ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Takuya UESHIN	8014b0b5d6	[SPARK-32160][CORE][PYSPARK] Add a config to switch allow/disallow to create SparkContext in executors ### What changes were proposed in this pull request? This is a follow-up of #28986. This PR adds a config to switch allow/disallow to create `SparkContext` in executors. - `spark.driver.allowSparkContextInExecutors` ### Why are the changes needed? Some users or libraries actually create `SparkContext` in executors. We shouldn't break their workloads. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to create `SparkContext` in executors with the config enabled. ### How was this patch tested? More tests are added. Closes #29278 from ueshin/issues/SPARK-32160/add_configs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-31 17:28:35 +09:00
HyukjinKwon	e1d7321034	[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization ### What changes were proposed in this pull request? This PR proposes to: 1. Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example, ```R df <- createDataFrame(list(list(a=1L, b="2"))) count(gapply(df, "a", function(key, group) { group }, structType("a int, b int"))) ``` Before: ``` Error in handleErrors(returnStatus, conn) : ... java.lang.UnsupportedOperationException ... ``` After: ``` Error in handleErrors(returnStatus, conn) : ... java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType ... ``` 2. Update documentation about the schema matching for `gapply` and `dapply`. ### Why are the changes needed? To show which schema is not matched, and let users know what's going on. ### Does this PR introduce _any_ user-facing change? Yes, error message is updated as above, and documentation is updated. ### How was this patch tested? Manually tested and unitttests were added. Closes #29283 from HyukjinKwon/r-vectorized-error. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-30 15:16:02 +09:00
Max Gekk	99a855575c	[SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources ### What changes were proposed in this pull request? When `spark.sql.caseSensitive` is `false` (by default), check that there are not duplicate column names on the same level (top level or nested levels) in reading from in-built datasources Parquet, ORC, Avro and JSON. If such duplicate columns exist, throw the exception: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: ``` ### Why are the changes needed? To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error when `spark.sql.caseSensitive` is `false`: ```Scala org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase` ``` Checking of top-level duplicates was introduced by https://github.com/apache/spark/pull/17758. ### Does this PR introduce _any_ user-facing change? Yes. For the example from SPARK-32431: ORC: ```scala java.io.IOException: Error reading file: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-c02c2f9a-0cdc-4859-94fc-b9c809ca58b1/part-00001-63e8c3f0-7131-4ec9-be02-30b3fdd276f4-c000.snappy.orc at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) ... Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 3 kind DATA position: 6 length: 6 range: 0 offset: 12 limit: 12 range 0 = 0 to 6 uncompressed: 3 to 3 at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) ``` JSON: ```scala +------------+ \|StructColumn\| +------------+ \| [,,]\| +------------+ ``` Parquet: ```scala +------------+ \|StructColumn\| +------------+ \| [0,, 1]\| +------------+ ``` Avro: ```scala +------------+ \|StructColumn\| +------------+ \| [,,]\| +------------+ ``` After the changes, Parquet, ORC, JSON and Avro output the same error: ```scala Found duplicate column(s) in the data schema: `camelcase`; org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:112) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:51) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:67) ``` ### How was this patch tested? Run modified test suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.FileBasedDataSourceSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.*" ``` and added new UT to `SchemaUtilsSuite`. Closes #29234 from MaxGekk/nested-case-insensitive-column. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-30 06:05:55 +00:00
HyukjinKwon	89d9b7cc64	[SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode ### What changes were proposed in this pull request? This PR proposes: 1. To introduce `InheritableThread` class, that works identically with `threading.Thread` but it can inherit the inheritable attributes of a JVM thread such as `InheritableThreadLocal`. This was a problem from the pinned thread mode, see also https://github.com/apache/spark/pull/24898. Now it works as below: ```python import pyspark spark.sparkContext.setLocalProperty("a", "hi") def print_prop(): print(spark.sparkContext.getLocalProperty("a")) pyspark.InheritableThread(target=print_prop).start() ``` ``` hi ``` 2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks the thread and does not close the connection from Python to JVM. In `InheritableThread`, it manually closes the connections when PVM garbage collection happens. So, JVM threads finish safely. I manually verified by profiling but there's also another easy way to verify: ```bash PYSPARK_PIN_THREAD=true ./bin/pyspark ``` ```python >>> from threading import Thread >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>]) >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>]) ``` This issue is fixed now. 3. Because now we have a fix for the issue here, it also proposes to deprecate `collectWithJobGroup` which was a temporary workaround added to avoid this leak issue. ### Why are the changes needed? To support pinned thread mode properly without a resource leak, and a proper inheritable local properties. ### Does this PR introduce _any_ user-facing change? Yes, it adds an API `InheritableThread` class for pinned thread mode. ### How was this patch tested? Manually tested as described above, and unit test was added as well. Closes #28968 from HyukjinKwon/SPARK-32010. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-30 10:15:25 +09:00
Thomas Graves	e926d419d3	[SPARK-30322][DOCS] Add stage level scheduling docs ### What changes were proposed in this pull request? Document the stage level scheduling feature. ### Why are the changes needed? Document the stage level scheduling feature. ### Does this PR introduce _any_ user-facing change? Documentation. ### How was this patch tested? n/a docs only Closes #29292 from tgravescs/SPARK-30322. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-07-29 13:46:28 -05:00
HyukjinKwon	5491c08bf1	Revert "[SPARK-31525][SQL] Return an empty list for df.head() when df is empty" This reverts commit `44a5258ac2`.	2020-07-29 12:07:35 +09:00
Xiaochang Wu	44c868b73a	[SPARK-32339][ML][DOC] Improve MLlib BLAS native acceleration docs ### What changes were proposed in this pull request? Rewrite a clearer and complete BLAS native acceleration enabling guide. ### Why are the changes needed? The document of enabling BLAS native acceleration in ML guide (https://spark.apache.org/docs/latest/ml-guide.html#dependencies) is incomplete and unclear to the user. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29139 from xwu99/blas-doc. Lead-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Wu, Xiaochang <xiaochang.wu@intel.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-28 08:36:11 -07:00
Tianshi Zhu	44a5258ac2	[SPARK-31525][SQL] Return an empty list for df.head() when df is empty ### What changes were proposed in this pull request? return an empty list instead of None when calling `df.head()` ### Why are the changes needed? `df.head()` and `df.head(1)` are inconsistent when df is empty. ### Does this PR introduce _any_ user-facing change? Yes. If a user relies on `df.head()` to return None, things like `if df.head() is None:` will be broken. ### How was this patch tested? Closes #29214 from tianshizz/SPARK-31525. Authored-by: Tianshi Zhu <zhutianshirea@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-28 12:32:19 +09:00
GuoPhilipse	8de43338be	[SPARK-31753][SQL][DOCS] Add missing keywords in the SQL docs ### What changes were proposed in this pull request? update sql-ref docs, the following key words will be added in this PR. CASE/ELSE WHEN/THEN MAP KEYS TERMINATED BY NULL DEFINED AS LINES TERMINATED BY ESCAPED BY COLLECTION ITEMS TERMINATED BY PIVOT LATERAL VIEW OUTER? ROW FORMAT SERDE ROW FORMAT DELIMITED FIELDS TERMINATED BY IGNORE NULLS FIRST LAST ### Why are the changes needed? let more users know the sql key words usage ### Does this PR introduce _any_ user-facing change? ![image](https://user-images.githubusercontent.com/46367746/88148830-c6dc1f80-cc31-11ea-81ea-13bc9dc34550.png) ![image](https://user-images.githubusercontent.com/46367746/88148968-fb4fdb80-cc31-11ea-8649-e8297cf5813e.png) ![image](https://user-images.githubusercontent.com/46367746/88149000-073b9d80-cc32-11ea-9aa4-f914ecd72663.png) ![image](https://user-images.githubusercontent.com/46367746/88149021-0f93d880-cc32-11ea-86ed-7db8672b5aac.png) ### How was this patch tested? No Closes #29056 from GuoPhilipse/add-missing-keywords. Lead-authored-by: GuoPhilipse <guofei_ok@126.com> Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-07-28 09:41:53 +09:00
Kent Yao	d315ebf3a7	[SPARK-32424][SQL] Fix silent data change for timestamp parsing if overflow happens ### What changes were proposed in this pull request? When using `Seconds.toMicros` API to convert epoch seconds to microseconds, ```scala /** * Equivalent to * {link #convert(long, TimeUnit) MICROSECONDS.convert(duration, this)}. * param duration the duration * return the converted duration, * or {code Long.MIN_VALUE} if conversion would negatively * overflow, or {code Long.MAX_VALUE} if it would positively overflow. */ ``` This PR change it to `Math.multiplyExact(epochSeconds, MICROS_PER_SECOND)` ### Why are the changes needed? fix silent data change between 3.x and 2.x ``` ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200722  bin/spark-sql -S -e "select to_timestamp('300000', 'y');" +294247-01-10 12:00:54.775807 ``` ``` kentyaohulk  ~/Downloads/spark/spark-2.4.5-bin-hadoop2.7  bin/spark-sql -S -e "select to_timestamp('300000', 'y');" 284550-10-19 15:58:1010.448384 ``` ### Does this PR introduce _any_ user-facing change? Yes, we will raise `ArithmeticException` instead of giving the wrong answer if overflow. ### How was this patch tested? add unit test Closes #29220 from yaooqinn/SPARK-32424. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-27 17:03:14 +00:00
HyukjinKwon	6ab29b37cf	[SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base ### What changes were proposed in this pull request? This PR proposes to redesign the PySpark documentation. I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html. Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html. In more details, this PR proposes: 1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark. 2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow. 3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively. One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage. ### Why are the changes needed? Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/). It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate. Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it. ### Does this PR introduce _any_ user-facing change? Yes, PySpark API documentation will be redesigned. ### How was this patch tested? Manually tested, and the demo site was made to show. Closes #29188 from HyukjinKwon/SPARK-32179. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 17:49:21 +09:00
HyukjinKwon	bfa5d57bbd	[SPARK-32452][R][SQL] Bump up the minimum Arrow version as 1.0.0 in SparkR ### What changes were proposed in this pull request? This PR proposes to set the minimum Arrow version as 1.0.0 to minimise the maintenance overhead and keep the minimal version up to date. Other required changes to support 1.0.0 were already made in SPARK-32451. ### Why are the changes needed? R side, people rather aggressively encourage people to use the latest version, and SparkR vectorization is very experimental that was added from Spark 3.0. Also, we're technically not testing old Arrow versions in SparkR for now. ### Does this PR introduce _any_ user-facing change? Yes, users wouldn't be able to use SparkR with old Arrow. ### How was this patch tested? GitHub Actions and AppVeyor are already testing them. Closes #29253 from HyukjinKwon/SPARK-32452. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 14:21:15 +09:00
Kent Yao	d3596c04b0	[SPARK-32406][SQL] Make RESET syntax support single configuration reset ### What changes were proposed in this pull request? This PR extends the RESET command to support reset SQL configuration one by one. ### Why are the changes needed? Currently, the reset command only supports restore all of the runtime configurations to their defaults. In most cases, users do not want this, but just want to restore one or a small group of settings. The SET command can work as a workaround for this, but you have to keep the defaults in your mind or by temp variables, which turns out not very convenient to use. Hive supports this: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-BeelineExample reset <key> \| Resets the value of a particular configuration variable (key) to the default value.Note: If you misspell the variable name, Beeline will not show an error. -- \| -- PostgreSQL supports this too https://www.postgresql.org/docs/9.1/sql-reset.html ### Does this PR introduce _any_ user-facing change? yes, reset can restore one configuration now ### How was this patch tested? add new unit tests. Closes #29202 from yaooqinn/SPARK-32406. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-24 09:13:26 -07:00
Wenchen Fan	aa54dcf193	[SPARK-32251][SQL][TESTS][FOLLOWUP] improve SQL keyword test ### What changes were proposed in this pull request? Improve the `SQLKeywordSuite` so that: 1. it checks keywords under default mode as well 2. it checks if there are typos in the doc (found one and fixed in this PR) ### Why are the changes needed? better test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #29200 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-23 14:02:38 +00:00
ulysses	184074de22	[SPARK-31999][SQL] Add REFRESH FUNCTION command ### What changes were proposed in this pull request? In Hive mode, permanent functions are shared with Hive metastore so that functions may be modified by other Hive client. With in long-lived spark scene, it's hard to update the change of function. Here are 2 reasons: * Spark cache the function in memory using `FunctionRegistry`. * User may not know the location or classname of udf when using `replace function`. Note that we use v2 command code path to add new command. ### Why are the changes needed? Give a easy way to make spark function registry sync with Hive metastore. Then we can call ``` refresh function functionName ``` ### Does this PR introduce _any_ user-facing change? Yes, new command. ### How was this patch tested? New UT. Closes #28840 from ulysses-you/SPARK-31999. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-22 19:05:50 +00:00
Brandon	1267d80db6	[MINOR][DOCS] add link for Debugging your Application in running-on-yarn.html#launching-spark-on-yarn ### What changes were proposed in this pull request? add link for Debugging your Application in `running-on-yarn.html#launching-spark-on-yar` ### Why are the changes needed? Currrently on running-on-yarn.html page launching-spark-on-yarn section, it mentions to refer for Debugging your Application. It is better to add a direct link for it to save reader time to find the section ![image](https://user-images.githubusercontent.com/20021316/87867542-80cc5500-c9c0-11ea-8560-5ddcb5a308bc.png) ### Does this PR introduce _any_ user-facing change? Yes. Docs changes. 1. add link for Debugging your Application in `running-on-yarn.html#launching-spark-on-yarn` section Updated behavior: ![image](https://user-images.githubusercontent.com/20021316/87867534-6eeab200-c9c0-11ea-94ee-d3fa58157156.png) 2. update Spark Properties link to anchor link only ### How was this patch tested? manual test has been performed to test the updated Closes #29154 from brandonJY/patch-1. Authored-by: Brandon <brandonJY@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-21 13:42:19 +09:00
Gengliang Wang	c2afe1c0b9	[SPARK-32366][DOC] Fix doc link of datetime pattern in 3.0 migration guide ### What changes were proposed in this pull request? In http://spark.apache.org/docs/latest/sql-migration-guide.html#query-engine, there is a invalid reference for datetime reference "sql-ref-datetime-pattern.md". We should fix the link as http://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html. ![image](https://user-images.githubusercontent.com/1097932/87916920-fff57380-ca28-11ea-9028-99b9f9ebdfa4.png) Also, it is nice to add url for [DateTimeFormatter](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html) ### Why are the changes needed? Fix migration guide doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build the doc in local env and check it: ![image](https://user-images.githubusercontent.com/1097932/87919723-13a2d900-ca2d-11ea-9923-a29b4cefaf3c.png) Closes #29162 from gengliangwang/fixDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-20 20:49:22 +09:00
Igor Dvorzhak	32a0451376	[MINOR][DOCS] Fix links to Cloud Storage connectors docs Closes #29155 from medb/patch-1. Authored-by: Igor Dvorzhak <idv@google.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-19 12:19:36 -07:00
Kent Yao	bdeb626c5a	[SPARK-32272][SQL] Add SQL standard command SET TIME ZONE ### What changes were proposed in this pull request? This PR adds the SQL standard command - `SET TIME ZONE` to the current default time zone displacement for the current SQL-session, which is the same as the existing `set spark.sql.session.timeZone=xxx'. All in all, this PR adds syntax as following, ``` SET TIME ZONE LOCAL; SET TIME ZONE 'valid time zone'; -- zone offset or region SET TIME ZONE INTERVAL XXXX; -- xxx must in [-18, + 18] hours, * this range is bigger than ansi [-14, + 14] ``` ### Why are the changes needed? ANSI compliance and supply pure SQL users a way to retrieve all supported TimeZones ### Does this PR introduce _any_ user-facing change? yes, add new syntax. ### How was this patch tested? add unit tests. and locally verified reference doc ![image](https://user-images.githubusercontent.com/8326978/87510244-c8dc3680-c6a5-11ea-954c-b098be84afee.png) Closes #29064 from yaooqinn/SPARK-32272. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-16 13:01:53 +00:00
Warren Zhu	db47c6e340	[SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest API ### What changes were proposed in this pull request? Support fetching taskList by status as below: ``` /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed ``` ### Why are the changes needed? When there're large number of tasks in one stage, current api is hard to get taskList by status ### Does this PR introduce _any_ user-facing change? Yes. Updated monitoring doc. ### How was this patch tested? Added tests in `HistoryServerSuite` Closes #28942 from warrenzhu25/SPARK-32125. Authored-by: Warren Zhu <zhonzh@microsoft.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-07-16 11:31:24 +08:00
Erik Krogen	cf22d947fb	[SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature ### What changes were proposed in this pull request? This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR. This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained. ### Why are the changes needed? As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world. ### Does this PR introduce _any_ user-facing change? In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests. ### How was this patch tested? Existing tests should be suitable since no behavior changes are expected as a result of this PR. Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists. Authored-by: Erik Krogen <ekrogen@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-07-15 11:40:55 -05:00
Baohe Zhang	90b0c26b22	[SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster ### What changes were proposed in this pull request? Add a new class HybridStore to make the history server faster when loading event files. When rebuilding the application state from event logs, HybridStore will write data to InMemoryStore at first and use a background thread to dump data to LevelDB once the writing to InMemoryStore is completed. HybridStore is to make content serving faster by using more memory. It's only safe to enable it when the cluster is not having a heavy load. ### Why are the changes needed? HybridStore can greatly reduce the event logs loading time, especially for large log files. In general, it has 4x - 6x UI loading speed improvement for large log files. The detailed result is shown in comments. ### Does this PR introduce any user-facing change? This PR adds new configs `spark.history.store.hybridStore.enabled` and `spark.history.store.hybridStore.maxMemoryUsage`. ### How was this patch tested? A test suite for HybridStore is added. I also manually tested it on 3.1.0 on mac os. This is a follow-up for the work done by Hieu Huynh in 2019. Closes #28412 from baohe-zhang/SPARK-31608. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-07-15 07:51:13 +09:00
HyukjinKwon	4ad9bfd53b	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5 ### What changes were proposed in this pull request? This PR aims to drop Python 2.7, 3.4 and 3.5. Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark. ### Why are the changes needed? 1. Unsupport EOL Python versions 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2. 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation. 4. Users can use Python type hints with Pandas UDFs without thinking about Python version 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle. ### Does this PR introduce _any_ user-facing change? Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version. ### How was this patch tested? Manually tested and also tested in Jenkins. Closes #28957 from HyukjinKwon/SPARK-32138. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-14 11:22:44 +09:00
Holden Karau	90ac9f975b	[SPARK-32004][ALL] Drop references to slave ### What changes were proposed in this pull request? This change replaces the world slave with alternatives matching the context. ### Why are the changes needed? There is no need to call things slave, we might as well use better clearer names. ### Does this PR introduce _any_ user-facing change? Yes, the ouput JSON does change. To allow backwards compatibility this is an additive change. The shell scripts for starting & stopping workers are renamed, and for backwards compatibility old scripts are added to call through to the new ones while printing a deprecation message to stderr. ### How was this patch tested? Existing tests. Closes #28864 from holdenk/SPARK-32004-drop-references-to-slave. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-07-13 14:05:33 -07:00
Chuliang Xiao	c56c84af47	[MINOR][DOCS] Fix typo in PySpark example in ml-datasource.md ### What changes were proposed in this pull request? This PR changes `true` to `True` in the python code. ### Why are the changes needed? The previous example is a syntax error. ### Does this PR introduce _any_ user-facing change? Yes, but this is doc-only typo fix. ### How was this patch tested? Manually run the example. Closes #29073 from ChuliangXiao/patch-1. Authored-by: Chuliang Xiao <ChuliangX@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-12 09:01:41 -07:00
Wenchen Fan	84db660ebe	[SPARK-32251][SQL][DOCS][TESTS] Fix SQL keyword document ### What changes were proposed in this pull request? This PR improves the test to make sure all the SQL keywords are documented correctly. It fixes several issues: 1. some keywords are not documented 2. some keywords are not ANSI SQL keywords but documented as reserved/non-reserved. ### Why are the changes needed? To make sure the implementation matches the doc. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test Closes #29055 from cloud-fan/keyword. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-10 15:10:28 -07:00
Kent Yao	4609f1fdab	[SPARK-32207][SQL] Support 'F'-suffixed Float Literals ### What changes were proposed in this pull request? In this PR, I suppose we support 'f'-suffixed float literal, e.g. `select 1.1f` ### Why are the changes needed? a very common feature across platforms, checked with pg, presto, hive, MySQL... ### Does this PR introduce _any_ user-facing change? yes, `select 1.1f` results float value 1.1 instead of throwing AnlaysisExceptiion`Can't extract value from 1: need struct type but got int;` ### How was this patch tested? add unit tests Closes #29022 from yaooqinn/SPARK-32207. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 19:45:16 -07:00
moovlin	9331a5c44b	[SPARK-32035][DOCS][EXAMPLES] Fixed typos involving AWS Access, Secret, & Sessions tokens ### What changes were proposed in this pull request? I resolved some of the inconsistencies of AWS env variables. They're fixed in the documentation as well as in the examples. I grep-ed through the repo to try & find any more instances but nothing popped up. ### Why are the changes needed? As previously mentioned, there is a JIRA request, SPARK-32035, which encapsulates all the issues. But, in summary, the naming of items was inconsistent. ### Does this PR introduce _any_ user-facing change? Correct names: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN These are the same that AWS uses in their libraries. However, looking through the Spark documentation and comments, I see that these are not denoted correctly across the board: docs/cloud-integration.md 106:1. `spark-submit` reads the `AWS_ACCESS_KEY`, `AWS_SECRET_KEY` <-- both different 107:and `AWS_SESSION_TOKEN` environment variables and sets the associated authentication options docs/streaming-kinesis-integration.md 232:- Set up the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_KEY` with your AWS credentials. <-- secret key different external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py 34: $ export AWS_ACCESS_KEY_ID=<your-access-key> 35: $ export AWS_SECRET_KEY=<your-secret-key> <-- different 48: Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 438: val keyId = System.getenv("AWS_ACCESS_KEY_ID") 439: val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY") 448: val sessionToken = System.getenv("AWS_SESSION_TOKEN") external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala 53: * $ export AWS_ACCESS_KEY_ID=<your-access-key> 54: * $ export AWS_SECRET_KEY=<your-secret-key> <-- different 65: * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different external/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java 59: * $ export AWS_ACCESS_KEY_ID=[your-access-key] 60: * $ export AWS_SECRET_KEY=<your-secret-key> <-- different 71: * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different These were all fixed to match names listed under the "correct names" heading. ### How was this patch tested? I built the documentation using jekyll and verified that the changes were present & accurate. Closes #29058 from Moovlin/SPARK-32035. Authored-by: moovlin <richjoerger@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 10:35:21 -07:00
GuoPhilipse	09cc6c51ea	[SPARK-32193][SQL][DOCS] Update regexp usage in SQL docs ### What changes were proposed in this pull request? update REGEXP usage and examples in sql-ref-syntx-qry-select-like.cmd ### Why are the changes needed? make the usage of REGEXP known to more users ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No tests Closes #29009 from GuoPhilipse/update-migrate-guide. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-07-09 16:14:33 +09:00
Max Gekk	1261fac674	[SPARK-31710][SQL][FOLLOWUP] Allow cast numeric to timestamp by default ### What changes were proposed in this pull request? 1. Set the SQL config `spark.sql.legacy.allowCastNumericToTimestamp` to `true` by default 2. Remove explicit sets of `spark.sql.legacy.allowCastNumericToTimestamp` to `true` in the cast suites. ### Why are the changes needed? To avoid breaking changes in minor versions (in the upcoming Spark 3.1.0) according to the the semantic versioning guidelines (https://spark.apache.org/versioning-policy.html) ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By `CastSuite`. Closes #29012 from MaxGekk/allow-cast-numeric-to-timestamp. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-07 14:09:40 -07:00
Huaxin Gao	492d5d174a	[SPARK-32171][SQL][DOCS] Change file locations for use db and refresh table ### What changes were proposed in this pull request? docs/sql-ref-syntax-qry-select-usedb.md -> docs/sql-ref-syntax-ddl-usedb.md docs/sql-ref-syntax-aux-refresh-table.md -> docs/sql-ref-syntax-aux-cache-refresh-table.md ### Why are the changes needed? usedb belongs to DDL. Its location should be consistent with other DDL commands file locations similar reason for refresh table ### Does this PR introduce _any_ user-facing change? before change, when clicking USE DATABASE, the side bar menu shows select commands <img width="1200" alt="Screen Shot 2020-07-04 at 9 05 35 AM" src="https://user-images.githubusercontent.com/13592258/86516696-b45f8a80-bdd7-11ea-8dba-3a5cca22aad3.png"> after change, when clicking USE DATABASE, the side bar menu shows DDL commands <img width="1120" alt="Screen Shot 2020-07-04 at 9 06 06 AM" src="https://user-images.githubusercontent.com/13592258/86516703-bf1a1f80-bdd7-11ea-8a90-ae7eaaafd44c.png"> before change, when clicking refresh table, the side bar menu shows Auxiliary statements <img width="1200" alt="Screen Shot 2020-07-04 at 9 30 40 AM" src="https://user-images.githubusercontent.com/13592258/86516877-3d2af600-bdd9-11ea-9568-0a6f156f57da.png"> after change, when clicking refresh table, the side bar menu shows Cache statements <img width="1199" alt="Screen Shot 2020-07-04 at 9 35 21 AM" src="https://user-images.githubusercontent.com/13592258/86516937-b4f92080-bdd9-11ea-8ad1-5f5a7f58d76b.png"> ### How was this patch tested? Manually build and check Closes #28995 from huaxingao/docs_fix. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-04 19:01:07 -07:00
Max Gekk	bcf23307f4	[SPARK-32130][SQL] Disable the JSON option `inferTimestamp` by default ### What changes were proposed in this pull request? Set the JSON option `inferTimestamp` to `false` if an user don't pass it as datasource option. ### Why are the changes needed? To prevent perf regression while inferring schemas from JSON with potential timestamps fields. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? - Modified existing tests in `JsonSuite` and `JsonInferSchemaSuite`. - Regenerated results of `JsonBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge \| \| AMI \| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) \| \| Java \| OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 \| Closes #28966 from MaxGekk/json-inferTimestamps-disable-by-default. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-01 15:45:39 -07:00
Kousuke Saruta	5176707ac3	[MINOR][DOCS] Fix a typo for a configuration property of resources allocation ### What changes were proposed in this pull request? This PR fixes a typo for a configuration property in the `spark-standalone.md`. `spark.driver.resourcesfile` should be `spark.driver.resourcesFile`. I look for similar typo but this is the only typo. ### Why are the changes needed? The property name is wrong. ### Does this PR introduce _any_ user-facing change? Yes. The property name is corrected. ### How was this patch tested? I confirmed the spell of the property name is the correct from the property name defined in o.a.s.internal.config.package.scala. Closes #28958 from sarutak/fix-resource-typo. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-30 09:28:54 -07:00
Guy Khazma	44aecaa912	[SPARK-32099][DOCS] Remove broken link in cloud integration documentation ### What changes were proposed in this pull request? The 3rd link in `IBM Cloud Object Storage connector for Apache Spark` is broken. The PR removes this link. ### Why are the changes needed? broken link ### Does this PR introduce _any_ user-facing change? yes, the broken link is removed from the doc. ### How was this patch tested? doc generation passes successfully as before Closes #28927 from guykhazma/spark32099. Authored-by: Guy Khazma <guykhag@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-26 19:12:42 -07:00
gatorsmile	d06604f60a	[SPARK-32078][DOC] Add a redirect to sql-ref from sql-reference ### What changes were proposed in this pull request? This PR is to add a redirect to sql-ref.html. ### Why are the changes needed? Before Spark 3.0 release, we are using sql-reference.md, which was replaced by sql-ref.md instead. A number of Google searches I’ve done today have turned up https://spark.apache.org/docs/latest/sql-reference.html, which does not exist any more. Thus, we should add a redirect to sql-ref.html. ### Does this PR introduce _any_ user-facing change? https://spark.apache.org/docs/latest/sql-reference.html will be redirected to https://spark.apache.org/docs/latest/sql-ref.html ### How was this patch tested? Build it in my local environment. It works well. The sql-reference.html file was generated. The contents are like: ``` <!DOCTYPE html> <html lang="en-US"> <meta charset="utf-8"> <title>Redirecting…</title> <link rel="canonical" href="http://localhost:4000/sql-ref.html"> <script>location="http://localhost:4000/sql-ref.html"</script> <meta http-equiv="refresh" content="0; url=http://localhost:4000/sql-ref.html"> <meta name="robots" content="noindex"> <h1>Redirecting…</h1> <a href="http://localhost:4000/sql-ref.html">Click here if you are not redirected.</a> </html> ``` Closes #28914 from gatorsmile/addRedirectSQLRef. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-24 11:00:20 -07:00
sidedoorleftroad	986fa01747	[SPARK-32075][DOCS] Fix a few issues in parameters table ### What changes were proposed in this pull request? Fix a few issues in parameters table in structured-streaming-kafka-integration doc. ### Why are the changes needed? Make the title of the table consistent with the data. ### Does this PR introduce _any_ user-facing change? Yes. Before: ![image](https://user-images.githubusercontent.com/67275816/85414316-8475e300-b59e-11ea-84ec-fa78ecc980b3.png) After: ![image](https://user-images.githubusercontent.com/67275816/85414562-d61e6d80-b59e-11ea-9fe6-247e0ad4d9ee.png) Before: ![image](https://user-images.githubusercontent.com/67275816/85414467-b8510880-b59e-11ea-92a0-7205542fe28b.png) After: ![image](https://user-images.githubusercontent.com/67275816/85414589-de76a880-b59e-11ea-91f2-5073eaf3444b.png) Before: ![image](https://user-images.githubusercontent.com/67275816/85414502-c69f2480-b59e-11ea-837f-1201f10a56b6.png) After: ![image](https://user-images.githubusercontent.com/67275816/85414615-e9313d80-b59e-11ea-9b1a-fc11da0b6bc5.png) ### How was this patch tested? Manually build and check. Closes #28910 from sidedoorleftroad/SPARK-32075. Authored-by: sidedoorleftroad <sidedoorleftroad@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-24 13:39:55 +09:00
HyukjinKwon	b62e2536db	[SPARK-32073][R] Drop R < 3.5 support ### What changes were proposed in this pull request? Spark 3.0 accidentally dropped R < 3.5. It is built by R 3.6.3 which not support R < 3.5: ``` Error in readRDS(pfile) : cannot read workspace version 3 written by R 3.6.3; need R 3.5.0 or newer version. ``` In fact, with SPARK-31918, we will have to drop R < 3.5 entirely to support R 4.0.0. This is inevitable to release on CRAN because they require to make the tests pass with the latest R. ### Why are the changes needed? To show the supported versions correctly, and support R 4.0.0 to unblock the releases. ### Does this PR introduce _any_ user-facing change? In fact, no because Spark 3.0.0 already does not work with R < 3.5. Compared to Spark 2.4, yes. R < 3.5 would not work. ### How was this patch tested? Jenkins should test it out. Closes #28908 from HyukjinKwon/SPARK-32073. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-24 11:05:27 +09:00
Gabor Somogyi	a9247c39d2	[SPARK-32033][SS][DSTEAMS] Use new poll API in Kafka connector executor side to avoid infinite wait ### What changes were proposed in this pull request? Spark uses an old and deprecated API named `KafkaConsumer.poll(long)` which never returns and stays in live lock if metadata is not updated (for instance when broker disappears at consumer creation). Please see [Kafka documentation](https://kafka.apache.org/25/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#poll-long-) and [standalone test application](https://github.com/gaborgsomogyi/kafka-get-assignment) for further details. In this PR I've applied the new `KafkaConsumer.poll(Duration)` API on executor side. Please note driver side still uses the old API which will be fixed in SPARK-32032. ### Why are the changes needed? Infinite wait in `KafkaConsumer.poll(long)`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #28871 from gaborgsomogyi/SPARK-32033. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-19 14:46:26 -07:00
Yuanjian Li	8750363c8d	[MINOR][DOCS] Emphasize the Streaming tab is for DStream API ### What changes were proposed in this pull request? Emphasize the Streaming tab is for DStream API. ### Why are the changes needed? Some users reported that it's a little confusing of the streaming tab and structured streaming tab. ### Does this PR introduce _any_ user-facing change? Document change. ### How was this patch tested? N/A Closes #28854 from xuanyuanking/minor-doc. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-19 12:17:40 +09:00
James Yu	ac98a9a07f	[MINOR][DOCS] Update running-on-kubernetes.md ### What changes were proposed in this pull request? Fix executor container name typo. `executor` should be `spark-kubernetes-executor`. ### Why are the changes needed? The Executor pod container name the users actually get from their Kubernetes clusters is different from that described in the documentation. For example, below is what a user get from an executor pod. ``` Containers: spark-kubernetes-executor: Container ID: docker://aaaabbbbccccddddeeeeffff Image: <imagename> Image ID: docker-pullable://0000.dkr.ecr.us-east-0.amazonaws.com/spark Port: 7079/TCP Host Port: 0/TCP Args: executor State: Running Started: Thu, 28 May 2020 05:54:04 -0700 Ready: True Restart Count: 0 Limits: memory: 16Gi ``` ### Does this PR introduce _any_ user-facing change? Document change. ### How was this patch tested? N/A Closes #28862 from yuj/patch-1. Authored-by: James Yu <yuj@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-18 14:36:20 -07:00
DB Tsai	9b792518b2	[SPARK-31960][YARN][BUILD] Only populate Hadoop classpath for no-hadoop build ### What changes were proposed in this pull request? If a Spark distribution has built-in hadoop runtime, Spark will not populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to Yarn. Users can override this behavior by setting `spark.yarn.populateHadoopClasspath` to `true`. ### Why are the changes needed? Without this, Spark will populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` even Spark distribution has built-in hadoop. This results jar conflict and many unexpected behaviors in runtime. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test with two builds, with-hadoop and no-hadoop builds. Closes #28788 from dbtsai/yarn-classpath. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-06-18 06:08:40 +00:00
GuoPhilipse	f0e6d0ec13	[SPARK-31710][SQL] Fail casting numeric to timestamp by default ## What changes were proposed in this pull request? we fail casting from numeric to timestamp by default. ## Why are the changes needed? casting from numeric to timestamp is not a non-standard,meanwhile it may generate different result between spark and other systems,for example hive ## Does this PR introduce any user-facing change? Yes,user cannot cast numeric to timestamp directly,user have to use the following function to achieve the same effect:TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS ## How was this patch tested? unit test added Closes #28593 from GuoPhilipse/31710-fix-compatibility. Lead-authored-by: GuoPhilipse <guofei_ok@126.com> Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-16 08:35:35 +00:00
Jungtaek Lim (HeartSaVioR)	fe68e95a5a	[SPARK-24634][SS][FOLLOWUP] Rename the variable from "numLateInputs" to "numRowsDroppedByWatermark" ### What changes were proposed in this pull request? This PR renames the variable from "numLateInputs" to "numRowsDroppedByWatermark" so that it becomes self-explanation. ### Why are the changes needed? This is originated from post-review, see https://github.com/apache/spark/pull/28607#discussion_r439853232 ### Does this PR introduce _any_ user-facing change? No, as SPARK-24634 is not introduced in any release yet. ### How was this patch tested? Existing UTs. Closes #28828 from HeartSaVioR/SPARK-24634-v3-followup. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-16 16:41:08 +09:00
Takeshi Yamamuro	3698a14204	[SPARK-26905][SQL] Follow the SQL:2016 reserved keywords ### What changes were proposed in this pull request? This PR intends to move keywords `ANTI`, `SEMI`, and `MINUS` from reserved to non-reserved. ### Why are the changes needed? To comply with the ANSI/SQL standard. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #28807 from maropu/SPARK-26905-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-06-16 00:27:45 +09:00
yi.wu	54e702c0dd	[SPARK-31970][CORE] Make MDC configuration step be consistent between setLocalProperty and log4j.properties ### What changes were proposed in this pull request? This PR proposes to use "mdc.XXX" as the consistent key for both `sc.setLocalProperty` and `log4j.properties` when setting up configurations for MDC. ### Why are the changes needed? It's weird that we use "mdc.XXX" as key to set MDC value via `sc.setLocalProperty` while we use "XXX" as key to set MDC pattern in log4j.properties. It could also bring extra burden to the user. ### Does this PR introduce _any_ user-facing change? No, as MDC feature is added in version 3.1, which hasn't been released. ### How was this patch tested? Tested manually. Closes #28801 from Ngone51/consistent-mdc. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-14 14:26:11 -07:00
Jungtaek Lim (HeartSaVioR)	84815d0550	[SPARK-24634][SS] Add a new metric regarding number of inputs later than watermark plus allowed delay ### What changes were proposed in this pull request? Please refer https://issues.apache.org/jira/browse/SPARK-24634 to see rationalization of the issue. This patch adds a new metric to count the number of inputs arrived later than watermark plus allowed delay. To make changes simpler, this patch doesn't count the exact number of input rows which are later than watermark plus allowed delay. Instead, this patch counts the inputs which are dropped in the logic of operator. The difference of twos are shown in streaming aggregation: to optimize the calculation, streaming aggregation "pre-aggregates" the input rows, and later checks the lateness against "pre-aggregated" inputs, hence the number might be reduced. The new metric will be provided via two places: 1. On Spark UI: check the metrics in stateful operator nodes in query execution details page in SQL tab 2. On Streaming Query Listener: check "numLateInputs" in "stateOperators" in QueryProcessEvent. ### Why are the changes needed? Dropping late inputs means that end users might not get expected outputs. Even end users may indicate the fact and tolerate the result (as that's what allowed lateness is for), but they should be able to observe whether the current value of allowed lateness drops inputs or not so that they can adjust the value. Also, whatever the chance they have multiple of stateful operators in a single query, if Spark drops late inputs "between" these operators, it becomes "correctness" issue. Spark should disallow such possibility, but given we already provided the flexibility, at least we should provide the way to observe the correctness issue and decide whether they should make correction of their query or not. ### Does this PR introduce _any_ user-facing change? Yes. End users will be able to retrieve the information of late inputs via two ways: 1. SQL tab in Spark UI 2. Streaming Query Listener ### How was this patch tested? New UTs added & existing UTs are modified to reflect the change. And ran manual test reproducing SPARK-28094. I've picked the specific case on "B outer C outer D" which is enough to represent the "intermediate late row" issue due to global watermark. https://gist.github.com/jammann/b58bfbe0f4374b89ecea63c1e32c8f17 Spark logs warning message on the query which means SPARK-28074 is working correctly, ``` 20/05/30 17:52:47 WARN UnsupportedOperationChecker: Detected pattern of possible 'correctness' issue due to global watermark. The query contains stateful operation which can emit rows older than the current watermark plus allowed late record delay, which are "late rows" in downstream stateful operations and these rows can be discarded. Please refer the programming guide doc for more details.; Join LeftOuter, ((D_FK#28 = D_ID#87) AND (B_LAST_MOD#26-T30000ms = D_LAST_MOD#88-T30000ms)) :- Join LeftOuter, ((C_FK#27 = C_ID#58) AND (B_LAST_MOD#26-T30000ms = C_LAST_MOD#59-T30000ms)) : :- EventTimeWatermark B_LAST_MOD#26: timestamp, 30 seconds : : +- Project [v#23.B_ID AS B_ID#25, v#23.B_LAST_MOD AS B_LAST_MOD#26, v#23.C_FK AS C_FK#27, v#23.D_FK AS D_FK#28] : : +- Project [from_json(StructField(B_ID,StringType,false), StructField(B_LAST_MOD,TimestampType,false), StructField(C_FK,StringType,true), StructField(D_FK,StringType,true), value#21, Some(UTC)) AS v#23] : : +- Project [cast(value#8 as string) AS value#21] : : +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider3a7fd18c, kafka, org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable396d2958, org.apache.spark.sql.util.CaseInsensitiveStringMapa51ee61a, [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSessiond221af8,kafka,List(),None,List(),None,Map(inferSchema -> true, startingOffsets -> earliest, subscribe -> B, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6] : +- EventTimeWatermark C_LAST_MOD#59: timestamp, 30 seconds : +- Project [v#56.C_ID AS C_ID#58, v#56.C_LAST_MOD AS C_LAST_MOD#59] : +- Project [from_json(StructField(C_ID,StringType,false), StructField(C_LAST_MOD,TimestampType,false), value#54, Some(UTC)) AS v#56] : +- Project [cast(value#41 as string) AS value#54] : +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider3f507373, kafka, org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable7b6736a4, org.apache.spark.sql.util.CaseInsensitiveStringMapa51ee61b, [key#40, value#41, topic#42, partition#43, offset#44L, timestamp#45, timestampType#46], StreamingRelation DataSource(org.apache.spark.sql.SparkSessiond221af8,kafka,List(),None,List(),None,Map(inferSchema -> true, startingOffsets -> earliest, subscribe -> C, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#33, value#34, topic#35, partition#36, offset#37L, timestamp#38, timestampType#39] +- EventTimeWatermark D_LAST_MOD#88: timestamp, 30 seconds +- Project [v#85.D_ID AS D_ID#87, v#85.D_LAST_MOD AS D_LAST_MOD#88] +- Project [from_json(StructField(D_ID,StringType,false), StructField(D_LAST_MOD,TimestampType,false), value#83, Some(UTC)) AS v#85] +- Project [cast(value#70 as string) AS value#83] +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider2b90e779, kafka, org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable36f8cd29, org.apache.spark.sql.util.CaseInsensitiveStringMapa51ee620, [key#69, value#70, topic#71, partition#72, offset#73L, timestamp#74, timestampType#75], StreamingRelation DataSource(org.apache.spark.sql.SparkSessiond221af8,kafka,List(),None,List(),None,Map(inferSchema -> true, startingOffsets -> earliest, subscribe -> D, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#62, value#63, topic#64, partition#65, offset#66L, timestamp#67, timestampType#68] ``` and we can find the late inputs from the batch 4 as follows: ![Screen Shot 2020-05-30 at 18 02 53](https://user-images.githubusercontent.com/1317309/83324401-058fd200-a2a0-11ea-8bf6-89cf777e9326.png) which represents intermediate inputs are being lost, ended up with correctness issue. Closes #28607 from HeartSaVioR/SPARK-24634-v3. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-14 14:37:38 +09:00
Kent Yao	6a424b93e5	[SPARK-31830][SQL] Consistent error handling for datetime formatting and parsing functions ### What changes were proposed in this pull request? Currently, `date_format` and `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp`, `to_date` have different exception handling behavior for formatting datetime values. In this PR, we apply the exception handling behavior of `date_format` to `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date`. In the phase of creating the datetime formatted or formating, exceptions will be raised. e.g. ```java spark-sql> select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-aaa'); 20/05/28 15:25:38 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-aaa')] org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-aaa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html ``` ```java spark-sql> select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-AAA'); 20/05/28 15:26:10 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-AAA')] java.lang.IllegalArgumentException: Illegal pattern character: A ``` ```java spark-sql> select date_format(make_timestamp(1,1,1,1,1,1), 'yyyyyyyyyyy-MM-dd'); 20/05/28 15:23:23 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1,1,1,1,1,1), 'yyyyyyyyyyy-MM-dd')] java.lang.ArrayIndexOutOfBoundsException: 11 at java.time.format.DateTimeFormatterBuilder$NumberPrinterParser.format(DateTimeFormatterBuilder.java:2568) ``` In the phase of parsing, `DateTimeParseException \| DateTimeException \| ParseException` will be suppressed, but `SparkUpgradeException` will still be raised e.g. ```java spark-sql> set spark.sql.legacy.timeParserPolicy=exception; spark.sql.legacy.timeParserPolicy exception spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz"); 20/05/28 15:31:15 ERROR SparkSQLDriver: Failed in [select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz")] org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2020-01-27T20:06:11.847-0800' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string. ``` ```java spark-sql> set spark.sql.legacy.timeParserPolicy=corrected; spark.sql.legacy.timeParserPolicy corrected spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz"); NULL spark-sql> set spark.sql.legacy.timeParserPolicy=legacy; spark.sql.legacy.timeParserPolicy legacy spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz"); 2020-01-28 12:06:11.847 ``` ### Why are the changes needed? Consistency ### Does this PR introduce _any_ user-facing change? Yes, invalid datetime patterns will fail `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date` instead of resulting `NULL` ### How was this patch tested? add more tests Closes #28650 from yaooqinn/SPARK-31830. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-09 16:56:45 +00:00
Akshat Bordia	6befb2d8bd	[SPARK-31486][CORE] spark.submit.waitAppCompletion flag to control spark-submit exit in Standalone Cluster Mode ### What changes were proposed in this pull request? These changes implement an application wait mechanism which will allow spark-submit to wait until the application finishes in Standalone Spark Mode. This will delay the exit of spark-submit JVM until the job is completed. This implementation will keep monitoring the application until it is either finished, failed or killed. This will be controlled via a flag (spark.submit.waitForCompletion) which will be set to false by default. ### Why are the changes needed? Currently, Livy API for Standalone Cluster Mode doesn't know when the job has finished. If this flag is enabled, this can be used by Livy API (/batches/{batchId}/state) to find out when the application has finished/failed. This flag is Similar to spark.yarn.submit.waitAppCompletion. ### Does this PR introduce any user-facing change? Yes, this PR introduces a new flag but it will be disabled by default. ### How was this patch tested? Couldn't implement unit tests since the pollAndReportStatus method has System.exit() calls. Please provide any suggestions. Tested spark-submit locally for the following scenarios: 1. With the flag enabled, spark-submit exits once the job is finished. 2. With the flag enabled and job failed, spark-submit exits when the job fails. 3. With the flag disabled, spark-submit exists post submitting the job (existing behavior). 4. Existing behavior is unchanged when the flag is not added explicitly. Closes #28258 from akshatb1/master. Lead-authored-by: Akshat Bordia <akshat.bordia31@gmail.com> Co-authored-by: Akshat Bordia <akshat.bordia@citrix.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-09 09:29:37 -05:00
Jungtaek Lim (HeartSaVioR)	8305b77796	[SPARK-28199][SPARK-28199][SS][FOLLOWUP] Mention the change of into the SS migration guide ### What changes were proposed in this pull request? SPARK-28199 (#24996) made the trigger related public API to be exposed only from static methods of Trigger class. This is backward incompatible change, so some users may experience compilation error after upgrading to Spark 3.0.0. While we plan to mention the change into release note, it's good to mention the change to the migration guide doc as well, since the purpose of the doc is to collect the major changes/incompatibilities between versions and end users would refer the doc. ### Why are the changes needed? SPARK-28199 is technically backward incompatible change and we should kindly guide the change. ### Does this PR introduce _any_ user-facing change? Doc change. ### How was this patch tested? N/A, as it's just a doc change. Closes #28763 from HeartSaVioR/SPARK-28199-FOLLOWUP-doc. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-09 04:52:48 +00:00
Gabor Somogyi	04f66bfd4e	[MINOR][SS][DOCS] fileNameOnly parameter description re-unite ### What changes were proposed in this pull request? `fileNameOnly` parameter is split to 2 pieces in [this](`dbb8143501`) commit. This PR re-unites it. ### Why are the changes needed? Parameter description split in doc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #28739 from gaborgsomogyi/datasettxtfix. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-06 16:49:48 +09:00

1 2 3 4 5 ...

2907 commits