ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
HyukjinKwon	df8d3f1bf7	[SPARK-33544][SQL][FOLLOW-UP] Rename NoSideEffect to NoThrow and clarify the documentation more ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/30504. It proposes: - Rename `NoSideEffect` to `NoThrow`, and use `Expression.deterministic` together where it is used. - Clarify, in the docs in the expressions, that it means they don't throw exceptions ### Why are the changes needed? `NoSideEffect` virtually means that `Expression.eval` does not throw an exception, and the expressions are deterministic. It's best to be explicit so `NoThrow` was proposed - I looked if there's a similar name to represent this concept and borrowed the name of [nothrow](https://clang.llvm.org/docs/AttributeReference.html#nothrow). For determinism, we already have a way to note it under `Expression.deterministic`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually ran the existing unittests written. Closes #30570 from HyukjinKwon/SPARK-33544. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-02 16:03:08 +00:00
neko	28dad1ba77	[SPARK-33504][CORE] The application log in the Spark history server contains sensitive attributes should be redacted ### What changes were proposed in this pull request? To make sure the sensitive attributes to be redacted in the history server log. ### Why are the changes needed? We found the secure attributes like password in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly. The screenshot can be viewed in the attachment of JIRA spark-33504 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? muntual test works well, I have also added unit testcase. Closes #30446 from akiyamaneko/eventlog_unredact. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-12-02 09:24:19 -06:00
yangjie01	084d38b64e	[SPARK-33557][CORE][MESOS][TEST] Ensure the relationship between STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT and NETWORK_TIMEOUT ### What changes were proposed in this pull request? As described in SPARK-33557, `HeartbeatReceiver` and `MesosCoarseGrainedSchedulerBackend` will always use `Network.NETWORK_TIMEOUT.defaultValueString` as value of `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` when we configure `NETWORK_TIMEOUT` without configure `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT`, this is different from the relationship described in `configuration.md`. To fix this problem，the main change of this pr as follow: - Remove the explicitly default value of `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` - Use actual value of `NETWORK_TIMEOUT` as `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` when `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` not configured in `HeartbeatReceiver` and `MesosCoarseGrainedSchedulerBackend` ### Why are the changes needed? To ensure the relationship between `NETWORK_TIMEOUT` and `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` as we described in `configuration.md` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test configure `NETWORK_TIMEOUT` and `STORAGE_BLOCKMANAGER_HEARTBEAT_TIMEOUT` locally Closes #30547 from LuciferYang/SPARK-33557. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-02 18:41:49 +09:00
Dongjoon Hyun	290aa02179	[SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to make hadoop-aws work ### What changes were proposed in this pull request? This reverts commit SPARK-33212 (`cb3fa6c936`) mostly with three exceptions: 1. `SparkSubmitUtils` was updated recently by SPARK-33580 2. `resource-managers/yarn/pom.xml` was updated recently by SPARK-33104 to add `hadoop-yarn-server-resourcemanager` test dependency. 3. Adjust `com.fasterxml.jackson.module:jackson-module-jaxb-annotations` dependency in K8s module which is updated recently by SPARK-33471. ### Why are the changes needed? According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. It fails at write operation like the following. 1. Spark distribution with `-Phadoop-cloud` ```scala $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY 20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context available as 'sc' (master = local[], app id = local-1606806088715). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272) Type in expressions to have them evaluated. Type :help for more information. scala> spark.read.parquet("s3a://dongjoon/users.parquet").show 20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties +------+--------------+----------------+ \| name\|favorite_color\|favorite_numbers\| +------+--------------+----------------+ \|Alyssa\| null\| [3, 9, 15, 20]\| \| Ben\| red\| []\| +------+--------------+----------------+ scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet") 20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1] java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V ``` 2. Spark distribution without `-Phadoop-cloud`* ```scala $ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0 ... java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI. Closes #30508 from dongjoon-hyun/SPARK-33212-REVERT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-02 18:23:48 +09:00
Cheng Su	a4788ee8c6	[MINOR][SS] Rename auxiliary protected methods in StreamingJoinSuite ### What changes were proposed in this pull request? Per request from https://github.com/apache/spark/pull/30395#issuecomment-735028698, here we remove `Windowed` from methods names `setupWindowedJoinWithRangeCondition` and `setupWindowedSelfJoin` as they don't join on time window. ### Why are the changes needed? There's no such official name for `windowed join`, so this is to help avoid confusion for future developers. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #30563 from c21/stream-minor. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-02 15:28:16 +09:00
Cheng Su	51ebcd95a5	[SPARK-32863][SS] Full outer stream-stream join ### What changes were proposed in this pull request? This PR is to add full outer stream-stream join, and the implementation of full outer join is: * For left side input row, check if there's a match on right side state store. * if there's a match, output the joined row, o.w. output nothing. Put the row in left side state store. * For right side input row, check if there's a match on left side state store. * if there's a match, output the joined row, o.w. output nothing. Put the row in right side state store. * State store eviction: evict rows from left/right side state store below watermark, and output rows never matched before (a combination of left outer and right outer join). ### Why are the changes needed? Enable more use cases for spark stream-stream join. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`. Closes #30395 from c21/stream-foj. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-02 10:17:00 +09:00
Thomas Graves	f71f34572d	[SPARK-33544][SQL] Optimize size of CreateArray/CreateMap to be the size of its children ### What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to insert a filter for not null and size > 0 when using inner explode/inline. This is fine in most cases but the extra filter is not needed if the explode is with a create array and not using Literals (it already handles LIterals). When this happens you know that the values aren't null and it has a size. It already handles the empty array. The not null check is already optimized out because Createarray and createMap are not nullable, that leaves the size > 0 check. To handle that this PR makes it so that the size > 0 check gets optimized in ConstantFolding to be the size of the children in the array or map. That makes it a literal and then makes it ultimately be optimized out. ### Why are the changes needed? remove unneeded filter ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Unit tests added and manually tested various cases Closes #30504 from tgravescs/SPARK-33544. Lead-authored-by: Thomas Graves <tgraves@nvidia.com> Co-authored-by: Thomas Graves <tgraves@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-02 09:50:02 +09:00
zero323	5a1c5ac807	[SPARK-33622][R][ML] Add array_to_vector to SparkR ### What changes were proposed in this pull request? This PR adds `array_to_vector` to R API. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? New function exposed in the public API. ### How was this patch tested? New unit test. Manual verification of the documentation examples. Closes #30561 from zero323/SPARK-33622. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-01 10:44:14 -08:00
Gengliang Wang	5d0045eedf	[SPARK-33611][UI] Avoid encoding twice on the query parameter of rewritten proxy URL ### What changes were proposed in this pull request? When running Spark behind a reverse proxy(e.g. Nginx, Apache HTTP server), the request URL can be encoded twice if we pass the query string directly to the constructor of `java.net.URI`: ``` > val uri = "http://localhost:8081/test" > val query = "order%5B0%5D%5Bcolumn%5D=0" // query string of URL from the reverse proxy > val rewrittenURI = URI.create(uri.toString()) > new URI(rewrittenURI.getScheme(), rewrittenURI.getAuthority(), rewrittenURI.getPath(), query, rewrittenURI.getFragment()).toString result: http://localhost:8081/test?order%255B0%255D%255Bcolumn%255D=0 ``` In Spark's stage page, the URL of "/taskTable" contains query parameter order[0][dir]. After encoding twice, the query parameter becomes `order%255B0%255D%255Bdir%255D` and it will be decoded as `order%5B0%5D%5Bdir%5D` instead of `order[0][dir]`. As a result, there will be NullPointerException from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/api/v1/StagesResource.scala#L176 Other than that, the other parameter may not work as expected after encoded twice. This PR is to fix the bug by calling the method `URI.create(String URL)` directly. This convenience method can avoid encoding twice on the query parameter. ``` > val uri = "http://localhost:8081/test" > val query = "order%5B0%5D%5Bcolumn%5D=0" > URI.create(s"$uri?$query").toString result: http://localhost:8081/test?order%5B0%5D%5Bcolumn%5D=0 > URI.create(s"$uri?$query").getQuery result: order[0][column]=0 ``` ### Why are the changes needed? Fix a potential bug when Spark's reverse proxy is enabled. The bug itself is similar to https://github.com/apache/spark/pull/29271. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a new unit test. Also, Manual UI testing for master, worker and app UI with an nginx proxy Spark config: ``` spark.ui.port 8080 spark.ui.reverseProxy=true spark.ui.reverseProxyUrl=/path/to/spark/ ``` nginx config: ``` server { listen 9000; set $SPARK_MASTER http://127.0.0.1:8080; # split spark UI path into prefix and local path within master UI location ~ ^(/path/to/spark/) { # strip prefix when forwarding request rewrite /path/to/spark(/.*) $1 break; #rewrite /path/to/spark/ "/" ; # forward to spark master UI proxy_pass $SPARK_MASTER; proxy_intercept_errors on; error_page 301 302 307 = handle_redirects; } location handle_redirects { set $saved_redirect_location '$upstream_http_location'; proxy_pass $saved_redirect_location; } } ``` Closes #30552 from gengliangwang/decodeProxyRedirect. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-12-02 01:36:41 +08:00
Anton Okolnychyi	c24f2b2d6a	[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer ### What changes were proposed in this pull request? This PR adds a new batch to the optimizer for executing rules that rewrite plans for data sources. ### Why are the changes needed? Right now, we have a special place in the optimizer where we construct v2 scans. As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables. Not all rules will be specific to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. That's why it seems better to introduce a new batch and use it for all rewrites. The name is generic so that we don't limit ourselves to v2 data sources only. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The change is trivial and SPARK-23889 will depend on it. Closes #30558 from aokolnychyi/spark-33612. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-01 09:27:46 -08:00
Anton Okolnychyi	478fb7f528	[SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates ### What changes were proposed in this pull request? This PR adds logic to handle DELETE/UPDATE/MERGE plans in `PullupCorrelatedPredicates`. ### Why are the changes needed? Right now, `PullupCorrelatedPredicates` applies only to filters and unary nodes. As a result, correlated predicates in DELETE/UPDATE/MERGE are not rewritten. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The PR adds 3 new test cases. Closes #30555 from aokolnychyi/spark-33608. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 14:11:01 +00:00
Prakhar Jain	cf4ad212b1	[SPARK-33503][SQL] Refactor SortOrder class to allow multiple childrens ### What changes were proposed in this pull request? This is a followup of #30302 . As part of this PR, sameOrderExpressions set is made part of children of SortOrder node - so that they don't need any special handling as done in #30302 . ### Why are the changes needed? sameOrderExpressions should get same treatment as child. So making them part of children helps in transforming them easily. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs Closes #30430 from prakharjain09/SPARK-33400-sortorder-refactor. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-01 21:13:27 +09:00
gengjiaan	9273d4250d	[SPARK-33045][SQL][FOLLOWUP] Support built-in function like_any and fix StackOverflowError issue ### What changes were proposed in this pull request? Spark already support `LIKE ANY` syntax, but it will throw `StackOverflowError` if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ANY to fix this issue. Why the stack overflow can happen in the current approach ? The current approach uses reduceLeft to connect each `Like(e, p)`, this will lead the the call depth of the thread is too large, causing `StackOverflowError` problems. Why the fix in this PR can avoid the error? This PR support built-in function for `LIKE ANY` and avoid this issue. ### Why are the changes needed? 1.Fix the `StackOverflowError` issue. 2.Support built-in function `like_any`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30465 from beliefer/SPARK-33045-like_any-bak. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 11:48:30 +00:00
Huaxin Gao	d38883c1d8	[SPARK-32405][SQL][FOLLOWUP] Throw Exception if provider is specified in JDBCTableCatalog create table ### What changes were proposed in this pull request? Throw Exception if JDBC Table Catalog has provider in create table. ### Why are the changes needed? JDBC Table Catalog doesn't support provider and we should throw Exception. Previously CREATE TABLE syntax forces people to specify a provider so we have to add a `USING_`. Now the problem was fix and we will throw Exception for provider. ### Does this PR introduce _any_ user-facing change? Yes. We throw Exception if a provider is specified in CREATE TABLE for JDBC Table catalog. ### How was this patch tested? Existing tests (remove `USING _`) Closes #30544 from huaxingao/followup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 11:38:42 +00:00
Gabor Somogyi	e5bb2937f6	[SPARK-32032][SS] Avoid infinite wait in driver because of KafkaConsumer.poll(long) API ### What changes were proposed in this pull request? Deprecated `KafkaConsumer.poll(long)` API calls may cause infinite wait in the driver. In this PR I've added a new `AdminClient` based offset fetching which is turned off by default. There is a new flag named `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` (default: `true`) which can be set to `false` to reach the newly added functionality. The Structured Streaming migration guide contains more information what migration consideration must be done. Please see the following [doc](https://docs.google.com/document/d/1gAh0pKgZUgyqO2Re3sAy-fdYpe_SxpJ6DkeXE8R1P7E/edit?usp=sharing) for further details. The PR contains the following changes: * Added `AdminClient` based offset fetching * GroupId prefix feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId) * GroupId override feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId) * Additional unit tests * Code comment changes * Minor bugfixes here and there * Removed Kafka auto topic creation feature but only in `AdminClient` based approach (please see doc for rationale). In short, it's super hidden, not sure anybody ever used in production + error prone. * Added documentation to `ss-migration-guide` and `structured-streaming-kafka-integration` ### Why are the changes needed? Driver may hang forever. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing + additional unit tests. Cluster test with simple Kafka topic to another topic query. Documentation: ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #29729 from gaborgsomogyi/SPARK-32032. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-01 20:34:00 +09:00
zky.zhoukeyong	1034815519	[SPARK-33572][SQL] Datetime building should fail if the year, month, ..., second combination is invalid ### What changes were proposed in this pull request? Datetime building should fail if the year, month, ..., second combination is invalid, when ANSI mode is enabled. This patch should update MakeDate, MakeTimestamp and MakeInterval. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30516 from waitinfuture/SPARK-33498. Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Co-authored-by: waitinfuture <waitinfuture@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-01 11:07:16 +00:00
Jungtaek Lim (HeartSaVioR)	52e5cc46bc	[SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files ### What changes were proposed in this pull request? This patch proposes to provide a new option to specify time-to-live (TTL) for output file entries in FileStreamSink. TTL is defined via current timestamp - the last modified time for the file. This patch will filter out outdated output files in metadata while compacting batches (other batches don't have functionality to clean entries), which helps metadata to not grow linearly, as well as filtered out files will be "eventually" no longer seen in reader queries which leverage File(Stream)Source. ### Why are the changes needed? The metadata log greatly helps to easily achieve exactly-once but given the output path is open to arbitrary readers, there's no way to compact the metadata log, which ends up growing the metadata file as query runs for long time, especially for compacted batch. Lots of end users have been reporting the issue: see comments in [SPARK-24295](https://issues.apache.org/jira/browse/SPARK-24295) and [SPARK-29995](https://issues.apache.org/jira/browse/SPARK-29995), and [SPARK-30462](https://issues.apache.org/jira/browse/SPARK-30462). (There're some reports from end users which include their workarounds: SPARK-24295) ### Does this PR introduce any user-facing change? No, as the configuration is new and by default it is not applied. ### How was this patch tested? New UT. Closes #28363 from HeartSaVioR/SPARK-27188-v2. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-01 14:42:48 +09:00
HyukjinKwon	1a042cc414	[SPARK-33530][CORE] Support --archives and spark.archives option natively ### What changes were proposed in this pull request? TL;DR: - This PR completes the support of archives in Spark itself instead of Yarn-only - It makes `--archives` option work in other cluster modes too and adds `spark.archives` configuration. - After this PR, PySpark users can leverage Conda to ship Python packages together as below: ```python conda create -y -n pyspark_env -c conda-forge pyarrow==2.0.0 pandas==1.1.4 conda-pack==0.5.0 conda activate pyspark_env conda pack -f -o pyspark_env.tar.gz PYSPARK_DRIVER_PYTHON=python PYSPARK_PYTHON=./environment/bin/python pyspark --archives pyspark_env.tar.gz#environment ``` - Issue a warning that undocumented and hidden behavior of partial archive handling in `spark.files` / `SparkContext.addFile` will be deprecated, and users can use `spark.archives` and `SparkContext.addArchive`. This PR proposes to add Spark's native `--archives` in Spark submit, and `spark.archives` configuration. Currently, both are supported only in Yarn mode: ```bash ./bin/spark-submit --help ``` ``` Options: ... Spark on YARN only: --queue QUEUE_NAME The YARN queue to submit to (Default: "default"). --archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor. ``` This `archives` feature is useful often when you have to ship a directory and unpack into executors. One example is native libraries to use e.g. JNI. Another example is to ship Python packages together by Conda environment. Especially for Conda, PySpark currently does not have a nice way to ship a package that works in general, please see also https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment (PySpark new documentation demo for 3.1.0). The neatest way is arguably to use Conda environment by shipping zipped Conda environment but this is currently dependent on this archive feature. NOTE that we are able to use `spark.files` by relying on its undocumented behaviour that untars `tar.gz` but I don't think we should document such ways and promote people to more rely on it. Also, note that this PR does not target to add the feature parity of `spark.files.overwrite`, `spark.files.useFetchCache`, etc. yet. I documented that this is an experimental feature as well. ### Why are the changes needed? To complete the feature parity, and to provide a better support of shipping Python libraries together with Conda env. ### Does this PR introduce _any_ user-facing change? Yes, this makes `--archives` works in Spark instead of Yarn-only, and adds a new configuration `spark.archives`. ### How was this patch tested? I added unittests. Also, manually tested in standalone cluster, local-cluster, and local modes. Closes #30486 from HyukjinKwon/native-archive. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-01 13:43:02 +09:00
Jungtaek Lim (HeartSaVioR)	2af2da5a4b	[SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log twice if the query restarts from compact batch ### What changes were proposed in this pull request? This patch addresses the case where compact metadata file is read twice in FileStreamSource during restarting query. When restarting the query, there is a case which the query starts from compaction batch, and the batch has source metadata file to read. One case is that the previous query succeeded to read from inputs, but not finalized the batch for various reasons. The patch finds the latest compaction batch when restoring from metadata log, and put entries for the batch into the file entry cache which would avoid reading compact batch file twice. FileStreamSourceLog doesn't know about offset / commit metadata in checkpoint so doesn't know which exactly batch to start from, but in practice, only couple of latest batches are candidates to be started from when restarting query. This patch leverages the fact to skip calculation if possible. ### Why are the changes needed? Spark incurs unnecessary cost on reading the compact metadata file twice on some case, which may not be ignorable when the query has been processed huge number of files so far. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Closes #27649 from HeartSaVioR/SPARK-30900. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-01 13:11:14 +09:00
Kousuke Saruta	c50fcac00e	[SPARK-33607][SS][WEBUI] Input Rate timeline/histogram aren't rendered if built with Scala 2.13 ### What changes were proposed in this pull request? This PR fixes an issue that the histogram and timeline aren't rendered in the `Streaming Query Statistics` page if we built Spark with Scala 2.13. ![before-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612855-f543d700-3356-11eb-90d9-ede57b8b3f4f.png) ![NaN_Error](https://user-images.githubusercontent.com/4736016/100612879-00970280-3357-11eb-97cf-43978bbe2d3a.png) The reason is [`maxRecordRate` can be `NaN`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L371) for Scala 2.13. The `NaN` is the result of [`query.recentProgress.map(_.inputRowsPerSecond).max`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/ui/StreamingQueryStatisticsPage.scala#L372) when the first element of `query.recentProgress.map(_.inputRowsPerSecond)` is `NaN`. Actually, the comparison logic for `Double` type was changed in Scala 2.13. https://github.com/scala/bug/issues/12107 https://github.com/scala/scala/pull/6410 So this issue happens as of Scala 2.13. The root cause of the `NaN` is [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L164). This `NaN` seems to be an initial value of `inputTimeSec` so I think `Double.PositiveInfinity` is suitable rather than `NaN` and this change can resolve this issue. ### Why are the changes needed? To make sure we can use the histogram/timeline with Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? First, I built with the following commands. ``` $ /dev/change-scala-version.sh 2.13 $ build/sbt -Phive -Phive-thriftserver -Pscala-2.13 package ``` Then, ran the following query (this is brought from #30427 ). ``` import org.apache.spark.sql.streaming.Trigger val query = spark .readStream .format("rate") .option("rowsPerSecond", 1000) .option("rampUpTime", "10s") .load() .selectExpr("", "CAST(CAST(timestamp AS BIGINT) - CAST((RAND() 100000) AS BIGINT) AS TIMESTAMP) AS tsMod") .selectExpr("tsMod", "mod(value, 100) as mod", "value") .withWatermark("tsMod", "10 seconds") .groupBy(window($"tsMod", "1 minute", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) .writeStream .format("console") .trigger(Trigger.ProcessingTime("5 seconds")) .outputMode("append") .start() ``` Finally, I confirmed that the timeline and histogram are rendered. ![after-fix-the-issue](https://user-images.githubusercontent.com/4736016/100612736-c9285600-3356-11eb-856d-7e53cc656c36.png) ``` Closes #30546 from sarutak/ss-nan. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-01 11:45:32 +09:00
Weichen Xu	80161238fe	[SPARK-33592] Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading ### What changes were proposed in this pull request? Fix: Pyspark ML Validator params in estimatorParamMaps may be lost after saving and reloading When saving validator estimatorParamMaps, will check all nested stages in tuned estimator to get correct param parent. Two typical cases to manually test: ~~~python tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression() pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100]) \ .addGrid(lr.maxIter, [100, 200]) \ .build() tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator()) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) # check `loadedTvs.getEstimatorParamMaps()` restored correctly. ~~~ ~~~python lr = LogisticRegression() ova = OneVsRest(classifier=lr) grid = ParamGridBuilder().addGrid(lr.maxIter, [100, 200]).build() evaluator = MulticlassClassificationEvaluator() tvs = TrainValidationSplit(estimator=ova, estimatorParamMaps=grid, evaluator=evaluator) tvs.save(tvsPath) loadedTvs = TrainValidationSplit.load(tvsPath) # check `loadedTvs.getEstimatorParamMaps()` restored correctly. ~~~ ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30539 from WeichenXu123/fix_tuning_param_maps_io. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-01 09:36:42 +08:00
Bryan Cutler	aeb3649fb9	[SPARK-33613][PYTHON][TESTS] Replace deprecated APIs in pyspark tests ### What changes were proposed in this pull request? This replaces deprecated API usage in PySpark tests with the preferred APIs. These have been deprecated for some time and usage is not consistent within tests. - https://docs.python.org/3/library/unittest.html#deprecated-aliases ### Why are the changes needed? For consistency and eventual removal of deprecated APIs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #30557 from BryanCutler/replace-deprecated-apis-in-tests. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-01 10:34:40 +09:00
Weichen Xu	596fbc1d29	[SPARK-33556][ML] Add array_to_vector function for dataframe column ### What changes were proposed in this pull request? Add array_to_vector function for dataframe column ### Why are the changes needed? Utility function for array to vector conversion. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? scala unit test & doctest. Closes #30498 from WeichenXu123/array_to_vec. Lead-authored-by: Weichen Xu <weichen.xu@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-01 09:52:19 +09:00
Jungtaek Lim (HeartSaVioR)	f5d2165c95	[SPARK-33440][CORE] Use current timestamp with warning log in HadoopFSDelegationTokenProvider when the issue date for token is not set up properly ### What changes were proposed in this pull request? This PR proposes to use current timestamp with warning log when the issue date for token is not set up properly. The next section will explain the rationalization with details. ### Why are the changes needed? Unfortunately not every implementations respect the `issue date` in `AbstractDelegationTokenIdentifier`, which Spark relies on while calculating. The default value of issue date is 0L, which is far from actual issue date, breaking logic on calculating next renewal date under some circumstance, leading to 0 interval (immediate) on rescheduling token renewal. In HadoopFSDelegationTokenProvider, Spark calculates token renewal interval as below: `2c64b731ae/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala (L123-L134)` The interval is calculated as `token.renew() - identifier.getIssueDate`, which is providing correct interval assuming both `token.renew()` and `identifier.getIssueDate` produce correct value, but it's going to be weird when `identifier.getIssueDate` provides 0L (default value), like below: ``` 20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 1603175657000 for token S3ADelegationToken/IDBroker 20/10/13 06:34:19 INFO security.HadoopFSDelegationTokenProvider: Renewal interval is 86400048 for token HDFS_DELEGATION_TOKEN ``` Hopefully we pick the minimum value as safety guard (so in this case, `86400048` is being picked up), but the safety guard leads unintentional bad impact on this case. `2c64b731ae/core/src/main/scala/org/apache/spark/deploy/security/HadoopFSDelegationTokenProvider.scala (L58-L71)` Spark leverages the interval being calculated in above, "minimum" value of intervals, and blindly adds the value to token's issue date to calculates the next renewal date for the token, and picks "minimum" value again. In problematic case, the value would be `86400048` (86400048 + 0) which is quite smaller than current timestamp. `2c64b731ae/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala (L228-L234)` The next renewal date is subtracted with current timestamp again to get the interval, and multiplexed by configured ratio to produce the final schedule interval. In problematic case, this value goes to negative. `2c64b731ae/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala (L180-L188)` There's a safety guard to not allow negative value, but that's simply 0 meaning schedule immediately. This triggers next calculation of next renewal date to calculate the schedule interval, lead to the same behavior, hence updating delegation token immediately and continuously. As we fetch token just before the calculation happens, the actual issue date is likely slightly before, hence it's not that dangerous to use current timestamp as issue date for the token the issue date has not been set up properly. Still, it's better not to leave the token implementation as it is, so we log warn message to let end users consult with token implementer. ### Does this PR introduce _any_ user-facing change? Yes. End users won't encounter the tight loop of schedule of token renewal after the PR. In end users' perspective of reflection, there's nothing end users need to change. ### How was this patch tested? Manually tested with problematic environment. Closes #30366 from HeartSaVioR/SPARK-33440. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-01 06:44:15 +09:00
Dongjoon Hyun	c6994354f7	[SPARK-33545][CORE] Support Fallback Storage during Worker decommission ### What changes were proposed in this pull request? This PR aims to support storage migration to the fallback storage like cloud storage (`S3`) during worker decommission for the corner cases where the exceptions occur or there is no live peer left. Although this PR focuses on cloud storage like `S3` which has a TTL feature in order to simplify Spark's logic, we can use alternative fallback storages like HDFS/NFS(EFS) if the user provides a clean-up mechanism. ### Why are the changes needed? Currently, storage migration is not possible when there is no available executor. For example, when there is one executor, the executor cannot perform storage migration because it has no peer. ### Does this PR introduce _any_ user-facing change? Yes. This is a new feature. ### How was this patch tested? Pass the CIs with newly added test cases. Closes #30492 from dongjoon-hyun/SPARK-33545. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-30 13:29:50 -08:00
Erik Krogen	f3c2583cc3	[SPARK-33185][YARN][FOLLOW-ON] Leverage RM's RPC API instead of REST to fetch driver log links in yarn.Client ### What changes were proposed in this pull request? This is a follow-on to PR #30096 which initially added support for printing direct links to the driver stdout/stderr logs from the application report output in `yarn.Client` using the `spark.yarn.includeDriverLogsLink` configuration. That PR made use of the ResourceManager's REST APIs to fetch the necessary information to construct the links. This PR proposes removing the dependency on the REST API, since the new logic is the only place in `yarn.Client` which makes use of this API, and instead leverages the RPC API via `YarnClient`, which brings the code in line with the rest of `yarn.Client`. ### Why are the changes needed? While the old logic worked okay when running a Spark application in a "standard" environment with full access to Kerberos credentials, it can fail when run in an environment with restricted Kerberos credentials. In our case, this environment is represented by [Azkaban](https://azkaban.github.io/), but it likely affects other job scheduling systems as well. In such an environment, the application has delegation tokens which enabled it to communicate with services such as YARN, but the RM REST API is not typically covered by such delegation tokens (note that although YARN does actually support accessing the RM REST API via a delegation token as documented [here](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Delegation_Tokens_API), it is a new feature in alpha phase, and most deployments are likely not retrieving this token today). Besides this enhancement, leveraging the `YarnClient` APIs greatly simplifies the processing logic, such as removing all JSON parsing. ### Does this PR introduce _any_ user-facing change? Very minimal user-facing changes on top of PR #30096. Basically expands the scope of environments in which that feature will operate correctly. ### How was this patch tested? In addition to redoing the `spark-submit` testing as mentioned in PR #30096, I also tested this logic in a restricted-credentials environment (Azkaban). It succeeds where the previous logic would fail with a 401 error. Closes #30450 from xkrogen/xkrogen-SPARK-33185-driverlogs-followon. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2020-11-30 14:40:51 -06:00
Max Gekk	030b3139da	[SPARK-33569][SPARK-33452][SQL][FOLLOWUP] Fix a build error in `ShowPartitionsExec` ### What changes were proposed in this pull request? Use `listPartitionIdentifiers ` instead of `listPartitionByNames` in `ShowPartitionsExec`. The `listPartitionByNames` was renamed by https://github.com/apache/spark/pull/30514. ### Why are the changes needed? To fix build error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests for the `SHOW PARTITIONS` command: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite" ``` Closes #30553 from MaxGekk/fix-build-show-partitions-exec. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 16:40:36 +00:00
Max Gekk	6fd148fea8	[SPARK-33569][SQL] Remove getting partitions by an identifier prefix ### What changes were proposed in this pull request? 1. Remove the method `listPartitionIdentifiers()` from the `SupportsPartitionManagement` interface. The method lists partitions by ident prefix. 2. Rename `listPartitionByNames()` to `listPartitionIdentifiers()`. 3. Re-implement the default method `partitionExists()` using new method. ### Why are the changes needed? Getting partitions by ident prefix only is not used, and it can be removed to improve code maintenance. Also this makes the `SupportsPartitionManagement` interface cleaner. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly org.apache.spark.sql.connector.catalog.*" ``` Closes #30514 from MaxGekk/remove-listPartitionIdentifiers. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 14:05:49 +00:00
Max Gekk	0a612b6a40	[SPARK-33452][SQL] Support v2 SHOW PARTITIONS ### What changes were proposed in this pull request? 1. Remove V2 logical node `ShowPartitionsStatement `, and replace it by V2 `ShowPartitions`. 2. Implement V2 execution node `ShowPartitionsExec` similar to V1 `ShowPartitionsCommand`. ### Why are the changes needed? To have feature parity with Datasource V1. ### Does this PR introduce _any_ user-facing change? Yes. Before the change, `SHOW PARTITIONS` fails in V2 table catalogs with the exception: ``` org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is only supported with v1 tables. at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog.org$apache$spark$sql$catalyst$analysis$ResolveSessionCatalog$$parseV1Table(ResolveSessionCatalog.scala:628) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:466) ``` ### How was this patch tested? By running the following test suites: 1. Modified `ShowPartitionsParserSuite` where `ShowPartitionsStatement` is replaced by V2 `ShowPartitions`. 2. `v2.ShowPartitionsSuite` Closes #30398 from MaxGekk/show-partitions-exec-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 13:45:53 +00:00
Pascal Gillet	6e5446e61f	[SPARK-33579][UI] Fix executor blank page behind proxy ### What changes were proposed in this pull request? Fix some "hardcoded" API urls in Web UI. More specifically, we avoid the use of `location.origin` when constructing URLs for internal API calls within the JavaScript. Instead, we use `apiRoot` global variable. ### Why are the changes needed? On one hand, it allows us to build relative URLs. On the other hand, `apiRoot` reflects the Spark property `spark.ui.proxyBase` which can be set to change the root path of the Web UI. If `spark.ui.proxyBase` is actually set, original URLs become incorrect, and we end up with an executors blank page. I encounter this bug when accessing the Web UI behind a proxy (in my case a Kubernetes Ingress). See the following link for more context: https://github.com/jupyterhub/jupyter-server-proxy/issues/57#issuecomment-699163115 ### Does this PR introduce _any_ user-facing change? Yes, as all the changes introduced are in the JavaScript for the Web UI. ### How the changes have been tested ? I modified/debugged the JavaScript as in the commit with the help of the developer tools in Google Chrome, while accessing the Web UI of my Spark app behind my k8s ingress. Closes #30523 from pgillet/fix-executors-blank-page-behind-proxy. Authored-by: Pascal Gillet <pascal.gillet@stack-labs.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-11-30 19:31:42 +09:00
Wenchen Fan	5cfbdddefe	[SPARK-33480][SQL] Support char/varchar type ### What changes were proposed in this pull request? This PR adds the char/varchar type which is kind of a variant of string type: 1. Char type is fixed-length string. When comparing char type values, we need to pad the shorter one to the longer length. 2. Varchar type is string with a length limitation. To implement the char/varchar semantic, this PR: 1. Do string length check when writing to char/varchar type columns. 2. Do string padding when reading char type columns. We don't do it at the writing side to save storage space. 3. Do string padding when comparing char type column with string literal or another char type column. (string literal is fixed length so should be treated as char type as well) To simplify the implementation, this PR doesn't propagate char/varchar type info through functions/operators(e.g. `substring`). That said, a column can only be char/varchar type if it's a table column, not a derived column like `SELECT substring(col)`. To be safe, this PR doesn't add char/varchar type to the query engine(expression input check, internal row framework, codegen framework, etc.). We will replace char/varchar type by string type with metadata (`Attribute.metadata` or `StructField.metadata`) that includes the original type string before it goes into the query engine. That said, the existing code will not see char/varchar type but only string type. char/varchar type may come from several places: 1. v1 table from hive catalog. 2. v2 table from v2 catalog. 3. user-specified schema in `spark.read.schema` and `spark.readStream.schema` 4. `Column.cast` 5. schema string in places like `from_json`, pandas UDF, etc. These places use SQL parser which replaces char/varchar with string already, even before this PR. This PR covers all the above cases, implements the length check and padding feature by looking at string type with special metadata. ### Why are the changes needed? char and varchar are standard SQL types. varchar is widely used in other databases instead of string type. ### Does this PR introduce _any_ user-facing change? For hive tables: now the table insertion fails if the value exceeds char/varchar length. Previously we truncate the value silently. For other tables: 1. now char type is allowed. 2. now we have length check when inserting to varchar columns. Previously we write the value as it is. ### How was this patch tested? new tests Closes #30412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 09:23:05 +00:00
gengjiaan	b665d58819	[SPARK-28646][SQL] Fix bug of Count so as consistent with mainstream databases ### What changes were proposed in this pull request? Currently, Spark allows calls to `count` even for non parameterless aggregate function. For example, the following query actually works: `SELECT count() FROM tenk1;` On the other hand, mainstream databases will throw an error. Oracle `> ORA-00909: invalid number of arguments` PgSQL `ERROR: count() must be used to call a parameterless aggregate function` MySQL* `> 1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')` ### Why are the changes needed? Fix a bug so that consistent with mainstream databases. There is an example query output with/without this fix. `SELECT count() FROM testData;` The output before this fix: `0` The output after this fix: ``` org.apache.spark.sql.AnalysisException cannot resolve 'count()' due to data type mismatch: count requires at least one argument.; line 1 pos 7 ``` ### Does this PR introduce _any_ user-facing change? Yes. If not specify parameter for `count`, will throw an error. ### How was this patch tested? Jenkins test. Closes #30541 from beliefer/SPARK-28646. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 17:04:38 +09:00
xuewei.linxuewei	225c2e2815	[SPARK-33498][SQL][FOLLOW-UP] Deduplicate the unittest by using checkCastWithParseError ### What changes were proposed in this pull request? Dup code removed in SPARK-33498 as follow-up. ### Why are the changes needed? Nit. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30540 from leanken/leanken-SPARK-33498. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 15:36:26 +09:00
Terry Kim	0fd9f57dd4	[SPARK-33448][SQL] Support CACHE/UNCACHE TABLE commands for v2 tables ### What changes were proposed in this pull request? This PR proposes to support `CHACHE/UNCACHE TABLE` commands for v2 tables. In addtion, this PR proposes to migrate `CACHE/UNCACHE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To support `CACHE/UNCACHE TABLE` commands for v2 tables. Note that `CACHE/UNCACHE TABLE` for v1 tables/views go through `SparkSession.table` to resolve identifier, which resolves temp views first, so there is no change in the behavior by moving to the new framework. ### Does this PR introduce _any_ user-facing change? Yes. Now the user can run `CACHE/UNCACHE TABLE` commands on v2 tables. ### How was this patch tested? Added/updated existing tests. Closes #30403 from imback82/cache_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 05:37:10 +00:00
Kent Yao	2da72593c1	[SPARK-32976][SQL] Support column list in INSERT statement ### What changes were proposed in this pull request? #### JIRA expectations ``` INSERT currently does not support named column lists. INSERT INTO <table> (col1, col2,…) VALUES( 'val1', 'val2', … ) Note, we assume the column list contains all the column names. Issue an exception if the list is not complete. The column order could be different from the column order defined in the table definition. ``` #### implemetations In this PR, we add a column list as an optional part to the `INSERT OVERWRITE/INTO` statements: ``` /** * {{{ * INSERT OVERWRITE TABLE tableIdentifier [partitionSpec [IF NOT EXISTS]]? [identifierList] ... * INSERT INTO [TABLE] tableIdentifier [partitionSpec] [identifierList] ... * }}} / ``` The column list represents all expected columns with an explicit order that you want to insert to the target table. Particularly, we assume the column list contains all the column names in the current implementation, it will fail when the list is incomplete. In Analyzer*, we add a code path to resolve the column list in the `ResolveOutputRelation` rule before it is transformed to v1 or v2 command. It will fail here if the list has any field that not belongs to the target table. Then, for v2 command, e.g. `AppendData`, we use the resolved column list and output of the target table to resolve the output of the source query `ResolveOutputRelation` rule. If the list has duplicated columns, we fail. If the list is not empty but the list size does not match the target table, we fail. If no other exceptions occur, we use the column list to map the output of the source query to the output of the target table. The column list will be set to Nil and it will not hit the rule again after it is resolved. for v1 command, those all happen in the `PreprocessTableInsertion` rule ### Why are the changes needed? new feature support ### Does this PR introduce _any_ user-facing change? yes, insert into/overwrite table support specify column list ### How was this patch tested? new tests Closes #29893 from yaooqinn/SPARK-32976. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 05:23:23 +00:00
Josh Soref	485145326a	[MINOR] Spelling bin core docs external mllib repl ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `bin` * `core` * `docs` * `external` * `mllib` * `repl` * `pom.xml` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30530 from jsoref/spelling-bin-core-docs-external-mllib-repl. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-30 13:59:51 +09:00
Chao Sun	feda7299e3	[SPARK-33567][SQL] DSv2: Use callback instead of passing Spark session and v2 relation for refreshing cache ### What changes were proposed in this pull request? This replaces Spark session and `DataSourceV2Relation` in V2 write plans by replacing them with a callback `afterWrite`. ### Why are the changes needed? Per discussion in #30429, it's better to not pass Spark session and `DataSourceV2Relation` through Spark plans. Instead we can use a callback which makes the interface cleaner. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30491 from sunchao/SPARK-33492-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-30 04:50:50 +00:00
Yuming Wang	a5e13acd19	[SPARK-33582][SQL] Hive Metastore support filter by not-equals ### What changes were proposed in this pull request? This pr make partition predicate pushdown into Hive metastore support not-equals operator. Hive related changes: `b8bd4594be/itests/hive-unit/src/test/java/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java (L2194-L2207)` https://issues.apache.org/jira/browse/HIVE-2702 ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30534 from wangyum/SPARK-33582. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 11:24:15 +09:00
Yuming Wang	f93d4395b2	[SPARK-33589][SQL] Close opened session if the initialization fails ### What changes were proposed in this pull request? This pr add try catch when opening session. ### Why are the changes needed? Close opened session if the initialization fails. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Before this pr: ``` [rootspark-3267648 spark]# bin/beeline -u jdbc:hive2://localhost:10000/db_not_exist NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. Connecting to jdbc:hive2://localhost:10000/db_not_exist log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/db_not_exist: Database 'db_not_exist' not found; (state=08S01,code=0) Beeline version 2.3.7 by Apache Hive beeline> ``` ![image](https://user-images.githubusercontent.com/5399861/100560975-73ba5d80-32f2-11eb-8f92-b2509e7a121f.png) After this pr: ``` [rootspark-3267648 spark]# bin/beeline -u jdbc:hive2://localhost:10000/db_not_exist NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Connecting to jdbc:hive2://localhost:10000/db_not_exist Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000/db_not_exist: Failed to open new session: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db_not_exist' not found; (state=08S01,code=0) Beeline version 2.3.7 by Apache Hive beeline> ``` ![image](https://user-images.githubusercontent.com/5399861/100560917-479edc80-32f2-11eb-986f-7a997f1163fc.png) Closes #30536 from wangyum/SPARK-33589. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 11:21:02 +09:00
liucht	3d54774fb9	[SPARK-33517][SQL][DOCS] Fix the correct menu items and page links in PySpark Usage Guide for Pandas with Apache Arrow ### What changes were proposed in this pull request? Change "Apache Arrow in Spark" to "Apache Arrow in PySpark" and the link to “/sql-pyspark-pandas-with-arrow.html#apache-arrow-in-pyspark” ### Why are the changes needed? When I click on the menu item it doesn't point to the correct page, and from the parent menu I can infer that the correct menu item name and link should be "Apache Arrow in PySpark". like this: image ![image](https://user-images.githubusercontent.com/28332082/99954725-2b64e200-2dbe-11eb-9576-cf6a3d758980.png) ### Does this PR introduce any user-facing change? Yes, clicking on the menu item will take you to the correct guide page ### How was this patch tested? Manually build the doc. This can be verified as below: cd docs SKIP_API=1 jekyll build open _site/sql-pyspark-pandas-with-arrow.html Closes #30466 from liucht-inspur/master. Authored-by: liucht <liucht@inspur.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-30 10:03:18 +09:00
Max Gekk	a088a801ed	[SPARK-33585][SQL][DOCS] Fix the comment for `SQLContext.tables()` and mention the `database` column ### What changes were proposed in this pull request? Change the comments for `SQLContext.tables()` to "The returned DataFrame has three columns, database, tableName and isTemporary". ### Why are the changes needed? Currently, the comment mentions only 2 columns but `tables()` returns 3 columns actually: ```scala scala> spark.range(10).createOrReplaceTempView("view1") scala> val tables = spark.sqlContext.tables() tables: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field] scala> tables.printSchema root \|-- database: string (nullable = false) \|-- tableName: string (nullable = false) \|-- isTemporary: boolean (nullable = false) scala> tables.show +--------+---------+-----------+ \|database\|tableName\|isTemporary\| +--------+---------+-----------+ \| default\| t1\| false\| \| default\| t2\| false\| \| default\| ymd\| false\| \| \| view1\| true\| +--------+---------+-----------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle` Closes #30526 from MaxGekk/sqlcontext-tables-doc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-29 12:18:07 -08:00
Max Gekk	0054fc937f	[SPARK-33588][SQL] Respect the `spark.sql.caseSensitive` config while resolving partition spec in v1 `SHOW TABLE EXTENDED` ### What changes were proposed in this pull request? Perform partition spec normalization in `ShowTablesCommand` according to the table schema before getting partitions from the catalog. The normalization via `PartitioningUtils.normalizePartitionSpec()` adjusts the column names in partition specification, w.r.t. the real partition column names and case sensitivity. ### Why are the changes needed? Even when `spark.sql.caseSensitive` is `false` which is the default value, v1 `SHOW TABLE EXTENDED` is case sensitive: ```sql spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > partitioned by (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); Error in query: Partition spec is invalid. The spec (YEAR, Month) must match the partition spec (year, month) defined in table '`default`.`tbl1`'; ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the `SHOW TABLE EXTENDED` command respects the SQL config. And for example above, it returns correct result: ```sql spark-sql> SHOW TABLE EXTENDED LIKE 'tbl1' PARTITION(YEAR = 2015, Month = 1); default tbl1 false Partition Values: [year=2015, month=1] Location: file:/Users/maximgekk/spark-warehouse/tbl1/year=2015/month=1 Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Storage Properties: [serialization.format=1, path=file:/Users/maximgekk/spark-warehouse/tbl1] Partition Parameters: {transient_lastDdlTime=1606595118, totalSize=623, numFiles=1} Created Time: Sat Nov 28 23:25:18 MSK 2020 Last Access: UNKNOWN Partition Statistics: 623 bytes ``` ### How was this patch tested? By running the modified test suite `v1/ShowTablesSuite` Closes #30529 from MaxGekk/show-table-case-sensitive-spec. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-29 12:10:16 -08:00
Shixiong Zhu	c8286ec416	[SPARK-33587][CORE] Kill the executor on nested fatal errors ### What changes were proposed in this pull request? Currently we will kill the executor when hitting a fatal error. However, if the fatal error is wrapped by another exception, such as - java.util.concurrent.ExecutionException, com.google.common.util.concurrent.UncheckedExecutionException, com.google.common.util.concurrent.ExecutionError when using Guava cache or Java thread pool. - SparkException thrown from `cf98a761de/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala (L231)` or `cf98a761de/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala (L296)` We will still keep the executor running. Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error and it's hard to predicate the behaviors of a broken component. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover. ### Why are the changes needed? Fatal errors are usually unrecoverable (such as OutOfMemoryError), some components may be in a broken state when hitting a fatal error and it's hard to predicate the behaviors of a broken component. Hence, it's better to detect the nested fatal error as well and kill the executor. Then we can rely on Spark's fault tolerance to recover. ### Does this PR introduce _any_ user-facing change? Yep. There is a slight internal behavior change on when to kill an executor. We will kill the executor when detecting a nested fatal error in the exception chain. `spark.executor.killOnFatalError.depth` is added to allow users to turn off this change if the slight behavior change impacts them. ### How was this patch tested? The new method `Executor.isFatalError` is tested by `spark.executor.killOnNestedFatalError`. Closes #30528 from zsxwing/SPARK-33587. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-29 11:56:48 -08:00
Kazuaki Ishizaki	b94ff1e870	[SPARK-33590][DOCS][SQL] Add missing sub-bullets in Spark SQL Guide ### What changes were proposed in this pull request? Add the missing sub-bullets in the left side of `Spark SQL Guide` ### Why are the changes needed? The three sub-bullets in the left side is not consistent with the contents (five bullets) in the right side. ![image](https://user-images.githubusercontent.com/1315079/100546388-7a21e880-32a4-11eb-922d-62a52f4f9f9b.png) ### Does this PR introduce _any_ user-facing change? Yes, you can see more lines in the left menu. ### How was this patch tested? Manually build the doc as follows. This can be verified as attached: ``` cd docs SKIP_API=1 jekyll build firefox _site/sql-pyspark-pandas-with-arrow.html ``` ![image](https://user-images.githubusercontent.com/1315079/100546399-8ad25e80-32a4-11eb-80ac-44af0aebc717.png) Closes #30537 from kiszk/SPARK-33590. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-29 11:24:58 -08:00
Yuming Wang	ba178f852f	[SPARK-33581][SQL][TEST] Refactor HivePartitionFilteringSuite ### What changes were proposed in this pull request? This pr refactor HivePartitionFilteringSuite. ### Why are the changes needed? To make it easy to maintain. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30525 from wangyum/SPARK-33581. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-11-29 09:36:55 +08:00
Max Gekk	bfe9380ba2	[MINOR][SQL] Remove `getTables()` from `r.SQLUtils` ### What changes were proposed in this pull request? Remove the unused method `getTables()` from `r.SQLUtils`. The method was used before the changes https://github.com/apache/spark/pull/17483 but R's `tables.default` was rewritten using `listTables()`: https://github.com/apache/spark/pull/17483/files#diff-2c01472a7bcb1d318244afcd621d726e00d36cd15dffe7e44fa96c54fce4cd9aR220-R223 ### Why are the changes needed? To improve code maintenance, and remove the dead code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By R tests. Closes #30527 from MaxGekk/remove-getTables-in-r-SQLUtils. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-28 16:58:40 -08:00
Liang-Chi Hsieh	3650a6bd97	[SPARK-33580][CORE] resolveDependencyPaths should use classifier attribute of artifact ### What changes were proposed in this pull request? This patch proposes to use classifier attribute to construct artifact path instead of type. ### Why are the changes needed? `resolveDependencyPaths` now takes artifact type to decide to add "-tests" postfix. However, the path pattern of ivy in `resolveMavenCoordinates` is `[organization]_[artifact][revision](-[classifier]).[ext]`. We should use classifier instead of type to construct file path. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Manual test. Closes #30524 from viirya/SPARK-33580. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-28 12:47:47 -08:00
Kousuke Saruta	cf98a761de	[SPARK-33570][SQL][TESTS] Set the proper version of gssapi plugin automatically for MariaDBKrbIntegrationSuite ### What changes were proposed in this pull request? This PR changes mariadb_docker_entrypoint.sh to set the proper version automatically for mariadb-plugin-gssapi-server. The proper version is based on the one of mariadb-server. Also, this PR enables to use arbitrary docker image by setting the environment variable `MARIADB_CONTAINER_IMAGE_NAME`. ### Why are the changes needed? For `MariaDBKrbIntegrationSuite`, the version of `mariadb-plugin-gssapi-server` is currently set to `10.5.5` in `mariadb_docker_entrypoint.sh` but it's no longer available in the official apt repository and `MariaDBKrbIntegrationSuite` doesn't pass for now. It seems that only the most recent three versions are available for each major version and they are `10.5.6`, `10.5.7` and `10.5.8` for now. Further, the release cycle of MariaDB seems to be very rapid (1 ~ 2 months) so I don't think it's a good idea to set to an specific version for `mariadb-plugin-gssapi-server`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed that `MariaDBKrbIntegrationSuite` passes with the following commands. ``` $ build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite" ``` In this case, we can see what version of `mariadb-plugin-gssapi-server` is going to be installed in the following container log message. ``` Installing mariadb-plugin-gssapi-server=1:10.5.8+maria~focal ``` Or, we can set MARIADB_CONTAINER_IMAGE_NAME for a specific version of MariaDB. ``` $ MARIADB_DOCKER_IMAGE_NAME=mariadb:10.5.6 build/sbt -Pdocker-integration-tests -Phive -Phive-thriftserver package "testOnly org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite" ``` ``` Installing mariadb-plugin-gssapi-server=1:10.5.6+maria~focal ``` Closes #30515 from sarutak/fix-MariaDBKrbIntegrationSuite. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-28 23:38:11 +09:00
Josh Soref	13fd272cd3	Spelling r common dev mlib external project streaming resource managers python ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `R` * `common` * `dev` * `mlib` * `external` * `project` * `streaming` * `resource-managers` * `python` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-27 10:22:45 -06:00
luluorta	35ded12fc6	[SPARK-33141][SQL] Capture SQL configs when creating permanent views ### What changes were proposed in this pull request? This PR makes CreateViewCommand/AlterViewAsCommand capturing runtime SQL configs and store them as view properties. These configs will be applied during the parsing and analysis phases of the view resolution. Users can set `spark.sql.legacy.useCurrentConfigsForView` to `true` to restore the behavior before. ### Why are the changes needed? This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138) that proposes to unify temp view and permanent view behaviors. This PR makes permanent views mimicking the temp view behavior that "fixes" view semantic by directly storing resolved LogicalPlan. For example, if a user uses spark 2.4 to create a view that contains null values from division-by-zero expressions, she may not want that other users' queries which reference her view throw exceptions when running on spark 3.x with ansi mode on. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? added UT + existing UTs (improved) Closes #30289 from luluorta/SPARK-33141. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-27 13:32:25 +00:00

1 2 3 4 5 ...

28661 commits