ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
zhengruifeng	a2887164bc	[SPARK-32907][ML][PYTHON] adaptively blockify instances - LinearSVC ### What changes were proposed in this pull request? 1, use `maxBlockSizeInMB` instead of `blockSize`(#rows) to control the stacking of vectors; 2, infer an appropriate `maxBlockSizeInMB` if set 0; ### Why are the changes needed? the performance gain is mainly related to the nnz of block. f2jBLAS \| \| \| \| \| \| \| \| \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Duration(millisecond) \| branch 3.0 Impl \| blockSizeInMB=0.0625 \| blockSizeInMB=0.125 \| blockSizeInMB=0.25 \| blockSizeInMB=0.5 \| blockSizeInMB=1 \| blockSizeInMB=2 \| blockSizeInMB=4 \| blockSizeInMB=8 \| blockSizeInMB=16 \| blockSizeInMB=32 \| blockSizeInMB=64 \| blockSizeInMB=128 epsilon(100%) \| 326481 \| 26143 \| 25710 \| 24726 \| 25395 \| 25840 \| 26846 \| 25927 \| 27431 \| 26190 \| 26056 \| 26347 \| 27204 epsilon3000(67%) \| 455247 \| 35893 \| 34366 \| 34985 \| 38387 \| 38901 \| 40426 \| 40044 \| 39161 \| 38767 \| 39965 \| 39523 \| 39108 epsilon4000(50%) \| 306390 \| 42256 \| 41164 \| 43748 \| 48638 \| 50892 \| 50986 \| 51091 \| 51072 \| 51289 \| 51652 \| 53312 \| 52146 epsilon5000(40%) \| 307619 \| 43639 \| 42992 \| 44743 \| 50800 \| 51939 \| 51871 \| 52190 \| 53850 \| 52607 \| 51062 \| 52509 \| 51570 epsilon10000(20%) \| 310070 \| 58371 \| 55921 \| 56317 \| 56618 \| 53694 \| 52131 \| 51768 \| 51728 \| 52233 \| 51881 \| 51653 \| 52440 epsilon20000(10%) \| 316565 \| 109193 \| 95121 \| 82764 \| 69653 \| 60764 \| 56066 \| 53371 \| 52822 \| 52872 \| 52769 \| 52527 \| 53508 epsilon200000(1%) \| 336181 \| 1569721 \| 1069355 \| 673718 \| 375043 \| 218230 \| 145393 \| 110926 \| 94327 \| 87039 \| 83926 \| 81890 \| 81787 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| Speedup \| \| \| \| \| \| \| \| \| \| \| \| epsilon(100%) \| 1 \| 12.48827602 \| 12.69859977 \| 13.20395535 \| 12.85611341 \| 12.63471362 \| 12.16125307 \| 12.59231689 \| 11.90189931 \| 12.46586483 \| 12.5299739 \| 12.39158158 \| 12.00121306 epsilon3000(67%) \| 1 \| 12.68344803 \| 13.2470174 \| 13.01263399 \| 11.85940553 \| 11.70270687 \| 11.26124276 \| 11.36866946 \| 11.62500958 \| 11.74315784 \| 11.39114225 \| 11.51853351 \| 11.64076404 epsilon4000(50%) \| 1 \| 7.250804619 \| 7.443154212 \| 7.003520161 \| 6.299395534 \| 6.020396133 \| 6.00929667 \| 5.996946625 \| 5.999177632 \| 5.973795551 \| 5.931812902 \| 5.747111345 \| 5.875618456 epsilon5000(40%) \| 1 \| 7.049176196 \| 7.155261444 \| 6.875243055 \| 6.055492126 \| 5.92269778 \| 5.930462108 \| 5.894213451 \| 5.712516249 \| 5.847491779 \| 6.024421292 \| 5.858405226 \| 5.965076595 epsilon10000(20%) \| 1 \| 5.312055644 \| 5.544786395 \| 5.505797539 \| 5.4765269 \| 5.774760681 \| 5.947900481 \| 5.98960748 \| 5.994239097 \| 5.93628549 \| 5.976561747 \| 6.002942714 \| 5.912852784 epsilon20000(10%) \| 1 \| 2.899132728 \| 3.328024306 \| 3.824911797 \| 4.544886796 \| 5.209745902 \| 5.64629187 \| 5.931404695 \| 5.993052137 \| 5.987384627 \| 5.999071425 \| 6.026710073 \| 5.916218136 epsilon200000(1%) \| 1 \| 0.214166084 \| 0.314377358 \| 0.498993644 \| 0.896379882 \| 1.540489392 \| 2.312222734 \| 3.03067811 \| 3.563995463 \| 3.862417997 \| 4.005683578 \| 4.105275369 \| 4.110445425 OpenBLAS \| \| \| \| \| \| \| \| \| \| \| \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| -- Duration(millisecond) \| branch 3.0 Impl \| blockSizeInMB=0.0625 \| blockSizeInMB=0.125 \| blockSizeInMB=0.25 \| blockSizeInMB=0.5 \| blockSizeInMB=1 \| blockSizeInMB=2 \| blockSizeInMB=4 \| blockSizeInMB=8 \| blockSizeInMB=16 \| blockSizeInMB=32 \| blockSizeInMB=64 \| blockSizeInMB=128 epsilon(100%) \| 299119 \| 26047 \| 25049 \| 25239 \| 28001 \| 35138 \| 36438 \| 36279 \| 36114 \| 35111 \| 35428 \| 36295 \| 35197 epsilon3000(67%) \| 439798 \| 33321 \| 34423 \| 34336 \| 38906 \| 51756 \| 54138 \| 54085 \| 53412 \| 54766 \| 54425 \| 54221 \| 54842 epsilon4000(50%) \| 302963 \| 42960 \| 40678 \| 43483 \| 48254 \| 50888 \| 54990 \| 52647 \| 51947 \| 51843 \| 52891 \| 53410 \| 52020 epsilon5000(40%) \| 303569 \| 44225 \| 44961 \| 45065 \| 51768 \| 52776 \| 51930 \| 53587 \| 53104 \| 51833 \| 52138 \| 52574 \| 53756 epsilon10000(20%) \| 307403 \| 58447 \| 55993 \| 56757 \| 56694 \| 54038 \| 52734 \| 52073 \| 52051 \| 52150 \| 51986 \| 52407 \| 52390 epsilon20000(10%) \| 313344 \| 107580 \| 94679 \| 83329 \| 70226 \| 60996 \| 57130 \| 55461 \| 54641 \| 52712 \| 52541 \| 53101 \| 53312 epsilon200000(1%) \| 334679 \| 1642726 \| 1073148 \| 654481 \| 364974 \| 213881 \| 140248 \| 107579 \| 91757 \| 85090 \| 81940 \| 80492 \| 80250 \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| Speedup \| \| \| \| \| \| \| \| \| \| \| \| epsilon(100%) \| 1 \| 11.48381771 \| 11.94135494 \| 11.85146004 \| 10.68243991 \| 8.512692811 \| 8.208985125 \| 8.244962651 \| 8.282632774 \| 8.519238985 \| 8.443011178 \| 8.241328007 \| 8.498423161 epsilon3000(67%) \| 1 \| 13.19882356 \| 12.7762833 \| 12.80865564 \| 11.30411762 \| 8.497526857 \| 8.123646976 \| 8.131607655 \| 8.234067251 \| 8.030493372 \| 8.080808452 \| 8.111211523 \| 8.01936472 epsilon4000(50%) \| 1 \| 7.052211359 \| 7.44783421 \| 6.967389555 \| 6.278505409 \| 5.953525389 \| 5.509419895 \| 5.754610899 \| 5.832155851 \| 5.843855487 \| 5.728063376 \| 5.672402172 \| 5.823971549 epsilon5000(40%) \| 1 \| 6.86419446 \| 6.751829363 \| 6.736247642 \| 5.864027971 \| 5.752027437 \| 5.845734643 \| 5.664974714 \| 5.716499699 \| 5.856674319 \| 5.822413595 \| 5.774127896 \| 5.647164968 epsilon10000(20%) \| 1 \| 5.259517169 \| 5.490025539 \| 5.416124883 \| 5.422143437 \| 5.688645028 \| 5.829313157 \| 5.903308816 \| 5.905803923 \| 5.894592522 \| 5.913188166 \| 5.865685882 \| 5.867589235 epsilon20000(10%) \| 1 \| 2.912660346 \| 3.309540658 \| 3.760323537 \| 4.461937174 \| 5.137123746 \| 5.48475407 \| 5.649807973 \| 5.734594901 \| 5.944452876 \| 5.963799699 \| 5.900905821 \| 5.87755102 epsilon200000(1%) \| 1 \| 0.203733915 \| 0.311866583 \| 0.511365494 \| 0.916994087 \| 1.564790701 \| 2.38633706 \| 3.111006795 \| 3.647449241 \| 3.933235398 \| 4.084439834 \| 4.157916315 \| 4.170454829 ### Does this PR introduce _any_ user-facing change? yes, param `blockSize` -> `blockSizeInMB` in master ### How was this patch tested? added testsuites and performance test (result attached in [ticket](https://issues.apache.org/jira/browse/SPARK-32907)) Closes #30009 from zhengruifeng/adaptively_blockify_linear_svc_II. Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-11-12 19:14:07 +08:00
Kent Yao	4335af075a	[MINOR][DOC] spark.executor.memoryOverhead is not cluster-mode only ### What changes were proposed in this pull request? Remove "in cluster mode" from the description of `spark.executor.memoryOverhead` ### Why are the changes needed? fix correctness issue in documentaion ### Does this PR introduce _any_ user-facing change? yes, users may not get confused about the description `spark.executor.memoryOverhead` ### How was this patch tested? pass GA doc generation Closes #30311 from yaooqinn/minordoc. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-12 18:53:06 +09:00
xuewei.linxuewei	6d31daeb6a	[SPARK-33386][SQL] Accessing array elements in ElementAt/Elt/GetArrayItem should failed if index is out of bound ### What changes were proposed in this pull request? Instead of returning NULL, throws runtime ArrayIndexOutOfBoundsException when ansiMode is enable for `element_at`，`elt`, `GetArrayItem` functions. ### Why are the changes needed? For ansiMode. ### Does this PR introduce any user-facing change? When `spark.sql.ansi.enabled` = true, Spark will throw `ArrayIndexOutOfBoundsException` if out-of-range index when accessing array elements ### How was this patch tested? Added UT and existing UT. Closes #30297 from leanken/leanken-SPARK-33386. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-12 08:50:32 +00:00
Dongjoon Hyun	22baf05a9e	[SPARK-33408][SPARK-32354][K8S][R] Use R 3.6.3 in K8s R image and re-enable RTestsSuite ### What changes were proposed in this pull request? This PR aims to use R 3.6.3 in K8s R image and re-enable `RTestsSuite`. ### Why are the changes needed? Jenkins Server is using `R 3.6.3`. ``` + SPARK_HOME=/home/jenkins/workspace/SparkPullRequestBuilder-K8s + /usr/bin/R CMD check --as-cran --no-tests SparkR_3.1.0.tar.gz * using log directory ‘/home/jenkins/workspace/SparkPullRequestBuilder-K8s/R/SparkR.Rcheck’ * using R version 3.6.3 (2020-02-29) ``` OpenJDK docker image is using `R 3.5.2 (2018-12-20)` which is old and currently `spark-3.0.1` fails to run SparkR. ``` $ cd spark-3.0.1-bin-hadoop3.2 $ bin/docker-image-tool.sh -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile -n build ... exit code: 1 termination reason: Error ... $ bin/spark-submit --master k8s://https://192.168.64.49:8443 --deploy-mode cluster --conf spark.kubernetes.container.image=spark-r:latest local:///opt/spark/examples/src/main/r/dataframe.R $ k logs dataframe-r-b1c14b75b0c09eeb-driver ... + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.4 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.RRunner local:///opt/spark/examples/src/main/r/dataframe.R 20/11/10 06:03:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable log4j:WARN No appenders could be found for logger (io.netty.util.internal.logging.InternalLoggerFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Error: package or namespace load failed for ‘SparkR’ in rbind(info, getNamespaceInfo(env, "S3methods")): number of columns of matrices must match (see arg 2) In addition: Warning message: package ‘SparkR’ was built under R version 4.0.2 Execution halted ``` In addition, this PR aims to recover the test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass K8S IT Jenkins job. Closes #30130 from dongjoon-hyun/SPARK-32354. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 15:36:31 +09:00
Yuanjian Li	9f983a68f1	[SPARK-30294][SS][FOLLOW-UP] Directly override RDD methods ### Why are the changes needed? Follow the comment: https://github.com/apache/spark/pull/26935#discussion_r514697997 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test and Mima test. Closes #30344 from xuanyuanking/SPARK-30294-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 12:22:25 +09:00
Ruifeng Zheng	6244407ce6	Revert "[WIP] Test (#30327 )" This reverts commit `61ee5d8a4e`. ### What changes were proposed in this pull request? I need to merge https://github.com/apache/spark/pull/30327 to https://github.com/apache/spark/pull/30009, but I merged it to master by mistake. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #30345 from zhengruifeng/revert-30327-adaptively_blockify_linear_svc_II. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 11:32:12 +09:00
WeichenXu	61ee5d8a4e	[WIP] Test (#30327 ) * resend * address comments * directly gen new Iter * directly gen new Iter * update blockify strategy * address comments * try to fix 2.13 * try to fix scala 2.13 * use 1.0 as the default value for gemv * update Co-authored-by: zhengruifeng <ruifengz@foxmail.com>	2020-11-12 10:20:33 +08:00
Josh Soref	9d58a2f0f0	[MINOR][GRAPHX] Correct typos in the sub-modules: graphx, external, and examples ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: graphx, external, and examples. Split per holdenk https://github.com/apache/spark/pull/30323#issuecomment-725159710 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No testing was performed Closes #30326 from jsoref/spelling-graphx. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-12 08:29:22 +09:00
Steve Loughran	318a173fce	[SPARK-33402][CORE] Jobs launched in same second have duplicate MapReduce JobIDs ### What changes were proposed in this pull request? 1. Applies the SQL changes in SPARK-33230 to SparkHadoopWriter, so that `rdd.saveAsNewAPIHadoopDataset` passes in a unique job UUID in `spark.sql.sources.writeJobUUID` 1. `SparkHadoopWriterUtils.createJobTrackerID` generates a JobID by appending a random long number to the supplied timestamp to ensure the probability of a collision is near-zero. 1. With tests of uniqueness, round trips and negative jobID rejection. ### Why are the changes needed? Without this, if more than one job is started in the same second and the committer expects application attempt IDs to be unique is at risk of clashing with other jobs. With the fix, * those committers which use the ID set in `spark.sql.sources.writeJobUUID` as a priority ID will pick that up instead and so be unique. * committers which use the Hadoop JobID for unique paths and filenames will get the randomly generated jobID. Assuming all clocks in a cluster in sync, the probability of two jobs launched in the same second has dropped from 1 to 1/(2^63) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests There's a new test suite SparkHadoopWriterUtilsSuite which creates jobID, verifies they are unique even for the same timestamp and that they can be marshalled to string and parsed back in the hadoop code, which contains some (brittle) assumptions about the format of job IDs. Functional Integration Tests 1. Hadoop-trunk built with [HADOOP-17318], publishing to local maven repository 1. Spark built with hadoop.version=3.4.0-SNAPSHOT to pick up these JARs. 1. Spark + Object store integration tests at [https://github.com/hortonworks-spark/cloud-integration](https://github.com/hortonworks-spark/cloud-integration) were built against that local spark version 1. And executed against AWS london. The tests were run with `fs.s3a.committer.require.uuid=true`, so the s3a committers fail fast if they don't get a job ID down. This showed that `rdd.saveAsNewAPIHadoopDataset` wasn't setting the UUID option. It again uses the current Date value for an app attempt -which is not guaranteed to be unique. With the change applied to spark, the relevant tests work, therefore the committers are getting unique job IDs. Closes #30319 from steveloughran/BUG/SPARK-33402-jobuuid. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-11 14:27:48 -08:00
Max Gekk	7e867298fe	[SPARK-33404][SQL][FOLLOWUP] Update benchmark results for `date_trunc` ### What changes were proposed in this pull request? Updated results of `DateTimeBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| ### Why are the changes needed? The fix https://github.com/apache/spark/pull/30303 slowed down `date_trunc`. This PR updates benchmark results to have actual info about performance of `date_trunc`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By regenerating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeBenchmark" ``` Closes #30338 from MaxGekk/fix-trunc_date-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-11 08:50:43 -08:00
zero323	4b76a74f1c	[SPARK-33415][PYTHON][SQL] Don't encode JVM response in Column.__repr__ ### What changes were proposed in this pull request? Removes encoding of the JVM response in `pyspark.sql.column.Column.__repr__`. ### Why are the changes needed? API consistency and improved readability of the expressions. ### Does this PR introduce _any_ user-facing change? Before this change col("abc") col("wąż") result in Column<b'abc'> Column<b'w\xc4\x85\xc5\xbc'> After this change we'll get Column<'abc'> Column<'wąż'> ### How was this patch tested? Existing tests and manual inspection. Closes #30322 from zero323/SPARK-33415. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-12 00:13:17 +09:00
stczwd	1eb236b936	[SPARK-32512][SQL] add alter table add/drop partition command for datasourcev2 ### What changes were proposed in this pull request? This patch is trying to add `AlterTableAddPartitionExec` and `AlterTableDropPartitionExec` with the new table partition API, defined in #28617. ### Does this PR introduce _any_ user-facing change? Yes. User can use `alter table add partition` or `alter table drop partition` to create/drop partition in V2Table. ### How was this patch tested? Run suites and fix old tests. Closes #29339 from stczwd/SPARK-32512-new. Lead-authored-by: stczwd <qcsd2011@163.com> Co-authored-by: Jacky Lee <qcsd2011@163.com> Co-authored-by: Jackey Lee <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 09:30:42 +00:00
Wenchen Fan	8760032f4f	[SPARK-33412][SQL] OverwriteByExpression should resolve its delete condition based on the table relation not the input query ### What changes were proposed in this pull request? Make a special case in `ResolveReferences`, which resolves `OverwriteByExpression`'s condition expression based on the table relation instead of the input query. ### Why are the changes needed? The condition expression is passed to the table implementation at the end, so we should resolve it using table schema. Previously it works because we have a hack in `ResolveReferences` to delay the resolution if `outputResolved == false`. However, this hack doesn't work for tables accepting any schema like https://github.com/delta-io/delta/pull/521 . We may wrongly resolve the delete condition using input query's outout columns which don't match the table column names. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests and updated test in v2 write. Closes #30318 from cloud-fan/v2-write. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-11 16:13:21 +09:00
Takeshi Yamamuro	4b367976a8	[SPARK-33417][SQL][TEST] Correct the behaviour of query filters in TPCDSQueryBenchmark ### What changes were proposed in this pull request? This PR intends to fix the behaviour of query filters in `TPCDSQueryBenchmark`. We can use an option `--query-filter` for selecting TPCDS queries to run, e.g., `--query-filter q6,q8,q13`. But, the current master has a weird behaviour about the option. For example, if we pass `--query-filter q6` so as to run the TPCDS q6 only, `TPCDSQueryBenchmark` runs `q6` and `q6-v2.7` because the `filterQueries` method does not respect the name suffix. So, there is no way now to run the TPCDS q6 only. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes #30324 from maropu/FilterBugInTPCDSQueryBenchmark. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-11 15:24:05 +09:00
Terry Kim	6d5d030957	[SPARK-33414][SQL] Migrate SHOW CREATE TABLE command to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `SHOW CREATE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `SHOW CREATE TABLE` works only with a v1 table and a permanent view, and not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t (key INT, value STRING) USING hive") sql("USE db") sql("SHOW CREATE TABLE t AS SERDE") // Succeeds ``` With this change, `SHOW CREATE TABLE ... AS SERDE` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$43(Analyzer.scala:883) at scala.Option.map(Option.scala:230) ``` , which is expected since temporary view is resolved first and `SHOW CREATE TABLE ... AS SERDE` doesn't support a temporary view. Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE` since it was already resolving to a temporary view first. See below for more detail. ### Does this PR introduce _any_ user-facing change? After this PR, `SHOW CREATE TABLE t AS SERDE` is resolved to a temp view `t` instead of table `db.t` in the above scenario. Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE`, but the exception message changes from `SHOW CREATE TABLE is not supported on a temporary view` to `t is a temp view not table or permanent view`. ### How was this patch tested? Updated existing tests. Closes #30321 from imback82/show_create_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 05:54:27 +00:00
Max Gekk	1e2eeda20e	[SPARK-33382][SQL][TESTS] Unify datasource v1 and v2 SHOW TABLES tests ### What changes were proposed in this pull request? In the PR, I propose to gather common `SHOW TABLES` tests into one trait `org.apache.spark.sql.execution.command.ShowTablesSuite`, and put datasource specific tests to the `v1.ShowTablesSuite` and `v2.ShowTablesSuite`. Also tests for parsing `SHOW TABLES` are extracted to `ShowTablesParserSuite`. ### Why are the changes needed? - The unification will allow to run common `SHOW TABLES` tests for both DSv1 and DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: - `org.apache.spark.sql.execution.command.v1.ShowTablesSuite` - `org.apache.spark.sql.execution.command.v2.ShowTablesSuite` - `ShowTablesParserSuite` Closes #30287 from MaxGekk/unify-dsv1_v2-tests. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-11 05:26:46 +00:00
ulysses	5197c5d2e7	[SPARK-33390][SQL] Make Literal support char array ### What changes were proposed in this pull request? Make Literal support char array. ### Why are the changes needed? We always use `Literal()` to create foldable value, and `char[]` is a usual data type. We can make it easy that support create String Literal with `char[]`. ### Does this PR introduce _any_ user-facing change? Yes, user can call `Literal()` with `char[]`. ### How was this patch tested? Add test. Closes #30295 from ulysses-you/SPARK-33390. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-11 11:39:11 +09:00
Utkarsh	46346943bb	[SPARK-33404][SQL] Fix incorrect results in `date_trunc` expression ### What changes were proposed in this pull request? The following query produces incorrect results: ``` SELECT date_trunc('minute', '1769-10-17 17:10:02') ``` Spark currently incorrectly returns ``` 1769-10-17 17:10:02 ``` against the expected return value of ``` 1769-10-17 17:10:00 ``` Steps to repro Run the following commands in spark-shell: ``` spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") spark.sql("SELECT date_trunc('minute', '1769-10-17 17:10:02')").show() ``` This happens as `truncTimestamp` in package `org.apache.spark.sql.catalyst.util.DateTimeUtils` incorrectly assumes that time zone offsets can never have the granularity of a second and thus does not account for time zone adjustment when truncating the given timestamp to `minute`. This assumption is currently used when truncating the timestamps to `microsecond, millisecond, second, or minute`. This PR fixes this issue and always uses time zone knowledge when truncating timestamps regardless of the truncation unit. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new tests to `DateTimeUtilsSuite` which previously failed and pass now. Closes #30303 from utkarsh39/trunc-timestamp-fix. Authored-by: Utkarsh <utkarsh.agarwal@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-11 09:28:59 +09:00
Liang-Chi Hsieh	6fa80ed1dd	[SPARK-33337][SQL] Support subexpression elimination in branches of conditional expressions ### What changes were proposed in this pull request? Currently we skip subexpression elimination in branches of conditional expressions including `If`, `CaseWhen`, and `Coalesce`. Actually we can do subexpression elimination for such branches if the subexpression is common across all branches. This patch proposes to support subexpression elimination in branches of conditional expressions. ### Why are the changes needed? We may miss subexpression elimination chances in branches of conditional expressions. This kind of subexpression is frequently seen. It may be written manually by users or come from query optimizer. For example, project collapsing could embed expressions between two `Project`s and produces conditional expression like: ``` CASE WHEN jsonToStruct(json).a = '1' THEN 1.0 WHEN jsonToStruct(json).a = '2' THEN 2.0 ... ELSE 1.2 END ``` If `jsonToStruct(json)` is time-expensive expression, we don't eliminate the duplication and waste time on running it repeatedly now. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30245 from viirya/SPARK-33337. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-10 16:17:00 -08:00
zero323	122c8999cb	[SPARK-33251][FOLLOWUP][PYTHON][DOCS][MINOR] Adjusts returns PrefixSpan.findFrequentSequentialPatterns ### What changes were proposed in this pull request? Changes pyspark.sql.dataframe.DataFrame to :py:class:`pyspark.sql.DataFrame` ### Why are the changes needed? Consistency (see https://github.com/apache/spark/pull/30285#pullrequestreview-526764104). ### Does this PR introduce _any_ user-facing change? User will see shorter reference with a link. ### How was this patch tested? `dev/lint-python` and manual check of the rendered docs. Closes #30313 from zero323/SPARK-33251-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-11-10 09:17:00 -08:00
Chao Sun	3165ca742a	[SPARK-33376][SQL] Remove the option of "sharesHadoopClasses" in Hive IsolatedClientLoader ### What changes were proposed in this pull request? This removes the `sharesHadoopClasses` flag from `IsolatedClientLoader` in Hive module. ### Why are the changes needed? Currently, when initializing `IsolatedClientLoader`, users can set the `sharesHadoopClasses` flag to decide whether the `HiveClient` created should share Hadoop classes with Spark itself or not. In the latter case, the client will only load Hadoop classes from the Hive dependencies. There are two reasons to remove this: 1. this feature is currently used in two cases: 1) unit tests, 2) when the Hadoop version defined in Maven can not be found when `spark.sql.hive.metastore.jars` is equal to "maven", which could be very rare. 2. when `sharesHadoopClasses` is false, Spark doesn't really only use Hadoop classes from Hive jars: we also download `hadoop-client` jar and put all the sub-module jars (e.g., `hadoop-common`, `hadoop-hdfs`) together with the Hive jars, and the Hadoop version used by `hadoop-client` is the same version used by Spark itself. As result, we're mixing two versions of Hadoop jars in the classpath, which could potentially cause issues, especially considering that the default Hadoop version is already 3.2.0 while most Hive versions supported by the `IsolatedClientLoader` is still using Hadoop 2.x or even lower. ### Does this PR introduce _any_ user-facing change? This affects Spark users in one scenario: when `spark.sql.hive.metastore.jars` is set to `maven` AND the Hadoop version specified in pom file cannot be downloaded, currently the behavior is to switch to _not_ share Hadoop classes, but with the PR it will share Hadoop classes with Spark. ### How was this patch tested? Existing UTs. Closes #30284 from sunchao/SPARK-33376. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 15:41:04 +00:00
angerszhu	34f5e7ce77	[SPARK-33302][SQL] Push down filters through Expand ### What changes were proposed in this pull request? Push down filter through expand. For case below: ``` create table t1(pid int, uid int, sid int, dt date, suid int) using parquet; create table t2(pid int, vs int, uid int, csid int) using parquet; SELECT years, appversion, SUM(uusers) AS users FROM (SELECT Date_trunc('year', dt) AS years, CASE WHEN h.pid = 3 THEN 'iOS' WHEN h.pid = 4 THEN 'Android' ELSE 'Other' END AS viewport, h.vs AS appversion, Count(DISTINCT u.uid) AS uusers ,Count(DISTINCT u.suid) AS srcusers FROM t1 u join t2 h ON h.uid = u.uid GROUP BY 1, 2, 3) AS a WHERE viewport = 'iOS' GROUP BY 1, 2 ``` Plan. before this pr: ``` == Physical Plan == (5) HashAggregate(keys=[years#30, appversion#32], functions=[sum(uusers#33L)]) +- Exchange hashpartitioning(years#30, appversion#32, 200), true, [id=#251] +- (4) HashAggregate(keys=[years#30, appversion#32], functions=[partial_sum(uusers#33L)]) +- (4) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, 200), true, [id=#246] +- (3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12], functions=[partial_count(if ((gid#44 = 1)) u.`uid`#47 else null)]) +- (3) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- Exchange hashpartitioning(date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44, 200), true, [id=#241] +- (2) HashAggregate(keys=[date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44], functions=[]) +- (2) Filter (CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46 = iOS) +- (2) Expand [ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, uid#7, null, 1), ArrayBuffer(date_trunc(year, cast(dt#9 as timestamp), Some(Etc/GMT+7)), CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END, vs#12, null, suid#10, 2)], [date_trunc('year', CAST(u.`dt` AS TIMESTAMP))#45, CASE WHEN (h.`pid` = 3) THEN 'iOS' WHEN (h.`pid` = 4) THEN 'Android' ELSE 'Other' END#46, vs#12, u.`uid`#47, u.`suid`#48, gid#44] +- (2) Project [uid#7, dt#9, suid#10, pid#11, vs#12] +- (2) BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight :- (2) Project [uid#7, dt#9, suid#10] : +- (2) Filter isnotnull(uid#7) : +- (2) ColumnarToRow : +- FileScan parquet default.t1[uid#7,dt#9,suid#10] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<uid:int,dt:date,suid:int> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, true] as bigint))), [id=#233] +- (1) Project [pid#11, vs#12, uid#13] +- (1) Filter isnotnull(uid#13) +- (1) ColumnarToRow +- FileScan parquet default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [isnotnull(uid#13)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-bin-hadoop3.2/spark-warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<pid:int,vs:int,uid:int> ``` Plan. after. this pr. : ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[years#0, appversion#2], functions=[sum(uusers#3L)], output=[years#0, appversion#2, users#5L]) +- Exchange hashpartitioning(years#0, appversion#2, 5), true, [id=#71] +- HashAggregate(keys=[years#0, appversion#2], functions=[partial_sum(uusers#3L)], output=[years#0, appversion#2, sum#22L]) +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12], functions=[count(distinct uid#7)], output=[years#0, appversion#2, uusers#3L]) +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, 5), true, [id=#67] +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12], functions=[partial_count(distinct uid#7)], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, count#27L]) +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7]) +- Exchange hashpartitioning(date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7, 5), true, [id=#63] +- HashAggregate(keys=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles)) AS date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END AS CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7], functions=[], output=[date_trunc(year, cast(dt#9 as timestamp), Some(America/Los_Angeles))#23, CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END#24, vs#12, uid#7]) +- Project [uid#7, dt#9, pid#11, vs#12] +- BroadcastHashJoin [uid#7], [uid#13], Inner, BuildRight, false :- Filter isnotnull(uid#7) : +- FileScan parquet default.t1[uid#7,dt#9] Batched: true, DataFilters: [isnotnull(uid#7)], Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87..., PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<uid:int,dt:date> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[2, int, false] as bigint)),false), [id=#58] +- Filter ((CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS) AND isnotnull(uid#13)) +- FileScan parquet default.t2[pid#11,vs#12,uid#13] Batched: true, DataFilters: [(CASE WHEN (pid#11 = 3) THEN iOS WHEN (pid#11 = 4) THEN Android ELSE Other END = iOS), isnotnull..., Format: Parquet, Location: InMemoryFileIndex[file:/private/var/folders/4l/7_c5c97s1_gb0d9_d6shygx00000gn/T/warehouse-c069d87..., PartitionFilters: [], PushedFilters: [IsNotNull(uid)], ReadSchema: struct<pid:int,vs:int,uid:int> ``` ### Why are the changes needed? Improve performance, filter more data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30278 from AngersZhuuuu/SPARK-33302. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 14:40:24 +00:00
Chao Sun	4934da56bc	[SPARK-33305][SQL] DSv2: DROP TABLE command should also invalidate cache ### What changes were proposed in this pull request? This changes `DropTableExec` to also invalidate caches referencing the table to be dropped, in a cascading manner. ### Why are the changes needed? In DSv1, `DROP TABLE` command also invalidate caches as described in [SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765). However in DSv2 the same command only drops the table but doesn't handle the caches. This could lead to correctness issue. ### Does this PR introduce _any_ user-facing change? Yes. Now DSv2 `DROP TABLE` command also invalidates cache. ### How was this patch tested? Added a new UT Closes #30211 from sunchao/SPARK-33305. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 14:37:42 +00:00
lrz	27bb40b629	[SPARK-33339][PYTHON] Pyspark application will hang due to non Exception error ### What changes were proposed in this pull request? When a system.exit exception occurs during the process, the python worker exits abnormally, and then the executor task is still waiting for the worker for reading from socket, causing it to hang. The system.exit exception may be caused by the user's error code, but spark should at least throw an error to remind the user, not get stuck we can run a simple test to reproduce this case: ``` from pyspark.sql import SparkSession def err(line): raise SystemExit spark = SparkSession.builder.appName("test").getOrCreate() spark.sparkContext.parallelize(range(1,2), 2).map(err).collect() spark.stop() ``` ### Why are the changes needed? to make sure pyspark application won't hang if there's non-Exception error in python worker ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added a new test and also manually tested the case above Closes #30248 from li36909/pyspark. Lead-authored-by: lrz <lrz@lrzdeMacBook-Pro.local> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 19:39:18 +09:00
xuewei.linxuewei	e3a768dd79	[SPARK-33391][SQL] element_at with CreateArray not respect one based index ### What changes were proposed in this pull request? element_at with CreateArray not respect one based index. repo step: ``` var df = spark.sql("select element_at(array(3, 2, 1), 0)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 1)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 2)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 3)") df.printSchema() root – element_at(array(3, 2, 1), 0): integer (nullable = false) root – element_at(array(3, 2, 1), 1): integer (nullable = false) root – element_at(array(3, 2, 1), 2): integer (nullable = false) root – element_at(array(3, 2, 1), 3): integer (nullable = true) correct answer should be 0 true which is outOfBounds return default true. 1 false 2 false 3 false ``` For expression eval, it respect the oneBasedIndex, but within checking the nullable, it calculates with zeroBasedIndex using `computeNullabilityFromArray`. ### Why are the changes needed? Correctness issue. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and existing UT. Closes #30296 from leanken/leanken-SPARK-33391. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 07:23:47 +00:00
Yuanjian Li	ad02ceda29	[SPARK-33244][SQL] Unify the code paths for spark.table and spark.read.table ### What changes were proposed in this pull request? - Call `spark.read.table` in `spark.table`. - Add comments for `spark.table` to emphasize it also support streaming temp view reading. ### Why are the changes needed? The code paths of `spark.table` and `spark.read.table` should be the same. This behavior is broke in SPARK-32592 since we need to respect options in `spark.read.table` API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT. Closes #30148 from xuanyuanking/SPARK-33244. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 05:46:45 +00:00
Terry Kim	90f6f39e42	[SPARK-33366][SQL] Migrate LOAD DATA command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `LOAD DATA` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `LOAD DATA` is not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior: ```scala sql("CREATE TEMPORARY VIEW t AS SELECT 1") sql("CREATE DATABASE db") sql("CREATE TABLE t (key INT, value STRING) USING hive") sql("USE db") sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE t") // Succeeds ``` With this change, `LOAD DATA` above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$39(Analyzer.scala:865) at scala.Option.foreach(Option.scala:407) ``` , which is expected since temporary view is resolved first and `LOAD DATA` doesn't support a temporary view. ### Does this PR introduce _any_ user-facing change? After this PR, `LOAD DATA ... t` is resolved to a temp view `t` instead of table `db.t` in the above scenario. ### How was this patch tested? Updated existing tests. Closes #30270 from imback82/load_data_cmd. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 05:28:06 +00:00
Gengliang Wang	a1f84d8714	[SPARK-33369][SQL] DSV2: Skip schema inference in write if table provider supports external metadata ### What changes were proposed in this pull request? When TableProvider.supportsExternalMetadata() is true, Spark will use the input Dataframe's schema in `DataframeWriter.save()`/`DataStreamWriter.start()` and skip schema/partitioning inference. ### Why are the changes needed? For all the v2 data sources which are not FileDataSourceV2, Spark always infers the table schema/partitioning on `DataframeWriter.save()`/`DataStreamWriter.start()`. The inference of table schema/partitioning can be expensive. However, there is no such trait or flag for indicating a V2 source can use the input DataFrame's schema on `DataframeWriter.save()`/`DataStreamWriter.start()`. We can resolve the problem by adding a new expected behavior for the method `TableProvider.supportsExternalMetadata()`. ### Does this PR introduce _any_ user-facing change? Yes, a new behavior for the data source v2 API `TableProvider.supportsExternalMetadata()` when it returns true. ### How was this patch tested? Unit test Closes #30273 from gengliangwang/supportsExternalMetadata. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-10 04:43:32 +00:00
Chao Sun	c2caf2522b	[SPARK-33213][BUILD] Upgrade Apache Arrow to 2.0.0 ### What changes were proposed in this pull request? This upgrade Apache Arrow version from 1.0.1 to 2.0.0 ### Why are the changes needed? Apache Arrow 2.0.0 was released with some improvements from Java side, so it's better to upgrade Spark to the new version. Note that the format version in Arrow 2.0.0 is still 1.0.0 so API should still be compatible between 1.0.1 and 2.0.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #30306 from sunchao/SPARK-33213. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-09 19:07:16 -08:00
Gabor Somogyi	4ac8133866	[SPARK-33223][SS][UI] Structured Streaming Web UI state information ### What changes were proposed in this pull request? Structured Streaming UI is not containing state information. In this PR I've added it. ### Why are the changes needed? Missing state information. ### Does this PR introduce _any_ user-facing change? Additional UI elements appear. ### How was this patch tested? Existing unit tests + manual test. <img width="1044" alt="Screenshot 2020-10-30 at 15 14 21" src="https://user-images.githubusercontent.com/18561820/97715405-a1797000-1ac2-11eb-886a-e3e6efa3af3e.png"> Closes #30151 from gaborgsomogyi/SPARK-33223. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-10 11:22:35 +09:00
neko	4360c6f12a	[SPARK-33363] Add prompt information related to the current task when pyspark/sparkR starts ### What changes were proposed in this pull request? add prompt information about current applicationId, current URL and master info when pyspark / sparkR starts. ### Why are the changes needed? The information printed when pyspark/sparkR starts does not prompt the basic information of current application, and it is not convenient when used pyspark/sparkR in dos. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manual test result shows below: ![pyspark new print](https://user-images.githubusercontent.com/52202080/98274268-2a663f00-1fce-11eb-88ce-964ce90b439e.png) ![sparkR](https://user-images.githubusercontent.com/52202080/98541235-1a01dd00-22ca-11eb-9304-09bcde87b05e.png) Closes #30266 from akiyamaneko/pyspark-hint-info. Authored-by: neko <echohlne@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 11:12:19 +09:00
Dongjoon Hyun	35ac314181	[SPARK-33405][BUILD] Upgrade commons-compress to 1.20 ### What changes were proposed in this pull request? This PR aims to upgrade `commons-compress` from 1.8 to 1.20. ### Why are the changes needed? - https://commons.apache.org/proper/commons-compress/security-reports.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30304 from dongjoon-hyun/SPARK-33405. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 11:08:55 +09:00
Kent Yao	036c11b0d4	[SPARK-33397][YARN][DOC] Fix generating md to html for available-patterns-for-shs-custom-executor-log-url ### What changes were proposed in this pull request? 1. replace `{{}}` with `{{}}` 2. using `<code></code>` in td-tag ### Why are the changes needed? to fix this. ![image](https://user-images.githubusercontent.com/8326978/98544155-8c74bc00-22ce-11eb-8889-8dacb726b762.png) ### Does this PR introduce _any_ user-facing change? yes, you will see the correct online doc with this change ![image](https://user-images.githubusercontent.com/8326978/98545256-2e48d880-22d0-11eb-9dd9-b8cae3df8659.png) ### How was this patch tested? shown as the above pic via jekyll serve. Closes #30298 from yaooqinn/SPARK-33397. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-10 10:15:55 +09:00
zero323	090962cd42	[SPARK-33251][PYTHON][DOCS] Migration to NumPy documentation style in ML (pyspark.ml.*) ### What changes were proposed in this pull request? This PR proposes migration of `pyspark.ml` to NumPy documentation style. ### Why are the changes needed? To improve documentation style. ### Does this PR introduce _any_ user-facing change? Yes, this changes both rendered HTML docs and console representation (SPARK-33243). ### How was this patch tested? `dev/lint-python` and manual inspection. Closes #30285 from zero323/SPARK-33251. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 09:33:48 +09:00
huangtianhua	83a80796aa	[SPARK-32691][BUILD] Update commons-crypto to v1.1.0 ### What changes were proposed in this pull request? Update the package commons-crypto to v1.1.0 to support aarch64 platform - https://issues.apache.org/jira/browse/CRYPTO-139 ### Why are the changes needed? The package commons-crypto-1.0.0 available in the Maven repository doesn't support aarch64 platform. It costs long time in CryptoRandomFactory.getCryptoRandom(properties).nextBytes(iv) when NettyBlockRpcSever receive block data from client, if the time more than the default value 120s, IOException raised and client will retry replicate the block data to other executors. But in fact the replication is complete, it makes the replication number incorrect. This makes DistributedSuite tests pass. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Pass the CIs. Closes #30275 from huangtianhua/SPARK-32691. Authored-by: huangtianhua <huangtianhua223@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-09 14:33:27 -08:00
Chandni Singh	8113c88542	[SPARK-32916][SHUFFLE] Implementation of shuffle service that leverages push-based shuffle in YARN deployment mode ### What changes were proposed in this pull request? This is one of the patches for SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602) which is needed for push-based shuffle. Summary of changes: - Adds an implementation of `MergedShuffleFileManager` which was introduced with [Spark 32915](https://issues.apache.org/jira/browse/SPARK-32915). - Integrated the push-based shuffle service with `YarnShuffleService`. ### Why are the changes needed? Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602). We have already verified the functionality and the improved performance as documented in the SPIP doc. Lead-authored-by: Min Shen mshenlinkedin.com Co-authored-by: Chandni Singh chsinghlinkedin.com Co-authored-by: Ye Zhou yezhoulinkedin.com Closes #30062 from otterc/SPARK-32916. Lead-authored-by: Chandni Singh <singh.chandni@gmail.com> Co-authored-by: Chandni Singh <chsingh@linkedin.com> Co-authored-by: Ye Zhou <yezhou@linkedin.com> Co-authored-by: Min Shen <mshen@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2020-11-09 11:00:52 -06:00
Peter Toth	84dc374611	[SPARK-33303][SQL] Deduplicate deterministic PythonUDF calls ### What changes were proposed in this pull request? This PR modifies the `ExtractPythonUDFs` rule to deduplicate deterministic PythonUDF calls. Before this PR the dataframe: `df.withColumn("c", batchedPythonUDF(col("a"))).withColumn("d", col("c"))` has the plan: ``` (1) Project [value#1 AS a#4, pythonUDF1#15 AS c#7, pythonUDF1#15 AS d#10] +- BatchEvalPython [dummyUDF(value#1), dummyUDF(value#1)], [pythonUDF0#14, pythonUDF1#15] +- LocalTableScan [value#1] ``` After this PR the deterministic PythonUDF calls are deduplicated: ``` (1) Project [value#1 AS a#4, pythonUDF0#14 AS c#7, pythonUDF0#14 AS d#10] +- BatchEvalPython [dummyUDF(value#1)], [pythonUDF0#14] +- LocalTableScan [value#1] ``` ### Why are the changes needed? To fix a performance issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New and existing UTs. Closes #30203 from peter-toth/SPARK-33303-deduplicate-deterministic-udf-calls. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-09 19:27:36 +09:00
Linhong Liu	4e1c89400d	[SPARK-33140][SQL][FOLLOW-UP] Use sparkSession in AQE context when applying rules ### What changes were proposed in this pull request? After #30097, all rules are using `SparkSession.active` to get `SQLConf` and `SparkSession`. But in AQE, when applying the rules for the initial plan, we should use the spark session in AQE context. ### Why are the changes needed? Fix potential problem caused by using the wrong spark session ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing ut Closes #30294 from linhongliu-db/SPARK-33140-followup. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 09:44:58 +00:00
Yuming Wang	7a5647a93a	[SPARK-33385][SQL] Support bucket pruning for IsNaN ### What changes were proposed in this pull request? This pr add support bucket pruning on `IsNaN` predicate. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30291 from wangyum/SPARK-33385. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 09:20:31 +00:00
Yuming Wang	69799c514f	[SPARK-33372][SQL] Fix InSet bucket pruning ### What changes were proposed in this pull request? This pr fix `InSet` bucket pruning because of it's values should not be `Literal`: `cbd3fdea62/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (L253-L255)` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and manual test: ```scala spark.sql("select id as a, id as b from range(10000)").write.bucketBy(100, "a").saveAsTable("t") spark.sql("select * from t where a in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)").show ``` Before this PR \| After this PR -- \| -- ![image](https://user-images.githubusercontent.com/5399861/98380788-fb120980-2083-11eb-8fae-4e21ad873e9b.png) \| ![image](https://user-images.githubusercontent.com/5399861/98381095-5ba14680-2084-11eb-82ca-2d780c85305c.png) Closes #30279 from wangyum/SPARK-33372. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 08:32:51 +00:00
Wenchen Fan	98730b7ee2	[SPARK-33087][SQL] DataFrameWriterV2 should delegate table resolution to the analyzer ### What changes were proposed in this pull request? This PR makes `DataFrameWriterV2` to create query plans with `UnresolvedRelation` and leave the table resolution work to the analyzer. ### Why are the changes needed? Table resolution work should be done by the analyzer. After this PR, the behavior is more consistent between different APIs (DataFrameWriter, DataFrameWriterV2 and SQL). See the next section for behavior changes. ### Does this PR introduce _any_ user-facing change? Yes. 1. writes to a temp view of v2 relation: previously it fails with table not found exception, now it works if the v2 relation is writable. This is consistent with `DataFrameWriter` and SQL INSERT. 2. writes to other temp views: previously it fails with table not found exception, now it fails with a more explicit error message, saying that writing to a temp view of non-v2-relation is not allowed. 3. writes to a view: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a view is not allowed. 4. writes to a v1 table: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a v1 table is not allowed. (We can allow it later, by falling back to v1 command) ### How was this patch tested? new tests Closes #29970 from cloud-fan/refactor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 08:08:00 +00:00
Huaxin Gao	bfb257f078	[SPARK-32405][SQL] Apply table options while creating tables in JDBC Table Catalog ### What changes were proposed in this pull request? Currently in JDBCTableCatalog, we ignore the table options when creating table. ``` // TODO (SPARK-32405): Apply table options while creating tables in JDBC Table Catalog if (!properties.isEmpty) { logWarning("Cannot create JDBC table with properties, these properties will be " + "ignored: " + properties.asScala.map { case (k, v) => s"$k=$v" }.mkString("[", ", ", "]")) } ``` ### Why are the changes needed? need to apply the table options when we create table ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add new test Closes #30154 from huaxingao/table_options. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 07:02:14 +00:00
Dongjoon Hyun	aa0849b46a	[SPARK-33387][CORE] Support ordered shuffle block migration ### What changes were proposed in this pull request? This PR aims to support sorted shuffle block migration. ### Why are the changes needed? Since the current shuffle block migration works in a random order, the failure during worker decommission affects all shuffles. We had better finish the shuffles one by one to minimize the number of affected shuffle. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #30293 from dongjoon-hyun/SPARK-33387. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-08 22:43:27 -08:00
Liang-Chi Hsieh	c269b53f07	[SPARK-33384][SS] Delete temporary file when cancelling writing to final path even underlying stream throwing error ### What changes were proposed in this pull request? In `RenameBasedFSDataOutputStream.cancel`, we do two things: closing underlying stream and delete temporary file, in a single try/catch block. Closing `OutputStream` could possibly throw `IOException` so we possibly missing deleting temporary file. This patch proposes to delete temporary even underlying stream throwing error. ### Why are the changes needed? To avoid leaving temporary files during canceling writing in `RenameBasedFSDataOutputStream`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30290 from viirya/SPARK-33384. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-08 18:44:26 -08:00
yangjie01	02fd52cfbc	[SPARK-33352][CORE][SQL][SS][MLLIB][AVRO][K8S] Fix procedure-like declaration compilation warnings in Scala 2.13 ### What changes were proposed in this pull request? There are two similar compilation warnings about procedure-like declaration in Scala 2.13: ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition ``` and ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type ``` this pr is the first part to resolve SPARK-33352： - For constructors method definition add `=` to convert to function syntax - For without `return type` methods definition add `: Unit =` to convert to function syntax ### Why are the changes needed? Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30255 from LuciferYang/SPARK-29392-FOLLOWUP.1. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-08 12:51:48 -06:00
Hannah Amundson	1090b1b00a	[SPARK-32860][DOCS][SQL] Updating documentation about map support in Encoders ### What changes were proposed in this pull request? Javadocs updated for the encoder to include maps as a collection type ### Why are the changes needed? The javadocs were not updated with fix SPARK-16706 ### Does this PR introduce _any_ user-facing change? Yes, the javadocs are updated ### How was this patch tested? sbt was run to ensure it meets scalastyle Closes #30274 from hannahkamundson/SPARK-32860. Lead-authored-by: Hannah Amundson <amundson.hannah@heb.com> Co-authored-by: Hannah <48397717+hannahkamundson@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-08 20:29:24 +09:00
HyukjinKwon	e11a24c1ba	[SPARK-33371][PYTHON] Update setup.py and tests for Python 3.9 ### What changes were proposed in this pull request? This PR proposes to fix PySpark to officially support Python 3.9. The main codes already work. We should just note that we support Python 3.9. Also, this PR fixes some minor fixes into the test codes. - `Thread.isAlive` is removed in Python 3.9, and `Thread.is_alive` exists in Python 3.6+, see https://docs.python.org/3/whatsnew/3.9.html#removed - Fixed `TaskContextTestsWithWorkerReuse.test_barrier_with_python_worker_reuse` and `TaskContextTests.test_barrier` to be less flaky. This becomes more flaky in Python 3.9 for some reasons. NOTE that PyArrow does not support Python 3.9 yet. ### Why are the changes needed? To officially support Python 3.9. ### Does this PR introduce _any_ user-facing change? Yes, it officially supports Python 3.9. ### How was this patch tested? Manually ran the tests: ``` $ ./run-tests --python-executable=python Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming'] python python_implementation is CPython python version is: Python 3.9.0 Starting test(python): pyspark.ml.tests.test_base Starting test(python): pyspark.ml.tests.test_evaluation Starting test(python): pyspark.ml.tests.test_algorithms Starting test(python): pyspark.ml.tests.test_feature Finished test(python): pyspark.ml.tests.test_base (12s) Starting test(python): pyspark.ml.tests.test_image Finished test(python): pyspark.ml.tests.test_evaluation (15s) Starting test(python): pyspark.ml.tests.test_linalg Finished test(python): pyspark.ml.tests.test_feature (25s) Starting test(python): pyspark.ml.tests.test_param Finished test(python): pyspark.ml.tests.test_image (17s) Starting test(python): pyspark.ml.tests.test_persistence Finished test(python): pyspark.ml.tests.test_param (17s) Starting test(python): pyspark.ml.tests.test_pipeline Finished test(python): pyspark.ml.tests.test_linalg (30s) Starting test(python): pyspark.ml.tests.test_stat Finished test(python): pyspark.ml.tests.test_pipeline (6s) Starting test(python): pyspark.ml.tests.test_training_summary Finished test(python): pyspark.ml.tests.test_stat (12s) Starting test(python): pyspark.ml.tests.test_tuning Finished test(python): pyspark.ml.tests.test_algorithms (68s) Starting test(python): pyspark.ml.tests.test_wrapper Finished test(python): pyspark.ml.tests.test_persistence (51s) Starting test(python): pyspark.mllib.tests.test_algorithms Finished test(python): pyspark.ml.tests.test_training_summary (33s) Starting test(python): pyspark.mllib.tests.test_feature Finished test(python): pyspark.ml.tests.test_wrapper (19s) Starting test(python): pyspark.mllib.tests.test_linalg Finished test(python): pyspark.mllib.tests.test_feature (26s) Starting test(python): pyspark.mllib.tests.test_stat Finished test(python): pyspark.mllib.tests.test_stat (22s) Starting test(python): pyspark.mllib.tests.test_streaming_algorithms Finished test(python): pyspark.mllib.tests.test_algorithms (53s) Starting test(python): pyspark.mllib.tests.test_util Finished test(python): pyspark.mllib.tests.test_linalg (54s) Starting test(python): pyspark.sql.tests.test_arrow Finished test(python): pyspark.sql.tests.test_arrow (0s) ... 61 tests were skipped Starting test(python): pyspark.sql.tests.test_catalog Finished test(python): pyspark.mllib.tests.test_util (11s) Starting test(python): pyspark.sql.tests.test_column Finished test(python): pyspark.sql.tests.test_catalog (16s) Starting test(python): pyspark.sql.tests.test_conf Finished test(python): pyspark.sql.tests.test_column (17s) Starting test(python): pyspark.sql.tests.test_context Finished test(python): pyspark.sql.tests.test_context (6s) ... 3 tests were skipped Starting test(python): pyspark.sql.tests.test_dataframe Finished test(python): pyspark.sql.tests.test_conf (11s) Starting test(python): pyspark.sql.tests.test_datasources Finished test(python): pyspark.sql.tests.test_datasources (19s) Starting test(python): pyspark.sql.tests.test_functions Finished test(python): pyspark.sql.tests.test_dataframe (35s) ... 3 tests were skipped Starting test(python): pyspark.sql.tests.test_group Finished test(python): pyspark.sql.tests.test_functions (32s) Starting test(python): pyspark.sql.tests.test_pandas_cogrouped_map Finished test(python): pyspark.sql.tests.test_pandas_cogrouped_map (1s) ... 15 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_grouped_map Finished test(python): pyspark.sql.tests.test_group (19s) Starting test(python): pyspark.sql.tests.test_pandas_map Finished test(python): pyspark.sql.tests.test_pandas_grouped_map (0s) ... 21 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf Finished test(python): pyspark.sql.tests.test_pandas_map (0s) ... 6 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg Finished test(python): pyspark.sql.tests.test_pandas_udf (0s) ... 6 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_scalar Finished test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg (0s) ... 13 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_typehints Finished test(python): pyspark.sql.tests.test_pandas_udf_scalar (0s) ... 50 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_window Finished test(python): pyspark.sql.tests.test_pandas_udf_typehints (0s) ... 10 tests were skipped Starting test(python): pyspark.sql.tests.test_readwriter Finished test(python): pyspark.sql.tests.test_pandas_udf_window (0s) ... 14 tests were skipped Starting test(python): pyspark.sql.tests.test_serde Finished test(python): pyspark.sql.tests.test_serde (19s) Starting test(python): pyspark.sql.tests.test_session Finished test(python): pyspark.mllib.tests.test_streaming_algorithms (120s) Starting test(python): pyspark.sql.tests.test_streaming Finished test(python): pyspark.sql.tests.test_readwriter (25s) Starting test(python): pyspark.sql.tests.test_types Finished test(python): pyspark.ml.tests.test_tuning (208s) Starting test(python): pyspark.sql.tests.test_udf Finished test(python): pyspark.sql.tests.test_session (31s) Starting test(python): pyspark.sql.tests.test_utils Finished test(python): pyspark.sql.tests.test_streaming (35s) Starting test(python): pyspark.streaming.tests.test_context Finished test(python): pyspark.sql.tests.test_types (34s) Starting test(python): pyspark.streaming.tests.test_dstream Finished test(python): pyspark.sql.tests.test_utils (14s) Starting test(python): pyspark.streaming.tests.test_kinesis Finished test(python): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped Starting test(python): pyspark.streaming.tests.test_listener Finished test(python): pyspark.streaming.tests.test_listener (11s) Starting test(python): pyspark.tests.test_appsubmit Finished test(python): pyspark.sql.tests.test_udf (39s) Starting test(python): pyspark.tests.test_broadcast Finished test(python): pyspark.streaming.tests.test_context (23s) Starting test(python): pyspark.tests.test_conf Finished test(python): pyspark.tests.test_conf (15s) Starting test(python): pyspark.tests.test_context Finished test(python): pyspark.tests.test_broadcast (33s) Starting test(python): pyspark.tests.test_daemon Finished test(python): pyspark.tests.test_daemon (5s) Starting test(python): pyspark.tests.test_install_spark Finished test(python): pyspark.tests.test_context (44s) Starting test(python): pyspark.tests.test_join Finished test(python): pyspark.tests.test_appsubmit (68s) Starting test(python): pyspark.tests.test_profiler Finished test(python): pyspark.tests.test_join (7s) Starting test(python): pyspark.tests.test_rdd Finished test(python): pyspark.tests.test_profiler (9s) Starting test(python): pyspark.tests.test_rddbarrier Finished test(python): pyspark.tests.test_rddbarrier (7s) Starting test(python): pyspark.tests.test_readwrite Finished test(python): pyspark.streaming.tests.test_dstream (107s) Starting test(python): pyspark.tests.test_serializers Finished test(python): pyspark.tests.test_serializers (8s) Starting test(python): pyspark.tests.test_shuffle Finished test(python): pyspark.tests.test_readwrite (14s) Starting test(python): pyspark.tests.test_taskcontext Finished test(python): pyspark.tests.test_install_spark (65s) Starting test(python): pyspark.tests.test_util Finished test(python): pyspark.tests.test_shuffle (8s) Starting test(python): pyspark.tests.test_worker Finished test(python): pyspark.tests.test_util (5s) Starting test(python): pyspark.accumulators Finished test(python): pyspark.accumulators (5s) Starting test(python): pyspark.broadcast Finished test(python): pyspark.broadcast (6s) Starting test(python): pyspark.conf Finished test(python): pyspark.tests.test_worker (14s) Starting test(python): pyspark.context Finished test(python): pyspark.conf (4s) Starting test(python): pyspark.ml.classification Finished test(python): pyspark.tests.test_rdd (60s) Starting test(python): pyspark.ml.clustering Finished test(python): pyspark.context (21s) Starting test(python): pyspark.ml.evaluation Finished test(python): pyspark.tests.test_taskcontext (69s) Starting test(python): pyspark.ml.feature Finished test(python): pyspark.ml.evaluation (26s) Starting test(python): pyspark.ml.fpm Finished test(python): pyspark.ml.clustering (45s) Starting test(python): pyspark.ml.functions Finished test(python): pyspark.ml.fpm (24s) Starting test(python): pyspark.ml.image Finished test(python): pyspark.ml.functions (17s) Starting test(python): pyspark.ml.linalg.__init__ Finished test(python): pyspark.ml.linalg.__init__ (0s) Starting test(python): pyspark.ml.recommendation Finished test(python): pyspark.ml.classification (74s) Starting test(python): pyspark.ml.regression Finished test(python): pyspark.ml.image (8s) Starting test(python): pyspark.ml.stat Finished test(python): pyspark.ml.stat (29s) Starting test(python): pyspark.ml.tuning Finished test(python): pyspark.ml.regression (53s) Starting test(python): pyspark.mllib.classification Finished test(python): pyspark.ml.tuning (35s) Starting test(python): pyspark.mllib.clustering Finished test(python): pyspark.ml.feature (103s) Starting test(python): pyspark.mllib.evaluation Finished test(python): pyspark.mllib.classification (33s) Starting test(python): pyspark.mllib.feature Finished test(python): pyspark.mllib.evaluation (21s) Starting test(python): pyspark.mllib.fpm Finished test(python): pyspark.ml.recommendation (103s) Starting test(python): pyspark.mllib.linalg.__init__ Finished test(python): pyspark.mllib.linalg.__init__ (1s) Starting test(python): pyspark.mllib.linalg.distributed Finished test(python): pyspark.mllib.feature (26s) Starting test(python): pyspark.mllib.random Finished test(python): pyspark.mllib.fpm (23s) Starting test(python): pyspark.mllib.recommendation Finished test(python): pyspark.mllib.clustering (50s) Starting test(python): pyspark.mllib.regression Finished test(python): pyspark.mllib.random (13s) Starting test(python): pyspark.mllib.stat.KernelDensity Finished test(python): pyspark.mllib.stat.KernelDensity (1s) Starting test(python): pyspark.mllib.stat._statistics Finished test(python): pyspark.mllib.linalg.distributed (42s) Starting test(python): pyspark.mllib.tree Finished test(python): pyspark.mllib.stat._statistics (19s) Starting test(python): pyspark.mllib.util Finished test(python): pyspark.mllib.regression (33s) Starting test(python): pyspark.profiler Finished test(python): pyspark.mllib.recommendation (36s) Starting test(python): pyspark.rdd Finished test(python): pyspark.profiler (9s) Starting test(python): pyspark.resource.tests.test_resources Finished test(python): pyspark.mllib.tree (19s) Starting test(python): pyspark.serializers Finished test(python): pyspark.mllib.util (21s) Starting test(python): pyspark.shuffle Finished test(python): pyspark.resource.tests.test_resources (9s) Starting test(python): pyspark.sql.avro.functions Finished test(python): pyspark.shuffle (1s) Starting test(python): pyspark.sql.catalog Finished test(python): pyspark.rdd (22s) Starting test(python): pyspark.sql.column Finished test(python): pyspark.serializers (12s) Starting test(python): pyspark.sql.conf Finished test(python): pyspark.sql.conf (6s) Starting test(python): pyspark.sql.context Finished test(python): pyspark.sql.catalog (14s) Starting test(python): pyspark.sql.dataframe Finished test(python): pyspark.sql.avro.functions (15s) Starting test(python): pyspark.sql.functions Finished test(python): pyspark.sql.column (24s) Starting test(python): pyspark.sql.group Finished test(python): pyspark.sql.context (20s) Starting test(python): pyspark.sql.pandas.conversion Finished test(python): pyspark.sql.pandas.conversion (13s) Starting test(python): pyspark.sql.pandas.group_ops Finished test(python): pyspark.sql.group (36s) Starting test(python): pyspark.sql.pandas.map_ops Finished test(python): pyspark.sql.pandas.group_ops (21s) Starting test(python): pyspark.sql.pandas.serializers Finished test(python): pyspark.sql.pandas.serializers (0s) Starting test(python): pyspark.sql.pandas.typehints Finished test(python): pyspark.sql.pandas.typehints (0s) Starting test(python): pyspark.sql.pandas.types Finished test(python): pyspark.sql.pandas.types (0s) Starting test(python): pyspark.sql.pandas.utils Finished test(python): pyspark.sql.pandas.utils (0s) Starting test(python): pyspark.sql.readwriter Finished test(python): pyspark.sql.dataframe (56s) Starting test(python): pyspark.sql.session Finished test(python): pyspark.sql.functions (57s) Starting test(python): pyspark.sql.streaming Finished test(python): pyspark.sql.pandas.map_ops (12s) Starting test(python): pyspark.sql.types Finished test(python): pyspark.sql.types (10s) Starting test(python): pyspark.sql.udf Finished test(python): pyspark.sql.streaming (16s) Starting test(python): pyspark.sql.window Finished test(python): pyspark.sql.session (19s) Starting test(python): pyspark.streaming.util Finished test(python): pyspark.streaming.util (0s) Starting test(python): pyspark.util Finished test(python): pyspark.util (0s) Finished test(python): pyspark.sql.readwriter (24s) Finished test(python): pyspark.sql.udf (13s) Finished test(python): pyspark.sql.window (14s) Tests passed in 780 seconds ``` Closes #30277 from HyukjinKwon/SPARK-33371. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-06 15:05:37 -08:00
yangjie01	fb9c873e7d	[SPARK-33347][CORE] Cleanup useless variables of MutableApplicationInfo ### What changes were proposed in this pull request? There are 4 fields in `MutableApplicationInfo ` seems useless: - `coresGranted` - `maxCores` - `coresPerExecutor` - `memoryPerExecutorMB` They are always `None` and not reassigned. So the main change of this pr is cleanup these useless fields in `MutableApplicationInfo`. ### Why are the changes needed? Cleanup useless variables. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30251 from LuciferYang/SPARK-33347. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-07 06:43:27 +09:00
Stuart White	09fa7ecae1	[SPARK-33291][SQL] Improve DataFrame.show for nulls in arrays and structs ### What changes were proposed in this pull request? The changes in [SPARK-32501 Inconsistent NULL conversions to strings](https://issues.apache.org/jira/browse/SPARK-32501) introduced some behavior that I'd like to clean up a bit. Here's sample code to illustrate the behavior I'd like to clean up: ```scala val rows = Seq[String](null) .toDF("value") .withColumn("struct1", struct('value as "value1")) .withColumn("struct2", struct('value as "value1", 'value as "value2")) .withColumn("array1", array('value)) .withColumn("array2", array('value, 'value)) // Show the DataFrame using the "first" codepath. rows.show(truncate=false) +-----+-------+-------------+------+--------+ \|value\|struct1\|struct2 \|array1\|array2 \| +-----+-------+-------------+------+--------+ \|null \|{ null}\|{ null, null}\|[] \|[, null]\| +-----+-------+-------------+------+--------+ // Write the DataFrame to disk, then read it back and show it to trigger the "codegen" code path: rows.write.parquet("rows") spark.read.parquet("rows").show(truncate=false) +-----+-------+-------------+-------+-------------+ \|value\|struct1\|struct2 \|array1 \|array2 \| +-----+-------+-------------+-------+-------------+ \|null \|{ null}\|{ null, null}\|[ null]\|[ null, null]\| +-----+-------+-------------+-------+-------------+ ``` Notice: 1. If the first element of a struct is null, it is printed with a leading space (e.g. "\{ null\}"). I think it's preferable to print it without the leading space (e.g. "\{null\}"). This is consistent with how non-null values are printed inside a struct. 2. If the first element of an array is null, it is not printed at all in the first code path, and the "codegen" code path prints it with a leading space. I think both code paths should be consistent and print it without a leading space (e.g. "[null]"). The desired result of this PR is to product the following output via both code paths: ``` +-----+-------+------------+------+------------+ \|value\|struct1\|struct2 \|array1\|array2 \| +-----+-------+------------+------+------------+ \|null \|{null} \|{null, null}\|[null]\|[null, null]\| +-----+-------+------------+------+------------+ ``` This contribution is my original work and I license the work to the project under the project’s open source license. ### Why are the changes needed? To correct errors and inconsistencies in how DataFrame.show() displays nulls inside arrays and structs. ### Does this PR introduce _any_ user-facing change? Yes. This PR changes what is printed out by DataFrame.show(). ### How was this patch tested? I added new test cases in CastSuite.scala to cover the cases addressed by this PR. Closes #30189 from stwhit/show_nulls. Authored-by: Stuart White <stuart.white1@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-06 13:12:35 -08:00
Warren Zhu	93ad26be01	[SPARK-23432][UI] Add executor peak jvm memory metrics in executors page ### What changes were proposed in this pull request? Add executor peak jvm memory metrics in executors page ![image](https://user-images.githubusercontent.com/1633312/97767765-9121bf00-1adb-11eb-93c7-7912d9fe7826.png) ### Why are the changes needed? Users can know executor peak jvm metrics on in executors page ### Does this PR introduce _any_ user-facing change? Users can know executor peak jvm metrics on in executors page ### How was this patch tested? Manually tested Closes #30186 from warrenzhu25/23432. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-06 16:53:10 +09:00

1 2 3 4 5 ...

28580 commits