ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Oleksandr Shevchenko	d2fbf0dce4	[SPARK-35405][DOC] Submitting Applications documentation has outdated information about K8s client mode support ### What changes were proposed in this pull request? [Submitting Applications doc](https://spark.apache.org/docs/latest/submitting-applications.html#master-urls) has outdated information about K8s client mode support. It still says "Client mode is currently unsupported and will be supported in future releases". ![image](https://user-images.githubusercontent.com/31073930/118268920-b5b51580-b4c6-11eb-8eed-975be8d37964.png) Whereas it's already supported and [Running Spark on Kubernetes doc](https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode) says that it's supported started from 2.4.0 and has all needed information. ![image](https://user-images.githubusercontent.com/31073930/118268947-bd74ba00-b4c6-11eb-98d5-37961327642f.png) Changes: ![image](https://user-images.githubusercontent.com/31073930/118269179-12b0cb80-b4c7-11eb-8a37-d9d301bbda53.png) JIRA: https://issues.apache.org/jira/browse/SPARK-35405 ### Why are the changes needed? Outdated information in the doc is misleading ### Does this PR introduce _any_ user-facing change? Documentation changes ### How was this patch tested? Documentation changes Closes #32551 from o-shevchenko/SPARK-35405. Authored-by: Oleksandr Shevchenko <oleksandr.shevchenko@datarobot.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-14 11:26:35 -07:00
Kent Yao	d424771ec2	[MINOR][DOC] ADD toc for monitoring page ### What changes were proposed in this pull request? Add toc tag on monitoring.md ### Why are the changes needed? fix doc ### Does this PR introduce _any_ user-facing change? yes, the table of content of the monitoring page will be shown on the official doc site. ### How was this patch tested? pass GA doc build Closes #32545 from yaooqinn/minor. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-05-14 14:19:15 +08:00
Pablo Langa	9ea55fe771	[SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) ### What changes were proposed in this pull request? Generally, we would expect that x = y => hash( x ) = hash( y ). However +-0 hash to different values for floating point types. ``` scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as double))").show +-------------------------+--------------------------+ \|hash(CAST(0.0 AS DOUBLE))\|hash(CAST(-0.0 AS DOUBLE))\| +-------------------------+--------------------------+ \| -1670924195\| -853646085\| +-------------------------+--------------------------+ scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as double)").show +--------------------------------------------+ \|(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))\| +--------------------------------------------+ \| true\| +--------------------------------------------+ ``` Here is an extract from IEEE 754: > The two zeros are distinguishable arithmetically only by either division-byzero ( producing appropriately signed infinities ) or else by the CopySign function recommended by IEEE 754 /854. Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special cases From this, I deduce that the hash function must produce the same result for 0 and -0. ### Why are the changes needed? It is a correctness issue ### Does this PR introduce _any_ user-facing change? This changes only affect to the hash function applied to -0 value in float and double types ### How was this patch tested? Unit testing and manual testing Closes #32496 from planga82/feature/spark35207_hashnegativezero. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-14 12:40:36 +08:00
Gabor Somogyi	b6a0a7ea53	[SPARK-35311][SS][UI][DOCS] Structured Streaming Web UI state information documentation ### What changes were proposed in this pull request? In this PR I'm adding Structured Streaming Web UI state information documentation. ### Why are the changes needed? Missing documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` cd docs/ SKIP_API=1 bundle exec jekyll build ``` Manual webpage check. Closes #32433 from gaborgsomogyi/SPARK-35311. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-05-14 10:40:12 +09:00
Luca Canali	ae0579a945	[SPARK-35369][DOC] Document ExecutorAllocationManager metrics ### What changes were proposed in this pull request? This proposes to document the available metrics for ExecutorAllocationManager in the Spark monitoring documentation. ### Why are the changes needed? The ExecutorAllocationManager is instrumented with metrics using the Spark metrics system. The relevant work is in SPARK-7007 and SPARK-33763 ExecutorAllocationManager metrics are currently undocumented. ### Does this PR introduce _any_ user-facing change? This PR adds documentation only. ### How was this patch tested? na Closes #32500 from LucaCanali/followupMetricsDocSPARK33763. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-05-12 13:02:00 -07:00
Ludovic Henry	b52d47a920	[SPARK-35295][ML] Replace fully com.github.fommil.netlib by dev.ludovic.netlib:2.0 ### What changes were proposed in this pull request? Bump to `dev.ludovic.netlib:2.0` which provides JNI-based wrappers for BLAS, ARPACK, and LAPACK. Theseare not taking dependencies on GPL or LGPL libraries, allowing to provide out-of-the-box support for hardware acceleration when a native library is present (this is still up to the end-user to install such library on their system, like OpenBLAS, Intel MKL, and libarpack2). ### Why are the changes needed? Great performance improvement for ML-related workload on vanilla-distributions of Spark. ### Does this PR introduce _any_ user-facing change? Users now take advantage of hardware acceleration as long as a native library is installed (like OpenBLAS, Intel MKL and libarpack2). ### How was this patch tested? Spark test-suite + dev.ludovic.netlib testsuite. #### JDK8: ``` [info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.F2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.JNIBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 220 226 6 454.9 2.2 1.0X [info] java 221 228 5 451.9 2.2 1.0X [info] native 209 215 5 478.7 2.1 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 121 125 3 823.3 1.2 1.0X [info] java 121 125 3 824.3 1.2 1.0X [info] native 101 105 3 988.4 1.0 1.2X [info] [info] dcopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 212 219 6 470.9 2.1 1.0X [info] java 208 212 4 481.0 2.1 1.0X [info] native 209 215 5 478.5 2.1 1.0X [info] [info] scopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 114 119 3 878.9 1.1 1.0X [info] java 99 105 3 1011.4 1.0 1.2X [info] native 97 103 3 1026.7 1.0 1.2X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 108 111 2 925.9 1.1 1.0X [info] java 71 73 2 1414.9 0.7 1.5X [info] native 54 56 2 1847.0 0.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 96 97 2 1046.8 1.0 1.0X [info] java 47 48 1 2129.8 0.5 2.0X [info] native 29 30 1 3404.7 0.3 3.3X [info] [info] dnrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 139 143 2 718.2 1.4 1.0X [info] java 46 47 1 2171.2 0.5 3.0X [info] native 44 46 2 2261.8 0.4 3.1X [info] [info] snrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 154 157 4 651.0 1.5 1.0X [info] java 40 42 1 2469.3 0.4 3.8X [info] native 26 27 1 3787.6 0.3 5.8X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 185 195 8 541.0 1.8 1.0X [info] java 186 196 7 538.5 1.9 1.0X [info] native 177 187 7 564.1 1.8 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 98 102 3 1016.2 1.0 1.0X [info] java 98 102 3 1017.8 1.0 1.0X [info] native 87 91 3 1143.2 0.9 1.1X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 68 70 1 1474.7 0.7 1.0X [info] java 51 52 1 1973.0 0.5 1.3X [info] native 30 32 1 3298.8 0.3 2.2X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 96 99 2 1037.9 1.0 1.0X [info] java 50 51 1 1999.6 0.5 1.9X [info] native 30 31 1 3368.1 0.3 3.2X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 59 61 1 1688.7 0.6 1.0X [info] java 41 42 1 2461.9 0.4 1.5X [info] native 15 16 1 6593.0 0.2 3.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 90 92 1 1116.2 0.9 1.0X [info] java 39 40 1 2565.8 0.4 2.3X [info] native 15 16 1 6594.2 0.2 5.9X [info] [info] dger: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 192 202 7 520.5 1.9 1.0X [info] java 203 214 7 491.9 2.0 0.9X [info] native 176 187 7 568.8 1.8 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 59 61 1 846.1 1.2 1.0X [info] java 38 39 1 1313.5 0.8 1.6X [info] native 24 27 1 2047.8 0.5 2.4X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 97 101 3 515.4 1.9 1.0X [info] java 97 101 2 515.1 1.9 1.0X [info] native 88 91 3 569.1 1.8 1.1X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 169 174 3 295.4 3.4 1.0X [info] java 169 174 3 295.4 3.4 1.0X [info] native 160 165 4 312.2 3.2 1.1X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 561 577 13 1782.3 0.6 1.0X [info] java 225 231 4 4446.2 0.2 2.5X [info] native 31 32 3 32473.1 0.0 18.2X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 570 584 9 1754.8 0.6 1.0X [info] java 224 230 4 4457.3 0.2 2.5X [info] native 31 32 1 32493.4 0.0 18.5X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 855 866 6 1169.2 0.9 1.0X [info] java 224 228 3 4466.9 0.2 3.8X [info] native 31 32 1 32395.5 0.0 27.7X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 1328 1344 8 752.8 1.3 1.0X [info] java 224 230 4 4458.9 0.2 5.9X [info] native 31 32 1 32201.8 0.0 42.8X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 534 541 5 1873.0 0.5 1.0X [info] java 220 224 3 4542.8 0.2 2.4X [info] native 15 16 1 66803.1 0.0 35.7X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 544 551 6 1839.6 0.5 1.0X [info] java 220 224 4 4538.2 0.2 2.5X [info] native 15 16 1 65589.9 0.0 35.7X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 833 845 21 1201.0 0.8 1.0X [info] java 220 224 3 4548.7 0.2 3.8X [info] native 15 16 1 66603.2 0.0 55.5X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 899 907 5 1112.9 0.9 1.0X [info] java 221 224 2 4531.6 0.2 4.1X [info] native 15 16 1 65944.9 0.0 59.3X ``` #### JDK11: ``` [info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.F2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.JNIBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 195 200 3 512.2 2.0 1.0X [info] java 197 202 3 507.0 2.0 1.0X [info] native 184 189 4 543.0 1.8 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 108 112 3 921.8 1.1 1.0X [info] java 101 105 3 989.4 1.0 1.1X [info] native 87 91 3 1147.1 0.9 1.2X [info] [info] dcopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 187 191 3 535.1 1.9 1.0X [info] java 182 188 3 548.8 1.8 1.0X [info] native 178 182 3 562.2 1.8 1.1X [info] [info] scopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 110 114 3 909.3 1.1 1.0X [info] java 86 93 4 1159.3 0.9 1.3X [info] native 86 90 3 1162.4 0.9 1.3X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 106 108 2 943.6 1.1 1.0X [info] java 70 71 2 1426.8 0.7 1.5X [info] native 54 56 2 1835.4 0.5 1.9X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 96 97 1 1047.1 1.0 1.0X [info] java 43 44 1 2331.9 0.4 2.2X [info] native 29 30 1 3392.1 0.3 3.2X [info] [info] dnrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 114 115 2 880.7 1.1 1.0X [info] java 42 43 1 2398.1 0.4 2.7X [info] native 45 46 1 2233.3 0.4 2.5X [info] [info] snrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 140 143 2 714.6 1.4 1.0X [info] java 28 29 1 3531.0 0.3 4.9X [info] native 26 27 1 3820.0 0.3 5.3X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 156 166 7 641.3 1.6 1.0X [info] java 158 167 6 633.2 1.6 1.0X [info] native 150 160 7 664.8 1.5 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 85 88 2 1181.7 0.8 1.0X [info] java 85 88 2 1176.0 0.9 1.0X [info] native 75 78 2 1333.2 0.8 1.1X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 58 59 1 1731.1 0.6 1.0X [info] java 41 43 1 2415.5 0.4 1.4X [info] native 30 31 1 3293.9 0.3 1.9X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 94 96 1 1063.4 0.9 1.0X [info] java 41 42 1 2435.8 0.4 2.3X [info] native 30 30 1 3379.8 0.3 3.2X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 44 45 1 2278.9 0.4 1.0X [info] java 37 38 0 2686.8 0.4 1.2X [info] native 15 16 1 6555.4 0.2 2.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 88 89 1 1142.1 0.9 1.0X [info] java 33 34 1 3010.7 0.3 2.6X [info] native 15 16 1 6553.9 0.2 5.7X [info] [info] dger: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 164 172 4 609.4 1.6 1.0X [info] java 163 172 5 612.6 1.6 1.0X [info] native 150 159 4 667.0 1.5 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 49 50 1 1029.4 1.0 1.0X [info] java 41 42 1 1209.4 0.8 1.2X [info] native 25 27 1 2029.2 0.5 2.0X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 80 85 3 622.2 1.6 1.0X [info] java 80 85 3 622.4 1.6 1.0X [info] native 75 79 3 668.7 1.5 1.1X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 137 142 3 364.1 2.7 1.0X [info] java 139 142 2 360.4 2.8 1.0X [info] native 131 135 3 380.4 2.6 1.0X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 517 525 5 1935.5 0.5 1.0X [info] java 213 216 3 4704.8 0.2 2.4X [info] native 31 31 1 32705.6 0.0 16.9X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 589 601 6 1698.6 0.6 1.0X [info] java 213 217 3 4693.3 0.2 2.8X [info] native 31 32 1 32498.9 0.0 19.1X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 851 865 6 1175.3 0.9 1.0X [info] java 212 216 3 4717.0 0.2 4.0X [info] native 30 32 1 32903.0 0.0 28.0X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 1301 1316 6 768.4 1.3 1.0X [info] java 212 216 2 4717.4 0.2 6.1X [info] native 31 32 1 32606.0 0.0 42.4X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 454 460 2 2203.0 0.5 1.0X [info] java 208 212 3 4803.8 0.2 2.2X [info] native 15 16 0 66586.0 0.0 30.2X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 529 536 4 1889.7 0.5 1.0X [info] java 208 212 3 4798.6 0.2 2.5X [info] native 15 16 1 66751.4 0.0 35.3X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 830 840 5 1205.1 0.8 1.0X [info] java 208 211 2 4814.1 0.2 4.0X [info] native 15 15 1 67676.4 0.0 56.2X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 894 907 7 1118.7 0.9 1.0X [info] java 208 211 3 4809.6 0.2 4.3X [info] native 15 16 1 66675.2 0.0 59.6X ``` #### JDK16: ``` [info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.F2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.VectorBLAS [info] nativeBLAS = dev.ludovic.netlib.blas.JNIBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 193 199 3 517.5 1.9 1.0X [info] java 181 186 4 553.2 1.8 1.1X [info] native 181 185 5 553.6 1.8 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 108 112 2 925.1 1.1 1.0X [info] java 88 91 3 1138.6 0.9 1.2X [info] native 87 91 3 1144.2 0.9 1.2X [info] [info] dcopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 184 189 3 542.5 1.8 1.0X [info] java 181 185 3 552.8 1.8 1.0X [info] native 179 183 2 558.0 1.8 1.0X [info] [info] scopy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 97 101 3 1031.6 1.0 1.0X [info] java 86 90 2 1163.7 0.9 1.1X [info] native 85 88 2 1182.9 0.8 1.1X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 107 109 2 932.4 1.1 1.0X [info] java 54 56 2 1846.7 0.5 2.0X [info] native 54 56 2 1846.7 0.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 96 97 1 1043.6 1.0 1.0X [info] java 29 30 1 3439.3 0.3 3.3X [info] native 29 30 1 3423.9 0.3 3.3X [info] [info] dnrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 121 123 2 829.8 1.2 1.0X [info] java 32 32 1 3171.3 0.3 3.8X [info] native 45 46 1 2246.2 0.4 2.7X [info] [info] snrm2: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 142 144 2 705.9 1.4 1.0X [info] java 15 16 1 6585.8 0.2 9.3X [info] native 26 27 1 3839.5 0.3 5.4X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 157 165 5 635.6 1.6 1.0X [info] java 151 159 5 664.0 1.5 1.0X [info] native 151 160 5 663.6 1.5 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 85 89 2 1172.3 0.9 1.0X [info] java 75 79 3 1337.3 0.7 1.1X [info] native 75 79 2 1335.5 0.7 1.1X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 58 59 1 1731.5 0.6 1.0X [info] java 28 29 1 3544.2 0.3 2.0X [info] native 30 31 1 3306.2 0.3 1.9X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 90 92 1 1108.3 0.9 1.0X [info] java 28 28 1 3622.5 0.3 3.3X [info] native 30 31 1 3381.3 0.3 3.1X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 44 45 1 2284.7 0.4 1.0X [info] java 14 15 1 7034.0 0.1 3.1X [info] native 15 16 1 6643.7 0.2 2.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 85 86 1 1177.4 0.8 1.0X [info] java 15 15 1 6886.1 0.1 5.8X [info] native 15 16 1 6560.1 0.2 5.6X [info] [info] dger: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 164 173 6 608.1 1.6 1.0X [info] java 148 157 5 675.2 1.5 1.1X [info] native 152 160 5 659.9 1.5 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 61 63 1 815.4 1.2 1.0X [info] java 16 17 1 3104.3 0.3 3.8X [info] native 24 27 1 2071.9 0.5 2.5X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 81 85 2 616.4 1.6 1.0X [info] java 81 85 2 614.7 1.6 1.0X [info] native 75 78 2 669.5 1.5 1.1X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 138 141 3 362.7 2.8 1.0X [info] java 137 140 2 365.3 2.7 1.0X [info] native 131 134 2 382.9 2.6 1.1X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 525 544 8 1906.2 0.5 1.0X [info] java 61 68 3 16358.1 0.1 8.6X [info] native 31 32 1 32623.7 0.0 17.1X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 580 598 12 1724.5 0.6 1.0X [info] java 61 68 4 16302.5 0.1 9.5X [info] native 30 32 1 32962.8 0.0 19.1X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 829 838 4 1206.2 0.8 1.0X [info] java 61 69 3 16339.7 0.1 13.5X [info] native 30 31 1 33231.9 0.0 27.6X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 1352 1363 5 739.6 1.4 1.0X [info] java 61 69 3 16347.0 0.1 22.1X [info] native 31 32 1 32740.3 0.0 44.3X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 482 493 7 2073.1 0.5 1.0X [info] java 35 38 2 28315.3 0.0 13.7X [info] native 15 15 1 67579.7 0.0 32.6X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 472 482 4 2119.0 0.5 1.0X [info] java 36 38 2 28138.1 0.0 13.3X [info] native 15 16 1 66616.5 0.0 31.4X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 823 830 5 1215.2 0.8 1.0X [info] java 35 38 2 28681.4 0.0 23.6X [info] native 15 15 1 67908.4 0.0 55.9X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ----------------------------------------------------------------------------------------------- [info] f2j 896 908 7 1115.8 0.9 1.0X [info] java 35 38 2 28402.0 0.0 25.5X [info] native 15 16 0 66691.2 0.0 59.8X ``` TODO: - [x] update documentation in `docs/` and `docs/ml-linalg-guide.md` refering `com.github.fommil.netlib` - [ ] merge https://github.com/luhenry/netlib/pull/1 with all feedback from this PR + remove references to snapshot repositories in `pom.xml` and `project/SparkBuild.scala`. Closes #32415 from luhenry/master. Authored-by: Ludovic Henry <git@ludovic.dev> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-05-12 08:59:36 -05:00
Kousuke Saruta	2b6640a169	[SPARK-35229][WEBUI] Limit the maximum number of items on the timeline view ### What changes were proposed in this pull request? This PR proposes to introduces three new configurations to limit the maximum number of jobs/stages/executors on the timeline view. ### Why are the changes needed? If the number of items on the timeline view grows +1000, rendering can be significantly slow. https://issues.apache.org/jira/browse/SPARK-35229 The maximum number of tasks on the timeline is already limited by `spark.ui.timeline.tasks.maximum` so l proposed to mitigate this issue with the same manner. ### Does this PR introduce _any_ user-facing change? Yes. the maximum number of items shown on the timeline view is limited. I proposed the default value 500 for jobs and stages, and 250 for executors. A executor has at most 2 items (added and removed) 250 is chosen. ### How was this patch tested? I manually confirm this change works with the following procedures. ``` # launch a cluster $ bin/spark-shell --conf spark.ui.retainedDeadExecutors=300 --master "local-cluster[4, 1, 1024]" // Confirm the maximum number of jobs (1 to 1000).foreach { _ => sc.parallelize(List(1)).collect } // Confirm the maximum number of stages var df = sc.parallelize(1 to 2) (1 to 1000).foreach { i => df = df.repartition(i % 5 + 1) } df.collect // Confirm the maximum number of executors (1 to 300).foreach { _ => try sc.parallelize(List(1)).foreach { _ => System.exit(0) } catch { case e => }} ``` Screenshots here. ![jobs_limited](https://user-images.githubusercontent.com/4736016/116386937-3e8c4a00-a855-11eb-8f4c-151cf7ddd3b8.png) ![stages_limited](https://user-images.githubusercontent.com/4736016/116386990-49df7580-a855-11eb-9f71-8e129e3336ab.png) ![executors_limited](https://user-images.githubusercontent.com/4736016/116387009-4f3cc000-a855-11eb-8697-a2eb4c9c99e6.png) Closes #32381 from sarutak/mitigate-timeline-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-11 20:53:11 +08:00
Max Gekk	335f00b19b	[SPARK-35285][SQL] Parse ANSI interval types in SQL schema ### What changes were proposed in this pull request? 1. Extend Spark SQL parser to support parsing of: - `INTERVAL YEAR TO MONTH` to `YearMonthIntervalType` - `INTERVAL DAY TO SECOND` to `DayTimeIntervalType` 2. Assign new names to the ANSI interval types according to the SQL standard to be able to parse the names back by Spark SQL parser. Override the `typeName()` name of `YearMonthIntervalType`/`DayTimeIntervalType`. ### Why are the changes needed? To be able to use new ANSI interval types in SQL. The SQL standard requires the types to be defined according to the rules: ``` <interval type> ::= INTERVAL <interval qualifier> <interval qualifier> ::= <start field> TO <end field> \| <single datetime field> <start field> ::= <non-second primary datetime field> [ <left paren> <interval leading field precision> <right paren> ] <end field> ::= <non-second primary datetime field> \| SECOND [ <left paren> <interval fractional seconds precision> <right paren> ] <primary datetime field> ::= <non-second primary datetime field \| SECOND <non-second primary datetime field> ::= YEAR \| MONTH \| DAY \| HOUR \| MINUTE <interval fractional seconds precision> ::= <unsigned integer> <interval leading field precision> ::= <unsigned integer> ``` Currently, Spark SQL supports only `YEAR TO MONTH` and `DAY TO SECOND` as `<interval qualifier>`. ### Does this PR introduce _any_ user-facing change? Should not since the types has not been released yet. ### How was this patch tested? By running the affected tests such as: ``` $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z datetime.sql" $ build/sbt "test:testOnly ExpressionTypeCheckingSuite" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z windowFrameCoercion.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z literals.sql" ``` Closes #32409 from MaxGekk/parse-ansi-interval-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 13:50:35 +09:00
Kousuke Saruta	132cbf0c8c	[SPARK-35105][SQL] Support multiple paths for ADD FILE/JAR/ARCHIVE commands ### What changes were proposed in this pull request? This PR extends `ADD FILE/JAR/ARCHIVE` commands to be able to take multiple path arguments like Hive. ### Why are the changes needed? To make those commands more useful. ### Does this PR introduce _any_ user-facing change? Yes. In the current implementation, those commands can take a path which contains whitespaces without enclose it by neither `'` nor `"` but after this change, users need to enclose such paths. I've note this incompatibility in the migration guide. ### How was this patch tested? New tests. Closes #32205 from sarutak/add-multiple-files. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-04-29 13:58:51 +09:00
Kousuke Saruta	529b875901	[SPARK-35226][SQL] Support refreshKrb5Config option in JDBC datasources ### What changes were proposed in this pull request? This PR proposes to introduce a new JDBC option `refreshKrb5Config` which allows to reflect the change of `krb5.conf`. ### Why are the changes needed? In the current master, JDBC datasources can't accept `refreshKrb5Config` which is defined in `Krb5LoginModule`. So even if we change the `krb5.conf` after establishing a connection, the change will not be reflected. The similar issue happens when we run multiple `KrbIntegrationSuites` at the same time. `MiniKDC` starts and stops every KerberosIntegrationSuite and different port number is recorded to `krb5.conf`. Due to `SecureConnectionProvider.JDBCConfiguration` doesn't take `refreshKrb5Config`, KerberosIntegrationSuites except the first running one see the wrong port so those suites fail. You can easily confirm with the following command. ``` build/sbt -Phive Phive-thriftserver -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.KrbIntegrationSuite" ``` ### Does this PR introduce _any_ user-facing change? Yes. Users can set `refreshKrb5Config` to refresh krb5 relevant configuration. ### How was this patch tested? New test. Closes #32344 from sarutak/kerberos-refresh-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-04-29 13:55:53 +09:00
Angerszhuuuu	26a5e339a6	[SPARK-33976][SQL][DOCS][FOLLOWUP] Fix syntax error in select doc page ### What changes were proposed in this pull request? Add doc about `TRANSFORM` and related function. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32257 from AngersZhuuuu/SPARK-33976-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-04-28 16:47:02 +09:00
Ludovic Henry	5b77ebb57b	[SPARK-35150][ML] Accelerate fallback BLAS with dev.ludovic.netlib ### What changes were proposed in this pull request? Following https://github.com/apache/spark/pull/30810, I've continued looking for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate work done in the [`dev.ludovic.netlib`](https://github.com/luhenry/netlib/) Maven package. The `dev.ludovic.netlib` library wraps the original `com.github.fommil.netlib` library and focus on accelerating the linear algebra routines in use in Spark. When running the `org.apache.spark.ml.linalg.BLASBenchmark` benchmarking suite, I get the results at [1] on an Intel machine. Moreover, this library is thoroughly tested to return the exact same results as the reference implementation. Under the hood, it reimplements the necessary algorithms in pure autovectorization-friendly Java 8, as well as takes advantage of the Vector API and Foreign Linker API introduced in JDK 16 when available. A table summarising which version gets loaded in which case: ``` \| \| BLAS.nativeBLAS \| BLAS.javaBLAS \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| with -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.NetlibNativeBLAS, a \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| wrapper for com.github.fommil:all \| (JDK16+, relies on the Vector API, requires \| \| \| 2. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| relies on the Foreign Linker API, requires \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| `--add-modules=jdk.incubator.foreign \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| -Dforeign.restricted=warn`) \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| 3. fails to load, falls back to BLAS.javaBLAS in \| wrapper for com.github.fommil:core \| \| \| org.apache.spark.ml.linalg.BLAS \| \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| without -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| relies on the Foreign Linker API, requires \| (JDK16+, relies on the Vector API, requires \| \| \| `--add-modules=jdk.incubator.foreign \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| -Dforeign.restricted=warn`) \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| 2. fails to load, falls back to BLAS.javaBLAS in \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| org.apache.spark.ml.linalg.BLAS \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| \| wrapper for com.github.fommil:core \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| ``` ### Why are the changes needed? Accelerates linear algebra operations when the pure-java fallback method is in use. Transparently falls back to native implementation (OpenBLAS, MKL) when available. ### Does this PR introduce _any_ user-facing change? No, all changes are transparent to the user. ### How was this patch tested? The `dev.ludovic.netlib` library has its own test suite [2]. It has also been validated by running the Spark test suite and benchmarking suite. [1] Results for `org.apache.spark.ml.linalg.BLASBenchmark`: #### JDK8: ``` [info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 223 232 8 448.0 2.2 1.0X [info] java 221 228 7 453.0 2.2 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 122 128 4 821.2 1.2 1.0X [info] java 122 128 4 822.3 1.2 1.0X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 112 2 921.4 1.1 1.0X [info] java 70 74 3 1423.5 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.1 1.0 1.0X [info] java 47 49 2 2121.7 0.5 2.0X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 184 195 8 544.3 1.8 1.0X [info] java 185 196 7 539.5 1.9 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 99 104 4 1011.9 1.0 1.0X [info] java 99 104 4 1010.4 1.0 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 947.2 1.1 1.0X [info] java 0 0 0 1584.8 0.6 1.7X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 867.4 1.2 1.0X [info] java 1 1 0 865.0 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 485.9 2.1 1.0X [info] java 1 1 0 486.8 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1843.0 0.5 1.0X [info] java 0 0 0 2690.6 0.4 1.5X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1214.7 0.8 1.0X [info] java 0 0 0 2536.8 0.4 2.1X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1895.9 0.5 1.0X [info] java 0 0 0 2961.1 0.3 1.6X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1223.4 0.8 1.0X [info] java 0 0 0 3091.4 0.3 2.5X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 560 575 20 1787.1 0.6 1.0X [info] java 226 232 5 4432.4 0.2 2.5X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 570 586 23 1755.2 0.6 1.0X [info] java 227 232 4 4410.1 0.2 2.5X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 863 879 17 1158.4 0.9 1.0X [info] java 227 231 3 4407.9 0.2 3.8X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1282 1305 23 780.0 1.3 1.0X [info] java 227 232 4 4413.4 0.2 5.7X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 538 548 8 1858.6 0.5 1.0X [info] java 221 226 3 4521.1 0.2 2.4X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 549 558 10 1819.9 0.5 1.0X [info] java 222 229 7 4503.5 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 838 852 12 1193.0 0.8 1.0X [info] java 222 229 5 4500.5 0.2 3.8X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 905 919 18 1104.8 0.9 1.0X [info] java 221 228 5 4521.3 0.2 4.1X ``` #### JDK11: ``` [info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 195 204 10 512.7 2.0 1.0X [info] java 195 202 7 512.4 2.0 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 113 4 923.3 1.1 1.0X [info] java 102 107 4 984.4 1.0 1.1X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 107 110 3 938.1 1.1 1.0X [info] java 69 72 3 1447.1 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.5 1.0 1.0X [info] java 43 45 2 2317.1 0.4 2.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 155 168 8 644.2 1.6 1.0X [info] java 158 169 8 632.8 1.6 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 90 4 1178.1 0.8 1.0X [info] java 86 90 4 1167.7 0.9 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 1182.1 0.8 1.0X [info] java 0 0 0 1432.1 0.7 1.2X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 898.7 1.1 1.0X [info] java 1 1 0 891.5 1.1 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 495.4 2.0 1.0X [info] java 1 1 0 495.7 2.0 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2271.6 0.4 1.0X [info] java 0 0 0 3648.1 0.3 1.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1229.3 0.8 1.0X [info] java 0 0 0 2711.3 0.4 2.2X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2677.5 0.4 1.0X [info] java 0 0 0 3288.2 0.3 1.2X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1233.0 0.8 1.0X [info] java 0 0 0 2766.3 0.4 2.2X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 520 536 16 1923.6 0.5 1.0X [info] java 214 221 7 4669.5 0.2 2.4X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 593 612 17 1686.5 0.6 1.0X [info] java 215 219 3 4643.3 0.2 2.8X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 853 870 16 1172.8 0.9 1.0X [info] java 215 218 3 4659.7 0.2 4.0X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1350 1370 23 740.8 1.3 1.0X [info] java 215 219 4 4656.6 0.2 6.3X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 460 468 6 2173.2 0.5 1.0X [info] java 210 213 2 4752.7 0.2 2.2X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 535 544 8 1869.3 0.5 1.0X [info] java 210 215 5 4761.8 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 843 853 11 1186.8 0.8 1.0X [info] java 209 214 4 4793.4 0.2 4.0X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 891 904 15 1122.0 0.9 1.0X [info] java 209 214 4 4777.2 0.2 4.3X ``` #### JDK16: ``` [info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] nativeBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 194 199 7 515.7 1.9 1.0X [info] java 181 186 3 551.1 1.8 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 115 4 915.0 1.1 1.0X [info] java 88 92 3 1138.8 0.9 1.2X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 110 2 922.6 1.1 1.0X [info] java 54 56 2 1839.2 0.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 97 2 1046.1 1.0 1.0X [info] java 29 30 1 3393.4 0.3 3.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 156 165 5 643.0 1.6 1.0X [info] java 150 159 5 667.1 1.5 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 91 6 1171.0 0.9 1.0X [info] java 75 79 3 1340.6 0.7 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 917.0 1.1 1.0X [info] java 0 0 0 8147.2 0.1 8.9X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 859.3 1.2 1.0X [info] java 1 1 0 859.3 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 482.1 2.1 1.0X [info] java 1 1 0 482.6 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2214.2 0.5 1.0X [info] java 0 0 0 7975.8 0.1 3.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1231.4 0.8 1.0X [info] java 0 0 0 8680.9 0.1 7.0X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2684.3 0.4 1.0X [info] java 0 0 0 18527.1 0.1 6.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1235.4 0.8 1.0X [info] java 0 0 0 17347.9 0.1 14.0X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 530 552 18 1887.5 0.5 1.0X [info] java 58 64 3 17143.9 0.1 9.1X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 598 620 17 1671.1 0.6 1.0X [info] java 58 64 3 17196.6 0.1 10.3X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 834 847 14 1199.4 0.8 1.0X [info] java 57 63 4 17486.9 0.1 14.6X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1338 1366 22 747.3 1.3 1.0X [info] java 58 63 3 17356.6 0.1 23.2X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 489 501 9 2045.5 0.5 1.0X [info] java 36 38 2 27721.9 0.0 13.6X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 478 488 9 2094.0 0.5 1.0X [info] java 36 38 2 27813.2 0.0 13.3X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 825 837 10 1211.6 0.8 1.0X [info] java 35 38 2 28433.1 0.0 23.5X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 900 918 15 1111.6 0.9 1.0X [info] java 36 38 2 28073.0 0.0 25.3X ``` [2] https://github.com/luhenry/netlib/tree/master/blas/src/test/java/dev/ludovic/netlib/blas Closes #32253 from luhenry/master. Authored-by: Ludovic Henry <git@ludovic.dev> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 14:00:59 -05:00
Daoyuan Wang	26a8d2f908	[SPARK-35238][DOC] Add JindoFS SDK in cloud integration documents ### What changes were proposed in this pull request? Add JindoFS SDK documents link in the cloud integration section of Spark's official document. ### Why are the changes needed? If Spark users need to interact with Alibaba Cloud OSS, JindoFS SDK is the official solution provided by Alibaba Cloud. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? tested the url manually. Closes #32360 from adrian-wang/jindodoc. Authored-by: Daoyuan Wang <daoyuan.wdy@alibaba-inc.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 09:32:47 -05:00
Angerszhuuuu	1db031f158	[SPARK-35220][DOCS][FOLLOWUP] DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited ### What changes were proposed in this pull request? Add note in migration guide about DayTimeIntervalType/YearMonthIntervalType show different between Hive SerDe and row format delimited ### Why are the changes needed? Add note ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32343 from AngersZhuuuu/SPARK-35220-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-26 17:47:47 +03:00
Angerszhuuuu	20d68dc2f4	[SPARK-35159][SQL][DOCS] Extract hive format doc ### What changes were proposed in this pull request? Extract common doc about hive format for `sql-ref-syntax-ddl-create-table-hiveformat.md` and `sql-ref-syntax-qry-select-transform.md` to refer. ![image](https://user-images.githubusercontent.com/46485123/115802193-04641800-a411-11eb-827d-d92544881842.png) ### Why are the changes needed? Improve doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32264 from AngersZhuuuu/SPARK-35159. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-23 05:47:48 +00:00
Angerszhuuuu	b22d54a58a	[SPARK-35026][SQL] Support nested CUBE/ROLLUP/GROUPING SETS in GROUPING SETS ### What changes were proposed in this pull request? PG and Oracle both support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS's grouping set as a sugar syntax. ![image](https://user-images.githubusercontent.com/46485123/114975588-139a1180-9eb7-11eb-8f53-498c1db934e0.png) In this PR, we support it in Spark SQL too ### Why are the changes needed? Keep consistent with PG and oracle ### Does this PR introduce _any_ user-facing change? User can write grouping analytics like ``` SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b)); SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ()); SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ())); ``` ### How was this patch tested? Added Test Closes #32201 from AngersZhuuuu/SPARK-35026. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-22 13:08:22 +00:00
Shardul Mahadik	83f753e4e1	[SPARK-34472][YARN] Ship ivySettings file to driver in cluster mode ### What changes were proposed in this pull request? In YARN, ship the `spark.jars.ivySettings` file to the driver when using `cluster` deploy mode so that `addJar` is able to find it in order to resolve ivy paths. ### Why are the changes needed? SPARK-33084 introduced support for Ivy paths in `sc.addJar` or Spark SQL `ADD JAR`. If we use a custom ivySettings file using `spark.jars.ivySettings`, it is loaded at `b26e7b510b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L1280)`. However, this file is only accessible on the client machine. In YARN cluster mode, this file is not available on the driver and so `addJar` fails to find it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests to verify that the `ivySettings` file is localized by the YARN client and that a YARN cluster mode application is able to find to load the `ivySettings` file. Closes #31591 from shardulm94/SPARK-34472. Authored-by: Shardul Mahadik <smahadik@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-04-20 13:35:57 -05:00
Angerszhuuuu	9c956abb1d	[SPARK-33976][SQL][DOCS] Add a SQL doc page for a TRANSFORM clause ### What changes were proposed in this pull request? Add doc about `TRANSFORM` and related function. ![image](https://user-images.githubusercontent.com/46485123/114332579-1627fe80-9b79-11eb-8fa7-131f0a20f72f.png) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31010 from AngersZhuuuu/SPARK-33976. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-20 10:30:26 +00:00
Dongjoon Hyun	00f06dd267	[SPARK-35131][K8S] Support early driver service clean-up during app termination ### What changes were proposed in this pull request? This PR aims to support a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, to clean up `Driver Service` resource during app termination. ### Why are the changes needed? The K8s service is one of the important resources and sometimes it's controlled by quota. ``` $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 1 3 ``` Apache Spark creates a service for driver whose lifecycle is the same with driver pod. It means a new Spark job submission fails if the number of completed Spark jobs equals the number of service quota. BEFORE ``` $ k get pod NAME READY STATUS RESTARTS AGE org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver 0/1 Completed 0 31m org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver 0/1 Completed 0 78s $ k get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 80m org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP,4040/TCP 31m org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP,4040/TCP 80s $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 3 3 $ bin/spark-submit... Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://192.168.64.50:8443/api/v1/namespaces/default/services. Message: Forbidden! User minikube doesn't have permission. services "org-apache-spark-examples-sparkpi-843f6978e722819c-driver-svc" is forbidden: exceeded quota: service, requested: services=1, used: services=3, limited: services=3. ``` AFTER ``` $ k get pod NAME READY STATUS RESTARTS AGE org-apache-spark-examples-sparkpi-23d5f278e77731a7-driver 0/1 Completed 0 26s org-apache-spark-examples-sparkpi-d1292278e7768ed4-driver 0/1 Completed 0 67s org-apache-spark-examples-sparkpi-e5bedf78e776ea9d-driver 0/1 Completed 0 44s $ k get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 172m $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 1 3 ``` ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, and enables it by default. The change is documented at the migration guide. ### How was this patch tested? Pass the CIs. This is tested with K8s IT manually. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 19 minutes, 9 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #32226 from dongjoon-hyun/SPARK-35131. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-19 12:11:08 -07:00
Max Gekk	1d1ed3eb25	[SPARK-35107][SQL] Parse unit-to-unit interval literals to ANSI intervals ### What changes were proposed in this pull request? Parse the year-month interval literals like `INTERVAL '1-1' YEAR TO MONTH` to values of `YearMonthIntervalType`, and day-time interval literals to `DayTimeIntervalType` values. Currently, Spark SQL supports: - DAY TO HOUR - DAY TO MINUTE - DAY TO SECOND - HOUR TO MINUTE - HOUR TO SECOND - MINUTE TO SECOND All such interval literals are converted to `DayTimeIntervalType`, and `YEAR TO MONTH` to `YearMonthIntervalType` while loosing info about `from` and `to` units. Note: new behavior is under the SQL config `spark.sql.legacy.interval.enabled` which is `false` by default. When the config is set to `true`, the interval literals are parsed to `CaledarIntervalType` values. Closes #32176 ### Why are the changes needed? To conform the ANSI SQL standard which assumes conversions of interval literals to year-month or day-time interval but not to mixed interval type like Catalyst's `CalendarIntervalType`. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```sql spark-sql> SELECT INTERVAL '1 01:02:03.123' DAY TO SECOND; 1 days 1 hours 2 minutes 3.123 seconds spark-sql> SELECT typeof(INTERVAL '1 01:02:03.123' DAY TO SECOND); interval ``` After: ```sql spark-sql> SELECT INTERVAL '1 01:02:03.123' DAY TO SECOND; 1 01:02:03.123000000 spark-sql> SELECT typeof(INTERVAL '1 01:02:03.123' DAY TO SECOND); day-time interval ``` ### How was this patch tested? 1. By running the affected test suites: ``` $ ./build/sbt "test:testOnly .ExpressionParserSuite" $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly SQLQueryTestSuite -- -z interval.sql" $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly SQLQueryTestSuite -- -z create_view.sql" $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly SQLQueryTestSuite -- -z date.sql" $ SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *SQLQueryTestSuite -- -z timestamp.sql" ``` 2. PostgresSQL tests are executed with `spark.sql.legacy.interval.enabled` is set to `true` to keep compatibility with PostgreSQL output: ```sql > SELECT interval '999' second; 0 years 0 mons 0 days 0 hours 16 mins 39.00 secs ``` Closes #32209 from MaxGekk/parse-ansi-interval-literals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-19 16:00:59 +03:00
Angerszhuuuu	a74f601040	[SPARK-31937][SQL] Support processing ArrayType/MapType/StructType data using no-serde mode script transform ### What changes were proposed in this pull request? Support no-serde mode script transform use ArrayType/MapType/StructStpe data. ### Why are the changes needed? Make user can process array/map/struct data ### Does this PR introduce _any_ user-facing change? Yes, user can process array/map/struct data in script transform `no-serde` mode ### How was this patch tested? Added UT Closes #30957 from AngersZhuuuu/SPARK-31937. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-19 17:02:32 +09:00
Sean Owen	2e1e1f83e4	[MINOR][DOCS] Soften security warning and keep it in cluster management docs only ### What changes were proposed in this pull request? Soften security warning and keep it in cluster management docs only, not in the main doc page, where it's not necessarily relevant. ### Why are the changes needed? The statement is perhaps unnecessarily 'frightening' as the first section in the main docs page. It applies to clusters not local mode, anyhow. ### Does this PR introduce _any_ user-facing change? Just a docs change. ### How was this patch tested? N/A Closes #32206 from srowen/SecurityStatement. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-17 08:44:00 -05:00
ulysses-you	345c380778	[SPARK-35083][CORE] Support remote scheduler pool files ### What changes were proposed in this pull request? Use hadoop FileSystem instead of FileInputStream. ### Why are the changes needed? Make `spark.scheduler.allocation.file` suport remote file. When using Spark as a server (e.g. SparkThriftServer), it's hard for user to specify a local path as the scheduler pool. ### Does this PR introduce _any_ user-facing change? Yes, a minor feature. ### How was this patch tested? Pass `core/src/test/scala/org/apache/spark/scheduler/PoolSuite.scala` and manul test After add config `spark.scheduler.allocation.file=hdfs:///tmp/fairscheduler.xml`. We intrudoce the configed pool. ![pool1](https://user-images.githubusercontent.com/12025282/114810037-df065700-9ddd-11eb-8d7a-54b59a07ee7b.jpg) Closes #32184 from ulysses-you/SPARK-35083. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-16 00:18:35 -07:00
Angerszhuuuu	71133e1c2a	[SPARK-35070][SQL] TRANSFORM not support alias in inputs ### What changes were proposed in this pull request? Normal function parameters should not support alias, hive not support too ![image](https://user-images.githubusercontent.com/46485123/114645556-4a7ff400-9d0c-11eb-91eb-bc679ea0039a.png) In this pr we forbid use alias in `TRANSFORM`'s inputs ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #32165 from AngersZhuuuu/SPARK-35070. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-15 06:51:07 +00:00
Angerszhuuuu	0fc97b5bf4	[SPARK-34577][SQL][FOLLOWUP] Add change of `DESC NAMESPACE`'s schema to migration guide ### What changes were proposed in this pull request? Add change of `DESC NAMESPACE`'s schema to migration guide ### Why are the changes needed? Update doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32155 from AngersZhuuuu/SPARK-34577-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-14 06:30:45 +00:00
Sean Owen	700aa1769c	[SPARK-35050][DOCS][MESOS] Document deprecation of Apache Mesos in 3.2.0 ### What changes were proposed in this pull request? Deprecate Apache Mesos support for Spark 3.2.0 by adding documentation to this effect. ### Why are the changes needed? Apache Mesos is ceasing development (https://lists.apache.org/thread.html/rab2a820507f7c846e54a847398ab20f47698ec5bce0c8e182bfe51ba%40%3Cdev.mesos.apache.org%3E) ; at some point we'll want to drop support, so, deprecate it now. This doesn't mean it'll go away in 3.3.0. ### Does this PR introduce _any_ user-facing change? No, docs only. ### How was this patch tested? N/A Closes #32150 from srowen/SPARK-35050. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:17:58 +09:00
Gengliang Wang	79e55b44f7	[SPARK-35028][SQL] ANSI mode: disallow group by aliases ### What changes were proposed in this pull request? Disallow group by aliases under ANSI mode. ### Why are the changes needed? As per the ANSI SQL standard secion 7.12 <group by clause>: >Each `grouping column reference` shall unambiguously reference a column of the table resulting from the `from clause`. A column referenced in a `group by clause` is a grouping column. By forbidding it, we can avoid ambiguous SQL queries like: ``` SELECT col + 1 as col FROM t GROUP BY col ``` ### Does this PR introduce _any_ user-facing change? Yes, group by aliases is not allowed under ANSI mode. ### How was this patch tested? Unit tests Closes #32129 from gengliangwang/disallowGroupByAlias. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-04-13 10:42:57 +08:00
Yuming Wang	e40fce919a	[SPARK-34562][SQL] Add test and doc for Parquet Bloom filter push down ### What changes were proposed in this pull request? This pr add test and document for Parquet Bloom filter push down. ### Why are the changes needed? Improve document. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Generating docs: ![image](https://user-images.githubusercontent.com/5399861/114327472-c131bb80-9b6b-11eb-87a0-6f9a74eb1097.png) Closes #32123 from wangyum/SPARK-34562. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-12 17:07:35 +03:00
Angerszhuuuu	21232377ba	[SPARK-33229][SQL] Support partial grouping analytics and concatenated grouping analytics ### What changes were proposed in this pull request? Support GROUP BY use Separate columns and CUBE/ROLLUP In postgres sql, it support ``` select a, b, c, count(1) from t group by a, b, cube (a, b, c); select a, b, c, count(1) from t group by a, b, rollup(a, b, c); select a, b, c, count(1) from t group by cube(a, b), rollup (a, b, c); select a, b, c, count(1) from t group by a, b, grouping sets((a, b), (a), ()); ``` In this pr, we have done two things as below: 1. Support partial grouping analytics such as `group by a, cube(a, b)` 2. Support mixed grouping analytics such as `group by cube(a, b), rollup(b,c)` Partial Groupings Partial Groupings means there are both `group_expression` and `CUBE\|ROLLUP\|GROUPING SETS` in GROUP BY clause. For example: `GROUP BY warehouse, CUBE(product, location)` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (warehouse))`. `GROUP BY warehouse, ROLLUP(product, location)` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse))`. `GROUP BY warehouse, GROUPING SETS((product, location), (producet), ())` is equivalent to `GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, location), (warehouse))`. Concatenated Groupings Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified with concatenated groupings yield the cross-product of groupings from each grouping set. The cross-product operation enables even a small number of concatenated groupings to generate a large number of final groups. The concatenated groupings are specified simply by listing multiple `GROUPING SETS`, `CUBES`, and `ROLLUP`, and separating them with commas. For example: `GROUP BY GROUPING SETS((warehouse), (producet)), GROUPING SETS((location), (size))` is equivalent to `GROUP BY GROUPING SETS((warehouse, location), (warehouse, size), (product, location), (product, size))`. `GROUP BY CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to `GROUP BY GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())` `GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse), (product, location, size), (product, location), (product), (location, size), (location), ())`. `GROUP BY order, CUBE((warehouse), (producet)), ROLLUP((location), (size))` is equivalent to `GROUP BY order, GROUPING SETS((warehouse, product), (warehouse), (producet), ()), GROUPING SETS((location, size), (location), ())` `GROUP BY GROUPING SETS( (order, warehouse, product, location, size), (order, warehouse, product, location), (order, warehouse, product), (order, warehouse, location, size), (order, warehouse, location), (order, warehouse), (order, product, location, size), (order, product, location), (order, product), (order, location, size), (order, location), (order))`. ### Why are the changes needed? Support more flexible grouping analytics ### Does this PR introduce _any_ user-facing change? User can use sql like ``` select a, b, c, agg_expr() from table group by a, cube(b, c) ``` ### How was this patch tested? Added UT Closes #30144 from AngersZhuuuu/SPARK-33229. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-12 08:23:52 +00:00
hissy	214a46aa88	[MINOR][SS][DOC] Fix wrong Python code sample ### What changes were proposed in this pull request? This patch fixes wrong Python code sample for doc. ### Why are the changes needed? Sample code is wrong. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc only. Closes #32119 from Hisssy/ss-doc-typo-1. Authored-by: hissy <aozora@live.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-10 12:33:30 +03:00
itholic	8e15ac1c67	[SPARK-34493][DOCS] Add "TEXT Files" page for Data Source documents ### What changes were proposed in this pull request? This PR aims to add a documentation on how to read and write TEXT files through various APIs such as Scala, Python and JAVA in Spark to [Data Source documents](https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources). ### Why are the changes needed? Documentation on how Spark handles TEXT files is missing. It should be added to the document for user convenience. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a new page to Data Sources documents. ### How was this patch tested? Manually build documents and check the page on local as below. ![Screen Shot 2021-04-07 at 4 05 01 PM](https://user-images.githubusercontent.com/44108233/113824674-085e2c00-97bb-11eb-91ae-d2cc19dfd369.png) Closes #32053 from itholic/SPARK-34491-TEXT. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-07 17:11:43 +03:00
Kent Yao	7cffacef18	[SPARK-34935][SQL] CREATE TABLE LIKE should respect the reserved table properties ### What changes were proposed in this pull request? CREATE TABLE LIKE should respect the reserved properties of tables and fail if specified, using `spark.sql.legacy.notReserveProperties` to restore. ### Why are the changes needed? Make DDLs consistently treat reserved properties ### Does this PR introduce _any_ user-facing change? YES, this is a breaking change as using `create table like` w/ reserved properties will fail. ### How was this patch tested? new test Closes #32025 from yaooqinn/SPARK-34935. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-04-06 08:52:48 +09:00
Wenchen Fan	39d5677ee3	[SPARK-34932][SQL] deprecate GROUP BY ... GROUPING SETS (...) and promote GROUP BY GROUPING SETS (...) ### What changes were proposed in this pull request? GROUP BY ... GROUPING SETS (...) is a weird SQL syntax we copied from Hive. It's not in the SQL standard or any other mainstream databases. This syntax requires users to repeat the expressions inside `GROUPING SETS (...)` after `GROUP BY`, and has a weird null semantic if `GROUP BY` contains extra expressions than `GROUPING SETS (...)`. This PR deprecates this syntax: 1. Do not promote it in the document and only mention it as a Hive compatible sytax. 2. Simplify the code to only keep it for Hive compatibility. ### Why are the changes needed? Deprecate a weird grammar. ### Does this PR introduce _any_ user-facing change? No breaking change, but it removes a check to simplify the code: `GROUP BY a GROUPING SETS(a, b)` fails before and forces users to also put `b` after `GROUP BY`. Now this works just as `GROUP BY GROUPING SETS(a, b)`. ### How was this patch tested? existing tests Closes #32022 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-04-06 08:49:08 +09:00
twoentartian	a72f0d7c90	[SPARK-34492][DOCS] Add "CSV Files" page for Data Source documents ### What changes were proposed in this pull request? Fix [SPARK-34492], add Scala examples to read/write CSV files. ### Why are the changes needed? Fix [SPARK-34492]. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build the document with "SKIP_API=1 bundle exec jekyll build", and everything looks fine. Closes #31827 from twoentartian/master. Authored-by: twoentartian <twoentartian@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-05 11:17:42 +09:00
Kousuke Saruta	8724f2b8b7	[SPARK-34933][DOC][SQL] Remove the description that \|\| and && can be used as logical operators from the document ### What changes were proposed in this pull request? This PR removes the description that `\|\|` and `&&` can be used as logical operators from the migration guide. ### Why are the changes needed? At the `Compatibility with Apache Hive` section in the migration guide, it describes that `\|\|` and `&&` can be used as logical operators. But, in fact, they cannot be used as described. AFAIK, Hive also doesn't support `&&` and `\|\|` as logical operators. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed that `&&` and `\|\|` cannot be used as logical operators with both Hive's interactive shell and `spark-sql`. I also built the modified document and confirmed that the modified document doesn't break layout. Closes #32023 from sarutak/modify-hive-compatibility-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-01 17:14:41 -05:00
Angerszhuuuu	2796812cea	[SPARK-26399][WEBUI][CORE] Add new stage-level REST APIs and parameters ### What changes were proposed in this pull request? Add more flexable parameters for stage end point endpoint /application/{app-id}/stages. It can be: /application/{app-id}/stages?details=[true\|false]&status=[ACTIVE\|COMPLETE\|FAILED\|PENDING\|SKIPPED]&withSummaries=[true\|false]$quantiles=[comma separated quantiles string]&taskStatus=[RUNNING\|SUCCESS\|FAILED\|PENDING] where ``` query parameter details=true is to show the detailed task information within each stage. The default value is details=false; query parameter status can select those stages with the specified status. When status parameter is not specified, a list of all stages are generated. query parameter withSummaries=true is to show both task summary information in percentile distribution and executor summary information in percentile distribution. The default value is withSummaries=false. query parameter quantiles support user defined quantiles, default quantiles is `0.0,0.25,0.5,0.75,1.0` query parameter taskStatus is to show only those tasks with the specified status within their corresponding stages. This parameter will be set when details=true (i.e. this parameter will be ignored when details=false). ``` ### Why are the changes needed? More flexable restful API ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? UT Closes #31204 from AngersZhuuuu/SPARK-26399-NEW. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-01 12:48:26 -05:00
Max Gekk	5911faa0d4	[SPARK-34903][SQL] Return day-time interval from timestamps subtraction ### What changes were proposed in this pull request? Modify the `SubtractTimestamps` expression to return values of `DayTimeIntervalType` when `spark.sql.legacy.interval.enabled` is set to `false` (which is the default). ### Why are the changes needed? To conform to the ANSI SQL standard which requires ANSI intervals as the result of timestamps subtraction, see <img width="656" alt="Screenshot 2021-03-29 at 19 09 34" src="https://user-images.githubusercontent.com/1580697/112866455-7e2f0d00-90c2-11eb-96e6-3feb7eea7e09.png"> ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly DateTimeUtilsSuite" $ build/sbt "test:testOnly DateExpressionsSuite" $ build/sbt "test:testOnly ColumnExpressionSuite" ``` and some tests from `SQLQueryTestSuite`: ``` $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z timestamp.sql" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z datetime.sql" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z interval.sql" ``` Closes #32016 from MaxGekk/subtract-timestamps-to-intervals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-01 10:27:58 +03:00
Gengliang Wang	3951e3371a	[SPARK-34881][SQL] New SQL Function: TRY_CAST ### What changes were proposed in this pull request? Add a new SQL function `try_cast`. `try_cast` is identical to `AnsiCast` (or `Cast` when `spark.sql.ansi.enabled` is true), except it returns NULL instead of raising an error. This expression has one major difference from `cast` with `spark.sql.ansi.enabled` as true: when the source value can't be stored in the target integral(Byte/Short/Int/Long) type, `try_cast` returns null instead of returning the low order bytes of the source value. Note that the result of `try_cast` is not affected by the configuration `spark.sql.ansi.enabled`. This is learned from Google BigQuery and Snowflake: https://docs.snowflake.com/en/sql-reference/functions/try_cast.html https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe_casting ### Why are the changes needed? This is an useful for the following scenarios: 1. When ANSI mode is on, users can choose `try_cast` an alternative way to run SQL without errors for certain operations. 2. When ANSI mode is off, users can use `try_cast` to get a more reasonable result for casting a value to an integral type: when an overflow error happens, `try_cast` returns null while `cast` returns the low order bytes of the source value. ### Does this PR introduce _any_ user-facing change? Yes, adding a new function `try_cast` ### How was this patch tested? Unit tests. Closes #31982 from gengliangwang/tryCast. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-03-31 20:47:04 +08:00
Angerszhuuuu	1c3bdabc03	[SPARK-34911][SQL] Fix code not close issue in monitoring.md ### What changes were proposed in this pull request? Fix code not close issue in monitoring.md ### Why are the changes needed? Fix doc issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32008 from AngersZhuuuu/SPARK-34911. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-31 02:55:08 -05:00
Max Gekk	162f0560e6	[SPARK-34896][SQL] Return day-time interval from dates subtraction ### What changes were proposed in this pull request? 1. Add the SQL config `spark.sql.legacy.interval.enabled` which will control when Spark SQL should use `CalendarIntervalType` instead of ANSI intervals. 2. Modify the `SubtractDates` expression to return values of `DayTimeIntervalType` when `spark.sql.legacy.interval.enabled` is set to `false` (which is the default). ### Why are the changes needed? To conform to the ANSI SQL standard which requires ANSI intervals as the result of dates subtraction, see <img width="656" alt="Screenshot 2021-03-29 at 19 09 34" src="https://user-images.githubusercontent.com/1580697/112866455-7e2f0d00-90c2-11eb-96e6-3feb7eea7e09.png"> ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly DateExpressionsSuite" $ build/sbt "test:testOnly ColumnExpressionSuite" ``` and some tests from `SQLQueryTestSuite`: ``` $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z date.sql" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z datetime.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z interval.sql" ``` Closes #31996 from MaxGekk/subtract-dates-to-intervals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-03-31 08:20:13 +03:00
Erik Krogen	9f065ff375	[SPARK-34828][YARN] Make shuffle service name configurable on client side and allow for classpath-based config override on server side ### What changes were proposed in this pull request? Add a new config, `spark.shuffle.service.name`, which allows for Spark applications to look for a YARN shuffle service which is defined at a name other than the default `spark_shuffle`. Add a new config, `spark.yarn.shuffle.service.metrics.namespace`, which allows for configuring the namespace used when emitting metrics from the shuffle service into the NodeManager's `metrics2` system. Add a new mechanism by which to override shuffle service configurations independently of the configurations in the NodeManager. When a resource `spark-shuffle-site.xml` is present on the classpath of the shuffle service, the configs present within it will be used to override the configs coming from `yarn-site.xml` (via the NodeManager). ### Why are the changes needed? There are two use cases which can benefit from these changes. One use case is to run multiple instances of the shuffle service side-by-side in the same NodeManager. This can be helpful, for example, when running a YARN cluster with a mixed workload of applications running multiple Spark versions, since a given version of the shuffle service is not always compatible with other versions of Spark (e.g. see SPARK-27780). With this PR, it is possible to run two shuffle services like `spark_shuffle` and `spark_shuffle_3.2.0`, one of which is "legacy" and one of which is for new applications. This is possible because YARN versions since 2.9.0 support the ability to run shuffle services within an isolated classloader (see YARN-4577), meaning multiple Spark versions can coexist. Besides this, the separation of shuffle service configs into `spark-shuffle-site.xml` can be useful for administrators who want to change and/or deploy Spark shuffle service configurations independently of the configurations for the NodeManager (e.g., perhaps they are owned by two different teams). ### Does this PR introduce _any_ user-facing change? Yes. There are two new configurations related to the external shuffle service, and a new mechanism which can optionally be used to configure the shuffle service. `docs/running-on-yarn.md` has been updated to provide user instructions; please see this guide for more details. ### How was this patch tested? In addition to the new unit tests added, I have deployed this to a live YARN cluster and successfully deployed two Spark shuffle services simultaneously, one running a modified version of Spark 2.3.0 (which supports some of the newer shuffle protocols) and one running Spark 3.1.1. Spark applications of both versions are able to communicate with their respective shuffle services without issue. Closes #31936 from xkrogen/xkrogen-SPARK-34828-shufflecompat-config-from-classpath. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-03-30 10:09:00 -05:00
angerszhu	a98dc60408	[SPARK-33308][SQL] Refactor current grouping analytics ### What changes were proposed in this pull request? As discussed in https://github.com/apache/spark/pull/30145#discussion_r514728642 https://github.com/apache/spark/pull/30145#discussion_r514734648 We need to rewrite current Grouping Analytics grammar to support as flexible as Postgres SQL to support subsequent development. In postgres sql, it support ``` select a, b, c, count(1) from t group by cube (a, b, c); select a, b, c, count(1) from t group by cube(a, b, c); select a, b, c, count(1) from t group by cube (a, b, c, (a, b), (a, b, c)); select a, b, c, count(1) from t group by rollup(a, b, c); select a, b, c, count(1) from t group by rollup (a, b, c); select a, b, c, count(1) from t group by rollup (a, b, c, (a, b), (a, b, c)); ``` In this pr, we have done three things as below, and we will split it to different pr: - Refactor CUBE/ROLLUP (regarding them as ANTLR tokens in a parser) - Refactor GROUPING SETS (the logical node -> a new expr) - Support new syntax for CUBE/ROLLUP (e.g., GROUP BY CUBE ((a, b), (a, c))) ### Why are the changes needed? Rewrite current Grouping Analytics grammar to support as flexible as Postgres SQL to support subsequent development. ### Does this PR introduce _any_ user-facing change? User can write Grouping Analytics grammar as flexible as Postgres SQL to support subsequent development. ### How was this patch tested? Added UT Closes #30212 from AngersZhuuuu/refact-grouping-analytics. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-30 12:31:58 +00:00
Kent Yao	5692aa0c2c	[SPARK-34894][CORE] Use 'io.connectionTimeout' as a hint instead of 'spark.network.timeout' for lost connections ### What changes were proposed in this pull request? Currently, when a connection for TransportClient is marked as idled and closed, we suggest users adjust `spark.network.timeout` for all transport modules. As a lot of timeout configs will fallback to the `spark.network.timeout`, this could be a piece of overkill advice, we should give a more targeted one with `spark.${moduleName}.io.connectionTimeout` ### Why are the changes needed? better advise for overloaded network traffic cases ### Does this PR introduce _any_ user-facing change? yes, when a connection is zombied and closed by spark internally, users can use a more targeted config to tune their jobs ### How was this patch tested? Just log and doc. Passing Jenkins and GA Closes #31990 from yaooqinn/SPARK-34894. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>	2021-03-30 09:58:24 +08:00
Angerszhuuuu	066c055b52	[SPARK-34092][SQL] Support Stage level restful api filter task details by task status ### What changes were proposed in this pull request? When we want to get stage's detail info with task information, it will return all tasks, the content is huge and always we just want to know some failed tasks/running tasks with whole stage info to judge is a task has some problem. This pr support user to use ``` /application/[appid]/stages/[stage-id]?details=true&taskStatus=xxx /application/[appid]/stages/[stage-id]/[stage-attempted-id]?details=true&taskStatus=xxx ``` to filter task details by task status ### Why are the changes needed? More flexiable Restful API ### Does this PR introduce _any_ user-facing change? User can use ``` /application/[appid]/stages/[stage-id]?details=true&taskStatus=xxx /application/[appid]/stages/[stage-id]/[stage-attempted-id]?details=true&taskStatus=xxx ``` to filter task details by task status ### How was this patch tested? Added Closes #31165 from AngersZhuuuu/SPARK-34092. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-27 16:26:07 -05:00
Josh Soref	d58587b60d	[SPARK-33717][LAUNCHER] deprecate spark.launcher.childConectionTimeout ### What changes were proposed in this pull request? Deprecating `spark.launcher.childConectionTimeout` in favor of `spark.launcher.childConnectionTimeout` ### Why are the changes needed? srowen suggested it https://github.com/apache/spark/pull/30323#discussion_r521449342 ### How was this patch tested? No testing. Not even compiled Closes #30679 from jsoref/spelling-connection. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-26 15:53:52 -05:00
Gengliang Wang	0515f49018	[SPARK-34856][SQL] ANSI mode: Allow casting complex types as string type ### What changes were proposed in this pull request? Allow casting complex types as string type in ANSI mode. ### Why are the changes needed? Currently, complex types are not allowed to cast as string type. This breaks the DataFrame.show() API. E.g ``` scala> sql(“select array(1, 2, 2)“).show(false) org.apache.spark.sql.AnalysisException: cannot resolve ‘CAST(`array(1, 2, 2)` AS STRING)’ due to data type mismatch: cannot cast array<int> to string with ANSI mode on. ``` We should allow the conversion as the extension of the ANSI SQL standard, so that the DataFrame.show() still work in ANSI mode. ### Does this PR introduce _any_ user-facing change? Yes, casting complex types as string type is now allowed in ANSI mode. ### How was this patch tested? Unit tests. Closes #31954 from gengliangwang/fixExplicitCast. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-03-26 00:17:43 +08:00
Angerszhuuuu	8ed5808f64	[SPARK-34488][CORE] Support task Metrics Distributions and executor Metrics Distributions in the REST API call for a specified stage ### What changes were proposed in this pull request? For a specific stage, it is useful to show the task metrics in percentile distribution. This information can help users know whether or not there is a skew/bottleneck among tasks in a given stage. We list an example in taskMetricsDistributions.json Similarly, it is useful to show the executor metrics in percentile distribution for a specific stage. This information can show whether or not there is a skewed load on some executors. We list an example in executorMetricsDistributions.json We define `withSummaries` and `quantiles` query parameter in the REST API for a specific stage as: applications/<application_id>/<application_attempt/stages/<stage_id>/<stage_attempt>?withSummaries=[true\|false]& quantiles=0.05,0.25,0.5,0.75,0.95 1. withSummaries: default is false, define whether to show current stage's taskMetricsDistribution and executorMetricsDistribution 2. quantiles: default is `0.0,0.25,0.5,0.75,1.0` only effect when `withSummaries=true`, it define the quantiles we use when calculating metrics distributions. When withSummaries=true, both task metrics in percentile distribution and executor metrics in percentile distribution are included in the REST API output. The default value of withSummaries is false, i.e. no metrics percentile distribution will be included in the REST API output. ### Why are the changes needed? For a specific stage, it is useful to show the task metrics in percentile distribution. This information can help users know whether or not there is a skew/bottleneck among tasks in a given stage. We list an example in taskMetricsDistributions.json ### Does this PR introduce _any_ user-facing change? User can use below restful API to get task metrics distribution and executor metrics distribution for indivial stage ``` applications/<application_id>/<application_attempt/stages/<stage_id>/<stage_attempt>?withSummaries=[true\|false] ``` ### How was this patch tested? Added UT Closes #31611 from AngersZhuuuu/SPARK-34488. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-24 08:50:45 -05:00
Liang-Chi Hsieh	95c61df0fa	[SPARK-34295][CORE] Exclude filesystems from token renewal at YARN ### What changes were proposed in this pull request? This patch adds a config `spark.yarn.kerberos.renewal.excludeHadoopFileSystems` which lists the filesystems to be excluded from delegation token renewal at YARN. ### Why are the changes needed? MapReduce jobs can instruct YARN to skip renewal of tokens obtained from certain hosts by specifying the hosts with configuration mapreduce.job.hdfs-servers.token-renewal.exclude=<host1>,<host2>,..,<hostN>. But seems Spark lacks of similar option. So the job submission fails if YARN fails to renew DelegationToken for any of the remote HDFS cluster. The failure in DT renewal can happen due to many reason like Remote HDFS does not trust Kerberos identity of YARN etc. We have a customer facing such issue. ### Does this PR introduce _any_ user-facing change? No, if the config is not set. Yes, as users can use this config to instruct YARN not to renew delegation token from certain filesystems. ### How was this patch tested? It is hard to do unit test for this. We did verify it work from the customer using this fix in the production environment. Closes #31761 from viirya/SPARK-34295. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-03-24 01:11:53 -07:00
robert4os	06d40696dc	[MINOR][DOCS] Update sql-ref-syntax-dml-insert-into.md ### What changes were proposed in this pull request? the given example uses a non-standard syntax for CREATE TABLE, by defining the partitioning column with the other columns, instead of in PARTITION BY. This works is this case, because the partitioning column happens to be the last column defined, but it will break if instead 'name' would be used for partitioning. I suggest therefore to change the example to use a standard syntax, like in https://spark.apache.org/docs/3.1.1/sql-ref-syntax-ddl-create-table-hiveformat.html ### Why are the changes needed? To show the better documentation. ### Does this PR introduce _any_ user-facing change? Yes, this fixes the user-facing docs. ### How was this patch tested? CI should test it out. Closes #31900 from robert4os/patch-1. Authored-by: robert4os <robert4os@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-24 00:13:06 +09:00
Lena	d32bb4e5ee	[MINOR][DOCS] Updating the link for Azure Data Lake Gen 2 in docs ### What changes were proposed in this pull request? Current link for `Azure Blob Storage and Azure Datalake Gen 2` leads to AWS information. Replacing the link to point to the right page. ### Why are the changes needed? For users to access to the correct link. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the link correctly. ### How was this patch tested? N/A Closes #31938 from lenadroid/patch-1. Authored-by: Lena <alehall@microsoft.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-03-23 10:13:32 +03:00
Ismaël Mejía	8a552bfc76	[SPARK-34778][BUILD] Upgrade to Avro 1.10.2 ### What changes were proposed in this pull request? Update the Avro version to 1.10.2 ### Why are the changes needed? To stay up to date with upstream and catch compatibility issues with zstd ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #31866 from iemejia/SPARK-27733-upgrade-avro-1.10.2. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-03-22 19:30:14 +08:00
Dongjoon Hyun	3bc6fe4e77	[SPARK-34809][CORE] Enable spark.hadoopRDD.ignoreEmptySplits by default ### What changes were proposed in this pull request? This PR aims to enable `spark.hadoopRDD.ignoreEmptySplits` by default for Apache Spark 3.2.0. ### Why are the changes needed? Although this is a safe improvement, this hasn't been enabled by default to avoid the explicit behavior change. This PR aims to switch the default explicitly in Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? Yes, the behavior change is documented. ### How was this patch tested? Pass the existing CIs. Closes #31909 from dongjoon-hyun/SPARK-34809. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-21 14:34:02 -07:00
Sean Owen	ed641fbad6	[MINOR][DOCS][ML] Doc 'mode' as a supported Imputer strategy in Pyspark ### What changes were proposed in this pull request? Document `mode` as a supported Imputer strategy in Pyspark docs. ### Why are the changes needed? Support was added in 3.1, and documented in Scala, but some Python docs were missed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #31883 from srowen/ImputerModeDocs. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-20 01:16:49 -05:00
Dongjoon Hyun	2fa792aa64	[SPARK-34783][K8S] Support remote template files ### What changes were proposed in this pull request? This PR aims to support remote driver/executor template files. ### Why are the changes needed? Currently, `KubernetesUtils.loadPodFromTemplate` supports only local files. With this PR, we can do the following. ```bash bin/spark-submit \ ... -c spark.kubernetes.driver.podTemplateFile=s3a://dongjoon/driver.yml \ -c spark.kubernetes.executor.podTemplateFile=s3a://dongjoon/executor.yml \ ... ``` ### Does this PR introduce _any_ user-facing change? Yes, this is an improvement. ### How was this patch tested? Manual testing. Closes #31877 from dongjoon-hyun/SPARK-34783-2. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-19 08:52:42 -07:00
Gengliang Wang	143303147b	[SPARK-34742][SQL] ANSI mode: Abs throws exception if input is out of range ### What changes were proposed in this pull request? For the following cases, ABS should throw exceptions since the results are out of the range of the result data types in ANSI mode. ``` SELECT abs(${Int.MinValue}); SELECT abs(${Long.MinValue}); ``` ### Why are the changes needed? Better ANSI compliance ### Does this PR introduce _any_ user-facing change? Yes, Abs throws an exception if input is out of range in ANSI mode ### How was this patch tested? Unit test Closes #31836 from gengliangwang/ansiAbs. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-17 06:57:25 +00:00
Kousuke Saruta	2fd85174e9	[SPARK-34603][SQL] Support ADD ARCHIVE and LIST ARCHIVES command ### What changes were proposed in this pull request? This PR adds `ADD ARCHIVE` and `LIST ARCHIVES` commands to SQL and updates relevant documents. SPARK-33530 added `addArchive` and `listArchives` to `SparkContext` but it's not supported yet to add/list archives with SQL. ### Why are the changes needed? To complement features. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test and confirmed the generated HTML from the updated documents. Closes #31721 from sarutak/sql-archive. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-09 21:28:35 +09:00
Amandeep Sharma	a9c11896a5	[SPARK-34649][SQL][DOCS] org.apache.spark.sql.DataFrameNaFunctions.replace() fails for column name having a dot ### What changes were proposed in this pull request? Use resolved attributes instead of data-frame fields for replacing values. ### Why are the changes needed? dataframe.na.replace() does not work for column having a dot in the name ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Added unit tests for the same Closes #31769 from amandeep-sharma/master. Authored-by: Amandeep Sharma <happyaman91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-09 11:47:01 +00:00
Gengliang Wang	ee756fd695	[SPARK-34665][SQL][DOCS] Revise the type coercion section of ANSI Compliance ### What changes were proposed in this pull request? 1. Fix the table of valid type coercion combinations. Binary type should be allowed casting to String type and disallowed casting to Numeric types. 2. Summary all the `CAST`s that can cause runtime exceptions. ### Why are the changes needed? Fix a mistake in the docs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run `jekyll serve` and preview: ![image](https://user-images.githubusercontent.com/1097932/110334374-8fab5a80-7fd7-11eb-86e7-c519cfa41b99.png) Closes #31781 from gengliangwang/reviseAnsiDoc2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-09 13:19:14 +09:00
nickhliu	75db6e7d9e	[MINOR][SQL][DOCS] Fix some spelling issues in SQL migration guide ### What changes were proposed in this pull request? 1 add a sapce between words 2 unify the initials' case ### Why are the changes needed? correct spelling issues for better user experience ### Does this PR introduce _any_ user-facing change? yes. ### How was this patch tested? manually Closes #31748 from hopefulnick/doc_rectify. Authored-by: nickhliu <nickhliu@tencent.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-04 22:37:17 -08:00
Gengliang Wang	2b1c170016	[SPARK-34614][SQL] ANSI mode: Casting String to Boolean should throw exception on parse error ### What changes were proposed in this pull request? In ANSI mode, casting String to Boolean should throw an exception on parse error, instead of returning null ### Why are the changes needed? For better ANSI compliance ### Does this PR introduce _any_ user-facing change? Yes, in ANSI mode there will be an exception on parse failure of casting String value to Boolean type. ### How was this patch tested? Unit tests. Closes #31734 from gengliangwang/ansiCastToBoolean. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2021-03-04 19:04:16 +08:00
angerszhu	56edb8156f	[SPARK-33474][SQL] Support TypeConstructed partition spec value ### What changes were proposed in this pull request? Hive support type constructed value as partition spec value, spark should support too. ### Why are the changes needed? Support TypeConstructed partition spec value keep same with hive ### Does this PR introduce _any_ user-facing change? Yes, user can use TypeConstruct value as partition spec value such as ``` CREATE TABLE t1(name STRING) PARTITIONED BY (part DATE) INSERT INTO t1 PARTITION(part = date'2019-01-02') VALUES('a') CREATE TABLE t2(name STRING) PARTITIONED BY (part TIMESTAMP) INSERT INTO t2 PARTITION(part = timestamp'2019-01-02 11:11:11') VALUES('a') CREATE TABLE t4(name STRING) PARTITIONED BY (part BINARY) INSERT INTO t4 PARTITION(part = X'537061726B2053514C') VALUES('a') ``` ### How was this patch tested? Added UT Closes #30421 from AngersZhuuuu/SPARK-33474. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-03 16:48:50 +09:00
Kent Yao	499f620037	[MINOR][SQL][DOCS] Fix some wrong default values in SQL tuning guide's AQE section ### What changes were proposed in this pull request? spark.sql.adaptive.coalescePartitions.initialPartitionNum 200 -> (none) spark.sql.adaptive.skewJoin.skewedPartitionFactor is 10 -> 5 ### Why are the changes needed? the wrong doc misguide people ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing doc Closes #31717 from yaooqinn/minordoc0. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-03 15:00:09 +09:00
Dongjoon Hyun	499cc79344	[SPARK-34503][DOCS][FOLLOWUP] Document available codecs for event log compression ### What changes were proposed in this pull request? This PR is a follow-up of https://github.com/apache/spark/pull/31618 to document the available codecs for event log compression. ### Why are the changes needed? Documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #31695 from dongjoon-hyun/SPARK-34503-DOC. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-01 15:42:10 -08:00
Yikun Jiang	85b50d4258	[SPARK-34539][BUILD][INFRA] Remove stand-alone version Zinc server ### What changes were proposed in this pull request? Cleanup all Zinc standalone server code, and realated coniguration. ### Why are the changes needed? ![image](https://user-images.githubusercontent.com/1736354/109154790-c1d3e580-77a9-11eb-8cde-835deed6e10e.png) - Zinc is the incremental compiler to speed up builds of compilation. - The scala-maven-plugin is the mave plugin, which is used by Spark, one of the function is to integrate the Zinc to enable the incremental compiler. - Since Spark v3.0.0 ([SPARK-28759](https://issues.apache.org/jira/browse/SPARK-28759)), the scala-maven-plugin is upgraded to v4.X, that means Zinc v0.3.13 standalone server is useless anymore. However, we still download, install, start the standalone Zinc server. we should remove all zinc standalone server code, and all related configuration. See more in [SPARK-34539](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-34539) or the doc [Zinc standalone server is useless after scala-maven-plugin 4.x](https://docs.google.com/document/d/1u4kCHDx7KjVlHGerfmbcKSB0cZo6AD4cBdHSse-SBsM). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run any mvn build: ./build/mvn -DskipTests clean package -pl core You could see the increamental compilation is still working, the stage of "scala-maven-plugin:4.3.0:compile (scala-compile-first)" with incremental compilation info, like: ``` [INFO] --- scala-maven-plugin:4.3.0:testCompile (scala-test-compile-first) spark-core_2.12 --- [INFO] Using incremental compilation using Mixed compile order [INFO] Compiler bridge file: /root/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar [INFO] compiler plugin: BasicArtifact(com.github.ghik,silencer-plugin_2.12.10,1.6.0,null) [INFO] Compiling 303 Scala sources and 27 Java sources to /root/spark/core/target/scala-2.12/test-classes ... ``` Closes #31647 from Yikun/cleanup-zinc. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-01 08:39:38 -06:00
Shardul Mahadik	0216051aca	[SPARK-34506][CORE] ADD JAR with ivy coordinates should be compatible with Hive transitive behavior ### What changes were proposed in this pull request? SPARK-33084 added the ability to use ivy coordinates with `SparkContext.addJar`. PR #29966 claims to mimic Hive behavior although I found a few cases where it doesn't 1) The default value of the transitive parameter is false, both in case of parameter not being specified in coordinate or parameter value being invalid. The Hive behavior is that transitive is [true if not specified](`cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L169)`) in the coordinate and [false for invalid values](`cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L124)`). Also, regardless of Hive, I think a default of true for the transitive parameter also matches [ivy's own defaults](https://ant.apache.org/ivy/history/2.5.0/ivyfile/dependency.html#_attributes). 2) The parameter value for transitive parameter is regarded as case-sensitive [based on the understanding](https://github.com/apache/spark/pull/29966#discussion_r547752259) that Hive behavior is case-sensitive. However, this is not correct, Hive [treats the parameter value case-insensitively](`cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L122)`). I propose that we be compatible with Hive for these behaviors ### Why are the changes needed? To make `ADD JAR` with ivy coordinates compatible with Hive's transitive behavior ### Does this PR introduce _any_ user-facing change? The user-facing changes here are within master as the feature introduced in SPARK-33084 has not been released yet 1. Previously an ivy coordinate without `transitive` parameter specified did not resolve transitive dependency, now it does. 2. Previously an `transitive` parameter value was treated case-sensitively. e.g. `transitive=TRUE` would be treated as false as it did not match exactly `true`. Now it will be treated case-insensitively. ### How was this patch tested? Modified existing unit tests to test new behavior Add new unit test to cover usage of `exclude` with unspecified `transitive` Closes #31623 from shardulm94/spark-34506. Authored-by: Shardul Mahadik <smahadik@linkedin.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-01 09:10:20 +09:00
Yuming Wang	d07fc3076b	[SPARK-33687][SQL] Support analyze all tables in a specific database ### What changes were proposed in this pull request? This pr add support analyze all tables in a specific database: ```g4 ANALYZE TABLES ((FROM \| IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)? ``` ### Why are the changes needed? 1. Make it easy to analyze all tables in a specific database. 2. PostgreSQL has a similar implementation: https://www.postgresql.org/docs/12/sql-analyze.html. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The feature tested by unit test. The documentation tested by regenerating the documentation: menu-sql.yaml \| sql-ref-syntax-aux-analyze-tables.md -- \| -- ![image](https://user-images.githubusercontent.com/5399861/109098769-dc33a200-775c-11eb-86b1-55531e5425e0.png) \| ![image](https://user-images.githubusercontent.com/5399861/109098841-02594200-775d-11eb-8588-de8da97ec94a.png) Closes #30648 from wangyum/SPARK-33687. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-01 09:06:47 +09:00
Phillip Henry	397b843890	[SPARK-34415][ML] Randomization in hyperparameter optimization ### What changes were proposed in this pull request? Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here: http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html All code is entirely my own work and I license the work to the project under the project’s open source license. ### Why are the changes needed? Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts. Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python. ### Does this PR introduce _any_ user-facing change? A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined. ### How was this patch tested? Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added. `ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface. `RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed. Closes #31535 from PhillHenry/ParamRandomBuilder. Authored-by: Phillip Henry <PhillHenry@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-27 08:34:39 -06:00
HyukjinKwon	22383e312d	[SPARK-34531][CORE] Remove Experimental API tag in PrometheusServlet ### What changes were proposed in this pull request? The endpoints of Prometheus metrics are properly marked and documented as an experimental (SPARK-31674). The class `PrometheusServlet` itself is not the part of an API so this PR proposes to remove it. ### Why are the changes needed? To avoid marking a non-API as an API. ### Does this PR introduce _any_ user-facing change? No, the class is already `private[spark]`. ### How was this patch tested? Existing tests should cover. Closes #31640 from HyukjinKwon/SPARK-34531. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-24 18:11:25 -08:00
Wenchen Fan	87409c42bc	[SPARK-31891][SQL][DOCS][FOLLOWUP] Fix typo in the description of `MSCK REPAIR TABLE` ### What changes were proposed in this pull request? Fix typo and highlight that `ADD PARTITIONS` is the default. ### Why are the changes needed? Fix a typo which can mislead users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? n/a Closes #31633 from MaxGekk/repair-table-drop-partitions-followup. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-24 21:13:58 +09:00
Dongjoon Hyun	a6dcd5544d	[MINOR][DOCS][K8S] Use hadoop-aws 3.2.2 in K8s example ### What changes were proposed in this pull request? This PR aims to update `Hadoop` dependency in K8S doc example. ### Why are the changes needed? Apache Spark 3.2.0 is using Apache Hadoop 3.2.2 by default. ### Does this PR introduce _any_ user-facing change? No. This is a doc-only change. ### How was this patch tested? N/A Closes #31628 from dongjoon-hyun/minor-doc. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-24 11:34:29 +09:00
Dongjoon Hyun	2e31e2c5f3	[SPARK-34503][CORE] Use zstd for spark.eventLog.compression.codec by default ### What changes were proposed in this pull request? Apache Spark 3.0 introduced `spark.eventLog.compression.codec` configuration. For Apache Spark 3.2, this PR aims to set `zstd` as the default value for `spark.eventLog.compression.codec` configuration. This only affects creating a new log file. ### Why are the changes needed? The main purpose of event logs is archiving. Many logs are generated and occupy the storage, but most of them are never accessed by users. 1. Save storage resources (and money) In general, ZSTD is much smaller than LZ4. For example, in case of TPCDS (Scale 200) log, ZSTD generates about 3 times smaller log files than LZ4. \| CODEC \| SIZE (bytes) \| \|---------\|-------------\| \| LZ4 \| 184001434\| \| ZSTD \| 64522396\| And, the plain file is 17.6 times bigger. ``` -rw-r--r-- 1 dongjoon staff 1135464691 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679 -rw-r--r-- 1 dongjoon staff 64522396 Feb 21 22:31 spark-a1843ead29834f46b1125a03eca32679.zstd ``` 2. Better Usability We cannot decompress Spark-generated LZ4 event log files via CLI while we can for ZSTD event log files. Spark's LZ4 event log files are inconvenient to some users who want to uncompress and access them. ``` $ lz4 -d spark-d3deba027bd34435ba849e14fc2c42ef.lz4 Decoding file spark-d3deba027bd34435ba849e14fc2c42ef Error 44 : Unrecognized header : file cannot be decoded ``` ``` $ zstd -d spark-a1843ead29834f46b1125a03eca32679.zstd spark-a1843ead29834f46b1125a03eca32679.zstd: 1135464691 bytes ``` 3. Speed The following results are collected by running [lzbench](https://github.com/inikep/lzbench) on the above Spark event log. Note that - This is not a direct comparison of Spark compression/decompression codec. - `lzbench` is an in-memory benchmark. So, it doesn't show the benefit of the reduced network traffic due to the small size of ZSTD. Here, - To get ZSTD 1.4.8-1 result, `lzbench` `master` branch is used because Spark is using ZSTD 1.4.8. - To get LZ4 1.7.5 result, `lzbench` `v1.7` branch is used because Spark is using LZ4 1.7.1. ``` Compressor name Compress. Decompress. Compr. size Ratio Filename memcpy 7393 MB/s 7166 MB/s 1135464691 100.00 spark-a1843ead29834f46b1125a03eca32679 zstd 1.4.8 -1 1344 MB/s 3351 MB/s 56665767 4.99 spark-a1843ead29834f46b1125a03eca32679 lz4 1.7.5 1385 MB/s 4782 MB/s 127662168 11.24 spark-a1843ead29834f46b1125a03eca32679 ``` ### Does this PR introduce _any_ user-facing change? - No for the apps which doesn't use `spark.eventLog.compress` because `spark.eventLog.compress` is disabled by default. - No for the apps using `spark.eventLog.compression.codec` explicitly because this is a change of the default value. - Yes for the apps using `spark.eventLog.compress` without setting `spark.eventLog.compression.codec`. In this case, previously `spark.io.compression.codec` value was used whose default is `lz4`. So this JIRA issue, SPARK-34503, is labeled with `releasenotes`. ### How was this patch tested? Pass the updated UT. Closes #31618 from dongjoon-hyun/SPARK-34503. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-23 16:37:29 -08:00
Max Gekk	7f27d33a3c	[SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD\|DROP\|SYNC} PARTITIONS]` ### What changes were proposed in this pull request? In the PR, I propose to extend the `MSCK REPAIR TABLE` command, and support new options `{ADD\|DROP\|SYNC} PARTITIONS`. In particular: 1. Extend the logical node `RepairTable`, and add two new flags `enableAddPartitions` and `enableDropPartitions`. 2. Add similar flags to the v1 execution node `AlterTableRecoverPartitionsCommand` 3. Add new method `dropPartitions()` to `AlterTableRecoverPartitionsCommand` which drops partitions from the catalog if their locations in the file system don't exist. 4. Updated public docs about the `MSCK REPAIR TABLE` command: <img width="1037" alt="Screenshot 2021-02-16 at 13 46 39" src="https://user-images.githubusercontent.com/1580697/108052607-7446d280-705d-11eb-8e25-7398254787a4.png"> Closes #31097 ### Why are the changes needed? - The changes allow to recover tables with removed partitions. The example below portraits the problem: ```sql spark-sql> create table tbl2 (col int, part int) partitioned by (part); spark-sql> insert into tbl2 partition (part=1) select 1; spark-sql> insert into tbl2 partition (part=0) select 0; spark-sql> show table extended like 'tbl2' partition (part = 0); default tbl2 false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0 ... ``` Remove the partition (part = 0) from the filesystem: ``` $ rm -rf /Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0 ``` Even after recovering, we cannot query the table: ```sql spark-sql> msck repair table tbl2; spark-sql> select * from tbl2; 21/01/08 22:49:13 ERROR SparkSQLDriver: Failed in [select * from tbl2] org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0 ``` - To have feature parity with Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE) ### Does this PR introduce _any_ user-facing change? Yes. After the changes, we can query recovered table: ```sql spark-sql> msck repair table tbl2 sync partitions; spark-sql> select * from tbl2; 1 1 spark-sql> show partitions tbl2; part=1 ``` ### How was this patch tested? - By running the modified test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly MsckRepairTableParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly PlanResolutionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsParallelSuite" ``` - Added unified v1 and v2 tests for `MSCK REPAIR TABLE`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *MsckRepairTableSuite" ``` Closes #31499 from MaxGekk/repair-table-drop-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-23 13:45:15 -08:00
Kousuke Saruta	612d52315b	[SPARK-34500][DOCS][EXAMPLES] Replace symbol literals with $"" in examples and documents ### What changes were proposed in this pull request? This PR replaces all the occurrences of symbol literals (`'name`) with string interpolation (`$"name"`) in examples and documents. ### Why are the changes needed? Symbol literals are used to represent columns in Spark SQL but the Scala community seems to remove `Symbol` completely. As we discussed in #31569, first we should replacing symbol literals with `$"name"` in user facing examples and documents. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Build docs. Closes #31615 from sarutak/replace-symbol-literals-in-doc-and-examples. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-23 11:22:02 +09:00
Karl-WangSK	a6a82c8e69	[MINOR][DOCS] Add table_identifier in sql-migration-guide for SHOW CREATE TABLE ### What changes were proposed in this pull request? Add `table_identifier` in sql-migration-guide for SHOW CREATE TABLE. ### Why are the changes needed? To make document more readable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test suites. Closes #31608 from Karl-WangSK/sqldoc. Lead-authored-by: Karl-WangSK <shikai.wang@linkflowtech.com> Co-authored-by: ShiKai Wang <wskqing@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-02-22 20:15:19 +08:00
Max Gekk	6ea4b5fda7	[SPARK-34401][SQL][DOCS] Update docs about altering cached tables/views ### What changes were proposed in this pull request? Update public docs of SQL commands about altering cached tables/views. For instance: <img width="869" alt="Screenshot 2021-02-08 at 15 11 48" src="https://user-images.githubusercontent.com/1580697/107217940-fd3b8980-6a1f-11eb-98b9-9b2e3fe7f4ef.png"> ### Why are the changes needed? To inform users about commands behavior in altering cached tables or views. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the command below and manually checking the docs: ``` $ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch ``` Closes #31524 from MaxGekk/doc-cmd-caching. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-22 04:32:09 +00:00
Kousuke Saruta	82b33a3041	[SPARK-34379][SQL] Map JDBC RowID to StringType rather than LongType ### What changes were proposed in this pull request? This PR fix an issue that `java.sql.RowId` is mapped to `LongType` and prefer `StringType`. In the current implementation, JDBC RowID type is mapped to `LongType` except for `OracleDialect`, but there is no guarantee to be able to convert RowID to long. `java.sql.RowId` declares `toString` and the specification of `java.sql.RowId` says > _all methods on the RowId interface must be fully implemented if the JDBC driver supports the data type_ (https://docs.oracle.com/javase/8/docs/api/java/sql/RowId.html) So, we should prefer StringType to LongType. ### Why are the changes needed? This seems to be a potential bug. ### Does this PR introduce _any_ user-facing change? Yes. RowID is mapped to StringType rather than LongType. ### How was this patch tested? New test and the existing test case `SPARK-32992: map Oracle's ROWID type to StringType` in `OracleIntegrationSuite` passes. Closes #31491 from sarutak/rowid-type. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-02-20 23:45:56 +09:00
Bo Zhang	489d32aa9b	[SPARK-34471][SS][DOCS] Document Streaming Table APIs in Structured Streaming Programming Guide ### What changes were proposed in this pull request? This change is to document the newly added streaming table APIs in Structured Streaming Programming Guide. ### Why are the changes needed? This will help our users when they try to use the new APIs. ### Does this PR introduce _any_ user-facing change? Yes. Users will see the changes in the programming guide. ### How was this patch tested? Built the HTML page and verified. Attached is a screenshot of the section added: ![Table APIs Section - Scala](https://user-images.githubusercontent.com/44179472/108581923-1ff86700-736b-11eb-8fcd-efa04ac936de.png) Closes #31590 from bozhang2820/table-api-doc. Lead-authored-by: Bo Zhang <bo.zhang@databricks.com> Co-authored-by: Bo Zhang <bozhang2820@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2021-02-20 15:54:43 +09:00
Max Gekk	4a9a1d42e7	[SPARK-34466][SQL][DOCS] Improve docs for `ALTER TABLE .. RENAME TO` ### What changes were proposed in this pull request? Explicitly highlight that the table rename command cannot move a table between databases. ### Why are the changes needed? To inform users about actual behavior of the table rename command. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ```sql spark-sql> CREATE DATABASE db1; spark-sql> CREATE DATABASE db2; spark-sql> CREATE TABLE db1.tbl1 (c0 INT); spark-sql> ALTER TABLE db1.tbl1 RENAME TO db2.tbl1; Error in query: RENAME TABLE source and destination databases do not match: 'db1' != 'db2'; spark-sql> ALTER TABLE db1.tbl1 RENAME TO db1.tbl2; spark-sql> SHOW TABLES IN db1 LIKE '*'; db1 tbl2 false ``` Closes #31586 from MaxGekk/doc-rename-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-19 04:48:16 +00:00
Steve Loughran	ff5115c3ac	[SPARK-33739][SQL] Jobs committed through the S3A Magic committer don't track bytes BasicWriteStatsTracker to probe for a custom Xattr if the size of the generated file is 0 bytes; if found and parseable use that as the declared length of the output. The matching Hadoop patch in HADOOP-17414: * Returns all S3 object headers as XAttr attributes prefixed "header." * Sets the custom header x-hadoop-s3a-magic-data-length to the length of the data in the marker file. As a result, spark job tracking will correctly report the amount of data uploaded and yet to materialize. ### Why are the changes needed? Now that S3 is consistent, it's a lot easier to use the S3A "magic" committer which redirects a file written to `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro` to its final destination `dest/year=2020/output.avro` , adding a zero byte marker file at the end and a json file `dest/__magic/job_0011/task_1245/__base/year=2020/output.avro.pending` containing all the information for the job committer to complete the upload. But: the write tracker statictics don't show progress as they measure the length of the created file, find the marker file and report 0 bytes. By probing for a specific HTTP header in the marker file and parsing that if retrieved, the real progress can be reported. There's a matching change in Hadoop [https://github.com/apache/hadoop/pull/2530](https://github.com/apache/hadoop/pull/2530) which adds getXAttr API support to the S3A connector and returns the headers; the magic committer adds the relevant attributes. If the FS being probed doesn't support the XAttr API, the header is missing or the value not a positive long then the size of 0 is returned. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New tests in BasicWriteTaskStatsTrackerSuite which use a filter FS to implement getXAttr on top of LocalFS; this is used to explore the set of options: * no XAttr API implementation (existing tests; what callers would see with most filesystems) * no attribute found (HDFS, ABFS without the attribute) * invalid data of different forms All of these return Some(0) as file length. The Hadoop PR verifies XAttr implementation in S3A and that the commit protocol attaches the header to the files. External downstream testing has done the full hadoop+spark end to end operation, with manual review of logs to verify that the data was successfully collected from the attribute. Closes #30714 from steveloughran/cdpd/SPARK-33739-magic-commit-tracking-master. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-02-18 08:43:18 -06:00
Max Gekk	b58f0976a9	[SPARK-34437][SQL][DOCS] Update Spark SQL guide about the rebasing DS options and SQL configs ### What changes were proposed in this pull request? In the PR, I propose to update the Spark SQL guide about the SQL configs that are related to datetime rebasing: - spark.sql.parquet.int96RebaseModeInWrite - spark.sql.parquet.datetimeRebaseModeInWrite - spark.sql.parquet.int96RebaseModeInRead - spark.sql.parquet.datetimeRebaseModeInRead - spark.sql.avro.datetimeRebaseModeInWrite - spark.sql.avro.datetimeRebaseModeInRead Parquet options added by #31489: - datetimeRebaseMode - int96RebaseMode and Avro options added by #31529: - datetimeRebaseMode <img width="998" alt="Screenshot 2021-02-17 at 21 42 09" src="https://user-images.githubusercontent.com/1580697/108252043-3afb8900-7169-11eb-8568-511e21fa7f78.png"> ### Why are the changes needed? To inform users about supported DS options and SQL configs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By generating the doc and manually checking: ``` $ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch ``` Closes #31564 from MaxGekk/doc-rebase-options. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-18 17:48:50 +09:00
Max Gekk	7b549c3e53	[SPARK-34455][SQL] Deprecate `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` ### What changes were proposed in this pull request? 1. Put the SQL config `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` to the list of deprecated configs `deprecatedSQLConfigs` 2. Update docs for the Avro datasource <img width="982" alt="Screenshot 2021-02-17 at 21 04 26" src="https://user-images.githubusercontent.com/1580697/108249890-abed7180-7166-11eb-8cb7-0c246d2a34fc.png"> ### Why are the changes needed? The config exists for enough time. We can deprecate it, and recommend users to use `.format("avro")` instead. ### Does this PR introduce _any_ user-facing change? Should not except of the warning with the recommendation to use the `avro` format. ### How was this patch tested? 1. By generating docs via: ``` $ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch ``` 2. Manually checking the warning: ``` scala> spark.conf.set("spark.sql.legacy.replaceDatabricksSparkAvro.enabled", false) 21/02/17 21:20:18 WARN SQLConf: The SQL config 'spark.sql.legacy.replaceDatabricksSparkAvro.enabled' has been deprecated in Spark v3.2 and may be removed in the future. Use `.format("avro")` in `DataFrameWriter` or `DataFrameReader` instead. ``` Closes #31578 from MaxGekk/deprecate-replaceDatabricksSparkAvro. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-17 21:54:20 -08:00
“attilapiros”	bdcad33d8b	[SPARK-34433][DOCS] Lock Jekyll version by Gemfile and Bundler ### What changes were proposed in this pull request? Improving the documentation and release process by pinning Jekyll version by Gemfile and Bundler. Some files and their responsibilities within this PR: - `docs/.bundle/config` is used to specify a directory "docs/.local_ruby_bundle" which will be used as destination to install the ruby packages into instead of the global one which requires root access - `docs/Gemfile` is specifying the required Jekyll version and other top level gem versions - `docs/Gemfile.lock` is generated by the "bundle install". This file contains the exact resolved versions of all the gems including the top level gems and all the direct and transitive dependencies of those gems. When this file is generated it contains a platform related section "PLATFORMS" (in my case after the generation it was "universal-darwin-19"). Still this file must be under version control as when the version of a gem does not fit to the one specified in `Gemfile` an error comes (i.e. if the `Gemfile.lock` was generated for Jekyll 4.1.0 and its version is updated in the `Gemfile` to 4.2.0 then it triggers the error: "The bundle currently has jekyll locked at 4.1.0."). This is solution is also suggested officially in [its documentation](https://bundler.io/rationale.html#checking-your-code-into-version-control). To get rid of the specific platform (like "universal-darwin-19") first we have to add "ruby" as platform [which means this should work on every platform where Ruby runs](https://guides.rubygems.org/what-is-a-gem/)) by running "bundle lock --add-platform ruby" then the specific platform can be removed by "bundle lock --remove-platform universal-darwin-19". After this the correct process to update Jekyll version is the following: 1. update the version in `Gemfile` 2. run "bundle update" which updates the `Gemfile.lock` 3. commit both files This process for version update is tested for details please check the testing section. ### Why are the changes needed? Using different Jekyll versions can generate different output documents. This PR standardize the process. ### Does this PR introduce _any_ user-facing change? No, assuming the release was done via docker by using `do-release-docker.sh`. In that case there should be no difference at all as the same Jekyll version is specified in the Gemfile. ### How was this patch tested? #### Testing document generation Doc generation step was triggered via the docker release: ``` $ ./do-release-docker.sh -d ~/working -n -s docs ... ======================== = Building documentation... Command: /opt/spark-rm/release-build.sh docs Log file: docs.log Skipping publish step. ``` The docs.log contains the followings: ``` Building Spark docs Fetching gem metadata from https://rubygems.org/......... Using bundler 2.2.9 Fetching rb-fsevent 0.10.4 Fetching forwardable-extended 2.6.0 Fetching public_suffix 4.0.6 Fetching colorator 1.1.0 Fetching eventmachine 1.2.7 Fetching http_parser.rb 0.6.0 Fetching ffi 1.14.2 Fetching concurrent-ruby 1.1.8 Installing colorator 1.1.0 Installing forwardable-extended 2.6.0 Installing rb-fsevent 0.10.4 Installing public_suffix 4.0.6 Installing http_parser.rb 0.6.0 with native extensions Installing eventmachine 1.2.7 with native extensions Installing concurrent-ruby 1.1.8 Fetching rexml 3.2.4 Fetching liquid 4.0.3 Installing ffi 1.14.2 with native extensions Installing rexml 3.2.4 Installing liquid 4.0.3 Fetching mercenary 0.4.0 Installing mercenary 0.4.0 Fetching rouge 3.26.0 Installing rouge 3.26.0 Fetching safe_yaml 1.0.5 Installing safe_yaml 1.0.5 Fetching unicode-display_width 1.7.0 Installing unicode-display_width 1.7.0 Fetching webrick 1.7.0 Installing webrick 1.7.0 Fetching pathutil 0.16.2 Fetching kramdown 2.3.0 Fetching terminal-table 2.0.0 Fetching addressable 2.7.0 Fetching i18n 1.8.9 Installing terminal-table 2.0.0 Installing pathutil 0.16.2 Installing i18n 1.8.9 Installing addressable 2.7.0 Installing kramdown 2.3.0 Fetching kramdown-parser-gfm 1.1.0 Installing kramdown-parser-gfm 1.1.0 Fetching rb-inotify 0.10.1 Fetching sassc 2.4.0 Fetching em-websocket 0.5.2 Installing rb-inotify 0.10.1 Installing em-websocket 0.5.2 Installing sassc 2.4.0 with native extensions Fetching listen 3.4.1 Installing listen 3.4.1 Fetching jekyll-watch 2.2.1 Installing jekyll-watch 2.2.1 Fetching jekyll-sass-converter 2.1.0 Installing jekyll-sass-converter 2.1.0 Fetching jekyll 4.2.0 Installing jekyll 4.2.0 Fetching jekyll-redirect-from 0.16.0 Installing jekyll-redirect-from 0.16.0 Bundle complete! 4 Gemfile dependencies, 30 gems now installed. Bundled gems are installed into `./.local_ruby_bundle` ``` #### Testing Jekyll (or other gem) update First locally I reverted Jekyll to 4.1.0: ``` $ rm Gemfile.lock $ rm -rf .local_ruby_bundle # edited Gemfile to use version 4.1.0 $ cat Gemfile source "https://rubygems.org" gem "jekyll", "4.1.0" gem "rouge", "3.26.0" gem "jekyll-redirect-from", "0.16.0" gem "webrick", "1.7" $ bundle install ... ``` Testing Jekyll version before the update: ``` $ bundle exec jekyll --version jekyll 4.1.0 ``` Imitating Jekyll update coming from git by reverting my local changes: ``` $ git checkout Gemfile Updated 1 path from the index $ cat Gemfile source "https://rubygems.org" gem "jekyll", "4.2.0" gem "rouge", "3.26.0" gem "jekyll-redirect-from", "0.16.0" gem "webrick", "1.7" $ git checkout Gemfile.lock Updated 1 path from the index ``` Run the install: ``` $ bundle install ... ``` Checking the updated Jekyll version: ``` $ bundle exec jekyll --version jekyll 4.2.0 ``` Closes #31559 from attilapiros/pin-jekyll-version. Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Co-authored-by: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-18 12:17:57 +09:00
Cheng Su	a575e805a1	[SPARK-34446][SS][DOCS] Update doc for stream-stream join (full outer + left semi) ### What changes were proposed in this pull request? Per discussion in https://issues.apache.org/jira/browse/SPARK-32883?focusedCommentId=17285057&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17285057, we should add documentation for added new features of full outer and left semi joins into SS programming guide. * Reworded the section for "Outer Joins with Watermarking", to make it work for full outer join. Updated the code snippet to show up full outer and left semi join. * Added one section for "Semi Joins with Watermarking", similar to "Outer Joins with Watermarking". * Updated "Support matrix for joins in streaming queries" to reflect latest fact for full outer and left semi join. ### Why are the changes needed? Good for users and developers to follow guide to try out these two new features. ### Does this PR introduce _any_ user-facing change? Yes. They will see the corresponding updated guide. ### How was this patch tested? No, just documentation change. Previewed the markdown file in browser. Also attached here for the change to the "Support matrix for joins in streaming queries" table. <img width="896" alt="Screen Shot 2021-02-16 at 8 12 07 PM" src="https://user-images.githubusercontent.com/4629931/108155275-73c92e80-7093-11eb-9f0b-c8b4bb7321e5.png"> Closes #31572 from c21/ss-doc. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-02-18 09:34:33 +09:00
Kousuke Saruta	dd6383f0a3	[SPARK-34333][SQL] Fix PostgresDialect to handle money types properly ### What changes were proposed in this pull request? This PR changes the type mapping for `money` and `money[]` types for PostgreSQL. Currently, those types are tried to convert to `DoubleType` and `ArrayType` of `double` respectively. But the JDBC driver seems not to be able to handle those types properly. https://github.com/pgjdbc/pgjdbc/issues/100 https://github.com/pgjdbc/pgjdbc/issues/1405 Due to these issue, we can get the error like as follows. money type. ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : 1,000.00 [info] at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104) [info] at org.postgresql.jdbc.PgResultSet.getDouble(PgResultSet.java:2432) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$5(JdbcUtils.scala:418) ``` money[] type. ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : $2,000.00 [info] at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104) [info] at org.postgresql.jdbc.ArrayDecoding$5.parseValue(ArrayDecoding.java:235) [info] at org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:122) [info] at org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:764) [info] at org.postgresql.jdbc.PgArray.buildArray(PgArray.java:310) [info] at org.postgresql.jdbc.PgArray.getArrayImpl(PgArray.java:171) [info] at org.postgresql.jdbc.PgArray.getArray(PgArray.java:111) ``` For money type, a known workaround is to treat it as string so this PR do it. For money[], however, there is no reasonable workaround so this PR remove the support. ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? Yes. As of this PR merged, money type is mapped to `StringType` rather than `DoubleType` and the support for money[] is stopped. For money type, if the value is less than one thousand, `$100.00` for instance, it works without this change so I also updated the migration guide because it's a behavior change for such small values. On the other hand, money[] seems not to work with any value but mentioned in the migration guide just in case. ### How was this patch tested? New test. Closes #31442 from sarutak/fix-for-money-type. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-02-17 10:50:06 +09:00
Gabor Somogyi	0a37a95224	[SPARK-31816][SQL][DOCS] Added high level description about JDBC connection providers for users/developers ### What changes were proposed in this pull request? JDBC connection provider API and embedded connection providers already added to the code but no in-depth description about the internals. In this PR I've added both user and developer documentation and additionally added an example custom JDBC connection provider. ### Why are the changes needed? No documentation and example custom JDBC provider. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` cd docs/ SKIP_API=1 jekyll build ``` <img width="793" alt="Screenshot 2021-02-02 at 16 35 43" src="https://user-images.githubusercontent.com/18561820/106623428-e48d2880-6574-11eb-8d14-e5c2aa7c37f1.png"> Closes #31384 from gaborgsomogyi/SPARK-31816. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-02-10 12:28:28 +09:00
Angerszhuuuu	123365e05c	[SPARK-34240][SQL] Unify output of `SHOW TBLPROPERTIES` clause's output attribute's schema and ExprID ### What changes were proposed in this pull request? Passing around the output attributes should have more benefits like keeping the exprID unchanged to avoid bugs when we apply more operators above the command output DataFrame. This PR did 2 things ： 1. After this pr, a `SHOW TBLPROPERTIES` clause's output shows `key` and `value` columns whether you specify the table property `key`. Before this pr, a `SHOW TBLPROPERTIES` clause's output only show a `value` column when you specify the table property `key`.. 2. Keep `SHOW TBLPROPERTIES` command's output attribute exprId unchanged. ### Why are the changes needed? 1. Keep `SHOW TBLPROPERTIES`'s output schema consistence 2. Keep `SHOW TBLPROPERTIES` command's output attribute exprId unchanged. ### Does this PR introduce _any_ user-facing change? After this pr, a `SHOW TBLPROPERTIES` clause's output shows `key` and `value` columns whether you specify the table property `key`. Before this pr, a `SHOW TBLPROPERTIES` clause's output only show a `value` column when you specify the table property `key`. Before this PR: ``` sql > SHOW TBLPROPERTIES tabe_name('key') value value_of_key ``` After this PR ``` sql > SHOW TBLPROPERTIES tabe_name('key') key value key value_of_key ``` ### How was this patch tested? Added UT Closes #31378 from AngersZhuuuu/SPARK-34240. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 03:19:52 +00:00
Liang-Chi Hsieh	1fbd576410	[SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in UnivariateFeatureSelector document ### What changes were proposed in this pull request? This follows up #31160 to update score function in the document. ### Why are the changes needed? Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the sklearn's naming. It is good to have it but I think it is nice if we have formal score function name with sklearn's ones. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No, only doc change. Closes #31531 from viirya/SPARK-34080-minor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-10 09:24:25 +09:00
Gengliang Wang	88ced28141	[SPARK-33354][DOC] Remove an unnecessary quote in doc ### What changes were proposed in this pull request? Remove an unnecessary quote in the documentation. Super trivial. ### Why are the changes needed? Fix a mistake. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc Closes #31523 from gengliangwang/removeQuote. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 21:08:34 +09:00
gengjiaan	2c243c93d9	[SPARK-34157][SQL] Unify output of SHOW TABLES and pass output attributes properly ### What changes were proposed in this pull request? The current implement of some DDL not unify the output and not pass the output properly to physical command. Such as: The `ShowTables` output attributes `namespace`, but `ShowTablesCommand` output attributes `database`. As the query plan, this PR pass the output attributes from `ShowTables` to `ShowTablesCommand`, `ShowTableExtended ` to `ShowTablesCommand`. Take `show tables` and `show table extended like 'tbl'` as example. The output before this PR: `show tables` \|database\|tableName\|isTemporary\| -- \| -- \| -- \| default\| tbl\| false\| If catalog is v2 session catalog, the output before this PR: \|namespace\|tableName\| -- \| -- \| default\| tbl `show table extended like 'tbl'` \|database\|tableName\|isTemporary\| information\| -- \| -- \| -- \| -- \| default\| tbl\| false\|Database: default...\| The output after this PR: `show tables` \|namespace\|tableName\|isTemporary\| -- \| -- \| -- \| default\| tbl\| false\| `show table extended like 'tbl'` \|namespace\|tableName\|isTemporary\| information\| -- \| -- \| -- \| -- \| default\| tbl\| false\|Database: default...\| ### Why are the changes needed? This PR have benefits as follows: First, Unify schema for the output of SHOW TABLES. Second, pass the output attributes could keep the expr ID unchanged, so that avoid bugs when we apply more operators above the command output dataframe. ### Does this PR introduce _any_ user-facing change? Yes. The output schema of `SHOW TABLES` replace `database` by `namespace`. ### How was this patch tested? Jenkins test. Closes #31245 from beliefer/SPARK-34157. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-08 08:39:58 +00:00
raphaelauv	34a1a65b39	[SPARK-34398][DOCS] Fix PySpark migration link ### What changes were proposed in this pull request? docs/pyspark-migration-guide.md ### Why are the changes needed? broken link ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually build and check Closes #31514 from raphaelauv/patch-2. Authored-by: raphaelauv <raphaelauv@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 09:12:15 +09:00
Wenchen Fan	361d702f8d	[SPARK-34359][SQL] Add a legacy config to restore the output schema of SHOW DATABASES ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26006 In #26006 , we merged the v1 and v2 SHOW DATABASES/NAMESPACES commands, but we missed a behavior change that the output schema of SHOW DATABASES becomes different. This PR adds a legacy config to restore the old schema, with a migration guide item to mention this behavior change. ### Why are the changes needed? Improve backward compatibility ### Does this PR introduce _any_ user-facing change? No (the legacy config is false by default) ### How was this patch tested? a new test Closes #31474 from cloud-fan/command-schema. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-05 04:57:51 +00:00
Gengliang Wang	ff1b6ecc37	[SPARK-33591][SQL][FOLLOW-UP] Revise the version and doc of `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` ### What changes were proposed in this pull request? Correct the version of SQL configuration `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` from 3.2.0 to 3.0.2. Also, revise the documentation and test case. ### Why are the changes needed? The release version in https://github.com/apache/spark/pull/31421 was wrong. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #31434 from gengliangwang/reviseVersion. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-02 13:51:20 +00:00
Linhong Liu	bb9bf66bb6	[SPARK-34199][SQL] Block `table.` inside function to follow ANSI standard and other SQL engines ### What changes were proposed in this pull request? In spark, the `count(table.)` may cause very weird result, for example: ``` select count() from (select 1 as a, null as b) t; output: 1 select count(t.) from (select 1 as a, null as b) t; output: 0 ``` This is because spark expands `t.` while converts `` to count(1), this will confuse users. After checking the ANSI standard, `count()` should always be `count(1)` while `count(t.)` is not allowed. What's more, this is also not allowed by common databases, e.g. MySQL, Oracle. So, this PR proposes to block the ambiguous behavior and print a clear error message for users. ### Why are the changes needed? to avoid ambiguous behavior and follow ANSI standard and other SQL engines ### Does this PR introduce _any_ user-facing change? Yes, `count(table.*)` behavior will be blocked and output an error message. ### How was this patch tested? newly added and existing tests Closes #31286 from linhongliu-db/fix-table-star. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-02 07:49:50 +00:00
Gengliang Wang	521397f2f9	[SPARK-33591][SQL][FOLLOWUP] Add legacy config for recognizing null partition spec values ### What changes were proposed in this pull request? This is a follow up for https://github.com/apache/spark/pull/30538. It adds a legacy conf `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` in case users wants the legacy behavior. It also adds document for the behavior change. ### Why are the changes needed? In case users want the legacy behavior, they can set `spark.sql.legacy.parseNullPartitionSpecAsStringLiteral` as true. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy configuration to restore the old behavior. ### How was this patch tested? Unit test. Closes #31421 from gengliangwang/legacyNullStringConstant. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 16:13:40 +09:00
HyukjinKwon	30468a9015	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs ### What changes were proposed in this pull request? This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621. In more details, this PR: - Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases. - (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases. - Deprecates and renames: - `sumDistinct` -> `sum_distinct` - `bitwiseNOT` -> `bitwise_not` - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`) - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`) - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`) - (Scala-specific) `callUDF` -> `call_udf` ### Why are the changes needed? To keep the consistent naming in APIs. ### Does this PR introduce _any_ user-facing change? Yes, it deprecates some APIs and add new renamed APIs as described above. ### How was this patch tested? Unittests were added. Closes #31408 from HyukjinKwon/SPARK-34306. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 09:29:40 +09:00
Dongjoon Hyun	78244bafe8	[SPARK-34281][K8S] Promote spark.kubernetes.executor.podNamePrefix to the public conf ### What changes were proposed in this pull request? This PR aims to remove `internal()` from `spark.kubernetes.executor.podNamePrefix` in order to make it the configuration public. ### Why are the changes needed? In line with K8s GA, this will allow some users control the full executor pod names officially. This is useful when we want a custom executor pod name pattern independently from the app name. ### Does this PR introduce _any_ user-facing change? No, this has been there since Apache Spark 2.3.0. ### How was this patch tested? N/A. Closes #31386 from dongjoon-hyun/SPARK-34281. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-28 13:01:18 -08:00
HyukjinKwon	1217c8b418	Revert "[SPARK-31168][SPARK-33913][BUILD] Upgrade Scala to 2.12.13 and Kafka to 2.7.0" This reverts commit `a65e86a65e`.	2021-01-27 17:03:15 +09:00
Chao Sun	c2320a43c7	[SPARK-34052][FOLLOWUP][DOC] Add document in SQL migration guide ### What changes were proposed in this pull request? Add document for the behavior change in SPARK-34052, in SQL migration guide. ### Why are the changes needed? Document behavior change for Spark users. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #31351 from sunchao/SPARK-34052-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-26 15:11:45 -08:00
Max Gekk	ac8307d75c	[SPARK-34215][SQL] Keep tables cached after truncation ### What changes were proposed in this pull request? Invoke `CatalogImpl.refreshTable()` instead of combination of `SessionCatalog.refreshTable()` + `uncacheQuery()`. This allows to clear cached table data while keeping the table cached. ### Why are the changes needed? 1. To improve user experience with Spark SQL 2. To be consistent to other commands, see https://github.com/apache/spark/pull/31206 ### Does this PR introduce _any_ user-facing change? Yes. Before: ```scala scala> sql("CREATE TABLE tbl (c0 int)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT INTO tbl SELECT 0") res2: org.apache.spark.sql.DataFrame = [] scala> sql("CACHE TABLE tbl") res3: org.apache.spark.sql.DataFrame = [] scala> sql("SELECT * FROM tbl").show(false) +---+ \|c0 \| +---+ \|0 \| +---+ scala> spark.catalog.isCached("tbl") res5: Boolean = true scala> sql("TRUNCATE TABLE tbl") res6: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.isCached("tbl") res7: Boolean = false ``` After: ```scala scala> sql("TRUNCATE TABLE tbl") res6: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.isCached("tbl") res7: Boolean = true ``` ### How was this patch tested? Added new test to `CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly CachedTableSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly CatalogedDDLSuite" ``` Closes #31308 from MaxGekk/truncate-table-cached. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-26 15:36:44 +00:00
Angerszhuuuu	7bd4165c11	[SPARK-32852][SQL][FOLLOW_UP] Add notice about keep hive version consistence when config hive jars location ### What changes were proposed in this pull request? Add notice about keep hive version consistence when config hive jars location With PR #29881, if we don't keep hive version consistence. we will got below error. ``` Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8. ``` ![image](https://user-images.githubusercontent.com/46485123/105795169-512d8380-5fc7-11eb-97c3-0259a0d2aa58.png) ### Why are the changes needed? Make config doc detail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31317 from AngersZhuuuu/SPARK-32852-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 13:40:20 +09:00
Max Gekk	652bdf0d5a	[SPARK-34027][SPARK-34213][SQL][FOLLOWUP][DOCS] Update the SQL migration guide about table re-caching ### What changes were proposed in this pull request? This is a follow up of the PRs https://github.com/apache/spark/pull/31066 and https://github.com/apache/spark/pull/31304 that changed behavior of some commands regarding to table cache refreshing. The PR updates the SQL migration guide, in particular, the item which describes new behavior. ### Why are the changes needed? To inform users about command behavior changes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #31309 from MaxGekk/refreshTable-sql-migration-guide. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-24 11:56:35 -08:00
Max Gekk	e79c1cde1b	[SPARK-34138][SQL] Keep dependants cached while refreshing v1 tables ### What changes were proposed in this pull request? This PR changes cache refreshing of v1 tables in v1 commands. In particular, v1 table dependents are not removed from the cache after this PR. Comparing to current implementation, we just clear cached data of all dependents and keep them in the cache. So, the next actions will fill in the cached data of the original v1 table and its dependents. In more details: 1. Modified the `CatalogImpl.refreshTable()` method to use `recacheByPlan()` instead of `lookupCachedData()`, `uncacheQuery()` and `cacheQuery()`. Users can call this method via public API like `spark.catalog.refreshTable()`. 2. Rewritten the part in `CatalogImpl.refreshTable()` which was responsible for table meta-data refreshing because this code stopped to work properly after removing of the second `sparkSession.table(tableIdent)`. 3. Added new private method `invalidateCachedTable()` to `SessionCatalog`. Comparing to the existing `SessionCatalog.refreshTable`, it invalidates the relation cache only. If we called `SessionCatalog.refreshTable` from `CatalogImpl.refreshTable()`, we would refresh temporary and global temporary views twice (that could lead to refreshing file index twice). ### Why are the changes needed? 1. This should improve user experience with table/view caching. For example, let's imagine that an user has cached v1 table and cached view based on the table. And the user passed the table to external library which drops/renames/adds partitions in the v1 table. Unfortunately, the user gets the view uncached after that even he/she hasn't uncached the view explicitly. 2. To improve code maintenance. 3. To reduce the amount of calls to Hive external catalog. 4. Also this should speed up table recaching. 5. To have the same behavior as for v2 tables supported by https://github.com/apache/spark/pull/31172 ### Does this PR introduce _any_ user-facing change? From the view of the correctness of query results, there are no behavior changes but the changes might influence on consuming memory and query execution time. For example: Before: ```scala scala> sql("CREATE TABLE tbl (c int)") scala> sql("CACHE TABLE tbl") scala> sql("CREATE VIEW v AS SELECT * FROM tbl") scala> sql("CACHE TABLE v") scala> spark.catalog.isCached("v") res6: Boolean = true scala> spark.catalog.refreshTable("tbl") scala> spark.catalog.isCached("v") res8: Boolean = false ``` After: ```scala scala> spark.catalog.refreshTable("tbl") scala> spark.catalog.isCached("v") res8: Boolean = true ``` ### How was this patch tested? 1. Added new unit tests that create a view, a temporary view and a global temporary view on top of v1/v2 tables, and refresh the base table via `ALTER TABLE .. ADD/DROP/RENAME PARTITION`. 2. By running the unified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" # build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31206 from MaxGekk/refreshTable-recache-by-plan. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 13:03:24 +00:00
Angerszhuuuu	faa4f0c2bd	[SPARK-34181][DOC] Update Prerequisites for build doc of ruby 3.0 issue ### What changes were proposed in this pull request? When ruby version is 3.0, jekyll server will failed with ``` yi.zhu$ SKIP_API=1 jekyll serve --watch Configuration file: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs/_config.yml Source: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs Destination: /Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs/_site Incremental build: disabled. Enable with --incremental Generating... done in 5.085 seconds. Auto-regeneration: enabled for '/Users/yi.zhu/Documents/project/Angerszhuuuu/spark/docs' ------------------------------------------------ Jekyll 4.2.0 Please append `--trace` to the `serve` command for any additional information or backtrace. ------------------------------------------------ <internal:/usr/local/Cellar/ruby/3.0.0_1/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb>:85:in `require': cannot load such file -- webrick (LoadError) from <internal:/usr/local/Cellar/ruby/3.0.0_1/lib/ruby/3.0.0/rubygems/core_ext/kernel_require.rb>:85:in `require' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve/servlet.rb:3:in `<top (required)>' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:179:in `require_relative' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:179:in `setup' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb💯in `process' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `block in process_with_graceful_fail' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `each' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/command.rb:91:in `process_with_graceful_fail' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/lib/jekyll/commands/serve.rb:86:in `block (2 levels) in init_with_program' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `block in execute' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `each' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/command.rb:221:in `execute' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary/program.rb:44:in `go' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/mercenary-0.4.0/lib/mercenary.rb:21:in `program' from /Users/yi.zhu/.gem/ruby/3.0.0/gems/jekyll-4.2.0/exe/jekyll:15:in `<top (required)>' from /usr/local/bin/jekyll:23:in `load' from /usr/local/bin/jekyll:23:in `<main>' ``` This issue is solved in https://github.com/jekyll/jekyll/issues/8523 ### Why are the changes needed? Fix build issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31263 from AngersZhuuuu/SPARK-34181. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-21 11:36:09 +09:00
Ismaël Mejía	e9e81f798f	[SPARK-27733][CORE] Upgrade Avro to version 1.10.1 ### What changes were proposed in this pull request? Update Avro dependency to version 1.10.1 ### Why are the changes needed? To catch up multiple improvements of Avro as well as fix security issues on transitive dependencies. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since there were no API changes required we just run the tests Closes #31232 from iemejia/SPARK-27733-avro-upgrade. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-20 15:42:27 -08:00
Dongjoon Hyun	7e1651e315	[SPARK-34162][DOCS][PYSPARK] Add PyArrow compatibility note for Python 3.9 ### What changes were proposed in this pull request? This PR aims to add a note for Apache Arrow project's `PyArrow` compatibility for Python 3.9. ### Why are the changes needed? Although Apache Spark documentation claims `Spark runs on Java 8/11, Scala 2.12, Python 3.6+ and R 3.5+.`, Apache Arrow's `PyArrow` is not compatible with Python 3.9.x yet. Without installing `PyArrow` library, PySpark UTs passed without any problem. So, it would be enough to add a note for this limitation and the compatibility link of Apache Arrow website. - https://arrow.apache.org/docs/python/install.html#python-compatibility ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? BEFORE <img width="804" alt="Screen Shot 2021-01-19 at 1 45 07 PM" src="https://user-images.githubusercontent.com/9700541/105096867-8fbdbe00-5a5c-11eb-88f7-8caae2427583.png"> AFTER <img width="908" alt="Screen Shot 2021-01-19 at 7 06 41 PM" src="https://user-images.githubusercontent.com/9700541/105121661-85fe7f80-5a89-11eb-8af7-1b37e12c55c1.png"> Closes #31251 from dongjoon-hyun/SPARK-34162. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-19 19:09:14 -08:00
Dongjoon Hyun	a65e86a65e	[SPARK-31168][SPARK-33913][BUILD] Upgrade Scala to 2.12.13 and Kafka to 2.7.0 ### What changes were proposed in this pull request? This PR is the 3rd try to upgrade Scala 2.12.x in order to see the feasibility. - https://github.com/apache/spark/pull/27929 (Upgrade Scala to 2.12.11, wangyum ) - https://github.com/apache/spark/pull/30940 (Upgrade Scala to 2.12.12, viirya ) `silencer` library is updated accordingly. And, Kafka version upgrade is required because it fails like the following. ``` [info] KafkaDataConsumerSuite: [info] org.apache.spark.streaming.kafka010.KafkaDataConsumerSuite * ABORTED * (1 second, 580 milliseconds) [info] java.lang.NoClassDefFoundError: scala/math/Ordering$$anon$7 [info] at kafka.api.ApiVersion$.orderingByVersion(ApiVersion.scala:45) ``` ### Why are the changes needed? Apache Spark was stuck to 2.12.10 due to the regression in Scala 2.12.11 and 2.12.12. This will bring all the bug fixes. - https://github.com/scala/scala/releases/tag/v2.12.13 - https://github.com/scala/scala/releases/tag/v2.12.12 - https://github.com/scala/scala/releases/tag/v2.12.11 ### Does this PR introduce _any_ user-facing change? Yes, but this is a bug-fixed version. ### How was this patch tested? Pass the CIs. Closes #31223 from dongjoon-hyun/SPARK-31168. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-18 13:45:06 -08:00
Yuming Wang	c87b0085c9	[SPARK-33696][BUILD][SQL] Upgrade built-in Hive to 2.3.8 ### What changes were proposed in this pull request? Hive 2.3.8 changes: HIVE-19662: Upgrade Avro to 1.8.2 HIVE-24324: Remove deprecated API usage from Avro HIVE-23980: Shade Guava from hive-exec in Hive 2.3 HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue HIVE-24512: Exclude calcite in packaging. HIVE-22708: Fix for HttpTransport to replace String.equals HIVE-24551: Hive should include transitive dependencies from calcite after shading it HIVE-24553: Exclude calcite from test-jar dependency of hive-exec ### Why are the changes needed? Upgrade Avro and Parquet to latest version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: https://github.com/apache/spark/pull/30517 Closes #30657 from wangyum/SPARK-33696. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-17 21:54:35 -08:00
Mitsuru Kariya	536a7258a8	[MINOR][DOCS] Fix typos in sql-ref-datatypes.md ### What changes were proposed in this pull request? Fixing typos in the docs sql-ref-datatypes.md. ### Why are the changes needed? To display '<element_type>' correctly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually run jekyll. before this fix ![image](https://user-images.githubusercontent.com/2217224/104865408-3df33600-597f-11eb-857b-c6223ff9159a.png) after this fix ![image](https://user-images.githubusercontent.com/2217224/104865458-62e7a900-597f-11eb-8a21-6d838eecaaf2.png) Closes #31221 from kariya-mitsuru/fix-typo. Authored-by: Mitsuru Kariya <Mitsuru.Kariya@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-18 13:18:03 +09:00
Huaxin Gao	8847b7fa6d	[MINOR][DOCS] Fix broken python doc links ### What changes were proposed in this pull request? Fix broken python links ### Why are the changes needed? links broken. ![image](https://user-images.githubusercontent.com/13592258/104859361-9f60c980-58d9-11eb-8810-cb0669040af4.png) ![image](https://user-images.githubusercontent.com/13592258/104859350-8b1ccc80-58d9-11eb-9a8a-6ba8792595aa.png) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually checked Closes #31220 from huaxingao/docs. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-18 10:06:45 +09:00
William Hyun	1cf09b77eb	[MINOR][DOCS] Update Parquet website link ### What changes were proposed in this pull request? This PR aims to update the Parquet website link from http://parquet.io to https://parquet.apache.orc ### Why are the changes needed? The old website goes to the incubator site. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #31208 from williamhyun/minor-parquet. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-16 11:24:32 -08:00
Huaxin Gao	f3548837c6	[SPARK-34080][ML][PYTHON] Add UnivariateFeatureSelector ### What changes were proposed in this pull request? Add UnivariateFeatureSelector ### Why are the changes needed? Have one UnivariateFeatureSelector, so we don't need to have three Feature Selectors. ### Does this PR introduce _any_ user-facing change? Yes ``` selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], featureType="categorical", labelType="continuous", selectorType="numTopFeatures", numTopFeatures=100) ``` Or numTopFeatures ``` selector = UnivariateFeatureSelector(featureCols=["x", "y", "z"], labelCol=["target"], scoreFunction="f_classif", selectorType="numTopFeatures", numTopFeatures=100) ``` ### How was this patch tested? Add Unit test Closes #31160 from huaxingao/UnivariateSelector. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2021-01-16 11:09:23 +08:00
zero323	66cc12944a	[SPARK-34132][DOCS][R] Update Roxygen version references to 7.1.1 ### What changes were proposed in this pull request? This PR updates `roxygen2` version reference in docs and `DESCRIPTION` file. ### Why are the changes needed? According to information provided by shaneknapp (see [this comment](https://issues.apache.org/jira/browse/SPARK-30747?focusedCommentId=17265142&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17265142) to SPARK-30747) all workers use roxygen 7.1.1. In GitHub workflow we install the latest version `c75c29dcaa/.github/workflows/build_and_test.yml (L346)` which [is also 7.1.1 at the moment](https://web.archive.org/web/20210115172522/https://cran.r-project.org/web/packages/roxygen2/). ### Does this PR introduce _any_ user-facing change? Docs and description mention currently used package verison. ### How was this patch tested? - `dev/lint-r`. - Manual check of command used in docs. Closes #31200 from zero323/ROXYGEN-VERSION-UPDATE-DOCS. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-15 17:08:17 -08:00
Gengliang Wang	feedd1b44d	[SPARK-33354][FOLLOWUP][DOC] Shorten the table width of ANSI compliance casting document ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/30260 It shortens the table width of ANSI compliance casting document. ### Why are the changes needed? The table is too wide and the UI of doc site is broken if we scroll the page to right side. ![Screen Shot 2021-01-14 at 3 04 57 PM](https://user-images.githubusercontent.com/1097932/104565897-d2693b80-5601-11eb-9f93-5f603cfc94c1.png) ### Does this PR introduce _any_ user-facing change? Minor document change ### How was this patch tested? Build doc site locally and preview: ![Screen Shot 2021-01-14 at 4 44 30 PM](https://user-images.githubusercontent.com/1097932/104565814-b2d21300-5601-11eb-94c4-78c785cda8ed.png) Closes #31180 from gengliangwang/reviseAnsiDocStyle. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-14 13:55:39 -08:00
ulysses-you	92e5cfd58d	[SPARK-33989][SQL] Strip auto-generated cast when using Cast.sql ### What changes were proposed in this pull request? This PR aims to strip auto-generated cast. The main logic is: 1. Add tag if Cast is specified by user. 2. Wrap `PrettyAttribute` in usePrettyExpression. ### Why are the changes needed? Make sql consistent with dsl. Here is an inconsistent example before this PR: ``` -- output field name: FLOOR(1) spark.emptyDataFrame.select(floor(lit(1))) -- output field name: FLOOR(CAST(1 AS DOUBLE)) spark.sql("select floor(1)") ``` Note that, we don't remove the `Cast` so the auto-generated `Cast` can still work. The only changed place is `usePrettyExpression`, we use `PrettyAttribute` replace `Cast` to give a better sql string. ### Does this PR introduce _any_ user-facing change? Yes, the default field name may change. ### How was this patch tested? Add test and pass exists test. Closes #31034 from ulysses-you/SPARK-33989. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-14 15:27:14 +00:00
Kousuke Saruta	b7da108cae	[SPARK-33690][SQL][FOLLOWUP] Escape further meta-characters in showString ### What changes were proposed in this pull request? This is a followup PR for SPARK-33690 (#30647) . In addition to the original PR, this PR intends to escape the following meta-characters in `Dataset#showString`. * `\r` (carrige ret) * `\f` (form feed) * `\b` (backspace) * `\u000B` (vertical tab) * `\u0007` (bell) ### Why are the changes needed? To avoid breaking the layout of `Dataset#showString`. `\u0007` does not break the layout of `Dataset#showString` but it's noisy (beeps for each row) so it should be also escaped. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified the existing tests. I also build the documents and check the generated html for `sql-migration-guide.md`. Closes #31144 from sarutak/escape-metacharacters-in-getRows. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 18:13:01 -06:00
schintap	bd5039fc35	[SPARK-33741][CORE] Add min threshold time speculation config ### What changes were proposed in this pull request? Add min threshold time speculation config ### Why are the changes needed? When we turn on speculation with default configs we have the last 10% of the tasks subject to speculation. There are a lot of stages where the stage runs for few seconds to minutes. Also in general we don't want to speculate tasks that run within a minimum threshold. By setting a minimum threshold for speculation config gives us better control for speculative tasks ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #30710 from redsanket/SPARK-33741. Lead-authored-by: schintap <schintap@verizonmedia.com> Co-authored-by: Sanket Chintapalli <chintapalli.sanketreddy@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-01-13 08:57:56 -06:00
Liang-Chi Hsieh	ad9fad72a9	[MINOR][SS] Add some description about auto reset and data loss note to SS doc ### What changes were proposed in this pull request? This patch adds a few description to SS doc about offset reset and data loss. ### Why are the changes needed? During recent SS test, the behavior of gradual reducing input rows are confusing me. Comparing with Flink, I do not see a similar behavior. After looking into the code and doing some tests, I feel it is better to add some more description there in SS doc. ### Does this PR introduce _any_ user-facing change? No, doc only. ### How was this patch tested? Doc only. Closes #31089 from viirya/ss-minor-5. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-11 11:47:07 -08:00
Kousuke Saruta	023eba2ad7	[SPARK-33796][DOCS][FOLLOWUP] Tweak the width of left-menu of Spark SQL Guide ### What changes were proposed in this pull request? This PR tweaks the width of left-menu of Spark SQL Guide. When I view the Spark SQL Guide with browsers on macOS, the title `Spark SQL Guide` looks prettily. But I often use Pop!_OS, an Ubuntu variant, and the title is overlapped with browsers on it. ![spark-sql-guide-layout-before](https://user-images.githubusercontent.com/4736016/104002743-d56cc200-51e4-11eb-9e3a-28abcd46e0bf.png) After this change, the title is no longer overlapped. ![spark-sql-guide-layout-after](https://user-images.githubusercontent.com/4736016/104002847-f9c89e80-51e4-11eb-85c0-01d69cee46b7.png) ### Why are the changes needed? For the pretty layout. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the document with `cd docs && SKIP_API=1 jekyll build` and confirmed the layout. Closes #31091 from sarutak/modify-layout-sparksql-guide. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-08 09:43:11 -06:00
Gabor Somogyi	71d261ab8f	[SPARK-34032][SS] Add truststore and keystore type config possibility for Kafka delegation token ### What changes were proposed in this pull request? Kafka delegation token is obtained with `AdminClient` where security settings can be set. Keystore and trustrore type however can't be set. In this PR I've added these new configurations. This can be useful when the type is different. A good example is to make Spark FIPS compliant where the default JKS is not accepted. ### Why are the changes needed? Missing configurations. ### Does this PR introduce _any_ user-facing change? Yes, adding 2 additional config parameters. ### How was this patch tested? Existing + modified unit tests + simple Kafka to Kafka app on cluster. Closes #31070 from gaborgsomogyi/SPARK-34032. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2021-01-08 20:04:56 +09:00
angerszhu	9b54da490d	[SPARK-33818][SQL][DOC] Add descriptions about `spark.sql.parser.quotedRegexColumnNames` in the SQL documents ### What changes were proposed in this pull request? According to https://github.com/apache/spark/pull/30805#issuecomment-747179899, doc `spark.sql.parser.quotedRegexColumnNames` since we need user know about this in doc and it's useful. ![image](https://user-images.githubusercontent.com/46485123/103656543-afa4aa80-4fa3-11eb-8cd3-a9d1b87a3489.png) ![image](https://user-images.githubusercontent.com/46485123/103656551-b2070480-4fa3-11eb-9ce7-95cc424242a6.png) ### Why are the changes needed? Complete doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #30816 from AngersZhuuuu/SPARK-33818. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-07 18:55:27 -08:00
Dongjoon Hyun	5b16d70d6a	[SPARK-34044][DOCS] Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md ### What changes were proposed in this pull request? This PR adds new configuration to `sql-data-sources-hive-tables`. ### Why are the changes needed? SPARK-32852 added a new configuration, `spark.sql.hive.metastore.jars.path`. ### Does this PR introduce _any_ user-facing change? Yes, but a document only. ### How was this patch tested? BEFORE ![Screen Shot 2021-01-07 at 2 57 57 PM](https://user-images.githubusercontent.com/9700541/103954318-cc9ec200-50f8-11eb-86d3-cd89b07fcd21.png) AFTER ![Screen Shot 2021-01-07 at 2 56 34 PM](https://user-images.githubusercontent.com/9700541/103954221-9d885080-50f8-11eb-8938-fb91394a33cb.png) Closes #31085 from dongjoon-hyun/SPARK-34044. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-08 09:34:40 +09:00
HyukjinKwon	aa388cf3d0	[SPARK-34041][PYTHON][DOCS] Miscellaneous cleanup for new PySpark documentation ### What changes were proposed in this pull request? This PR proposes to: - Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs - `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page - Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html). - Mention other migration guides as well because PySpark can get affected by it. ### Why are the changes needed? For better documentation. ### Does this PR introduce _any_ user-facing change? It fixes user-facing docs. However, it's not released out yet. ### How was this patch tested? Manually tested by running: ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` Closes #31082 from HyukjinKwon/SPARK-34041. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-08 09:28:31 +09:00
Dongjoon Hyun	9b5df2afaa	[SPARK-34036][DOCS] Update ORC data source documentation ### What changes were proposed in this pull request? This PR aims to update SQL documentation about ORC data sources. New structure looks like the following. - ORC Implementation - Vectorized Reader - Schema Merging - Zstandard - Bloom Filters - Columnar Encryption - Hive metastore ORC table conversion - Configuration ### Why are the changes needed? This document is not up-to-date. Apache Spark 3.2.0 can utilize new improvements from Apache ORC 1.6.6. ### Does this PR introduce _any_ user-facing change? No, this is a documentation. ### How was this patch tested? Manual. ``` SKIP_API=1 jekyll build ``` --- BEFORE ![Screen Shot 2021-01-06 at 5 08 19 PM](https://user-images.githubusercontent.com/9700541/103838399-d0bbd880-5041-11eb-8757-297728d2793f.png) --- AFTER ![Screen Shot 2021-01-06 at 7 03 38 PM](https://user-images.githubusercontent.com/9700541/103845972-0963ae00-5052-11eb-905e-8e8b335c760a.png) ![Screen Shot 2021-01-06 at 7 03 49 PM](https://user-images.githubusercontent.com/9700541/103845971-08cb1780-5052-11eb-9b2a-d3acfa4b9278.png) ![Screen Shot 2021-01-06 at 7 03 59 PM](https://user-images.githubusercontent.com/9700541/103845970-08328100-5052-11eb-8982-7079fd7b0efc.png) ![Screen Shot 2021-01-06 at 7 04 10 PM](https://user-images.githubusercontent.com/9700541/103845968-08328100-5052-11eb-9ef5-db99c7cc64d3.png) ![Screen Shot 2021-01-06 at 7 04 16 PM](https://user-images.githubusercontent.com/9700541/103845963-07015400-5052-11eb-955f-8126d417e8aa.png) Closes #31075 from dongjoon-hyun/SPARK-34036. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-06 20:19:16 -08:00
angerszhu	3cdc4ef5b4	[SPARK-32685][SQL][FOLLOW-UP] Update migration guide about change default filed.delim to '\t' when user specifies serde ### What changes were proposed in this pull request? Update migration guide according to https://github.com/apache/spark/pull/30942#issuecomment-755054562 ### Why are the changes needed? update migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31051 from AngersZhuuuu/SPARK-32685-FOLLOW-UP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-06 13:45:48 +00:00
gengjiaan	6788304240	[SPARK-33977][SQL][DOCS] Add doc for "'like any' and 'like all' operators" ### What changes were proposed in this pull request? Add doc for 'like any' and 'like all' operators in sql-ref-syntx-qry-select-like.cmd ### Why are the changes needed? make the usage of 'like any' and 'like all' known to more users ### Does this PR introduce _any_ user-facing change? Yes. <img width="687" alt="Screen Shot 2021-01-06 at 21 10 38" src="https://user-images.githubusercontent.com/692303/103767385-dc1ffb80-5063-11eb-9529-89479531425f.png"> <img width="495" alt="Screen Shot 2021-01-06 at 21 11 06" src="https://user-images.githubusercontent.com/692303/103767391-dde9bf00-5063-11eb-82ce-63bdd11593a1.png"> <img width="406" alt="Screen Shot 2021-01-06 at 21 11 20" src="https://user-images.githubusercontent.com/692303/103767396-df1aec00-5063-11eb-8e81-a192e6c72431.png"> ### How was this patch tested? No tests Closes #31008 from beliefer/SPARK-33977. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-06 21:14:45 +09:00
Chongguang LIU	976e97a80d	[SPARK-33794][SQL] NextDay expression throw runtime IllegalArgumentException when receiving invalid input under ANSI mode ### What changes were proposed in this pull request? Instead of returning NULL, the next_day function throws runtime IllegalArgumentException when ansiMode is enable and receiving invalid input of the dayOfWeek parameter. ### Why are the changes needed? For ansiMode. ### Does this PR introduce _any_ user-facing change? Yes. When spark.sql.ansi.enabled = true, the next_day function will throw IllegalArgumentException when receiving invalid input of the dayOfWeek parameter. When spark.sql.ansi.enabled = false, same behaviour as before. ### How was this patch tested? Ansi mode is tested with existing tests. End-to-end tests have been added. Closes #30807 from chongguang/SPARK-33794. Authored-by: Chongguang LIU <chongguang.liu@laposte.fr> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:20:16 +00:00
Pradyumn Agrawal (pradyumn.ag)	13e8c28409	[SPARK-33942][DOCS] Remove `hiveClientCalls.count` in `CodeGenerator` metrics docs ### What changes were proposed in this pull request? Removed the hiveClientCalls.count in CodeGenerator metrics in Component instance = Executor ### Why are the changes needed? Wrong information regarding metrics was being displayed on Monitoring Documentation. I had added referred documentation for adding metrics logging in Graphite. This metric was not being reported. I had to check if the issue was at my application end or spark code or documentation. Documentation had the wrong info. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual, checked it on my forked repository feature branch [SPARK-33942](https://github.com/coderbond007/spark/blob/SPARK-33942/docs/monitoring.md) Closes #30976 from coderbond007/SPARK-33942. Authored-by: Pradyumn Agrawal (pradyumn.ag) <pradyumn.ag@media.net> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-30 17:25:46 -08:00
gengjiaan	ba974ea8e4	[SPARK-30789][SQL] Support (IGNORE \| RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE ### What changes were proposed in this pull request? All of `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE` should support IGNORE NULLS \| RESPECT NULLS. For example: ``` LEAD (value_expr [, offset ]) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ) ``` ``` LAG (value_expr [, offset ]) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ) ``` ``` NTH_VALUE (expr, offset) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] [ ORDER BY window_ordering frame_clause ] ) ``` The mainstream database or engine supports this syntax contains: Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0 Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html Presto https://prestodb.io/docs/current/functions/window.html DB2 https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm Teradata https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w Snowflake https://docs.snowflake.com/en/sql-reference/functions/lead.html https://docs.snowflake.com/en/sql-reference/functions/lag.html https://docs.snowflake.com/en/sql-reference/functions/nth_value.html https://docs.snowflake.com/en/sql-reference/functions/first_value.html https://docs.snowflake.com/en/sql-reference/functions/last_value.html Exasol https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lead.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lag.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/nth_value.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/first_value.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/last_value.htm ### Why are the changes needed? Support `(IGNORE \| RESPECT) NULLS` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE `is very useful. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Jenkins test Closes #30943 from beliefer/SPARK-30789. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 13:14:31 +00:00
angerszhu	aadda4b561	[SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde ### What changes were proposed in this pull request? For same SQL ``` SELECT TRANSFORM(a, b, c, null) ROW FORMAT DELIMITED USING 'cat' ROW FORMAT DELIMITED FIELDS TERMINATED BY '&' FROM (select 1 as a, 2 as b, 3 as c) t ``` In hive: ``` hive> SELECT TRANSFORM(a, b, c, null) > ROW FORMAT DELIMITED > USING 'cat' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '&' > FROM (select 1 as a, 2 as b, 3 as c) t; OK 123\N NULL Time taken: 14.519 seconds, Fetched: 1 row(s) hive> packet_write_wait: Connection to 10.191.58.100 port 32200: Broken pipe ``` In Spark ``` Spark master: local[*], Application Id: local-1609225830376 spark-sql> SELECT TRANSFORM(a, b, c, null) > ROW FORMAT DELIMITED > USING 'cat' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '&' > FROM (select 1 as a, 2 as b, 3 as c) t; 1 2 3 null NULL Time taken: 4.297 seconds, Fetched 1 row(s) spark-sql> ``` We should keep same. Change default ROW FORMAT FIELD DELIMIT to `\u0001` In hive default value is '1' to char is '\u0001' ``` bucket_count -1 column.name.delimiter , columns columns.comments columns.types file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat ``` ### Why are the changes needed? Keep same behavior with hive ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30958 from AngersZhuuuu/SPARK-33930. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-29 23:26:27 +09:00
Max Gekk	b2dfeae18b	[SPARK-33911][SQL][DOCS] Update the SQL migration guide about changes in `HiveClientImpl` ### What changes were proposed in this pull request? Update the SQL migration guide about the changes made by: - https://github.com/apache/spark/pull/30778 - https://github.com/apache/spark/pull/30711 - https://github.com/apache/spark/pull/30866 ### Why are the changes needed? To inform users about the recent changes in the upcoming releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30925 from MaxGekk/sql-migr-guide-hiveclientimpl. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-27 17:57:42 +09:00
angerszhu	10b6466e91	[SPARK-33084][CORE][SQL] Add jar support ivy path ### What changes were proposed in this pull request? Support add jar with ivy path ### Why are the changes needed? Since submit app can support ivy, add jar we can also support ivy now. ### Does this PR introduce _any_ user-facing change? User can add jar with sql like ``` add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false ``` core api ``` sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true") sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false") ``` #### Doc Update snapshot ![image](https://user-images.githubusercontent.com/46485123/101227738-de451200-36d3-11eb-813d-78a8b879da4f.png) ### How was this patch tested? Added UT Closes #29966 from AngersZhuuuu/support-add-jar-ivy. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-25 09:07:48 +09:00
Kent Yao	368a2c341d	[SPARK-33877][SQL][FOLLOWUP] SQL reference documents for INSERT w/ a column list ### What changes were proposed in this pull request? followup of `a3dd8dacee` via suggestion https://github.com/apache/spark/pull/30888#discussion_r547822642 ### Why are the changes needed? doc improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing GA doc Closes #30909 from yaooqinn/SPARK-33877-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-23 15:38:32 -08:00
Dongjoon Hyun	47d1aa4e93	[SPARK-33891][DOCS][CORE] Update dynamic allocation related documents ### What changes were proposed in this pull request? This PR aims to update the followings. - Remove the outdated requirement for `spark.shuffle.service.enabled` in `configuration.md` - Dynamic allocation section in `job-scheduling.md` ### Why are the changes needed? To make the document up-to-date. ### Does this PR introduce _any_ user-facing change? No, it's a documentation update. ### How was this patch tested? Manual. BEFORE ![Screen Shot 2020-12-23 at 2 22 04 AM](https://user-images.githubusercontent.com/9700541/102986441-ae647f80-44c5-11eb-97a3-87c2d368952a.png) ![Screen Shot 2020-12-23 at 2 22 34 AM](https://user-images.githubusercontent.com/9700541/102986473-bcb29b80-44c5-11eb-8eae-6802001c6dfa.png) AFTER ![Screen Shot 2020-12-23 at 2 25 36 AM](https://user-images.githubusercontent.com/9700541/102986767-2df24e80-44c6-11eb-8540-e74856a4c313.png) ![Screen Shot 2020-12-23 at 2 21 13 AM](https://user-images.githubusercontent.com/9700541/102986366-8e34c080-44c5-11eb-8054-1efd07c9458c.png) Closes #30906 from dongjoon-hyun/SPARK-33891. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 23:43:21 +09:00
HyukjinKwon	d98c216e19	[SPARK-31960][YARN][DOCS][FOLLOW-UP] Document the behaviour change of Hadoop's classpath propagation in migration guide ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/28788, and proposes to update migration guide. ### Why are the changes needed? To tell users about the behaviour change. ### Does this PR introduce _any_ user-facing change? Yes, it updates migration guides for users. ### How was this patch tested? GitHub Actions' documentation build should test it. Closes #30903 from HyukjinKwon/SPARK-31960-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 18:04:28 +09:00
Dongjoon Hyun	90d6f86001	[SPARK-33870][CORE] Enable spark.storage.replication.proactive by default ### What changes were proposed in this pull request? This PR aims to enable `spark.storage.replication.proactive` by default for Apache Spark 3.2.0. ### Why are the changes needed? `spark.storage.replication.proactive` is added by SPARK-15355 at Apache Spark 2.2.0 and has been helpful when the block manager loss occurs frequently like K8s environment. ### Does this PR introduce _any_ user-facing change? Yes, this will make the Spark jobs more robust. ### How was this patch tested? Pass the existing UTs. Closes #30876 from dongjoon-hyun/SPARK-33870. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 21:59:53 -08:00
Kent Yao	a3dd8dacee	[SPARK-33877][SQL] SQL reference documents for INSERT w/ a column list We support a column list of INSERT for Spark v3.1.0 (See: SPARK-32976 (https://github.com/apache/spark/pull/29893)). So, this PR targets at documenting it in the SQL documents. ### What changes were proposed in this pull request? improve doc ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? doc ### How was this patch tested? passing GA doc gen. ![image](https://user-images.githubusercontent.com/8326978/102954876-8994fa00-450f-11eb-81f9-931af6d1f69b.png) ![image](https://user-images.githubusercontent.com/8326978/102954900-99acd980-450f-11eb-9733-115ad37d2319.png) ![image](https://user-images.githubusercontent.com/8326978/102954935-af220380-450f-11eb-9aaa-fdae0725d41e.png) ![image](https://user-images.githubusercontent.com/8326978/102954949-bc3ef280-450f-11eb-8a0d-d7b688efa7bb.png) Closes #30888 from yaooqinn/SPARK-33877. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 19:46:37 -08:00
ulysses-you	bc46d273e0	[SPARK-33840][DOCS] Add spark.sql.files.minPartitionNum to performence tuning doc ### What changes were proposed in this pull request? Add `spark.sql.files.minPartitionNum` and it's description to sql-performence-tuning.md. ### Why are the changes needed? Help user to find it. ### Does this PR introduce _any_ user-facing change? Yes, it's the doc. ### How was this patch tested? Pass CI. Closes #30838 from ulysses-you/SPARK-33840. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-18 20:27:16 +09:00
Liang-Chi Hsieh	42e1831ebb	[SPARK-33797][SS][DOCS] Update SS doc about State Store and task locality ### What changes were proposed in this pull request? This updates SS documentation to document about State Store and task locality. ### Why are the changes needed? During running some tests for structured streaming, I found state store locality becomes an issue sometimes and it is not very straightforward for end-users. It'd be great if we can document it. ### Does this PR introduce _any_ user-facing change? No, only doc change. ### How was this patch tested? No, only doc change. Closes #30789 from viirya/ss-statestore-doc. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2020-12-18 10:48:51 +09:00
Gengliang Wang	dd042f58e7	[SPARK-33796][DOCS] Show hidden text from the left menu of Spark Doc ### What changes were proposed in this pull request? If the text in the left menu of Spark is too long, it will be hidden. ![sql1](https://user-images.githubusercontent.com/1097932/102249583-5ae7a580-3eb7-11eb-813c-f2e2fe019d28.jpeg) This PR is to fix the style issue. ### Why are the changes needed? Improve the UI of Spark documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test After changes: ![sql2](https://user-images.githubusercontent.com/1097932/102249603-5fac5980-3eb7-11eb-806d-4e7b8248e6b6.jpeg) Closes #30786 from gengliangwang/fixDocStyle. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-16 10:07:35 +09:00
David McWhorter	87c58367cd	[SPARK-22256][MESOS] Introduce spark.mesos.driver.memoryOverhead ### What changes were proposed in this pull request? This is a simple change to support allocating a specified amount of overhead memory for the driver's mesos container. This is already supported for executors. ### Why are the changes needed? This is needed to keep the driver process from exceeding memory limits and being killed off when running on mesos. ### Does this PR introduce _any_ user-facing change? Yes, it adds a `spark.mesos.driver.memoryOverhead` configuration option. Documentation changes for this option are included in the PR. ### How was this patch tested? Test cases covering allocation of driver memory overhead are included in the changes. ### Other notes This is a second attempt to get this change reviewed, accepted and merged. The original pull request was closed as stale back in January: https://github.com/apache/spark/pull/21006. For this pull request, I took the original change by pmackles, rebased it onto the current master branch, and added a test case that was requested in the original code review. I'm happy to make any further edits or do anything needed so that this can be included in a future spark release. I keep having to build custom spark distributions so that we can use spark within our mesos clusters. Closes #30739 from dmcwhorter/dmcwhorter-SPARK-22256. Lead-authored-by: David McWhorter <david_mcwhorter@premierinc.com> Co-authored-by: Paul Mackles <pmackles@adobe.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-15 14:00:38 -08:00
HyukjinKwon	a99a47ca1d	[SPARK-33748][K8S] Respect environment variables and configurations for Python executables ### What changes were proposed in this pull request? This PR proposes: - Respect `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations in Kubernates just like other cluster types in Spark. - Depreate `spark.kubernetes.pyspark.pythonVersion` and guide users to set the environment variables and configurations for Python executables. NOTE that `spark.kubernetes.pyspark.pythonVersion` is already a no-op configuration without this PR. Default is `3` and other values are disallowed. - In order for Python executable settings to be consistently used, fix `spark.archives` option to unpack into the current working directory in the driver of Kubernates' cluster mode. This behaviour is identical with Yarn's cluster mode. By doing this, users can leverage Conda or virtuenenv in cluster mode as below: ```python conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack conda activate pyspark_conda_env conda pack -f -o pyspark_conda_env.tar.gz PYSPARK_PYTHON=./environment/bin/python spark-submit --archives pyspark_conda_env.tar.gz#environment app.py ``` - Removed several unused or useless codes such as `extractS3Key` and `renameResourcesToLocalFS` ### Why are the changes needed? - To provide a consistent support of PySpark by using `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations. - To provide Conda and virtualenv support via `spark.archives` options. ### Does this PR introduce _any_ user-facing change? Yes: - `spark.kubernetes.pyspark.pythonVersion` is deprecated. - `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, and `spark.pyspark.python` and `spark.pyspark.driver.python` configurations are respected. ### How was this patch tested? Manually tested via: ```bash minikube delete minikube start --cpus 12 --memory 16384 kubectl create namespace spark-integration-test cat <<EOF \| kubectl apply -f - apiVersion: v1 kind: ServiceAccount metadata: name: spark namespace: spark-integration-test EOF kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test dev/make-distribution.sh --pip --tgz -Pkubernetes resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz --service-account spark --namespace spark-integration-test ``` Unittests were also added. Closes #30735 from HyukjinKwon/SPARK-33748. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 08:56:45 +09:00
Linhong Liu	b7c8210135	[SPARK-33142][SPARK-33647][SQL][FOLLOW-UP] Add docs and test cases ### What changes were proposed in this pull request? Addressed comments in PR #30567, including: 1. add test case for SPARK-33647 and SPARK-33142 2. add migration guide 3. add `getRawTempView` and `getRawGlobalTempView` to return the raw view info (i.e. TemporaryViewRelation) 4. other minor code clean ### Why are the changes needed? Code clean and more test cases ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing and newly added test cases Closes #30666 from linhongliu-db/SPARK-33142-followup. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 08:31:50 +00:00
Gengliang Wang	6e862792fb	[SPARK-33723][SQL] ANSI mode: Casting String to Date should throw exception on parse error ### What changes were proposed in this pull request? Currently, when casting a string as timestamp type in ANSI mode, Spark throws a runtime exception on parsing error. However, the result for casting a string to date is always null. We should throw an exception on parsing error as well. ### Why are the changes needed? Add missing feature for ANSI mode ### Does this PR introduce _any_ user-facing change? Yes for ANSI mode, Casting string to date will throw an exception on parsing error ### How was this patch tested? Unit test Closes #30687 from gengliangwang/castDate. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-14 10:22:37 +09:00
Takeshi Yamamuro	8197ee3b15	[SPARK-33690][SQL] Escape meta-characters in showString ### What changes were proposed in this pull request? This PR intends to escape meta-characters (e.g., \n and \t) in `Dataset.showString`. Before this PR: ``` scala> Seq("aaa\nbbb\t\tccccc").toDF("value").show() +--------------+ \| value\| +--------------+ \|aaa bbb ccccc\| +--------------+ ``` After this PR: ``` +-----------------+ \| value\| +-----------------+ \|aaa\nbbb\t\tccccc\| +-----------------+ ``` ### Why are the changes needed? For better output. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test. Closes #30647 from maropu/EscapeMetaInShow. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-13 15:04:23 -08:00
Gengliang Wang	9959d49942	[SPARK-33719][DOC] Add make_date/make_timestamp/make_interval into the doc of ANSI Compliance ### What changes were proposed in this pull request? Add make_date/make_timestamp/make_interval into the doc of ANSI Compliance ### Why are the changes needed? Users can know that these functions throw runtime exceptions under ANSI mode if the result is not valid. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build doc and check it in browser: ![image](https://user-images.githubusercontent.com/1097932/101608930-34a79e80-39bb-11eb-9294-9d9b8c3f6faa.png) Closes #30683 from gengliangwang/improveDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-09 19:47:20 +09:00
Kent Yao	c88eddac3b	[SPARK-33641][SQL][DOC][FOLLOW-UP] Add migration guide for CHAR VARCHAR types ### What changes were proposed in this pull request? Add migration guide for CHAR VARCHAR types ### Why are the changes needed? for migration ### Does this PR introduce _any_ user-facing change? doc change ### How was this patch tested? passing ci Closes #30654 from yaooqinn/SPARK-33641-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-09 06:44:10 +00:00
Dongjoon Hyun	031c5ef280	[SPARK-33679][SQL] Enable spark.sql.adaptive.enabled by default ### What changes were proposed in this pull request? This PR aims to enable `spark.sql.adaptive.enabled` by default for Apache Spark 3.2.0. ### Why are the changes needed? By switching the default for Apache Spark 3.2, the whole community can focus more on the stabilizing this feature in the various situation more seriously. ### Does this PR introduce _any_ user-facing change? Yes, but this is an improvement and it's supposed to have no bugs. ### How was this patch tested? Pass the CIs. Closes #30628 from dongjoon-hyun/SPARK-33679. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 23:10:35 -08:00
Dongjoon Hyun	de9818f043	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.2.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #30606 from dongjoon-hyun/SPARK-3.2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 14:10:42 -08:00
german	d671e053e9	[SPARK-33660][DOCS][SS] Fix Kafka Headers Documentation ### What changes were proposed in this pull request? Update kafka headers documentation, type is not longer a map but an array [jira](https://issues.apache.org/jira/browse/SPARK-33660) ### Why are the changes needed? To help users ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? It is only documentation Closes #30605 from Gschiavon/SPARK-33660-fix-kafka-headers-documentation. Authored-by: german <germanschiavon@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-05 06:51:54 +09:00
HyukjinKwon	990bee9c58	[SPARK-33615][K8S] Make 'spark.archives' working in Kubernates ### What changes were proposed in this pull request? This PR proposes to make `spark.archives` configuration working in Kubernates. It works without a problem in standalone cluster but there seems a bug in Kubernates. It fails to fetch the file on the driver side as below: ``` 20/12/03 13:33:53 INFO SparkContext: Added JAR file:/tmp/spark-75004286-c83a-4369-b624-14c5d2d2a748/spark-examples_2.12-3.1.0-SNAPSHOT.jar at spark://spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc:7078/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar with timestamp 1607002432558 20/12/03 13:33:53 INFO SparkContext: Added archive file:///tmp/tmp4542734800151332666.txt.tar.gz#test_tar_gz at spark://spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc:7078/files/tmp4542734800151332666.txt.tar.gz with timestamp 1607002432558 20/12/03 13:33:53 INFO TransportClientFactory: Successfully created connection to spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc/172.17.0.4:7078 after 83 ms (47 ms spent in bootstraps) 20/12/03 13:33:53 INFO Utils: Fetching spark://spark-test-app-48ae737628cee6f8-driver-svc.spark-integration-test.svc:7078/files/tmp4542734800151332666.txt.tar.gz to /tmp/spark-66573e24-27a3-427c-99f4-36f06d9e9cd5/fetchFileTemp2665785666227461849.tmp 20/12/03 13:33:53 ERROR SparkContext: Error initializing SparkContext. java.lang.RuntimeException: Stream '/files/tmp4542734800151332666.txt.tar.gz' was not found. at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:242) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) ``` This is because `spark.archives` was not actually added on the driver side correctly. The changes here fix it by adding and resolving URIs correctly. ### Why are the changes needed? `spark.archives` feature can be leveraged for many things such as Conda support. We should make it working in Kubernates as well. This is a bug fix too. ### Does this PR introduce _any_ user-facing change? No, this feature is not out yet. ### How was this patch tested? I manually tested with Minikube 1.15.1. For an environment issue (?), I had to use a custom namespace, service account and roles. `default` service account does not work for me and complains it doesn't have permissions to get/list pods, etc. ```bash minikube delete minikube start --cpus 12 --memory 16384 kubectl create namespace spark-integration-test cat <<EOF \| kubectl apply -f - apiVersion: v1 kind: ServiceAccount metadata: name: spark namespace: spark-integration-test EOF kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test dev/make-distribution.sh --pip --tgz -Pkubernetes resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.1.0-SNAPSHOT-bin-3.2.0.tgz --service-account spark --namespace spark-integration-test ``` Closes #30581 from HyukjinKwon/SPARK-33615. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-04 19:37:03 +09:00

1 2 3 4 5 ...

3250 commits