ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gengliang Wang	0c57bb8f7f	Preparing development version 3.2.1-SNAPSHOT	2021-09-27 08:24:50 +00:00
Gengliang Wang	49aea14c5a	Preparing Spark release v3.2.0-rc5	2021-09-27 08:24:44 +00:00
Gengliang Wang	2348cce37e	Preparing development version 3.2.1-SNAPSHOT	2021-09-26 12:28:46 +00:00
Gengliang Wang	2ed8c08c5b	Preparing Spark release v3.2.0-rc5	2021-09-26 12:28:40 +00:00
Gengliang Wang	da722d43cb	Preparing development version 3.2.1-SNAPSHOT	2021-09-24 10:03:23 +00:00
Gengliang Wang	9e35703211	Preparing Spark release v3.2.0-rc5	2021-09-24 10:03:16 +00:00
Gengliang Wang	0fb7127f85	Preparing development version 3.2.1-SNAPSHOT	2021-09-23 08:46:28 +00:00
Gengliang Wang	b609f2fe0c	Preparing Spark release v3.2.0-rc4	2021-09-23 08:46:22 +00:00
Gengliang Wang	b0249851f6	Preparing development version 3.2.1-SNAPSHOT	2021-09-18 11:30:12 +00:00
Gengliang Wang	96044e9735	Preparing Spark release v3.2.0-rc3	2021-09-18 11:30:06 +00:00
Hyukjin Kwon	e9f2e34261	[SPARK-36631][R] Ask users if they want to download and install SparkR in non Spark scripts ### What changes were proposed in this pull request? This PR proposes to ask users if they want to download and install SparkR when they install SparkR from CRAN. `SPARKR_ASK_INSTALLATION` environment variable was added in case other notebook projects are affected. ### Why are the changes needed? This is required for CRAN. Currently SparkR is removed: https://cran.r-project.org/web/packages/SparkR/index.html. See also https://lists.apache.org/thread.html/r02b9046273a518e347dfe85f864d23d63d3502c6c1edd33df17a3b86%40%3Cdev.spark.apache.org%3E ### Does this PR introduce _any_ user-facing change? Yes, `sparkR.session(...)` will ask if users want to download and install Spark package or not if they are in the plain R shell or `Rscript`. ### How was this patch tested? R shell Valid input (`n`): ``` > sparkR.session(master="local") Spark not found in SPARK_HOME: Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n ``` ``` Error in sparkCheckInstall(sparkHome, master, deployMode) : Please make sure Spark package is installed in this machine. - If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME. - If not, you may run install.spark function to do the job. ``` Invalid input: ``` > sparkR.session(master="local") Spark not found in SPARK_HOME: Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc ``` ``` Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): ``` Valid input (`y`): ``` > sparkR.session(master="local") Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://ftp.riken.jp/net/apache/spark Downloading spark-3.3.0 for Hadoop 2.7 from: - https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz trying URL 'https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz' ... ``` Rscript ``` cat tmp.R ``` ``` library(SparkR, lib.loc = c(file.path(".", "R", "lib"))) sparkR.session(master="local") ``` ``` Rscript tmp.R ``` Valid input (`n`): ``` Spark not found in SPARK_HOME: Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n ``` ``` Error in sparkCheckInstall(sparkHome, master, deployMode) : Please make sure Spark package is installed in this machine. - If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME. - If not, you may run install.spark function to do the job. Calls: sparkR.session -> sparkCheckInstall ``` Invalid input: ``` Spark not found in SPARK_HOME: Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc ``` ``` Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): ``` Valid input (`y`): ``` ... Spark not found in SPARK_HOME: Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... Preferred mirror site found: https://ftp.riken.jp/net/apache/spark Downloading spark-3.3.0 for Hadoop 2.7 from: ... ``` `bin/sparkR` and `bin/spark-submit *.R` are not affected (tested). Closes #33887 from HyukjinKwon/SPARK-36631. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `e983ba8fce`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-02 13:27:55 +09:00
Gengliang Wang	1bad04d028	Preparing development version 3.2.1-SNAPSHOT	2021-08-31 17:04:14 +00:00
Gengliang Wang	03f5d23e96	Preparing Spark release v3.2.0-rc2	2021-08-31 17:04:08 +00:00
Gengliang Wang	69be513c5e	Preparing development version 3.2.1-SNAPSHOT	2021-08-20 12:40:47 +00:00
Gengliang Wang	c829ed53ff	Revert "Preparing development version 3.2.1-SNAPSHOT" This reverts commit `4f1d21571d`.	2021-08-20 20:07:01 +08:00
Gengliang Wang	4f1d21571d	Preparing development version 3.2.1-SNAPSHOT	2021-08-19 14:08:32 +00:00
Dominik Gehl	3a09024636	[SPARK-36154][DOCS] Documenting week and quarter as valid formats in pyspark sql/functions trunc ### What changes were proposed in this pull request? Added missing documentation of week and quarter as valid formats to pyspark sql/functions trunc ### Why are the changes needed? Pyspark documentation and scala documentation didn't mentioned the same supported formats ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only documentation change Closes #33359 from dominikgehl/feature/SPARK-36154. Authored-by: Dominik Gehl <dog@open.ch> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `802f632a28`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-15 16:51:25 +03:00
Wenchen Fan	c1d8178817	[SPARK-35968][SQL] Make sure partitions are not too small in AQE partition coalescing ### What changes were proposed in this pull request? By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions. However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless. This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config. ### Why are the changes needed? AQE is default on now, we should make the perf better in the default case. ### Does this PR introduce _any_ user-facing change? yes, a new config. ### How was this patch tested? new tests Closes #33172 from cloud-fan/aqe2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `0c9c8ff569`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-02 16:07:46 +08:00
itholic	745756ca4c	[SPARK-35603][R][DOCS] Add data source options link for R API documentation ### What changes were proposed in this pull request? There are options for data source are documented at Data Source Options page for every data sources. For Python, Scala, JAVA, the link for Data Source Option page was added in each API documentation. - Python <img width="732" alt="Screen Shot 2021-06-07 at 12 25 45 PM" src="https://user-images.githubusercontent.com/44108233/120955187-cbe38800-c78b-11eb-9475-ccf89bbc3c95.png"> - Scala <img width="677" alt="Screen Shot 2021-06-07 at 12 26 41 PM" src="https://user-images.githubusercontent.com/44108233/120955186-cab25b00-c78b-11eb-9fed-3f0d2024029b.png"> - JAVA <img width="726" alt="Screen Shot 2021-06-07 at 12 27 49 PM" src="https://user-images.githubusercontent.com/44108233/120955182-c8e89780-c78b-11eb-9cf1-13e41ba35b3e.png"> However, we have no link for R documentation, so we should add the link to the R documentation as well. ### Why are the changes needed? To provide users available options for each data source when they read/write it. ### Does this PR introduce _any_ user-facing change? Yes, the link for Data Source Option is added to R documentation as below. <img width="855" alt="Screen Shot 2021-06-07 at 12 29 26 PM" src="https://user-images.githubusercontent.com/44108233/120955302-064d2500-c78c-11eb-8dc3-cb22dfd5fd14.png"> ### How was this patch tested? Manually built doc and checked one by one Closes #32797 from itholic/SPARK-35603. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-08 11:58:38 +09:00
Hyukjin Kwon	1ba1b70cfe	[SPARK-35573][R][TESTS] Make SparkR tests pass with R 4.1+ ### What changes were proposed in this pull request? This PR proposes to support R 4.1.0+ in SparkR. Currently the tests are being failed as below: ``` ══ Failed ══════════════════════════════════════════════════════════════════════ ── 1. Failure (test_sparkSQL_arrow.R:71:3): createDataFrame/collect Arrow optimi collect(createDataFrame(rdf)) not equal to `expected`. Component “g”: 'tzone' attributes are inconsistent ('UTC' and '') ── 2. Failure (test_sparkSQL_arrow.R:143:3): dapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 3. Failure (test_sparkSQL_arrow.R:229:3): gapply() Arrow optimization - type collect(ret) not equal to `rdf`. Component “b”: 'tzone' attributes are inconsistent ('UTC' and '') ── 4. Error (test_sparkSQL.R:1454:3): column functions ───────────────────────── Error: (converted from warning) cannot xtfrm data frames Backtrace: 1. base::sort(collect(distinct(select(df, input_file_name())))) test_sparkSQL.R:1454:2 2. base::sort.default(collect(distinct(select(df, input_file_name())))) 5. base::order(x, na.last = na.last, decreasing = decreasing) 6. base::lapply(z, function(x) if (is.object(x)) as.vector(xtfrm(x)) else x) 7. base:::FUN(X[[i]], ...) 10. base::xtfrm.data.frame(x) ── 5. Failure (test_utils.R:67:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ── 6. Failure (test_utils.R:80:3): cleanClosure on R functions ───────────────── `actual` not equal to `g`. names for current but not for target Length mismatch: comparison on first 0 components ``` It fixes three as below: - Avoid a sort on DataFrame which isn't legitimate: https://github.com/apache/spark/pull/32709#discussion_r642458108 - Treat the empty timezone and local timezone as equivalent in SparkR: https://github.com/apache/spark/pull/32709#discussion_r642464454 - Disable `check.environment` in the cleaned closure comparison (enabled by default from R 4.1+, https://cran.r-project.org/doc/manuals/r-release/NEWS.html), and keep the test as is https://github.com/apache/spark/pull/32709#discussion_r642510089 ### Why are the changes needed? Higher R versions have bug fixes and improvements. More importantly R users tend to use highest R versions. ### Does this PR introduce _any_ user-facing change? Yes, SparkR will work together with R 4.1.0+ ### How was this patch tested? ```bash ./R/run-tests.sh ``` ``` sparkSQL_arrow: SparkSQL Arrow optimization: ................. ... sparkSQL: SparkSQL functions: ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ........................................................................................................................................................................................................ ... utils: functions in utils.R: .............................................. ``` Closes #32709 from HyukjinKwon/SPARK-35573. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:35:52 +09:00
Felix Cheung	1530876615	[SPARK-35495][R] Change SparkR maintainer for CRAN ### What changes were proposed in this pull request? As discussed, update SparkR maintainer for future release. ### Why are the changes needed? Shivaram will not be able to work with this in the future, so we would like to migrate off the maintainer contact email. shivaram Closes #32642 from felixcheung/sparkr-maintainer. Authored-by: Felix Cheung <felixcheung@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-23 19:08:54 -07:00
Hyukjin Kwon	ecb48ccb7d	[SPARK-35381][R] Fix lambda variable name issues in nested higher order functions at R APIs ### What changes were proposed in this pull request? This PR fixes the same issue as https://github.com/apache/spark/pull/32424 ```r df <- sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") collect(select( df, array_transform("numbers", function(number) { array_transform("letters", function(latter) { struct(alias(number, "n"), alias(latter, "l")) }) }) )) ``` Before: ``` ... a, a, b, b, c, c, a, a, b, b, c, c, a, a, b, b, c, c ``` After: ``` ... 1, a, 1, b, 1, c, 2, a, 2, b, 2, c, 3, a, 3, b, 3, c ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Manually tested as above, and unit test was added. Closes #32517 from HyukjinKwon/SPARK-35381. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-12 16:52:39 +09:00
Ruifeng Zheng	1f150b9392	[SPARK-35024][ML] Refactor LinearSVC - support virtual centering ### What changes were proposed in this pull request? 1, remove existing agg, and use a new agg supporting virtual centering 2, add related testsuites ### Why are the changes needed? centering vectors should accelerate convergence, and generate solution more close to R ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated testsuites and added testsuites Closes #32124 from zhengruifeng/svc_agg_refactor. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-04-25 13:16:46 +08:00
Kousuke Saruta	c0972dec1d	[SPARK-35180][BUILD] Allow to build SparkR with SBT ### What changes were proposed in this pull request? This PR proposes a change that allows us to build SparkR with SBT. ### Why are the changes needed? In the current master, SparkR can be built only with Maven. It's helpful if we can built it with SBT. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed that I can build SparkR on Ubuntu 20.04 with the following command. ``` build/sbt -Psparkr package ``` Closes #32285 from sarutak/sbt-sparkr. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2021-04-22 20:56:33 +09:00
Yuanjian Li	8e9e70045b	[SPARK-35171][R] Declare the markdown package as a dependency of the SparkR package ### What changes were proposed in this pull request? Declare the markdown package as a dependency of the SparkR package ### Why are the changes needed? If we didn't install pandoc locally, running make-distribution.sh will fail with the following message: ``` — re-building ‘sparkr-vignettes.Rmd’ using rmarkdown Warning in engine$weave(file, quiet = quiet, encoding = enc) : Pandoc (>= 1.12.3) not available. Falling back to R Markdown v1. Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics: The 'markdown' package should be declared as a dependency of the 'SparkR' package (e.g., in the 'Suggests' field of DESCRIPTION), because the latter contains vignette(s) built with the 'markdown' package. Please see https://github.com/yihui/knitr/issues/1864 for more information. — failed re-building ‘sparkr-vignettes.Rmd’ ``` ### Does this PR introduce _any_ user-facing change? Yes. Workaround for R packaging. ### How was this patch tested? Manually test. After the fix, the command `sh dev/make-distribution.sh -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn` in the environment without pandoc will pass. Closes #32270 from xuanyuanking/SPARK-35171. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-21 20:43:47 +09:00
HyukjinKwon	f72b9068ad	[SPARK-34643][R][DOCS] Use CRAN URL in canonical form ### What changes were proposed in this pull request? This PR fixes the URL links to use CRAN URL in canonical form. CRAN package submission was failed as below: ``` Found the following (possibly) invalid URLs: URL: https://cran.r-project.org/web/packages/e1071/index.html From: man/spark.naiveBayes.Rd Status: 200 Message: OK CRAN URL not in canonical form URL: https://cran.r-project.org/web/packages/mixtools/index.html From: man/spark.gaussianMixture.Rd Status: 200 Message: OK CRAN URL not in canonical form URL: https://cran.r-project.org/web/packages/survival/index.html From: man/spark.survreg.Rd Status: 200 Message: OK CRAN URL not in canonical form URL: https://cran.r-project.org/web/packages/topicmodels/index.html From: man/spark.lda.Rd Status: 200 Message: OK CRAN URL not in canonical form The canonical URL of the CRAN page for a package is https://CRAN.R-project.org/package=pkgname ``` ### Why are the changes needed? To fix CRAN package submission ### Does this PR introduce _any_ user-facing change? It exposes a canoncal form of URLs to end users. ### How was this patch tested? I manually clicked each links. Closes #31759 from HyukjinKwon/minor-doc-fixes. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-05 10:08:11 -08:00
Richard Penney	7d0743b493	[SPARK-33678][SQL] Product aggregation function ### Why is this change being proposed? This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group. This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark. This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly. ### Does this PR introduce _any_ user-facing change? No - only adds new function. ### How was this patch tested? Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself). An illustration of the new functionality, within PySpark is as follows: ``` import pyspark.sql.functions as pf, pyspark.sql.window as pw df = sqlContext.range(1, 17).toDF("x") win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x")) df.withColumn("factorial", pf.product("x").over(win)).show(20, False) +---+---------------+ \|x \|factorial \| +---+---------------+ \|1 \|1.0 \| \|2 \|2.0 \| \|3 \|6.0 \| \|4 \|24.0 \| \|5 \|120.0 \| \|6 \|720.0 \| \|7 \|5040.0 \| \|8 \|40320.0 \| \|9 \|362880.0 \| \|10 \|3628800.0 \| \|11 \|3.99168E7 \| \|12 \|4.790016E8 \| \|13 \|6.2270208E9 \| \|14 \|8.71782912E10 \| \|15 \|1.307674368E12 \| \|16 \|2.0922789888E13\| +---+---------------+ ``` Closes #30745 from rwpenney/feature/agg-product. Lead-authored-by: Richard Penney <rwp@rwpenney.uk> Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-02 16:51:07 +09:00
HyukjinKwon	30468a9015	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs ### What changes were proposed in this pull request? This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621. In more details, this PR: - Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases. - (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases. - Deprecates and renames: - `sumDistinct` -> `sum_distinct` - `bitwiseNOT` -> `bitwise_not` - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`) - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`) - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`) - (Scala-specific) `callUDF` -> `call_udf` ### Why are the changes needed? To keep the consistent naming in APIs. ### Does this PR introduce _any_ user-facing change? Yes, it deprecates some APIs and add new renamed APIs as described above. ### How was this patch tested? Unittests were added. Closes #31408 from HyukjinKwon/SPARK-34306. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 09:29:40 +09:00
Max Gekk	2b76e6d15c	[SPARK-34301][SQL] Use logical plan of alter table in `CatalogImpl.recoverPartitions()` ### What changes were proposed in this pull request? Replace v1 exec node `AlterTableRecoverPartitionsCommand` by the logical node `AlterTableRecoverPartitions` in `CatalogImpl.recoverPartitions()`. ### Why are the changes needed? 1. Print user friendly error message for views: ``` my_temp_table is a temp view. 'recoverPartitions()' expects a table ``` Before the changes: ``` Table or view 'my_temp_table' not found in database 'default' ``` 2. To not bind to v1 `ALTER TABLE .. RECOVER PARTITIONS`, and to support v2 tables potentially as well. ### Does this PR introduce _any_ user-facing change? Yes, it can. ### How was this patch tested? By running new test in `CatalogSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly org.apache.spark.sql.internal.CatalogSuite" ``` Closes #31403 from MaxGekk/catalogimpl-recoverPartitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-01 14:09:40 +00:00
HyukjinKwon	b5bdbf2ebc	[SPARK-30682][R][SQL][FOLLOW-UP] Keep the name similar with Scala side in higher order functions ### What changes were proposed in this pull request? This PR is a followup of #27433. It fixes the naming to match with Scala side, and this is similar with https://github.com/apache/spark/pull/31062. Note that: - there are a bit of inconsistency already e.g.) `x`, `y` in SparkR and they are documented together for doc deduplication. This part I did not change but the name `zero` vs `initialValue` looks unnecessary. - such naming matching seems already pretty common in SparkR. ### Why are the changes needed? To make the usage similar with Scala side, and for consistency. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? GitHub Actions and Jenkins build will test it out. Also, I manually tested: ```r > df <- select(createDataFrame(data.frame(id = 1)),expr("CAST(array(1.0, 2.0, -3.0, -4.0) AS array<double>) xs")) > collect(select(df, array_aggregate("xs", initialValue = lit(0.0), merge = function(x, y) otherwise(when(x > y, x), y)))) aggregate(xs, 0.0, lambdafunction(CASE WHEN (x > y) THEN x ELSE y END, x, y), lambdafunction(id, id)) 1 2 ``` Closes #31226 from HyukjinKwon/SPARK-30682. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-18 14:19:14 +09:00
zero323	66cc12944a	[SPARK-34132][DOCS][R] Update Roxygen version references to 7.1.1 ### What changes were proposed in this pull request? This PR updates `roxygen2` version reference in docs and `DESCRIPTION` file. ### Why are the changes needed? According to information provided by shaneknapp (see [this comment](https://issues.apache.org/jira/browse/SPARK-30747?focusedCommentId=17265142&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17265142) to SPARK-30747) all workers use roxygen 7.1.1. In GitHub workflow we install the latest version `c75c29dcaa/.github/workflows/build_and_test.yml (L346)` which [is also 7.1.1 at the moment](https://web.archive.org/web/20210115172522/https://cran.r-project.org/web/packages/roxygen2/). ### Does this PR introduce _any_ user-facing change? Docs and description mention currently used package verison. ### How was this patch tested? - `dev/lint-r`. - Manual check of command used in docs. Closes #31200 from zero323/ROXYGEN-VERSION-UPDATE-DOCS. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-15 17:08:17 -08:00
HyukjinKwon	0ba3ab4c23	[SPARK-34021][R] Fix hyper links in SparkR documentation for CRAN submission ### What changes were proposed in this pull request? 3.0.1 CRAN submission was failed as the reason below: ``` Found the following (possibly) invalid URLs: URL: http://jsonlines.org/ (moved to https://jsonlines.org/) From: man/read.json.Rd man/write.json.Rd Status: 200 Message: OK URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to https://dl.acm.org/doi/10.1109/MC.2009.263) From: inst/doc/sparkr-vignettes.html Status: 200 Message: OK ``` The links were being redirected now. This PR checked all hyperlinks in the docs such as `href{...}` and `url{...}`, and fixed all in SparkR: - Fix two problems above. - Fix http to https - Fix `https://www.apache.org/ https://spark.apache.org/` -> `https://www.apache.org https://spark.apache.org`. ### Why are the changes needed? For CRAN submission. ### Does this PR introduce _any_ user-facing change? Virtually no because it's just cleanup that CRAN requires. ### How was this patch tested? Manually tested by clicking the links Closes #31058 from HyukjinKwon/SPARK-34021. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-07 13:58:13 +09:00
Tom.Howland	3d8ee492d6	[SPARK-34015][R] Fixing input timing in gapply ### What changes were proposed in this pull request? When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows. In detail: the variable inputElap in a wider context is used to mark the end of reading rows, but in the part changed here it was used as a local variable for measuring the beginning of compute time in a loop over the groups in the partition. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests. For our application, here's what a log entry looks like before these changes were applied: `20/10/09 04:08:58 INFO RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s` this indicates that we're spending more time reading rows than operating on the rows. After these changes, it looks like this: `20/12/15 06:43:29 INFO RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s ` ### Why are the changes needed? Metrics shouldn't mislead? ### Does this PR introduce _any_ user-facing change? Aside from no longer misleading, no ### How was this patch tested? unit tests passed. Field test results seem plausible Closes #31021 from WamBamBoozle/input_timing. Authored-by: Tom.Howland <Tom.Howland@target.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-06 11:40:02 +09:00
Michael Chirico	12b69cc27c	[SPARK-26199][SPARK-31517][R] Fix strategy for handling ... names in mutate ### What changes were proposed in this pull request? Change the strategy for how the varargs are handled in the default `mutate` method ### Why are the changes needed? Bugfix -- `deparse` + `sapply` not working as intended due to `width.cutoff` ### Does this PR introduce any user-facing change? Yes, bugfix. Shouldn't change any working code. ### How was this patch tested? None! yet. Closes #28386 from MichaelChirico/r-mutate-deparse. Lead-authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Co-authored-by: Michael Chirico <michaelchirico4@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-17 17:20:45 +09:00
Dongjoon Hyun	de9818f043	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.2.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #30606 from dongjoon-hyun/SPARK-3.2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 14:10:42 -08:00
zero323	5a1c5ac807	[SPARK-33622][R][ML] Add array_to_vector to SparkR ### What changes were proposed in this pull request? This PR adds `array_to_vector` to R API. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? New function exposed in the public API. ### How was this patch tested? New unit test. Manual verification of the documentation examples. Closes #30561 from zero323/SPARK-33622. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-01 10:44:14 -08:00
Josh Soref	13fd272cd3	Spelling r common dev mlib external project streaming resource managers python ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `R` * `common` * `dev` * `mlib` * `external` * `project` * `streaming` * `resource-managers` * `python` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-27 10:22:45 -06:00
zero323	d082ad0abf	[SPARK-33563][PYTHON][R][SQL] Expose inverse hyperbolic trig functions in PySpark and SparkR ### What changes were proposed in this pull request? This PR adds the following functions (introduced in Scala API with SPARK-33061): - `acosh` - `asinh` - `atanh` to Python and R. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? New functions. ### How was this patch tested? New unit tests. Closes #30501 from zero323/SPARK-33563. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-27 11:00:09 +09:00
zero323	56a8510e19	[SPARK-33304][R][SQL] Add from_avro and to_avro functions to SparkR ### What changes were proposed in this pull request? Adds `from_avro` and `to_avro` functions to SparkR. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? New functions exposed in SparkR API. ### How was this patch tested? New unit tests. Closes #30216 from zero323/SPARK-33304. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 09:52:29 +09:00
neko	4360c6f12a	[SPARK-33363] Add prompt information related to the current task when pyspark/sparkR starts ### What changes were proposed in this pull request? add prompt information about current applicationId, current URL and master info when pyspark / sparkR starts. ### Why are the changes needed? The information printed when pyspark/sparkR starts does not prompt the basic information of current application, and it is not convenient when used pyspark/sparkR in dos. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manual test result shows below: ![pyspark new print](https://user-images.githubusercontent.com/52202080/98274268-2a663f00-1fce-11eb-88ce-964ce90b439e.png) ![sparkR](https://user-images.githubusercontent.com/52202080/98541235-1a01dd00-22ca-11eb-9304-09bcde87b05e.png) Closes #30266 from akiyamaneko/pyspark-hint-info. Authored-by: neko <echohlne@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-10 11:12:19 +09:00
zero323	d71b2febaf	[SPARK-30663][SPARK-33313][TESTS][R] Drop testthat 1.x support and add testthat 3.x support ### What changes were proposed in this pull request? This PR modifies `R/pkg/tests/run-all.R` by: - Removing `testthat` 1.x support, as Jenkins has been upgraded to 2.x with SPARK-30637 and this code is no longer relevant. - Add `testthat` 3.x support to avoid AppVeyor failures. ### Why are the changes needed? Currently used internal API has been removed in the latest `testthat` release. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests executed against `testthat == 2.3.2` and `testthat == 3.0.0` Closes #30219 from zero323/SPARK-33313. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-02 08:54:08 +09:00
Max Gekk	b409025641	[SPARK-33281][SQL] Return SQL schema instead of Catalog string from the `SchemaOfCsv` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression. ### Why are the changes needed? To unify output of the `schema_of_json()` and `schema_of_csv()`. ### Does this PR introduce _any_ user-facing change? Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter. Before: ``` > SELECT schema_of_csv('1,abc'); struct<_c0:int,_c1:string> ``` After: ``` > SELECT schema_of_csv('1,abc'); STRUCT<`_c0`: INT, `_c1`: STRING> ``` ### How was this patch tested? By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`. Closes #30180 from MaxGekk/schema_of_csv-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 21:02:10 +09:00
Max Gekk	9d5e48ea95	[SPARK-33270][SQL] Return SQL schema instead of Catalog string from the `SchemaOfJson` expression ### What changes were proposed in this pull request? Return schema in SQL format instead of Catalog string from the `SchemaOfJson` expression. ### Why are the changes needed? In some cases, `from_json()` cannot parse schemas returned by `schema_of_json`, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by `from_json()`. Here is the example: ```scala val in = Seq("""{"a b": 1}""").toDS() in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed") ``` raises the exception: ``` == SQL == struct<a b:bigint> ------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131) at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537) at org.apache.spark.sql.functions$.from_json(functions.scala:4141) ``` ### Does this PR introduce _any_ user-facing change? Yes. For example, `schema_of_json` for the input `{"col":0}`. Before: `struct<col:bigint>` After: `STRUCT<`col`: BIGINT>` ### How was this patch tested? By existing test suites `JsonFunctionsSuite` and `JsonExpressionsSuite`. Closes #30172 from MaxGekk/schema_of_json-sql-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-29 10:30:41 +09:00
zero323	ea709d6748	[SPARK-33258][R][SQL] Add asc_nulls_* and desc_nulls_* methods to SparkR ### What changes were proposed in this pull request? This PR adds the following `Column` methods to R API: - asc_nulls_first - asc_nulls_last - desc_nulls_first - desc_nulls_last ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? No, new methods. ### How was this patch tested? New unit tests. Closes #30159 from zero323/SPARK-33258. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-28 09:46:13 +09:00
xuewei.linxuewei	dc697a8b59	[SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero ### What changes were proposed in this pull request? As [SPARK-13860](https://issues.apache.org/jira/browse/SPARK-13860) stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result. Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard. ### Why are the changes needed? SQL correctness issue. ### Does this PR introduce any user-facing change? Yes. See sql-migration-guide In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`. ### How was this patch tested? Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior. Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior. Closes #29983 from leanken/leanken-SPARK-13860. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-13 13:21:45 +00:00
zero323	3beab8d8a8	[SPARK-32793][FOLLOW-UP] Minor corrections for PySpark annotations and SparkR ### What changes were proposed in this pull request? - Annotated return types of `assert_true` and `raise_error` as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504495801). - Add `assert_true` and `raise_error` to SparkR NAMESPACE. - Validating message vector size in SparkR as discussed [here](https://github.com/apache/spark/pull/29947#pullrequestreview-504539004). ### Why are the changes needed? As discussed in review for #29947. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Existing tests. - Validation of annotations using MyPy Closes #29978 from zero323/SPARK-32793-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-09 09:50:45 +09:00
Karen Feng	39510b0e9b	[SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true ## What changes were proposed in this pull request? Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field. `raise_error` is exposed in SQL, Python, Scala, and R. `assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R. ### Why are the changes needed? Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`. ### Does this PR introduce _any_ user-facing change? Yes: - Adds `raise_error` function to the SQL, Python, Scala, and R APIs. - Adds `assert_true` function to the SQL, Python and R APIs. ### How was this patch tested? Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`. Closes #29947 from karenfeng/spark-32793. Lead-authored-by: Karen Feng <karen.feng@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 12:05:39 +09:00
zero323	473b3ba6aa	[SPARK-32511][FOLLOW-UP][SQL][R][PYTHON] Add dropFields to SparkR and PySpark ### What changes were proposed in this pull request? This PR adds `dropFields` method to: - PySpark `Column` - SparkR `Column` ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? No, new API. ### How was this patch tested? - New unit tests. - Manual verification of examples / doctests. - Manual run of MyPy tests Closes #29967 from zero323/SPARK-32511-FOLLOW-UP-PYSPARK-SPARKR. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-08 10:37:42 +09:00
zero323	24f890e8e8	[SPARK-33040][FOLLOW-UP][R] Reorder argument choices and add examples ### What changes were proposed in this pull request? - Reorder choices of `dtype` to match Scala defaults. - Add example to ml_functions. ### Why are the changes needed? As requested: - https://github.com/apache/spark/pull/29917#pullrequestreview-501715344 - https://github.com/apache/spark/pull/29917#pullrequestreview-501716521 ### Does this PR introduce _any_ user-facing change? No (changes to newly added component). ### How was this patch tested? Existing tests. Closes #29944 from zero323/SPARK-33040-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 16:31:17 +09:00
zero323	e83d03ca48	[SPARK-33040][R][ML] Add SparkR wrapper for vector_to_array ### What changes were proposed in this pull request? Add SparkR wrapper for `o.a.s.ml.functions.vector_to_array` ### Why are the changes needed? - Currently ML vectors, including predictions, are almost inaccessible to R users. That's is a serious loss of functionality. - Feature parity. ### Does this PR introduce _any_ user-facing change? Yes, new R function is added. ### How was this patch tested? - New unit tests. - Manual verification. Closes #29917 from zero323/SPARK-33040. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-05 13:18:12 +09:00

1 2 3 4 5 ...

840 commits