ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Josh Soref	485145326a	[MINOR] Spelling bin core docs external mllib repl ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `bin` * `core` * `docs` * `external` * `mllib` * `repl` * `pom.xml` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30530 from jsoref/spelling-bin-core-docs-external-mllib-repl. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-30 13:59:51 +09:00
HyukjinKwon	e1d7321034	[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization ### What changes were proposed in this pull request? This PR proposes to: 1. Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example, ```R df <- createDataFrame(list(list(a=1L, b="2"))) count(gapply(df, "a", function(key, group) { group }, structType("a int, b int"))) ``` Before: ``` Error in handleErrors(returnStatus, conn) : ... java.lang.UnsupportedOperationException ... ``` After: ``` Error in handleErrors(returnStatus, conn) : ... java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType ... ``` 2. Update documentation about the schema matching for `gapply` and `dapply`. ### Why are the changes needed? To show which schema is not matched, and let users know what's going on. ### Does this PR introduce _any_ user-facing change? Yes, error message is updated as above, and documentation is updated. ### How was this patch tested? Manually tested and unitttests were added. Closes #29283 from HyukjinKwon/r-vectorized-error. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-30 15:16:02 +09:00
HyukjinKwon	bfa5d57bbd	[SPARK-32452][R][SQL] Bump up the minimum Arrow version as 1.0.0 in SparkR ### What changes were proposed in this pull request? This PR proposes to set the minimum Arrow version as 1.0.0 to minimise the maintenance overhead and keep the minimal version up to date. Other required changes to support 1.0.0 were already made in SPARK-32451. ### Why are the changes needed? R side, people rather aggressively encourage people to use the latest version, and SparkR vectorization is very experimental that was added from Spark 3.0. Also, we're technically not testing old Arrow versions in SparkR for now. ### Does this PR introduce _any_ user-facing change? Yes, users wouldn't be able to use SparkR with old Arrow. ### How was this patch tested? GitHub Actions and AppVeyor are already testing them. Closes #29253 from HyukjinKwon/SPARK-32452. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-27 14:21:15 +09:00
HyukjinKwon	e1315cd656	[SPARK-31701][R][SQL] Bump up the minimum Arrow version as 0.15.1 in SparkR ### What changes were proposed in this pull request? This PR proposes to set the minimum Arrow version as 0.15.1 to be consistent with PySpark side at. ### Why are the changes needed? It will reduce the maintenance overhead to match the Arrow versions, and minimize the supported range. SparkR Arrow optimization is experimental yet. ### Does this PR introduce _any_ user-facing change? No, it's the change in unreleased branches only. ### How was this patch tested? 0.15.x was already tested at SPARK-29378, and we're testing the latest version of SparkR currently in AppVeyor. I already manually tested too. Closes #28520 from HyukjinKwon/SPARK-31701. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-13 10:03:12 -07:00
zero323	697fe911ac	[SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR ### What changes were proposed in this pull request? This pull request adds SparkR wrapper for `FMRegressor`: - Supporting ` org.apache.spark.ml.r.FMRegressorWrapper`. - `FMRegressionModel` S4 class. - Corresponding `spark.fmRegressor`, `predict`, `summary` and `write.ml` generics. - Corresponding docs and tests. ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No (new API). ### How was this patch tested? New unit tests. Closes #27571 from zero323/SPARK-30819. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-09 19:38:11 -05:00
zero323	0063462d55	[SPARK-30818][SPARKR][ML] Add SparkR LinearRegression wrapper ### What changes were proposed in this pull request? This pull request adds SparkR wrapper for `LinearRegression` - Supporting `org.apache.spark.ml.rLinearRegressionWrapper`. - `LinearRegressionModel` S4 class. - Corresponding `spark.lm` predict, summary and write.ml generics. - Corresponding docs and tests. ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No (new API). ### How was this patch tested? New unit tests. Closes #27593 from zero323/SPARK-30818. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-08 22:29:44 -05:00
zero323	0d37f794ef	[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR ### What changes were proposed in this pull request? This pull request adds SparkR wrapper for `FMClassifier`: - Supporting ` org.apache.spark.ml.r.FMClassifierWrapper`. - `FMClassificationModel` S4 class. - Corresponding `spark.fmClassifier`, `predict`, `summary` and `write.ml` generics. - Corresponding docs and tests. ### Why are the changes needed? Feature parity. ### Does this PR introduce any user-facing change? No (new API). ### How was this patch tested? New unit tests. Closes #27570 from zero323/SPARK-30820. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-07 09:01:45 -05:00
HyukjinKwon	0f48aafab8	[SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) ### What changes were proposed in this pull request? This PR proposes: 1. Use `is.data.frame` to check if it is a DataFrame. 2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI. ### Why are the changes needed? 1. To support SparkR with Arrow 0.14 2. To check if there's any regression and if it works correctly. ### Does this PR introduce any user-facing change? ```r df <- createDataFrame(mtcars) collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double"))) ``` Before: ``` Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument ``` After: ``` gear 1 5 2 5 3 5 4 4 5 4 6 4 7 4 8 5 9 5 ... ``` ### How was this patch tested? AppVeyor Closes #25993 from HyukjinKwon/arrow-r-appveyor. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-04 08:56:45 +09:00
HyukjinKwon	7d4eb38bbc	[SPARK-29052][DOCS][ML][PYTHON][CORE][R][SQL][SS] Create a Migration Guide tap in Spark documentation ### What changes were proposed in this pull request? Currently, there is no migration section for PySpark, SparkCore and Structured Streaming. It is difficult for users to know what to do when they upgrade. This PR proposes to create create a "Migration Guide" tap at Spark documentation. ![Screen Shot 2019-09-11 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/64688126-ad712f80-d4c6-11e9-8672-9a2c56c05bf8.png) ![Screen Shot 2019-09-11 at 7 27 15 PM](https://user-images.githubusercontent.com/6477701/64689915-389ff480-d4ca-11e9-8c54-7f46095d0d23.png) This page will contain migration guides for Spark SQL, PySpark, SparkR, MLlib, Structured Streaming and Core. Basically it is a refactoring. There are some new information added, which I will leave a comment inlined for easier review. 1. MLlib Merge [ml-guide.html#migration-guide](https://spark.apache.org/docs/latest/ml-guide.html#migration-guide) and [ml-migration-guides.html](https://spark.apache.org/docs/latest/ml-migration-guides.html) ``` 'docs/ml-guide.md' ↓ Merge new/old migration guides 'docs/ml-migration-guide.md' ``` 2. PySpark Extract PySpark specific items from https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html ``` 'docs/sql-migration-guide-upgrade.md' ↓ Extract PySpark specific items 'docs/pyspark-migration-guide.md' ``` 3. SparkR Move [sparkr.html#migration-guide](https://spark.apache.org/docs/latest/sparkr.html#migration-guide) into a separate file, and extract from [sql-migration-guide-upgrade.html](https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html) ``` 'docs/sparkr.md' 'docs/sql-migration-guide-upgrade.md' Move migration guide section ↘ ↙ Extract SparkR specific items docs/sparkr-migration-guide.md ``` 4. Core Newly created at `'docs/core-migration-guide.md'`. I skimmed resolved JIRAs at 3.0.0 and found some items to note. 5. Structured Streaming Newly created at `'docs/ss-migration-guide.md'`. I skimmed resolved JIRAs at 3.0.0 and found some items to note. 6. SQL Merged [sql-migration-guide-upgrade.html](https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html) and [sql-migration-guide-hive-compatibility.html](https://spark.apache.org/docs/latest/sql-migration-guide-hive-compatibility.html) ``` 'docs/sql-migration-guide-hive-compatibility.md' 'docs/sql-migration-guide-upgrade.md' Move Hive compatibility section ↘ ↙ Left over after filtering PySpark and SparkR items 'docs/sql-migration-guide.md' ``` ### Why are the changes needed? In order for users in production to effectively migrate to higher versions, and detect behaviour or breaking changes before upgrading and/or migrating. ### Does this PR introduce any user-facing change? Yes, this changes Spark's documentation at https://spark.apache.org/docs/latest/index.html. ### How was this patch tested? Manually build the doc. This can be verified as below: ```bash cd docs SKIP_API=1 jekyll build open _site/index.html ``` Closes #25757 from HyukjinKwon/migration-doc. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-09-15 11:17:30 -07:00
HyukjinKwon	db48da87f0	[SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations ## What changes were proposed in this pull request? `spark.sql.execution.arrow.enabled` was added when we add PySpark arrow optimization. Later, in the current master, SparkR arrow optimization was added and it's controlled by the same configuration `spark.sql.execution.arrow.enabled`. There look two issues about this: 1. `spark.sql.execution.arrow.enabled` in PySpark was added from 2.3.0 whereas SparkR optimization was added 3.0.0. The stability is different so it's problematic when we change the default value for one of both optimization first. 2. Suppose users want to share some JVM by PySpark and SparkR. They are currently forced to use the optimization for all or none if the configuration is set globally. This PR proposes two separate configuration groups for PySpark and SparkR about Arrow optimization: - Deprecate `spark.sql.execution.arrow.enabled` - Add `spark.sql.execution.arrow.pyspark.enabled` (fallback to `spark.sql.execution.arrow.enabled`) - Add `spark.sql.execution.arrow.sparkr.enabled` - Deprecate `spark.sql.execution.arrow.fallback.enabled` - Add `spark.sql.execution.arrow.pyspark.fallback.enabled ` (fallback to `spark.sql.execution.arrow.fallback.enabled`) Note that `spark.sql.execution.arrow.maxRecordsPerBatch` is used within JVM side for both. Note that `spark.sql.execution.arrow.fallback.enabled` was added due to behaviour change. We don't need it in SparkR - SparkR side has the automatic fallback. ## How was this patch tested? Manually tested and some unittests were added. Closes #24700 from HyukjinKwon/separate-sparkr-arrow. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-06-03 10:01:37 +09:00
HyukjinKwon	cc0b9d41cd	[MINOR][DOCS][R] Use actual version in SparkR Arrow guide for copy-and-paste ## What changes were proposed in this pull request? To address https://github.com/apache/spark/pull/24506#discussion_r280964509 ## How was this patch tested? N/A Closes #24701 from HyukjinKwon/minor-arrow-r-doc. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-24 10:38:26 -07:00
Liang-Chi Hsieh	253a8793f0	[SPARK-26921][R][DOCS][FOLLOWUP] Document Arrow optimization and vectorized R APIs ## What changes were proposed in this pull request? There are few suspect in the newly added doc. Open this followup to fix it and a typo. ## How was this patch tested? N/A Closes #24514 from viirya/SPARK-26924-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-02 20:36:34 +09:00
HyukjinKwon	3670826af6	[SPARK-26921][R][DOCS] Document Arrow optimization and vectorized R APIs ## What changes were proposed in this pull request? This PR adds SparkR with Arrow optimization documentation. Note that looks CRAN issue in Arrow side won't look likely fixed soon, IMHO, even after Spark 3.0. If it happen to be fixed, I will fix this doc too later. Another note is that Arrow R package itself requires R 3.5+. So, I intentionally didn't note this. ## How was this patch tested? Manually built and checked. Closes #24506 from HyukjinKwon/SPARK-26924. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-02 10:02:14 +09:00
Sean Owen	754f820035	[SPARK-26918][DOCS] All .md should have ASF license header ## What changes were proposed in this pull request? Add AL2 license to metadata of all .md files. This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing. ## How was this patch tested? Doc build Closes #24243 from srowen/SPARK-26918. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-30 19:49:45 -05:00
Huaxin Gao	05cf81e6de	[SPARK-19827][R] spark.ml R API for PIC ## What changes were proposed in this pull request? Add PowerIterationCluster (PIC) in R ## How was this patch tested? Add test case Closes #23072 from huaxingao/spark-19827. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-12-10 18:28:13 -06:00
Keiji Yoshida	c3f27b2437	[MINOR][DOCS] Fix typos ## What changes were proposed in this pull request? Fix Typos. This PR is the complete version of https://github.com/apache/spark/pull/23145. ## How was this patch tested? NA Closes #23185 from kjmrknsn/docUpdate. Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-30 09:03:46 -06:00
gatorsmile	94145786a5	[SPARK-25908][SQL][FOLLOW-UP] Add back unionAll ## What changes were proposed in this pull request? This PR is to add back `unionAll`, which is widely used. The name is also consistent with our ANSI SQL. We also have the corresponding `intersectAll` and `exceptAll`, which were introduced in Spark 2.4. ## How was this patch tested? Added a test case in DataFrameSuite Closes #23131 from gatorsmile/addBackUnionAll. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-11-25 15:53:07 -08:00
DB Tsai	ad853c5678	[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0 ## What changes were proposed in this pull request? This PR makes Spark's default Scala version as 2.12, and Scala 2.11 will be the alternative version. This implies that Scala 2.12 will be used by our CI builds including pull request builds. We'll update the Jenkins to include a new compile-only jobs for Scala 2.11 to ensure the code can be still compiled with Scala 2.11. ## How was this patch tested? existing tests Closes #22967 from dbtsai/scala2.12. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2018-11-14 16:22:23 -08:00
Felix Cheung	41e1416f4d	[SPARK-16693][SPARKR] Remove methods deprecated ## What changes were proposed in this pull request? Remove deprecated functions which includes: SQLContext/HiveContext stuff sparkR.init jsonFile parquetFile registerTempTable saveAsParquetFile unionAll createExternalTable dropTempTable ## How was this patch tested? jenkins Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #22843 from felixcheung/rrddapi.	2018-10-27 15:11:29 -07:00
Sean Owen	ca545f7941	[SPARK-25821][SQL] Remove SQLContext methods deprecated in 1.4 ## What changes were proposed in this pull request? Remove SQLContext methods deprecated in 1.4 ## How was this patch tested? Existing tests. Closes #22815 from srowen/SPARK-25821. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-26 16:49:48 -05:00
adrian555	ddd1b1e8ae	[SPARK-24572][SPARKR] "eager execution" for R shell, IDE ## What changes were proposed in this pull request? Check the `spark.sql.repl.eagerEval.enabled` configuration property in SparkDataFrame `show()` method. If the `SparkSession` has eager execution enabled, the data will be returned to the R client when the data frame is created. So instead of seeing this ``` > df <- createDataFrame(faithful) > df SparkDataFrame[eruptions:double, waiting:double] ``` you will see ``` > df <- createDataFrame(faithful) > df +---------+-------+ \|eruptions\|waiting\| +---------+-------+ \| 3.6\| 79.0\| \| 1.8\| 54.0\| \| 3.333\| 74.0\| \| 2.283\| 62.0\| \| 4.533\| 85.0\| \| 2.883\| 55.0\| \| 4.7\| 88.0\| \| 3.6\| 85.0\| \| 1.95\| 51.0\| \| 4.35\| 85.0\| \| 1.833\| 54.0\| \| 3.917\| 84.0\| \| 4.2\| 78.0\| \| 1.75\| 47.0\| \| 4.7\| 83.0\| \| 2.167\| 52.0\| \| 1.75\| 62.0\| \| 4.8\| 84.0\| \| 1.6\| 52.0\| \| 4.25\| 79.0\| +---------+-------+ only showing top 20 rows ``` ## How was this patch tested? Manual tests as well as unit tests (one new test case is added). Author: adrian555 <v2ave10p> Closes #22455 from adrian555/eager_execution.	2018-10-24 23:42:06 -07:00
Huaxin Gao	fc64e83f95	[SPARK-24207][R] add R API for PrefixSpan ## What changes were proposed in this pull request? add R API for PrefixSpan ## How was this patch tested? add test in test_mllib_fpm.R Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21710 from huaxingao/spark-24207.	2018-10-21 12:32:43 -07:00
Yuanjian Li	987f386588	[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2018-10-18 11:59:06 -07:00
Ilan Filonenko	51540c2fa6	[SPARK-25372][YARN][K8S] Deprecate and generalize keytab / principal config ## What changes were proposed in this pull request? SparkSubmit already logs in the user if a keytab is provided, the only issue is that it uses the existing configs which have "yarn" in their name. As such, the configs were changed to: `spark.kerberos.keytab` and `spark.kerberos.principal`. ## How was this patch tested? Will be tested with K8S tests, but needs to be tested with Yarn - [x] K8S Secure HDFS tests - [x] Yarn Secure HDFS tests vanzin Closes #22362 from ifilonenko/SPARK-25372. Authored-by: Ilan Filonenko <if56@cornell.edu> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2018-09-26 17:24:52 -07:00
Sean Owen	35f7f5ce83	[DOCS][MINOR] Fix a few broken links and typos, and, nit, use HTTPS more consistently ## What changes were proposed in this pull request? Fix a few broken links and typos, and, nit, use HTTPS more consistently esp. on scripts and Apache links ## How was this patch tested? Doc build Closes #22172 from srowen/DocTypo. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2018-08-22 01:02:17 +08:00
Liang-Chi Hsieh	8b0e94d896	[SPARK-23042][ML] Use OneHotEncoderModel to encode labels in MultilayerPerceptronClassifier ## What changes were proposed in this pull request? In MultilayerPerceptronClassifier, we use RDD operation to encode labels for now. I think we should use ML's OneHotEncoderEstimator/Model to do the encoding. ## How was this patch tested? Existing tests. Closes #20232 from viirya/SPARK-23042. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2018-08-17 18:40:29 +00:00
hyukjinkwon	1c9c5de951	[SPARK-23291][SPARK-23291][R][FOLLOWUP] Update SparkR migration note for ## What changes were proposed in this pull request? This PR fixes the migration note for SPARK-23291 since it's going to backport to 2.3.1. See the discussion in https://issues.apache.org/jira/browse/SPARK-23291 ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@apache.org> Closes #21249 from HyukjinKwon/SPARK-23291.	2018-05-07 14:52:14 -07:00
Daniel Sakuma	6ade5cbb49	[MINOR][DOC] Fix some typos and grammar issues ## What changes were proposed in this pull request? Easy fix in the documentation. ## How was this patch tested? N/A Closes #20948 Author: Daniel Sakuma <dsakuma@gmail.com> Closes #20928 from dsakuma/fix_typo_configuration_docs.	2018-04-06 13:37:08 +08:00
Liang-Chi Hsieh	53561d27c4	[SPARK-23291][SQL][R] R's substr should not reduce starting position by 1 when calling Scala API ## What changes were proposed in this pull request? Seems R's substr API treats Scala substr API as zero based and so subtracts the given starting position by 1. Because Scala's substr API also accepts zero-based starting position (treated as the first element), so the current R's substr test results are correct as they all use 1 as starting positions. ## How was this patch tested? Modified tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20464 from viirya/SPARK-23291.	2018-03-07 09:37:42 -08:00
Felix Cheung	02214b0943	[SPARK-21293][SPARKR][DOCS] structured streaming doc update ## What changes were proposed in this pull request? doc update Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #20197 from felixcheung/rwadoc.	2018-01-08 22:08:19 -08:00
Felix Cheung	7a702d8d5e	[SPARK-21616][SPARKR][DOCS] update R migration guide and vignettes ## What changes were proposed in this pull request? update R migration guide and vignettes ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #20106 from felixcheung/rreleasenote23.	2018-01-02 07:00:31 +09:00
Zheng RuiFeng	a97c497045	[SPARK-20849][DOC][SPARKR] Document R DecisionTree ## What changes were proposed in this pull request? 1, add an example for sparkr `decisionTree` 2, document it in user guide ## How was this patch tested? local submit Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #18067 from zhengruifeng/dt_example.	2017-05-25 23:00:50 -07:00
Felix Cheung	b8302ccd02	[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example ## What changes were proposed in this pull request? Add - R vignettes - R programming guide - SS programming guide - R example Also disable spark.als in vignettes for now since it's failing (SPARK-20402) ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17814 from felixcheung/rdocss.	2017-05-04 00:27:10 -07:00
Felix Cheung	d20a976e89	[SPARK-20192][SPARKR][DOC] SparkR migration guide to 2.2.0 ## What changes were proposed in this pull request? Updating R Programming Guide ## How was this patch tested? manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #17816 from felixcheung/r22relnote.	2017-05-01 21:03:48 -07:00
wangmiao1981	b28c3bc202	[SPARK-20477][SPARKR][DOC] Document R bisecting k-means in R programming guide ## What changes were proposed in this pull request? Add hyper link in the SparkR programming guide. ## How was this patch tested? Build doc and manually check the doc link. Author: wangmiao1981 <wm624@hotmail.com> Closes #17805 from wangmiao1981/doc.	2017-04-29 10:31:01 -07:00
wangmiao1981	7fe8249793	[SPARKR][DOC] Document LinearSVC in R programming guide ## What changes were proposed in this pull request? add link to svmLinear in the SparkR programming document. ## How was this patch tested? Build doc manually and click the link to the document. It looks good. Author: wangmiao1981 <wm624@hotmail.com> Closes #17797 from wangmiao1981/doc.	2017-04-27 22:29:47 -07:00
zero323	ba7666274e	[SPARK-20208][DOCS][FOLLOW-UP] Add FP-Growth to SparkR programming guide ## What changes were proposed in this pull request? Add `spark.fpGrowth` to SparkR programming guide. ## How was this patch tested? Manual tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17775 from zero323/SPARK-20208-FOLLOW-UP.	2017-04-27 00:34:20 -07:00
zero323	df58a95a33	[SPARK-20437][R] R wrappers for rollup and cube ## What changes were proposed in this pull request? - Add `rollup` and `cube` methods and corresponding generics. - Add short description to the vignette. ## How was this patch tested? - Existing unit tests. - Additional unit tests covering new features. - `check-cran.sh`. Author: zero323 <zero323@users.noreply.github.com> Closes #17728 from zero323/SPARK-20437.	2017-04-25 22:00:45 -07:00
Yanbo Liang	1d00761b91	[MINOR][SPARKR] Move 'Data type mapping between R and Spark' to right place in SparkR doc. Section ```Data type mapping between R and Spark``` was put in the wrong place in SparkR doc currently, we should move it to a separate section. ## What changes were proposed in this pull request? Before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/24340911/bc01a532-126a-11e7-9a08-0d60d13a547c.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/24340938/d9d32a9a-126a-11e7-8891-d2f5b46e0c71.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #17440 from yanboliang/sparkr-doc.	2017-03-27 17:37:24 -07:00
Felix Cheung	38fd163d0d	[SPARK-18849][ML][SPARKR][DOC] vignettes final check reorg ## What changes were proposed in this pull request? Reorganizing content (copy/paste) ## How was this patch tested? https://felixcheung.github.io/sparkr-vignettes.html Previous: https://felixcheung.github.io/sparkr-vignettes_old.html Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16301 from felixcheung/rvignettespass2.	2016-12-17 14:37:34 -08:00
Yanbo Liang	9bf8f3cd4f	[SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide ## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16148 from yanboliang/spark-18325.	2016-12-08 06:19:38 -08:00
Felix Cheung	b019b3a8ac	[SPARK-18643][SPARKR] SparkR hangs at session start when installed as a package without Spark ## What changes were proposed in this pull request? If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session. This seems to be a regression on the earlier behavior. Fix is to always try to install or check for the cached Spark if running in an interactive session. As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc) ## How was this patch tested? Manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16077 from felixcheung/rsessioninteractive.	2016-12-04 20:25:11 -08:00
Sean Owen	7e0cd1d9b1	[SPARK-18073][DOCS][WIP] Migrate wiki to spark.apache.org web site ## What changes were proposed in this pull request? Updates links to the wiki to links to the new location of content on spark.apache.org. ## How was this patch tested? Doc builds Author: Sean Owen <sowen@cloudera.com> Closes #15967 from srowen/SPARK-18073.1.	2016-11-23 11:25:47 +00:00
Felix Cheung	44c8bfda79	[SQL][DOC] updating doc for JSON source to link to jsonlines.org ## What changes were proposed in this pull request? API and programming guide doc changes for Scala, Python and R. ## How was this patch tested? manual test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15629 from felixcheung/jsondoc.	2016-10-26 23:06:11 -07:00
Felix Cheung	e21e1c946c	[SPARK-18013][SPARKR] add crossJoin API ## What changes were proposed in this pull request? Add crossJoin and do not default to cross join if joinExpr is left out ## How was this patch tested? unit test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15559 from felixcheung/rcrossjoin.	2016-10-21 12:35:37 -07:00
Jeff Zhang	f62ddc5983	[SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio ## What changes were proposed in this pull request? Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala). ``` if (args.isR && clusterManager == YARN) { val sparkRPackagePath = RUtils.localSparkRPackagePath if (sparkRPackagePath.isEmpty) { printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.") } val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE) if (!sparkRPackageFile.exists()) { printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.") } val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString // Distribute the SparkR package. // Assigns a symbol link name "sparkr" to the shipped package. args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr") // Distribute the R package archive containing all the built R packages. if (!RUtils.rPackages.isEmpty) { val rPackageFile = RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE) if (!rPackageFile.exists()) { printErrorAndExit("Failed to zip all the built R packages.") } val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString // Assigns a symbol link name "rpkg" to the shipped package. args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg") } } ``` So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor. Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster. ## How was this patch tested? Verify it manually in R Studio using the following code. ``` Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark") .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths())) library(SparkR) sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1")) df <- as.DataFrame(mtcars) head(df) ``` … Author: Jeff Zhang <zjffdu@apache.org> Closes #14784 from zjffdu/SPARK-17210.	2016-09-23 11:37:43 -07:00
Sean Owen	dc0a4c9161	[SPARK-17445][DOCS] Reference an ASF page as the main place to find third-party packages ## What changes were proposed in this pull request? Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki. ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #15075 from srowen/SPARK-17445.	2016-09-14 10:10:16 +01:00
Felix Cheung	b73defdd79	[SPARKR][DOCS] fix broken url in doc ## What changes were proposed in this pull request? Fix broken url, also, sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop" ![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png) Data type section is in the middle of a list of gapply/gapplyCollect subsections: ![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png) ## How was this patch tested? manual test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14329 from felixcheung/rdoclinkfix.	2016-07-25 11:25:41 -07:00
Felix Cheung	75f0efe74d	[SPARKR][DOCS] minor code sample update in R programming guide ## What changes were proposed in this pull request? Fix code style from ad hoc review of RC4 doc ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14250 from felixcheung/rdocs2rc4.	2016-07-18 16:01:57 -07:00
Narine Kokhlikyan	4167304836	[SPARK-16112][SPARKR] Programming guide for gapply/gapplyCollect ## What changes were proposed in this pull request? Updates programming guide for spark.gapply/spark.gapplyCollect. Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality. Please, let me know if you prefer another example. ## How was this patch tested? Existing test cases in R Author: Narine Kokhlikyan <narine@slice.com> Closes #14090 from NarineK/gapplyProgGuide.	2016-07-16 16:56:16 -07:00

1 2

75 commits