ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Oscar D. Lara Yejas	e4bfb4aa73	[SPARK-13436][SPARKR] Added parameter drop to subsetting operator [ Added parameter drop to subsetting operator [. This is useful to get a Column from a DataFrame, given its name. R supports it. In R: ``` > name <- "Sepal_Length" > class(iris[, name]) [1] "numeric" ``` Currently, in SparkR: ``` > name <- "Sepal_Length" > class(irisDF[, name]) [1] "DataFrame" ``` Previous code returns a DataFrame, which is inconsistent with R's behavior. SparkR should return a Column instead. Currently, in order for the user to return a Column given a column name as a character variable would be through `eval(parse(x))`, where x is the string `"irisDF$Sepal_Length"`. That itself is pretty hacky. `SparkR:::getColumn() `is another choice, but I don't see why this method should be externalized. Instead, following R's way to do things, the proposed implementation allows this: ``` > name <- "Sepal_Length" > class(irisDF[, name, drop=T]) [1] "Column" > class(irisDF[, name, drop=F]) [1] "DataFrame" ``` This is consistent with R: ``` > name <- "Sepal_Length" > class(iris[, name]) [1] "numeric" > class(iris[, name, drop=F]) [1] "data.frame" ``` Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Closes #11318 from olarayej/SPARK-13436.	2016-04-27 15:47:54 -07:00
Oscar D. Lara Yejas	0c99c23b7d	[SPARK-13734][SPARKR] Added histogram function ## What changes were proposed in this pull request? Added method histogram() to compute the histogram of a Column Usage: ``` ## Create a DataFrame from the Iris dataset irisDF <- createDataFrame(sqlContext, iris) ## Render a histogram for the Sepal_Length column histogram(irisDF, "Sepal_Length", nbins=12) ``` ![histogram](https://cloud.githubusercontent.com/assets/13985649/13588486/e1e751c6-e484-11e5-85db-2fc2115c4bb2.png) Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name ## How was this patch tested? All unit tests pass. I added specific unit cases for different scenarios. Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Closes #11569 from olarayej/SPARK-13734.	2016-04-26 15:34:30 -07:00
Yanbo Liang	92f66331b4	[SPARK-14313][ML][SPARKR] AFTSurvivalRegression model persistence in SparkR ## What changes were proposed in this pull request? ```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12685 from yanboliang/spark-14313.	2016-04-26 10:30:24 -07:00
Yanbo Liang	9cb3ba1013	[SPARK-14312][ML][SPARKR] NaiveBayes model persistence in SparkR ## What changes were proposed in this pull request? SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API: ``` df <- createDataFrame(sqlContext, infert) model <- naiveBayes(education ~ ., df, laplace = 0) ml.save(model, path) model2 <- ml.load(path) ``` ## How was this patch tested? Add unit tests. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12573 from yanboliang/spark-14312.	2016-04-25 14:08:41 -07:00
Reynold Xin	890abd1279	[SPARK-14869][SQL] Don't mask exceptions in ResolveRelations ## What changes were proposed in this pull request? In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence. ## How was this patch tested? I manually hacked some bugs into Spark and made sure the exceptions were being propagated up. Author: Reynold Xin <rxin@databricks.com> Closes #12634 from rxin/SPARK-14869.	2016-04-23 12:49:36 -07:00
felixcheung	a55fbe2a16	[SPARK-12148][SPARKR] SparkR: rename DataFrame to SparkDataFrame ## What changes were proposed in this pull request? Changed class name defined in R from "DataFrame" to "SparkDataFrame". A popular package, S4Vector already defines "DataFrame" - this change is to avoid conflict. Aside from class name and API/roxygen2 references, SparkR APIs like `createDataFrame`, `as.DataFrame` are not changed (S4Vector does not define a "as.DataFrame"). Since in R, one would rarely reference type/class, this change should have minimal/almost-no impact to a SparkR user in terms of back compat. ## How was this patch tested? SparkR tests, manually loading S4Vector then SparkR package Author: felixcheung <felixcheung_m@hotmail.com> Closes #12621 from felixcheung/rdataframe.	2016-04-23 00:20:27 -07:00
Sun Rui	1a7fc74ccf	[SPARK-13178] RRDD faces with concurrency issue in case of rdd.zip(rdd).count(). ## What changes were proposed in this pull request? The concurrency issue reported in SPARK-13178 was fixed by the PR https://github.com/apache/spark/pull/10947 for SPARK-12792. This PR just removes a workaround not needed anymore. ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Closes #12606 from sun-rui/SPARK-13178.	2016-04-22 11:19:52 -07:00
Dongjoon Hyun	411454475a	[SPARK-14780] [R] Add `setLogLevel` to SparkR ## What changes were proposed in this pull request? This PR aims to add `setLogLevel` function to SparkR shell. Spark Shell ```scala scala> sc.setLogLevel("ERROR") ``` PySpark ```python >>> sc.setLogLevel("ERROR") ``` SparkR (this PR) ```r > setLogLevel(sc, "ERROR") NULL ``` ## How was this patch tested? Pass the Jenkins tests including a new R testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12547 from dongjoon-hyun/SPARK-14780.	2016-04-21 16:09:50 -07:00
Dongjoon Hyun	14869ae64e	[SPARK-14639] [PYTHON] [R] Add `bround` function in Python/R. ## What changes were proposed in this pull request? This issue aims to expose Scala `bround` function in Python/R API. `bround` function is implemented in SPARK-14614 by extending current `round` function. We used the following semantics from Hive. ```java public static double bround(double input, int scale) { if (Double.isNaN(input) \|\| Double.isInfinite(input)) { return input; } return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue(); } ``` After this PR, `pyspark` and `sparkR` also support `bround` function. PySpark ```python >>> from pyspark.sql.functions import bround >>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect() [Row(r=2.0)] ``` SparkR ```r > df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5))) > head(collect(select(df, bround(df$x, 0)))) bround(x, 0) 1 2 2 4 ``` ## How was this patch tested? Pass the Jenkins tests (including new testcases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12509 from dongjoon-hyun/SPARK-14639.	2016-04-19 22:28:11 -07:00
Sun Rui	8eedf0b553	[SPARK-13905][SPARKR] Change signature of as.data.frame() to be consistent with the R base package. ## What changes were proposed in this pull request? Change the signature of as.data.frame() to be consistent with that in the R base package to meet R user's convention. ## How was this patch tested? dev/lint-r SparkR unit tests Author: Sun Rui <rui.sun@intel.com> Closes #11811 from sun-rui/SPARK-13905.	2016-04-19 19:57:03 -07:00
felixcheung	ecd877e833	[SPARK-12224][SPARKR] R support for JDBC source Add R API for `read.jdbc`, `write.jdbc`. Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala `JDBCSuite` depends on Java H2 in-memory database. Refactored some code into util so they could be tested. Core's R SerDe code needs to be updated to allow access to java.util.Properties as `jobj` handle which is required by DataFrameReader/Writer's `jdbc` method. It would be possible, though more code to add a `sql/r/SQLUtils` helper function. Tested: ``` # with postgresql ../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar # read.jdbc df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345") df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345) # partitionColumn and numPartitions test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345) a <- SparkR:::toRDD(df) SparkR:::getNumPartitions(a) [1] 4 SparkR:::collectPartition(a, 2L) # defaultParallelism test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345) SparkR:::getNumPartitions(a) [1] 2 # predicates test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345) count(df) == 1 # write.jdbc, default save mode "error" irisDf <- as.DataFrame(sqlContext, iris) write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345") "error, already exists" write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345") ``` Author: felixcheung <felixcheung_m@hotmail.com> Closes #10480 from felixcheung/rreadjdbc.	2016-04-19 15:59:47 -07:00
Yanbo Liang	83af297ac4	[SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm for more family and link functions ## What changes were proposed in this pull request? Expose R-like summary statistics in SparkR::glm for more family and link functions. Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work. ## How was this patch tested? Unit tests. SparkR Output: ``` Deviance Residuals: (Note: These are approximate quantiles with relative error <= 0.01) Min 1Q Median 3Q Max -0.95096 -0.16585 -0.00232 0.17410 0.72918 Coefficients: Estimate Std. Error t value Pr(>\|t\|) (Intercept) 1.6765 0.23536 7.1231 4.4561e-11 Sepal_Length 0.34988 0.046301 7.5566 4.1873e-12 Species_versicolor -0.98339 0.072075 -13.644 0 Species_virginica -1.0075 0.093306 -10.798 0 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.22 Number of Fisher Scoring iterations: 1 ``` R output: ``` Deviance Residuals: Min 1Q Median 3Q Max -0.95096 -0.16522 0.00171 0.18416 0.72918 Coefficients: Estimate Std. Error t value Pr(>\|t\|) (Intercept) 1.67650 0.23536 7.123 4.46e-11 * Sepal.Length 0.34988 0.04630 7.557 4.19e-12 * Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 * Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 * --- Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.217 Number of Fisher Scoring iterations: 2 ``` cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12393 from yanboliang/spark-13925.	2016-04-15 08:23:51 -07:00
Yanbo Liang	75e05a5a96	[SPARK-12566][SPARK-14324][ML] GLM model family, link function support in SparkR:::glm * SparkR glm supports families and link functions which match R's signature for family. * SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```. * This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in. * This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR. Unit tests. cc mengxr jkbradley hhbyyh Author: Yanbo Liang <ybliang8@gmail.com> Closes #12294 from yanboliang/spark-12566.	2016-04-12 10:51:09 -07:00
gatorsmile	9f838bd242	[SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and Drop Table #### What changes were proposed in this pull request? This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`. #### How was this patch tested? Modified the existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #12284 from gatorsmile/followupDropTable.	2016-04-10 20:46:15 -07:00
Burak Yavuz	1146c534d6	[SPARK-14353] Dataset Time Window `window` API for R ## What changes were proposed in this pull request? The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008). This PR adds the R API for this function. With this PR, SQL, Java, and Scala will share the same APIs as in users can use: - `window(timeColumn, windowDuration)` - `window(timeColumn, windowDuration, slideDuration)` - `window(timeColumn, windowDuration, slideDuration, startTime)` In Python and R, users can access all APIs above, but in addition they can do - In R: `window(timeColumn, windowDuration, startTime=...)` that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows. ## How was this patch tested? Unit tests + manual tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12141 from brkyvz/R-windows.	2016-04-05 17:21:41 -07:00
Sun Rui	d3638d7bff	[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF. ## What changes were proposed in this pull request? Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs. Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later. ## How was this patch tested? dev/lint-r SparkR unit tests Author: Sun Rui <rui.sun@intel.com> Closes #12024 from sun-rui/SPARK-12792_new.	2016-03-28 21:51:02 -07:00
Davies Liu	e5a1b301fb	Revert "[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF." This reverts commit `40984f6706`.	2016-03-28 10:21:02 -07:00
Sun Rui	40984f6706	[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF. Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs. Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later. Author: Sun Rui <rui.sun@intel.com> Closes #10947 from sun-rui/SPARK-12792.	2016-03-28 10:14:28 -07:00
Andrew Or	20ddf5fddf	[SPARK-14014][SQL] Integrate session catalog (attempt #2 ) ## What changes were proposed in this pull request? This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests. ## How was this patch tested? See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11938 from andrewor14/session-catalog-again.	2016-03-24 22:59:35 -07:00
Yanbo Liang	13cbb2de70	[SPARK-13010][ML][SPARKR] Implement a simple wrapper of AFTSurvivalRegression in SparkR ## What changes were proposed in this pull request? This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR. ## How was this patch tested? Test against output from R package survival's survreg. cc mengxr felixcheung Close #11447 Author: Yanbo Liang <ybliang8@gmail.com> Closes #11932 from yanboliang/spark-13010-new.	2016-03-24 22:29:34 -07:00
Andrew Or	c44d140cae	Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog" This reverts commit `5dfc01976b`.	2016-03-23 22:21:15 -07:00
Andrew Or	5dfc01976b	[SPARK-14014][SQL] Replace existing catalog with SessionCatalog ## What changes were proposed in this pull request? `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #11836 from andrewor14/use-session-catalog.	2016-03-23 13:34:22 -07:00
Xusen Yin	d6dc12ef01	[SPARK-13449] Naive Bayes wrapper in SparkR ## What changes were proposed in this pull request? This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli. I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes. I removed the preprocess part that omit NA values because we don't know which columns to process. ## How was this patch tested? Test against output from R package e1071's naiveBayes. cc: yanboliang yinxusen Closes #11486 Author: Xusen Yin <yinxusen@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #11890 from mengxr/SPARK-13449.	2016-03-22 14:16:51 -07:00
Sun Rui	c7e68c3968	[SPARK-13812][SPARKR] Fix SparkR lint-r test errors. ## What changes were proposed in this pull request? This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github. ## How was this patch tested? dev/lint-r SparkR unit tests Author: Sun Rui <rui.sun@intel.com> Closes #11652 from sun-rui/SPARK-13812.	2016-03-13 14:30:44 -07:00
Yanbo Liang	4d535d1f1c	[SPARK-13389][SPARKR] SparkR support first/last with ignore NAs ## What changes were proposed in this pull request? SparkR support first/last with ignore NAs cc sun-rui felixcheung shivaram ## How was the this patch tested? unit tests Author: Yanbo Liang <ybliang8@gmail.com> Closes #11267 from yanboliang/spark-13389.	2016-03-10 17:31:19 -08:00
Oscar D. Lara Yejas	416e71af4d	[SPARK-13327][SPARKR] Added parameter validations for colnames<- Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Closes #11220 from olarayej/SPARK-13312-3.	2016-03-10 17:10:23 -08:00
Yanbo Liang	50e60e36f7	[SPARK-13504] [SPARKR] Add approxQuantile for SparkR ## What changes were proposed in this pull request? Add ```approxQuantile``` for SparkR. ## How was this patch tested? unit tests Author: Yanbo Liang <ybliang8@gmail.com> Closes #11383 from yanboliang/spark-13504 and squashes the following commits: 4f17adb [Yanbo Liang] Add approxQuantile for SparkR	2016-02-25 21:23:41 -08:00
Liang-Chi Hsieh	8930181833	[SPARK-13472] [SPARKR] Fix unstable Kmeans test in R JIRA: https://issues.apache.org/jira/browse/SPARK-13472 ## What changes were proposed in this pull request? One Kmeans test in R is unstable and sometimes fails. We should fix it. ## How was this patch tested? Unit test is modified in this PR. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11345 from viirya/fix-kmeans-r-test and squashes the following commits: f959f61 [Liang-Chi Hsieh] Sort resulted clusters.	2016-02-24 07:05:20 -08:00
Xusen Yin	8d29001dec	[SPARK-13011] K-means wrapper in SparkR https://issues.apache.org/jira/browse/SPARK-13011 Author: Xusen Yin <yinxusen@gmail.com> Closes #11124 from yinxusen/SPARK-13011.	2016-02-23 15:42:58 -08:00
Cheng Lian	d9efe63ecd	[SPARK-12799] Simplify various string output for expressions This PR introduces several major changes: 1. Replacing `Expression.prettyString` with `Expression.sql` The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users. 1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed) Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird. Here are several examples: Expression \| `prettyString` \| `sql` \| Note ------------------ \| -------------- \| ---------- \| --------------- `a && b` \| `a && b` \| `a AND b` \| `a.getField("f")` \| `a[f]` \| `a.f` \| `a` is a struct 1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders) `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression. Author: Cheng Lian <lian@databricks.com> Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.	2016-02-21 22:53:15 +08:00
Yanbo Liang	e7f9199e70	[SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR Add ```covar_samp``` and ```covar_pop``` for SparkR. Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10829 from yanboliang/spark-12903.	2016-01-26 19:29:47 -08:00
Narine Kokhlikyan	8a88e12128	[SPARK-12629][SPARKR] Fixes for DataFrame saveAsTable method I've tried to solve some of the issues mentioned in: https://issues.apache.org/jira/browse/SPARK-12629 Please, let me know what do you think. Thanks! Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #10580 from NarineK/sparkrSavaAsRable.	2016-01-22 10:35:02 -08:00
Sun Rui	1b2a918e59	[SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR. Author: Sun Rui <rui.sun@intel.com> Closes #10201 from sun-rui/SPARK-12204.	2016-01-20 21:08:15 -08:00
Herman van Hovell	1017327930	[SPARK-12848][SQL] Change parsed decimal literal datatype from Double to Decimal The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```. The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double. This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D``` cc davies rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10796 from hvanhovell/SPARK-12848.	2016-01-20 15:13:01 -08:00
felixcheung	488bbb216c	[SPARK-12232][SPARKR] New R API for read.table to avoid name conflict shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table` Author: felixcheung <felixcheung_m@hotmail.com> Closes #10406 from felixcheung/readtable.	2016-01-19 18:31:03 -08:00
Sun Rui	3ac648289c	[SPARK-12337][SPARKR] Implement dropDuplicates() method of DataFrame in SparkR. Author: Sun Rui <rui.sun@intel.com> Closes #10309 from sun-rui/SPARK-12337.	2016-01-19 16:37:18 -08:00
felixcheung	37fefa66cb	[SPARK-12168][SPARKR] Add automated tests for conflicted function in R Currently this is reported when loading the SparkR package in R (probably would add is.nan) ``` Loading required package: methods Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var The following objects are masked from ‘package:base’: colnames, colnames<-, intersect, rank, rbind, sample, subset, summary, table, transform ``` Adding this test adds an automated way to track changes to masked method. Also, the second part of this test check for those functions that would not be accessible without namespace/package prefix. Incidentally, this might point to how we would fix those inaccessible functions in base or stats. Looking for feedback for adding this test. Author: felixcheung <felixcheung_m@hotmail.com> Closes #10171 from felixcheung/rmaskedtest.	2016-01-19 16:33:48 -08:00
felixcheung	92502703f4	[SPARK-12862][SPARKR] Jenkins does not run R tests Slight correction: I'm leaving sparkR as-is (ie. R file not supported) and fixed only run-tests.sh as shivaram described. I also assume we are going to cover all doc changes in https://issues.apache.org/jira/browse/SPARK-12846 instead of here. rxin shivaram zjffdu Author: felixcheung <felixcheung_m@hotmail.com> Closes #10792 from felixcheung/sparkRcmd.	2016-01-17 09:29:08 -08:00
Oscar D. Lara Yejas	ba4a641902	[SPARK-11031][SPARKR] Method str() on a DataFrame Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu> Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Closes #9613 from olarayej/SPARK-11031.	2016-01-15 07:37:54 -08:00
Wenchen Fan	962e9bcf94	[SPARK-12756][SQL] use hash expression in Exchange This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one. This PR also fixes the tests that are broken by the new hash behaviour in shuffle. Author: Wenchen Fan <wenchen@databricks.com> Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.	2016-01-13 22:43:28 -08:00
Yanbo Liang	3d77cffec0	[SPARK-12645][SPARKR] SparkR support hash function Add ```hash``` function for SparkR ```DataFrame```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10597 from yanboliang/spark-12645.	2016-01-09 12:29:51 +05:30
Yanbo Liang	d1fea41363	[SPARK-12393][SPARKR] Add read.text and write.text for SparkR Add ```read.text``` and ```write.text``` for SparkR. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10348 from yanboliang/spark-12393.	2016-01-06 12:05:41 +05:30
felixcheung	cc4d5229c9	[SPARK-12625][SPARKR][SQL] replace R usage of Spark SQL deprecated API rxin davies shivaram Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559 - [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed) Author: felixcheung <felixcheung_m@hotmail.com> Closes #10584 from felixcheung/rremovedeprecated.	2016-01-04 22:32:07 -08:00
felixcheung	c3d505602d	[SPARK-12327][SPARKR] fix code for lintr warning for commented code shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #10408 from felixcheung/rcodecomment.	2016-01-03 20:53:35 +05:30
Hossein	f6ecf14333	[SPARK-11199][SPARKR] Improve R context management story and add getOrCreate * Changes api.r.SQLUtils to use ```SQLContext.getOrCreate``` instead of creating a new context. * Adds a simple test [SPARK-11199] #comment link with JIRA Author: Hossein <hossein@databricks.com> Closes #9185 from falaki/SPARK-11199.	2015-12-29 11:44:20 -08:00
Forest Fang	d80cc90b55	[SPARK-12526][SPARKR] ifelse`,` when`,` otherwise` unable to take Column as value `ifelse`, `when`, `otherwise` is unable to take `Column` typed S4 object as values. For example: ```r ifelse(lit(1) == lit(1), lit(2), lit(3)) ifelse(df$mpg > 0, df$mpg, 0) ``` will both fail with ```r attempt to replicate an object of type 'environment' ``` The PR replaces `ifelse` calls with `if ... else ...` inside the function implementations to avoid attempt to vectorize(i.e. `rep()`). It remains to be discussed whether we should instead support vectorization in these functions for consistency because `ifelse` in base R is vectorized but I cannot foresee any scenarios these functions will want to be vectorized in SparkR. For reference, added test cases which trigger failures: ```r . Error: when(), otherwise() and ifelse() with column on a DataFrame ---------- error in evaluating the argument 'x' in selecting a method for function 'collect': error in evaluating the argument 'col' in selecting a method for function 'select': attempt to replicate an object of type 'environment' Calls: when -> when -> ifelse -> ifelse 1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage")) 2: eval(code, new_test_environment) 3: eval(expr, envir, enclos) 4: expect_equal(collect(select(df, when(df$a > 1 & df$b > 2, lit(1))))[, 1], c(NA, 1)) at test_sparkSQL.R:1126 5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label) 6: condition(object) 7: compare(actual, expected, ...) 8: collect(select(df, when(df$a > 1 & df$b > 2, lit(1)))) Error: Test failures Execution halted ``` Author: Forest Fang <forest.fang@outlook.com> Closes #10481 from saurfang/spark-12526.	2015-12-29 12:45:24 +05:30
Yanbo Liang	22f6cd86fc	[SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10281 from yanboliang/spark-12310.	2015-12-16 10:34:30 -08:00
gatorsmile	1e3526c2d3	[SPARK-12158][SPARKR][SQL] Fix 'sample' functions that break R unit test cases The existing sample functions miss the parameter `seed`, however, the corresponding function interface in `generics` has such a parameter. Thus, although the function caller can call the function with the 'seed', we are not using the value. This could cause SparkR unit tests failed. For example, I hit it in another PR: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull Author: gatorsmile <gatorsmile@gmail.com> Closes #10160 from gatorsmile/sampleR.	2015-12-11 20:55:16 -08:00
Yanbo Liang	0fb9825556	[SPARK-12146][SPARKR] SparkR jsonFile should support multiple input files * ```jsonFile``` should support multiple input files, such as: ```R jsonFile(sqlContext, c(“path1”, “path2”)) # character vector as arguments jsonFile(sqlContext, “path1,path2”) ``` * Meanwhile, ```jsonFile``` has been deprecated by Spark SQL and will be removed at Spark 2.0. So we mark ```jsonFile``` deprecated and use ```read.json``` at SparkR side. * Replace all ```jsonFile``` with ```read.json``` at test_sparkSQL.R, but still keep jsonFile test case. * If this PR is accepted, we should also make almost the same change for ```parquetFile```. cc felixcheung sun-rui shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10145 from yanboliang/spark-12146.	2015-12-11 11:47:35 -08:00
Yanbo Liang	d9d354ed40	[SPARK-12234][SPARKR] Fix ```subset`` `function error when only set` ``select``` argument Fix ```subset``` function error when only set ```select``` argument. Please refer to the [JIRA](https://issues.apache.org/jira/browse/SPARK-12234) about the error and how to reproduce it. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10217 from yanboliang/spark-12234.	2015-12-10 10:18:58 -08:00

1 2 3 4

157 commits