From d7dd59a6b4b734284b157af1191af337ceb1d256 Mon Sep 17 00:00:00 2001 From: Hyukjin Kwon Date: Wed, 3 Apr 2019 08:30:24 +0900 Subject: [PATCH] [SPARK-26224][SQL][PYTHON][R][FOLLOW-UP] Add notes about many projects in withColumn at SparkR and PySpark as well ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/23285. This PR adds the notes into PySpark and SparkR documentation as well. While I am here, I revised the doc a bit to make it sound a bit more neutral ## How was this patch tested? Manually built the doc and verified. Closes #24272 from HyukjinKwon/SPARK-26224. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- R/pkg/R/DataFrame.R | 5 +++++ python/pyspark/sql/dataframe.py | 5 +++++ .../src/main/scala/org/apache/spark/sql/Dataset.scala | 8 ++++---- 3 files changed, 14 insertions(+), 4 deletions(-) diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 014ba285ba..774a2b2202 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -2143,6 +2143,11 @@ setMethod("selectExpr", #' Return a new SparkDataFrame by adding a column or replacing the existing column #' that has the same name. #' +#' Note: This method introduces a projection internally. Therefore, calling it multiple times, +#' for instance, via loops in order to add multiple columns can generate big plans which +#' can cause performance issues and even \code{StackOverflowException}. To avoid this, +#' use \code{select} with the multiple columns at once. +#' #' @param x a SparkDataFrame. #' @param colName a column name. #' @param col a Column expression (which must refer only to this SparkDataFrame), or an atomic diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index 58d74f5d7d..659cbc4487 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -1974,6 +1974,11 @@ class DataFrame(object): :param colName: string, name of the new column. :param col: a :class:`Column` expression for the new column. + .. note:: This method introduces a projection internally. Therefore, calling it multiple + times, for instance, via loops in order to add multiple columns can generate big + plans which can cause performance issues and even `StackOverflowException`. + To avoid this, use :func:`select` with the multiple columns at once. + >>> df.withColumn('age2', df.age + 2).collect() [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)] diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 477e873250..05015f8dcd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -2151,10 +2151,10 @@ class Dataset[T] private[sql]( * `column`'s expression must only refer to attributes supplied by this Dataset. It is an * error to add a column that refers to some other Dataset. * - * Please notice that this method introduces a `Project`. This means that using it in loops in - * order to add several columns can generate very big plans which can cause huge performance - * issues and even `StackOverflowException`s. A much better alternative use `select` with the - * list of columns to add. + * @note this method introduces a projection internally. Therefore, calling it multiple times, + * for instance, via loops in order to add multiple columns can generate big plans which + * can cause performance issues and even `StackOverflowException`. To avoid this, + * use `select` with the multiple columns at once. * * @group untypedrel * @since 2.0.0