*This contribution is my original work and that I license the work to the project under the project's open source license.*
## What changes were proposed in this pull request?
Documentation updates to PySpark's GroupedData
## How was this patch tested?
Manual Tests
Author: Josh Howes <josh.howes@gmail.com>
Author: Josh Howes <josh.howes@maxpoint.com>
Closes#13724 from josh-howes/bugfix/SPARK-15973.
## What changes were proposed in this pull request?
Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13242 from WeichenXu123/python_doctest_update_sparksession.
This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.
This PR also fixes the tests that are broken by the new hash behaviour in shuffle.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#10703 from cloud-fan/use-hash-expr-in-shuffle.
In SPARK-11946 the API for pivot was changed a bit and got updated doc, the doc changes were not made for the python api though. This PR updates the python doc to be consistent.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#10176 from aray/sql-pivot-python-doc.
Currently pivot's signature looks like
```scala
scala.annotation.varargs
def pivot(pivotColumn: Column, values: Column*): GroupedData
scala.annotation.varargs
def pivot(pivotColumn: String, values: Any*): GroupedData
```
I think we can remove the one that takes "Column" types, since callers should always be passing in literals. It'd also be more clear if the values are not varargs, but rather Seq or java.util.List.
I also made similar changes for Python.
Author: Reynold Xin <rxin@databricks.com>
Closes#9929 from rxin/SPARK-11946.
This PR adds pivot to the python api of GroupedData with the same syntax as Scala/Java.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes#9653 from aray/sql-pivot-python.
We added a bunch of higher order statistics such as skewness and kurtosis to GroupedData. I don't think they are common enough to justify being listed, since users can always use the normal statistics aggregate functions.
That is to say, after this change, we won't support
```scala
df.groupBy("key").kurtosis("colA", "colB")
```
However, we will still support
```scala
df.groupBy("key").agg(kurtosis(col("colA")), kurtosis(col("colB")))
```
Author: Reynold Xin <rxin@databricks.com>
Closes#9446 from rxin/SPARK-11489.
Add Python API for stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis
Author: Davies Liu <davies@databricks.com>
Closes#9424 from davies/py_var.
Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.
This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.
Author: Reynold Xin <rxin@databricks.com>
Closes#7174 from rxin/binary-opterator and squashes the following commits:
f31900d [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
fceb216 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into binary-opterator
d8518cf [Reynold Xin] Updated Python tests.
Add version info for public Python SQL API.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#6295 from davies/versions and squashes the following commits:
cfd91e6 [Davies Liu] add more version for DataFrame API
600834d [Davies Liu] add version to SQL API docs
dataframe.py is splited into column.py, group.py and dataframe.py:
```
360 column.py
1223 dataframe.py
183 group.py
```
Author: Davies Liu <davies@databricks.com>
Closes#6201 from davies/split_df and squashes the following commits:
fc8f5ab [Davies Liu] split dataframe.py into multiple files