spark-instrumented-optimizer/python/pyspark/sql
hyukjinkwon 5cd8ea99f0 [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python
## What changes were proposed in this pull request?

This PR make `DataFrame.sample(...)` can omit `withReplacement` defaulting `False`, consistently with equivalent Scala / Java API.

In short, the following examples are allowed:

```python
>>> df = spark.range(10)
>>> df.sample(0.5).count()
7
>>> df.sample(fraction=0.5).count()
3
>>> df.sample(0.5, seed=42).count()
5
>>> df.sample(fraction=0.5, seed=42).count()
5
```

In addition, this PR also adds some type checking logics as below:

```python
>>> df = spark.range(10)
>>> df.sample().count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [].
>>> df.sample(True).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>].
>>> df.sample(42).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'int'>].
>>> df.sample(fraction=False, seed="a").count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'bool'>, <type 'str'>].
>>> df.sample(seed=[1]).count()
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'list'>].
>>> df.sample(withReplacement="a", fraction=0.5, seed=1)
...
TypeError: withReplacement (optional), fraction (required) and seed (optional) should be a bool, float and number; however, got [<type 'str'>, <type 'float'>, <type 'int'>].
```

## How was this patch tested?

Manually tested, unit tests added in doc tests and manually checked the built documentation for Python.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18999 from HyukjinKwon/SPARK-21779.
2017-09-01 13:01:23 +09:00
..
__init__.py [SPARK-16772][PYTHON][DOCS] Fix API doc references to UDFRegistration + Update "important classes" 2016-08-06 05:02:59 +01:00
catalog.py [SPARK-18777][PYTHON][SQL] Return UDF from udf.register 2017-05-06 22:28:42 -07:00
column.py [SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column 2017-08-24 20:29:03 +09:00
conf.py [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code 2016-05-23 18:14:48 -07:00
context.py [SPARK-20586][SQL] Add deterministic to ScalaUDF 2017-07-25 17:19:44 -07:00
dataframe.py [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python 2017-09-01 13:01:23 +09:00
functions.py [SPARK][DOCS] Added note on meaning of position to substring function 2017-08-07 17:16:03 +01:00
group.py [MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation 2016-07-06 10:45:51 -07:00
readwriter.py [SPARK-21839][SQL] Support SQL config for ORC compression 2017-08-31 08:16:58 +09:00
session.py [SPARK-19507][SPARK-21296][PYTHON] Avoid per-record type dispatch in schema verification and improve exception message 2017-07-04 20:45:58 +08:00
streaming.py [SPARK-21756][SQL] Add JSON option to allow unquoted control characters 2017-08-25 10:18:03 -07:00
tests.py [SPARK-21779][PYTHON] Simpler DataFrame.sample API in Python 2017-09-01 13:01:23 +09:00
types.py [SPARK-20090][PYTHON] Add StructType.fieldNames in PySpark 2017-07-28 20:59:32 -07:00
utils.py [MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo 2017-01-04 15:07:29 +00:00
window.py [SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames 2016-12-02 17:39:28 -08:00