spark-instrumented-optimizer/python/pyspark
hyukjinkwon 68ea290b3a
[SPARK-13748][PYSPARK][DOC] Add the description for explictly setting None for a named argument for a Row
## What changes were proposed in this pull request?

It seems allowed to not set a key and value for a dict to represent the value is `None` or missing as below:

``` python
spark.createDataFrame([{"x": 1}, {"y": 2}]).show()
```

```
+----+----+
|   x|   y|
+----+----+
|   1|null|
|null|   2|
+----+----+
```

However,  it seems it is not for `Row` as below:

``` python
spark.createDataFrame([Row(x=1), Row(y=2)]).show()
```

``` scala
16/06/19 16:25:56 ERROR Executor: Exception in task 6.0 in stage 66.0 (TID 316)
java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 2 fields are required while 1 values are provided.
    at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:147)
    at org.apache.spark.sql.SparkSession$$anonfun$7.apply(SparkSession.scala:656)
    at org.apache.spark.sql.SparkSession$$anonfun$7.apply(SparkSession.scala:656)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
```

The behaviour seems right but it seems it might confuse users just like this JIRA was reported.

This PR adds the explanation for `Row` class.
## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13771 from HyukjinKwon/SPARK-13748.
2017-01-07 12:52:41 +00:00
..
ml [MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo 2017-01-04 15:07:29 +00:00
mllib [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) 2016-12-28 00:49:36 -08:00
sql [SPARK-13748][PYSPARK][DOC] Add the description for explictly setting None for a named argument for a Row 2017-01-07 12:52:41 +00:00
streaming [SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation 2016-11-22 11:40:18 +00:00
__init__.py [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark 2016-12-20 15:51:21 -08:00
accumulators.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
broadcast.py [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python 2016-09-14 13:37:35 -07:00
cloudpickle.py [SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python 2016-09-14 13:37:35 -07:00
conf.py [SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation 2016-11-22 11:40:18 +00:00
context.py [SPARK-18523][PYSPARK] Make SparkContext.stop more reliable 2016-11-28 18:28:24 -08:00
daemon.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
files.py [SPARK-3309] [PySpark] Put all public API in __all__ 2014-09-03 11:49:45 -07:00
find_spark_home.py [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
heapq3.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
java_gateway.py [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
join.py [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo… 2016-03-28 14:51:36 -07:00
profiler.py [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() 2015-06-26 08:12:22 -07:00
rdd.py [SPARK-18281] [SQL] [PYSPARK] Remove timeout for reading data through socket for local iterator 2016-12-20 13:12:16 -08:00
rddsampler.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
resultiterable.py [SPARK-3074] [PySpark] support groupByKey() with single huge key 2015-04-09 17:07:23 -07:00
serializers.py [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records 2016-12-08 11:08:12 -08:00
shell.py [SPARK-16536][SQL][PYSPARK][MINOR] Expose sql in PySpark Shell 2016-07-13 22:24:26 -07:00
shuffle.py [SPARK-10710] Remove ability to disable spilling in core and SQL 2015-09-19 21:40:21 -07:00
statcounter.py [SPARK-6919] [PYSPARK] Add asDict method to StatCounter 2015-09-29 13:38:15 -07:00
status.py [SPARK-4172] [PySpark] Progress API in Python 2015-02-17 13:36:43 -08:00
storagelevel.py [SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api 2016-04-12 23:06:55 -07:00
taskcontext.py [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark 2016-12-20 15:51:21 -08:00
tests.py [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark 2016-12-20 15:51:21 -08:00
traceback_utils.py [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. 2014-09-15 19:28:17 -07:00
version.py [SPARK-1267][SPARK-18129] Allow PySpark to be pip installed 2016-11-16 14:22:15 -08:00
worker.py [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark 2016-12-20 15:51:21 -08:00