spark-instrumented-optimizer

History

hyukjinkwon 68ea290b3a [SPARK-13748][PYSPARK][DOC] Add the description for explictly setting None for a named argument for a Row ## What changes were proposed in this pull request? It seems allowed to not set a key and value for a dict to represent the value is `None` or missing as below: ``` python spark.createDataFrame([{"x": 1}, {"y": 2}]).show() ``` ``` +----+----+ \| x\| y\| +----+----+ \| 1\|null\| \|null\| 2\| +----+----+ ``` However, it seems it is not for `Row` as below: ``` python spark.createDataFrame([Row(x=1), Row(y=2)]).show() ``` ``` scala 16/06/19 16:25:56 ERROR Executor: Exception in task 6.0 in stage 66.0 (TID 316) java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 2 fields are required while 1 values are provided. at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:147) at org.apache.spark.sql.SparkSession$$anonfun$7.apply(SparkSession.scala:656) at org.apache.spark.sql.SparkSession$$anonfun$7.apply(SparkSession.scala:656) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) ``` The behaviour seems right but it seems it might confuse users just like this JIRA was reported. This PR adds the explanation for `Row` class. ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #13771 from HyukjinKwon/SPARK-13748.		2017-01-07 12:52:41 +00:00
..
ml	[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo	2017-01-04 15:07:29 +00:00
mllib	[SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)	2016-12-28 00:49:36 -08:00
sql	[SPARK-13748][PYSPARK][DOC] Add the description for explictly setting None for a named argument for a Row	2017-01-07 12:52:41 +00:00
streaming	[SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across Python API documentation	2016-11-22 11:40:18 +00:00
__init__.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00
accumulators.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
broadcast.py	[SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python	2016-09-14 13:37:35 -07:00
cloudpickle.py	[SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python	2016-09-14 13:37:35 -07:00
conf.py	[SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across Python API documentation	2016-11-22 11:40:18 +00:00
context.py	[SPARK-18523][PYSPARK] Make SparkContext.stop more reliable	2016-11-28 18:28:24 -08:00
daemon.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
find_spark_home.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
heapq3.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
java_gateway.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
rdd.py	[SPARK-18281] [SQL] [PYSPARK] Remove timeout for reading data through socket for local iterator	2016-12-20 13:12:16 -08:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	[SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records	2016-12-08 11:08:12 -08:00
shell.py	[SPARK-16536][SQL][PYSPARK][MINOR] Expose `sql` in PySpark Shell	2016-07-13 22:24:26 -07:00
shuffle.py	[SPARK-10710] Remove ability to disable spilling in core and SQL	2015-09-19 21:40:21 -07:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api	2016-04-12 23:06:55 -07:00
taskcontext.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00
tests.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
version.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
worker.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00