[SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with provided schema
When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided. Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect. marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606 Author: Jason White <jason.white@shopify.com> Closes #9392 from JasonMWhite/createDataFrame_without_take.
This commit is contained in:
parent
71d1c907de
commit
f92f334ca4
|
@ -318,13 +318,7 @@ class SQLContext(object):
|
|||
struct.names[i] = name
|
||||
schema = struct
|
||||
|
||||
elif isinstance(schema, StructType):
|
||||
# take the first few rows to verify schema
|
||||
rows = rdd.take(10)
|
||||
for row in rows:
|
||||
_verify_type(row, schema)
|
||||
|
||||
else:
|
||||
elif not isinstance(schema, StructType):
|
||||
raise TypeError("schema should be StructType or list or None, but got: %s" % schema)
|
||||
|
||||
# convert python objects to sql data
|
||||
|
|
Loading…
Reference in a new issue