spark-instrumented-optimizer

History

Maxim Gekk 1d9338bb10 [SPARK-23786][SQL] Checking column names of csv headers ## What changes were proposed in this pull request? Currently column names of headers in CSV files are not checked against provided schema of CSV data. It could cause errors like showed in the [SPARK-23786](https://issues.apache.org/jira/browse/SPARK-23786) and https://github.com/apache/spark/pull/20894#issuecomment-375957777. I introduced new CSV option - `enforceSchema`. If it is enabled (by default `true`), Spark forcibly applies provided or inferred schema to CSV files. In that case, CSV headers are ignored and not checked against the schema. If `enforceSchema` is set to `false`, additional checks can be performed. For example, if column in CSV header and in the schema have different ordering, the following exception is thrown: ``` java.lang.IllegalArgumentException: CSV file header does not contain the expected fields Header: depth, temperature Schema: temperature, depth CSV file: marina.csv ``` ## How was this patch tested? The changes were tested by existing tests of CSVSuite and by 2 new tests. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20894 from MaxGekk/check-column-names.		2018-06-03 22:02:21 -07:00
..
__init__.py	[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark	2017-11-02 15:22:52 +01:00
catalog.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
column.py	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark	2018-04-08 12:09:06 +08:00
conf.py	[SPARK-23700][PYTHON] Cleanup imports in pyspark.sql	2018-03-26 12:42:32 +09:00
context.py	[SPARK-23706][PYTHON] spark.conf.get(value, default=None) should produce None in PySpark	2018-03-18 20:24:14 +09:00
dataframe.py	[SPARK-24392][PYTHON] Label pandas_udf as Experimental	2018-05-28 12:56:05 +08:00
functions.py	[SPARK-23920][SQL] add array_remove to remove all elements that equal element from array	2018-05-31 22:04:26 -07:00
group.py	[SPARK-24392][PYTHON] Label pandas_udf as Experimental	2018-05-28 12:56:05 +08:00
readwriter.py	[SPARK-23786][SQL] Checking column names of csv headers	2018-06-03 22:02:21 -07:00
session.py	[SPARK-24392][PYTHON] Label pandas_udf as Experimental	2018-05-28 12:56:05 +08:00
streaming.py	[SPARK-23786][SQL] Checking column names of csv headers	2018-06-03 22:02:21 -07:00
tests.py	[SPARK-23786][SQL] Checking column names of csv headers	2018-06-03 22:02:21 -07:00
types.py	[SPARK-24057][PYTHON] put the real data type in the AssertionError message	2018-04-26 14:21:22 -07:00
udf.py	[SPARK-23754][PYTHON] Re-raising StopIteration in client code	2018-05-30 18:11:33 +08:00
utils.py	[SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled	2018-03-27 20:06:12 -07:00
window.py	[SPARK-23861][SQL][DOC] Clarify default window frame with and without orderBy clause	2018-04-07 00:15:54 +08:00