spark-instrumented-optimizer/python/pyspark/sql
Maxim Gekk 1d9338bb10 [SPARK-23786][SQL] Checking column names of csv headers
## What changes were proposed in this pull request?

Currently column names of headers in CSV files are not checked against provided schema of CSV data. It could cause errors like showed in the [SPARK-23786](https://issues.apache.org/jira/browse/SPARK-23786) and https://github.com/apache/spark/pull/20894#issuecomment-375957777. I introduced new CSV option - `enforceSchema`. If it is enabled (by default `true`), Spark forcibly applies provided or inferred schema to CSV files. In that case, CSV headers are ignored and not checked against the schema. If `enforceSchema` is set to `false`, additional checks can be performed. For example, if column in CSV header and in the schema have different ordering, the following exception is thrown:

```
java.lang.IllegalArgumentException: CSV file header does not contain the expected fields
 Header: depth, temperature
 Schema: temperature, depth
CSV file: marina.csv
```

## How was this patch tested?

The changes were tested by existing tests of CSVSuite and by 2 new tests.

Author: Maxim Gekk <maxim.gekk@databricks.com>
Author: Maxim Gekk <max.gekk@gmail.com>

Closes #20894 from MaxGekk/check-column-names.
2018-06-03 22:02:21 -07:00
..
__init__.py [SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark 2017-11-02 15:22:52 +01:00
catalog.py [SPARK-23522][PYTHON] always use sys.exit over builtin exit 2018-03-08 20:38:34 +09:00
column.py [SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark 2018-04-08 12:09:06 +08:00
conf.py [SPARK-23700][PYTHON] Cleanup imports in pyspark.sql 2018-03-26 12:42:32 +09:00
context.py [SPARK-23706][PYTHON] spark.conf.get(value, default=None) should produce None in PySpark 2018-03-18 20:24:14 +09:00
dataframe.py [SPARK-24392][PYTHON] Label pandas_udf as Experimental 2018-05-28 12:56:05 +08:00
functions.py [SPARK-23920][SQL] add array_remove to remove all elements that equal element from array 2018-05-31 22:04:26 -07:00
group.py [SPARK-24392][PYTHON] Label pandas_udf as Experimental 2018-05-28 12:56:05 +08:00
readwriter.py [SPARK-23786][SQL] Checking column names of csv headers 2018-06-03 22:02:21 -07:00
session.py [SPARK-24392][PYTHON] Label pandas_udf as Experimental 2018-05-28 12:56:05 +08:00
streaming.py [SPARK-23786][SQL] Checking column names of csv headers 2018-06-03 22:02:21 -07:00
tests.py [SPARK-23786][SQL] Checking column names of csv headers 2018-06-03 22:02:21 -07:00
types.py [SPARK-24057][PYTHON] put the real data type in the AssertionError message 2018-04-26 14:21:22 -07:00
udf.py [SPARK-23754][PYTHON] Re-raising StopIteration in client code 2018-05-30 18:11:33 +08:00
utils.py [SPARK-23699][PYTHON][SQL] Raise same type of error caught with Arrow enabled 2018-03-27 20:06:12 -07:00
window.py [SPARK-23861][SQL][DOC] Clarify default window frame with and without orderBy clause 2018-04-07 00:15:54 +08:00