spark-instrumented-optimizer/python/pyspark/sql
Bill Chambers 603f4453a1 [SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names
## What changes were proposed in this pull request?

When a CSV begins with:
- `,,`
OR
- `"","",`

meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV:
```
"","second column"
"hello", "there"
```
Then column names would become `"C0", "second column"`.

This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark.

### Current Behavior in Spark <=1.6
In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue.

### Current Behavior in Spark 2.0
Spark throws a NullPointerError and will not read in the file.

#### Reproduction in 2.0
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html

## How was this patch tested?
A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.

Author: Bill Chambers <bill@databricks.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>

Closes #13041 from anabranch/master.
2016-05-11 17:42:13 -07:00
..
__init__.py [SPARK-14945][PYTHON] SparkSession Python API 2016-04-28 10:55:48 -07:00
catalog.py [SPARK-14988][PYTHON] SparkSession API follow-ups 2016-04-29 16:41:13 -07:00
column.py [SPARK-15278] [SQL] Remove experimental tag from Python DataFrame 2016-05-11 15:12:27 -07:00
conf.py [SPARK-15126][SQL] RuntimeConfig.set should return Unit 2016-05-04 14:26:05 -07:00
context.py [SPARK-15270] [SQL] Use SparkSession Builder to build a session with HiveSupport 2016-05-11 14:15:18 -07:00
dataframe.py [SPARK-15278] [SQL] Remove experimental tag from Python DataFrame 2016-05-11 15:12:27 -07:00
functions.py [SPARK-14639] [PYTHON] [R] Add bround function in Python/R. 2016-04-19 22:28:11 -07:00
group.py [SPARK-12756][SQL] use hash expression in Exchange 2016-01-13 22:43:28 -08:00
readwriter.py [SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names 2016-05-11 17:42:13 -07:00
session.py [SPARK-15126][SQL] RuntimeConfig.set should return Unit 2016-05-04 14:26:05 -07:00
streaming.py [SPARK-14896][SQL] Deprecate HiveContext in python 2016-05-04 17:39:30 -07:00
tests.py [SPARK-15037] [SQL] [MLLIB] Part2: Use SparkSession instead of SQLContext in Python TestSuites 2016-05-11 11:24:16 -07:00
types.py [SPARK-12200][SQL] Add __contains__ implementation to Row 2016-05-11 13:15:11 -07:00
utils.py [SPARK-14603][SQL] Verification of Metadata Operations by Session Catalog 2016-05-10 11:25:55 -07:00
window.py [SPARK-14058][PYTHON] Incorrect docstring in Window.order 2016-03-21 23:52:33 -07:00