spark-instrumented-optimizer/python/pyspark/sql
Jeff Evans 95de93b24e [SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read
Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

### What changes were proposed in this pull request?

Adds support for parsing CSV data using multiple-character delimiters.  Existing logic for converting the input delimiter string to characters was kept and invoked in a loop.  Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.

### Why are the changes needed?

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters.  Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

### Does this PR introduce any user-facing change?

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception.  Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

### How was this patch tested?

The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.

Closes #26027 from jeff303/SPARK-24540.

Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-15 15:44:51 -05:00
..
avro [SPARK-28698][SQL] Support user-specified output schema in to_avro 2019-08-13 20:52:16 +08:00
tests [SPARK-29402][PYTHON][TESTS] Added tests for grouped map pandas_udf with window 2019-10-11 16:19:13 -07:00
__init__.py [SPARK-27463][PYTHON][FOLLOW-UP] Miscellaneous documentation and code cleanup of cogroup pandas UDF 2019-09-30 22:25:35 +09:00
catalog.py [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 2019-09-09 10:19:40 -05:00
cogroup.py [SPARK-27463][PYTHON][FOLLOW-UP] Miscellaneous documentation and code cleanup of cogroup pandas UDF 2019-09-30 22:25:35 +09:00
column.py [SPARK-28031][PYSPARK][TEST] Improve doctest on over function of Column 2019-06-13 11:04:41 +09:00
conf.py [SPARK-23698][PYTHON] Resolve undefined names in Python 3 2018-08-22 10:06:59 -07:00
context.py [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 2019-09-09 10:19:40 -05:00
dataframe.py [SPARK-27659][PYTHON] Allow PySpark to prefetch during toLocalIterator 2019-09-20 09:59:31 -07:00
functions.py [SPARK-29240][PYTHON] Pass Py4J column instance to support PySpark column in element_at function 2019-09-27 11:04:55 -07:00
group.py [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs 2019-09-17 17:13:50 -07:00
readwriter.py [SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read 2019-10-15 15:44:51 -05:00
session.py [SPARK-27995][PYTHON] Note the difference between str of Python 2 and 3 at Arrow optimized 2019-06-11 18:43:59 +09:00
streaming.py [SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read 2019-10-15 15:44:51 -05:00
types.py [SPARK-29041][PYTHON] Allows createDataFrame to accept bytes as binary type 2019-09-12 08:52:25 +09:00
udf.py [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs 2019-09-17 17:13:50 -07:00
utils.py [SPARK-21045][PYTHON] Allow non-ascii string as an exception message from python execution in Python 2 2019-09-21 08:09:19 +09:00
window.py [SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations 2019-09-01 10:15:00 -05:00