spark-instrumented-optimizer/dev
Jeff Evans 95de93b24e [SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read
Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters

Moving univocity-parsers version to spark-parent pom dependencyManagement section

Adding new utility method to build multi-char delimiter string, which delegates to existing one

Adding tests for multiple character delimited CSV

### What changes were proposed in this pull request?

Adds support for parsing CSV data using multiple-character delimiters.  Existing logic for converting the input delimiter string to characters was kept and invoked in a loop.  Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.

### Why are the changes needed?

It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters.  Currently, it is difficult to handle such data in Spark (typically needs pre-processing).

### Does this PR introduce any user-facing change?

Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception.  Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.

### How was this patch tested?

The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.

Closes #26027 from jeff303/SPARK-24540.

Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-15 15:44:51 -05:00
..
create-release [SPARK-28906][BUILD] Fix incorrect information in bin/spark-submit --version 2019-09-11 08:12:44 -05:00
deps [SPARK-24540][SQL] Support for multiple character delimiter in Spark CSV read 2019-10-15 15:44:51 -05:00
sparktestsupport [SPARK-27463][PYTHON][FOLLOW-UP] Run the tests of Cogrouped pandas UDF 2019-09-22 21:39:30 +09:00
tests [MINOR] Fix typos in dev/* scripts. 2018-01-31 07:37:25 +09:00
.gitignore [SPARK-23174][BUILD][PYTHON][FOLLOWUP] Add pycodestyle*.py to .gitignore file. 2018-01-31 00:51:00 +09:00
.rat-excludes [SPARK-27489][WEBUI] UI updates to show executor resource information 2019-09-04 09:45:44 +08:00
.scalafmt.conf [SPARK-26177] Config change followup to [] Automated formatting for Scala code 2018-12-03 10:03:51 -06:00
appveyor-guide.md [SPARK-26918][DOCS] All .md should have ASF license header 2019-03-30 19:49:45 -05:00
appveyor-install-dependencies.ps1 [SPARK-29159][BUILD] Increase ReservedCodeCacheSize to 1G 2019-09-19 00:24:15 -07:00
change-scala-version.sh [SPARK-26132][BUILD][CORE] Remove support for Scala 2.11 in Spark 3.0.0 2019-03-25 10:46:42 -05:00
check-license [MINOR][BUILD] Upgrade apache-rat to 0.13 2019-04-01 16:44:42 +09:00
checkstyle-suppressions.xml [MINOR][BUILD] Update all checkstyle dtd to use "https://checkstyle.org" 2019-02-25 11:25:53 -08:00
checkstyle.xml [SPARK-29470][BUILD] Update plugins to latest versions 2019-10-15 11:55:52 -07:00
github_jira_sync.py [SPARK-27889][INFRA] Make development scripts under dev/ support Python 3 2019-08-09 18:55:48 +09:00
lint-java [SPARK-23063][K8S] K8s changes for publishing scripts (and a couple of other misses) 2018-01-13 21:34:28 -08:00
lint-python [BUILD] refactor dev/lint-python in to something readable 2018-11-20 12:38:40 -08:00
lint-r [SPARK-10328] [SPARKR] Fix generic for na.omit 2015-08-28 00:37:50 -07:00
lint-r.R [SPARK-22063][R] Fixes lint check failures in R by latest commit sha1 ID of lint-r 2017-10-01 18:42:45 +09:00
lint-scala [SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles 2019-03-15 08:20:42 +09:00
make-distribution.sh [SPARK-29159][BUILD] Increase ReservedCodeCacheSize to 1G 2019-09-19 00:24:15 -07:00
merge_spark_pr.py [MINOR][BUILD] Decode output of commands during merge script as UTF-8 consistently 2019-10-02 11:28:55 +09:00
mima [SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles 2019-03-15 08:20:42 +09:00
pip-sanity-check.py [SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis 2019-01-17 19:40:39 -06:00
README.md Merge pull request #565 from pwendell/dev-scripts. Closes #565. 2014-02-08 23:13:34 -08:00
requirements.txt [SPARK-25270] lint-python: Add flake8 to find syntax errors and undefined names 2018-09-07 09:35:25 -07:00
run-pip-tests Fix typos detected by github.com/client9/misspell 2018-08-11 21:23:36 -05:00
run-tests [SPARK-22302][INFRA] Remove manual backports for subprocess and print explicit message for < Python 2.7 2017-10-22 02:22:35 +09:00
run-tests-jenkins [MINOR] Fix typos in dev/* scripts. 2018-01-31 07:37:25 +09:00
run-tests-jenkins.py [SPARK-28701][TEST-HADOOP3.2][TEST-JAVA11][K8S] adding java11 support for pull request builds 2019-08-27 00:48:01 +09:00
run-tests.py [SPARK-28701][INFRA][FOLLOWUP] Fix the key error when looking in os.environ 2019-08-26 12:40:31 -07:00
sbt-checkstyle [SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles 2019-03-15 08:20:42 +09:00
scalafmt [SPARK-26177] Automated formatting for Scala code 2018-11-29 08:54:31 -06:00
scalastyle [SPARK-27158][BUILD] dev/mima and dev/scalastyle support dynamic profiles 2019-03-15 08:20:42 +09:00
test-dependencies.sh [SPARK-29308][BUILD] Update deps in dev/deps/spark-deps-hadoop-3.2 for hadoop-3.2 2019-10-13 12:53:12 -05:00
tox.ini [SPARK-23367][BUILD] Include python document style checking 2018-10-27 08:20:42 -05:00

Spark Developer Scripts

This directory contains scripts useful to developers when packaging, testing, or committing to Spark.

Many of these scripts require Apache credentials to work correctly.