spark-instrumented-optimizer

History

hyukjinkwon 7e5359be5c [SPARK-19610][SQL] Support parsing multiline CSV files ## What changes were proposed in this pull request? This PR proposes the support for multiple lines for CSV by resembling the multiline supports in JSON datasource (in case of JSON, per file). So, this PR introduces `wholeFile` option which makes the format not splittable and reads each whole file. Since Univocity parser can produces each row from a stream, it should be capable of parsing very large documents when the internal rows are fix in the memory. ## How was this patch tested? Unit tests in `CSVSuite` and `tests.py` Manual tests with a single 9GB CSV file in local file system, for example, ```scala spark.read.option("wholeFile", true).option("inferSchema", true).csv("tmp.csv").count() ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16976 from HyukjinKwon/SPARK-19610.		2017-02-28 13:34:33 -08:00
..
ml	[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy	2017-02-28 16:17:35 +02:00
mllib	[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change	2017-01-10 13:09:58 +00:00
sql	[SPARK-19610][SQL] Support parsing multiline CSV files	2017-02-28 13:34:33 -08:00
streaming	[SPARK-19405][STREAMING] Support for cross-account Kinesis reads via STS	2017-02-22 11:32:36 -05:00
__init__.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00
accumulators.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
broadcast.py	[SPARK-17472] [PYSPARK] Better error message for serialization failures of large objects in Python	2016-09-14 13:37:35 -07:00
cloudpickle.py	[SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0	2017-01-17 09:53:20 -08:00
conf.py	[SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across Python API documentation	2016-11-22 11:40:18 +00:00
context.py	[SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated to python worker	2017-02-24 15:04:42 -08:00
daemon.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
find_spark_home.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
heapq3.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
java_gateway.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
rdd.py	[SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated to python worker	2017-02-24 15:04:42 -08:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	[SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0	2017-01-17 09:53:20 -08:00
shell.py	[SPARK-16536][SQL][PYSPARK][MINOR] Expose `sql` in PySpark Shell	2016-07-13 22:24:26 -07:00
shuffle.py	[SPARK-10710] Remove ability to disable spilling in core and SQL	2015-09-19 21:40:21 -07:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api	2016-04-12 23:06:55 -07:00
taskcontext.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00
tests.py	[SPARK-19660][CORE][SQL] Replace the configuration property names that are deprecated in the version of Hadoop 2.6	2017-02-28 10:13:42 +00:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
version.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
worker.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00