spark-instrumented-optimizer

History

hyukjinkwon 25a020be99 [SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV ## What changes were proposed in this pull request? This PR includes the changes below: 1. Upgrade Univocity library from 2.1.1 to 2.2.1 This includes some performance improvement and also enabling auto-extending buffer in `maxCharsPerColumn` option in CSV. Please refer the [release notes](https://github.com/uniVocity/univocity-parsers/releases). 2. Remove useless `rowSeparator` variable existing in `CSVOptions` We have this unused variable in [CSVOptions.scala#L127](`29952ed096/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala (L127)`) but it seems possibly causing confusion that it actually does not care of `\r\n`. For example, we have an issue open about this, [SPARK-17227](https://issues.apache.org/jira/browse/SPARK-17227), describing this variable. This variable is virtually not being used because we rely on `LineRecordReader` in Hadoop which deals with only both `\n` and `\r\n`. 3. Set the default value of `maxCharsPerColumn` to auto-expending. We are setting 1000000 for the length of each column. It'd be more sensible we allow auto-expending rather than fixed length by default. To make sure, using `-1` is being described in the release note, [2.2.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.2.0). ## How was this patch tested? N/A Author: hyukjinkwon <gurwls223@gmail.com> Closes #15138 from HyukjinKwon/SPARK-17583.		2016-09-21 10:35:29 +01:00
..
__init__.py	[SPARK-16772][PYTHON][DOCS] Fix API doc references to UDFRegistration + Update "important classes"	2016-08-06 05:02:59 +01:00
catalog.py	[SPARK-16772] Correct API doc references to PySpark classes + formatting fixes	2016-07-28 14:57:15 -07:00
column.py	[SPARK-17215][SQL] Method `SQLContext.parseDataType(dataTypeString: String)` could be removed.	2016-08-24 23:36:04 -07:00
conf.py	[SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code	2016-05-23 18:14:48 -07:00
context.py	[SPARK-16700][PYSPARK][SQL] create DataFrame from dict/Row with schema	2016-08-15 12:41:27 -07:00
dataframe.py	[SPARK-17514] df.take(1) and df.limit(1).collect() should perform the same in Python	2016-09-14 10:10:01 -07:00
functions.py	[SPARK-17215][SQL] Method `SQLContext.parseDataType(dataTypeString: String)` could be removed.	2016-08-24 23:36:04 -07:00
group.py	[MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation	2016-07-06 10:45:51 -07:00
readwriter.py	[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV	2016-09-21 10:35:29 +01:00
session.py	[SPARK-17261] [PYSPARK] Using HiveContext after re-creating SparkContext in Spark 2.0 throws "Java.lang.illegalStateException: Cannot call methods on a stopped sparkContext"	2016-09-02 10:08:14 -07:00
streaming.py	[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV	2016-09-21 10:35:29 +01:00
tests.py	[SPARK-17100] [SQL] fix Python udf in filter on top of outer join	2016-09-19 13:24:16 -07:00
types.py	[SPARK-17215][SQL] Method `SQLContext.parseDataType(dataTypeString: String)` could be removed.	2016-08-24 23:36:04 -07:00
utils.py	[SPARK-15953][WIP][STREAMING] Renamed ContinuousQuery to StreamingQuery	2016-06-15 10:46:07 -07:00
window.py	[SPARK-14058][PYTHON] Incorrect docstring in Window.order	2016-03-21 23:52:33 -07:00