spark-instrumented-optimizer

History

goldmedal 1fdfe69352 [SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark ## What changes were proposed in this pull request? We added a method to the scala API for creating a `DataFrame` from `DataSet[String]` storing CSV in [SPARK-15463](https://issues.apache.org/jira/browse/SPARK-15463) but PySpark doesn't have `Dataset` to support this feature. Therfore, I add an API to create a `DataFrame` from `RDD[String]` storing csv and it's also consistent with PySpark's `spark.read.json`. For example as below ``` >>> rdd = sc.textFile('python/test_support/sql/ages.csv') >>> df2 = spark.read.csv(rdd) >>> df2.dtypes [('_c0', 'string'), ('_c1', 'string')] ``` ## How was this patch tested? add unit test cases. Author: goldmedal <liugs963@gmail.com> Closes #19339 from goldmedal/SPARK-22112.		2017-09-27 11:19:45 +09:00
..
ml	[SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator	2017-09-22 13:12:33 +08:00
mllib	[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel	2017-05-24 22:55:38 +08:00
sql	[SPARK-22112][PYSPARK] Supports RDD of strings as input in spark.read.csv in PySpark	2017-09-27 11:19:45 +09:00
streaming	[SPARK-21893][BUILD][STREAMING][WIP] Put Kafka 0.8 behind a profile	2017-09-13 10:10:40 +01:00
__init__.py	[MINOR] Fix some typo of the document	2017-06-19 20:35:58 +01:00
accumulators.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
broadcast.py	[SPARK-12717][PYTHON] Adding thread-safe broadcast pickle registry	2017-08-02 07:12:23 +09:00
cloudpickle.py	[SPARK-21070][PYSPARK] Attempt to update cloudpickle again	2017-08-22 11:17:53 +09:00
conf.py	[SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across Python API documentation	2016-11-22 11:40:18 +00:00
context.py	[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles	2017-09-18 13:20:11 +09:00
daemon.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
find_spark_home.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
heapq3.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
java_gateway.py	[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed	2016-11-16 14:22:15 -08:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()	2015-06-26 08:12:22 -07:00
rdd.py	[SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator	2017-08-09 14:03:18 -07:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	[SPARK-22106][PYSPARK][SQL] Disable 0-parameter pandas_udf and add doctests	2017-09-26 10:54:00 +09:00
shell.py	[SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell	2017-04-12 10:54:50 -07:00
shuffle.py	[SPARK-10710] Remove ability to disable spilling in core and SQL	2015-09-19 21:40:21 -07:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api	2016-04-12 23:06:55 -07:00
taskcontext.py	[SPARK-18576][PYTHON] Add basic TaskContext information to PySpark	2016-12-20 15:51:21 -08:00
tests.py	[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles	2017-09-18 13:20:11 +09:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-19505][PYTHON] AttributeError on Exception.message in Python3	2017-04-11 12:18:31 -07:00
version.py	[MINOR] Bump SparkR and PySpark version to 2.3.0.	2017-06-19 11:13:03 +01:00
worker.py	[SPARK-22106][PYSPARK][SQL] Disable 0-parameter pandas_udf and add doctests	2017-09-26 10:54:00 +09:00