## What changes were proposed in this pull request?
Adds an quoteAll option for writing CSV which will quote all fields.
See https://issues.apache.org/jira/browse/SPARK-13638
## How was this patch tested?
Added a test to verify the output columns are quoted for all fields in the Dataframe
Author: Jurriaan Pruis <email@jurriaanpruis.nl>
Closes#13374 from jurriaan/csv-quote-all.
## What changes were proposed in this pull request?
This PR corrects ORC compression option for PySpark as well. I think this was missed mistakenly in https://github.com/apache/spark/pull/13948.
## How was this patch tested?
N/A
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#13963 from HyukjinKwon/minor-orc-compress.
#### What changes were proposed in this pull request?
In Python API, we have the same issue. Thanks for identifying this issue, zsxwing ! Below is an example:
```Python
spark.read.format('json').load('python/test_support/sql/people.json')
```
#### How was this patch tested?
Existing test cases cover the changes by this PR
Author: gatorsmile <gatorsmile@gmail.com>
Closes#13965 from gatorsmile/optionPaths.
## What changes were proposed in this pull request?
- Moved DataStreamReader/Writer from pyspark.sql to pyspark.sql.streaming to make them consistent with scala packaging
- Exposed the necessary classes in sql.streaming package so that they appear in the docs
- Added pyspark.sql.streaming module to the docs
## How was this patch tested?
- updated unit tests.
- generated docs for testing visibility of pyspark.sql.streaming classes.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13955 from tdas/SPARK-16266.
## What changes were proposed in this pull request?
There are some duplicated code for options in DataFrame reader/writer API, this PR clean them up, it also fix a bug for `escapeQuotes` of csv().
## How was this patch tested?
Existing tests.
Author: Davies Liu <davies@databricks.com>
Closes#13948 from davies/csv_options.
## What changes were proposed in this pull request?
This is a follow-up to https://github.com/apache/spark/pull/13795 to properly set CSV options in Python API. As part of this, I also make the Python option setting for both CSV and JSON more robust against positional errors.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13800 from rxin/SPARK-13792-2.
## What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.
The error log looks something like
```
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
```
Closes#12173
## How was this patch tested?
Manually tested.
Author: Reynold Xin <rxin@databricks.com>
Closes#13795 from rxin/SPARK-13792.
## What changes were proposed in this pull request?
- Fixed bug in Python API of DataStreamReader. Because a single path was being converted to a array before calling Java DataStreamReader method (which takes a string only), it gave the following error.
```
File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 947, in pyspark.sql.readwriter.DataStreamReader.json
Failed example:
json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'), schema = sdf_schema)
Exception raised:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1253, in __run
compileflags, 1) in test.globs
File "<doctest pyspark.sql.readwriter.DataStreamReader.json[0]>", line 1, in <module>
json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'), schema = sdf_schema)
File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 963, in json
return self._df(self._jreader.json(path))
File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
format(target_id, ".", name, value))
Py4JError: An error occurred while calling o121.json. Trace:
py4j.Py4JException: Method json([class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:744)
```
- Reduced code duplication between DataStreamReader and DataFrameWriter
- Added missing Python doctests
## How was this patch tested?
New tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13703 from tdas/SPARK-15981.
Renamed for simplicity, so that its obvious that its related to streaming.
Existing unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13673 from tdas/SPARK-15953.
## What changes were proposed in this pull request?
Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams.
- [x] Python API!!
## How was this patch tested?
Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13653 from tdas/SPARK-15933.
## What changes were proposed in this pull request?
This pr is to add doc for turning off quotations because this behavior is different from `com.databricks.spark.csv`.
## How was this patch tested?
Check behavior to put an empty string in csv options.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13616 from maropu/SPARK-15585-2.
## What changes were proposed in this pull request?
This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv.
Also, it explicitly sets default values for CSV options in python.
## How was this patch tested?
Added tests in CSVSuite.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#13372 from maropu/SPARK-15585.
## What changes were proposed in this pull request?
Currently structured streaming only supports append output mode. This PR adds the following.
- Added support for Complete output mode in the internal state store, analyzer and planner.
- Added public API in Scala and Python for users to specify output mode
- Added checks for unsupported combinations of output mode and DF operations
- Plans with no aggregation should support only Append mode
- Plans with aggregation should support only Update and Complete modes
- Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
- Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.
## How was this patch tested?
Unit tests in various test suites
- StreamingAggregationSuite: tests for complete mode
- MemorySinkSuite: tests for checking behavior in Append and Complete modes.
- UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
- DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
- Python doc test and existing unit tests modified to call write.outputMode.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#13286 from tdas/complete-mode.
## What changes were proposed in this pull request?
This reverts commit c24b6b679c. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.
## How was this patch tested?
Jenkins unit tests, integration tests, manual tests)
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#13417 from zsxwing/revert-SPARK-11753.
## What changes were proposed in this pull request?
`a` -> `an`
I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.
## How was this patch tested?
local build
`lint-java` checking
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes#13317 from zhengruifeng/a_an.
## What changes were proposed in this pull request?
Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF". Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed.
## How was this patch tested?
`JsonParsingOptionsSuite`.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#9759 from viirya/fix-json-nonnumric.
## What changes were proposed in this pull request?
This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13184 from rxin/SPARK-14463.
## What changes were proposed in this pull request?
Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.
## How was this patch tested?
This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.
Author: Sean Zhong <seanzhong@databricks.com>
Closes#13098 from clockfly/spark-15171-remove-deprecation.
## What changes were proposed in this pull request?
Seems db573fc743 did not remove withHiveSupport from readwrite.py
Author: Yin Huai <yhuai@databricks.com>
Closes#13069 from yhuai/fixPython.
## What changes were proposed in this pull request?
When a CSV begins with:
- `,,`
OR
- `"","",`
meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV:
```
"","second column"
"hello", "there"
```
Then column names would become `"C0", "second column"`.
This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark.
### Current Behavior in Spark <=1.6
In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue.
### Current Behavior in Spark 2.0
Spark throws a NullPointerError and will not read in the file.
#### Reproduction in 2.0
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html
## How was this patch tested?
A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.
Author: Bill Chambers <bill@databricks.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>
Closes#13041 from anabranch/master.
This PR:
* Corrects the documentation for the `properties` parameter, which is supposed to be a dictionary and not a list.
* Generally clarifies the Python docstring for DataFrameReader.jdbc() by pulling from the [Scala docstrings](b281377647/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (L201-L251)) and rephrasing things.
* Corrects minor Sphinx typos.
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes#13034 from nchammas/SPARK-15256.
## What changes were proposed in this pull request?
Use SparkSession instead of SQLContext in Python TestSuites
## How was this patch tested?
Existing tests
Author: Sandeep Singh <sandeep@techaddict.me>
Closes#13044 from techaddict/SPARK-15037-python.
## What changes were proposed in this pull request?
This PR removes the old `json(path: String)` API which is covered by the new `json(paths: String*)`.
## How was this patch tested?
Jenkins tests (existing tests should cover this)
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#13040 from HyukjinKwon/SPARK-15250.
## What changes were proposed in this pull request?
This patch removes experimental tag from DataFrameReader and DataFrameWriter, and explicitly tags a few methods added for structured streaming as experimental.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes#13038 from rxin/SPARK-15261.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-15050
This PR adds function parameters for Python API for reading and writing `csv()`.
## How was this patch tested?
This was tested by `./dev/run_tests`.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#12834 from HyukjinKwon/SPARK-15050.
## What changes were proposed in this pull request?
This PR adds the explanation and documentation for CSV options for reading and writing.
## How was this patch tested?
Style tests with `./dev/run_tests` for documentation style.
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#12817 from HyukjinKwon/SPARK-13425.
## What changes were proposed in this pull request?
This PR adds Python APIs for:
- `ContinuousQueryManager`
- `ContinuousQueryException`
The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`.
For `ContinuousQueryManager`, all APIs are provided except for registering listeners.
This PR also attempts to fix test flakiness by stopping all active streams just before tests.
## How was this patch tested?
Python Doc tests and unit tests
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#12673 from brkyvz/pyspark-cqm.
## What changes were proposed in this pull request?
```
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT
/_/
Using Python version 2.7.5 (default, Mar 9 2014 22:15:05)
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x101f3bfd0>
>>> spark.sql("SHOW TABLES").show()
...
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| src| false|
+---------+-----------+
>>> spark.range(1, 10, 2).show()
+---+
| id|
+---+
| 1|
| 3|
| 5|
| 7|
| 9|
+---+
```
**Note**: This API is NOT complete in its current state. In particular, for now I left out the `conf` and `catalog` APIs, which were added later in Scala. These will be added later before 2.0.
## How was this patch tested?
Python tests.
Author: Andrew Or <andrew@databricks.com>
Closes#12746 from andrewor14/python-spark-session.
## What changes were proposed in this pull request?
In Python, the `option` and `options` method of `DataFrameReader` and `DataFrameWriter` were sending the string "None" instead of `null` when passed `None`, therefore making it impossible to send an actual `null`. This fixes that problem.
This is based on #11305 from mathieulongtin.
## How was this patch tested?
Added test to readwriter.py.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: mathieu longtin <mathieu.longtin@nuance.com>
Closes#12494 from viirya/py-df-none-option.
## What changes were proposed in this pull request?
This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
- ContinuousQuery
- Trigger
- ProcessingTime
in pyspark under `pyspark.sql.streaming`.
In addition, it contains the new methods added under:
- `DataFrameWriter`
a) `startStream`
b) `trigger`
c) `queryName`
- `DataFrameReader`
a) `stream`
- `DataFrame`
a) `isStreaming`
This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
- `exception`
- `sourceStatuses`
- `sinkStatus`
They may be added in a follow up.
This PR also contains some very minor doc fixes in the Scala side.
## How was this patch tested?
Python doc tests
TODO:
- [ ] verify Python docs look good
Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>
Closes#12320 from brkyvz/stream-python.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-14231
Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.
But there are few restrictions in Spark `DecimalType` below:
1. The precision cannot be bigger than 38.
2. scale cannot be bigger than precision.
Currently, both restrictions are not being handled.
This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).
So, the codes below:
```scala
def doubleRecords: RDD[String] =
sqlContext.sparkContext.parallelize(
s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)
val jsonDF = sqlContext.read
.option("prefersDecimal", "true")
.json(doubleRecords)
jsonDF.printSchema()
```
produces below:
- **Before**
```scala
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
at
...
```
- **After**
```scala
root
|-- a: double (nullable = true)
|-- b: double (nullable = true)
```
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#12030 from HyukjinKwon/SPARK-14231.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13953
Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string.
This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration.
This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for coding style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11881 from HyukjinKwon/SPARK-13953.
## What changes were proposed in this pull request?
Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .
This PR adds the support for parse modes just like CSV data source. There are three modes below:
- `PERMISSIVE` : When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode.
- `DROPMALFORMED`: When it fails to parse, this drops the whole record.
- `FAILFAST`: When it fails to parse, it just throws an exception.
This PR also make JSON data source share the `ParseModes` in CSV data source.
## How was this patch tested?
Unit tests were used and `./dev/run_tests` for code style tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11756 from HyukjinKwon/SPARK-13764.
## What changes were proposed in this pull request?
This PR adds the support to specify compression codecs for both ORC and Parquet.
## How was this patch tested?
unittests within IDE and code style tests with `dev/run_tests`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#11464 from HyukjinKwon/SPARK-13543.
https://issues.apache.org/jira/browse/SPARK-13507https://issues.apache.org/jira/browse/SPARK-13509
## What changes were proposed in this pull request?
This PR adds the support to write CSV data directly by a single call to the given path.
Several unitests were added for each functionality.
## How was this patch tested?
This was tested with unittests and with `dev/run_tests` for coding style
Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>
Closes#11389 from HyukjinKwon/SPARK-13507-13509.
I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success.
Added test for non-complex types. Should I add a test for complex types?
Author: Brandon Bradley <bradleytastic@gmail.com>
Closes#10936 from blbradley/spark-12749.
We can provides the option to choose JSON parser can be enabled to accept quoting of all character or not.
Author: Cazen <Cazen@korea.com>
Author: Cazen Lee <cazen.lee@samsung.com>
Author: Cazen Lee <Cazen@korea.com>
Author: cazen.lee <cazen.lee@samsung.com>
Closes#10497 from Cazen/master.
Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#10349 from yanboliang/text-value.
This patch makes it consistent to use varargs in all DataFrameReader methods, including Parquet, JSON, text, and the generic load function.
Also added a few more API tests for the Java API.
Author: Reynold Xin <rxin@databricks.com>
Closes#9945 from rxin/SPARK-11967.
This patch adds the following options to the JSON data source, for dealing with non-standard JSON files:
* `allowComments` (default `false`): ignores Java/C++ style comment in JSON records
* `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names
* `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012)
To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options.
Also updated documentation to explain these options.
Scala
![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png)
Python
![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png)
Author: Reynold Xin <rxin@databricks.com>
Closes#9724 from rxin/SPARK-11745.
Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider
Author: Koert Kuipers <koert@tresata.com>
Closes#8416 from koertkuipers/feat-sql-comma-separated-paths.
PySpark DataFrameReader should could accept an RDD of Strings (like the Scala version does) for JSON, rather than only taking a path.
If this PR is merged, it should be duplicated to cover the other input types (not just JSON).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8444 from yanboliang/spark-9964.
This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python.
Author: Cheng Lian <lian@databricks.com>
Closes#7444 from liancheng/spark-9100 and squashes the following commits:
284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments
e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC
Author: Reynold Xin <rxin@databricks.com>
Closes#7079 from rxin/SPARK-8698 and squashes the following commits:
8513e1c [Reynold Xin] [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.
I compared PySpark DataFrameReader/Writer against Scala ones. `Option` function is missing in both reader and writer, but the rest seems to all match.
I added `Option` to reader and writer and updated the `pyspark-sql` test.
Author: Cheolsoo Park <cheolsoop@netflix.com>
Closes#7078 from piaozhexiu/SPARK-8355 and squashes the following commits:
c63d419 [Cheolsoo Park] Fix version
524e0aa [Cheolsoo Park] Add option function to df reader and writer
https://issues.apache.org/jira/browse/SPARK-8532
This PR has two changes. First, it fixes the bug that save actions (i.e. `save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds input argument `partitionBy` to `save/saveAsTable/parquet`.
Author: Yin Huai <yhuai@databricks.com>
Closes#6937 from yhuai/SPARK-8532 and squashes the following commits:
f972d5d [Yin Huai] davies's comment.
d37abd2 [Yin Huai] style.
d21290a [Yin Huai] Python doc.
889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, and parquet.
7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode since JVM-side already uses "error" as the default value.
d696dff [Yin Huai] Python style.
88eb6c4 [Yin Huai] If mode is "error", do not call mode method.
c40c461 [Yin Huai] Regression test.
add schema()/format()/options() for reader, add mode()/format()/options()/partitionBy() for writer
cc rxin yhuai pwendell
Author: Davies Liu <davies@databricks.com>
Closes#6578 from davies/readwrite and squashes the following commits:
720d293 [Davies Liu] address comments
b65dfa2 [Davies Liu] Update readwriter.py
1299ab6 [Davies Liu] make Python API consistent with Scala
Add tests later.
Author: Davies Liu <davies@databricks.com>
Closes#6375 from davies/insertInto and squashes the following commits:
826423e [Davies Liu] add insertInto() to Writer
Add version info for public Python SQL API.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#6295 from davies/versions and squashes the following commits:
cfd91e6 [Davies Liu] add more version for DataFrame API
600834d [Davies Liu] add version to SQL API docs
cc rxin, please take a quick look, I'm working on tests.
Author: Davies Liu <davies@databricks.com>
Closes#6238 from davies/readwrite and squashes the following commits:
c7200eb [Davies Liu] update tests
9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
f0c5a04 [Davies Liu] use sqlContext.read.load
5f68bc8 [Davies Liu] update tests
6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
bcc6668 [Davies Liu] add reader amd writer API in Python