## What changes were proposed in this pull request?
Executors cache a list of their peers that is refreshed by default every minute. The cached stale references were randomly being used for replication. Since those executors were removed from the master, they did not occur in the block locations as reported by the master. This was fixed by
1. Refreshing peer cache in the block manager before trying to pro-actively replicate. This way the probability of replicating to a failed executor is eliminated.
2. Explicitly stopping the block manager in the tests. This shuts down the RPC endpoint use by the block manager. This way, even if a block manager tries to replicate using a stale reference, the replication logic should take care of refreshing the list of peers after failure.
## How was this patch tested?
Tested manually
Author: Shubham Chopra <schopra31@bloomberg.net>
Author: Kay Ousterhout <kayousterhout@gmail.com>
Author: Shubham Chopra <shubhamchopra@users.noreply.github.com>
Closes#17325 from shubhamchopra/SPARK-19803.
## What changes were proposed in this pull request?
It seems `checkType` and the type string in `structField` are not being tested closely. This string format currently seems SparkR-specific (see d1f6c64c4b/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala (L93-L131)) but resembles SQL type definition.
Therefore, it seems nicer if we test positive/negative cases in R side.
## How was this patch tested?
Unit tests in `test_sparkSQL.R`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17439 from HyukjinKwon/r-typestring-tests.
## What changes were proposed in this pull request?
The master snapshot publisher builds are currently broken due to two minor build issues:
1. For unknown reasons, the LFTP `mkdir -p` command began throwing errors when the remote directory already exists. This change of behavior might have been caused by configuration changes in the ASF's SFTP server, but I'm not entirely sure of that. To work around this problem, this patch updates the script to ignore errors from the `lftp mkdir -p` commands.
2. The PySpark `setup.py` file references a non-existent `pyspark.ml.stat` module, causing Python packaging to fail by complaining about a missing directory. The fix is to simply drop that line from the setup script.
## How was this patch tested?
The LFTP fix was tested by manually running the failing commands on AMPLab Jenkins against the ASF SFTP server. The PySpark fix was tested locally.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#17437 from JoshRosen/spark-20102.
## What changes were proposed in this pull request?
Instead of creating new `JavaSparkContext` we use `SparkContext.getOrCreate`.
## How was this patch tested?
Existing tests
Author: Hossein <hossein@databricks.com>
Closes#17423 from falaki/SPARK-20088.
## What changes were proposed in this pull request?
In current stage, we don't have advanced statistics such as sketches or histograms. As a result, some operator can't estimate `nullCount` accurately. E.g. left outer join estimation does not accurately update `nullCount` currently. So for `IsNull` and `IsNotNull` predicates, we only estimate them when the child is a leaf node, whose `nullCount` is accurate.
## How was this patch tested?
A new test case is added in `FilterEstimationSuite`.
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes#17438 from wzhfy/nullEstimation.
## What changes were proposed in this pull request?
This PR proposes to match minor documentations changes in https://github.com/apache/spark/pull/17399 and https://github.com/apache/spark/pull/17380 to R/Python.
## How was this patch tested?
Manual tests in Python , Python tests via `./python/run-tests.py --module=pyspark-sql` and lint-checks for Python/R.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17429 from HyukjinKwon/minor-match-doc.
## What changes were proposed in this pull request?
- Add `HasSupport` and `HasConfidence` `Params`.
- Add new module `pyspark.ml.fpm`.
- Add `FPGrowth` / `FPGrowthModel` wrappers.
- Provide tests for new features.
## How was this patch tested?
Unit tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes#17218 from zero323/SPARK-19281.
## What changes were proposed in this pull request?
The `CollapseWindow` is currently to aggressive when collapsing adjacent windows. It also collapses windows in the which the parent produces a column that is consumed by the child; this creates an invalid window which will fail at runtime.
This PR fixes this by adding a check for dependent adjacent windows to the `CollapseWindow` rule.
## How was this patch tested?
Added a new test case to `CollapseWindowSuite`
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17432 from hvanhovell/SPARK-20086.
## What changes were proposed in this pull request?
Adding additional information to existing logging messages:
- YarnAllocator: log the executor ID together with the container id when a container for an executor is launched.
- NettyRpcEnv: log the receiver address when there is a timeout waiting for an answer to a remote call.
- ExecutorAllocationManager: fix a typo in the logging message for the list of executors to be removed.
## How was this patch tested?
Build spark and submit the word count example to a YARN cluster using cluster mode
Author: Juan Rodriguez Hortala <hortala@amazon.com>
Closes#17411 from juanrh/logging-improvements.
## What changes were proposed in this pull request?
We are currently detecting the changes in `R/` directory only and then trigger AppVeyor tests.
It seems we need to tests when there are Scala codes dedicated for R in `core/src/main/scala/org/apache/spark/api/r/`, `sql/core/src/main/scala/org/apache/spark/sql/api/r/` and `mllib/src/main/scala/org/apache/spark/ml/r/` too.
This will enables the tests, for example, for SPARK-20088.
## How was this patch tested?
Tests with manually created PRs.
- Changes in `sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala` https://github.com/spark-test/spark/pull/13
- Changes in `core/src/main/scala/org/apache/spark/api/r/SerDe.scala` https://github.com/spark-test/spark/pull/12
- Changes in `README.md` https://github.com/spark-test/spark/pull/14
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17427 from HyukjinKwon/SPARK-20092.
## What changes were proposed in this pull request?
The `FailureSafeParser` is only used in sql core, it doesn't make sense to put it in catalyst module.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#17408 from cloud-fan/minor.
## What changes were proposed in this pull request?
Use the new `compressed` method on matrices to store the logistic regression coefficients as sparse or dense - whichever is requires less memory.
Marked as WIP so we can add some performance test results. Basically, we should see if prediction is slower because of using a sparse matrix over a dense one. This can happen since sparse matrices do not use native BLAS operations when computing the margins.
## How was this patch tested?
Unit tests added.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#17426 from sethah/SPARK-17137.
## What changes were proposed in this pull request?
Adding configurable mesos executor names and labels using `spark.mesos.task.name` and `spark.mesos.task.labels`.
Labels were defined as `k1:v1,k2:v2`.
mgummelt
## How was this patch tested?
Added unit tests to verify labels were added correctly, with incorrect labels being ignored and added a test to test the name of the executor.
Tested with: `./build/sbt -Pmesos mesos/test`
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Kalvin Chau <kalvin.chau@viasat.com>
Closes#17404 from kalvinnchau/mesos-config.
### What changes were proposed in this pull request?
Multiple tests failed. Revert the changes on `supportCodegen` of `GenerateExec`. For example,
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75194/testReport/
### How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17425 from gatorsmile/turnOnCodeGenGenerateExec.
## What changes were proposed in this pull request?
Commit 91fa80fe8a broke the build for scala 2.10. The commit uses `Regex.regex` field which is not available in Scala 2.10. This PR fixes this.
## How was this patch tested?
Existing tests.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17420 from hvanhovell/SPARK-20070-2.0.
## What changes were proposed in this pull request?
Updated the description for the `format_number` description to indicate that it uses `HALF_EVEN` rounding. Updated the description for the `round` description to indicate that it uses `HALF_UP` rounding.
## How was this patch tested?
Just changing the two function comments so no testing involved.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Roxanne Moslehi <rmoslehi@palantir.com>
Author: roxannemoslehi <rmoslehi@berkeley.edu>
Closes#17399 from roxannemoslehi/patch-1.
## What changes were proposed in this pull request?
Constraint propagation can be computation expensive and block the driver execution for long time. For example, the below benchmark needs 30mins.
Compared with previous PRs #16998, #16785, this is a much simpler option: add a flag to disable constraint propagation.
### Benchmark
Run the following codes locally.
import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}
import org.apache.spark.sql.internal.SQLConf
spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false)
val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
val indexers = df.columns.tail.map(c => new StringIndexer()
.setInputCol(c)
.setOutputCol(s"${c}_indexed")
.setHandleInvalid("skip"))
val encoders = indexers.map(indexer => new OneHotEncoder()
.setInputCol(indexer.getOutputCol)
.setOutputCol(s"${indexer.getOutputCol}_encoded")
.setDropLast(true))
val stages: Array[PipelineStage] = indexers ++ encoders
val pipeline = new Pipeline().setStages(stages)
val startTime = System.nanoTime
pipeline.fit(df).transform(df).show
val runningTime = System.nanoTime - startTime
Before this patch: 1786001 ms ~= 30 mins
After this patch: 26392 ms = less than half of a minute
Related PRs: #16998, #16785.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#17186 from viirya/add-flag-disable-constraint-propagation.
## What changes were proposed in this pull request?
The explain output of `DataSourceScanExec` can contain sensitive information (like Amazon keys). Such information should not end up in logs, or be exposed to non privileged users.
This PR addresses this by adding a redaction facility for the `DataSourceScanExec.treeString`. A user can enable this by setting a regex in the `spark.redaction.string.regex` configuration.
## How was this patch tested?
Added a unit test to check the output of DataSourceScanExec.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes#17397 from hvanhovell/SPARK-20070.
## What changes were proposed in this pull request?
This patch adds a `compressed` method to ML `Matrix` class, which returns the minimal storage representation of the matrix - either sparse or dense. Because the space occupied by a sparse matrix is dependent upon its layout (i.e. column major or row major), this method must consider both cases. It may also be useful to force the layout to be column or row major beforehand, so an overload is added which takes in a `columnMajor: Boolean` parameter.
The compressed implementation relies upon two new abstract methods `toDense(columnMajor: Boolean)` and `toSparse(columnMajor: Boolean)`, similar to the compressed method implemented in the `Vector` class. These methods also allow the layout of the resulting matrix to be specified via the `columnMajor` parameter. More detail on the new methods is given below.
## How was this patch tested?
Added many new unit tests
## New methods (summary, not exhaustive list)
**Matrix trait**
- `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` (abstract) - converts the matrix (either sparse or dense) to dense format
- `private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix` (abstract) - converts the matrix (either sparse or dense) to sparse format
- `def toDense: DenseMatrix = toDense(true)` - converts the matrix (either sparse or dense) to dense format in column major layout
- `def toSparse: SparseMatrix = toSparse(true)` - converts the matrix (either sparse or dense) to sparse format in column major layout
- `def compressed: Matrix` - finds the minimum space representation of this matrix, considering both column and row major layouts, and converts it
- `def compressed(columnMajor: Boolean): Matrix` - finds the minimum space representation of this matrix considering only column OR row major, and converts it
**DenseMatrix class**
- `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` - converts the dense matrix to a dense matrix, optionally changing the layout (data is NOT duplicated if the layouts are the same)
- `private[ml] def toSparseMatrix(columnMajor: Boolean): SparseMatrix` - converts the dense matrix to sparse matrix, using the specified layout
**SparseMatrix class**
- `private[ml] def toDenseMatrix(columnMajor: Boolean): DenseMatrix` - converts the sparse matrix to a dense matrix, using the specified layout
- `private[ml] def toSparseMatrix(columnMajors: Boolean): SparseMatrix` - converts the sparse matrix to sparse matrix. If the sparse matrix contains any explicit zeros, they are removed. If the layout requested does not match the current layout, data is copied to a new representation. If the layouts match and no explicit zeros exist, the current matrix is returned.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#15628 from sethah/matrix_compress.
## What changes were proposed in this pull request?
- Add new KinesisDStream.scala containing KinesisDStream.Builder class
- Add KinesisDStreamBuilderSuite test suite
- Make KinesisInputDStream ctor args package private for testing
- Add JavaKinesisDStreamBuilderSuite test suite
- Add args to KinesisInputDStream and KinesisReceiver for optional
service-specific auth (Kinesis, DynamoDB and CloudWatch)
## How was this patch tested?
Added ```KinesisDStreamBuilderSuite``` to verify builder class works as expected
Author: Adam Budde <budde@amazon.com>
Closes#17250 from budde/KinesisStreamBuilder.
## What changes were proposed in this pull request?
Fix for typo in Analyzer
## How was this patch tested?
local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes#17409 from jaceklaskowski/analyzer-typo.
Add Python wrapper for `Imputer` feature transformer.
## How was this patch tested?
New doc tests and tweak to PySpark ML `tests.py`
Author: Nick Pentreath <nickp@za.ibm.com>
Closes#17316 from MLnick/SPARK-15040-pyspark-imputer.
### What changes were proposed in this pull request?
This is a follow-up for the PR: https://github.com/apache/spark/pull/17311
- For safety, use `sessionState` to get the user name, instead of calling `SessionState.get()` in the function `toHiveTable`.
- Passing `user names` instead of `conf` when calling `toHiveTable`.
### How was this patch tested?
N/A
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17405 from gatorsmile/user.
This commit adds a killTaskAttempt method to SparkContext, to allow users to
kill tasks so that they can be re-scheduled elsewhere.
This also refactors the task kill path to allow specifying a reason for the task kill. The reason is propagated opaquely through events, and will show up in the UI automatically as `(N killed: $reason)` and `TaskKilled: $reason`. Without this change, there is no way to provide the user feedback through the UI.
Currently used reasons are "stage cancelled", "another attempt succeeded", and "killed via SparkContext.killTask". The user can also specify a custom reason through `SparkContext.killTask`.
cc rxin
In the stage overview UI the reasons are summarized:
![1](https://cloud.githubusercontent.com/assets/14922/23929209/a83b2862-08e1-11e7-8b3e-ae1967bbe2e5.png)
Within the stage UI you can see individual task kill reasons:
![2](https://cloud.githubusercontent.com/assets/14922/23929200/9a798692-08e1-11e7-8697-72b27ad8a287.png)
Existing tests, tried killing some stages in the UI and verified the messages are as expected.
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekl@google.com>
Closes#17166 from ericl/kill-reason.
## What changes were proposed in this pull request?
1. Use a MedianHeap to record durations of successful tasks. When check speculatable tasks, we can get the median duration with O(1) time complexity.
2. `checkSpeculatableTasks` will synchronize `TaskSchedulerImpl`. If `checkSpeculatableTasks` doesn't finish with 100ms, then the possibility exists for that thread to release and then immediately re-acquire the lock. Change `scheduleAtFixedRate` to be `scheduleWithFixedDelay` when call method of `checkSpeculatableTasks`.
## How was this patch tested?
Added MedianHeapSuite.
Author: jinxing <jinxing6042@126.com>
Closes#16867 from jinxing64/SPARK-16929.
## What changes were proposed in this pull request?
This patch adds the Dataframes-based support for the correlation statistics found in the `org.apache.spark.mllib.stat.correlation.Statistics`, following the design doc discussed in the JIRA ticket.
The current implementation is a simple wrapper around the `spark.mllib` implementation. Future optimizations can be implemented at a later stage.
## How was this patch tested?
```
build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite"
```
Author: Timothy Hunter <timhunter@databricks.com>
Closes#17108 from thunterdb/19636.
## What changes were proposed in this pull request?
Fixes break caused by: 746a558de2
## How was this patch tested?
Compiled with `build/sbt -Dscala2.10 sql/compile` locally
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#17403 from brkyvz/onceTrigger2.10.
## What changes were proposed in this pull request?
Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism. If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns.
The solution is to allow users to specify database column data type for the create table as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: `name CHAR(64), comments VARCHAR(1024)`).
All supported target database types can not be specified , the data types has to be valid spark sql data types also. For example user can not specify target database CLOB data type. This will be supported in the follow-up PR.
Example:
```Scala
df.write
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
.jdbc(url, "TEST.DBCOLTYPETEST", properties)
```
## How was this patch tested?
Added new test cases to the JDBCWriteSuite
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes#16209 from sureshthalamati/jdbc_custom_dbtype_option_json-spark-10849.
## What changes were proposed in this pull request?
Some `Schedulable` Entities(`Pool` and `TaskSetManager`) variables need refactoring for _immutability_ and _access modifiers_ levels as follows:
- From `var` to `val` (if there is no requirement): This is important to support immutability as much as possible.
- Sample => `Pool`: `weight`, `minShare`, `priority`, `name` and `taskSetSchedulingAlgorithm`.
- Access modifiers: Specially, `var`s access needs to be restricted from other parts of codebase to prevent potential side effects.
- `TaskSetManager`: `tasksSuccessful`, `totalResultSize`, `calculatedTasks` etc...
This PR is related with #15604 and has been created seperatedly to keep patch content as isolated and to help the reviewers.
## How was this patch tested?
Added new UTs and existing UT coverage.
Author: erenavsarogullari <erenavsarogullari@gmail.com>
Closes#16905 from erenavsarogullari/SPARK-19567.
## What changes were proposed in this pull request?
An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers.
In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature.
## How was this patch tested?
A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly.
In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests:
- The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop).
- The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log.
- A OneTime trigger execution that results in an exception being thrown.
marmbrus tdas zsxwing
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Tyson Condie <tcondie@gmail.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#17219 from tcondie/stream-commit.
## What changes were proposed in this pull request?
Fixup typo in comment.
## How was this patch tested?
Don't need.
Author: Ye Yin <eyniy@qq.com>
Closes#17396 from hustcat/fix.
## What changes were proposed in this pull request?
Several javadoc8 breaks have been introduced. This PR proposes fix those instances so that we can build Scala/Java API docs.
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:6: error: reference not found
[error] * <code>flatMapGroupsWithState</code> operations on {link KeyValueGroupedDataset}.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:10: error: reference not found
[error] * Both, <code>mapGroupsWithState</code> and <code>flatMapGroupsWithState</code> in {link KeyValueGroupedDataset}
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:51: error: reference not found
[error] * {link GroupStateTimeout.ProcessingTimeTimeout}) or event time (i.e.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:52: error: reference not found
[error] * {link GroupStateTimeout.EventTimeTimeout}).
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:158: error: reference not found
[error] * Spark SQL types (see {link Encoder} for more details).
[error] ^
[error] .../spark/mllib/target/java/org/apache/spark/ml/fpm/FPGrowthParams.java:26: error: bad use of '>'
[error] * Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and
[error] ^
[error] .../spark/sql/core/src/main/java/org/apache/spark/api/java/function/FlatMapGroupsWithStateFunction.java:30: error: reference not found
[error] * {link org.apache.spark.sql.KeyValueGroupedDataset#flatMapGroupsWithState(
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:211: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:232: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:254: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:277: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
[error] * {link TaskMetrics} & {link MetricsSystem} objects are not thread safe.
[error] ^
[error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
[error] * {link TaskMetrics} & {link MetricsSystem} objects are not thread safe.
[error] ^
[info] 13 errors
```
```
jekyll 3.3.1 | Error: Unidoc generation failed
```
## How was this patch tested?
Manually via `jekyll build`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17389 from HyukjinKwon/minor-javadoc8-fix.
## What changes were proposed in this pull request?
This PR proposes to support _not_ trimming the white spaces when writing out. These are `false` by default in CSV reading path but these are `true` by default in CSV writing in univocity parser.
Both `ignoreLeadingWhiteSpace` and `ignoreTrailingWhiteSpace` options are not being used for writing and therefore, we are always trimming the white spaces.
It seems we should provide a way to keep this white spaces easily.
WIth the data below:
```scala
val df = spark.read.csv(Seq("a , b , c").toDS)
df.show()
```
```
+---+----+---+
|_c0| _c1|_c2|
+---+----+---+
| a | b | c|
+---+----+---+
```
**Before**
```scala
df.write.csv("/tmp/text.csv")
spark.read.text("/tmp/text.csv").show()
```
```
+-----+
|value|
+-----+
|a,b,c|
+-----+
```
It seems this can't be worked around via `quoteAll` too.
```scala
df.write.option("quoteAll", true).csv("/tmp/text.csv")
spark.read.text("/tmp/text.csv").show()
```
```
+-----------+
| value|
+-----------+
|"a","b","c"|
+-----------+
```
**After**
```scala
df.write.option("ignoreLeadingWhiteSpace", false).option("ignoreTrailingWhiteSpace", false).csv("/tmp/text.csv")
spark.read.text("/tmp/text.csv").show()
```
```
+----------+
| value|
+----------+
|a , b , c|
+----------+
```
Note that this case is possible in R
```r
> system("cat text.csv")
f1,f2,f3
a , b , c
> df <- read.csv(file="text.csv")
> df
f1 f2 f3
1 a b c
> write.csv(df, file="text1.csv", quote=F, row.names=F)
> system("cat text1.csv")
f1,f2,f3
a , b , c
```
## How was this patch tested?
Unit tests in `CSVSuite` and manual tests for Python.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17310 from HyukjinKwon/SPARK-18579.
## What changes were proposed in this pull request?
Since the state is tied a "group" in the "mapGroupsWithState" operations, its better to call the state "GroupState" instead of a key. This would make it more general if you extends this operation to RelationGroupedDataset and python APIs.
## How was this patch tested?
Existing unit tests.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#17385 from tdas/SPARK-20057.
## What changes were proposed in this pull request?
Currently, when we perform count with timestamp types, it prints the internal representation as the column name as below:
```scala
Seq(new java.sql.Timestamp(1)).toDF("a").groupBy("a").pivot("a").count().show()
```
```
+--------------------+----+
| a|1000|
+--------------------+----+
|1969-12-31 16:00:...| 1|
+--------------------+----+
```
This PR proposes to use external Scala value instead of the internal representation in the column names as below:
```
+--------------------+-----------------------+
| a|1969-12-31 16:00:00.001|
+--------------------+-----------------------+
|1969-12-31 16:00:...| 1|
+--------------------+-----------------------+
```
## How was this patch tested?
Unit test in `DataFramePivotSuite` and manual tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17348 from HyukjinKwon/SPARK-20018.
## What changes were proposed in this pull request?
This PR proposes to make `mode` options in both CSV and JSON to use `cass object` and fix some related comments related previous fix.
Also, this PR modifies some tests related parse modes.
## How was this patch tested?
Modified unit tests in both `CSVSuite.scala` and `JsonSuite.scala`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17377 from HyukjinKwon/SPARK-19949.
## What changes were proposed in this pull request?
During build/sbt publish-local, build breaks due to javadocs errors. This patch fixes those errors.
## How was this patch tested?
Tested by running the sbt build.
Author: Prashant Sharma <prashsh1@in.ibm.com>
Closes#17358 from ScrapCodes/docs-fix.
## What changes were proposed in this pull request?
Add backslash for line continuation in python code.
## How was this patch tested?
Jenkins.
Author: uncleGen <hustyugm@gmail.com>
Author: dylon <hustyugm@gmail.com>
Closes#17352 from uncleGen/python-example-doc.
### What changes were proposed in this pull request?
Currently, `DESC FORMATTED` did not output the table comment, unlike what `DESC EXTENDED` does. This PR is to fix it.
Also correct the following displayed names in `DESC FORMATTED`, for being consistent with `DESC EXTENDED`
- `"Create Time:"` -> `"Created:"`
- `"Last Access Time:"` -> `"Last Access:"`
### How was this patch tested?
Added test cases in `describe.sql`
Author: Xiao Li <gatorsmile@gmail.com>
Closes#17381 from gatorsmile/descFormattedTableComment.
## What changes were proposed in this pull request?
SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).
## How was this patch tested?
Add unit tests, and verify this fix at standalone and yarn cluster.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#17274 from yanboliang/spark-19925.
## What changes were proposed in this pull request?
Adding event time based timeout. The user sets the timeout timestamp directly using `KeyedState.setTimeoutTimestamp`. The keys times out when the watermark crosses the timeout timestamp.
## How was this patch tested?
Unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#17361 from tdas/SPARK-20030.
## What changes were proposed in this pull request?
There is a race condition between calling stop on a streaming query and deleting directories in `withTempDir` that causes test to fail, fixing to do lazy deletion using delete on shutdown JVM hook.
## How was this patch tested?
- Unit test
- repeated 300 runs with no failure
Author: Kunal Khamar <kkhamar@outlook.com>
Closes#17382 from kunalkhamar/partition-bugfix.
## What changes were proposed in this pull request?
This PR proposes to defer throwing the exception within `DataSource`.
Currently, if other datasources fail to infer the schema, it returns `None` and then this is being validated in `DataSource` as below:
```
scala> spark.read.json("emptydir")
org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.;
```
```
scala> spark.read.orc("emptydir")
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.;
```
```
scala> spark.read.parquet("emptydir")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
```
However, CSV it checks it within the datasource implementation and throws another exception message as below:
```
scala> spark.read.csv("emptydir")
java.lang.IllegalArgumentException: requirement failed: Cannot infer schema from an empty set of files
```
We could remove this duplicated check and validate this in one place in the same way with the same message.
## How was this patch tested?
Unit test in `CSVSuite` and manual test.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#17256 from HyukjinKwon/SPARK-19919.
## What changes were proposed in this pull request?
The description in the comment for array_contains is vague/incomplete (i.e., doesn't mention that it returns `null` if the array is `null`); this PR fixes that.
## How was this patch tested?
No testing, since it merely changes a comment.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Will Manning <lwwmanning@gmail.com>
Closes#17380 from lwwmanning/patch-1.