## What changes were proposed in this pull request?
SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> InputPartitionReader. Some classes still reflects the old name and causes confusion. This patch renames the left over classes to reflect the new interface and fixes a few comments.
## How was this patch tested?
Existing unit tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Arun Mahadevan <arunm@apache.org>
Closes#21355 from arunmahadevan/SPARK-24308.
## What changes were proposed in this pull request?
This pr added an option `queryTimeout` for the number of seconds the the driver will wait for a Statement object to execute.
## How was this patch tested?
Added tests in `JDBCSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#21173 from maropu/SPARK-23856.
## What changes were proposed in this pull request?
Hadoop 3 introduces HDFS federation. This means that multiple namespaces are allowed on the same HDFS cluster. In Spark, we need to ask the delegation token for all the namenodes (for each namespace), otherwise accessing any other namespace different from the default one (for which we already fetch the delegation token) fails.
The PR adds the automatic discovery of all the namenodes related to all the namespaces available according to the configs in hdfs-site.xml.
## How was this patch tested?
manual tests in dockerized env
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21216 from mgaido91/SPARK-24149.
The old code was relying on a core configuration and extended its
default value to include things that redact desired things in the
app's environment. Instead, add a SQL-specific option for which
options to redact, and apply both the core and SQL-specific rules
when redacting the options in the save command.
This is a little sub-optimal since it adds another config, but it
retains the current default behavior.
While there I also fixed a typo and a couple of minor config API
usage issues in the related redaction option that SQL already had.
Tested with existing unit tests, plus checking the env page on
a shell UI.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#21158 from vanzin/SPARK-23850.
## What changes were proposed in this pull request?
Enabled no-data batches in flatMapGroupsWithState in following two cases.
- When ProcessingTime timeout is used, then we always run a batch every trigger interval.
- When event-time watermark is defined, then the user may be doing arbitrary logic against the watermark value even if timeouts are not set. In such cases, it's best to run batches whenever the watermark has changed, irrespective of whether timeouts (i.e. event-time timeout) have been explicitly enabled.
## How was this patch tested?
updated tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#21345 from tdas/SPARK-24159.
## What changes were proposed in this pull request?
Wrap Dataset.reduce with `withNewExecutionId`.
Author: Soham Aurangabadkar <sohama4@gmail.com>
Closes#21316 from sohama4/dataset_reduce_withexecutionid.
## What changes were proposed in this pull request?
cloudpickle 0.4.4 is released - https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4
There's no invasive change - the main difference is that we are now able to pickle the root logger, which fix is pretty isolated.
## How was this patch tested?
Jenkins tests.
Author: hyukjinkwon <gurwls223@apache.org>
Closes#21350 from HyukjinKwon/SPARK-24303.
## What changes were proposed in this pull request?
In HadoopMapReduceCommitProtocol and FileFormatWriter, there are unnecessary settings in hadoop configuration.
Also clean up some code in SQL module.
## How was this patch tested?
Unit test
Author: Gengliang Wang <gengliang.wang@databricks.com>
Closes#21329 from gengliangwang/codeCleanWrite.
## What changes were proposed in this pull request?
Converting clustering tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882.
This PR is a new version of https://github.com/apache/spark/pull/20319
Author: Sandor Murakozi <smurakozi@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#21358 from jkbradley/smurakozi-SPARK-22884.
## What changes were proposed in this pull request?
Have FPGrowth keep track of model training using the Instrumentation class.
## How was this patch tested?
manually
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Bago Amirbekian <bago@databricks.com>
Closes#21344 from MrBago/fpgrowth-instr.
## What changes were proposed in this pull request?
Fixes to tuning instrumentation.
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Bago Amirbekian <bago@databricks.com>
Closes#21340 from MrBago/tunning-instrumentation.
## What changes were proposed in this pull request?
Physical plan of `select colA from t order by colB limit M` is `TakeOrderedAndProject`;
Currently `TakeOrderedAndProject` sorts data in memory, see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L158
We can add a config – if the number of limit (M) is too big, we can sort by disk. Thus memory issue can be resolved.
## How was this patch tested?
Test added
Author: jinxing <jinxing6042@126.com>
Closes#21252 from jinxing64/SPARK-24193.
## What changes were proposed in this pull request?
The PR adds the function `arrays_overlap`. This function returns `true` if the input arrays contain a non-null common element; if not, it returns `null` if any of the arrays contains a `null` element, `false` otherwise.
## How was this patch tested?
added UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21028 from mgaido91/SPARK-23922.
## What changes were proposed in this pull request?
According to the discussion in https://github.com/apache/spark/pull/21175 , this PR proposes 2 improvements:
1. add comments to explain why we call `limit` to write out `ByteBuffer` with slices.
2. remove the `try ... finally`
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21327 from cloud-fan/minor.
## What changes were proposed in this pull request?
There's a period of time when an accumulator has been garbage collected, but hasn't been removed from AccumulatorContext.originals by ContextCleaner. When an update is received for such accumulator it will throw an exception and kill the whole job. This can happen when a stage completes, but there're still running tasks from other attempts, speculation etc. Since AccumulatorContext.get() returns an option we can just return None in such case.
## How was this patch tested?
Unit test.
Author: Artem Rudoy <artem.rudoy@gmail.com>
Closes#21114 from artemrd/SPARK-22371.
## What changes were proposed in this pull request?
The PR adds a new collection function, array_repeat. As there already was a function repeat with the same signature, with the only difference being the expected return type (String instead of Array), the new function is called array_repeat to distinguish.
The behaviour of the function is based on Presto's one.
The function creates an array containing a given element repeated the requested number of times.
## How was this patch tested?
New unit tests added into:
- CollectionExpressionsSuite
- DataFrameFunctionsSuite
Author: Florent Pépin <florentpepin.92@gmail.com>
Author: Florent Pépin <florent.pepin14@imperial.ac.uk>
Closes#21208 from pepinoflo/SPARK-23925.
## What changes were proposed in this pull request?
### Problem
When we run _PySpark shell with Yarn client mode_, specified `--py-files` are not recognised in _driver side_.
Here are the steps I took to check:
```bash
$ cat /home/spark/tmp.py
def testtest():
return 1
```
```bash
$ ./bin/pyspark --master yarn --deploy-mode client --py-files /home/spark/tmp.py
```
```python
>>> def test():
... import tmp
... return tmp.testtest()
...
>>> spark.range(1).rdd.map(lambda _: test()).collect() # executor side
[1]
>>> test() # driver side
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in test
ImportError: No module named tmp
```
### How did it happen?
Unlike Yarn cluster and client mode with Spark submit, when Yarn client mode with PySpark shell specifically,
1. It first runs Python shell via:
3cb82047f2/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java (L158) as pointed out by tgravescs in the JIRA.
2. this triggers shell.py and submit another application to launch a py4j gateway:
209b9361ac/python/pyspark/java_gateway.py (L45-L60)
3. it runs a Py4J gateway:
3cb82047f2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L425)
4. it copies (or downloads) --py-files into local temp directory:
3cb82047f2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L365-L376)
and then these files are set up to `spark.submit.pyFiles`
5. Py4J JVM is launched and then the Python paths are set via:
7013eea11c/python/pyspark/context.py (L209-L216)
However, these are not actually set because those files were copied into a tmp directory in 4. whereas this code path looks for `SparkFiles.getRootDirectory` where the files are stored only when `SparkContext.addFile()` is called.
In other cluster mode, `spark.files` are set via:
3cb82047f2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L554-L555)
and those files are explicitly added via:
ecb8b383af/core/src/main/scala/org/apache/spark/SparkContext.scala (L395)
So we are fine in other modes.
In case of Yarn client and cluster with _submit_, these are manually being handled. In particular https://github.com/apache/spark/pull/6360 added most of the logics. In this case, the Python path looks manually set via, for example, `deploy.PythonRunner`. We don't use `spark.files` here.
### How does the PR fix the problem?
I tried to make an isolated approach as possible as I can: simply copy py file or zip files into `SparkFiles.getRootDirectory()` in driver side if not existing. Another possible way is to set `spark.files` but it does unnecessary stuff together and sounds a bit invasive.
**Before**
```python
>>> def test():
... import tmp
... return tmp.testtest()
...
>>> spark.range(1).rdd.map(lambda _: test()).collect()
[1]
>>> test()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in test
ImportError: No module named tmp
```
**After**
```python
>>> def test():
... import tmp
... return tmp.testtest()
...
>>> spark.range(1).rdd.map(lambda _: test()).collect()
[1]
>>> test()
1
```
## How was this patch tested?
I manually tested in standalone and yarn cluster with PySpark shell. .zip and .py files were also tested with the similar steps above. It's difficult to add a test.
Author: hyukjinkwon <gurwls223@apache.org>
Closes#21267 from HyukjinKwon/SPARK-21945.
## What changes were proposed in this pull request?
- Add seed parameter for variationalTopicInference
- Add seed for calling variationalTopicInference in submitMiniBatch
- Add var seed in LDAModel so that it can take the seed from LDA and use it for the function call of variationalTopicInference in logLikelihoodBound, topicDistributions, getTopicDistributionMethod, and topicDistribution.
## How was this patch tested?
Check the test result in mllib.clustering.LDASuite to make sure the result is repeatable with the seed.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Lu WANG <lu.wang@databricks.com>
Closes#21183 from ludatabricks/SPARK-22210.
## What changes were proposed in this pull request?
This is a continuation of the larger task of enabling zero-data batches for more eager state cleanup. This PR enables it for stream-stream joins.
## How was this patch tested?
- Updated join tests. Additionally, updated them to not use `CheckLastBatch` anywhere to set good precedence for future.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#21253 from tdas/SPARK-24158.
The repository.apache.org server still requires md5 checksums or
it won't publish the staging repo.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#21338 from vanzin/SPARK-23601.
## What changes were proposed in this pull request?
In #21145, DataReaderFactory is renamed to InputPartition.
This PR is to revise wording in the comments to make it more clear.
## How was this patch tested?
None
Author: Gengliang Wang <gengliang.wang@databricks.com>
Closes#21326 from gengliangwang/revise_reader_comments.
## What changes were proposed in this pull request?
See SPARK-23455 for reference. Now default params in ML are saved separately in metadata file in Scala. We must change it for Python for Spark 2.4.0 as well in order to keep them in sync.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#21153 from viirya/SPARK-24058.
## What changes were proposed in this pull request?
Add evaluateEachIteration for GBTClassification and GBTRegressionModel
## How was this patch tested?
doctest
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Lu WANG <lu.wang@databricks.com>
Closes#21335 from ludatabricks/SPARK-14682.
## What changes were proposed in this pull request?
Support aggregates with exactly 1 partition in continuous processing.
A few small tweaks are needed to make this work:
* Replace currentEpoch tracking with an ThreadLocal. This means that current epoch is scoped to a task rather than a node, but I think that's sustainable even once we add shuffle.
* Add a new testing-only flag to disable the UnsupportedOperationChecker whitelist of allowed continuous processing nodes. I think this is preferable to writing a pile of custom logic to enforce that there is in fact only 1 partition; we plan to support multi-partition aggregates before the next Spark release, so we'd just have to tear that logic back out.
* Restart continuous processing queries from the first available uncommitted epoch, rather than one that's guaranteed to be unused. This is required for stateful operators to overwrite partial state from the previous attempt at the epoch, and there was no specific motivation for the original strategy. In another PR before stabilizing the StreamWriter API, we'll need to narrow down and document more precise semantic guarantees for the epoch IDs.
* We need a single-partition ContinuousMemoryStream. The way MemoryStream is constructed means it can't be a text option like it is for rate source, unfortunately.
## How was this patch tested?
new unit tests
Author: Jose Torres <torres.joseph.f+github@gmail.com>
Closes#21239 from jose-torres/withAggr.
## What changes were proposed in this pull request?
Right now `ArrayWriter` used to output Arrow data for array type, doesn't do `clear` or `reset` after each batch. It produces wrong output.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#21312 from viirya/SPARK-24259.
## What changes were proposed in this pull request?
```
~/spark-2.3.0-bin-hadoop2.7$ bin/spark-sql --num-executors 0 --conf spark.dynamicAllocation.enabled=true
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=1024m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
Error: Number of executors must be a positive number
Run with --help for usage help or --verbose for debug output
```
Actually, we could start up with min executor number with 0 before if dynamically
## How was this patch tested?
ut added
Author: Kent Yao <yaooqinn@hotmail.com>
Closes#21290 from yaooqinn/SPARK-24241.
## What changes were proposed in this pull request?
1. Change antlr rule to fix the warning.
2. Add PIVOT/LATERAL check in AstBuilder with a more meaningful error message.
## How was this patch tested?
1. Add a counter case in `PlanParserSuite.test("lateral view")`
Author: maryannxue <maryann.xue@gmail.com>
Closes#21324 from maryannxue/spark-24035-fix.
## What changes were proposed in this pull request?
This PR adds isEmpty() in DataSet
## How was this patch tested?
Unit tests added
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Goun Na <gounna@gmail.com>
Author: goungoun <gounna@gmail.com>
Closes#20800 from goungoun/SPARK-23627.
## What changes were proposed in this pull request?
change generic to get it to work with googleVis
also fix lintr
## How was this patch tested?
manual test, unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#21315 from felixcheung/googvis.
## What changes were proposed in this pull request?
Add a `withSQLConf(...)` wrapper to force Parquet filter pushdown for a test that relies on it.
## How was this patch tested?
Test passes
Author: Henry Robinson <henry@apache.org>
Closes#21323 from henryr/spark-23582.
## What changes were proposed in this pull request?
Currently, the from_json function support StructType or ArrayType as the root type. The PR allows to specify MapType(StringType, DataType) as the root type additionally to mentioned types. For example:
```scala
import org.apache.spark.sql.types._
val schema = MapType(StringType, IntegerType)
val in = Seq("""{"a": 1, "b": 2, "c": 3}""").toDS()
in.select(from_json($"value", schema, Map[String, String]())).collect()
```
```
res1: Array[org.apache.spark.sql.Row] = Array([Map(a -> 1, b -> 2, c -> 3)])
```
## How was this patch tested?
It was checked by new tests for the map type with integer type and struct type as value types. Also roundtrip tests like from_json(to_json) and to_json(from_json) for MapType are added.
Author: Maxim Gekk <maxim.gekk@databricks.com>
Author: Maxim Gekk <max.gekk@gmail.com>
Closes#21108 from MaxGekk/from_json-map-type.
## What changes were proposed in this pull request?
changed the instrument for all of the clustering methods
## How was this patch tested?
N/A
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Lu WANG <lu.wang@databricks.com>
Closes#21218 from ludatabricks/SPARK-23686-1.
## What changes were proposed in this pull request?
If there is an exception, it's better to set it as the cause of AnalysisException since the exception may contain useful debug information.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <zsxwing@gmail.com>
Closes#21297 from zsxwing/SPARK-24246.
## What changes were proposed in this pull request?
Change text to grep for.
## How was this patch tested?
manual test
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#21314 from felixcheung/openjdkver.
## What changes were proposed in this pull request?
This PR fixes the following Java lint errors due to importing unimport classes
```
$ dev/lint-java
Using `mvn` from path: /usr/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/partitioning/Distribution.java:[25] (sizes) LineLength: Line is longer than 100 characters (found 109).
[ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java:[38] (sizes) LineLength: Line is longer than 100 characters (found 102).
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8] (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream.
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform.
[ERROR] src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java:[110] (sizes) LineLength: Line is longer than 100 characters (found 101).
```
With this PR
```
$ dev/lint-java
Using `mvn` from path: /usr/bin/mvn
Checkstyle checks passed.
```
## How was this patch tested?
Existing UTs. Also manually run checkstyles against these two files.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#21301 from kiszk/SPARK-24228.
## What changes were proposed in this pull request?
reverse and concat are already in functions.R as column string functions. Since now these two functions are categorized as collection functions in scala and python, we will do the same in R.
## How was this patch tested?
Add test in test_sparkSQL.R
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#21307 from huaxingao/spark_24186.
## What changes were proposed in this pull request?
I think the ‘n_t+t’ in the following code may be wrong, it shoud be ‘n_t+1’ that means is the number of points to the cluster after it finish the no.t+1 min-batch.
* <blockquote>
* $$
* \begin{align}
* c_t+1 &= [(c_t * n_t * a) + (x_t * m_t)] / [n_t + m_t] \\
* n_t+t &= n_t * a + m_t
* \end{align}
* $$
* </blockquote>
Author: Fan Donglai <ddna_1022@163.com>
Closes#21179 from ddna1021/master.
## What changes were proposed in this pull request?
Updates `functon` to `function`. This was called out in holdenk's PyCon 2018 conference talk. Didn't see any existing PR's for this.
holdenk happy to fix the Pandas.Series bug too but will need a bit more guidance.
Author: Kelley Robinson <krobinson@twilio.com>
Closes#21304 from robinske/master.
The `implicitNotFound` message for `Encoder` doesn't mention the name of
the type for which it can't find an encoder. Furthermore, it covers up
the fact that `Encoder` is the name of the relevant type class.
Hopefully this new message provides a little more specific type detail
while still giving the general message about which types are supported.
## What changes were proposed in this pull request?
Augment the existing message to mention that it's looking for an `Encoder` and what the type of the encoder is.
For example instead of:
```
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
```
return this message:
```
Unable to find encoder for type Exception. An implicit Encoder[Exception] is needed to store Exception instances in a Dataset. Primitive types (Int, String, etc) and Product types (ca
se classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
```
## How was this patch tested?
It was tested manually in the Scala REPL, since triggering this in a test would cause a compilation error.
```
scala> implicitly[Encoder[Exception]]
<console>:51: error: Unable to find encoder for type Exception. An implicit Encoder[Exception] is needed to store Exception instances in a Dataset. Primitive types (Int, String, etc) and Product types (ca
se classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
implicitly[Encoder[Exception]]
^
```
Author: Cody Allen <ceedubs@gmail.com>
Closes#20869 from ceedubs/encoder-implicit-msg.
## What changes were proposed in this pull request?
The PR adds the `slice` function to SparkR. The function returns a subset of consecutive elements from the given array.
```
> df <- createDataFrame(cbind(model = rownames(mtcars), mtcars))
> tmp <- mutate(df, v1 = create_array(df$mpg, df$cyl, df$hp))
> head(select(tmp, slice(tmp$v1, 2L, 2L)))
```
```
slice(v1, 2, 2)
1 6, 110
2 6, 110
3 4, 93
4 6, 110
5 8, 175
6 6, 105
```
## How was this patch tested?
A test added into R/pkg/tests/fulltests/test_sparkSQL.R
Author: Marek Novotny <mn.mikke@gmail.com>
Closes#21298 from mn-mikke/SPARK-24198.
## What changes were proposed in this pull request?
This patch removes the various regr_* functions in functions.scala. They are so uncommon that I don't think they deserve real estate in functions.scala. We can consider adding them later if more users need them.
## How was this patch tested?
Removed the associated test case as well.
Author: Reynold Xin <rxin@databricks.com>
Closes#21309 from rxin/SPARK-23907.
This change updates the SystemRequirements and also includes a runtime check if the JVM is being launched by R. The runtime check is done by querying `java -version`
## How was this patch tested?
Tested on a Mac and Windows machine
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes#21278 from shivaram/sparkr-skip-solaris.
## What changes were proposed in this pull request?
It's useful to know what relationship between date1 and date2 results in a positive number.
Author: aditkumar <aditkumar@gmail.com>
Author: Adit Kumar <aditkumar@gmail.com>
Closes#20787 from aditkumar/master.
## What changes were proposed in this pull request?
In `PushDownOperatorsToDataSource`, we use `transformUp` to match `PhysicalOperation` and apply pushdown. This is problematic if we have multiple `Filter` and `Project` above the data source v2 relation.
e.g. for a query
```
Project
Filter
DataSourceV2Relation
```
The pattern match will be triggered twice and we will do operator pushdown twice. This is unnecessary, we can use `mapChildren` to only apply pushdown once.
## How was this patch tested?
existing test
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21230 from cloud-fan/step2.
Instead of always throwing a generic exception when the AM fails,
print a generic error and throw the exception with the YARN
diagnostics containing the reason for the failure.
There was an issue with YARN sometimes providing a generic diagnostic
message, even though the AM provides a failure reason when
unregistering. That was happening because the AM was registering
too late, and if errors happened before the registration, YARN would
just create a generic "ExitCodeException" which wasn't very helpful.
Since most errors in this path are a result of not being able to
connect to the driver, this change modifies the AM registration
a bit so that the AM is registered before the connection to the
driver is established. That way, errors are properly propagated
through YARN back to the driver.
As part of that, I also removed the code that retried connections
to the driver from the client AM. At that point, the driver should
already be up and waiting for connections, so it's unlikely that
retrying would help - and in case it does, that means a flaky
network, which would mean problems would probably show up again.
The effect of that is that connection-related errors are reported
back to the driver much faster now (through the YARN report).
One thing to note is that there seems to be a race on the YARN
side that causes a report to be sent to the client without the
corresponding diagnostics string from the AM; the diagnostics are
available later from the RM web page. For that reason, the generic
error messages are kept in the Spark scheduler code, to help
guide users to a way of debugging their failure.
Also of note is that if YARN's max attempts configuration is lower
than Spark's, Spark will not unregister the AM with a proper
diagnostics message. Unfortunately there seems to be no way to
unregister the AM and still allow further re-attempts to happen.
Testing:
- existing unit tests
- some of our integration tests
- hardcoded an invalid driver address in the code and verified
the error in the shell. e.g.
```
scala> 18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details.
18/05/04 15:09:34 ERROR cluster.YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult:
<AM stack trace>
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:1234
<More stack trace>
```
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#21243 from vanzin/SPARK-24182.