## What changes were proposed in this pull request?
The PR adds interpreted execution to CreateExternalRow
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#20749 from mgaido91/SPARK-23590.
## What changes were proposed in this pull request?
Remove .md5 files from release artifacts
## How was this patch tested?
N/A
Author: Sean Owen <sowen@cloudera.com>
Closes#20737 from srowen/SPARK-23601.
## What changes were proposed in this pull request?
This pr added interpreted execution for `GetExternalRowField`.
## How was this patch tested?
Added tests in `ObjectExpressionsSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#20746 from maropu/SPARK-23594.
…lValue
## What changes were proposed in this pull request?
Parquet 1.9 will change the semantics of Statistics.isEmpty slightly
to reflect if the null value count has been set. That breaks a
timestamp interoperability test that cares only about whether there
are column values present in the statistics of a written file for an
INT96 column. Fix by using Statistics.hasNonNullValue instead.
## How was this patch tested?
Unit tests continue to pass against Parquet 1.8, and also pass against
a Parquet build including PARQUET-1217.
Author: Henry Robinson <henry@cloudera.com>
Closes#20740 from henryr/spark-23604.
## What changes were proposed in this pull request?
The PR adds interpreted execution to WrapOption.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#20741 from mgaido91/SPARK-23586_2.
The `__del__` method that explicitly detaches the object was moved from `JavaParams` to `JavaWrapper` class, this way model summaries could also be garbage collected in Java. A test case was added to make sure that relevant error messages are thrown after the objects are deleted.
I ran pyspark tests agains `pyspark-ml` module
`./python/run-tests --python-executables=$(which python) --modules=pyspark-ml`
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Closes#20724 from yogeshg/java_wrapper_memory.
These options were used to configure the built-in JRE SSL libraries
when downloading files from HTTPS servers. But because they were also
used to set up the now (long) removed internal HTTPS file server,
their default configuration chose convenience over security by having
overly lenient settings.
This change removes the configuration options that affect the JRE SSL
libraries. The JRE trust store can still be configured via system
properties (or globally in the JRE security config). The only lost
functionality is not being able to disable the default hostname
verifier when using spark-submit, which should be fine since Spark
itself is not using https for any internal functionality anymore.
I also removed the HTTP-related code from the REPL class loader, since
we haven't had a HTTP server for REPL-generated classes for a while.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#20723 from vanzin/SPARK-23538.
## What changes were proposed in this pull request?
Before this commit, a non-interruptible iterator is returned if aggregator or ordering is specified.
This commit also ensures that sorter is closed even when task is cancelled(killed) in the middle of sorting.
## How was this patch tested?
Add a unit test in JobCancellationSuite
Author: Xianjin YE <advancedxy@gmail.com>
Closes#20449 from advancedxy/SPARK-23040.
## What changes were proposed in this pull request?
Add an epoch ID argument to DataWriterFactory for use in streaming. As a side effect of passing in this value, DataWriter will now have a consistent lifecycle; commit() or abort() ends the lifecycle of a DataWriter instance in any execution mode.
I considered making a separate streaming interface and adding the epoch ID only to that one, but I think it requires a lot of extra work for no real gain. I think it makes sense to define epoch 0 as the one and only epoch of a non-streaming query.
## How was this patch tested?
existing unit tests
Author: Jose Torres <jose@databricks.com>
Closes#20710 from jose-torres/api2.
## What changes were proposed in this pull request?
The PR adds interpreted execution to UnwrapOption.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#20736 from mgaido91/SPARK-23586.
## What changes were proposed in this pull request?
adding Structured Streaming tests for all Models/Transformers in spark.ml.classification
## How was this patch tested?
N/A
Author: WeichenXu <weichen.xu@databricks.com>
Closes#20121 from WeichenXu123/ml_stream_test_classification.
## What changes were proposed in this pull request?
Removed export tag to get rid of unknown tag warnings
## How was this patch tested?
Existing tests
Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: rjoshi2 <rekhajoshm@gmail.com>
Closes#20501 from rekhajoshm/SPARK-22430.
## What changes were proposed in this pull request?
Provide more details in trigonometric function documentations. Referenced `java.lang.Math` for further details in the descriptions.
## How was this patch tested?
Ran full build, checked generated documentation manually
Author: Mihaly Toth <misutoth@gmail.com>
Closes#20618 from misutoth/trigonometric-doc.
## What changes were proposed in this pull request?
The algorithm in `DefaultPartitionCoalescer.setupGroups` is responsible for picking preferred locations for coalesced partitions. It analyzes the preferred locations of input partitions. It starts by trying to create one partition for each unique location in the input. However, if the the requested number of coalesced partitions is higher that the number of unique locations, it has to pick duplicate locations.
Previously, the duplicate locations would be picked by iterating over the input partitions in order, and copying their preferred locations to coalesced partitions. If the input partitions were clustered by location, this could result in severe skew.
With the fix, instead of iterating over the list of input partitions in order, we pick them at random. It's not perfectly balanced, but it's much better.
## How was this patch tested?
Unit test reproducing the behavior was added.
Author: Ala Luszczak <ala@databricks.com>
Closes#20664 from ala/SPARK-23496.
## What changes were proposed in this pull request?
A current `CodegenContext` class has immutable value or method without mutable state, too.
This refactoring moves them to `CodeGenerator` object class which can be accessed from anywhere without an instantiated `CodegenContext` in the program.
## How was this patch tested?
Existing tests
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#20700 from kiszk/SPARK-23546.
## What changes were proposed in this pull request?
Check python version to determine whether to use `inspect.getargspec` or `inspect.getfullargspec` before applying `pandas_udf` core logic to a function. The former is python2.7 (deprecated in python3) and the latter is python3.x. The latter correctly accounts for type annotations, which are syntax errors in python2.x.
## How was this patch tested?
Locally, on python 2.7 and 3.6.
Author: Michael (Stu) Stewart <mstewart141@gmail.com>
Closes#20728 from mstewart141/pandas_udf_fix.
## What changes were proposed in this pull request?
It looks like this was incorrectly copied from `XPathFloat` in the class above.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Eric Liang <ekhliang@gmail.com>
Closes#20730 from ericl/fix-typo-xpath.
## What changes were proposed in this pull request?
Currently, when the Kafka source reads from Kafka, it generates as many tasks as the number of partitions in the topic(s) to be read. In some case, it may be beneficial to read the data with greater parallelism, that is, with more number partitions/tasks. That means, offset ranges must be divided up into smaller ranges such the number of records in partition ~= total records in batch / desired partitions. This would also balance out any data skews between topic-partitions.
In this patch, I have added a new option called `minPartitions`, which allows the user to specify the desired level of parallelism.
## How was this patch tested?
New tests in KafkaMicroBatchV2SourceSuite.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#20698 from tdas/SPARK-23541.
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/20679 I missed a few places in SQL tests.
For hygiene, they should also use the sessionState interface where possible.
## How was this patch tested?
Modified existing tests.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#20718 from juliuszsompolski/SPARK-23514-followup.
## What changes were proposed in this pull request?
Added subtree pruning in the translation from LearningNode to Node: a learning node having a single prediction value for all the leaves in the subtree rooted at it is translated into a LeafNode, instead of a (redundant) InternalNode
## How was this patch tested?
Added two unit tests under "mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala":
- test("SPARK-3159 tree model redundancy - classification")
- test("SPARK-3159 tree model redundancy - regression")
4 existing unit tests relying on the tree structure (existence of a specific redundant subtree) had to be adapted as the tested components in the output tree are now pruned (fixed by adding an extra _prune_ parameter which can be used to disable pruning for testing)
Author: Alessandro Solimando <18898964+asolimando@users.noreply.github.com>
Closes#20632 from asolimando/master.
## What changes were proposed in this pull request?
Add Spark 2.3.0 in HiveExternalCatalogVersionsSuite since Spark 2.3.0 is released for ensuring backward compatibility.
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#20720 from gatorsmile/add2.3.
## What changes were proposed in this pull request?
This PR moves structured streaming text socket source to V2.
Questions: do we need to remove old "socket" source?
## How was this patch tested?
Unit test and manual verification.
Author: jerryshao <sshao@hortonworks.com>
Closes#20382 from jerryshao/SPARK-23097.
## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/18944 added one patch, which allowed a spark session to be created when the hive metastore server is down. However, it did not allow running any commands with the spark session. This brings troubles to the user who only wants to read / write data frames without metastore setup.
## How was this patch tested?
Added some unit tests to read and write data frames based on the original HiveMetastoreLazyInitializationSuite.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Feng Liu <fengliu@databricks.com>
Closes#20681 from liufengdb/completely-lazy.
## What changes were proposed in this pull request?
Fix doc link that was changed in 2.3
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes#20711 from felixcheung/rvigmean.
## What changes were proposed in this pull request?
Adds structured streaming tests using testTransformer for these suites:
* BinarizerSuite
* BucketedRandomProjectionLSHSuite
* BucketizerSuite
* ChiSqSelectorSuite
* CountVectorizerSuite
* DCTSuite.scala
* ElementwiseProductSuite
* FeatureHasherSuite
* HashingTFSuite
## How was this patch tested?
It tests itself because it is a bunch of tests!
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#20111 from jkbradley/SPARK-22883-streaming-featureAM.
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small table ,and the number is one. The `catalog_sales` table is a big table, and the number is 10 billion. The task will be hang up. And i find the many null values of `cs_order_number` in the `catalog_sales` table. I think the null value should be removed in the logical plan.
>== Optimized Logical Plan ==
>Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
>:- Project cs_order_number#1
> : +- Filter isnotnull(cs_order_number#1)
> : +- MetastoreRelation 100t, ls
>+- Project cs_order_number#22
> +- MetastoreRelation 100t, catalog_sales
Now, use this patch, the plan will be:
>== Optimized Logical Plan ==
>Join LeftSemi, (cs_order_number#1 = cs_order_number#22)
>:- Project cs_order_number#1
> : +- Filter isnotnull(cs_order_number#1)
> : +- MetastoreRelation 100t, ls
>+- Project cs_order_number#22
> : **+- Filter isnotnull(cs_order_number#22)**
> :+- MetastoreRelation 100t, catalog_sales
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: KaiXinXiaoLei <584620569@qq.com>
Author: hanghang <584620569@qq.com>
Closes#20670 from KaiXinXiaoLei/Spark-23405.
## What changes were proposed in this pull request?
This is based on https://github.com/apache/spark/pull/20668 for supporting Hive 2.2 and Hive 2.3 metastore.
When we merge the PR, we should give the major credit to wangyum
## How was this patch tested?
Added the test cases
Author: Yuming Wang <yumwang@ebay.com>
Author: gatorsmile <gatorsmile@gmail.com>
Closes#20671 from gatorsmile/pr-20668.
## What changes were proposed in this pull request?
When the shuffle dependency specifies aggregation ,and `dependency.mapSideCombine=false`, in the map side,there is no need for aggregation and sorting, so we should be able to use serialized sorting.
## How was this patch tested?
Existing unit test
Author: liuxian <liu.xian3@zte.com.cn>
Closes#20576 from 10110346/mapsidecombine.
## What changes were proposed in this pull request?
Inside `OptimizeMetadataOnlyQuery.getPartitionAttrs`, avoid using `zip` to generate attribute map.
Also include other minor update of comments and format.
## How was this patch tested?
Existing test cases.
Author: Xingbo Jiang <xingbo.jiang@databricks.com>
Closes#20693 from jiangxb1987/SPARK-23523.
## What changes were proposed in this pull request?
A few places in `spark-sql` were using `sc.hadoopConfiguration` directly. They should be using `sessionState.newHadoopConf()` to blend in configs that were set through `SQLConf`.
Also, for better UX, for these configs blended in from `SQLConf`, we should consider removing the `spark.hadoop` prefix, so that the settings are recognized whether or not they were specified by the user.
## How was this patch tested?
Tested that AlterTableRecoverPartitions now correctly recognizes settings that are passed in to the FileSystem through SQLConf.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#20679 from juliuszsompolski/SPARK-23514.
## What changes were proposed in this pull request?
This PR proposes for `pyspark.util._exception_message` to produce the trace from Java side by `Py4JJavaError`.
Currently, in Python 2, it uses `message` attribute which `Py4JJavaError` didn't happen to have:
```python
>>> from pyspark.util import _exception_message
>>> try:
... sc._jvm.java.lang.String(None)
... except Exception as e:
... pass
...
>>> e.message
''
```
Seems we should use `str` instead for now:
aa6c53b590/py4j-python/src/py4j/protocol.py (L412)
but this doesn't address the problem with non-ascii string from Java side -
`https://github.com/bartdag/py4j/issues/306`
So, we could directly call `__str__()`:
```python
>>> e.__str__()
u'An error occurred while calling None.java.lang.String.\n: java.lang.NullPointerException\n\tat java.lang.String.<init>(String.java:588)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)\n\tat sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)\n\tat sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n\tat java.lang.reflect.Constructor.newInstance(Constructor.java:422)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat py4j.Gateway.invoke(Gateway.java:238)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:745)\n'
```
which doesn't type coerce unicodes to `str` in Python 2.
This can be actually a problem:
```python
from pyspark.sql.functions import udf
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.range(1).select(udf(lambda x: [[]])()).toPandas()
```
**Before**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError:
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
```
**After**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/sql/dataframe.py", line 2009, in toPandas
raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
RuntimeError: An error occurred while calling o47.collectAsArrowToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/.../spark/python/pyspark/worker.py", line 245, in main
process()
File "/.../spark/python/pyspark/worker.py", line 240, in process
...
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
```
## How was this patch tested?
Manually tested and unit tests were added.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#20680 from HyukjinKwon/SPARK-23517.
… cause oom
## What changes were proposed in this pull request?
blockManagerIdCache in BlockManagerId will not remove old values which may cause oom
`val blockManagerIdCache = new ConcurrentHashMap[BlockManagerId, BlockManagerId]()`
Since whenever we apply a new BlockManagerId, it will put into this map.
This patch will use guava cahce for blockManagerIdCache instead.
A heap dump show in [SPARK-23508](https://issues.apache.org/jira/browse/SPARK-23508)
## How was this patch tested?
Exist tests.
Author: zhoukang <zhoukang199191@gmail.com>
Closes#20667 from caneGuy/zhoukang/fix-history.
## What changes were proposed in this pull request?
Clarify JSON and CSV reader behavior in document.
JSON doesn't support partial results for corrupted records.
CSV only supports partial results for the records with more or less tokens.
## How was this patch tested?
Pass existing tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#20666 from viirya/SPARK-23448-2.
## What changes were proposed in this pull request?
Fix the build instructions supplied by exception messages in python streaming tests.
I also added -DskipTests to the maven instructions to avoid the 170 minutes of scala tests that occurs each time one wants to add a jar to the assembly directory.
## How was this patch tested?
- clone branch
- run build/sbt package
- run python/run-tests --modules "pyspark-streaming" , expect error message
- follow instructions in error message. i.e., run build/sbt assembly/package streaming-kafka-0-8-assembly/assembly
- rerun python tests, expect error message
- follow instructions in error message. i.e run build/sbt -Pflume assembly/package streaming-flume-assembly/assembly
- rerun python tests, see success.
- repeated all of the above for mvn version of the process.
Author: Bruce Robbins <bersprockets@gmail.com>
Closes#20638 from bersprockets/SPARK-23417_propa.
As suggested in #20651, the code is very redundant in `AllStagesPage` and modifying it is a copy-and-paste work. We should avoid such a pattern, which is error prone, and have a cleaner solution which avoids code redundancy.
existing UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#20663 from mgaido91/SPARK-23475_followup.
The ExecutorAllocationManager should not adjust the target number of
executors when killing idle executors, as it has already adjusted the
target number down based on the task backlog.
The name `replace` was misleading with DynamicAllocation on, as the target number
of executors is changed outside of the call to `killExecutors`, so I adjusted that name. Also separated out the logic of `countFailures` as you don't always want that tied to `replace`.
While I was there I made two changes that weren't directly related to this:
1) Fixed `countFailures` in a couple cases where it was getting an incorrect value since it used to be tied to `replace`, eg. when killing executors on a blacklisted node.
2) hard error if you call `sc.killExecutors` with dynamic allocation on, since that's another way the ExecutorAllocationManager and the CoarseGrainedSchedulerBackend would get out of sync.
Added a unit test case which verifies that the calls to ExecutorAllocationClient do not adjust the number of executors.
Author: Imran Rashid <irashid@cloudera.com>
Closes#20604 from squito/SPARK-23365.
## What changes were proposed in this pull request?
```Scala
val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e")
Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5")
.write.json(tablePath.getCanonicalPath)
val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct()
df.show()
```
It generates a wrong result.
```
[c,e,a]
```
We have a bug in the rule `OptimizeMetadataOnlyQuery `. We should respect the attribute order in the original leaf node. This PR is to fix it.
## How was this patch tested?
Added a test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#20684 from gatorsmile/optimizeMetadataOnly.
## What changes were proposed in this pull request?
Add a configuration spark.streaming.kafka.allowNonConsecutiveOffsets to allow streaming jobs to proceed on compacted topics (or other situations involving gaps between offsets in the log).
## How was this patch tested?
Added new unit test
justinrmiller has been testing this branch in production for a few weeks
Author: cody koeninger <cody@koeninger.org>
Closes#20572 from koeninger/SPARK-17147.
## What changes were proposed in this pull request?
This PR avoids version conflicts of `commons-net` by upgrading commons-net from 2.2 to 3.1. We are seeing the following message during the build using sbt.
```
[warn] Found version conflict(s) in library dependencies; some are suspected to be binary incompatible:
...
[warn] * commons-net:commons-net:3.1 is selected over 2.2
[warn] +- org.apache.hadoop:hadoop-common:2.6.5 (depends on 3.1)
[warn] +- org.apache.spark:spark-core_2.11:2.4.0-SNAPSHOT (depends on 2.2)
[warn]
```
[Here](https://commons.apache.org/proper/commons-net/changes-report.html) is a release history.
[Here](https://commons.apache.org/proper/commons-net/migration.html) is a migration guide from 2.x to 3.0.
## How was this patch tested?
Existing tests
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#20672 from kiszk/SPARK-23509.
## What changes were proposed in this pull request?
Refactor ColumnStat to be more flexible.
* Split `ColumnStat` and `CatalogColumnStat` just like `CatalogStatistics` is split from `Statistics`. This detaches how the statistics are stored from how they are processed in the query plan. `CatalogColumnStat` keeps `min` and `max` as `String`, making it not depend on dataType information.
* For `CatalogColumnStat`, parse column names from property names in the metastore (`KEY_VERSION` property), not from metastore schema. This means that `CatalogColumnStat`s can be created for columns even if the schema itself is not stored in the metastore.
* Make all fields optional. `min`, `max` and `histogram` for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate.
The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans.
## How was this patch tested?
Refactored existing tests to work with refactored `ColumnStat` and `CatalogColumnStat`.
New tests added in `StatisticsSuite` checking that backwards / forwards compatibility is not broken.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#20624 from juliuszsompolski/SPARK-23445.
## What changes were proposed in this pull request?
Remove queryExecutionThread.interrupt() from ContinuousExecution. As detailed in the JIRA, interrupting the thread is only relevant in the microbatch case; for continuous processing the query execution can quickly clean itself up without.
## How was this patch tested?
existing tests
Author: Jose Torres <jose@databricks.com>
Closes#20622 from jose-torres/SPARK-23441.
For some JVM options, like `-XX:+UnlockExperimentalVMOptions` ordering is necessary.
## What changes were proposed in this pull request?
Keep original `extraJavaOptions` ordering, when passing them through environment variables inside the Docker container.
## How was this patch tested?
Ran base branch a couple of times and checked startup command in logs. Ordering differed every time. Added sorting, ordering was consistent to what user had in `extraJavaOptions`.
Author: Andrew Korzhuev <korzhuev@andrusha.me>
Closes#20628 from andrusha/patch-2.
## What changes were proposed in this pull request?
There is a race condition introduced in SPARK-11141 which could cause data loss.
The problem is that ReceivedBlockTracker.insertAllocatedBatch function assumes that all blocks from streamIdToUnallocatedBlockQueues allocated to the batch and clears the queue.
In this PR only the allocated blocks will be removed from the queue which will prevent data loss.
## How was this patch tested?
Additional unit test + manually.
Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Closes#20620 from gaborgsomogyi/SPARK-23438.
## What changes were proposed in this pull request?
Converting spark.ml.recommendation tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882.
## How was this patch tested?
Automated: Pass the Jenkins.
Author: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Closes#20362 from gaborgsomogyi/SPARK-22886.