## What changes were proposed in this pull request?
Some users depend on source compatibility with the org.apache.spark.sql.execution.streaming.Offset class. Although this is not a stable interface, we can keep it in place for now to simplify upgrades to 2.3.
Author: Jose Torres <jose@databricks.com>
Closes#20012 from joseph-torres/binary-compat.
Still look at the old one in case any Spark user is setting it
explicitly, though.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#19983 from vanzin/SPARK-22788.
## What changes were proposed in this pull request?
Like `Parquet`, users can use `ORC` with Apache Spark structured streaming. This PR adds `orc()` to `DataStreamReader`(Scala/Python) in order to support creating streaming dataset with ORC file format more easily like the other file formats. Also, this adds a test coverage for ORC data source and updates the document.
**BEFORE**
```scala
scala> spark.readStream.schema("a int").orc("/tmp/orc_ss").writeStream.format("console").start()
<console>:24: error: value orc is not a member of org.apache.spark.sql.streaming.DataStreamReader
spark.readStream.schema("a int").orc("/tmp/orc_ss").writeStream.format("console").start()
```
**AFTER**
```scala
scala> spark.readStream.schema("a int").orc("/tmp/orc_ss").writeStream.format("console").start()
res0: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper678b3746
scala>
-------------------------------------------
Batch: 0
-------------------------------------------
+---+
| a|
+---+
| 1|
+---+
```
## How was this patch tested?
Pass the newly added test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#19975 from dongjoon-hyun/SPARK-22781.
## What changes were proposed in this pull request?
This change adds local checkpoint support to datasets and respective bind from Python Dataframe API.
If reliability requirements can be lowered to favor performance, as in cases of further quick transformations followed by a reliable save, localCheckpoints() fit very well.
Furthermore, at the moment Reliable checkpoints still incur double computation (see #9428)
In general it makes the API more complete as well.
## How was this patch tested?
Python land quick use case:
```python
>>> from time import sleep
>>> from pyspark.sql import types as T
>>> from pyspark.sql import functions as F
>>> def f(x):
sleep(1)
return x*2
...:
>>> df1 = spark.range(30, numPartitions=6)
>>> df2 = df1.select(F.udf(f, T.LongType())("id"))
>>> %time _ = df2.collect()
CPU times: user 7.79 ms, sys: 5.84 ms, total: 13.6 ms
Wall time: 12.2 s
>>> %time df3 = df2.localCheckpoint()
CPU times: user 2.38 ms, sys: 2.3 ms, total: 4.68 ms
Wall time: 10.3 s
>>> %time _ = df3.collect()
CPU times: user 5.09 ms, sys: 410 µs, total: 5.5 ms
Wall time: 148 ms
>>> sc.setCheckpointDir(".")
>>> %time df3 = df2.checkpoint()
CPU times: user 4.04 ms, sys: 1.63 ms, total: 5.67 ms
Wall time: 20.3 s
```
Author: Fernando Pereira <fernando.pereira@epfl.ch>
Closes#19805 from ferdonline/feature_dataset_localCheckpoint.
## What changes were proposed in this pull request?
Currently, the task memory manager throws an OutofMemory error when there is an IO exception happens in spill() - https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L194. Similarly there any many other places in code when if a task is not able to acquire memory due to an exception we throw an OutofMemory error which kills the entire executor and hence failing all the tasks that are running on that executor instead of just failing one single task.
## How was this patch tested?
Unit tests
Author: Sital Kedia <skedia@fb.com>
Closes#20014 from sitalkedia/skedia/upstream_SPARK-22827.
## What changes were proposed in this pull request?
Test Coverage for `WidenSetOperationTypes`, `BooleanEquality`, `StackCoercion` and `Division`, this is a Sub-tasks for [SPARK-22722](https://issues.apache.org/jira/browse/SPARK-22722).
## How was this patch tested?
N/A
Author: Yuming Wang <wgyumg@gmail.com>
Closes#20006 from wangyum/SPARK-22821.
## What changes were proposed in this pull request?
The optimizer rule `InferFiltersFromConstraints` could trigger our batch `Operator Optimizations` exceeds the max iteration limit (i.e., 100) so that the final plan might not be properly optimized. The rule `InferFiltersFromConstraints` could conflict with the other Filter/Join predicate reduction rules. Thus, we need to separate `InferFiltersFromConstraints` from the other rules.
This PR is to separate `InferFiltersFromConstraints ` from the main batch `Operator Optimizations` .
## How was this patch tested?
The existing test cases.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19149 from gatorsmile/inferFilterRule.
## What changes were proposed in this pull request?
This PR is follow-on of #19518. This PR tries to reduce the number of constant pool entries used for accessing mutable state.
There are two directions:
1. Primitive type variables should be allocated at the outer class due to better performance. Otherwise, this PR allocates an array.
2. The length of allocated array is up to 32768 due to avoiding usage of constant pool entry at access (e.g. `mutableStateArray[32767]`).
Here are some discussions to determine these directions.
1. [[1]](https://github.com/apache/spark/pull/19518#issuecomment-346690464), [[2]](https://github.com/apache/spark/pull/19518#issuecomment-346690642), [[3]](https://github.com/apache/spark/pull/19518#issuecomment-346828180), [[4]](https://github.com/apache/spark/pull/19518#issuecomment-346831544), [[5]](https://github.com/apache/spark/pull/19518#issuecomment-346857340)
2. [[6]](https://github.com/apache/spark/pull/19518#issuecomment-346729172), [[7]](https://github.com/apache/spark/pull/19518#issuecomment-346798358), [[8]](https://github.com/apache/spark/pull/19518#issuecomment-346870408)
This PR modifies `addMutableState` function in the `CodeGenerator` to check if the declared state can be easily initialized compacted into an array. We identify three types of states that cannot compacted:
- Primitive type state (ints, booleans, etc) if the number of them does not exceed threshold
- Multiple-dimensional array type
- `inline = true`
When `useFreshName = false`, the given name is used.
Many codes were ported from #19518. Many efforts were put here. I think this PR should credit to bdrillard
With this PR, the following code is generated:
```
/* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
/* 006 */
/* 007 */ private Object[] references;
/* 008 */ private InternalRow mutableRow;
/* 009 */ private boolean isNull_0;
/* 010 */ private boolean isNull_1;
/* 011 */ private boolean isNull_2;
/* 012 */ private int value_2;
/* 013 */ private boolean isNull_3;
...
/* 10006 */ private int value_4999;
/* 10007 */ private boolean isNull_5000;
/* 10008 */ private int value_5000;
/* 10009 */ private InternalRow[] mutableStateArray = new InternalRow[2];
/* 10010 */ private boolean[] mutableStateArray1 = new boolean[7001];
/* 10011 */ private int[] mutableStateArray2 = new int[1001];
/* 10012 */ private UTF8String[] mutableStateArray3 = new UTF8String[6000];
/* 10013 */
...
/* 107956 */ private void init_176() {
/* 107957 */ isNull_4986 = true;
/* 107958 */ value_4986 = -1;
...
/* 108004 */ }
...
```
## How was this patch tested?
Added a new test case to `GeneratedProjectionSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#19811 from kiszk/SPARK-18016.
## What changes were proposed in this pull request?
We could get incorrect results by running DecimalPrecision twice. This PR resolves the original found in https://github.com/apache/spark/pull/15048 and https://github.com/apache/spark/pull/14797. After this PR, it becomes easier to change it back using `children` instead of using `innerChildren`.
## How was this patch tested?
The existing test.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#20000 from gatorsmile/keepPromotePrecision.
## What changes were proposed in this pull request?
When calling explain on a query, the output can contain sensitive information. We should provide an admin/user to redact such information.
Before this PR, the plan of SS is like this
```
== Physical Plan ==
*HashAggregate(keys=[value#6], functions=[count(1)], output=[value#6, count(1)#12L])
+- StateStoreSave [value#6], state info [ checkpoint = file:/private/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw900000gn/T/temporary-91c6fac0-609f-4bc8-ad57-52c189f06797/state, runId = 05a4b3af-f02c-40f8-9ff9-a3e18bae496f, opId = 0, ver = 0, numPartitions = 5], Complete, 0
+- *HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#18L])
+- StateStoreRestore [value#6], state info [ checkpoint = file:/private/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw900000gn/T/temporary-91c6fac0-609f-4bc8-ad57-52c189f06797/state, runId = 05a4b3af-f02c-40f8-9ff9-a3e18bae496f, opId = 0, ver = 0, numPartitions = 5]
+- *HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#18L])
+- Exchange hashpartitioning(value#6, 5)
+- *HashAggregate(keys=[value#6], functions=[partial_count(1)], output=[value#6, count#18L])
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
+- *MapElements <function1>, obj#5: java.lang.String
+- *DeserializeToObject value#30.toString, obj#4: java.lang.String
+- LocalTableScan [value#30]
```
After this PR, we can get the following output if users set `spark.redaction.string.regex` to `file:/[\\w_]+`
```
== Physical Plan ==
*HashAggregate(keys=[value#6], functions=[count(1)], output=[value#6, count(1)#12L])
+- StateStoreSave [value#6], state info [ checkpoint = *********(redacted)/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw900000gn/T/temporary-e7da9b7d-3ec0-474d-8b8c-927f7d12ed72/state, runId = 8a9c3761-93d5-4896-ab82-14c06240dcea, opId = 0, ver = 0, numPartitions = 5], Complete, 0
+- *HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#32L])
+- StateStoreRestore [value#6], state info [ checkpoint = *********(redacted)/var/folders/vx/j0ydl5rn0gd9mgrh1pljnw900000gn/T/temporary-e7da9b7d-3ec0-474d-8b8c-927f7d12ed72/state, runId = 8a9c3761-93d5-4896-ab82-14c06240dcea, opId = 0, ver = 0, numPartitions = 5]
+- *HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#32L])
+- Exchange hashpartitioning(value#6, 5)
+- *HashAggregate(keys=[value#6], functions=[partial_count(1)], output=[value#6, count#32L])
+- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6]
+- *MapElements <function1>, obj#5: java.lang.String
+- *DeserializeToObject value#27.toString, obj#4: java.lang.String
+- LocalTableScan [value#27]
```
## How was this patch tested?
Added a test case
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19985 from gatorsmile/redactPlan.
## What changes were proposed in this pull request?
Equi-height histogram is one of the state-of-the-art statistics for cardinality estimation, which can provide better estimation accuracy, and good at cases with skew data.
This PR is to improve join estimation based on equi-height histogram. The difference from basic estimation (based on ndv) is the logic for computing join cardinality and the new ndv after join.
The main idea is as follows:
1. find overlapped ranges between two histograms from two join keys;
2. apply the formula `T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1))` in each overlapped range.
## How was this patch tested?
Added new test cases.
Author: Zhenhua Wang <wangzhenhua@huawei.com>
Closes#19594 from wzhfy/join_estimation_histogram.
## What changes were proposed in this pull request?
The current implementation of InMemoryRelation always uses the most expensive execution plan when writing cache
With CBO enabled, we can actually have a more exact estimation of the underlying table size...
## How was this patch tested?
existing test
Author: CodingCat <zhunansjtu@gmail.com>
Author: Nan Zhu <CodingCat@users.noreply.github.com>
Author: Nan Zhu <nanzhu@uber.com>
Closes#19864 from CodingCat/SPARK-22673.
## What changes were proposed in this pull request?
Remove useless `zipWithIndex` from `ResolveAliases `.
## How was this patch tested?
The existing tests
Author: gatorsmile <gatorsmile@gmail.com>
Closes#20009 from gatorsmile/try22.
# What changes were proposed in this pull request?
1. entrypoint.sh for Kubernetes spark-base image is marked as executable (644 -> 755)
2. make-distribution script will now create kubernetes/dockerfiles directory when Kubernetes support is compiled.
## How was this patch tested?
Manual testing
cc/ ueshin jiangxb1987 mridulm vanzin rxin liyinan926
Author: foxish <ramanathana@google.com>
Closes#20007 from foxish/fix-dockerfiles.
## What changes were proposed in this pull request?
In [the environment where `/usr/sbin/lsof` does not exist](https://github.com/apache/spark/pull/19695#issuecomment-342865001), `./dev/run-tests.py` for `maven` causes the following error. This is because the current `./dev/run-tests.py` checks existence of only `/usr/sbin/lsof` and aborts immediately if it does not exist.
This PR changes to check whether `lsof` or `/usr/sbin/lsof` exists.
```
/bin/sh: 1: /usr/sbin/lsof: not found
Usage:
kill [options] <pid> [...]
Options:
<pid> [...] send signal to every <pid> listed
-<signal>, -s, --signal <signal>
specify the <signal> to be sent
-l, --list=[<signal>] list all signal names, or convert one to a name
-L, --table list all signal names in a nice table
-h, --help display this help and exit
-V, --version output version information and exit
For more details see kill(1).
Traceback (most recent call last):
File "./dev/run-tests.py", line 626, in <module>
main()
File "./dev/run-tests.py", line 597, in main
build_apache_spark(build_tool, hadoop_version)
File "./dev/run-tests.py", line 389, in build_apache_spark
build_spark_maven(hadoop_version)
File "./dev/run-tests.py", line 329, in build_spark_maven
exec_maven(profiles_and_goals)
File "./dev/run-tests.py", line 270, in exec_maven
kill_zinc_on_port(zinc_port)
File "./dev/run-tests.py", line 258, in kill_zinc_on_port
subprocess.check_call(cmd, shell=True)
File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/usr/sbin/lsof -P |grep 3156 | grep LISTEN | awk '{ print $2; }' | xargs kill' returned non-zero exit status 123
```
## How was this patch tested?
manually tested
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#19998 from kiszk/SPARK-22813.
This change restores the functionality that keeps a limited number of
different types (jobs, stages, etc) depending on configuration, to avoid
the store growing indefinitely over time.
The feature is implemented by creating a new type (ElementTrackingStore)
that wraps a KVStore and allows triggers to be set up for when elements
of a certain type meet a certain threshold. Triggers don't need to
necessarily only delete elements, but the current API is set up in a way
that makes that use case easier.
The new store also has a trigger for the "close" call, which makes it
easier for listeners to register code for cleaning things up and flushing
partial state to the store.
The old configurations for cleaning up the stored elements from the core
and SQL UIs are now active again, and the old unit tests are re-enabled.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#19751 from vanzin/SPARK-20653.
## What changes were proposed in this pull request?
Changes discussed in https://github.com/apache/spark/pull/19946#discussion_r157063535
docker -> container, since with CRI, we are not limited to running only docker images.
## How was this patch tested?
Manual testing
Author: foxish <ramanathana@google.com>
Closes#19995 from foxish/make-docker-container.
## What changes were proposed in this pull request?
Test Coverage for `PromoteStrings` and `InConversion`, this is a Sub-tasks for [SPARK-22722](https://issues.apache.org/jira/browse/SPARK-22722).
## How was this patch tested?
N/A
Author: Yuming Wang <wgyumg@gmail.com>
Closes#20001 from wangyum/SPARK-22816.
## What changes were proposed in this pull request?
Easy fix in the link.
## How was this patch tested?
Tested manually
Author: Mahmut CAVDAR <mahmutcvdr@gmail.com>
Closes#19996 from mcavdar/master.
## What changes were proposed in this pull request?
`testthat` 2.0.0 is released and AppVeyor now started to use it instead of 1.0.2. And then, we started to have R tests failed in AppVeyor. See - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/1967-master
```
Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
object 'run_tests' not found
Calls: ::: -> get
```
This seems because we rely on internal `testthat:::run_tests` here:
https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75dc4c351837/R/pkg/tests/run-all.R (L49-L52)
However, seems it was removed out from 2.0.0. I tried few other exposed APIs like `test_dir` but I failed to make a good compatible fix.
Seems we better fix the `testthat` version first to make the build passed.
## How was this patch tested?
Manually tested and AppVeyor tests.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes#20003 from HyukjinKwon/SPARK-22817.
## What changes were proposed in this pull request?
pyspark.ml.tests is missing a py4j import. I've added the import and fixed the test that uses it. This test was only failing when testing without Hive.
## How was this patch tested?
Existing tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Bago Amirbekian <bago@databricks.com>
Closes#19997 from MrBago/fix-ImageReaderTest2.
## What changes were proposed in this pull request?
Basic tests for IfCoercion and CaseWhenCoercion
## How was this patch tested?
N/A
Author: Yuming Wang <wgyumg@gmail.com>
Closes#19949 from wangyum/SPARK-22762.
## What changes were proposed in this pull request?
Add a test suite to ensure all the [SSB (Star Schema Benchmark)](https://www.cs.umb.edu/~poneil/StarSchemaB.PDF) queries can be successfully analyzed, optimized and compiled without hitting the max iteration threshold.
## How was this patch tested?
Added `SSBQuerySuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#19990 from maropu/SPARK-22800.
## What changes were proposed in this pull request?
As the discussion in https://github.com/apache/spark/pull/16481 and https://github.com/apache/spark/pull/18975#discussion_r155454606
Currently the BaseRelation returned by `dataSource.writeAndRead` only used in `CreateDataSourceTableAsSelect`, planForWriting and writeAndRead has some common code paths.
In this patch I removed the writeAndRead function and added the getRelation function which only use in `CreateDataSourceTableAsSelectCommand` while saving data to non-existing table.
## How was this patch tested?
Existing UT
Author: Yuanjian Li <xyliyuanjian@gmail.com>
Closes#19941 from xuanyuanking/SPARK-22753.
## What changes were proposed in this pull request?
Add a test suite to ensure all the TPC-H queries can be successfully analyzed, optimized and compiled without hitting the max iteration threshold.
## How was this patch tested?
N/A
Author: gatorsmile <gatorsmile@gmail.com>
Closes#19982 from gatorsmile/testTPCH.
## What changes were proposed in this pull request?
since hive 2.0+ upgrades log4j to log4j2,a lot of [changes](https://issues.apache.org/jira/browse/HIVE-11304) are made working on it.
as spark is not to ready to update its inner hive version(1.2.1) , so I manage to make little changes.
the function registerCurrentOperationLog is moved from SQLOperstion to its parent class ExecuteStatementOperation so spark can use it.
## How was this patch tested?
manual test
Closes#19721 from ChenjunZou/operation-log.
Author: zouchenjun <zouchenjun@youzan.com>
Closes#19961 from ChenjunZou/spark-22496.
## What changes were proposed in this pull request?
StreamExecution is now an abstract base class, which MicroBatchExecution (the current StreamExecution) inherits. When continuous processing is implemented, we'll have a new ContinuousExecution implementation of StreamExecution.
A few fields are also renamed to make them less microbatch-specific.
## How was this patch tested?
refactoring only
Author: Jose Torres <jose@databricks.com>
Closes#19926 from joseph-torres/continuous-refactor.
## What changes were proposed in this pull request?
This PR added the missing service metadata for `KubernetesClusterManager`. Without the metadata, the service loader couldn't load `KubernetesClusterManager`, and caused the driver to fail to create a `ExternalClusterManager`, as being reported in SPARK-22778. The PR also changed the `k8s:` prefix used to `k8s://`, which is what existing Spark on k8s users are familiar and used to.
## How was this patch tested?
Manual testing verified that the fix resolved the issue in SPARK-22778.
/cc vanzin felixcheung jiangxb1987
Author: Yinan Li <liyinan926@gmail.com>
Closes#19972 from liyinan926/fix-22778.
## What changes were proposed in this pull request?
Missing some changes about limit in TaskSetManager.scala
## How was this patch tested?
running tests
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: kellyzly <kellyzly@126.com>
Closes#19976 from kellyzly/SPARK-22660.2.
## What changes were proposed in this pull request?
In multiple text analysis problems, it is not often desirable for the rows to be split by "\n". There exists a wholeText reader for RDD API, and this JIRA just adds the same support for Dataset API.
## How was this patch tested?
Added relevant new tests for both scala and Java APIs
Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>
Closes#14151 from ScrapCodes/SPARK-16496/wholetext.
## What changes were proposed in this pull request?
This PR adds check whether Java code generated by Catalyst can be compiled by `janino` correctly or not into `TPCDSQuerySuite`. Before this PR, this suite only checks whether analysis can be performed correctly or not.
This check will be able to avoid unexpected performance degrade by interpreter execution due to a Java compilation error.
## How was this patch tested?
Existing a test case, but updated it.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#19971 from kiszk/SPARK-22774.
## What changes were proposed in this pull request?
`ColumnVector.anyNullsSet` is not called anywhere except tests, and we can easily replace it with `ColumnVector.numNulls > 0`
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19980 from cloud-fan/minor.
## What changes were proposed in this pull request?
These dictionary related APIs are special to `WritableColumnVector` and should not be in `ColumnVector`, which will be public soon.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19970 from cloud-fan/final.
SQLConf allows some callers to define a custom default value for
configs, and that complicates a little bit the handling of fallback
config entries, since most of the default value resolution is
hidden by the config code.
This change peaks into the internals of these fallback configs
to figure out the correct default value, and also returns the
current human-readable default when showing the default value
(e.g. through "set -v").
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#19974 from vanzin/SPARK-22779.
## What changes were proposed in this pull request?
This PR provides DataSourceV2 API support for structured streaming, including new pieces needed to support continuous processing [SPARK-20928]. High level summary:
- DataSourceV2 includes new mixins to support micro-batch and continuous reads and writes. For reads, we accept an optional user specified schema rather than using the ReadSupportWithSchema model, because doing so would severely complicate the interface.
- DataSourceV2Reader includes new interfaces to read a specific microbatch or read continuously from a given offset. These follow the same setter pattern as the existing Supports* mixins so that they can work with SupportsScanUnsafeRow.
- DataReader (the per-partition reader) has a new subinterface ContinuousDataReader only for continuous processing. This reader has a special method to check progress, and next() blocks for new input rather than returning false.
- Offset, an abstract representation of position in a streaming query, is ported to the public API. (Each type of reader will define its own Offset implementation.)
- DataSourceV2Writer has a new subinterface ContinuousWriter only for continuous processing. Commits to this interface come tagged with an epoch number, as the execution engine will continue to produce new epoch commits as the task continues indefinitely.
Note that this PR does not propose to change the existing DataSourceV2 batch API, or deprecate the existing streaming source/sink internal APIs in spark.sql.execution.streaming.
## How was this patch tested?
Toy implementations of the new interfaces with unit tests.
Author: Jose Torres <jose@databricks.com>
Closes#19925 from joseph-torres/continuous-api.
## What changes were proposed in this pull request?
MLlib ```LinearRegression``` supports _huber_ loss addition to _leastSquares_ loss. The huber loss objective function is:
![image](https://user-images.githubusercontent.com/1962026/29554124-9544d198-8750-11e7-8afa-33579ec419d5.png)
Refer Eq.(6) and Eq.(8) in [A robust hybrid of lasso and ridge regression](http://statweb.stanford.edu/~owen/reports/hhu.pdf). This objective is jointly convex as a function of (w, σ) ∈ R × (0,∞), we can use L-BFGS-B to solve it.
The current implementation is a straight forward porting for Python scikit-learn [```HuberRegressor```](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html). There are some differences:
* We use mean loss (```lossSum/weightSum```), but sklearn uses total loss (```lossSum```).
* We multiply the loss function and L2 regularization by 1/2. It does not affect the result if we multiply the whole formula by a factor, we just keep consistent with _leastSquares_ loss.
So if fitting w/o regularization, MLlib and sklearn produce the same output. If fitting w/ regularization, MLlib should set ```regParam``` divide by the number of instances to match the output of sklearn.
## How was this patch tested?
Unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#19020 from yanboliang/spark-3181.
## What changes were proposed in this pull request?
This pr fixed a compilation error of TPCDS `q75`/`q77` caused by #19813;
```
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 371, Column 16: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 371, Column 16: Expression "bhj_matched" is not an rvalue
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
```
## How was this patch tested?
Manually checked `q75`/`q77` can be properly compiled
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#19969 from maropu/SPARK-22600-FOLLOWUP.
Use a semaphore to synchronize the tasks with the listener code
that is trying to cancel the job or stage, so that the listener
won't try to cancel a job or stage that has already finished.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#19956 from vanzin/SPARK-22764.
## What changes were proposed in this pull request?
In SPARK-22550 which fixes 64KB JVM bytecode limit problem with elt, `buildCodeBlocks` is used to split codes. However, we should use `splitExpressionsWithCurrentInputs` because it considers both normal and wholestage codgen (it is not supported yet, so it simply doesn't split the codes).
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#19964 from viirya/SPARK-22772.
## What changes were proposed in this pull request?
PR closed with all the comments -> https://github.com/apache/spark/pull/19793
It solves the problem when submitting a wrong CreateSubmissionRequest to Spark Dispatcher was causing a bad state of Dispatcher and making it inactive as a Mesos framework.
https://issues.apache.org/jira/browse/SPARK-22574
## How was this patch tested?
All spark test passed successfully.
It was tested sending a wrong request (without appArgs) before and after the change. The point is easy, check if the value is null before being accessed.
This was before the change, leaving the dispatcher inactive:
```
Exception in thread "Thread-22" java.lang.NullPointerException
at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.getDriverCommandValue(MesosClusterScheduler.scala:444)
at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.buildDriverCommand(MesosClusterScheduler.scala:451)
at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.org$apache$spark$scheduler$cluster$mesos$MesosClusterScheduler$$createTaskInfo(MesosClusterScheduler.scala:538)
at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:570)
at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:555)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:555)
at org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:621)
```
And after:
```
"message" : "Malformed request: org.apache.spark.deploy.rest.SubmitRestProtocolException: Validation of message CreateSubmissionRequest failed!\n\torg.apache.spark.deploy.rest.SubmitRestProtocolMessage.validate(SubmitRestProtocolMessage.scala:70)\n\torg.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:272)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:707)\n\tjavax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\torg.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)\n\torg.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)\n\torg.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\n\torg.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\torg.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\n\torg.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\torg.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\torg.spark_project.jetty.server.Server.handle(Server.java:524)\n\torg.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)\n\torg.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)\n\torg.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\torg.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\torg.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)\n\torg.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)\n\torg.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)\n\tjava.lang.Thread.run(Thread.java:745)"
```
Author: German Schiavon <germanschiavon@gmail.com>
Closes#19966 from Gschiavon/fix-submission-request.
## What changes were proposed in this pull request?
While spark code changes, there are new events in event log: #19649
And we used to maintain a whitelist to avoid exceptions: #15663
Currently Spark history server will stop parsing on unknown events or unrecognized properties. We may still see part of the UI data.
For better compatibility, we can ignore unknown events and parse through the log file.
## How was this patch tested?
Unit test
Author: Wang Gengliang <ltnwgl@gmail.com>
Closes#19953 from gengliangwang/ReplayListenerBus.
… than spark.network.timeout or not
## What changes were proposed in this pull request?
If spark.executor.heartbeatInterval bigger than spark.network.timeout,it will almost always cause exception below.
`Job aborted due to stage failure: Task 4763 in stage 3.0 failed 4 times, most recent failure: Lost task 4763.3 in stage 3.0 (TID 22383, executor id: 4761, host: xxx): ExecutorLostFailure (executor 4761 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 154022 ms`
Since many users do not get that point.He will set spark.executor.heartbeatInterval incorrectly.
This patch check this case when submit applications.
## How was this patch tested?
Test in cluster
Author: zhoukang <zhoukang199191@gmail.com>
Closes#19942 from caneGuy/zhoukang/check-heartbeat.
## What changes were proposed in this pull request?
We should not operate on `references` directly in `Expression.doGenCode`, instead we should use the high-level API `addReferenceObj`.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#19962 from cloud-fan/codegen.