## What changes were proposed in this pull request?
`ApproximatePercentile` contains a workaround logic to compress the samples since at the beginning `QuantileSummaries` was ignoring the compression threshold. This problem was fixed in SPARK-17439, but the workaround logic was not removed. So we are compressing the samples many more times than needed: this could lead to critical performance degradation.
This can create serious performance issues in queries like:
```
select approx_percentile(id, array(0.1)) from range(10000000)
```
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21133 from mgaido91/SPARK-24013.
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-24107?jql=text%20~%20%22ChunkedByteBuffer%22
ChunkedByteBuffer.writeFully method has not reset the limit value. When
chunks larger than bufferWriteChunkSize, such as 80 * 1024 * 1024 larger than
config.BUFFER_WRITE_CHUNK_SIZE(64 * 1024 * 1024),only while once, will lost 16 * 1024 * 1024 byte
Author: WangJinhai02 <jinhai.wang02@ele.me>
Closes#21175 from manbuyun/bugfix-ChunkedByteBuffer.
## What changes were proposed in this pull request?
This PR detects length overflow if total elements in inputs are not acceptable.
For example, when the three inputs has `0x7FFF_FF00`, `0x7FFF_FF00`, and `0xE00`, we should detect length overflow since we cannot allocate such a large structure on `byte[]`.
On the other hand, the current algorithm can allocate the result structure with `0x1000`-byte length due to integer sum overflow.
## How was this patch tested?
Existing UTs.
If we would create UTs, we need large heap (6-8GB). It may make test environment unstable.
If it is necessary to create UTs, I will create them.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#21064 from kiszk/SPARK-23976.
## What changes were proposed in this pull request?
We need to determine Spark major and minor versions in PySpark. We can add a `majorMinorVersion` API to PySpark which is similar to the Scala API in `VersionUtils.majorMinorVersion`.
## How was this patch tested?
Added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#21203 from viirya/SPARK-24131.
## What changes were proposed in this pull request?
Shell escaped the name passed to spark-submit and change how conf attributes are shell escaped.
## How was this patch tested?
This test has been tested manually with Hive-on-spark with mesos or with the use case described in the issue with the sparkPi application with a custom name which contains illegal shell characters.
With this PR, hive-on-spark on mesos works like a charm with hive 3.0.0-SNAPSHOT.
I state that this contribution is my original work and that I license the work to the project under the project’s open source license
Author: Bounkong Khamphousone <bounkong.khamphousone@ebiznext.com>
Closes#21014 from tiboun/fix/SPARK-23941.
## What changes were proposed in this pull request?
Add TypedFilter support for continuous processing application.
## How was this patch tested?
unit tests
Author: wangyanlin01 <wangyanlin01@baidu.com>
Closes#21136 from yanlin-Lynn/SPARK-24061.
## What changes were proposed in this pull request?
When `PyArrow` or `Pandas` are not available, the corresponding PySpark tests are skipped automatically. Currently, PySpark tests fail when we are not using `-Phive`. This PR aims to skip Hive related PySpark tests when `-Phive` is not given.
**BEFORE**
```bash
$ build/mvn -DskipTests clean package
$ python/run-tests.py --python-executables python2.7 --modules pyspark-sql
File "/Users/dongjoon/spark/python/pyspark/sql/readwriter.py", line 295, in pyspark.sql.readwriter.DataFrameReader.table
...
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"
**********************************************************************
1 of 3 in pyspark.sql.readwriter.DataFrameReader.table
***Test Failed*** 1 failures.
```
**AFTER**
```bash
$ build/mvn -DskipTests clean package
$ python/run-tests.py --python-executables python2.7 --modules pyspark-sql
...
Tests passed in 138 seconds
Skipped tests in pyspark.sql.tests with python2.7:
...
test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) ... skipped 'Hive is not available.'
```
## How was this patch tested?
This is a test-only change. First, this should pass the Jenkins. Then, manually do the following.
```bash
build/mvn -DskipTests clean package
python/run-tests.py --python-executables python2.7 --modules pyspark-sql
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#21141 from dongjoon-hyun/SPARK-23853.
## What changes were proposed in this pull request?
Added support to specify the 'spark.executor.extraJavaOptions' value in terms of the `{{APP_ID}}` and/or `{{EXECUTOR_ID}}`, `{{APP_ID}}` will be replaced by Application Id and `{{EXECUTOR_ID}}` will be replaced by Executor Id while starting the executor.
## How was this patch tested?
I have verified this by checking the executor process command and gc logs. I verified the same in different deployment modes(Standalone, YARN, Mesos) client and cluster modes.
Author: Devaraj K <devaraj@apache.org>
Closes#21088 from devaraj-kavali/SPARK-24003.
## What changes were proposed in this pull request?
filters like parquet row group filter, which is actually pushed to the data source but still to be evaluated by Spark, should also count as `pushedFilters`.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21143 from cloud-fan/step1.
## What changes were proposed in this pull request?
I propose to support the `samplingRatio` option for schema inferring of CSV datasource similar to the same option of JSON datasource:
b14993e1fc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala (L49-L50)
## How was this patch tested?
Added 2 tests for json and 2 tests for csv datasources. The tests checks that only subset of input dataset is used for schema inferring.
Author: Maxim Gekk <maxim.gekk@databricks.com>
Author: Maxim Gekk <max.gekk@gmail.com>
Closes#20959 from MaxGekk/csv-sampling.
## What changes were proposed in this pull request?
I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option:
```
spark.read.schema(schema)
.option("multiline", "true")
.option("encoding", "UTF-16LE")
.json(fileName)
```
If the option is not specified, charset auto-detection mechanism is used by default.
The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in `UTF-8` charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like `.option("charset", "UTF-16BE")`. By default the output charset is still `UTF-8` to keep backward compatibility.
The solution has the following restrictions for per-line mode (`multiline = false`):
- If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725
- Encoding with [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) are not supported. For example, the `UTF-16` and `UTF-32` encodings are blacklisted. The problem can be solved by https://github.com/MaxGekk/spark-1/pull/2
## How was this patch tested?
I added the following tests:
- reads an json file in `UTF-16LE` encoding with BOM in `multiline` mode
- read json file by using charset auto detection (`UTF-32BE` with BOM)
- read json file using of user's charset (`UTF-16LE`)
- saving in `UTF-32BE` and read the result by standard library (not by Spark)
- checking that default charset is `UTF-8`
- handling wrong (unsupported) charset
Author: Maxim Gekk <maxim.gekk@databricks.com>
Author: Maxim Gekk <max.gekk@gmail.com>
Closes#20937 from MaxGekk/json-encoding-line-sep.
## What changes were proposed in this pull request?
Fix `MemorySinkV2` toString() error
## How was this patch tested?
N/A
Author: Yuming Wang <yumwang@ebay.com>
Closes#21170 from wangyum/SPARK-22732.
## What changes were proposed in this pull request?
In the error messages we should return the SQL types (like `string` rather than the internal types like `StringType`).
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21181 from mgaido91/SPARK-23736_followup.
## What changes were proposed in this pull request?
Replace rate source with memory source in continuous mode test suite. Keep using "rate" source if the tests intend to put data periodically in background, or need to put short source name to load, since "memory" doesn't have provider for source.
## How was this patch tested?
Ran relevant test suite from IDE.
Author: Jungtaek Lim <kabhwan@gmail.com>
Closes#21152 from HeartSaVioR/SPARK-23688.
## What changes were proposed in this pull request?
Event `SparkListenerDriverAccumUpdates` may happen multiple times in a query - e.g. every `FileSourceScanExec` and `BroadcastExchangeExec` call `postDriverMetricUpdates`.
In Spark 2.2 `SQLListener` updated the map with new values. `SQLAppStatusListener` overwrites it.
Unless `update` preserved it in the KV store (dependant on `exec.lastWriteTime`), only the metrics from the last operator that does `postDriverMetricUpdates` are preserved.
## How was this patch tested?
Unit test added.
Author: Juliusz Sompolski <julek@databricks.com>
Closes#21171 from juliuszsompolski/SPARK-24104.
## What changes were proposed in this pull request?
In this case, the partition pruning happens before the planning phase of scalar subquery expressions.
For scalar subquery expressions, the planning occurs late in the cycle (after the physical planning) in "PlanSubqueries" just before execution. Currently we try to execute the scalar subquery expression as part of partition pruning and fail as it implements Unevaluable.
The fix attempts to ignore the Subquery expressions from partition pruning computation. Another option can be to somehow plan the subqueries before the partition pruning. Since this may not be a commonly occuring expression, i am opting for a simpler fix.
Repro
``` SQL
CREATE TABLE test_prc_bug (
id_value string
)
partitioned by (id_type string)
location '/tmp/test_prc_bug'
stored as parquet;
insert into test_prc_bug values ('1','a');
insert into test_prc_bug values ('2','a');
insert into test_prc_bug values ('3','b');
insert into test_prc_bug values ('4','b');
select * from test_prc_bug
where id_type = (select 'b');
```
## How was this patch tested?
Added test in SubquerySuite and hive/SQLQuerySuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes#21174 from dilipbiswal/spark-24085.
## What changes were proposed in this pull request?
A more informative message to tell you why a structured streaming query cannot continue if you have added more sources, than there are in the existing checkpoint offsets.
## How was this patch tested?
I added a Unit Test.
Author: Patrick McGloin <mcgloin.patrick@gmail.com>
Closes#20946 from patrickmcgloin/master.
## What changes were proposed in this pull request?
When a user specifies the wrong class -- or, in fact, a class instead of an object -- Spark throws an NPE which is not useful for debugging. This was reported in [SPARK-23830](https://issues.apache.org/jira/browse/SPARK-23830). This PR adds a check to ensure the main method was found and logs a useful error in the event that it's null.
## How was this patch tested?
* Unit tests + Manual testing
* The scope of the changes is very limited
Author: eric-maynard <emaynard@cloudera.com>
Author: Eric Maynard <emaynard@cloudera.com>
Closes#21168 from eric-maynard/feature/SPARK-23830.
## What changes were proposed in this pull request?
Previously, SPARK-22158 fixed for `USING hive` syntax. This PR aims to fix for `STORED AS` syntax. Although the test case covers ORC part, the patch considers both `convertMetastoreOrc` and `convertMetastoreParquet`.
## How was this patch tested?
Pass newly added test cases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#20522 from dongjoon-hyun/SPARK-22158-2.
## What changes were proposed in this pull request?
Log stacktrace for uncaught exception
## How was this patch tested?
UT and manually test
Author: zhoukang <zhoukang199191@gmail.com>
Closes#21151 from caneGuy/zhoukang/log-stacktrace.
## What changes were proposed in this pull request?
This PR proposes to remove duplicated dependency checking logics and also print out skipped tests from unittests.
For example, as below:
```
Skipped tests in pyspark.sql.tests with pypy:
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
...
Skipped tests in pyspark.sql.tests with python3:
test_createDataFrame_column_name_encoding (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
test_createDataFrame_does_not_modify_input (pyspark.sql.tests.ArrowTests) ... skipped 'PyArrow >= 0.8.0 must be installed; however, it was not found.'
...
```
Currently, it's not printed out in the console. I think we should better print out skipped tests in the console.
## How was this patch tested?
Manually tested. Also, fortunately, Jenkins has good environment to test the skipped output.
Author: hyukjinkwon <gurwls223@apache.org>
Closes#21107 from HyukjinKwon/skipped-tests-print.
## What changes were proposed in this pull request?
Print out the data type in the AssertionError message to make it more meaningful.
## How was this patch tested?
I manually tested the changed code on my local, but didn't add any test.
Author: Huaxin Gao <huaxing@us.ibm.com>
Closes#21159 from huaxingao/spark-24057.
## What changes were proposed in this pull request?
`colStat.min` AND `colStat.max` are empty for string type. Thus, `evaluateInSet` should not return zero when either `colStat.min` or `colStat.max`.
## How was this patch tested?
Added a test case.
Author: gatorsmile <gatorsmile@gmail.com>
Closes#21147 from gatorsmile/cached.
## What changes were proposed in this pull request?
This makes it easy to understand at runtime which version is running. Great for debugging production issues.
## How was this patch tested?
Not necessary.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#21160 from tdas/SPARK-24094.
## What changes were proposed in this pull request?
For the details of the exception please see [SPARK-24062](https://issues.apache.org/jira/browse/SPARK-24062).
The issue is:
Spark on Yarn stores SASL secret in current UGI's credentials, this credentials will be distributed to AM and executors, so that executors and drive share the same secret to communicate. But STS/Hive library code will refresh the current UGI by UGI's loginFromKeytab() after Spark application is started, this will create a new UGI in the current driver's context with empty tokens and secret keys, so secret key is lost in the current context's UGI, that's why Spark driver throws secret key not found exception.
In Spark 2.2 code, Spark also stores this secret key in SecurityManager's class variable, so even UGI is refreshed, the secret is still existed in the object, so STS with SASL can still be worked in Spark 2.2. But in Spark 2.3, we always search key from current UGI, which makes it fail to work in Spark 2.3.
To fix this issue, there're two possible solutions:
1. Fix in STS/Hive library, when a new UGI is refreshed, copy the secret key from original UGI to the new one. The difficulty is that some codes to refresh the UGI is existed in Hive library, which makes us hard to change the code.
2. Roll back the logics in SecurityManager to match Spark 2.2, so that this issue can be fixed.
2nd solution seems a simple one. So I will propose a PR with 2nd solution.
## How was this patch tested?
Verified in local cluster.
CC vanzin tgravescs please help to review. Thanks!
Author: jerryshao <sshao@hortonworks.com>
Closes#21138 from jerryshao/SPARK-24062.
## What changes were proposed in this pull request?
The PR adds the SQL function `array_join`. The behavior of the function is based on Presto's one.
The function accepts an `array` of `string` which is to be joined, a `string` which is the delimiter to use between the items of the first argument and optionally a `string` which is used to replace `null` values.
## How was this patch tested?
added UTs
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21011 from mgaido91/SPARK-23916.
## What changes were proposed in this pull request?
HIVE-15511 introduced the `roundOff` flag in order to disable the rounding to 8 digits which is performed in `months_between`. Since this can be a computational intensive operation, skipping it may improve performances when the rounding is not needed.
## How was this patch tested?
modified existing UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21008 from mgaido91/SPARK-23902.
## What changes were proposed in this pull request?
Added the `samplingRatio` option to the `json()` method of PySpark DataFrame Reader. Improving existing tests for Scala API according to review of the PR: https://github.com/apache/spark/pull/20959
## How was this patch tested?
Added new test for PySpark, updated 2 existing tests according to reviews of https://github.com/apache/spark/pull/20959 and added new negative test
Author: Maxim Gekk <maxim.gekk@databricks.com>
Closes#21056 from MaxGekk/json-sampling.
## What changes were proposed in this pull request?
a followup of https://github.com/apache/spark/pull/21100
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#21154 from cloud-fan/test.
## What changes were proposed in this pull request?
In some streaming queries, the input and processing rates are not calculated at all (shows up as zero) because MicroBatchExecution fails to associated metrics from the executed plan of a trigger with the sources in the logical plan of the trigger. The way this executed-plan-leaf-to-logical-source attribution works is as follows. With V1 sources, there was no way to identify which execution plan leaves were generated by a streaming source. So did a best-effort attempt to match logical and execution plan leaves when the number of leaves were same. In cases where the number of leaves is different, we just give up and report zero rates. An example where this may happen is as follows.
```
val cachedStaticDF = someStaticDF.union(anotherStaticDF).cache()
val streamingInputDF = ...
val query = streamingInputDF.join(cachedStaticDF).writeStream....
```
In this case, the `cachedStaticDF` has multiple logical leaves, but in the trigger's execution plan it only has leaf because a cached subplan is represented as a single InMemoryTableScanExec leaf. This leads to a mismatch in the number of leaves causing the input rates to be computed as zero.
With DataSourceV2, all inputs are represented in the executed plan using `DataSourceV2ScanExec`, each of which has a reference to the associated logical `DataSource` and `DataSourceReader`. So its easy to associate the metrics to the original streaming sources.
In this PR, the solution is as follows. If all the streaming sources in a streaming query as v2 sources, then use a new code path where the execution-metrics-to-source mapping is done directly. Otherwise we fall back to existing mapping logic.
## How was this patch tested?
- New unit tests using V2 memory source
- Existing unit tests using V1 source
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#21126 from tdas/SPARK-24050.
## What changes were proposed in this pull request?
This pr fixed code so that `cache` could prevent any jobs from being triggered.
For example, in the current master, an operation below triggers a actual job;
```
val df = spark.range(10000000000L)
.filter('id > 1000)
.orderBy('id.desc)
.cache()
```
This triggers a job while the cache should be lazy. The problem is that, when creating `InMemoryRelation`, we build the RDD, which calls `SparkPlan.execute` and may trigger jobs, like sampling job for range partitioner, or broadcast job.
This pr removed the code to build a cached `RDD` in the constructor of `InMemoryRelation` and added `CachedRDDBuilder` to lazily build the `RDD` in `InMemoryRelation`. Then, the first call of `CachedRDDBuilder.cachedColumnBuffers` triggers a job to materialize the cache in `InMemoryTableScanExec` .
## How was this patch tested?
Added tests in `CachedTableSuite`.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#21018 from maropu/SPARK-23880.
## What changes were proposed in this pull request?
Union of map and other compatible column result in unresolved operator 'Union; exception
Reproduction
`spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1`
Output:
```
Error in query: unresolved operator 'Union;;
'Union
:- Project [map(1, 2) AS map(1, 2)#106, str AS str#107]
: +- OneRowRelation$
+- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108]
+- OneRowRelation$
```
So, we should cast part of columns to be compatible when appropriate.
## How was this patch tested?
Added a test (query union of map and other columns) to SQLQueryTestSuite's union.sql.
Author: liutang123 <liutang123@yeah.net>
Closes#21100 from liutang123/SPARK-24012.
## What changes were proposed in this pull request?
Refactor continuous writing to its own class.
See WIP https://github.com/jose-torres/spark/pull/13 for the overall direction this is going, but I think this PR is very isolated and necessary anyway.
## How was this patch tested?
existing unit tests - refactoring only
Author: Jose Torres <torres.joseph.f+github@gmail.com>
Closes#21116 from jose-torres/SPARK-24038.
## What changes were proposed in this pull request?
Currently, the driver side of the Kafka source (i.e. KafkaMicroBatchReader) eagerly creates a consumer as soon as the Kafk aMicroBatchReader is created. However, we create dummy KafkaMicroBatchReader to get the schema and immediately stop it. Its better to make the consumer creation lazy, it will be created on the first attempt to fetch offsets using the KafkaOffsetReader.
## How was this patch tested?
Existing unit tests
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#21134 from tdas/SPARK-24056.
## What changes were proposed in this pull request?
Instruments logging improvements - ML regression package
I add an `OptionalInstrument` class which used in `WeightLeastSquares` and `IterativelyReweightedLeastSquares`.
## How was this patch tested?
N/A
Author: WeichenXu <weichen.xu@databricks.com>
Closes#21078 from WeichenXu123/inst_reg.
## What changes were proposed in this pull request?
We save ML's user-supplied params and default params as one entity in metadata. During loading the saved models, we set all the loaded params into created ML model instances as user-supplied params.
It causes some problems, e.g., if we strictly disallow some params to be set at the same time, a default param can fail the param check because it is treated as user-supplied param after loading.
The loaded default params should not be set as user-supplied params. We should save ML default params separately in metadata.
For backward compatibility, when loading metadata, if it is a metadata file from previous Spark, we shouldn't raise error if we can't find the default param field.
## How was this patch tested?
Pass existing tests and added tests.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#20633 from viirya/save-ml-default-params.
## What changes were proposed in this pull request?
1. Adds a `hadoop-3.1` profile build depending on the hadoop-3.1 artifacts.
1. In the hadoop-cloud module, adds an explicit hadoop-3.1 profile which switches from explicitly pulling in cloud connectors (hadoop-openstack, hadoop-aws, hadoop-azure) to depending on the hadoop-cloudstorage POM artifact, which pulls these in, has pre-excluded things like hadoop-common, and stays up to date with new connectors (hadoop-azuredatalake, hadoop-allyun). Goal: it becomes the Hadoop projects homework of keeping this clean, and the spark project doesn't need to handle new hadoop releases adding more dependencies.
1. the hadoop-cloud/hadoop-3.1 profile also declares support for jetty-ajax and jetty-util to ensure that these jars get into the distribution jar directory when needed by unshaded libraries.
1. Increases the curator and zookeeper versions to match those in hadoop-3, fixing spark core to build in sbt with the hadoop-3 dependencies.
## How was this patch tested?
* Everything this has been built and tested against both ASF Hadoop branch-3.1 and hadoop trunk.
* spark-shell was used to create connectors to all the stores and verify that file IO could take place.
The spark hive-1.2.1 JAR has problems here, as it's version check logic fails for Hadoop versions > 2.
This can be avoided with either of
* The hadoop JARs built to declare their version as Hadoop 2.11 `mvn install -DskipTests -DskipShade -Ddeclared.hadoop.version=2.11` . This is safe for local test runs, not for deployment (HDFS is very strict about cross-version deployment).
* A modified version of spark hive whose version check switch statement is happy with hadoop 3.
I've done both, with maven and SBT.
Three issues surfaced
1. A spark-core test failure —fixed in SPARK-23787.
1. SBT only: Zookeeper not being found in spark-core. Somehow curator 2.12.0 triggers some slightly different dependency resolution logic from previous versions, and Ivy was missing zookeeper.jar entirely. This patch adds the explicit declaration for all spark profiles, setting the ZK version = 3.4.9 for hadoop-3.1
1. Marking jetty-utils as provided in spark was stopping hadoop-azure from being able to instantiate the azure wasb:// client; it was using jetty-util-ajax, which could then not find a class in jetty-util.
Author: Steve Loughran <stevel@hortonworks.com>
Closes#20923 from steveloughran/cloud/SPARK-23807-hadoop-31.
## What changes were proposed in this pull request?
- Multiple possible input types is added in validateAndTransformSchema() and computeCost() while checking column type
- Add if statement in transform() to support array type as featuresCol
- Add the case statement in fit() while selecting columns from dataset
These changes will be applied to KMeans first, then to other clustering method
## How was this patch tested?
unit test is added
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Lu WANG <lu.wang@databricks.com>
Closes#21081 from ludatabricks/SPARK-23975.
## What changes were proposed in this pull request?
By default, the dynamic allocation will request enough executors to maximize the
parallelism according to the number of tasks to process. While this minimizes the
latency of the job, with small tasks this setting can waste a lot of resources due to
executor allocation overhead, as some executor might not even do any work.
This setting allows to set a ratio that will be used to reduce the number of
target executors w.r.t. full parallelism.
The number of executors computed with this setting is still fenced by
`spark.dynamicAllocation.maxExecutors` and `spark.dynamicAllocation.minExecutors`
## How was this patch tested?
Units tests and runs on various actual workloads on a Yarn Cluster
Author: Julien Cuquemelle <j.cuquemelle@criteo.com>
Closes#19881 from jcuquemelle/AddTaskPerExecutorSlot.
## What changes were proposed in this pull request?
This pr is a follow-up of #20980 and fixes code to reuse `InternalRow` for converting input keys/values in `ExternalMapToCatalyst` eval.
## How was this patch tested?
Existing tests.
Author: Takeshi Yamamuro <yamamuro@apache.org>
Closes#21137 from maropu/SPARK-23589-FOLLOWUP.
## What changes were proposed in this pull request?
In SPARK-23375 we introduced the ability of removing `Sort` operation during query optimization if the data is already sorted. In this follow-up we remove also a `Sort` which is followed by another `Sort`: in this case the first sort is not needed and can be safely removed.
The PR starts from henryr's comment: https://github.com/apache/spark/pull/20560#discussion_r180601594. So credit should be given to him.
## How was this patch tested?
added UT
Author: Marco Gaido <marcogaido91@gmail.com>
Closes#21072 from mgaido91/SPARK-23973.
"childOption" is for the remote connections, not for the server socket
that actually listens for incoming connections.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#21132 from vanzin/SPARK-24029.2.
TaskSetManager.hasAttemptOnHost had a misleading comment. The comment
said that it only checked for running tasks, but really it checked for
any tasks that might have run in the past as well. This updates to line
up with the implementation.
Author: wuyi <ngone_5451@163.com>
Closes#20998 from Ngone51/SPARK-23888.
## What changes were proposed in this pull request?
Adding PMML export to Spark ML's KMeans Model.
## How was this patch tested?
New unit test for Spark ML PMML export based on the old Spark MLlib unit test.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#20907 from holdenk/SPARK-11237-Add-PMML-Export-for-KMeans.
## What changes were proposed in this pull request?
A structured streaming query with a streaming aggregation can throw the following error in rare cases.
```
java.lang.IllegalStateException: Cannot commit after already committed or aborted
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$verify(HDFSBackedStateStoreProvider.scala:643)
at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:135)
at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2$$anonfun$hasNext$2.apply$mcV$sp(statefulOperators.scala:359)
at org.apache.spark.sql.execution.streaming.StateStoreWriter$class.timeTakenMs(statefulOperators.scala:102)
at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.timeTakenMs(statefulOperators.scala:251)
at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2.hasNext(statefulOperators.scala:359)
at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:188)
at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78)
at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:114)
at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:42)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:336)
```
This can happen when the following conditions are accidentally hit.
- Streaming aggregation with aggregation function that is a subset of [`TypedImperativeAggregation`](76b8b840dd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala (L473)) (for example, `collect_set`, `collect_list`, `percentile`, etc.).
- Query running in `update}` mode
- After the shuffle, a partition has exactly 128 records.
This causes StateStore.commit to be called twice. See the [JIRA](https://issues.apache.org/jira/browse/SPARK-23004) for a more detailed explanation. The solution is to use `NextIterator` or `CompletionIterator`, each of which has a flag to prevent the "onCompletion" task from being called more than once. In this PR, I chose to implement using `NextIterator`.
## How was this patch tested?
Added unit test that I have confirm will fail without the fix.
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#21124 from tdas/SPARK-23004.