[SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy

### What changes were proposed in this pull request?

Revise below config names to comply with [new config naming policy](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-naming-policy-of-Spark-configs-td28875.html):

SQL:
* spark.sql.execution.subquery.reuse.enabled / [SPARK-27083](https://issues.apache.org/jira/browse/SPARK-27083)
* spark.sql.legacy.allowNegativeScaleOfDecimal.enabled / [SPARK-30252](https://issues.apache.org/jira/browse/SPARK-30252)
* spark.sql.adaptive.optimizeSkewedJoin.enabled / [SPARK-29544](https://issues.apache.org/jira/browse/SPARK-29544)
* spark.sql.legacy.property.nonReserved / [SPARK-30183](https://issues.apache.org/jira/browse/SPARK-30183)
* spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled / [SPARK-26389](https://issues.apache.org/jira/browse/SPARK-26389)
* spark.sql.analyzer.failAmbiguousSelfJoin.enabled / [SPARK-28344](https://issues.apache.org/jira/browse/SPARK-28344)
* spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled / [SPARK-30074](https://issues.apache.org/jira/browse/SPARK-30074)
* spark.sql.execution.pandas.arrowSafeTypeConversion / [SPARK-25811](https://issues.apache.org/jira/browse/SPARK-25811)
* spark.sql.legacy.looseUpcast / [SPARK-24586](https://issues.apache.org/jira/browse/SPARK-24586)
* spark.sql.legacy.arrayExistsFollowsThreeValuedLogic / [SPARK-28052](https://issues.apache.org/jira/browse/SPARK-28052)
* spark.sql.sources.ignoreDataLocality.enabled / [SPARK-29189](https://issues.apache.org/jira/browse/SPARK-29189)
* spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled / [SPARK-9853](https://issues.apache.org/jira/browse/SPARK-9853)

CORE:
* spark.eventLog.erasureCoding.enabled / [SPARK-25855](https://issues.apache.org/jira/browse/SPARK-25855)
* spark.shuffle.readHostLocalDisk.enabled / [SPARK-30235](https://issues.apache.org/jira/browse/SPARK-30235)
* spark.scheduler.listenerbus.logSlowEvent.enabled / [SPARK-29001](https://issues.apache.org/jira/browse/SPARK-29001)
* spark.resources.coordinate.enable / [SPARK-27371](https://issues.apache.org/jira/browse/SPARK-27371)
* spark.eventLog.logStageExecutorMetrics.enabled / [SPARK-23429](https://issues.apache.org/jira/browse/SPARK-23429)

### Why are the changes needed?

To comply with the config naming policy.

### Does this PR introduce any user-facing change?

No. Configurations listed above are all newly added in Spark 3.0.

### How was this patch tested?

Pass Jenkins.

Closes #27563 from Ngone51/revise_boolean_conf_name.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This commit is contained in:
yi.wu 2020-02-18 20:39:50 +08:00 committed by Wenchen Fan
parent 643a480b11
commit 68d7edf949
14 changed files with 52 additions and 52 deletions

View file

@ -38,7 +38,7 @@ package object config {
private[spark] val LISTENER_BUS_EVENT_QUEUE_PREFIX = "spark.scheduler.listenerbus.eventqueue"
private[spark] val SPARK_RESOURCES_COORDINATE =
ConfigBuilder("spark.resources.coordinate.enable")
ConfigBuilder("spark.resources.coordinateResourcesInStandalone")
.doc("Whether to coordinate resources automatically among workers/drivers(client only) " +
"in Standalone. If false, the user is responsible for configuring different resources " +
"for workers/drivers that run on the same host.")
@ -159,7 +159,7 @@ package object config {
.createWithDefaultString("100k")
private[spark] val EVENT_LOG_STAGE_EXECUTOR_METRICS =
ConfigBuilder("spark.eventLog.logStageExecutorMetrics.enabled")
ConfigBuilder("spark.eventLog.logStageExecutorMetrics")
.doc("Whether to write per-stage peaks of executor metrics (for each executor) " +
"to the event log.")
.booleanConf
@ -632,7 +632,7 @@ package object config {
.createWithDefault(128)
private[spark] val LISTENER_BUS_LOG_SLOW_EVENT_ENABLED =
ConfigBuilder("spark.scheduler.listenerbus.logSlowEvent.enabled")
ConfigBuilder("spark.scheduler.listenerbus.logSlowEvent")
.internal()
.doc("When enabled, log the event that takes too much time to process. This helps us " +
"discover the event types that cause performance bottlenecks. The time threshold is " +
@ -644,7 +644,7 @@ package object config {
ConfigBuilder("spark.scheduler.listenerbus.logSlowEvent.threshold")
.internal()
.doc("The time threshold of whether a event is considered to be taking too much time to " +
"process. Log the event if spark.scheduler.listenerbus.logSlowEvent.enabled is true.")
s"process. Log the event if ${LISTENER_BUS_LOG_SLOW_EVENT_ENABLED.key} is true.")
.timeConf(TimeUnit.NANOSECONDS)
.createWithDefaultString("1s")
@ -1115,16 +1115,6 @@ package object config {
.booleanConf
.createWithDefault(false)
private[spark] val STORAGE_LOCAL_DISK_BY_EXECUTORS_CACHE_SIZE =
ConfigBuilder("spark.storage.localDiskByExecutors.cacheSize")
.doc("The max number of executors for which the local dirs are stored. This size is " +
"both applied for the driver and both for the executors side to avoid having an " +
"unbounded store. This cache will be used to avoid the network in case of fetching disk " +
"persisted RDD blocks or shuffle blocks (when `spark.shuffle.readHostLocalDisk.enabled` " +
"is set) from the same host.")
.intConf
.createWithDefault(1000)
private[spark] val SHUFFLE_SYNC =
ConfigBuilder("spark.shuffle.sync")
.doc("Whether to force outstanding writes to disk.")
@ -1161,13 +1151,23 @@ package object config {
.createWithDefault(false)
private[spark] val SHUFFLE_HOST_LOCAL_DISK_READING_ENABLED =
ConfigBuilder("spark.shuffle.readHostLocalDisk.enabled")
ConfigBuilder("spark.shuffle.readHostLocalDisk")
.doc(s"If enabled (and `${SHUFFLE_USE_OLD_FETCH_PROTOCOL.key}` is disabled), shuffle " +
"blocks requested from those block managers which are running on the same host are read " +
"from the disk directly instead of being fetched as remote blocks over the network.")
.booleanConf
.createWithDefault(true)
private[spark] val STORAGE_LOCAL_DISK_BY_EXECUTORS_CACHE_SIZE =
ConfigBuilder("spark.storage.localDiskByExecutors.cacheSize")
.doc("The max number of executors for which the local dirs are stored. This size is " +
"both applied for the driver and both for the executors side to avoid having an " +
"unbounded store. This cache will be used to avoid the network in case of fetching disk " +
s"persisted RDD blocks or shuffle blocks " +
s"(when `${SHUFFLE_HOST_LOCAL_DISK_READING_ENABLED.key}` is set) from the same host.")
.intConf
.createWithDefault(1000)
private[spark] val MEMORY_MAP_LIMIT_FOR_TESTS =
ConfigBuilder("spark.storage.memoryMapLimitForTests")
.internal()

View file

@ -40,7 +40,7 @@ import org.apache.spark.util.{JsonProtocol, Utils}
* spark.eventLog.enabled - Whether event logging is enabled.
* spark.eventLog.dir - Path to the directory in which events are logged.
* spark.eventLog.logBlockUpdates.enabled - Whether to log block updates
* spark.eventLog.logStageExecutorMetrics.enabled - Whether to log stage executor metrics
* spark.eventLog.logStageExecutorMetrics - Whether to log stage executor metrics
*
* Event log file writer maintains its own parameters: refer the doc of [[EventLogFileWriter]]
* and its descendant for more details.

File diff suppressed because one or more lines are too long

View file

@ -194,7 +194,7 @@ of the most common options to set are:
</td>
</tr>
<tr>
<td><code>spark.resources.coordinate.enable</code></td>
<td><code>spark.resources.coordinateResourcesInStandalone</code></td>
<td>true</td>
<td>
Whether to coordinate resources automatically among workers/drivers(client only)
@ -230,7 +230,7 @@ of the most common options to set are:
write to STDOUT a JSON string in the format of the ResourceInformation class. This has a
name and an array of addresses. For a client-submitted driver in Standalone, discovery
script must assign different resource addresses to this driver comparing to workers' and
other drivers' when <code>spark.resources.coordinate.enable</code> is off.
other drivers' when <code>spark.resources.coordinateResourcesInStandalone</code> is off.
</td>
</tr>
<tr>
@ -1641,7 +1641,7 @@ Apart from these, the following properties are also available, and may be useful
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.eventLog.logStageExecutorMetrics.enabled</code></td>
<td><code>spark.eventLog.logStageExecutorMetrics</code></td>
<td>false</td>
<td>
Whether to write per-stage peaks of executor metrics (for each executor) to the event log.
@ -2755,5 +2755,5 @@ There are configurations available to request resources for the driver: <code>sp
Spark will use the configurations specified to first request containers with the corresponding resources from the cluster manager. Once it gets the container, Spark launches an Executor in that container which will discover what resources the container has and the addresses associated with each resource. The Executor will register with the Driver and report back the resources available to that Executor. The Spark scheduler can then schedule tasks to each Executor and assign specific resource addresses based on the resource requirements the user specified. The user can see the resources assigned to a task using the <code>TaskContext.get().resources</code> api. On the driver, the user can see the resources assigned with the SparkContext <code>resources</code> call. It's then up to the user to use the assignedaddresses to do the processing they want or pass those into the ML/AI framework they are using.
See your cluster manager specific page for requirements and details on each of - [YARN](running-on-yarn.html#resource-allocation-and-configuration-overview), [Kubernetes](running-on-kubernetes.html#resource-allocation-and-configuration-overview) and [Standalone Mode](spark-standalone.html#resource-allocation-and-configuration-overview). It is currently not available with Mesos or local mode. If using local-cluster mode see the Spark Standalone documentation but be aware only a single worker resources file or discovery script can be specified the is shared by all the Workers so you should enable resource coordination (see <code>spark.resources.coordinate.enable</code>).
See your cluster manager specific page for requirements and details on each of - [YARN](running-on-yarn.html#resource-allocation-and-configuration-overview), [Kubernetes](running-on-kubernetes.html#resource-allocation-and-configuration-overview) and [Standalone Mode](spark-standalone.html#resource-allocation-and-configuration-overview). It is currently not available with Mesos or local mode. If using local-cluster mode see the Spark Standalone documentation but be aware only a single worker resources file or discovery script can be specified the is shared by all the Workers so you should enable resource coordination (see <code>spark.resources.coordinateResourcesInStandalone</code>).

View file

@ -659,7 +659,7 @@ A list of the available metrics, with a short description:
Executor-level metrics are sent from each executor to the driver as part of the Heartbeat to describe the performance metrics of Executor itself like JVM heap memory, GC information.
Executor metric values and their measured peak values per executor are exposed via the REST API at the end point `/applications/[app-id]/executors`.
In addition, aggregated per-stage peak values of the executor metrics are written to the event log if `spark.eventLog.logStageExecutorMetrics.enabled` is true.
In addition, aggregated per-stage peak values of the executor metrics are written to the event log if `spark.eventLog.logStageExecutorMetrics` is true.
Executor metrics are also exposed via the Spark metrics system based on the Dropwizard metrics library.
A list of the available metrics, with a short description:

View file

@ -34,7 +34,7 @@ Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide.
- In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder comes to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.
- In PySpark, when Arrow optimization is enabled, if Arrow version is higher than 0.11.0, Arrow can perform safe type conversion when converting Pandas.Series to Arrow array during serialization. Arrow will raise errors when detecting unsafe type conversion like overflow. Setting `spark.sql.execution.pandas.arrowSafeTypeConversion` to true can enable it. The default setting is false. PySpark's behavior for Arrow versions is illustrated in the table below:
- In PySpark, when Arrow optimization is enabled, if Arrow version is higher than 0.11.0, Arrow can perform safe type conversion when converting Pandas.Series to Arrow array during serialization. Arrow will raise errors when detecting unsafe type conversion like overflow. Setting `spark.sql.execution.pandas.convertToArrowArraySafely` to true can enable it. The default setting is false. PySpark's behavior for Arrow versions is illustrated in the table below:
<table class="table">
<tr>
<th>

View file

@ -256,7 +256,7 @@ SPARK_MASTER_OPTS supports the following system properties:
<td>
Path to resource discovery script, which is used to find a particular resource while worker starting up.
And the output of the script should be formatted like the <code>ResourceInformation</code> class.
When <code>spark.resources.coordinate.enable</code> is off, the discovery script must assign different
When <code>spark.resources.coordinateResourcesInStandalone</code> is off, the discovery script must assign different
resources for workers and drivers in client mode that run on the same host to avoid resource conflict.
</td>
</tr>
@ -267,7 +267,7 @@ SPARK_MASTER_OPTS supports the following system properties:
Path to resources file which is used to find various resources while worker starting up.
The content of resources file should be formatted like <code>
[[{"id":{"componentName": "spark.worker","resourceName":"gpu"},"addresses":["0","1","2"]}]]</code>.
When <code>spark.resources.coordinate.enable</code> is off, resources file must assign different
When <code>spark.resources.coordinateResourcesInStandalone</code> is off, resources file must assign different
resources for workers and drivers in client mode that run on the same host to avoid resource conflict.
If a particular resource is not found in the resources file, the discovery script would be used to
find that resource. If the discovery script also does not find the resources, the worker will fail
@ -346,9 +346,9 @@ Please make sure to have read the Custom Resource Scheduling and Configuration O
Spark Standalone has 2 parts, the first is configuring the resources for the Worker, the second is the resource allocation for a specific application.
The user must configure the Workers to have a set of resources available so that it can assign them out to Executors. The <code>spark.worker.resource.{resourceName}.amount</code> is used to control the amount of each resource the worker has allocated. The user must also specify either <code>spark.worker.resourcesFile</code> or <code>spark.worker.resource.{resourceName}.discoveryScript</code> to specify how the Worker discovers the resources its assigned. See the descriptions above for each of those to see which method works best for your setup. Please take note of <code>spark.resources.coordinate.enable</code> as it indicates whether Spark should handle coordinating resources or if the user has made sure each Worker has separate resources. Also note that if using the resources coordination <code>spark.resources.dir</code> can be used to specify the directory used to do that coordination.
The user must configure the Workers to have a set of resources available so that it can assign them out to Executors. The <code>spark.worker.resource.{resourceName}.amount</code> is used to control the amount of each resource the worker has allocated. The user must also specify either <code>spark.worker.resourcesFile</code> or <code>spark.worker.resource.{resourceName}.discoveryScript</code> to specify how the Worker discovers the resources its assigned. See the descriptions above for each of those to see which method works best for your setup. Please take note of <code>spark.resources.coordinateResourcesInStandalone</code> as it indicates whether Spark should handle coordinating resources or if the user has made sure each Worker has separate resources. Also note that if using the resources coordination <code>spark.resources.dir</code> can be used to specify the directory used to do that coordination.
The second part is running an application on Spark Standalone. The only special case from the standard Spark resource configs is when you are running the Driver in client mode. For a Driver in client mode, the user can specify the resources it uses via <code>spark.driver.resourcesfile</code> or <code>spark.driver.resources.{resourceName}.discoveryScript</code>. If the Driver is running on the same host as other Drivers or Workers there are 2 ways to make sure the they don't use the same resources. The user can either configure <code>spark.resources.coordinate.enable</code> on and give all the Driver/Workers the same set or resources and Spark will handle make sure each Driver/Worker has separate resources, or the user can make sure the resources file or discovery script only returns resources the do not conflict with other Drivers or Workers running on the same node.
The second part is running an application on Spark Standalone. The only special case from the standard Spark resource configs is when you are running the Driver in client mode. For a Driver in client mode, the user can specify the resources it uses via <code>spark.driver.resourcesfile</code> or <code>spark.driver.resources.{resourceName}.discoveryScript</code>. If the Driver is running on the same host as other Drivers or Workers there are 2 ways to make sure the they don't use the same resources. The user can either configure <code>spark.resources.coordinateResourcesInStandalone</code> on and give all the Driver/Workers the same set or resources and Spark will handle make sure each Driver/Worker has separate resources, or the user can make sure the resources file or discovery script only returns resources the do not conflict with other Drivers or Workers running on the same node.
Note, the user does not need to specify a discovery script when submitting an application as the Worker will start each Executor with the resources it allocates to it.

View file

@ -97,7 +97,7 @@ license: |
- Since Spark 3.0, when Avro files are written with user provided non-nullable schema, even the catalyst schema is nullable, Spark is still able to write the files. However, Spark will throw runtime NPE if any of the records contains null.
- Since Spark 3.0, a higher-order function `exists` follows the three-valued boolean logic, i.e., if the `predicate` returns any `null`s and no `true` is obtained, then `exists` will return `null` instead of `false`. For example, `exists(array(1, null, 3), x -> x % 2 == 0)` will be `null`. The previous behaviour can be restored by setting `spark.sql.legacy.arrayExistsFollowsThreeValuedLogic` to `false`.
- Since Spark 3.0, a higher-order function `exists` follows the three-valued boolean logic, i.e., if the `predicate` returns any `null`s and no `true` is obtained, then `exists` will return `null` instead of `false`. For example, `exists(array(1, null, 3), x -> x % 2 == 0)` will be `null`. The previous behaviour can be restored by setting `spark.sql.legacy.followThreeValuedLogicInArrayExists` to `false`.
- Since Spark 3.0, if files or subdirectories disappear during recursive directory listing (i.e. they appear in an intermediate listing but then cannot be read or listed during later phases of the recursive directory listing, due to either concurrent file deletions or object store consistency issues) then the listing will fail with an exception unless `spark.sql.files.ignoreMissingFiles` is `true` (default `false`). In previous versions, these missing files or subdirectories would be ignored. Note that this change of behavior only applies during initial table file listing (or during `REFRESH TABLE`), not during query execution: the net change is that `spark.sql.files.ignoreMissingFiles` is now obeyed during table file listing / query planning, not only at query execution time.
@ -109,7 +109,7 @@ license: |
- The result of `java.lang.Math`'s `log`, `log1p`, `exp`, `expm1`, and `pow` may vary across platforms. In Spark 3.0, the result of the equivalent SQL functions (including related SQL functions like `LOG10`) return values consistent with `java.lang.StrictMath`. In virtually all cases this makes no difference in the return value, and the difference is very small, but may not exactly match `java.lang.Math` on x86 platforms in cases like, for example, `log(3.0)`, whose value varies between `Math.log()` and `StrictMath.log()`.
- Since Spark 3.0, Dataset query fails if it contains ambiguous column reference that is caused by self join. A typical example: `val df1 = ...; val df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns an empty result which is quite confusing. This is because Spark cannot resolve Dataset column references that point to tables being self joined, and `df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior before Spark 3.0, you can set `spark.sql.analyzer.failAmbiguousSelfJoin.enabled` to `false`.
- Since Spark 3.0, Dataset query fails if it contains ambiguous column reference that is caused by self join. A typical example: `val df1 = ...; val df2 = df1.filter(...);`, then `df1.join(df2, df1("a") > df2("a"))` returns an empty result which is quite confusing. This is because Spark cannot resolve Dataset column references that point to tables being self joined, and `df1("a")` is exactly the same as `df2("a")` in Spark. To restore the behavior before Spark 3.0, you can set `spark.sql.analyzer.failAmbiguousSelfJoin` to `false`.
- Since Spark 3.0, `Cast` function processes string literals such as 'Infinity', '+Infinity', '-Infinity', 'NaN', 'Inf', '+Inf', '-Inf' in case insensitive manner when casting the literals to `Double` or `Float` type to ensure greater compatibility with other database systems. This behaviour change is illustrated in the table below:
<table class="table">
@ -258,13 +258,13 @@ license: |
- Since Spark 3.0, day-time interval strings are converted to intervals with respect to the `from` and `to` bounds. If an input string does not match to the pattern defined by specified bounds, the `ParseException` exception is thrown. For example, `interval '2 10:20' hour to minute` raises the exception because the expected format is `[+|-]h[h]:[m]m`. In Spark version 2.4, the `from` bound was not taken into account, and the `to` bound was used to truncate the resulted interval. For instance, the day-time interval string from the showed example is converted to `interval 10 hours 20 minutes`. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.fromDayTimeString.enabled` to `true`.
- Since Spark 3.0, negative scale of decimal is not allowed by default, e.g. data type of literal like `1E10BD` is `DecimalType(11, 0)`. In Spark version 2.4 and earlier, it was `DecimalType(2, -9)`. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled` to `true`.
- Since Spark 3.0, negative scale of decimal is not allowed by default, e.g. data type of literal like `1E10BD` is `DecimalType(11, 0)`. In Spark version 2.4 and earlier, it was `DecimalType(2, -9)`. To restore the behavior before Spark 3.0, you can set `spark.sql.legacy.allowNegativeScaleOfDecimal` to `true`.
- Since Spark 3.0, the `date_add` and `date_sub` functions only accepts int, smallint, tinyint as the 2nd argument, fractional and string types are not valid anymore, e.g. `date_add(cast('1964-05-23' as date), '12.34')` will cause `AnalysisException`. In Spark version 2.4 and earlier, if the 2nd argument is fractional or string value, it will be coerced to int value, and the result will be a date value of `1964-06-04`.
- Since Spark 3.0, the function `percentile_approx` and its alias `approx_percentile` only accept integral value with range in `[1, 2147483647]` as its 3rd argument `accuracy`, fractional and string types are disallowed, e.g. `percentile_approx(10.0, 0.2, 1.8D)` will cause `AnalysisException`. In Spark version 2.4 and earlier, if `accuracy` is fractional or string value, it will be coerced to an int value, `percentile_approx(10.0, 0.2, 1.8D)` is operated as `percentile_approx(10.0, 0.2, 1)` which results in `10.0`.
- Since Spark 3.0, the properties listing below become reserved, commands will fail if we specify reserved properties in places like `CREATE DATABASE ... WITH DBPROPERTIES` and `ALTER TABLE ... SET TBLPROPERTIES`. We need their specific clauses to specify them, e.g. `CREATE DATABASE test COMMENT 'any comment' LOCATION 'some path'`. We can set `spark.sql.legacy.property.nonReserved` to `true` to ignore the `ParseException`, in this case, these properties will be silently removed, e.g `SET DBPROTERTIES('location'='/tmp')` will affect nothing. In Spark version 2.4 and earlier, these properties are neither reserved nor have side effects, e.g. `SET DBPROTERTIES('location'='/tmp')` will not change the location of the database but only create a headless property just like `'a'='b'`.
- Since Spark 3.0, the properties listing below become reserved, commands will fail if we specify reserved properties in places like `CREATE DATABASE ... WITH DBPROPERTIES` and `ALTER TABLE ... SET TBLPROPERTIES`. We need their specific clauses to specify them, e.g. `CREATE DATABASE test COMMENT 'any comment' LOCATION 'some path'`. We can set `spark.sql.legacy.notReserveProperties` to `true` to ignore the `ParseException`, in this case, these properties will be silently removed, e.g `SET DBPROTERTIES('location'='/tmp')` will affect nothing. In Spark version 2.4 and earlier, these properties are neither reserved nor have side effects, e.g. `SET DBPROTERTIES('location'='/tmp')` will not change the location of the database but only create a headless property just like `'a'='b'`.
<table class="table">
<tr>
<th>

View file

@ -161,7 +161,7 @@ class ArrowStreamPandasSerializer(ArrowStreamSerializer):
"Array (%s). It can be caused by overflows or other unsafe " + \
"conversions warned by Arrow. Arrow safe type check can be " + \
"disabled by using SQL config " + \
"`spark.sql.execution.pandas.arrowSafeTypeConversion`."
"`spark.sql.execution.pandas.convertToArrowArraySafely`."
raise RuntimeError(error_msg % (s.dtype, t), e)
return array

View file

@ -211,14 +211,14 @@ class PandasUDFTests(ReusedSQLTestCase):
# Since 0.11.0, PyArrow supports the feature to raise an error for unsafe cast.
with self.sql_conf({
"spark.sql.execution.pandas.arrowSafeTypeConversion": True}):
"spark.sql.execution.pandas.convertToArrowArraySafely": True}):
with self.assertRaisesRegexp(Exception,
"Exception thrown when converting pandas.Series"):
df.select(['A']).withColumn('udf', udf('A')).collect()
# Disabling Arrow safe type check.
with self.sql_conf({
"spark.sql.execution.pandas.arrowSafeTypeConversion": False}):
"spark.sql.execution.pandas.convertToArrowArraySafely": False}):
df.select(['A']).withColumn('udf', udf('A')).collect()
def test_pandas_udf_arrow_overflow(self):
@ -232,13 +232,13 @@ class PandasUDFTests(ReusedSQLTestCase):
# When enabling safe type check, Arrow 0.11.0+ disallows overflow cast.
with self.sql_conf({
"spark.sql.execution.pandas.arrowSafeTypeConversion": True}):
"spark.sql.execution.pandas.convertToArrowArraySafely": True}):
with self.assertRaisesRegexp(Exception,
"Exception thrown when converting pandas.Series"):
df.withColumn('udf', udf('id')).collect()
# Disabling safe type check, let Arrow do the cast anyway.
with self.sql_conf({"spark.sql.execution.pandas.arrowSafeTypeConversion": False}):
with self.sql_conf({"spark.sql.execution.pandas.convertToArrowArraySafely": False}):
df.withColumn('udf', udf('id')).collect()

View file

@ -205,13 +205,13 @@ class TypesTests(ReusedSQLTestCase):
def test_negative_decimal(self):
try:
self.spark.sql("set spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=true")
self.spark.sql("set spark.sql.legacy.allowNegativeScaleOfDecimal=true")
df = self.spark.createDataFrame([(1, ), (11, )], ["value"])
ret = df.select(col("value").cast(DecimalType(1, -1))).collect()
actual = list(map(lambda r: int(r.value), ret))
self.assertEqual(actual, [0, 10])
finally:
self.spark.sql("set spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false")
self.spark.sql("set spark.sql.legacy.allowNegativeScaleOfDecimal=false")
def test_create_dataframe_from_objects(self):
data = [MyObject(1, "1"), MyObject(2, "2")]

View file

@ -306,7 +306,7 @@ def read_udfs(pickleSer, infile, eval_type):
# NOTE: if timezone is set here, that implies respectSessionTimeZone is True
timezone = runner_conf.get("spark.sql.session.timeZone", None)
safecheck = runner_conf.get("spark.sql.execution.pandas.arrowSafeTypeConversion",
safecheck = runner_conf.get("spark.sql.execution.pandas.convertToArrowArraySafely",
"false").lower() == 'true'
# Used by SQL_GROUPED_MAP_PANDAS_UDF and SQL_SCALAR_PANDAS_UDF when returning StructType
assign_cols_by_name = runner_conf.get(

View file

@ -368,14 +368,14 @@ object SQLConf {
.createWithDefault(false)
val REDUCE_POST_SHUFFLE_PARTITIONS_ENABLED =
buildConf("spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled")
buildConf("spark.sql.adaptive.shuffle.reducePostShufflePartitions")
.doc(s"When true and '${ADAPTIVE_EXECUTION_ENABLED.key}' is enabled, this enables reducing " +
"the number of post-shuffle partitions based on map output statistics.")
.booleanConf
.createWithDefault(true)
val FETCH_SHUFFLE_BLOCKS_IN_BATCH_ENABLED =
buildConf("spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled")
buildConf("spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch")
.doc("Whether to fetch the continuous shuffle blocks in batch. Instead of fetching blocks " +
"one by one, fetching continuous shuffle blocks for the same map task in batch can " +
"reduce IO and improve performance. Note, multiple continuous blocks exist in single " +
@ -425,7 +425,7 @@ object SQLConf {
.createWithDefault(true)
val ADAPTIVE_EXECUTION_SKEWED_JOIN_ENABLED =
buildConf("spark.sql.adaptive.optimizeSkewedJoin.enabled")
buildConf("spark.sql.adaptive.skewedJoinOptimization.enabled")
.doc("When true and adaptive execution is enabled, a skewed join is automatically handled at " +
"runtime.")
.booleanConf
@ -894,7 +894,7 @@ object SQLConf {
.createWithDefault(10000)
val IGNORE_DATA_LOCALITY =
buildConf("spark.sql.sources.ignoreDataLocality.enabled")
buildConf("spark.sql.sources.ignoreDataLocality")
.doc("If true, Spark will not fetch the block locations for each file on " +
"listing files. This speeds up file listing, but the scheduler cannot " +
"schedule tasks to take advantage of data locality. It can be particularly " +
@ -913,7 +913,7 @@ object SQLConf {
.createWithDefault(true)
val FAIL_AMBIGUOUS_SELF_JOIN_ENABLED =
buildConf("spark.sql.analyzer.failAmbiguousSelfJoin.enabled")
buildConf("spark.sql.analyzer.failAmbiguousSelfJoin")
.doc("When true, fail the Dataset query if it contains ambiguous self-join.")
.internal()
.booleanConf
@ -1062,7 +1062,7 @@ object SQLConf {
.booleanConf
.createWithDefault(true)
val SUBQUERY_REUSE_ENABLED = buildConf("spark.sql.execution.subquery.reuse.enabled")
val SUBQUERY_REUSE_ENABLED = buildConf("spark.sql.execution.reuseSubquery")
.internal()
.doc("When true, the planner will try to find out duplicated subqueries and re-use them.")
.booleanConf
@ -1102,7 +1102,7 @@ object SQLConf {
.createOptional
val FORCE_DELETE_TEMP_CHECKPOINT_LOCATION =
buildConf("spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled")
buildConf("spark.sql.streaming.forceDeleteTempCheckpointLocation")
.doc("When true, enable temporary checkpoint locations force delete.")
.booleanConf
.createWithDefault(false)
@ -1629,7 +1629,7 @@ object SQLConf {
.createWithDefault(true)
val PANDAS_ARROW_SAFE_TYPE_CONVERSION =
buildConf("spark.sql.execution.pandas.arrowSafeTypeConversion")
buildConf("spark.sql.execution.pandas.convertToArrowArraySafely")
.internal()
.doc("When true, Arrow will perform safe type conversion when converting " +
"Pandas.Series to Arrow array during serialization. Arrow will raise errors " +
@ -1959,7 +1959,7 @@ object SQLConf {
.createWithDefault(false)
val LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED =
buildConf("spark.sql.legacy.allowNegativeScaleOfDecimal.enabled")
buildConf("spark.sql.legacy.allowNegativeScaleOfDecimal")
.internal()
.doc("When set to true, negative scale of Decimal type is allowed. For example, " +
"the type of number 1E10BD under legacy mode is DecimalType(2, -9), but is " +
@ -2100,7 +2100,7 @@ object SQLConf {
.stringConf
.createOptional
val LEGACY_LOOSE_UPCAST = buildConf("spark.sql.legacy.looseUpcast")
val LEGACY_LOOSE_UPCAST = buildConf("spark.sql.legacy.doLooseUpcast")
.internal()
.doc("When true, the upcast will be loose and allows string to atomic types.")
.booleanConf
@ -2122,7 +2122,7 @@ object SQLConf {
.createWithDefault(LegacyBehaviorPolicy.EXCEPTION.toString)
val LEGACY_ARRAY_EXISTS_FOLLOWS_THREE_VALUED_LOGIC =
buildConf("spark.sql.legacy.arrayExistsFollowsThreeValuedLogic")
buildConf("spark.sql.legacy.followThreeValuedLogicInArrayExists")
.internal()
.doc("When true, the ArrayExists will follow the three-valued boolean logic.")
.booleanConf
@ -2149,7 +2149,7 @@ object SQLConf {
.createWithDefault(false)
val LEGACY_PROPERTY_NON_RESERVED =
buildConf("spark.sql.legacy.property.nonReserved")
buildConf("spark.sql.legacy.notReserveProperties")
.internal()
.doc("When true, all database and table properties are not reserved and available for " +
"create/alter syntaxes. But please be aware that the reserved properties will be " +

View file

@ -159,7 +159,7 @@ object DecimalType extends AbstractDataType {
private[sql] def checkNegativeScale(scale: Int): Unit = {
if (scale < 0 && !SQLConf.get.allowNegativeScaleOfDecimalEnabled) {
throw new AnalysisException(s"Negative scale is not allowed: $scale. " +
s"You can use spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=true " +
s"You can use spark.sql.legacy.allowNegativeScaleOfDecimal=true " +
s"to enable legacy mode to allow it.")
}
}