--- layout: global title: Spark SQL Upgrading Guide displayTitle: Spark SQL Upgrading Guide --- * Table of contents {:toc} ## Upgrading From Spark SQL 2.4 to 3.0 - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder comes to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`. - In Spark version 2.4 and earlier, the parser of JSON data source treats empty strings as null for some data types such as `IntegerType`. For `FloatType` and `DoubleType`, it fails on empty strings and throws exceptions. Since Spark 3.0, we disallow empty strings and will throw exceptions for data types except for `StringType` and `BinaryType`. - Since Spark 3.0, the `from_json` functions supports two modes - `PERMISSIVE` and `FAILFAST`. The modes can be set via the `mode` option. The default mode became `PERMISSIVE`. In previous versions, behavior of `from_json` did not conform to either `PERMISSIVE` nor `FAILFAST`, especially in processing of malformed JSON records. For example, the JSON string `{"a" 1}` with the schema `a INT` is converted to `null` by previous versions but Spark 3.0 converts it to `Row(null)`. ## Upgrading From Spark SQL 2.3 to 2.4 - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below.
Query | Result Spark 2.3 or Prior | Result Spark 2.4 | Remarks |
---|---|---|---|
SELECT array_contains(array(1), 1.34D); |
true | false | In Spark 2.4, left and right parameters are promoted to array(double) and double type respectively. |
SELECT array_contains(array(1), '1'); |
true | AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. | Users can use explicit cast |
SELECT array_contains(array(1), 'anystring'); |
null | AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner. | Users can use explicit cast |
InputA \ InputB | NullType | IntegerType | LongType | DecimalType(38,0)* | DoubleType | DateType | TimestampType | StringType |
---|---|---|---|---|---|---|---|---|
NullType | NullType | IntegerType | LongType | DecimalType(38,0) | DoubleType | DateType | TimestampType | StringType |
IntegerType | IntegerType | IntegerType | LongType | DecimalType(38,0) | DoubleType | StringType | StringType | StringType |
LongType | LongType | LongType | LongType | DecimalType(38,0) | StringType | StringType | StringType | StringType |
DecimalType(38,0)* | DecimalType(38,0) | DecimalType(38,0) | DecimalType(38,0) | DecimalType(38,0) | StringType | StringType | StringType | StringType |
DoubleType | DoubleType | DoubleType | StringType | StringType | DoubleType | StringType | StringType | StringType |
DateType | DateType | StringType | StringType | StringType | StringType | DateType | TimestampType | StringType |
TimestampType | TimestampType | StringType | StringType | StringType | StringType | TimestampType | TimestampType | StringType |
StringType | StringType | StringType | StringType | StringType | StringType | StringType | StringType | StringType |