[SPARK-27506][SQL][FOLLOWUP] Use option avroSchema
to specify an evolved schema in from_avro
### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/26780 In https://github.com/apache/spark/pull/26780, a new Avro data source option `actualSchema` is introduced for setting the original Avro schema in function `from_avro`, while the expected schema is supposed to be set in the parameter `jsonFormatSchema` of `from_avro`. However, there is another Avro data source option `avroSchema`. It is used for setting the expected schema in readiong and writing. This PR is to use the option `avroSchema` option for reading Avro data with an evolved schema and remove the new one `actualSchema` ### Why are the changes needed? Unify and simplify the Avro data source options. ### Does this PR introduce any user-facing change? Yes. To deserialize Avro data with an evolved schema, before changes: ``` from_avro('col, expectedSchema, ("actualSchema" -> actualSchema)) ``` After changes: ``` from_avro('col, actualSchema, ("avroSchema" -> expectedSchema)) ``` The second parameter is always the actual Avro schema after changes. ### How was this patch tested? Update the existing tests in https://github.com/apache/spark/pull/26780 Closes #27045 from gengliangwang/renameAvroOption. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This commit is contained in:
parent
a90ad5bf2a
commit
07593d362f
|
@ -198,9 +198,22 @@ Data source options of Avro can be set via:
|
|||
<tr>
|
||||
<td><code>avroSchema</code></td>
|
||||
<td>None</td>
|
||||
<td>Optional Avro schema provided by a user in JSON format. The data type and naming of record fields
|
||||
should match the Avro data type when reading from Avro or match the Spark's internal data type (e.g., StringType, IntegerType) when writing to Avro files; otherwise, the read/write action will fail.</td>
|
||||
<td>read and write</td>
|
||||
<td>Optional schema provided by a user in JSON format.
|
||||
<ul>
|
||||
<li>
|
||||
When reading Avro, this option can be set to an evolved schema, which is compatible but different with
|
||||
the actual Avro schema. The deserialization schema will be consistent with the evolved schema.
|
||||
For example, if we set an evolved schema containing one additional column with a default value,
|
||||
the reading result in Spark will contain the new column too.
|
||||
</li>
|
||||
<li>
|
||||
When writing Avro, this option can be set if the expected output Avro schema doesn't match the
|
||||
schema converted by Spark. For example, the expected schema of one column is of "enum" type,
|
||||
instead of "string" type in the default converted schema.
|
||||
</li>
|
||||
</ul>
|
||||
</td>
|
||||
<td> read, write and function <code>from_avro</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>recordName</code></td>
|
||||
|
@ -240,15 +253,6 @@ Data source options of Avro can be set via:
|
|||
</td>
|
||||
<td>function <code>from_avro</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>actualSchema</code></td>
|
||||
<td>None</td>
|
||||
<td>Optional Avro schema (in JSON format) that was used to serialize the data. This should be set if the schema provided
|
||||
for deserialization is compatible with - but not the same as - the one used to originally convert the data to Avro.
|
||||
For more information on Avro's schema evolution and compatibility, please refer to the [documentation of Confluent](https://docs.confluent.io/current/schema-registry/avro.html).
|
||||
</td>
|
||||
<td>function <code>from_avro</code></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
## Configuration
|
||||
|
|
|
@ -39,7 +39,7 @@ case class AvroDataToCatalyst(
|
|||
override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType)
|
||||
|
||||
override lazy val dataType: DataType = {
|
||||
val dt = SchemaConverters.toSqlType(avroSchema).dataType
|
||||
val dt = SchemaConverters.toSqlType(expectedSchema).dataType
|
||||
parseMode match {
|
||||
// With PermissiveMode, the output Catalyst row might contain columns of null values for
|
||||
// corrupt records, even if some of the columns are not nullable in the user-provided schema.
|
||||
|
@ -53,14 +53,15 @@ case class AvroDataToCatalyst(
|
|||
|
||||
private lazy val avroOptions = AvroOptions(options)
|
||||
|
||||
@transient private lazy val avroSchema = new Schema.Parser().parse(jsonFormatSchema)
|
||||
@transient private lazy val actualSchema = new Schema.Parser().parse(jsonFormatSchema)
|
||||
|
||||
@transient private lazy val reader = avroOptions.actualSchema
|
||||
.map(actualSchema =>
|
||||
new GenericDatumReader[Any](new Schema.Parser().parse(actualSchema), avroSchema))
|
||||
.getOrElse(new GenericDatumReader[Any](avroSchema))
|
||||
@transient private lazy val expectedSchema = avroOptions.schema
|
||||
.map(expectedSchema => new Schema.Parser().parse(expectedSchema))
|
||||
.getOrElse(actualSchema)
|
||||
|
||||
@transient private lazy val deserializer = new AvroDeserializer(avroSchema, dataType)
|
||||
@transient private lazy val reader = new GenericDatumReader[Any](actualSchema, expectedSchema)
|
||||
|
||||
@transient private lazy val deserializer = new AvroDeserializer(expectedSchema, dataType)
|
||||
|
||||
@transient private var decoder: BinaryDecoder = _
|
||||
|
||||
|
|
|
@ -36,18 +36,19 @@ class AvroOptions(
|
|||
}
|
||||
|
||||
/**
|
||||
* Optional schema provided by an user in JSON format.
|
||||
* Optional schema provided by a user in JSON format.
|
||||
*
|
||||
* When reading Avro, this option can be set to an evolved schema, which is compatible but
|
||||
* different with the actual Avro schema. The deserialization schema will be consistent with
|
||||
* the evolved schema. For example, if we set an evolved schema containing one additional
|
||||
* column with a default value, the reading result in Spark will contain the new column too.
|
||||
*
|
||||
* When writing Avro, this option can be set if the expected output Avro schema doesn't match the
|
||||
* schema converted by Spark. For example, the expected schema of one column is of "enum" type,
|
||||
* instead of "string" type in the default converted schema.
|
||||
*/
|
||||
val schema: Option[String] = parameters.get("avroSchema")
|
||||
|
||||
/**
|
||||
* Optional Avro schema (in JSON format) that was used to serialize the data.
|
||||
* This should be set if the schema provided for deserialization is compatible
|
||||
* with - but not the same as - the one used to originally convert the data to Avro.
|
||||
* See SPARK-27506 for more details.
|
||||
*/
|
||||
val actualSchema: Option[String] = parameters.get("actualSchema")
|
||||
|
||||
/**
|
||||
* Top level record name in write result, which is required in Avro spec.
|
||||
* See https://avro.apache.org/docs/1.8.2/spec.html#schema_record .
|
||||
|
|
|
@ -45,10 +45,11 @@ object functions {
|
|||
}
|
||||
|
||||
/**
|
||||
* Converts a binary column of Avro format into its corresponding catalyst value. If a schema is
|
||||
* provided via the option actualSchema, a different (but compatible) schema can be used for
|
||||
* reading. If no actualSchema option is provided, the specified schema must match the read data,
|
||||
* otherwise the behavior is undefined: it may fail or return arbitrary result.
|
||||
* Converts a binary column of Avro format into its corresponding catalyst value.
|
||||
* The specified schema must match actual schema of the read data, otherwise the behavior
|
||||
* is undefined: it may fail or return arbitrary result.
|
||||
* To deserialize the data with a compatible and evolved schema, the expected Avro schema can be
|
||||
* set via the option avroSchema.
|
||||
*
|
||||
* @param data the binary column.
|
||||
* @param jsonFormatSchema the avro schema in JSON string format.
|
||||
|
|
|
@ -197,8 +197,8 @@ class AvroFunctionsSuite extends QueryTest with SharedSparkSession {
|
|||
avroStructDF.select(
|
||||
functions.from_avro(
|
||||
'avro,
|
||||
evolvedAvroSchema,
|
||||
Map("actualSchema" -> actualAvroSchema).asJava)),
|
||||
actualAvroSchema,
|
||||
Map("avroSchema" -> evolvedAvroSchema).asJava)),
|
||||
expected)
|
||||
}
|
||||
}
|
||||
|
|
|
@ -30,10 +30,11 @@ from pyspark.util import _print_missing_jar
|
|||
@since(3.0)
|
||||
def from_avro(data, jsonFormatSchema, options={}):
|
||||
"""
|
||||
Converts a binary column of Avro format into its corresponding catalyst value. If a schema is
|
||||
provided via the option actualSchema, a different (but compatible) schema can be used for
|
||||
reading. If no actualSchema option is provided, the specified schema must match the read data,
|
||||
otherwise the behavior is undefined: it may fail or return arbitrary result.
|
||||
Converts a binary column of Avro format into its corresponding catalyst value.
|
||||
The specified schema must match the read data, otherwise the behavior is undefined:
|
||||
it may fail or return arbitrary result.
|
||||
To deserialize the data with a compatible and evolved schema, the expected Avro schema can be
|
||||
set via the option avroSchema.
|
||||
|
||||
Note: Avro is built-in but external data source module since Spark 2.4. Please deploy the
|
||||
application as per the deployment section of "Apache Avro Data Source Guide".
|
||||
|
|
Loading…
Reference in a new issue