{% highlight scala %} val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro") usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro") {% endhighlight %}

{% highlight java %} Dataset usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro"); usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro"); {% endhighlight %}

{% highlight python %} df = spark.read.format("avro").load("examples/src/main/resources/users.avro") df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro") {% endhighlight %}

{% highlight r %} df <- read.df("examples/src/main/resources/users.avro", "avro") write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro") {% endhighlight %}

{% highlight scala %} import org.apache.spark.sql.avro.functions._ // `from_avro` requires Avro schema in JSON string format. val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc"))) val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() // 1. Decode the Avro data into a struct; // 2. Filter by column `favorite_color`; // 3. Encode the column `name` in Avro format. val output = df .select(from_avro($"value", jsonFormatSchema) as $"user") .where("user.favorite_color == \"red\"") .select(to_avro($"user.name") as $"value") val query = output .writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic2") .start() {% endhighlight %}

{% highlight java %} import static org.apache.spark.sql.functions.col; import static org.apache.spark.sql.avro.functions.*; // `from_avro` requires Avro schema in JSON string format. String jsonFormatSchema = new String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc"))); Dataset df = spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load(); // 1. Decode the Avro data into a struct; // 2. Filter by column `favorite_color`; // 3. Encode the column `name` in Avro format. Dataset output = df .select(from_avro(col("value"), jsonFormatSchema).as("user")) .where("user.favorite_color == \"red\"") .select(to_avro(col("user.name")).as("value")); StreamingQuery query = output .writeStream() .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic2") .start(); {% endhighlight %}

{% highlight python %} from pyspark.sql.avro.functions import from_avro, to_avro # `from_avro` requires Avro schema in JSON string format. jsonFormatSchema = open("examples/src/main/resources/user.avsc", "r").read() df = spark\ .readStream\ .format("kafka")\ .option("kafka.bootstrap.servers", "host1:port1,host2:port2")\ .option("subscribe", "topic1")\ .load() # 1. Decode the Avro data into a struct; # 2. Filter by column `favorite_color`; # 3. Encode the column `name` in Avro format. output = df\ .select(from_avro("value", jsonFormatSchema).alias("user"))\ .where('user.favorite_color == "red"')\ .select(to_avro("user.name").alias("value")) query = output\ .writeStream\ .format("kafka")\ .option("kafka.bootstrap.servers", "host1:port1,host2:port2")\ .option("topic", "topic2")\ .start() {% endhighlight %}

{% highlight r %} # `from_avro` requires Avro schema in JSON string format. jsonFormatSchema <- paste0(readLines("examples/src/main/resources/user.avsc"), collapse=" ") df <- read.stream( "kafka", kafka.bootstrap.servers = "host1:port1,host2:port2", subscribe = "topic1" ) # 1. Decode the Avro data into a struct; # 2. Filter by column `favorite_color`; # 3. Encode the column `name` in Avro format. output <- select( filter( select(df, alias(from_avro("value", jsonFormatSchema), "user")), column("user.favorite_color") == "red" ), alias(to_avro("user.name"), "value") ) write.stream( output, "kafka", kafka.bootstrap.servers = "host1:port1,host2:port2", topic = "topic2" ) {% endhighlight %}

Property Name	Default	Meaning	Scope	Since Version
`avroSchema`	None	Optional schema provided by a user in JSON format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if we set an evolved schema containing one additional column with a default value, the reading result in Spark will contain the new column too. When writing Avro, this option can be set if the expected output Avro schema doesn't match the schema converted by Spark. For example, the expected schema of one column is of "enum" type, instead of "string" type in the default converted schema.	read, write and function `from_avro`	2.4.0
`recordName`	topLevelRecord	Top level record name in write result, which is required in Avro spec.	write	2.4.0
`recordNamespace`	""	Record namespace in write result.	write	2.4.0
`ignoreExtension`	true	The option controls ignoring of files without `.avro` extensions in read. If the option is enabled, all files (with and without `.avro` extension) are loaded. The option has been deprecated, and it will be removed in the future releases. Please use the general data source option pathGlobFilter for filtering file names.	read	2.4.0
`compression`	snappy	The `compression` option allows to specify a compression codec used in write. Currently supported codecs are `uncompressed`, `snappy`, `deflate`, `bzip2` and `xz`. If the option is not set, the configuration `spark.sql.avro.compression.codec` config is taken into account.	write	2.4.0
`mode`	FAILFAST	The `mode` option allows to specify parse mode for function `from_avro`. Currently supported modes are: `FAILFAST`: Throws an exception on processing corrupted record. `PERMISSIVE`: Corrupt records are processed as null result. Therefore, the data schema is forced to be fully nullable, which might be different from the one user provided.	function `from_avro`	2.4.0
`datetimeRebaseMode`	(value of `spark.sql.avro.datetimeRebaseModeInRead` configuration)	The `datetimeRebaseMode` option allows to specify the rebasing mode for the values of the `date`, `timestamp-micros`, `timestamp-millis` logical types from the Julian to Proleptic Gregorian calendar. Currently supported modes are: `EXCEPTION`: fails in reads of ancient dates/timestamps that are ambiguous between the two calendars. `CORRECTED`: loads dates/timestamps without rebasing. `LEGACY`: performs rebasing of ancient dates/timestamps from the Julian to Proleptic Gregorian calendar.	read and function `from_avro`	3.2.0
`positionalFieldMatching`	false	This can be used in tandem with the `avroSchema` option to adjust the behavior for matching the fields in the provided Avro schema with those in the SQL schema. By default, the matching will be performed using field names, ignoring their positions. If this option is set to "true", the matching will be based on the position of the fields.	read and write	3.2.0

Property Name

Default

Meaning

Scope

Since Version

avroSchema

None

Optional schema provided by a user in JSON format.

When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if we set an evolved schema containing one additional column with a default value, the reading result in Spark will contain the new column too.
When writing Avro, this option can be set if the expected output Avro schema doesn't match the schema converted by Spark. For example, the expected schema of one column is of "enum" type, instead of "string" type in the default converted schema.

read, write and function from_avro

2.4.0

recordName

topLevelRecord

Top level record name in write result, which is required in Avro spec.

write

2.4.0

recordNamespace

Record namespace in write result.

write

2.4.0

ignoreExtension

true

The option controls ignoring of files without .avro extensions in read.
If the option is enabled, all files (with and without .avro extension) are loaded.
The option has been deprecated, and it will be removed in the future releases. Please use the general data source option pathGlobFilter for filtering file names.

read

2.4.0

compression

snappy

The compression option allows to specify a compression codec used in write.
Currently supported codecs are uncompressed, snappy, deflate, bzip2 and xz.
If the option is not set, the configuration spark.sql.avro.compression.codec config is taken into account.

write

2.4.0

mode

FAILFAST

The mode option allows to specify parse mode for function from_avro.
Currently supported modes are:

FAILFAST: Throws an exception on processing corrupted record.
PERMISSIVE: Corrupt records are processed as null result. Therefore, the data schema is forced to be fully nullable, which might be different from the one user provided.

function from_avro

2.4.0

datetimeRebaseMode

(value of spark.sql.avro.datetimeRebaseModeInRead configuration)

The datetimeRebaseMode option allows to specify the rebasing mode for the values of the date, timestamp-micros, timestamp-millis logical types from the Julian to Proleptic Gregorian calendar.
Currently supported modes are:

EXCEPTION: fails in reads of ancient dates/timestamps that are ambiguous between the two calendars.
CORRECTED: loads dates/timestamps without rebasing.
LEGACY: performs rebasing of ancient dates/timestamps from the Julian to Proleptic Gregorian calendar.

read and function from_avro

3.2.0

positionalFieldMatching

false

This can be used in tandem with the `avroSchema` option to adjust the behavior for matching the fields in the provided Avro schema with those in the SQL schema. By default, the matching will be performed using field names, ignoring their positions. If this option is set to "true", the matching will be based on the position of the fields.

read and write

3.2.0

Property Name	Default	Meaning	Since Version
spark.sql.legacy.replaceDatabricksSparkAvro.enabled	true	If it is set to true, the data source provider `com.databricks.spark.avro` is mapped to the built-in but external Avro data source module for backward compatibility. Note: the SQL config has been deprecated in Spark 3.2 and might be removed in the future.	2.4.0
spark.sql.avro.compression.codec	snappy	Compression codec used in writing of AVRO files. Supported codecs: uncompressed, deflate, snappy, bzip2 and xz. Default codec is snappy.	2.4.0
spark.sql.avro.deflate.level	-1	Compression level for the deflate codec used in writing of AVRO files. Valid value must be in the range of from 1 to 9 inclusive or -1. The default value is -1 which corresponds to 6 level in the current implementation.	2.4.0
spark.sql.avro.datetimeRebaseModeInRead	`EXCEPTION`	The rebasing mode for the values of the `date`, `timestamp-micros`, `timestamp-millis` logical types from the Julian to Proleptic Gregorian calendar: `EXCEPTION`: Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. `CORRECTED`: Spark will not do rebase and read the dates/timestamps as it is. `LEGACY`: Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Avro files. This config is only effective if the writer info (like Spark, Hive) of the Avro files is unknown.	3.0.0
spark.sql.avro.datetimeRebaseModeInWrite	`EXCEPTION`	The rebasing mode for the values of the `date`, `timestamp-micros`, `timestamp-millis` logical types from the Proleptic Gregorian to Julian calendar: `EXCEPTION`: Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars. `CORRECTED`: Spark will not do rebase and write the dates/timestamps as it is. `LEGACY`: Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Avro files.	3.0.0

Property Name

Default

Meaning

Since Version

spark.sql.legacy.replaceDatabricksSparkAvro.enabled

true

If it is set to true, the data source provider com.databricks.spark.avro is mapped to the built-in but external Avro data source module for backward compatibility.
Note: the SQL config has been deprecated in Spark 3.2 and might be removed in the future.

2.4.0

spark.sql.avro.compression.codec

snappy

Compression codec used in writing of AVRO files. Supported codecs: uncompressed, deflate, snappy, bzip2 and xz. Default codec is snappy.

2.4.0

spark.sql.avro.deflate.level

-1

Compression level for the deflate codec used in writing of AVRO files. Valid value must be in the range of from 1 to 9 inclusive or -1. The default value is -1 which corresponds to 6 level in the current implementation.

2.4.0

spark.sql.avro.datetimeRebaseModeInRead

EXCEPTION

The rebasing mode for the values of the date, timestamp-micros, timestamp-millis logical types from the Julian to Proleptic Gregorian calendar:

EXCEPTION: Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars.
CORRECTED: Spark will not do rebase and read the dates/timestamps as it is.
LEGACY: Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Avro files.

This config is only effective if the writer info (like Spark, Hive) of the Avro files is unknown.

3.0.0

spark.sql.avro.datetimeRebaseModeInWrite

EXCEPTION

The rebasing mode for the values of the date, timestamp-micros, timestamp-millis logical types from the Proleptic Gregorian to Julian calendar:

EXCEPTION: Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.
CORRECTED: Spark will not do rebase and write the dates/timestamps as it is.
LEGACY: Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Avro files.

3.0.0

Avro type	Spark SQL type
boolean	BooleanType
int	IntegerType
long	LongType
float	FloatType
double	DoubleType
string	StringType
enum	StringType
fixed	BinaryType
bytes	BinaryType
record	StructType
array	ArrayType
map	MapType
union	See below

Avro type

Spark SQL type

boolean

BooleanType

int

IntegerType

long

LongType

float

FloatType

double

DoubleType

string

StringType

enum

StringType

fixed

BinaryType

bytes

BinaryType

record

StructType

array

ArrayType

map

MapType

union

See below

Avro logical type	Avro type	Spark SQL type
date	int	DateType
timestamp-millis	long	TimestampType
timestamp-micros	long	TimestampType
decimal	fixed	DecimalType
decimal	bytes	DecimalType

Avro logical type

Avro type

Spark SQL type

date

int

DateType

timestamp-millis

long

TimestampType

timestamp-micros

long

TimestampType

decimal

fixed

DecimalType

decimal

bytes

DecimalType

Spark SQL type	Avro type	Avro logical type
ByteType	int
ShortType	int
BinaryType	bytes
DateType	int	date
TimestampType	long	timestamp-micros
DecimalType	fixed	decimal

Spark SQL type

Avro type

Avro logical type

ByteType

int

ShortType

int

BinaryType

bytes

DateType

int

date

TimestampType

long

timestamp-micros

DecimalType

fixed

decimal

Spark SQL type	Avro type	Avro logical type
BinaryType	fixed
StringType	enum
TimestampType	long	timestamp-millis
DecimalType	bytes	decimal

Spark SQL type

Avro type

Avro logical type

BinaryType

fixed

StringType

enum

TimestampType

long

timestamp-millis

DecimalType

bytes

decimal