de42281527
## What changes were proposed in this pull request? Fix Typos. ## How was this patch tested? NA Closes #23145 from kjmrknsn/docUpdate. Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
642 lines
15 KiB
Markdown
642 lines
15 KiB
Markdown
---
|
|
layout: global
|
|
title: Reference
|
|
displayTitle: Reference
|
|
---
|
|
|
|
* Table of contents
|
|
{:toc}
|
|
|
|
## Data Types
|
|
|
|
Spark SQL and DataFrames support the following data types:
|
|
|
|
* Numeric types
|
|
- `ByteType`: Represents 1-byte signed integer numbers.
|
|
The range of numbers is from `-128` to `127`.
|
|
- `ShortType`: Represents 2-byte signed integer numbers.
|
|
The range of numbers is from `-32768` to `32767`.
|
|
- `IntegerType`: Represents 4-byte signed integer numbers.
|
|
The range of numbers is from `-2147483648` to `2147483647`.
|
|
- `LongType`: Represents 8-byte signed integer numbers.
|
|
The range of numbers is from `-9223372036854775808` to `9223372036854775807`.
|
|
- `FloatType`: Represents 4-byte single-precision floating point numbers.
|
|
- `DoubleType`: Represents 8-byte double-precision floating point numbers.
|
|
- `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
|
|
* String type
|
|
- `StringType`: Represents character string values.
|
|
* Binary type
|
|
- `BinaryType`: Represents byte sequence values.
|
|
* Boolean type
|
|
- `BooleanType`: Represents boolean values.
|
|
* Datetime type
|
|
- `TimestampType`: Represents values comprising values of fields year, month, day,
|
|
hour, minute, and second.
|
|
- `DateType`: Represents values comprising values of fields year, month, day.
|
|
* Complex types
|
|
- `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of
|
|
elements with the type of `elementType`. `containsNull` is used to indicate if
|
|
elements in a `ArrayType` value can have `null` values.
|
|
- `MapType(keyType, valueType, valueContainsNull)`:
|
|
Represents values comprising a set of key-value pairs. The data type of keys is
|
|
described by `keyType` and the data type of values is described by `valueType`.
|
|
For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull`
|
|
is used to indicate if values of a `MapType` value can have `null` values.
|
|
- `StructType(fields)`: Represents values with the structure described by
|
|
a sequence of `StructField`s (`fields`).
|
|
* `StructField(name, dataType, nullable)`: Represents a field in a `StructType`.
|
|
The name of a field is indicated by `name`. The data type of a field is indicated
|
|
by `dataType`. `nullable` is used to indicate if values of these fields can have
|
|
`null` values.
|
|
|
|
<div class="codetabs">
|
|
<div data-lang="scala" markdown="1">
|
|
|
|
All data types of Spark SQL are located in the package `org.apache.spark.sql.types`.
|
|
You can access them by doing
|
|
|
|
{% include_example data_types scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
|
|
|
|
<table class="table">
|
|
<tr>
|
|
<th style="width:20%">Data type</th>
|
|
<th style="width:40%">Value type in Scala</th>
|
|
<th>API to access or create a data type</th></tr>
|
|
<tr>
|
|
<td> <b>ByteType</b> </td>
|
|
<td> Byte </td>
|
|
<td>
|
|
ByteType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>ShortType</b> </td>
|
|
<td> Short </td>
|
|
<td>
|
|
ShortType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>IntegerType</b> </td>
|
|
<td> Int </td>
|
|
<td>
|
|
IntegerType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>LongType</b> </td>
|
|
<td> Long </td>
|
|
<td>
|
|
LongType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>FloatType</b> </td>
|
|
<td> Float </td>
|
|
<td>
|
|
FloatType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DoubleType</b> </td>
|
|
<td> Double </td>
|
|
<td>
|
|
DoubleType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DecimalType</b> </td>
|
|
<td> java.math.BigDecimal </td>
|
|
<td>
|
|
DecimalType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StringType</b> </td>
|
|
<td> String </td>
|
|
<td>
|
|
StringType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>BinaryType</b> </td>
|
|
<td> Array[Byte] </td>
|
|
<td>
|
|
BinaryType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>BooleanType</b> </td>
|
|
<td> Boolean </td>
|
|
<td>
|
|
BooleanType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>TimestampType</b> </td>
|
|
<td> java.sql.Timestamp </td>
|
|
<td>
|
|
TimestampType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DateType</b> </td>
|
|
<td> java.sql.Date </td>
|
|
<td>
|
|
DateType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>ArrayType</b> </td>
|
|
<td> scala.collection.Seq </td>
|
|
<td>
|
|
ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br />
|
|
<b>Note:</b> The default value of <i>containsNull</i> is <i>true</i>.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>MapType</b> </td>
|
|
<td> scala.collection.Map </td>
|
|
<td>
|
|
MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br />
|
|
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>true</i>.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StructType</b> </td>
|
|
<td> org.apache.spark.sql.Row </td>
|
|
<td>
|
|
StructType(<i>fields</i>)<br />
|
|
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
|
|
name are not allowed.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StructField</b> </td>
|
|
<td> The value type in Scala of the data type of this field
|
|
(For example, Int for a StructField with the data type IntegerType) </td>
|
|
<td>
|
|
StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br />
|
|
<b>Note:</b> The default value of <i>nullable</i> is <i>true</i>.
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
</div>
|
|
|
|
<div data-lang="java" markdown="1">
|
|
|
|
All data types of Spark SQL are located in the package of
|
|
`org.apache.spark.sql.types`. To access or create a data type,
|
|
please use factory methods provided in
|
|
`org.apache.spark.sql.types.DataTypes`.
|
|
|
|
<table class="table">
|
|
<tr>
|
|
<th style="width:20%">Data type</th>
|
|
<th style="width:40%">Value type in Java</th>
|
|
<th>API to access or create a data type</th></tr>
|
|
<tr>
|
|
<td> <b>ByteType</b> </td>
|
|
<td> byte or Byte </td>
|
|
<td>
|
|
DataTypes.ByteType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>ShortType</b> </td>
|
|
<td> short or Short </td>
|
|
<td>
|
|
DataTypes.ShortType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>IntegerType</b> </td>
|
|
<td> int or Integer </td>
|
|
<td>
|
|
DataTypes.IntegerType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>LongType</b> </td>
|
|
<td> long or Long </td>
|
|
<td>
|
|
DataTypes.LongType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>FloatType</b> </td>
|
|
<td> float or Float </td>
|
|
<td>
|
|
DataTypes.FloatType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DoubleType</b> </td>
|
|
<td> double or Double </td>
|
|
<td>
|
|
DataTypes.DoubleType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DecimalType</b> </td>
|
|
<td> java.math.BigDecimal </td>
|
|
<td>
|
|
DataTypes.createDecimalType()<br />
|
|
DataTypes.createDecimalType(<i>precision</i>, <i>scale</i>).
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StringType</b> </td>
|
|
<td> String </td>
|
|
<td>
|
|
DataTypes.StringType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>BinaryType</b> </td>
|
|
<td> byte[] </td>
|
|
<td>
|
|
DataTypes.BinaryType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>BooleanType</b> </td>
|
|
<td> boolean or Boolean </td>
|
|
<td>
|
|
DataTypes.BooleanType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>TimestampType</b> </td>
|
|
<td> java.sql.Timestamp </td>
|
|
<td>
|
|
DataTypes.TimestampType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DateType</b> </td>
|
|
<td> java.sql.Date </td>
|
|
<td>
|
|
DataTypes.DateType
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>ArrayType</b> </td>
|
|
<td> java.util.List </td>
|
|
<td>
|
|
DataTypes.createArrayType(<i>elementType</i>)<br />
|
|
<b>Note:</b> The value of <i>containsNull</i> will be <i>true</i><br />
|
|
DataTypes.createArrayType(<i>elementType</i>, <i>containsNull</i>).
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>MapType</b> </td>
|
|
<td> java.util.Map </td>
|
|
<td>
|
|
DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>)<br />
|
|
<b>Note:</b> The value of <i>valueContainsNull</i> will be <i>true</i>.<br />
|
|
DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>, <i>valueContainsNull</i>)<br />
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StructType</b> </td>
|
|
<td> org.apache.spark.sql.Row </td>
|
|
<td>
|
|
DataTypes.createStructType(<i>fields</i>)<br />
|
|
<b>Note:</b> <i>fields</i> is a List or an array of StructFields.
|
|
Also, two fields with the same name are not allowed.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StructField</b> </td>
|
|
<td> The value type in Java of the data type of this field
|
|
(For example, int for a StructField with the data type IntegerType) </td>
|
|
<td>
|
|
DataTypes.createStructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>)
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
</div>
|
|
|
|
<div data-lang="python" markdown="1">
|
|
|
|
All data types of Spark SQL are located in the package of `pyspark.sql.types`.
|
|
You can access them by doing
|
|
{% highlight python %}
|
|
from pyspark.sql.types import *
|
|
{% endhighlight %}
|
|
|
|
<table class="table">
|
|
<tr>
|
|
<th style="width:20%">Data type</th>
|
|
<th style="width:40%">Value type in Python</th>
|
|
<th>API to access or create a data type</th></tr>
|
|
<tr>
|
|
<td> <b>ByteType</b> </td>
|
|
<td>
|
|
int or long <br />
|
|
<b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
|
|
Please make sure that numbers are within the range of -128 to 127.
|
|
</td>
|
|
<td>
|
|
ByteType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>ShortType</b> </td>
|
|
<td>
|
|
int or long <br />
|
|
<b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
|
|
Please make sure that numbers are within the range of -32768 to 32767.
|
|
</td>
|
|
<td>
|
|
ShortType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>IntegerType</b> </td>
|
|
<td> int or long </td>
|
|
<td>
|
|
IntegerType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>LongType</b> </td>
|
|
<td>
|
|
long <br />
|
|
<b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
|
|
Please make sure that numbers are within the range of
|
|
-9223372036854775808 to 9223372036854775807.
|
|
Otherwise, please convert data to decimal.Decimal and use DecimalType.
|
|
</td>
|
|
<td>
|
|
LongType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>FloatType</b> </td>
|
|
<td>
|
|
float <br />
|
|
<b>Note:</b> Numbers will be converted to 4-byte single-precision floating
|
|
point numbers at runtime.
|
|
</td>
|
|
<td>
|
|
FloatType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DoubleType</b> </td>
|
|
<td> float </td>
|
|
<td>
|
|
DoubleType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DecimalType</b> </td>
|
|
<td> decimal.Decimal </td>
|
|
<td>
|
|
DecimalType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StringType</b> </td>
|
|
<td> string </td>
|
|
<td>
|
|
StringType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>BinaryType</b> </td>
|
|
<td> bytearray </td>
|
|
<td>
|
|
BinaryType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>BooleanType</b> </td>
|
|
<td> bool </td>
|
|
<td>
|
|
BooleanType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>TimestampType</b> </td>
|
|
<td> datetime.datetime </td>
|
|
<td>
|
|
TimestampType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DateType</b> </td>
|
|
<td> datetime.date </td>
|
|
<td>
|
|
DateType()
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>ArrayType</b> </td>
|
|
<td> list, tuple, or array </td>
|
|
<td>
|
|
ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br />
|
|
<b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>MapType</b> </td>
|
|
<td> dict </td>
|
|
<td>
|
|
MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br />
|
|
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StructType</b> </td>
|
|
<td> list or tuple </td>
|
|
<td>
|
|
StructType(<i>fields</i>)<br />
|
|
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
|
|
name are not allowed.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StructField</b> </td>
|
|
<td> The value type in Python of the data type of this field
|
|
(For example, Int for a StructField with the data type IntegerType) </td>
|
|
<td>
|
|
StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br />
|
|
<b>Note:</b> The default value of <i>nullable</i> is <i>True</i>.
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
</div>
|
|
|
|
<div data-lang="r" markdown="1">
|
|
|
|
<table class="table">
|
|
<tr>
|
|
<th style="width:20%">Data type</th>
|
|
<th style="width:40%">Value type in R</th>
|
|
<th>API to access or create a data type</th></tr>
|
|
<tr>
|
|
<td> <b>ByteType</b> </td>
|
|
<td>
|
|
integer <br />
|
|
<b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
|
|
Please make sure that numbers are within the range of -128 to 127.
|
|
</td>
|
|
<td>
|
|
"byte"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>ShortType</b> </td>
|
|
<td>
|
|
integer <br />
|
|
<b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
|
|
Please make sure that numbers are within the range of -32768 to 32767.
|
|
</td>
|
|
<td>
|
|
"short"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>IntegerType</b> </td>
|
|
<td> integer </td>
|
|
<td>
|
|
"integer"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>LongType</b> </td>
|
|
<td>
|
|
integer <br />
|
|
<b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
|
|
Please make sure that numbers are within the range of
|
|
-9223372036854775808 to 9223372036854775807.
|
|
Otherwise, please convert data to decimal.Decimal and use DecimalType.
|
|
</td>
|
|
<td>
|
|
"long"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>FloatType</b> </td>
|
|
<td>
|
|
numeric <br />
|
|
<b>Note:</b> Numbers will be converted to 4-byte single-precision floating
|
|
point numbers at runtime.
|
|
</td>
|
|
<td>
|
|
"float"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DoubleType</b> </td>
|
|
<td> numeric </td>
|
|
<td>
|
|
"double"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DecimalType</b> </td>
|
|
<td> Not supported </td>
|
|
<td>
|
|
Not supported
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StringType</b> </td>
|
|
<td> character </td>
|
|
<td>
|
|
"string"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>BinaryType</b> </td>
|
|
<td> raw </td>
|
|
<td>
|
|
"binary"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>BooleanType</b> </td>
|
|
<td> logical </td>
|
|
<td>
|
|
"bool"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>TimestampType</b> </td>
|
|
<td> POSIXct </td>
|
|
<td>
|
|
"timestamp"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>DateType</b> </td>
|
|
<td> Date </td>
|
|
<td>
|
|
"date"
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>ArrayType</b> </td>
|
|
<td> vector or list </td>
|
|
<td>
|
|
list(type="array", elementType=<i>elementType</i>, containsNull=[<i>containsNull</i>])<br />
|
|
<b>Note:</b> The default value of <i>containsNull</i> is <i>TRUE</i>.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>MapType</b> </td>
|
|
<td> environment </td>
|
|
<td>
|
|
list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br />
|
|
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>TRUE</i>.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StructType</b> </td>
|
|
<td> named list</td>
|
|
<td>
|
|
list(type="struct", fields=<i>fields</i>)<br />
|
|
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
|
|
name are not allowed.
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td> <b>StructField</b> </td>
|
|
<td> The value type in R of the data type of this field
|
|
(For example, integer for a StructField with the data type IntegerType) </td>
|
|
<td>
|
|
list(name=<i>name</i>, type=<i>dataType</i>, nullable=[<i>nullable</i>])<br />
|
|
<b>Note:</b> The default value of <i>nullable</i> is <i>TRUE</i>.
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|
|
## NaN Semantics
|
|
|
|
There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that
|
|
does not exactly match standard floating point semantics.
|
|
Specifically:
|
|
|
|
- NaN = NaN returns true.
|
|
- In aggregations, all NaN values are grouped together.
|
|
- NaN is treated as a normal value in join keys.
|
|
- NaN values go last when in ascending order, larger than any other numeric value.
|
|
|
|
## Arithmetic operations
|
|
|
|
Operations performed on numeric types (with the exception of `decimal`) are not checked for overflow.
|
|
This means that in case an operation causes an overflow, the result is the same that the same operation
|
|
returns in a Java/Scala program (eg. if the sum of 2 integers is higher than the maximum value representable,
|
|
the result is a negative number).
|