spark-instrumented-optimizer/docs/sql-reference.md
Keiji Yoshida de42281527 [MINOR][DOCS][WIP] Fix Typos
## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-11-29 10:39:00 -06:00

642 lines
15 KiB
Markdown

---
layout: global
title: Reference
displayTitle: Reference
---
* Table of contents
{:toc}
## Data Types
Spark SQL and DataFrames support the following data types:
* Numeric types
- `ByteType`: Represents 1-byte signed integer numbers.
The range of numbers is from `-128` to `127`.
- `ShortType`: Represents 2-byte signed integer numbers.
The range of numbers is from `-32768` to `32767`.
- `IntegerType`: Represents 4-byte signed integer numbers.
The range of numbers is from `-2147483648` to `2147483647`.
- `LongType`: Represents 8-byte signed integer numbers.
The range of numbers is from `-9223372036854775808` to `9223372036854775807`.
- `FloatType`: Represents 4-byte single-precision floating point numbers.
- `DoubleType`: Represents 8-byte double-precision floating point numbers.
- `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
* String type
- `StringType`: Represents character string values.
* Binary type
- `BinaryType`: Represents byte sequence values.
* Boolean type
- `BooleanType`: Represents boolean values.
* Datetime type
- `TimestampType`: Represents values comprising values of fields year, month, day,
hour, minute, and second.
- `DateType`: Represents values comprising values of fields year, month, day.
* Complex types
- `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of
elements with the type of `elementType`. `containsNull` is used to indicate if
elements in a `ArrayType` value can have `null` values.
- `MapType(keyType, valueType, valueContainsNull)`:
Represents values comprising a set of key-value pairs. The data type of keys is
described by `keyType` and the data type of values is described by `valueType`.
For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull`
is used to indicate if values of a `MapType` value can have `null` values.
- `StructType(fields)`: Represents values with the structure described by
a sequence of `StructField`s (`fields`).
* `StructField(name, dataType, nullable)`: Represents a field in a `StructType`.
The name of a field is indicated by `name`. The data type of a field is indicated
by `dataType`. `nullable` is used to indicate if values of these fields can have
`null` values.
<div class="codetabs">
<div data-lang="scala" markdown="1">
All data types of Spark SQL are located in the package `org.apache.spark.sql.types`.
You can access them by doing
{% include_example data_types scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
<table class="table">
<tr>
<th style="width:20%">Data type</th>
<th style="width:40%">Value type in Scala</th>
<th>API to access or create a data type</th></tr>
<tr>
<td> <b>ByteType</b> </td>
<td> Byte </td>
<td>
ByteType
</td>
</tr>
<tr>
<td> <b>ShortType</b> </td>
<td> Short </td>
<td>
ShortType
</td>
</tr>
<tr>
<td> <b>IntegerType</b> </td>
<td> Int </td>
<td>
IntegerType
</td>
</tr>
<tr>
<td> <b>LongType</b> </td>
<td> Long </td>
<td>
LongType
</td>
</tr>
<tr>
<td> <b>FloatType</b> </td>
<td> Float </td>
<td>
FloatType
</td>
</tr>
<tr>
<td> <b>DoubleType</b> </td>
<td> Double </td>
<td>
DoubleType
</td>
</tr>
<tr>
<td> <b>DecimalType</b> </td>
<td> java.math.BigDecimal </td>
<td>
DecimalType
</td>
</tr>
<tr>
<td> <b>StringType</b> </td>
<td> String </td>
<td>
StringType
</td>
</tr>
<tr>
<td> <b>BinaryType</b> </td>
<td> Array[Byte] </td>
<td>
BinaryType
</td>
</tr>
<tr>
<td> <b>BooleanType</b> </td>
<td> Boolean </td>
<td>
BooleanType
</td>
</tr>
<tr>
<td> <b>TimestampType</b> </td>
<td> java.sql.Timestamp </td>
<td>
TimestampType
</td>
</tr>
<tr>
<td> <b>DateType</b> </td>
<td> java.sql.Date </td>
<td>
DateType
</td>
</tr>
<tr>
<td> <b>ArrayType</b> </td>
<td> scala.collection.Seq </td>
<td>
ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br />
<b>Note:</b> The default value of <i>containsNull</i> is <i>true</i>.
</td>
</tr>
<tr>
<td> <b>MapType</b> </td>
<td> scala.collection.Map </td>
<td>
MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br />
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>true</i>.
</td>
</tr>
<tr>
<td> <b>StructType</b> </td>
<td> org.apache.spark.sql.Row </td>
<td>
StructType(<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
name are not allowed.
</td>
</tr>
<tr>
<td> <b>StructField</b> </td>
<td> The value type in Scala of the data type of this field
(For example, Int for a StructField with the data type IntegerType) </td>
<td>
StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br />
<b>Note:</b> The default value of <i>nullable</i> is <i>true</i>.
</td>
</tr>
</table>
</div>
<div data-lang="java" markdown="1">
All data types of Spark SQL are located in the package of
`org.apache.spark.sql.types`. To access or create a data type,
please use factory methods provided in
`org.apache.spark.sql.types.DataTypes`.
<table class="table">
<tr>
<th style="width:20%">Data type</th>
<th style="width:40%">Value type in Java</th>
<th>API to access or create a data type</th></tr>
<tr>
<td> <b>ByteType</b> </td>
<td> byte or Byte </td>
<td>
DataTypes.ByteType
</td>
</tr>
<tr>
<td> <b>ShortType</b> </td>
<td> short or Short </td>
<td>
DataTypes.ShortType
</td>
</tr>
<tr>
<td> <b>IntegerType</b> </td>
<td> int or Integer </td>
<td>
DataTypes.IntegerType
</td>
</tr>
<tr>
<td> <b>LongType</b> </td>
<td> long or Long </td>
<td>
DataTypes.LongType
</td>
</tr>
<tr>
<td> <b>FloatType</b> </td>
<td> float or Float </td>
<td>
DataTypes.FloatType
</td>
</tr>
<tr>
<td> <b>DoubleType</b> </td>
<td> double or Double </td>
<td>
DataTypes.DoubleType
</td>
</tr>
<tr>
<td> <b>DecimalType</b> </td>
<td> java.math.BigDecimal </td>
<td>
DataTypes.createDecimalType()<br />
DataTypes.createDecimalType(<i>precision</i>, <i>scale</i>).
</td>
</tr>
<tr>
<td> <b>StringType</b> </td>
<td> String </td>
<td>
DataTypes.StringType
</td>
</tr>
<tr>
<td> <b>BinaryType</b> </td>
<td> byte[] </td>
<td>
DataTypes.BinaryType
</td>
</tr>
<tr>
<td> <b>BooleanType</b> </td>
<td> boolean or Boolean </td>
<td>
DataTypes.BooleanType
</td>
</tr>
<tr>
<td> <b>TimestampType</b> </td>
<td> java.sql.Timestamp </td>
<td>
DataTypes.TimestampType
</td>
</tr>
<tr>
<td> <b>DateType</b> </td>
<td> java.sql.Date </td>
<td>
DataTypes.DateType
</td>
</tr>
<tr>
<td> <b>ArrayType</b> </td>
<td> java.util.List </td>
<td>
DataTypes.createArrayType(<i>elementType</i>)<br />
<b>Note:</b> The value of <i>containsNull</i> will be <i>true</i><br />
DataTypes.createArrayType(<i>elementType</i>, <i>containsNull</i>).
</td>
</tr>
<tr>
<td> <b>MapType</b> </td>
<td> java.util.Map </td>
<td>
DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>)<br />
<b>Note:</b> The value of <i>valueContainsNull</i> will be <i>true</i>.<br />
DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>, <i>valueContainsNull</i>)<br />
</td>
</tr>
<tr>
<td> <b>StructType</b> </td>
<td> org.apache.spark.sql.Row </td>
<td>
DataTypes.createStructType(<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a List or an array of StructFields.
Also, two fields with the same name are not allowed.
</td>
</tr>
<tr>
<td> <b>StructField</b> </td>
<td> The value type in Java of the data type of this field
(For example, int for a StructField with the data type IntegerType) </td>
<td>
DataTypes.createStructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>)
</td>
</tr>
</table>
</div>
<div data-lang="python" markdown="1">
All data types of Spark SQL are located in the package of `pyspark.sql.types`.
You can access them by doing
{% highlight python %}
from pyspark.sql.types import *
{% endhighlight %}
<table class="table">
<tr>
<th style="width:20%">Data type</th>
<th style="width:40%">Value type in Python</th>
<th>API to access or create a data type</th></tr>
<tr>
<td> <b>ByteType</b> </td>
<td>
int or long <br />
<b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of -128 to 127.
</td>
<td>
ByteType()
</td>
</tr>
<tr>
<td> <b>ShortType</b> </td>
<td>
int or long <br />
<b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of -32768 to 32767.
</td>
<td>
ShortType()
</td>
</tr>
<tr>
<td> <b>IntegerType</b> </td>
<td> int or long </td>
<td>
IntegerType()
</td>
</tr>
<tr>
<td> <b>LongType</b> </td>
<td>
long <br />
<b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of
-9223372036854775808 to 9223372036854775807.
Otherwise, please convert data to decimal.Decimal and use DecimalType.
</td>
<td>
LongType()
</td>
</tr>
<tr>
<td> <b>FloatType</b> </td>
<td>
float <br />
<b>Note:</b> Numbers will be converted to 4-byte single-precision floating
point numbers at runtime.
</td>
<td>
FloatType()
</td>
</tr>
<tr>
<td> <b>DoubleType</b> </td>
<td> float </td>
<td>
DoubleType()
</td>
</tr>
<tr>
<td> <b>DecimalType</b> </td>
<td> decimal.Decimal </td>
<td>
DecimalType()
</td>
</tr>
<tr>
<td> <b>StringType</b> </td>
<td> string </td>
<td>
StringType()
</td>
</tr>
<tr>
<td> <b>BinaryType</b> </td>
<td> bytearray </td>
<td>
BinaryType()
</td>
</tr>
<tr>
<td> <b>BooleanType</b> </td>
<td> bool </td>
<td>
BooleanType()
</td>
</tr>
<tr>
<td> <b>TimestampType</b> </td>
<td> datetime.datetime </td>
<td>
TimestampType()
</td>
</tr>
<tr>
<td> <b>DateType</b> </td>
<td> datetime.date </td>
<td>
DateType()
</td>
</tr>
<tr>
<td> <b>ArrayType</b> </td>
<td> list, tuple, or array </td>
<td>
ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br />
<b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>.
</td>
</tr>
<tr>
<td> <b>MapType</b> </td>
<td> dict </td>
<td>
MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br />
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>.
</td>
</tr>
<tr>
<td> <b>StructType</b> </td>
<td> list or tuple </td>
<td>
StructType(<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
name are not allowed.
</td>
</tr>
<tr>
<td> <b>StructField</b> </td>
<td> The value type in Python of the data type of this field
(For example, Int for a StructField with the data type IntegerType) </td>
<td>
StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br />
<b>Note:</b> The default value of <i>nullable</i> is <i>True</i>.
</td>
</tr>
</table>
</div>
<div data-lang="r" markdown="1">
<table class="table">
<tr>
<th style="width:20%">Data type</th>
<th style="width:40%">Value type in R</th>
<th>API to access or create a data type</th></tr>
<tr>
<td> <b>ByteType</b> </td>
<td>
integer <br />
<b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of -128 to 127.
</td>
<td>
"byte"
</td>
</tr>
<tr>
<td> <b>ShortType</b> </td>
<td>
integer <br />
<b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of -32768 to 32767.
</td>
<td>
"short"
</td>
</tr>
<tr>
<td> <b>IntegerType</b> </td>
<td> integer </td>
<td>
"integer"
</td>
</tr>
<tr>
<td> <b>LongType</b> </td>
<td>
integer <br />
<b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
Please make sure that numbers are within the range of
-9223372036854775808 to 9223372036854775807.
Otherwise, please convert data to decimal.Decimal and use DecimalType.
</td>
<td>
"long"
</td>
</tr>
<tr>
<td> <b>FloatType</b> </td>
<td>
numeric <br />
<b>Note:</b> Numbers will be converted to 4-byte single-precision floating
point numbers at runtime.
</td>
<td>
"float"
</td>
</tr>
<tr>
<td> <b>DoubleType</b> </td>
<td> numeric </td>
<td>
"double"
</td>
</tr>
<tr>
<td> <b>DecimalType</b> </td>
<td> Not supported </td>
<td>
Not supported
</td>
</tr>
<tr>
<td> <b>StringType</b> </td>
<td> character </td>
<td>
"string"
</td>
</tr>
<tr>
<td> <b>BinaryType</b> </td>
<td> raw </td>
<td>
"binary"
</td>
</tr>
<tr>
<td> <b>BooleanType</b> </td>
<td> logical </td>
<td>
"bool"
</td>
</tr>
<tr>
<td> <b>TimestampType</b> </td>
<td> POSIXct </td>
<td>
"timestamp"
</td>
</tr>
<tr>
<td> <b>DateType</b> </td>
<td> Date </td>
<td>
"date"
</td>
</tr>
<tr>
<td> <b>ArrayType</b> </td>
<td> vector or list </td>
<td>
list(type="array", elementType=<i>elementType</i>, containsNull=[<i>containsNull</i>])<br />
<b>Note:</b> The default value of <i>containsNull</i> is <i>TRUE</i>.
</td>
</tr>
<tr>
<td> <b>MapType</b> </td>
<td> environment </td>
<td>
list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br />
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>TRUE</i>.
</td>
</tr>
<tr>
<td> <b>StructType</b> </td>
<td> named list</td>
<td>
list(type="struct", fields=<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
name are not allowed.
</td>
</tr>
<tr>
<td> <b>StructField</b> </td>
<td> The value type in R of the data type of this field
(For example, integer for a StructField with the data type IntegerType) </td>
<td>
list(name=<i>name</i>, type=<i>dataType</i>, nullable=[<i>nullable</i>])<br />
<b>Note:</b> The default value of <i>nullable</i> is <i>TRUE</i>.
</td>
</tr>
</table>
</div>
</div>
## NaN Semantics
There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that
does not exactly match standard floating point semantics.
Specifically:
- NaN = NaN returns true.
- In aggregations, all NaN values are grouped together.
- NaN is treated as a normal value in join keys.
- NaN values go last when in ascending order, larger than any other numeric value.
## Arithmetic operations
Operations performed on numeric types (with the exception of `decimal`) are not checked for overflow.
This means that in case an operation causes an overflow, the result is the same that the same operation
returns in a Java/Scala program (eg. if the sum of 2 integers is higher than the maximum value representable,
the result is a negative number).