e3b6e4ad43
### What changes were proposed in this pull request? Documentation for `SparkSession.conf.isModifiable` is missing from the Python API site, so we added a Configuration section to the Spark SQL page to expose docs for the `RuntimeConfig` class (the class containing `isModifiable`). Then a `:class:` reference to `RuntimeConfig` was added to the `SparkSession.conf` docstring to create a link there as well. ### Why are the changes needed? No docs were generated for `pyspark.sql.conf.RuntimeConfig`. ### Does this PR introduce _any_ user-facing change? Yes--a new Configuration section to the Spark SQL page and a `Returns` section of the `SparkSession.conf` docstring, so this will now show a link to the `pyspark.sql.conf.RuntimeConfig` page. This is a change compared to both the released Spark version and the unreleased master branch. ### How was this patch tested? First built the Python docs: ```bash cd $SPARK_HOME/docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve ``` Then verified all pages and links: 1. Configuration link displayed on the API Reference page, and it clicks through to Spark SQL page: http://localhost:4000/api/python/reference/index.html ![image](https://user-images.githubusercontent.com/1160861/107601918-a2f02380-6bed-11eb-9b8f-974a0681a2a9.png) 2. Configuration section displayed on the Spark SQL page, and the RuntimeConfig link clicks through to the RuntimeConfig page: http://localhost:4000/api/python/reference/pyspark.sql.html#configuration ![image](https://user-images.githubusercontent.com/1160861/107602058-0d08c880-6bee-11eb-8cbb-ad8c47588085.png)** 3. RuntimeConfig page displayed: http://localhost:4000/api/python/reference/api/pyspark.sql.conf.RuntimeConfig.html ![image](https://user-images.githubusercontent.com/1160861/107602278-94eed280-6bee-11eb-95fc-445ea62ac1a4.png) 4. SparkSession.conf page displays the RuntimeConfig link, and it navigates to the RuntimeConfig page: http://localhost:4000/api/python/reference/api/pyspark.sql.SparkSession.conf.html ![image](https://user-images.githubusercontent.com/1160861/107602435-1f373680-6bef-11eb-985a-b72432464940.png) Closes #31483 from Eric-Lemmon/SPARK-33434-document-isModifiable. Authored-by: Eric Lemmon <eric@lemmon.cc> Signed-off-by: Sean Owen <srowen@gmail.com>
603 lines
10 KiB
ReStructuredText
603 lines
10 KiB
ReStructuredText
.. Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
.. http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
.. Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
|
|
|
|
=========
|
|
Spark SQL
|
|
=========
|
|
|
|
Core Classes
|
|
------------
|
|
|
|
.. currentmodule:: pyspark.sql
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
SparkSession
|
|
DataFrame
|
|
Column
|
|
Row
|
|
GroupedData
|
|
PandasCogroupedOps
|
|
DataFrameNaFunctions
|
|
DataFrameStatFunctions
|
|
Window
|
|
|
|
|
|
Spark Session APIs
|
|
------------------
|
|
|
|
.. currentmodule:: pyspark.sql
|
|
|
|
The entry point to programming Spark with the Dataset and DataFrame API.
|
|
To create a Spark session, you should use ``SparkSession.builder`` attribute.
|
|
See also :class:`SparkSession`.
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
SparkSession.builder.appName
|
|
SparkSession.builder.config
|
|
SparkSession.builder.enableHiveSupport
|
|
SparkSession.builder.getOrCreate
|
|
SparkSession.builder.master
|
|
SparkSession.catalog
|
|
SparkSession.conf
|
|
SparkSession.createDataFrame
|
|
SparkSession.getActiveSession
|
|
SparkSession.newSession
|
|
SparkSession.range
|
|
SparkSession.read
|
|
SparkSession.readStream
|
|
SparkSession.sparkContext
|
|
SparkSession.sql
|
|
SparkSession.stop
|
|
SparkSession.streams
|
|
SparkSession.table
|
|
SparkSession.udf
|
|
SparkSession.version
|
|
|
|
|
|
Configuration
|
|
-------------
|
|
|
|
.. currentmodule:: pyspark.sql.conf
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
RuntimeConfig
|
|
|
|
|
|
Input and Output
|
|
----------------
|
|
|
|
.. currentmodule:: pyspark.sql
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
DataFrameReader.csv
|
|
DataFrameReader.format
|
|
DataFrameReader.jdbc
|
|
DataFrameReader.json
|
|
DataFrameReader.load
|
|
DataFrameReader.option
|
|
DataFrameReader.options
|
|
DataFrameReader.orc
|
|
DataFrameReader.parquet
|
|
DataFrameReader.schema
|
|
DataFrameReader.table
|
|
DataFrameWriter.bucketBy
|
|
DataFrameWriter.csv
|
|
DataFrameWriter.format
|
|
DataFrameWriter.insertInto
|
|
DataFrameWriter.jdbc
|
|
DataFrameWriter.json
|
|
DataFrameWriter.mode
|
|
DataFrameWriter.option
|
|
DataFrameWriter.options
|
|
DataFrameWriter.orc
|
|
DataFrameWriter.parquet
|
|
DataFrameWriter.partitionBy
|
|
DataFrameWriter.save
|
|
DataFrameWriter.saveAsTable
|
|
DataFrameWriter.sortBy
|
|
DataFrameWriter.text
|
|
|
|
|
|
DataFrame APIs
|
|
--------------
|
|
|
|
.. currentmodule:: pyspark.sql
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
DataFrame.agg
|
|
DataFrame.alias
|
|
DataFrame.approxQuantile
|
|
DataFrame.cache
|
|
DataFrame.checkpoint
|
|
DataFrame.coalesce
|
|
DataFrame.colRegex
|
|
DataFrame.collect
|
|
DataFrame.columns
|
|
DataFrame.corr
|
|
DataFrame.count
|
|
DataFrame.cov
|
|
DataFrame.createGlobalTempView
|
|
DataFrame.createOrReplaceGlobalTempView
|
|
DataFrame.createOrReplaceTempView
|
|
DataFrame.createTempView
|
|
DataFrame.crossJoin
|
|
DataFrame.crosstab
|
|
DataFrame.cube
|
|
DataFrame.describe
|
|
DataFrame.distinct
|
|
DataFrame.drop
|
|
DataFrame.dropDuplicates
|
|
DataFrame.drop_duplicates
|
|
DataFrame.dropna
|
|
DataFrame.dtypes
|
|
DataFrame.exceptAll
|
|
DataFrame.explain
|
|
DataFrame.fillna
|
|
DataFrame.filter
|
|
DataFrame.first
|
|
DataFrame.foreach
|
|
DataFrame.foreachPartition
|
|
DataFrame.freqItems
|
|
DataFrame.groupBy
|
|
DataFrame.head
|
|
DataFrame.hint
|
|
DataFrame.inputFiles
|
|
DataFrame.intersect
|
|
DataFrame.intersectAll
|
|
DataFrame.isLocal
|
|
DataFrame.isStreaming
|
|
DataFrame.join
|
|
DataFrame.limit
|
|
DataFrame.localCheckpoint
|
|
DataFrame.mapInPandas
|
|
DataFrame.na
|
|
DataFrame.orderBy
|
|
DataFrame.persist
|
|
DataFrame.printSchema
|
|
DataFrame.randomSplit
|
|
DataFrame.rdd
|
|
DataFrame.registerTempTable
|
|
DataFrame.repartition
|
|
DataFrame.repartitionByRange
|
|
DataFrame.replace
|
|
DataFrame.rollup
|
|
DataFrame.sameSemantics
|
|
DataFrame.sample
|
|
DataFrame.sampleBy
|
|
DataFrame.schema
|
|
DataFrame.select
|
|
DataFrame.selectExpr
|
|
DataFrame.semanticHash
|
|
DataFrame.show
|
|
DataFrame.sort
|
|
DataFrame.sortWithinPartitions
|
|
DataFrame.stat
|
|
DataFrame.storageLevel
|
|
DataFrame.subtract
|
|
DataFrame.summary
|
|
DataFrame.tail
|
|
DataFrame.take
|
|
DataFrame.toDF
|
|
DataFrame.toJSON
|
|
DataFrame.toLocalIterator
|
|
DataFrame.toPandas
|
|
DataFrame.transform
|
|
DataFrame.union
|
|
DataFrame.unionAll
|
|
DataFrame.unionByName
|
|
DataFrame.unpersist
|
|
DataFrame.where
|
|
DataFrame.withColumn
|
|
DataFrame.withColumnRenamed
|
|
DataFrame.withWatermark
|
|
DataFrame.write
|
|
DataFrame.writeStream
|
|
DataFrame.writeTo
|
|
DataFrameNaFunctions.drop
|
|
DataFrameNaFunctions.fill
|
|
DataFrameNaFunctions.replace
|
|
DataFrameStatFunctions.approxQuantile
|
|
DataFrameStatFunctions.corr
|
|
DataFrameStatFunctions.cov
|
|
DataFrameStatFunctions.crosstab
|
|
DataFrameStatFunctions.freqItems
|
|
DataFrameStatFunctions.sampleBy
|
|
|
|
Column APIs
|
|
-----------
|
|
|
|
.. currentmodule:: pyspark.sql
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
Column.alias
|
|
Column.asc
|
|
Column.asc_nulls_first
|
|
Column.asc_nulls_last
|
|
Column.astype
|
|
Column.between
|
|
Column.bitwiseAND
|
|
Column.bitwiseOR
|
|
Column.bitwiseXOR
|
|
Column.cast
|
|
Column.contains
|
|
Column.desc
|
|
Column.desc_nulls_first
|
|
Column.desc_nulls_last
|
|
Column.dropFields
|
|
Column.endswith
|
|
Column.eqNullSafe
|
|
Column.getField
|
|
Column.getItem
|
|
Column.isNotNull
|
|
Column.isNull
|
|
Column.isin
|
|
Column.like
|
|
Column.name
|
|
Column.otherwise
|
|
Column.over
|
|
Column.rlike
|
|
Column.startswith
|
|
Column.substr
|
|
Column.when
|
|
Column.withField
|
|
|
|
Data Types
|
|
----------
|
|
|
|
.. currentmodule:: pyspark.sql.types
|
|
|
|
.. autosummary::
|
|
:template: autosummary/class_with_docs.rst
|
|
:toctree: api/
|
|
|
|
ArrayType
|
|
BinaryType
|
|
BooleanType
|
|
ByteType
|
|
DataType
|
|
DateType
|
|
DecimalType
|
|
DoubleType
|
|
FloatType
|
|
IntegerType
|
|
LongType
|
|
MapType
|
|
NullType
|
|
ShortType
|
|
StringType
|
|
StructField
|
|
StructType
|
|
TimestampType
|
|
|
|
|
|
Row
|
|
---
|
|
|
|
.. currentmodule:: pyspark.sql
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
Row.asDict
|
|
|
|
|
|
Functions
|
|
---------
|
|
|
|
.. currentmodule:: pyspark.sql.functions
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
abs
|
|
acos
|
|
acosh
|
|
add_months
|
|
aggregate
|
|
approxCountDistinct
|
|
approx_count_distinct
|
|
array
|
|
array_contains
|
|
array_distinct
|
|
array_except
|
|
array_intersect
|
|
array_join
|
|
array_max
|
|
array_min
|
|
array_position
|
|
array_remove
|
|
array_repeat
|
|
array_sort
|
|
array_union
|
|
arrays_overlap
|
|
arrays_zip
|
|
asc
|
|
asc_nulls_first
|
|
asc_nulls_last
|
|
ascii
|
|
asin
|
|
asinh
|
|
assert_true
|
|
atan
|
|
atanh
|
|
atan2
|
|
avg
|
|
base64
|
|
bin
|
|
bitwise_not
|
|
bitwiseNOT
|
|
broadcast
|
|
bround
|
|
bucket
|
|
cbrt
|
|
ceil
|
|
coalesce
|
|
col
|
|
collect_list
|
|
collect_set
|
|
column
|
|
concat
|
|
concat_ws
|
|
conv
|
|
corr
|
|
cos
|
|
cosh
|
|
count
|
|
count_distinct
|
|
countDistinct
|
|
covar_pop
|
|
covar_samp
|
|
crc32
|
|
create_map
|
|
cume_dist
|
|
current_date
|
|
current_timestamp
|
|
date_add
|
|
date_format
|
|
date_sub
|
|
date_trunc
|
|
datediff
|
|
dayofmonth
|
|
dayofweek
|
|
dayofyear
|
|
days
|
|
decode
|
|
degrees
|
|
dense_rank
|
|
desc
|
|
desc_nulls_first
|
|
desc_nulls_last
|
|
element_at
|
|
encode
|
|
exists
|
|
exp
|
|
explode
|
|
explode_outer
|
|
expm1
|
|
expr
|
|
factorial
|
|
filter
|
|
first
|
|
flatten
|
|
floor
|
|
forall
|
|
format_number
|
|
format_string
|
|
from_csv
|
|
from_json
|
|
from_unixtime
|
|
from_utc_timestamp
|
|
get_json_object
|
|
greatest
|
|
grouping
|
|
grouping_id
|
|
hash
|
|
hex
|
|
hour
|
|
hours
|
|
hypot
|
|
initcap
|
|
input_file_name
|
|
instr
|
|
isnan
|
|
isnull
|
|
json_tuple
|
|
kurtosis
|
|
lag
|
|
last
|
|
last_day
|
|
lead
|
|
least
|
|
length
|
|
levenshtein
|
|
lit
|
|
locate
|
|
log
|
|
log10
|
|
log1p
|
|
log2
|
|
lower
|
|
lpad
|
|
ltrim
|
|
map_concat
|
|
map_entries
|
|
map_filter
|
|
map_from_arrays
|
|
map_from_entries
|
|
map_keys
|
|
map_values
|
|
map_zip_with
|
|
max
|
|
md5
|
|
mean
|
|
min
|
|
minute
|
|
monotonically_increasing_id
|
|
month
|
|
months
|
|
months_between
|
|
nanvl
|
|
next_day
|
|
nth_value
|
|
ntile
|
|
overlay
|
|
pandas_udf
|
|
percent_rank
|
|
percentile_approx
|
|
posexplode
|
|
posexplode_outer
|
|
pow
|
|
quarter
|
|
radians
|
|
raise_error
|
|
rand
|
|
randn
|
|
rank
|
|
regexp_extract
|
|
regexp_replace
|
|
repeat
|
|
reverse
|
|
rint
|
|
round
|
|
row_number
|
|
rpad
|
|
rtrim
|
|
schema_of_csv
|
|
schema_of_json
|
|
second
|
|
sequence
|
|
sha1
|
|
sha2
|
|
shiftleft
|
|
shiftright
|
|
shiftrightunsigned
|
|
shuffle
|
|
signum
|
|
sin
|
|
sinh
|
|
size
|
|
skewness
|
|
slice
|
|
sort_array
|
|
soundex
|
|
spark_partition_id
|
|
split
|
|
sqrt
|
|
stddev
|
|
stddev_pop
|
|
stddev_samp
|
|
struct
|
|
substring
|
|
substring_index
|
|
sum
|
|
sum_distinct
|
|
sumDistinct
|
|
tan
|
|
tanh
|
|
timestamp_seconds
|
|
toDegrees
|
|
toRadians
|
|
to_csv
|
|
to_date
|
|
to_json
|
|
to_timestamp
|
|
to_utc_timestamp
|
|
transform
|
|
transform_keys
|
|
transform_values
|
|
translate
|
|
trim
|
|
trunc
|
|
udf
|
|
unbase64
|
|
unhex
|
|
unix_timestamp
|
|
upper
|
|
var_pop
|
|
var_samp
|
|
variance
|
|
weekofyear
|
|
when
|
|
window
|
|
xxhash64
|
|
year
|
|
years
|
|
zip_with
|
|
|
|
|
|
.. currentmodule:: pyspark.sql.avro.functions
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
from_avro
|
|
to_avro
|
|
|
|
Window
|
|
------
|
|
|
|
.. currentmodule:: pyspark.sql
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
Window.currentRow
|
|
Window.orderBy
|
|
Window.partitionBy
|
|
Window.rangeBetween
|
|
Window.rowsBetween
|
|
Window.unboundedFollowing
|
|
Window.unboundedPreceding
|
|
WindowSpec.orderBy
|
|
WindowSpec.partitionBy
|
|
WindowSpec.rangeBetween
|
|
WindowSpec.rowsBetween
|
|
|
|
Grouping
|
|
--------
|
|
|
|
.. currentmodule:: pyspark.sql
|
|
|
|
.. autosummary::
|
|
:toctree: api/
|
|
|
|
GroupedData.agg
|
|
GroupedData.apply
|
|
GroupedData.applyInPandas
|
|
GroupedData.avg
|
|
GroupedData.cogroup
|
|
GroupedData.count
|
|
GroupedData.max
|
|
GroupedData.mean
|
|
GroupedData.min
|
|
GroupedData.pivot
|
|
GroupedData.sum
|
|
PandasCogroupedOps.applyInPandas
|