[SPARK-36865][PYTHON][DOCS] Add PySpark API document of session_window
### What changes were proposed in this pull request?
This PR adds PySpark API document of `session_window`.
The docstring of the function doesn't comply with numpydoc format so this PR also fix it.
Further, the API document of `window` doesn't have `Parameters` section so it's also added in this PR.
### Why are the changes needed?
To provide PySpark users with the API document of the newly added function.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`make html` in `python/docs` and get the following docs.
[window]
![time-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963797-ce25b268-20ca-48e3-ac8d-cbcbd85ebb3e.png)
[session_window]
![session-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963853-dd9d8417-139b-41ee-9924-14544b1a91af.png)
Closes #34118 from sarutak/python-session-window-doc.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 5a32e41e9c
)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
This commit is contained in:
parent
939c4d93b5
commit
8b2b6bb0d3
|
@ -497,6 +497,7 @@ Functions
|
||||||
second
|
second
|
||||||
sentences
|
sentences
|
||||||
sequence
|
sequence
|
||||||
|
session_window
|
||||||
sha1
|
sha1
|
||||||
sha2
|
sha2
|
||||||
shiftleft
|
shiftleft
|
||||||
|
|
|
@ -2300,6 +2300,29 @@ def window(timeColumn, windowDuration, slideDuration=None, startTime=None):
|
||||||
|
|
||||||
.. versionadded:: 2.0.0
|
.. versionadded:: 2.0.0
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
timeColumn : :class:`~pyspark.sql.Column`
|
||||||
|
The column or the expression to use as the timestamp for windowing by time.
|
||||||
|
The time column must be of TimestampType.
|
||||||
|
windowDuration : str
|
||||||
|
A string specifying the width of the window, e.g. `10 minutes`,
|
||||||
|
`1 second`. Check `org.apache.spark.unsafe.types.CalendarInterval` for
|
||||||
|
valid duration identifiers. Note that the duration is a fixed length of
|
||||||
|
time, and does not vary over time according to a calendar. For example,
|
||||||
|
`1 day` always means 86,400,000 milliseconds, not a calendar day.
|
||||||
|
slideDuration : str, optional
|
||||||
|
A new window will be generated every `slideDuration`. Must be less than
|
||||||
|
or equal to the `windowDuration`. Check
|
||||||
|
`org.apache.spark.unsafe.types.CalendarInterval` for valid duration
|
||||||
|
identifiers. This duration is likewise absolute, and does not vary
|
||||||
|
according to a calendar.
|
||||||
|
startTime : str, optional
|
||||||
|
The offset with respect to 1970-01-01 00:00:00 UTC with which to start
|
||||||
|
window intervals. For example, in order to have hourly tumbling windows that
|
||||||
|
start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15... provide
|
||||||
|
`startTime` as `15 minutes`.
|
||||||
|
|
||||||
Examples
|
Examples
|
||||||
--------
|
--------
|
||||||
>>> df = spark.createDataFrame([("2016-03-11 09:00:07", 1)]).toDF("date", "val")
|
>>> df = spark.createDataFrame([("2016-03-11 09:00:07", 1)]).toDF("date", "val")
|
||||||
|
@ -2347,7 +2370,19 @@ def session_window(timeColumn, gapDuration):
|
||||||
input row.
|
input row.
|
||||||
The output column will be a struct called 'session_window' by default with the nested columns
|
The output column will be a struct called 'session_window' by default with the nested columns
|
||||||
'start' and 'end', where 'start' and 'end' will be of :class:`pyspark.sql.types.TimestampType`.
|
'start' and 'end', where 'start' and 'end' will be of :class:`pyspark.sql.types.TimestampType`.
|
||||||
|
|
||||||
.. versionadded:: 3.2.0
|
.. versionadded:: 3.2.0
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
timeColumn : :class:`~pyspark.sql.Column`
|
||||||
|
The column or the expression to use as the timestamp for windowing by time.
|
||||||
|
The time column must be of TimestampType.
|
||||||
|
gapDuration : :class:`~pyspark.sql.Column` or str
|
||||||
|
A column or string specifying the timeout of the session. It could be static value,
|
||||||
|
e.g. `10 minutes`, `1 second`, or an expression/UDF that specifies gap
|
||||||
|
duration dynamically based on the input row.
|
||||||
|
|
||||||
Examples
|
Examples
|
||||||
--------
|
--------
|
||||||
>>> df = spark.createDataFrame([("2016-03-11 09:00:07", 1)]).toDF("date", "val")
|
>>> df = spark.createDataFrame([("2016-03-11 09:00:07", 1)]).toDF("date", "val")
|
||||||
|
|
Loading…
Reference in a new issue