f9180f8752
## What changes were proposed in this pull request? Most SQL functions defined in `spark.sql.functions` have two calling patterns, one with a Column object as input, and another with a string representing a column name, which is then converted into a Column object internally. There are, however, a few notable exceptions: - lower() - upper() - abs() - bitwiseNOT() - ltrim() - rtrim() - trim() - ascii() - base64() - unbase64() While this doesn't break anything, as you can easily create a Column object yourself prior to passing it to one of these functions, it has two undesirable consequences: 1. It is surprising - it breaks coder's expectations when they are first starting with Spark. Every API should be as consistent as possible, so as to make the learning curve smoother and to reduce causes for human error; 2. It gets in the way of stylistic conventions. Most of the time it makes Python code more readable to use literal names, and the API provides ample support for that, but these few exceptions prevent this pattern from being universally applicable. This patch is meant to fix the aforementioned problem. ### Effect This patch **enables** support for passing column names as input to those functions mentioned above. ### Side effects This PR also **fixes** an issue with some functions being defined multiple times by using `_create_function()`. ### How it works `_create_function()` was redefined to always convert the argument to a Column object. The old implementation has been kept under `_create_name_function()`, and is still being used to generate the following special functions: - lit() - col() - column() - asc() - desc() - asc_nulls_first() - asc_nulls_last() - desc_nulls_first() - desc_nulls_last() This is because these functions can only take a column name as their argument. This is not a problem, as their semantics require so. ## How was this patch tested? Ran ./dev/run-tests and tested it manually. Closes #23882 from asmello/col-name-support-pyspark. Authored-by: André Sá de Mello <amello@palantir.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> |
||
---|---|---|
.. | ||
ml | ||
mllib | ||
sql | ||
streaming | ||
testing | ||
tests | ||
__init__.py | ||
_globals.py | ||
accumulators.py | ||
broadcast.py | ||
cloudpickle.py | ||
conf.py | ||
context.py | ||
daemon.py | ||
files.py | ||
find_spark_home.py | ||
heapq3.py | ||
java_gateway.py | ||
join.py | ||
profiler.py | ||
rdd.py | ||
rddsampler.py | ||
resultiterable.py | ||
serializers.py | ||
shell.py | ||
shuffle.py | ||
statcounter.py | ||
status.py | ||
storagelevel.py | ||
taskcontext.py | ||
traceback_utils.py | ||
util.py | ||
version.py | ||
worker.py |