spark-instrumented-optimizer

History

Julien 35c5516355 [SPARK-26024][SQL] Update documentation for repartitionByRange Following [SPARK-26024](https://issues.apache.org/jira/browse/SPARK-26024), I noticed the number of elements in each partition after repartitioning using `df.repartitionByRange` can vary for the same setup: ```scala // Shuffle numbers from 0 to 1000, and make a DataFrame val df = Random.shuffle(0.to(1000)).toDF("val") // Repartition it using 3 partitions // Sum up number of elements in each partition, and collect it. // And do it several times for (i <- 0 to 9) { var counts = df.repartitionByRange(3, col("val")) .mapPartitions{part => Iterator(part.size)} .collect() println(counts.toList) } // -> the number of elements in each partition varies ``` This is expected as for performance reasons this method uses sampling to estimate the ranges (with default size of 100). Hence, the output may not be consistent, since sampling can return different values. But documentation was not mentioning it at all, leading to misunderstanding. ## What changes were proposed in this pull request? Update the documentation (Spark & PySpark) to mention the impact of `spark.sql.execution.rangeExchange.sampleSizePerPartition` on the resulting partitioned DataFrame. Closes #23025 from JulienPeloton/SPARK-26024. Authored-by: Julien <peloton@lal.in2p3.fr> Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2018-11-19 22:24:53 +08:00
..
tests	[SPARK-26036][PYTHON] Break large tests.py files into smaller files	2018-11-15 12:30:52 +08:00
__init__.py	[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark	2017-11-02 15:22:52 +01:00
catalog.py	[SPARK-24665][PYSPARK][FOLLOWUP] Use SQLConf in PySpark to manage all sql configs	2018-08-17 10:18:08 +08:00
column.py	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark	2018-04-08 12:09:06 +08:00
conf.py	[SPARK-23698][PYTHON] Resolve undefined names in Python 3	2018-08-22 10:06:59 -07:00
context.py	[SPARK-25540][SQL][PYSPARK] Make HiveContext in PySpark behave as the same as Scala.	2018-09-27 09:51:20 +08:00
dataframe.py	[SPARK-26024][SQL] Update documentation for repartitionByRange	2018-11-19 22:24:53 +08:00
functions.py	[SPARK-26112][SQL] Update since versions of new built-in functions.	2018-11-19 22:18:20 +08:00
group.py	[SPARK-24722][SQL] pivot() with Column type argument	2018-08-04 14:17:32 +08:00
readwriter.py	[SPARK-25945][SQL] Support locale while parsing date/timestamp from CSV/JSON	2018-11-09 09:45:06 +08:00
session.py	[SPARK-25255][PYTHON] Add getActiveSession to SparkSession in PySpark	2018-10-26 09:40:13 -07:00
streaming.py	[SPARK-25972][PYTHON] Missed JSON options in streaming.py	2018-11-11 21:01:29 +08:00
types.py	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4	2018-09-13 11:19:43 +08:00
udf.py	[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement	2018-10-04 09:36:23 +08:00
utils.py	[SPARK-24721][SQL] Exclude Python UDFs filters in FileSourceStrategy	2018-08-28 10:57:13 +08:00
window.py	[SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608	2018-10-26 13:17:24 +08:00