spark-instrumented-optimizer

History

WeichenXu 7f605f5559 [SPARK-28621][SQL] Make spark.sql.crossJoin.enabled default value true ### What changes were proposed in this pull request? Make `spark.sql.crossJoin.enabled` default value true ### Why are the changes needed? For implicit cross join, we can set up a watchdog to cancel it if running for a long time. When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user: * it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast. * if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing. * the CROSS JOIN syntax doesn't work well if join reorder happens. * some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error. So that in order to address this in simpler way, we can turn off showing this cross-join error by default. For reference, I list some cases raising mismatching error here: Providing: ``` spark.range(2).createOrReplaceTempView("sm1") // can be broadcast spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast ``` 1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g. ``` select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id ``` 2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g. ``` select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id ``` ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #25520 from WeichenXu123/SPARK-28621. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2019-08-27 21:53:37 +08:00
..
avro	[SPARK-28698][SQL] Support user-specified output schema in `to_avro`	2019-08-13 20:52:16 +08:00
tests	[SPARK-28621][SQL] Make spark.sql.crossJoin.enabled default value true	2019-08-27 21:53:37 +08:00
__init__.py	[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark	2017-11-02 15:22:52 +01:00
catalog.py	[SPARK-24665][PYSPARK][FOLLOWUP] Use SQLConf in PySpark to manage all sql configs	2018-08-17 10:18:08 +08:00
column.py	[SPARK-28031][PYSPARK][TEST] Improve doctest on over function of Column	2019-06-13 11:04:41 +09:00
conf.py	[SPARK-23698][PYTHON] Resolve undefined names in Python 3	2018-08-22 10:06:59 -07:00
context.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
dataframe.py	[SPARK-28378][PYTHON] Remove usage of cgi.escape	2019-07-14 15:26:00 +09:00
functions.py	[SPARK-28777][PYTHON][DOCS] Fix format_string doc string with the correct parameters	2019-08-19 20:44:46 -07:00
group.py	[SPARK-24722][SQL] pivot() with Column type argument	2018-08-04 14:17:32 +08:00
readwriter.py	[SPARK-28471][SQL] Replace `yyyy` by `uuuu` in date-timestamp patterns without era	2019-07-28 20:36:36 -07:00
session.py	[SPARK-27995][PYTHON] Note the difference between str of Python 2 and 3 at Arrow optimized	2019-06-11 18:43:59 +09:00
streaming.py	[SPARK-28651][SS] Force the schema of Streaming file source to be nullable	2019-08-09 18:54:55 +09:00
types.py	[SPARK-28454][PYTHON] Validate LongType in `createDataFrame(verifySchema=True)`	2019-08-08 11:47:25 +09:00
udf.py	[SPARK-28273][SQL][PYTHON] Convert and port 'pgSQL/case.sql' into UDF test base	2019-07-09 10:50:07 +08:00
utils.py	[SPARK-27609][PYTHON] Convert values of function options to strings	2019-07-18 13:37:03 +09:00
window.py	[MINOR][PYSPARK][SQL][DOC] Fix rowsBetween doc in Window	2019-06-14 09:56:37 +09:00