ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Lu WANG	701e593414	[MINOR][R] Fix a R style in try and finally at DataFrame.R Fix the R style issue which is not catched by the R style checker. Got error: ``` R/DataFrame.R:1244:17: style: Closing curly-braces should always be on their own line, unless it's followed by an else. }, finally = { ^ lintr checks failed. ``` Closes #29574 from lu-wang-dl/fix-r-style. Lead-authored-by: Lu WANG <lu.wang@databricks.com> Co-authored-by: Lu Wang <38018689+lu-wang-dl@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-01 10:07:34 +09:00
Chao Sun	1453a09a63	[SPARK-32721][SQL] Simplify if clauses with null and boolean ### What changes were proposed in this pull request? The following if clause: ```sql if(p, null, false) ``` can be simplified to: ```sql and(p, null) ``` Similarly, the clause: ```sql if(p, null, true) ``` can be simplified to ```sql or(not(p), null) ``` iff the predicate `p` is non-nullable, i.e., can be evaluated to either true or false, but not null. ### Why are the changes needed? Converting if to or/and clauses can better push filters down. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #29567 from sunchao/SPARK-32721. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-08-31 20:59:54 +00:00
Huaxin Gao	806140de40	[SPARK-32592][SQL] Make DataFrameReader.table take the specified options ### What changes were proposed in this pull request? pass specified options in DataFrameReader.table to JDBCTableCatalog.loadTable ### Why are the changes needed? Currently, `DataFrameReader.table` ignores the specified options. The options specified like the following are lost. ``` val df = spark.read .option("partitionColumn", "id") .option("lowerBound", "0") .option("upperBound", "3") .option("numPartitions", "2") .table("h2.test.people") ``` We need to make `DataFrameReader.table` take the specified options. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test for now. Will add a test after V2 JDBC read is implemented. Closes #29535 from huaxingao/table_options. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-31 13:21:15 +00:00
HyukjinKwon	eaaf783148	[MINOR][DOCS] Fix the Binder link to point the quickstart notebook correctly ### What changes were proposed in this pull request? This PR fixes the link of Binder in Quickstart notebook and documentation. From: https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb To: https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb This link is the same as the one in RST files: `b54103016a/python/docs/source/conf.py (L57)` ### Why are the changes needed? The link was wrong, and points out non-existent file and repo. ### Does this PR introduce _any_ user-facing change? Yes, it will fixes the link so users can correctly try Binder. ### How was this patch tested? Manually tested by building the documentation. Closes #29597 from HyukjinKwon/minor-link-quickstart. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 22:00:56 +09:00
HyukjinKwon	2491cf1ae1	[SPARK-32747][R][TESTS] Deduplicate configuration set/unset in test_sparkSQL_arrow.R ### What changes were proposed in this pull request? This PR proposes to deduplicate configuration set/unset in `test_sparkSQL_arrow.R`. Setting `spark.sql.execution.arrow.sparkr.enabled` can be globally done instead of doing it in each test case. ### Why are the changes needed? To duduplicate the codes. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? Manually ran the tests. Closes #29592 from HyukjinKwon/SPARK-32747. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 17:39:12 +09:00
zero323	5574734093	[SPARK-32138][FOLLOW-UP] Drop obsolete StringIO import branching ### What changes were proposed in this pull request? Removal of branched `StringIO` import. ### Why are the changes needed? Top level `StringIO` is no longer present in Python 3.x. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29590 from zero323/SPARK-32138-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 16:56:50 +09:00
Cheng Su	ce473b223a	[SPARK-32740][SQL] Refactor common partitioning/distribution logic to BaseAggregateExec ### What changes were proposed in this pull request? For all three different aggregate physical operator: `HashAggregateExec`, `ObjectHashAggregateExec` and `SortAggregateExec`, they have same `outputPartitioning` and `requiredChildDistribution` logic. Refactor these same logic into their super class `BaseAggregateExec` to avoid code duplication and future bugs (similar to `HashJoin` and `ShuffledJoin`). ### Why are the changes needed? Reduce duplicated code across classes and prevent future bugs if we only update one class but forget another. We already did similar refactoring for join (`HashJoin` and `ShuffledJoin`). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests as this is pure refactoring and no new logic added. Closes #29583 from c21/aggregate-refactor. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-31 15:43:13 +09:00
Fokko Driesprong	a1e459ed9f	[SPARK-32719][PYTHON] Add Flake8 check missing imports https://issues.apache.org/jira/browse/SPARK-32719 ### What changes were proposed in this pull request? Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard). ### Why are the changes needed? To make sure that the quality of the Python code is up to standard. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit-tests and Flake8 static analysis Closes #29563 from Fokko/fd-add-check-missing-imports. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 11:23:31 +09:00
Kent Yao	6dacba7fa0	[SPARK-32733][SQL] Add extended information - arguments/examples/since/notes of expressions to the remarks field of GetFunctionsOperation ### What changes were proposed in this pull request? This PR adds extended information of a function including arguments, examples, notes and the since field to the SparkGetFunctionOperation ### Why are the changes needed? better user experience, it will help JDBC users to have a better understanding of our builtin functions ### Does this PR introduce _any_ user-facing change? Yes, BI tools and JDBC users will get full information on a spark function instead of only fragmentary usage info. e.g. date_part #### before ``` date_part(field, source) - Extracts a part of the date/timestamp or interval source. ``` #### after ``` Usage: date_part(field, source) - Extracts a part of the date/timestamp or interval source. Arguments: * field - selects which part of the source should be extracted, and supported string values are as same as the fields of the equivalent function `EXTRACT`. * source - a date/timestamp or interval column from where `field` should be extracted Examples: > SELECT date_part('YEAR', TIMESTAMP '2019-08-12 01:00:00.123456'); 2019 > SELECT date_part('week', timestamp'2019-08-12 01:00:00.123456'); 33 > SELECT date_part('doy', DATE'2019-08-12'); 224 > SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001'); 1.000001 > SELECT date_part('days', interval 1 year 10 months 5 days); 5 > SELECT date_part('seconds', interval 5 hours 30 seconds 1 milliseconds 1 microseconds); 30.001001 Note: The date_part function is equivalent to the SQL-standard function `EXTRACT(field FROM source)` Since: 3.0.0 ``` ### How was this patch tested? New tests Closes #29577 from yaooqinn/SPARK-32733. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 11:03:01 +09:00
Udbhav30	065f17386d	[SPARK-32481][CORE][SQL] Support truncate table to move data to trash ### What changes were proposed in this pull request? Instead of deleting the data, we can move the data to trash. Based on the configuration provided by the user it will be deleted permanently from the trash. ### Why are the changes needed? Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently. ### Does this PR introduce _any_ user-facing change? Yes, After truncate table the data is not permanently deleted now. It is first moved to the trash and then after the given time deleted permanently; ### How was this patch tested? new UTs added Closes #29552 from Udbhav30/truncate. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-30 10:25:32 -07:00
Cheng Su	cfe012a431	[SPARK-32629][SQL] Track metrics of BitSet/OpenHashSet in full outer SHJ ### What changes were proposed in this pull request? This is followup from https://github.com/apache/spark/pull/29342, where to do two things: * Per https://github.com/apache/spark/pull/29342#discussion_r470153323, change from java `HashSet` to spark in-house `OpenHashSet` to track matched rows for non-unique join keys. I checked `OpenHashSet` implementation which is built from a key index (`OpenHashSet._bitset` as `BitSet`) and key array (`OpenHashSet._data` as `Array`). Java `HashSet` is built from `HashMap`, which stores value in `Node` linked list and by theory should have taken more memory than `OpenHashSet`. Reran the same benchmark query used in https://github.com/apache/spark/pull/29342, and verified the query has similar performance here between `HashSet` and `OpenHashSet`. * Track metrics of the extra data structure `BitSet`/`OpenHashSet` for full outer SHJ. This depends on above thing, because there seems no easy way to get java `HashSet` memory size. ### Why are the changes needed? To better surface the memory usage for full outer SHJ more accurately. This can help users/developers to debug/improve full outer SHJ. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unite test in `SQLMetricsSuite.scala` . Closes #29566 from c21/add-metrics. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-30 07:01:33 +09:00
Wenchen Fan	ccc0250a08	[SPARK-32718][SQL] Remove unnecessary keywords for interval units ### What changes were proposed in this pull request? Remove the YEAR, MONTH, DAY, HOUR, MINUTE, SECOND keywords. They are not useful in the parser, as we need to support plural like YEARS, so the parser has to accept the general identifier as interval unit anyway. ### Why are the changes needed? These keywords are reserved in ANSI. If Spark has these keywords, then they become reserved under ANSI mode. This makes Spark not able to run TPCDS queries as they use YEAR as alias name. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added `TPCDSQueryANSISuite`, to make sure Spark with ANSI mode can run TPCDS queries. Closes #29560 from cloud-fan/keyword. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-29 14:06:01 -07:00
Louiszr	a0bd273bb0	[SPARK-32092][ML][PYSPARK][FOLLOWUP] Fixed CrossValidatorModel.copy() to copy models instead of list ### What changes were proposed in this pull request? Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models. ### Why are the changes needed? `copy()` was first changed in #29445 . The issue was found in CI of #29524 and fixed. This PR introduces the exact same change so that `CrossValidatorModel.copy()` and its related tests are aligned in branch `master` and branch `branch-3.0`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated `test_copy` to make sure `copy()` is called on models instead of lists of models. Closes #29553 from Louiszr/fix-cv-copy. Authored-by: Louiszr <zxhst14@gmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-08-28 10:15:16 -07:00
Chen Zhang	58f87b3178	[SPARK-32639][SQL] Support GroupType parquet mapkey field ### What changes were proposed in this pull request? Remove the assertion in ParquetSchemaConverter that the parquet mapKey field must be PrimitiveType. ### Why are the changes needed? There is a parquet file in the attachment of [SPARK-32639](https://issues.apache.org/jira/browse/SPARK-32639), and the MessageType recorded in the file is: ``` message parquet_schema { optional group value (MAP) { repeated group key_value { required group key { optional binary first (UTF8); optional binary middle (UTF8); optional binary last (UTF8); } optional binary value (UTF8); } } } ``` Use `spark.read.parquet("000.snappy.parquet")` to read the file. Spark will throw an exception when converting Parquet MessageType to Spark SQL StructType: > AssertionError(Map key type is expected to be a primitive type, but found...) Use `spark.read.schema("value MAP<STRUCT<first:STRING, middle:STRING, last:STRING>, STRING>").parquet("000.snappy.parquet")` to read the file, spark returns the correct result . According to the parquet project document (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), the mapKey in the parquet format does not need to be a primitive type. Note: This parquet file is not written by spark, because spark will write additional sparkSchema string information in the parquet file. When Spark reads, it will directly use the additional sparkSchema information in the file instead of converting Parquet MessageType to Spark SQL StructType. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a unit test case Closes #29451 from izchen/SPARK-32639. Authored-by: Chen Zhang <izchen@126.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-28 16:51:00 +00:00
Takeshi Yamamuro	0cb91b8c18	[SPARK-32704][SQL] Logging plan changes for execution ### What changes were proposed in this pull request? Since we only log plan changes for analyzer/optimizer now, this PR intends to add code to log plan changes in the preparation phase in `QueryExecution` for execution. ``` scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN") scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan ... 20/08/26 09:32:36 WARN PlanChangeLogger: === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages === !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) (1) HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) +- (1) HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) ! +- Range (0, 10, step=1, splits=4) +- (1) Range (0, 10, step=1, splits=4) 20/08/26 09:32:36 WARN PlanChangeLogger: === Result of Batch Preparations === !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) (1) HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) +- (1) HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) ! +- Range (0, 10, step=1, splits=4) +- (1) Range (0, 10, step=1, splits=4) ``` ### Why are the changes needed? Easy debugging for executed plans ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. Closes #29544 from maropu/PlanLoggingInPreparations. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-28 16:35:47 +00:00
Kent Yao	0626901bcb	[SPARK-32729][SQL][DOCS] Add missing since version for math functions ### What changes were proposed in this pull request? Add missing since version for math functions, including SPARK-8223 shiftright/shiftleft SPARK-8215 pi SPARK-8212 e SPARK-6829 sin/asin/sinh/cos/acos/cosh/tan/atan/tanh/ceil/floor/rint/cbrt/signum/isignum/Fsignum/Lsignum/degrees/radians/log/log10/log1p/exp/expm1/pow/hypot/atan2 SPARK-8209 conv SPARK-8213 factorial SPARK-20751 cot SPARK-2813 sqrt SPARK-8227 unhex SPARK-8218 log(a,b) SPARK-8207 bin SPARK-8214 hex SPARK-8206 round SPARK-14614 bround ### Why are the changes needed? fix SQL docs ### Does this PR introduce _any_ user-facing change? yes, doc updated ### How was this patch tested? passing doc generation. Closes #29571 from yaooqinn/minor. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-29 00:30:31 +09:00
yi.wu	c3b9404253	[SPARK-32717][SQL] Add a AQEOptimizer for AdaptiveSparkPlanExec ### What changes were proposed in this pull request? This PR proposes to add a specific `AQEOptimizer` for the `AdaptiveSparkPlanExec` instead of implementing an anonymous `RuleExecutor`. At the same time, this PR also adds the configuration `spark.sql.adaptive.optimizer.excludedRules`, which follows the same pattern of `Optimizer`, to make the `AQEOptimizer` more flexible for users and developers. ### Why are the changes needed? Currently, `AdaptiveSparkPlanExec` has implemented an anonymous `RuleExecutor` to apply the AQE optimize rules on the plan. However, the anonymous class usually could be inconvenient to maintain and extend for the long term. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It's a pure refactor so pass existing tests should be ok. Closes #29559 from Ngone51/impro-aqe-optimizer. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 21:23:53 +09:00
HyukjinKwon	5775073a01	[SPARK-32722][PYTHON][DOCS] Update document type conversion for Pandas UDFs (pyarrow 1.0.1, pandas 1.1.1, Python 3.7) ### What changes were proposed in this pull request? This PR updates the chart generated at SPARK-25666. We bumped up the minimal PyArrow version. It's better to use PyArrow 0.15.1+ ### Why are the changes needed? To track the changes in type coercion of PySpark <> PyArrow <> pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Use this code to generate the chart: ```python from pyspark.sql.types import * from pyspark.sql.functions import pandas_udf columns = [ ('none', 'object(NoneType)'), ('bool', 'bool'), ('int8', 'int8'), ('int16', 'int16'), ('int32', 'int32'), ('int64', 'int64'), ('uint8', 'uint8'), ('uint16', 'uint16'), ('uint32', 'uint32'), ('uint64', 'uint64'), ('float64', 'float16'), ('float64', 'float32'), ('float64', 'float64'), ('date', 'datetime64[ns]'), ('tz_aware_dates', 'datetime64[ns, US/Eastern]'), ('string', 'object(string)'), ('decimal', 'object(Decimal)'), ('array', 'object(array[int32])'), ('float128', 'float128'), ('complex64', 'complex64'), ('complex128', 'complex128'), ('category', 'category'), ('tdeltas', 'timedelta64[ns]'), ] def create_dataframe(): import pandas as pd import numpy as np import decimal pdf = pd.DataFrame({ 'none': [None, None], 'bool': [True, False], 'int8': np.arange(1, 3).astype('int8'), 'int16': np.arange(1, 3).astype('int16'), 'int32': np.arange(1, 3).astype('int32'), 'int64': np.arange(1, 3).astype('int64'), 'uint8': np.arange(1, 3).astype('uint8'), 'uint16': np.arange(1, 3).astype('uint16'), 'uint32': np.arange(1, 3).astype('uint32'), 'uint64': np.arange(1, 3).astype('uint64'), 'float16': np.arange(1, 3).astype('float16'), 'float32': np.arange(1, 3).astype('float32'), 'float64': np.arange(1, 3).astype('float64'), 'float128': np.arange(1, 3).astype('float128'), 'complex64': np.arange(1, 3).astype('complex64'), 'complex128': np.arange(1, 3).astype('complex128'), 'string': list('ab'), 'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3], dtype=np.int32)]), 'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]), 'date': pd.date_range('19700101', periods=2).values, 'category': pd.Series(list("AB")).astype('category')}) pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]] pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, tz='US/Eastern') return pdf types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), FloatType(), DoubleType(), DateType(), TimestampType(), StringType(), DecimalType(10, 0), ArrayType(IntegerType()), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), BinaryType(), ] df = spark.range(2).repartition(1) results = [] count = 0 total = len(types) * len(columns) values = [] spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for column, pandas_t in columns: v = create_dataframe()[column][0] values.append(v) try: row = df.select(pandas_udf(lambda _: create_dataframe()[column], t)(df.id)).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Pandas Value(Type): %s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), v, pandas_t, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, columns))) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` Closes #29569 from HyukjinKwon/SPARK-32722. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:38:39 +09:00
Jungtaek Lim (HeartSaVioR)	73bfed3633	[SPARK-28612][SQL][FOLLOWUP] Correct method doc of DataFrameWriterV2.replace() ### What changes were proposed in this pull request? This patch corrects the method doc of DataFrameWriterV2.replace() which explanation of exception is described oppositely. ### Why are the changes needed? The method doc is incorrect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only doc change. Closes #29568 from HeartSaVioR/SPARK-28612-FOLLOWUP-fix-doc-nit. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:14:57 +09:00
HyukjinKwon	c154629171	[SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow ### What changes were proposed in this pull request? This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide"). Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html ### Why are the changes needed? To have a single place for PySpark users, and better documentation. ### Does this PR introduce _any_ user-facing change? Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation. ### How was this patch tested? ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` and ```bash cd python/docs make clean html ``` Closes #29548 from HyukjinKwon/SPARK-32183. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:09:06 +09:00
Liang-Chi Hsieh	d6c095c92c	[SPARK-32693][SQL] Compare two dataframes with same schema except nullable property ### What changes were proposed in this pull request? This PR changes key data types check in `HashJoin` to use `sameType`. ### Why are the changes needed? Looks at the resolving condition of `SetOperation`, it requires only each left data types should be `sameType` as the right ones. Logically the `EqualTo` expression in equi-join, also requires only left data type `sameType` as right data type. Then `HashJoin` requires left keys data type exactly the same as right keys data type, looks not reasonable. It makes inconsistent results when doing `except` between two dataframes. If two dataframes don't have nested fields, even their field nullable property different, `HashJoin` passes the key type check because it checks field individually so field nullable property is ignored. If two dataframes have nested fields like struct, `HashJoin` fails the key type check because now it compare two struct types and nullable property now affects. ### Does this PR introduce _any_ user-facing change? Yes. Making consistent `except` operation between dataframes. ### How was this patch tested? Unit test. Closes #29555 from viirya/SPARK-32693. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-28 10:32:23 +09:00
Dongjoon Hyun	182727d90f	[SPARK-32713][K8S] Support execId placeholder in executor PVC conf ### What changes were proposed in this pull request? This PR aims to support executor id placeholder in `spark.kubernetes.executor.volumes.persistentVolumeClaim.myname.options.claimName` configuration like the following. ``` --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=pvc-spark-SPARK_EXECUTOR_ID \ ``` ### Why are the changes needed? This is a convenient way to mount corresponding PV to the executor. ### Does this PR introduce _any_ user-facing change? Yes, but this is a new feature and there is no regression because users don't use `SPARK_EXECUTOR_ID` in PVC claim name. ### How was this patch tested? Pass the newly added test case. Closes #29557 from dongjoon-hyun/SPARK-PVC. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-27 09:49:21 -07:00
waleedfateem	8749b2b6fa	[SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value The current documentation states that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which is not entirely true since this configuration isn't set anywhere in Spark but rather inherited from the Hadoop FileOutputCommitter class. ### What changes were proposed in this pull request? I'm submitting this change, to clarify that the default value will entirely depend on the Hadoop version of the runtime environment. ### Why are the changes needed? An application would end up using algorithm version 1 on certain environments but without any changes the same exact application will use version 2 on environments running Hadoop 3.0 and later. This can have pretty bad consequences in certain scenarios, for example, two tasks can partially overwrite their output if speculation is enabled. Also, please refer to the following JIRA: https://issues.apache.org/jira/browse/MAPREDUCE-7282 ### Does this PR introduce _any_ user-facing change? Yes. Configuration page content was modified where previously we explicitly highlighted that the default version for the FileOutputCommitter algorithm was v1, this now has changed to "Dependent on environment" with additional information in the description column to elaborate. ### How was this patch tested? Checked changes locally in browser Closes #29541 from waleedfateem/SPARK-32701. Authored-by: waleedfateem <waleed.fateem@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-08-27 09:05:50 -05:00
Dale Clarke	ed51a7f083	[SPARK-30654] Bootstrap4 docs upgrade ### What changes were proposed in this pull request? We are using an older version of Bootstrap (v. 2.1.0) for the online documentation site. Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to EOL in July 2019 (https://github.com/twbs/release). Older versions of Bootstrap are also getting flagged in security scans for various CVEs: https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889 https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700 https://snyk.io/vuln/npm:bootstrap:20180529 https://snyk.io/vuln/npm:bootstrap:20160627 I haven't validated each CVE, but it would probably be good practice to resolve any potential issues and get on a supported release. The bad news is that there have been quite a few changes between Bootstrap 2 and Bootstrap 4. I've tried updating the library, refactoring/tweaking the CSS and JS to maintain a similar appearance and functionality, and testing the documentation. This is a fairly large change so I'm sure additional testing and fixes will be needed. ### How was this patch tested? This has been manually tested, but as there is a lot of documentation it is possible issues were missed. Additional testing and feedback is welcomed. If it appears a whole section was missed let me know and I'll take a pass at addressing that section. Closes #27369 from clarkead/bootstrap4-docs-upgrade. Authored-by: Dale Clarke <a.dale.clarke@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-08-27 09:03:39 -05:00
Kent Yao	f14f3742e0	[SPARK-32696][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] Get columns operation should handle interval column properly ### What changes were proposed in this pull request? This PR let JDBC clients identify spark interval columns properly. ### Why are the changes needed? JDBC users can query interval values through thrift server, create views with interval columns, e.g. ```sql CREATE global temp view view1 as select interval 1 day as i; ``` but when they want to get the details of the columns of view1, the will fail with `Unrecognized type name: INTERVAL` ``` Caused by: java.lang.IllegalArgumentException: Unrecognized type name: INTERVAL at org.apache.hadoop.hive.serde2.thrift.Type.getType(Type.java:170) at org.apache.spark.sql.hive.thriftserver.ThriftserverShimUtils$.toJavaSQLType(ThriftserverShimUtils.scala:53) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:157) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:149) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$6(SparkGetColumnsOperation.scala:113) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$6$adapted(SparkGetColumnsOperation.scala:112) at scala.Option.foreach(Option.scala:407) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5(SparkGetColumnsOperation.scala:112) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5$adapted(SparkGetColumnsOperation.scala:111) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.runInternal(SparkGetColumnsOperation.scala:111) ... 34 more ``` ### Does this PR introduce _any_ user-facing change? YES, #### before ![image](https://user-images.githubusercontent.com/8326978/91162239-6cd1ec80-e6fe-11ea-8c2c-914ddb325c4e.png) #### after ![image](https://user-images.githubusercontent.com/8326978/91162025-1a90cb80-e6fe-11ea-94c4-03a6f2ec296b.png) ### How was this patch tested? new tests Closes #29539 from yaooqinn/SPARK-32696. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-27 06:52:34 +00:00
xuewei.linxuewei	eb379766f4	[SPARK-32705][SQL] Fix serialization issue for EmptyHashedRelation ### What changes were proposed in this pull request? Currently, EmptyHashedRelation and HashedRelationWithAllNullKeys is an object, and it will cause JavaDeserialization Exception as following ``` 20/08/26 11:13:30 WARN [task-result-getter-2] TaskSetManager: Lost task 34.0 in stage 57.0 (TID 18076, emr-worker-5.cluster-183257, executor 18): java.io.InvalidClassException: org.apache.spark.sql.execution.joins.EmptyHashedRelation$; no valid constructor at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169) at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328) ``` This PR includes * Using case object instead to fix serialization issue. * Also change EmptyHashedRelation not to extend NullAwareHashedRelation since it's already being used in other non-NAAJ joins. ### Why are the changes needed? It will cause BHJ failed when buildSide is Empty and BHJ(NAAJ) failed when buildSide with null partition keys. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? * Existing UT. * Run entire TPCDS for E2E coverage. Closes #29547 from leanken/leanken-SPARK-32705. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-27 06:24:42 +00:00
Terry Kim	baaa756dee	[SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() ### What changes were proposed in this pull request? This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. ### Why are the changes needed? The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. For example, ``` Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") ``` will write the result to `/tmp/path2`. ### Does this PR introduce _any_ user-facing change? Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`: ``` scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.; ``` The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`. ### How was this patch tested? Added new tests. Closes #29543 from imback82/path_option. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-27 06:21:04 +00:00
Devesh Agrawal	b786f31a42	[SPARK-32643][CORE][K8S] Consolidate state decommissioning in the TaskSchedulerImpl realm ### What changes were proposed in this pull request? The decommissioning state is a bit fragment across two places in the TaskSchedulerImpl: https://github.com/apache/spark/pull/29014/ stored the incoming decommission info messages in TaskSchedulerImpl.executorsPendingDecommission. While https://github.com/apache/spark/pull/28619/ was storing just the executor end time in the map TaskSetManager.tidToExecutorKillTimeMapping (which in turn is contained in TaskSchedulerImpl). While the two states are not really overlapping, it's a bit of a code hygiene concern to save this state in two places. With https://github.com/apache/spark/pull/29422, TaskSchedulerImpl is emerging as the place where all decommissioning book keeping is kept within the driver. So consolidate the information in _tidToExecutorKillTimeMapping_ into _executorsPendingDecommission_. However, in order to do so, we need to walk away from keeping the raw ExecutorDecommissionInfo messages and instead keep another class ExecutorDecommissionState. This decoupling will allow the RPC message class ExecutorDecommissionInfo to evolve independently from the book keeping ExecutorDecommissionState. ### Why are the changes needed? This is just a code cleanup. These two features were added independently and its time to consolidate their state for good hygiene. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #29452 from agrawaldevesh/consolidate_decom_state. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-08-26 15:16:47 -07:00
Dongjoon Hyun	2dee4352a0	Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to trash" This reverts commit `5c077f0580`.	2020-08-26 11:24:35 -07:00
unirt	d3304268d3	[MINOR][PYTHON] Fix typo in a docsting of RDD.toDF ### What changes were proposed in this pull request? Fixes typo in docsting of `toDF` ### Why are the changes needed? The third argument of `toDF` is actually `sampleRatio`. related discussion: https://github.com/apache/spark/pull/12746#discussion-diff-62704834 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch doesn't affect any logic, so existing tests should cover it. Closes #29551 from unirt/minor_fix_docs. Authored-by: unirt <lunirtc@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-26 10:34:49 -07:00
Yuming Wang	a8b568800e	[SPARK-32659][SQL] Fix the data issue when pruning DPP on non-atomic type ### What changes were proposed in this pull request? Use `InSet` expression to fix data issue when pruning DPP on non-atomic type. for example: ```scala spark.range(1000) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("parquet") .mode("overwrite") .saveAsTable("df1"); spark.range(100) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("parquet") .mode("overwrite") .saveAsTable("df2") spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2") spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false") spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show ``` It should return two records, but it returns empty. ### Why are the changes needed? Fix data issue ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new unit test. Closes #29475 from wangyum/SPARK-32659. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-26 06:57:43 +00:00
Udbhav30	5c077f0580	[SPARK-32481][CORE][SQL] Support truncate table to move data to trash ### What changes were proposed in this pull request? Instead of deleting the data, we can move the data to trash. Based on the configuration provided by the user it will be deleted permanently from the trash. ### Why are the changes needed? Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently. ### Does this PR introduce _any_ user-facing change? Yes, After truncate table the data is not permanently deleted now. It is first moved to the trash and then after the given time deleted permanently; ### How was this patch tested? new UTs added Closes #29387 from Udbhav30/tuncateTrash. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-25 23:38:43 -07:00
yi.wu	f510d21e93	[SPARK-32466][FOLLOW-UP][TEST][SQL] Regenerate the golden explain file for PlanStabilitySuite ### What changes were proposed in this pull request? This PR regenerates the golden explain file based on the fix: https://github.com/apache/spark/pull/29537 ### Why are the changes needed? Eliminates the personal related information (e.g., local directories) in the explain plan. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Checked manually. Closes #29546 from Ngone51/follow-up-gen-golden-file. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-26 14:46:14 +09:00
HyukjinKwon	b07e7429a6	[SPARK-32695][INFRA] Explicitly cache and hash 'build' directly in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to explicitly cache and hash the files/directories under 'build' for SBT and Zinc at GitHub Actions. Otherwise, it can end up with overwriting `build` directory. See also https://github.com/apache/spark/pull/29286#issuecomment-679368436 Previously, other files like `build/mvn` and `build/sbt` are also cached and overwritten. So, when you have some changes there, they are ignored. ### Why are the changes needed? To make GitHub Actions build stable. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? The builds in this PR test it out. Closes #29536 from HyukjinKwon/SPARK-32695. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-26 12:25:59 +09:00
HyukjinKwon	b54103016a	[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation ### What changes were proposed in this pull request? This PR proposes to: - add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - reuse this notebook as a quickstart guide in PySpark documentation. Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit. Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks. <br/> I made a simple demo to make it easier to review. Please see: - [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet. - [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html) <br/> When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address. Another way might be: - open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR. - download it as a `.ipynb` file: ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png) - upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course. - alternatively, push a commit into this PR right away if that's easier for you (if you're a committer). References: - https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html - https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? To improve PySpark's usability. The current quickstart for Python users are very friendly. ### Does this PR introduce _any_ user-facing change? Yes, it will add a documentation page, and expose a live notebook to PySpark users. ### How was this patch tested? Manually tested, and GitHub Actions builds will test. Closes #29491 from HyukjinKwon/SPARK-32204. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-26 12:23:24 +09:00
Yuming Wang	1354cf0842	[SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled ### What changes were proposed in this pull request? This pr reset the `numPartitions` metric when DPP is enabled. Otherwise, it is always a [static value](`18cac6a9f0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala (L215)`). ### Why are the changes needed? Fix metric issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and manual test For [this test case](`18cac6a9f0/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala (L252-L280)`). Before this pr: ![image](https://user-images.githubusercontent.com/5399861/90301798-9310b480-ded4-11ea-9294-49bcaba46f83.png) After this pr: ![image](https://user-images.githubusercontent.com/5399861/90301709-0fef5e80-ded4-11ea-942d-4d45d1dd15bc.png) Closes #29436 from wangyum/SPARK-32620. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2020-08-25 18:46:10 -07:00
Sean Owen	a9d4e60a90	[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV ### What changes were proposed in this pull request? Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of `\0`), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'. ### Why are the changes needed? To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example. ### Does this PR introduce _any_ user-facing change? Nothing beyond the effect of the bug fix. ### How was this patch tested? Existing tests plus new test case. Closes #29516 from srowen/SPARK-32614. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-26 00:25:58 +09:00
Thomas Graves	2feab4ef4f	[SPARK-32287][TESTS][FOLLOWUP] Add debugging for flaky ExecutorAllocationManagerSuite ### What changes were proposed in this pull request? fixing flaky test in ExecutorAllocationManagerSuite. The issue is that there is a timing issue when we do a reset as to when the numExecutorsToAddPerResourceProfileId gets reset. The fix is to just always set those values back to 1 when we call reset(). ### Why are the changes needed? fixing flaky test in ExecutorAllocationManagerSuite ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ran the unit test via this PR a bunch of times and the fix seems to be working. Closes #29508 from tgravescs/debugExecAllocTest. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-25 23:05:35 +09:00
Daniel Moore	a3179a78b6	[SPARK-32664][CORE] Switch the log level from info to debug at BlockManager.getLocalBlockData ### What changes were proposed in this pull request? Changing an info log to a debug log based on SPARK-32664 ### Why are the changes needed? It is outlined in SPARK-32664 ### Does this PR introduce _any_ user-facing change? There are changes to the debug and info logs ### How was this patch tested? Tested by looking at the logs Closes #29527 from dmoore62/SPARK-32664. Authored-by: Daniel Moore <moore@knights.ucf.edu> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-25 22:45:44 +09:00
Kent Yao	1f3bb51757	[SPARK-32683][DOCS][SQL] Fix doc error and add migration guide for datetime pattern F ### What changes were proposed in this pull request? This PR fixes the doc error and add a migration guide for datetime pattern. ### Why are the changes needed? This is a bug of the doc that we inherited from JDK https://bugs.openjdk.java.net/browse/JDK-8169482 The SimpleDateFormatter(F Day of week in month) we used in 2.x and the DatetimeFormatter(F week-of-month) we use now both have the opposite meanings to what they declared in the java docs. And unfortunately, this also leads to silent data change in Spark too. The `week-of-month` is actually the pattern `W` in DatetimeFormatter, which is banned to use in Spark 3.x. If we want to keep pattern `F`, we need to accept the behavior change with proper migration guide and fix the doc in Spark ### Does this PR introduce _any_ user-facing change? Yes, doc changed ### How was this patch tested? passing ci doc generating job Closes #29538 from yaooqinn/SPARK-32683. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-25 13:17:03 +00:00
yi.wu	b78b776c9e	[SPARK-32466][SQL][FOLLOW-UP] Normalize Location info in explain plan ### What changes were proposed in this pull request? 1. Extract `SQLQueryTestSuite.replaceNotIncludedMsg` to `PlanTest`. 2. Reuse `replaceNotIncludedMsg` to normalize the explain plan that generated in `PlanStabilitySuite`. ### Why are the changes needed? This's a follow-up of https://github.com/apache/spark/pull/29270. Eliminates the personal related information (e.g., local directories) in the explain plan. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated test. Closes #29537 from Ngone51/follow-up-plan-stablity. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-25 21:03:44 +09:00
Kent Yao	c26a97637f	Revert "[SPARK-32412][SQL] Unify error handling for spark thrift serv… …er operations" ### What changes were proposed in this pull request? This reverts commit `510a1656e6`. ### Why are the changes needed? see https://github.com/apache/spark/pull/29204#discussion_r475716547 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? pass ci tools Closes #29531 from yaooqinn/revert. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-25 05:57:14 +00:00
fqaiser94@gmail.com	3f1e56d4ca	[SPARK-32641][SQL] withField + getField should return null if original struct was null ### What changes were proposed in this pull request? There is a bug in the way the optimizer rule in `SimplifyExtractValueOps` is currently written in master branch which yields incorrect results in scenarios like the following: ``` sql("SELECT CAST(NULL AS struct<a:int,b:int>) struct_col") .select($"struct_col".withField("d", lit(4)).getField("d").as("d")) // currently returns this: +---+ \|d \| +---+ \|4 \| +---+ // when in fact it should return this: +----+ \|d \| +----+ \|null\| +----+ ``` The changes in this PR will fix this bug. ### Why are the changes needed? To fix the aforementioned bug. Optimizer rules should improve the performance of the query but yield exactly the same results. ### Does this PR introduce _any_ user-facing change? Yes, this bug will no longer occur. That said, this isn't something to be concerned about as this bug was introduced in Spark 3.1 and Spark 3.1 has yet to be released. ### How was this patch tested? Unit tests were added. Jenkins must pass them. Closes #29522 from fqaiser94/SPARK-32641. Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-25 04:59:37 +00:00
Liang-Chi Hsieh	cee48a9661	[SPARK-32646][SQL][TEST-HADOOP2.7][TEST-HIVE1.2] ORC predicate pushdown should work with case-insensitive analysis ### What changes were proposed in this pull request? This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis. This is re-submitted for #29457. Because #29457 has a hive-1.2 error and there were some tests failed with hive-1.2 profile at the same time, #29457 was reverted to unblock others. ### Why are the changes needed? Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis. But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too. ### Does this PR introduce _any_ user-facing change? Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work. ### How was this patch tested? Unit tests. Closes #29530 from viirya/fix-orc-pushdown. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-25 13:47:52 +09:00
Nicholas Chammas	f540031419	[SPARK-31000][PYTHON][SQL] Add ability to set table description via Catalog.createTable() ### What changes were proposed in this pull request? This PR enhances `Catalog.createTable()` to allow users to set the table's description. This corresponds to the following SQL syntax: ```sql CREATE TABLE ... COMMENT 'this is a fancy table'; ``` ### Why are the changes needed? This brings the Scala/Python catalog APIs a bit closer to what's already possible via SQL. ### Does this PR introduce any user-facing change? Yes, it adds a new parameter to `Catalog.createTable()`. ### How was this patch tested? Existing unit tests: ```sh ./python/run-tests \ --python-executables python3.7 \ --testnames 'pyspark.sql.tests.test_catalog,pyspark.sql.tests.test_context' ``` ``` $ ./build/sbt testOnly org.apache.spark.sql.internal.CatalogSuite org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.hive.MetastoreDataSourcesSuite org.apache.spark.sql.hive.execution.HiveDDLSuite ``` Closes #27908 from nchammas/SPARK-31000-table-description. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-25 13:42:31 +09:00
Baohe Zhang	9151a589a7	[SPARK-31608][CORE][WEBUI][TEST] Add test suites for HybridStore and HistoryServerMemoryManager ### What changes were proposed in this pull request? This pull request adds 2 test suites for 2 new classes HybridStore and HistoryServerMemoryManager, which were created in https://github.com/apache/spark/pull/28412. This pull request also did some minor changes in these 2 classes to expose some variables for testing. Besides 2 suites, this pull request adds a unit test in FsHistoryProviderSuite to test parsing logs with HybridStore. ### Why are the changes needed? Unit tests are needed for new features. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #29509 from baohe-zhang/SPARK-31608-UT. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-08-25 12:07:49 +09:00
Yesheng Ma	3eee915b47	[MINOR][SQL] Add missing documentation for LongType mapping ### What changes were proposed in this pull request? Added Java docs for Long data types in the Row class. ### Why are the changes needed? The Long datatype is somehow missing in Row.scala's `apply` and `get` methods. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs. Closes #29534 from yeshengm/docs-fix. Authored-by: Yesheng Ma <kimi.ysma@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-25 11:20:01 +09:00
yangjie01	a30bb0cfda	[SPARK-32550][SQL][FOLLOWUP] Eliminate negative impact on HyperLogLogSuite ### What changes were proposed in this pull request? Change to use `dataTypes.foreach` instead of get the element use specified index in `def this(dataTypes: Seq[DataType]) `constructor of `SpecificInternalRow` because the random access performance is unsatisfactory if the input argument not a `IndexSeq`. This pr followed srowen's advice. ### Why are the changes needed? I found that SPARK-32550 had some negative impact on performance, the typical cases is "deterministic cardinality estimation" in `HyperLogLogPlusPlusSuite` when rsd is 0.001, we found the code that is significantly slower is line 41 in `HyperLogLogPlusPlusSuite`： `new SpecificInternalRow(hll.aggBufferAttributes.map(_.dataType)) ` `08b951b1cb/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala (L40-L44)` The size of "hll.aggBufferAttributes" in this case is 209716, the results of comparison before and after spark-32550 merged are as follows, The unit is ns: \| After SPARK-32550 createBuffer \| After SPARK-32550 end to end \| Before SPARK-32550 createBuffer \| Before SPARK-32550 end to end -- \| -- \| -- \| -- \| -- rsd 0.001, n 1000 \| 52715513243 \| 53004810687 \| 195807999 \| 773977677 rsd 0.001, n 5000 \| 51881246165 \| 52519358215 \| 13689949 \| 249974855 rsd 0.001, n 10000 \| 52234282788 \| 52374639172 \| 14199071 \| 183452846 rsd 0.001, n 50000 \| 55503517122 \| 55664035449 \| 15219394 \| 584477125 rsd 0.001, n 100000 \| 51862662845 \| 52116774177 \| 19662834 \| 166483678 rsd 0.001, n 500000 \| 51619226715 \| 52183189526 \| 178048012 \| 16681330 rsd 0.001, n 1000000 \| 54861366981 \| 54976399142 \| 226178708 \| 18826340 rsd 0.001, n 5000000 \| 52023602143 \| 52354615149 \| 388173579 \| 15446409 rsd 0.001, n 10000000 \| 53008591660 \| 53601392304 \| 533454460 \| 16033032 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? `mvn test -pl sql/catalyst -DwildcardSuites=org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlusSuite -Dtest=none` Before: ``` Run completed in 8 minutes, 18 seconds. Total number of tests run: 5 Suites: completed 2, aborted 0 Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0 ``` After ``` Run completed in 7 seconds, 65 milliseconds. Total number of tests run: 5 Suites: completed 2, aborted 0 Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0 ``` Closes #29529 from LuciferYang/revert-spark-32550. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-25 11:13:01 +09:00
Nicholas Chammas	41cf1d093f	[SPARK-32686][PYTHON] Un-deprecate inferring DataFrame schema from list of dict ### What changes were proposed in this pull request? As discussed in https://github.com/apache/spark/pull/29491#discussion_r474451282 and in SPARK-32686, this PR un-deprecates Spark's ability to infer a DataFrame schema from a list of dictionaries. The ability is Pythonic and matches functionality offered by Pandas. ### Why are the changes needed? This change clarifies to users that this behavior is supported and is not going away in the near future. ### Does this PR introduce _any_ user-facing change? Yes. There used to be a `UserWarning` for this, but now there isn't. ### How was this patch tested? I tested this manually. Before: ```python >>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}])) /Users/nchamm/Documents/GitHub/nchammas/spark/python/pyspark/sql/session.py:388: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead warnings.warn("Using RDD of dict to inferSchema is deprecated. " DataFrame[a: bigint] >>> spark.createDataFrame([{'a': 5}]) .../python/pyspark/sql/session.py:378: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead warnings.warn("inferring schema from dict is deprecated," DataFrame[a: bigint] ``` After: ```python >>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}])) DataFrame[a: bigint] >>> spark.createDataFrame([{'a': 5}]) DataFrame[a: bigint] ``` Closes #29510 from nchammas/SPARK-32686-df-dict-infer-schema. Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2020-08-24 14:55:11 -07:00
Terry Kim	e3a88a9767	[SPARK-32516][SQL] 'path' option cannot coexist with load()'s path parameters ### What changes were proposed in this pull request? This PR proposes to make the behavior consistent for the `path` option when loading dataframes with a single path (e.g, `option("path", path).format("parquet").load(path)` vs. `option("path", path).parquet(path)`) by disallowing `path` option to coexist with `load`'s path parameters. ### Why are the changes needed? The current behavior is inconsistent: ```scala scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test") scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show +-----+ \|value\| +-----+ \| 1\| +-----+ scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show +-----+ \|value\| +-----+ \| 1\| \| 1\| +-----+ ``` ### Does this PR introduce _any_ user-facing change? Yes, now if the `path` option is specified along with `load`'s path parameters, it would fail: ```scala scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test") scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.; at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232) ... 47 elided scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.; at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:250) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:778) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:756) ... 47 elided ``` The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`. ### How was this patch tested? Added a test Closes #29328 from imback82/dfw_option. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-24 16:30:30 +00:00

... 6 7 8 9 10 ...

28300 commits