ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Xinrong Meng	f6013c8ce5	[SPARK-36927][PYTHON] Inline type hints for python/pyspark/sql/window.py ### What changes were proposed in this pull request? Inline type hints for python/pyspark/sql/window.py ### Why are the changes needed? Currently, stub files are used for type hints. However, statements within functions cannot be type-checked. The PR is proposed to inline type hints for python/pyspark/sql/window.py to type check statements within functions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #34173 from xinrong-databricks/inline_window. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-06 19:04:31 +09:00
Ye Zhou	31b6f614d3	[SPARK-36892][CORE] Disable batch fetch for a shuffle when push based shuffle is enabled We found an issue where user configured both AQE and push based shuffle, but the job started to hang after running some stages. We took the thread dump from the Executors, which showed the task is still waiting to fetch shuffle blocks. Proposed changes in the PR to fix the issue. ### What changes were proposed in this pull request? Disabled Batch fetch when push based shuffle is enabled. ### Why are the changes needed? Without this patch, enabling AQE and Push based shuffle will have a chance to hang the tasks. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested the PR within our PR, with Spark shell and the queries are: sql("""SELECT CASE WHEN rand() < 0.8 THEN 100 ELSE CAST(rand() * 30000000 AS INT) END AS s_item_id, CAST(rand() * 100 AS INT) AS s_quantity, DATE_ADD(current_date(), - CAST(rand() * 360 AS INT)) AS s_date FROM RANGE(1000000000)""").createOrReplaceTempView("sales") // Dynamically coalesce partitions sql("""SELECT s_date, sum(s_quantity) AS q FROM sales GROUP BY s_date ORDER BY q DESC""").collect Unit tests to be added. Closes #34156 from zhouyejoe/SPARK-36892. Authored-by: Ye Zhou <yezhou@linkedin.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-10-06 15:42:25 +08:00
dchvn nguyen	5c6f0b9263	[SPARK-36930][PYTHON] Support ps.MultiIndex.dtypes ### What changes were proposed in this pull request? Add dtypes for MultiIndex ### Why are the changes needed? Add dtypes for MultiIndex Before this PR: ```python >>> idx = pd.MultiIndex.from_arrays([[0, 1, 2, 3, 4, 5, 6, 7, 8], [1, 2, 3, 4, 5, 6, 7, 8, 9]], names=("zero", "one")) >>> pdf = pd.DataFrame( ... {"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [4, 5, 6, 3, 2, 1, 0, 0, 0]}, ... index=idx, ... ) >>> psdf = ps.from_pandas(pdf) >>> >>> ps.DataFrame[psdf.index.dtypes, psdf.dtypes] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/u02/spark/python/pyspark/pandas/indexes/multi.py", line 917, in __getattr__ raise AttributeError("'MultiIndex' object has no attribute '{}'".format(item)) AttributeError: 'MultiIndex' object has no attribute 'dtypes' >>> ``` ### Does this PR introduce _any_ user-facing change? After this PR user can use ```MultiIndex.dtypes``` for: ``` python >>> ps.DataFrame[psdf.index.dtypes, psdf.dtypes] typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType] ``` ### How was this patch tested? unit tests. Closes #34179 from dchvn/add_multiindex_dtypes. Lead-authored-by: dchvn nguyen <dgd_contributor@viettel.com.vn> Co-authored-by: dch nguyen <dgd_contributor@viettel.com.vn> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-06 15:35:32 +09:00
Hyukjin Kwon	6a2452fb5c	[SPARK-36711][PYTHON][FOLLOW-UP] Refactor typing logic for multi-index support ### What changes were proposed in this pull request? This PR proposes to refactor typing logic for multi-index support that was mostly introduced in https://github.com/apache/spark/pull/34176 At a high level, the below functions were introduced ```bash _extract_types # renamed from `extract_types` ``` ``` _is_named_params _address_named_type_hoders _to_tuple_of_params _convert_tuples_to_zip _address_unnamed_type_holders ``` In this PR, they become as below with simplification: ```bash _to_type_holders # renamed from `_extract_types` ``` ```bash _new_type_holders # merged from `_is_named_params`, etc. ``` ### Why are the changes needed? To make the codes easier to read. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Existing tests should cover them. Closes #34181 from HyukjinKwon/SPARK-36711. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-06 08:41:10 +09:00
Minchu Yang	e5b01cd823	[SPARK-36705][FOLLOW-UP] Support the case when user's classes need to register for Kryo serialization ### What changes were proposed in this pull request? - Make the val lazy wherever `isPushBasedShuffleEnabled` is invoked when it is a class instance variable, so it can happen after user-defined jars/classes in `spark.kryo.classesToRegister` are downloaded and available on executor-side, as part of the fix for the exception mentioned below. - Add a flag `checkSerializer` to control whether we need to check a serializer is `supportsRelocationOfSerializedObjects` or not within `isPushBasedShuffleEnabled` as part of the fix for the exception mentioned below. Specifically, we don't check this in `registerWithExternalShuffleServer()` in `BlockManager` and `createLocalDirsForMergedShuffleBlocks()` in `DiskBlockManager.scala` as the same issue would raise otherwise. - Move `instantiateClassFromConf` and `instantiateClass` from `SparkEnv` into `Utils`, in order to let `isPushBasedShuffleEnabled` to leverage them for instantiating serializer instances. ### Why are the changes needed? When user tries to set classes for Kryo Serialization by `spark.kryo.classesToRegister`, below exception(or similar) would be encountered in `isPushBasedShuffleEnabled` as indicated below. Reproduced the issue in our internal branch by launching spark-shell as: ``` spark-shell --spark-version 3.1.1 --packages ml.dmlc:xgboost4j_2.12:1.3.1 --conf spark.kryo.classesToRegister=ml.dmlc.xgboost4j.scala.Booster ``` ``` Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:83) at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala) Caused by: org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:183) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:230) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:171) at org.apache.spark.serializer.KryoSerializer$$anon$1.create(KryoSerializer.scala:102) at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.borrow(KryoPoolQueueImpl.java:48) at org.apache.spark.serializer.KryoSerializer$PoolWrapper.borrow(KryoSerializer.scala:109) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:346) at org.apache.spark.serializer.KryoSerializerInstance.getAutoReset(KryoSerializer.scala:446) at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects$lzycompute(KryoSerializer.scala:253) at org.apache.spark.serializer.KryoSerializer.supportsRelocationOfSerializedObjects(KryoSerializer.scala:249) at org.apache.spark.util.Utils$.isPushBasedShuffleEnabled(Utils.scala:2584) at org.apache.spark.MapOutputTrackerWorker.<init>(MapOutputTracker.scala:1109) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:322) at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:205) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893) ... 4 more Caused by: java.lang.ClassNotFoundException: ml.dmlc.xgboost4j.scala.Booster at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:217) at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$6(KryoSerializer.scala:174) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:173) ... 24 more ``` Registering user class for kryo serialization is happening after serializer creation in SparkEnv. Serializer creation can happen in `isPushBasedShuffleEnabled`, which can be called in some places prior to SparkEnv is created. Also, as per analysis by JoshRosen, this is probably due to Kryo instantiation was failing because added packages hadn't been downloaded to the executor yet (because this code is running during executor startup, not task startup). The proposed change helps fix this issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed existing tests. Tested this patch in our internal branch where user reported the issue. Issue is now not reproducible with this patch. Closes #34158 from rmcyang/SPARK-33781-bugFix. Lead-authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz> Co-authored-by: Minchu Yang <31781684+rmcyang@users.noreply.github.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-10-05 12:05:43 -05:00
Max Gekk	3ac0382759	[SPARK-36920][SQL][FOLLOWUP] Fix input types of `ABS()`: numeric and ANSI intervals ### What changes were proposed in this pull request? Change allowed input types of `Abs()` from: ``` NumericType + CalendarIntervalType + YearMonthIntervalType + DayTimeIntervalType ``` to ``` NumericType + YearMonthIntervalType + DayTimeIntervalType ``` ### Why are the changes needed? The changes make the error message more clear. Before changes: ```sql spark-sql> set spark.sql.legacy.interval.enabled=true; spark.sql.legacy.interval.enabled true spark-sql> select abs(interval -10 days -20 minutes); 21/10/05 09:11:30 ERROR SparkSQLDriver: Failed in [select abs(interval -10 days -20 minutes)] java.lang.ClassCastException: org.apache.spark.sql.types.CalendarIntervalType$ cannot be cast to org.apache.spark.sql.types.NumericType at org.apache.spark.sql.catalyst.util.TypeUtils$.getNumeric(TypeUtils.scala:77) at org.apache.spark.sql.catalyst.expressions.Abs.numeric$lzycompute(arithmetic.scala:172) at org.apache.spark.sql.catalyst.expressions.Abs.numeric(arithmetic.scala:169) ``` After: ```sql spark.sql.legacy.interval.enabled true spark-sql> select abs(interval -10 days -20 minutes); Error in query: cannot resolve 'abs(INTERVAL '-10 days -20 minutes')' due to data type mismatch: argument 1 requires (numeric or interval day to second or interval year to month) type, however, 'INTERVAL '-10 days -20 minutes'' is of interval type.; line 1 pos 7; 'Project [unresolvedalias(abs(-10 days -20 minutes, false), None)] +- OneRowRelation ``` ### Does this PR introduce _any_ user-facing change? No, because the original changes of https://github.com/apache/spark/pull/34169 haven't released yet. ### How was this patch tested? Manually checked in the command line, see examples above. Closes #34183 from MaxGekk/fix-abs-input-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-10-05 15:37:08 +03:00
PengLei	6e8a462611	[SPARK-36841][SQL] Add ansi syntax `set catalog xxx` to change the current catalog ### What changes were proposed in this pull request? 1、Add the statement of `set catalog xxx` to change the current catalog 2、Retain the `USE` statement to change the current catalog 3、Forcible loading the new catalog when change the new catalog. ### Why are the changes needed? Ansi SQL use `SET CATALOG XXX` statement to change the catalog. [DISCUSS](https://github.com/apache/spark/pull/34030#issuecomment-925936538) <img width="521" alt="set-catalog" src="https://user-images.githubusercontent.com/41178002/134658562-4e4dd879-b6e5-484c-9461-6345c3faaf2e.png"> ### Does this PR introduce _any_ user-facing change? Yes, User can use `SET CATALOG XXX` to change the current catalog ### How was this patch tested? Add ut testcase Closes #34096 from Peng-Lei/set-catalog-statement. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-10-05 17:23:24 +08:00
Wenchen Fan	fb919afac7	[SPARK-36926][SQL] Decimal average mistakenly overflow ### What changes were proposed in this pull request? This bug was introduced by https://github.com/apache/spark/pull/33177 When checking overflow of the sum value in the average function, we should use the `sumDataType` instead of the input decimal type. ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? Yes, the result was wrong before this PR. ### How was this patch tested? a new test Closes #34180 from cloud-fan/bug. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-10-05 17:13:06 +08:00
Xinrong Meng	7aee29016a	[SPARK-36880][PYTHON] Inline type hints for python/pyspark/sql/functions.py ### What changes were proposed in this pull request? Inline type hints from `python/pyspark/sql/functions.pyi` to `python/pyspark/sql/functions.py`. ### Why are the changes needed? Currently, there is type hint stub files `python/pyspark/sql/functions.pyi` to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. Closes #34130 from xinrong-databricks/inline_functions. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-05 17:13:16 +09:00
dchvn nguyen	d6786e036d	[SPARK-36711][PYTHON] Support multi-index in new syntax ### What changes were proposed in this pull request? Support multi-index in new syntax to specify index data type ### Why are the changes needed? Support multi-index in new syntax to specify index data type https://issues.apache.org/jira/browse/SPARK-36707 ### Does this PR introduce _any_ user-facing change? After this PR user can use ``` python >>> ps.DataFrame[[int, int],[int, int]] typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType] >>> arrays = [[1, 1, 2], ['red', 'blue', 'red']] >>> idx = pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) >>> pdf = pd.DataFrame([[1,2,3],[2,3,4],[4,5,6]], index=idx, columns=["a", "b", "c"]) >>> ps.DataFrame[pdf.index.dtypes, pdf.dtypes] typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType] >>> ps.DataFrame[[("index", int), ("index-2", int)], [("id", int), ("A", int)]] typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType] >>> ps.DataFrame[zip(pdf.index.names, pdf.index.dtypes), zip(pdf.columns, pdf.dtypes)] typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType, pyspark.pandas.typedef.typehints.NameType] ``` ### How was this patch tested? exist tests Closes #34176 from dchvn/SPARK-36711. Authored-by: dchvn nguyen <dgd_contributor@viettel.com.vn> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-05 12:45:16 +09:00
Kousuke Saruta	fa1805db48	[SPARK-36874][SQL] DeduplicateRelations should copy dataset_id tag to avoid ambiguous self join ### What changes were proposed in this pull request? This PR fixes an issue that ambiguous self join can't be detected if the left and right DataFrame are swapped. This is an example. ``` val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value") val df2 = df1.filter($"value" === "A2") df1.join(df2, df1("key1") === df2("key2")) // Ambiguous self join is detected and AnalysisException is thrown. df2.join(df1, df1("key1") === df2("key2)) // Ambiguous self join is not detected. ``` The root cause seems that an inner function `collectConflictPlans` in `DeduplicateRelations.` doesn't copy the `dataset_id` tag when it copies a `LogicalPlan`. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #34172 from sarutak/fix-deduplication-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-10-05 11:16:55 +08:00
Max Gekk	65eb4a2129	[SPARK-36920][SQL] Support ANSI intervals by `ABS()` ### What changes were proposed in this pull request? In the PR, I propose to handle ANSI interval types by the `Abs` expression, and the `abs()` function as a consequence of that: - for positive and zero intervals, `ABS()` returns the same input value, - for minimal supported values (`Int.MinValue` months for year-month interval and `Long.MinValue` microseconds for day-time interval), `ABS()` throws the arithmetic overflow exception. - for other supported negative intervals, `ABS()` negate its input and returns a positive interval. For example: ```sql spark-sql> SELECT ABS(INTERVAL -'10-8' YEAR TO MONTH); 10-8 spark-sql> SELECT ABS(INTERVAL '-10 01:02:03.123456' DAY TO SECOND); 10 01:02:03.123456000 ``` ### Why are the changes needed? To improve user experience with Spark SQL. ### Does this PR introduce _any_ user-facing change? No, this PR just extends `ABS()` by supporting new types. ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly *ArithmeticExpressionSuite" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite" ``` Closes #34169 from MaxGekk/abs-ansi-intervals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-05 10:43:28 +09:00
Xinrong Meng	b30e214483	[SPARK-36881][PYTHON] Inline type hints for python/pyspark/sql/catalog.py ### What changes were proposed in this pull request? Inline type hints for python/pyspark/sql/catalog.py. ### Why are the changes needed? Currently, a type hint stub file hints for python/pyspark/sql/catalog.pyi is used. We may leverage static type check by inlining type hints. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. Closes #34133 from xinrong-databricks/inline_catalog. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-05 10:39:34 +09:00
Xinrong Meng	81812606cc	[SPARK-36906][PYTHON] Inline type hints for conf.py and observation.py in python/pyspark/sql ### What changes were proposed in this pull request? Inline type hints for conf.py and observation.py in python/pyspark/sql. ### Why are the changes needed? Currently, there is type hint stub files (.pyi) to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints. ### Does this PR introduce _any_ user-facing change? No. It has a DOC typo fix: `Metrics are aggregation expressions, which are applied to the DataFrame while is* is being` is changed to `Metrics are aggregation expressions, which are applied to the DataFrame while it is being` ### How was this patch tested? Existing test. Closes #34159 from xinrong-databricks/inline_conf_observation. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-10-04 14:32:25 -07:00
Huaxin Gao	167664896d	[MINOR][SQL] Use SQLConf.resolver for caseSensitiveResolution/caseInsensitiveResolution ### What changes were proposed in this pull request? Use `SQLConf.resolver` for `caseSensitiveResolution`/`caseInsensitveResolution` instead of having a new method ### Why are the changes needed? remove redundant code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing code Closes #34171 from huaxingao/minor. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-10-04 12:57:48 -07:00
Takuya UESHIN	38d39812c1	[SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut ### What changes were proposed in this pull request? Fix `DataFrameGroupBy.apply` without shortcut. Pandas' `DataFrameGroupBy.apply` sometimes behaves weirdly when the udf returns `Series` and whether there is only one group or more. E.g.,: ```py >>> pdf = pd.DataFrame( ... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]}, ... columns=["a", "b", "c"], ... ) >>> pdf.groupby('b').apply(lambda x: x['a']) b 1 0 1 1 2 2 2 3 3 3 4 5 4 5 8 5 6 Name: a, dtype: int64 >>> pdf[pdf['b'] == 1].groupby('b').apply(lambda x: x['a']) a 0 1 b 1 1 2 ``` If there is only one group, it returns a "wide" `DataFrame` instead of `Series`. In our non-shortcut path, there is always only one group because it will be run in `groupby-applyInPandas`, so we will get `DataFrame`, then we should convert it to `Series` ourselves. ### Why are the changes needed? `DataFrameGroupBy.apply` without shortcut could raise an exception when it returns `Series`. ```py >>> ps.options.compute.shortcut_limit = 3 >>> psdf = ps.DataFrame( ... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]}, ... columns=["a", "b", "c"], ... ) >>> psdf.groupby("b").apply(lambda x: x["a"]) org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements ``` ### Does this PR introduce _any_ user-facing change? The error above will be gone: ```py >>> psdf.groupby("b").apply(lambda x: x["a"]) b 1 0 1 1 2 2 2 3 3 3 4 5 4 5 8 5 6 Name: a, dtype: int64 ``` ### How was this patch tested? Added tests. Closes #34160 from ueshin/issues/SPARK-36907/groupby-apply. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-03 12:25:19 +09:00
Chao Sun	14d4ceeb73	[SPARK-36891][SQL] Refactor SpecificParquetRecordReaderBase and add more coverage on vectorized Parquet decoding ### What changes were proposed in this pull request? Add a new test suite `ParquetVectorizedSuite` to provide more coverage on vectorized Parquet decoding logic, with different combinations on column index, dictionary, batch size, page size, etc. To facilitate the test, this also refactored `SpecificParquetRecordReaderBase` and makes the Parquet row group reader pluggable. ### Why are the changes needed? Currently `ParquetIOSuite` and `ParquetColumnIndexSuite` only test on the high-level API which is insufficient, especially after the introduction of column index support, for which we want to cover various combinations involving row ranges, first row index, batch size, page size, etc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new test suite. Closes #34149 from sunchao/SPARK-36891-parquet-test. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-10-01 23:35:23 -07:00
Dmitriy Fishman	25db6b45c7	[MINOR][DOCS] Typo fix in cloud-integration.md ### What changes were proposed in this pull request? Typo fix ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #34129 from fishmandev/patch-1. Authored-by: Dmitriy Fishman <fishman.code@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-10-01 11:26:21 -05:00
Venkata krishnan Sowrirajan	73747ecb97	[SPARK-36038][CORE] Speculation metrics summary at stage level ### What changes were proposed in this pull request? Currently there are no speculation metrics available for Spark either at application/job/stage level. This PR is to add some basic speculation metrics for a stage when speculation execution is enabled. This is similar to the existing stage level metrics tracking numTotal (total number of speculated tasks), numCompleted (total number of successful speculated tasks), numFailed (total number of failed speculated tasks), numKilled (total number of killed speculated tasks) etc. With this new set of metrics, it helps further understanding speculative execution feature in the context of the application and also helps in further tuning the speculative execution config knobs. Screenshot of Spark UI with speculation summary: ![Screen Shot 2021-09-22 at 12 12 20 PM](https://user-images.githubusercontent.com/8871522/135321311-db7699ad-f1ae-4729-afea-d1e2c4e86103.png) Screenshot of Spark UI with API output: ![Screen Shot 2021-09-22 at 12 10 37 PM](https://user-images.githubusercontent.com/8871522/135321486-4dbb7a67-5580-47f8-bccf-81c758c2e988.png) ### Why are the changes needed? Additional metrics for speculative execution. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests added and also deployed in our internal platform for quite some time now. Lead-authored by: Venkata krishnan Sowrirajan <vsowrirajanlinkedin.com> Co-authored by: Ron Hu <rhulinkedin.com> Co-authored by: Thejdeep Gudivada <tgudivadalinkedin.com> Closes #33253 from venkata91/speculation-metrics. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-10-01 16:59:29 +09:00
itholic	13ddc91668	[SPARK-36435][PYTHON] Implement MultIndex.equal_levels ### What changes were proposed in this pull request? This PR proposes implementing `MultiIndex.equal_levels`. ```python >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")]) >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")]) >>> psmidx1.equal_levels(psmidx2) True >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z"), ("a", "y")]) >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "y"), ("b", "x"), ("c", "z"), ("c", "x")]) >>> psmidx1.equal_levels(psmidx2) True ``` This was originally proposed in https://github.com/databricks/koalas/pull/1789, and all reviews in origin PR has been resolved. ### Why are the changes needed? We should support the pandas API as much as possible for pandas-on-Spark module. ### Does this PR introduce _any_ user-facing change? Yes, the `MultiIndex.equal_levels` API is available. ### How was this patch tested? Unittests Closes #34113 from itholic/SPARK-36435. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-01 14:07:55 +09:00
Karen Feng	4aeddb81d1	[SPARK-36870][SQL] Introduce INTERNAL_ERROR error class ### What changes were proposed in this pull request? Adds the `INTERNAL_ERROR` error class and the `isInternalError` API to `SparkThrowable`. Removes existing error classes that are internal-only and replaces them with `INTERNAL_ERROR`. ### Why are the changes needed? Makes it easy for end-users to diagnose whether an error is an internal error. If an end-user encounters an internal error, it should be reported immediately. This also limits the number of error classes, making it easy to audit. We do not need high-quality error messages for internal errors, as they should not be exposed to the end-user. ### Does this PR introduce _any_ user-facing change? Yes; this changes the error class in master. ### How was this patch tested? Unit tests Closes #34123 from karenfeng/internal-error-class. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-10-01 13:46:45 +09:00
Kousuke Saruta	7c155806ed	[SPARK-36830][SQL] Support reading and writing ANSI intervals from/to JSON datasources ### What changes were proposed in this pull request? This PR aims to support reading and writing ANSI intervals from/to JSON datasources. Aith this change, a interval data is written as a literal form like `{"col":"INTERVAL '1-2' YEAR TO MONTH"}`. For the reading part, we need to specify the schema explicitly like: ``` val readDF = spark.read.schema("col INTERVAL YEAR TO MONTH").json(...) ``` ### Why are the changes needed? For better usability. There should be no reason to prohibit from reading/writing ANSI intervals from/to JSON datasources. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. It covers both V1 and V2 sources. Closes #34155 from sarutak/ansi-interval-json-source. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-09-30 20:22:06 +03:00
Kousuke Saruta	5a32e41e9c	[SPARK-36865][PYTHON][DOCS] Add PySpark API document of session_window ### What changes were proposed in this pull request? This PR adds PySpark API document of `session_window`. The docstring of the function doesn't comply with numpydoc format so this PR also fix it. Further, the API document of `window` doesn't have `Parameters` section so it's also added in this PR. ### Why are the changes needed? To provide PySpark users with the API document of the newly added function. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `make html` in `python/docs` and get the following docs. [window] ![time-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963797-ce25b268-20ca-48e3-ac8d-cbcbd85ebb3e.png) [session_window] ![session-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963853-dd9d8417-139b-41ee-9924-14544b1a91af.png) Closes #34118 from sarutak/python-session-window-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-09-30 16:51:12 +09:00
Leona Yoda	17e3ca6df5	[SPARK-36899][R] Support ILIKE API on R ### What changes were proposed in this pull request? Support ILIKE (case insensitive LIKE) API on R. ### Why are the changes needed? ILIKE statement on SQL interface is supported by SPARK-36674. This PR will support R API for it. ### Does this PR introduce _any_ user-facing change? Yes. Users can call ilike from R. ### How was this patch tested? Unit tests. Closes #34152 from yoda-mon/r-ilike. Authored-by: Leona Yoda <yodal@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-09-30 14:43:09 +09:00
Xinrong Meng	ad5a53511e	[SPARK-36896][PYTHON] Return boolean for `dropTempView` and `dropGlobalTempView` ### What changes were proposed in this pull request? Currently `dropTempView` and `dropGlobalTempView` don't have return value, which conflicts with their docstring: `Returns true if this view is dropped successfully, false otherwise.`. And that's not consistent with the same API in other languages. The PR proposes a fix for that. ### Why are the changes needed? Be consistent with API in other languages. ### Does this PR introduce _any_ user-facing change? Yes. #### From ```py # dropTempView >>> spark.createDataFrame([(1, 1)]).createTempView("my_table") >>> spark.table("my_table").collect() [Row(_1=1, _2=1)] >>> spark.catalog.dropTempView("my_table") >>> spark.catalog.dropTempView("my_table") # dropGlobalTempView >>> spark.createDataFrame([(1, 1)]).createGlobalTempView("my_table") >>> spark.table("global_temp.my_table").collect() [Row(_1=1, _2=1)] >>> spark.catalog.dropGlobalTempView("my_table") >>> spark.catalog.dropGlobalTempView("my_table") ``` #### To ```py # dropTempView >>> spark.createDataFrame([(1, 1)]).createTempView("my_table") >>> spark.table("my_table").collect() [Row(_1=1, _2=1)] >>> spark.catalog.dropTempView("my_table") True >>> spark.catalog.dropTempView("my_table") False # dropGlobalTempView >>> spark.createDataFrame([(1, 1)]).createGlobalTempView("my_table") >>> spark.table("global_temp.my_table").collect() [Row(_1=1, _2=1)] >>> spark.catalog.dropGlobalTempView("my_table") True >>> spark.catalog.dropGlobalTempView("my_table") False ``` ### How was this patch tested? Existing tests. Closes #34147 from xinrong-databricks/fix_return. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-30 14:27:00 +09:00
Warren Zhu	b60e576eaf	[SPARK-36893][BUILD][MESOS] Upgrade mesos into 1.4.3 ### What changes were proposed in this pull request? Upgrade mesos into 1.4.3 ### Why are the changes needed? Fix CVE-2018-11793 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually Closes #34144 from warrenzhu25/mesos. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-29 21:49:22 -07:00
Richard Chen	8357572af5	[SPARK-36888][SQL] add tests cases for sha2 function <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. 8. If you want to add or modify an error type or message, please read the guideline first in 'core/src/main/resources/error/README.md'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> Adding tests to `HashExpressionSuite` to test the `sha2` function to test for bit lengths of 0 and 512. Also add comments to clarify existing ambiguous comment. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> Currently, the `sha2` function with bit lengths of `0` and `512` were not tested. This PR adds those tests ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Ran the sha2 test in `HashExpressionsSuite` to ensure added tests pass. Closes #34145 from richardc-db/add_sha_tests. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-30 09:52:40 +09:00
Dongjoon Hyun	aa9064ad96	[SPARK-36883][INFRA] Upgrade R version to 4.1.1 in CI images ### What changes were proposed in this pull request? This PR aims to upgrade GitHub Action CI image to recover CRAN installation failure. ### Why are the changes needed? Sometimes, GitHub Action linter job failed - https://github.com/apache/spark/runs/3739748809 New image have R 4.1.1 and will recover the failure. ``` $ docker run -it --rm dongjoon/apache-spark-github-action-image:20210928 R --version R version 4.1.1 (2021-08-10) -- "Kick Things" Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see https://www.gnu.org/licenses/. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass `GitHub Action`. Closes #34138 from dongjoon-hyun/SPARK-36883. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-29 11:39:01 -07:00
Kousuke Saruta	a8734e3f16	[SPARK-36831][SQL] Support reading and writing ANSI intervals from/to CSV datasources ### What changes were proposed in this pull request? This PR aims to support reading and writing ANSI intervals from/to CSV datasources. Aith this change, a interval data is written as a literal form like `INTERVAL '1-2' YEAR TO MONTH`. For the reading part, we need to specify the schema explicitly like: ``` val readDF = spark.read.schema("col INTERVAL YEAR TO MONTH").csv(...) ``` ### Why are the changes needed? For better usability. There should be no reason to prohibit from reading/writing ANSI intervals from/to CSV datasources. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. It covers both V1 and V2 sources. Closes #34142 from sarutak/ansi-interval-csv-source. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-09-29 21:22:34 +03:00
Angerszhuuuu	dcada3d48c	[SPARK-36624][YARN] In yarn client mode, when ApplicationMaster failed with KILLED/FAILED, driver should exit with code not 0 ### What changes were proposed in this pull request? In current code for yarn client mode, even when use use `yarn application -kill` to kill the application, driver side still exit with code 0. This behavior make job scheduler can't know the job is not success. and user don't know too. In this case we should exit program with a non 0 code. ### Why are the changes needed? Make scheduler/user more clear about application's status ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #33873 from AngersZhuuuu/SPDI-36624. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-09-29 11:12:01 -05:00
sychen	d003db34c4	[SPARK-36550][SQL] Propagation cause when UDF reflection fails ### What changes were proposed in this pull request? When the exception is InvocationTargetException, get cause and stack trace. ### Why are the changes needed? Now when UDF reflection fails, InvocationTargetException is thrown, but it is not a specific exception. ``` Error in query: No handler for Hive UDF 'XXX': java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? manual test Closes #33796 from cxzl25/SPARK-36550. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-09-29 08:30:50 -05:00
Max Gekk	bbd1318b7b	[SPARK-36889][SQL] Respect `spark.sql.parquet.filterPushdown` by v2 parquet scan builder ### What changes were proposed in this pull request? In the PR, I propose to take into account the SQL config `spark.sql.parquet.filterPushdown` while building the array of pushed down filters in v2 `ParquetScanBuilder`. ### Why are the changes needed? Before the changes, `explain()` outputs some filters even filter pushdown to parquet is disabled: ``` == Physical Plan == (1) Filter (isnotnull(c0#7) AND (c0#7 = 1)) +- (1) ColumnarToRow +- BatchScan[c0#7] ParquetScan DataFilters: [isnotnull(c0#7), (c0#7 = 1)], ... PushedFilters: [IsNotNull(c0), EqualTo(c0,1)] RuntimeFilters: [] ``` ### Does this PR introduce _any_ user-facing change? If users parse `explain()`'s output, this PR can influence to them. But in general, it shouldn't. Also, need to highlight that the PR affects DSv2 only. ### How was this patch tested? By running new test in `ParquetV2FilterSuite`: ``` $ build/sbt "test:testOnly *ParquetV2FilterSuite" ``` Closes #34140 from MaxGekk/fix-parquet-v2-pushed-filters. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-29 19:47:21 +09:00
ulysses-you	0e23bd7b7c	[SPARK-34980][SQL] Support coalesce partition through union in AQE ### What changes were proposed in this pull request? - Split plan into several groups, and every child of union is a new group - Coalesce paritition for every group ### Why are the changes needed? #### First Issue The rule `CoalesceShufflePartitions` can only coalesce paritition if * leaf node is ShuffleQueryStage * all shuffle have same partition number With `Union`, it might break the assumption. Let's say we have such plan ``` Union HashAggregate ShuffleQueryStage FileScan ``` `CoalesceShufflePartitions` can not optimize it and the result partition would be `shuffle partition + FileScan partition` which can be quite lagre. It's better to support partial optimize with `Union`. #### Second Issue the coalesce partition formule used the sum value as the total input size and it's not friendly for union, see ``` // ShufflePartitionsUtil.coalescePartitions val totalPostShuffleInputSize = mapOutputStatistics.flatMap(_.map(_.bytesByPartitionId.sum)).sum ``` So for such case: ``` Union HashAggregate ShuffleQueryStage HashAggregate ShuffleQueryStage ``` The `CoalesceShufflePartitions` rule will return an unexpected partition number. ### Does this PR introduce _any_ user-facing change? Probably yes, the result partition might changed. ### How was this patch tested? Add test. Closes #32084 from ulysses-you/SPARK-34980. Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: ulysses <ulyssesyou18@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-29 16:31:52 +08:00
Leona Yoda	ca1c09d88c	[SPARK-36882][PYTHON] Support ILIKE API on Python ### What changes were proposed in this pull request? Support ILIKE (case insensitive LIKE) API on Python. ### Why are the changes needed? ILIKE statement on SQL interface is supported by SPARK-36674. This PR will support Python API for it. ### Does this PR introduce _any_ user-facing change? Yes. Users can call `ilike` from Python. ### How was this patch tested? Unit tests. Closes #34135 from yoda-mon/python-ilike. Authored-by: Leona Yoda <yodal@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-09-29 15:04:03 +09:00
Hyukjin Kwon	13c2b711e4	[MINOR][DOCS] Mention other Python dependency tools in documentation ### What changes were proposed in this pull request? Self-contained. ### Why are the changes needed? For user's more information on available Python dependency management in PySpark. ### Does this PR introduce _any_ user-facing change? Yes, documentation change. ### How was this patch tested? Manaully built the docs and checked the results: <img width="918" alt="Screen Shot 2021-09-29 at 10 11 56 AM" src="https://user-images.githubusercontent.com/6477701/135186536-2f271378-d06b-4c6b-a4be-691ce395db9f.png"> <img width="976" alt="Screen Shot 2021-09-29 at 10 12 22 AM" src="https://user-images.githubusercontent.com/6477701/135186541-0f4c5615-bc49-48e2-affd-dc2f5c0334bf.png"> <img width="920" alt="Screen Shot 2021-09-29 at 10 12 42 AM" src="https://user-images.githubusercontent.com/6477701/135186551-0b613096-7c86-4562-b345-ddd60208367b.png"> Closes #34134 from HyukjinKwon/minor-docs-py-deps. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-29 14:45:58 +09:00
Huaxin Gao	d2bb359338	[SPARK-36526][SQL] DSV2 Index Support: Add supportsIndex interface ### What changes were proposed in this pull request? Indexes are database objects created on one or more columns of a table. Indexes are used to improve query performance. A detailed explanation of database index is here https://en.wikipedia.org/wiki/Database_index This PR adds `supportsIndex` interface that provides APIs to work with indexes. ### Why are the changes needed? Many data sources support index to improvement query performance. In order to take advantage of the index support in data source, this `supportsIndex` interface is added to let user to create/drop an index, list indexes, etc. ### Does this PR introduce _any_ user-facing change? yes, the following new APIs are added: - createIndex - dropIndex - indexExists - listIndexes New SQL syntax: ``` CREATE [index_type] INDEX [index_name] ON [TABLE] table_name (column_index_property_list)[OPTIONS indexPropertyList] column_index_property_list: column_name [OPTIONS(indexPropertyList)] [ , . . . ] indexPropertyList: index_property_name = index_property_value [ , . . . ] DROP INDEX index_name ``` ### How was this patch tested? only interface is added for now. Tests will be added when doing the implementation Closes #33754 from huaxingao/index_interface. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-29 10:49:29 +08:00
Liang-Chi Hsieh	bfcc596398	[MINOR][SS][TEST] Remove unsupportedOperationCheck setting for TextSocketStreamSuite ### What changes were proposed in this pull request? This patch simply removes a few `unsupportedOperationCheck` in `TextSocketStreamSuite`. ### Why are the changes needed? `unsupportedOperationCheck` is used to disable the check of unsupported operations. If we are not to test unsupported operations, it was unnecessarily set in `TextSocketStreamSuite` and could cause unexpected error by missing check. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #34132 from viirya/minor-test. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-28 17:45:22 -07:00
Takuya UESHIN	05c0fa5738	[SPARK-36813][SQL][PYTHON] Propose an infrastructure of as-of join and imlement ps.merge_asof ### What changes were proposed in this pull request? Proposes an infrastructure for as-of join and implements `ps.merge_asof` here. 1. Introduce `AsOfJoin` logical plan 2. Rewrite the plan in the optimize phase: - From something like (SQL syntax is not determied): ```sql SELECT * FROM left ASOF JOIN right ON (condition, as_of on(left.t, right.t), tolerance) ``` - To ```sql SELECT left., __right__. FROM ( SELECT left., ( SELECT MIN_BY(STRUCT(right.), left.t - right.t) AS __nearest_right__ FROM right WHERE condition AND left.t >= right.t AND right.t >= left.t - tolerance ) as __right__ FROM left ) WHERE __right__ IS NOT NULL ``` 3. The rewritten scalar-subquery will be handled by the existing decorrelation framework. Note: APIs on SQL DataFrames and SQL syntax are TBD (e.g., [SPARK-22947](https://issues.apache.org/jira/browse/SPARK-22947)), although there are temporary APIs added here. ### Why are the changes needed? Pandas' `merge_asof` or as-of join for SQL/DataFrame is useful for time series analysis. ### Does this PR introduce _any_ user-facing change? Yes. `ps.merge_asof` can be used. ```py >>> quotes time ticker bid ask 0 2016-05-25 13:30:00.023 GOOG 720.50 720.93 1 2016-05-25 13:30:00.023 MSFT 51.95 51.96 2 2016-05-25 13:30:00.030 MSFT 51.97 51.98 3 2016-05-25 13:30:00.041 MSFT 51.99 52.00 4 2016-05-25 13:30:00.048 GOOG 720.50 720.93 5 2016-05-25 13:30:00.049 AAPL 97.99 98.01 6 2016-05-25 13:30:00.072 GOOG 720.50 720.88 7 2016-05-25 13:30:00.075 MSFT 52.01 52.03 >>> trades time ticker price quantity 0 2016-05-25 13:30:00.023 MSFT 51.95 75 1 2016-05-25 13:30:00.038 MSFT 51.95 155 2 2016-05-25 13:30:00.048 GOOG 720.77 100 3 2016-05-25 13:30:00.048 GOOG 720.92 100 4 2016-05-25 13:30:00.048 AAPL 98.00 100 >>> ps.merge_asof( ... trades, quotes, on="time", by="ticker" ... ).sort_values(["time", "ticker", "price"]).reset_index(drop=True) time ticker price quantity bid ask 0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96 1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98 2 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN 3 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 4 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 >>> ps.merge_asof( ... trades, ... quotes, ... on="time", ... by="ticker", ... tolerance=F.expr("INTERVAL 2 MILLISECONDS") # pd.Timedelta("2ms") ... ).sort_values(["time", "ticker", "price"]).reset_index(drop=True) time ticker price quantity bid ask 0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96 1 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN 2 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN 3 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93 4 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93 ``` Note: As `IntervalType` literal is not supported yet, we have to specify the `IntervalType` value with `F.expr` as a workaround. ### How was this patch tested? Added tests. Closes #34053 from ueshin/issues/SPARK-36813/merge_asof. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-29 09:27:38 +09:00
Takuya UESHIN	a9b4c27f12	[SPARK-36846][PYTHON] Inline most of type hint files under pyspark/sql/pandas folder ### What changes were proposed in this pull request? Inlines type hint files under `pyspark/sql/pandas` folder, except for `pyspark/sql/pandas/functions.pyi` and files under `pyspark/sql/pandas/_typing`. - Since the file contains a lot of overloads, we should revisit and manage it separately. - We can't inline files under `pyspark/sql/pandas/_typing` because it includes new syntax for type hints. ### Why are the changes needed? Currently there are type hint stub files (`*.pyi`) to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #34101 from ueshin/issues/SPARK-36846/inline_typehints. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-29 09:25:18 +09:00
Wenchen Fan	d233977d18	[SPARK-36848][SQL][FOLLOWUP] Simplify SHOW CURRENT NAMESPACE command ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/34104 `SHOW CURRENT NAMESPACE` is a very simple command that does not involve v2 catalog API, does not need analysis, does not have children. We can simply use `RunnableCommand` to save defining the physical plan. ### Why are the changes needed? code simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #34128 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-28 10:31:59 -07:00
Max Gekk	8c51fd85d6	[SPARK-36866][SQL] Pushdown filters with ANSI interval values to parquet ### What changes were proposed in this pull request? In the PR, I propose to pushdown filters with ANSI intervals as filter values to the parquet datasource. After the changes, filter values are pushed down with the following values via Filter API: - `java.time.Period` for year-month filters - `java.time.Duration` for day-time filters. Since at the Parquet filter level, we don't have info about Catalyst's types (`YearMonthIntervalType` and `DayTimeIntervalType`) but only the info about primitive parquet types `INT32` and `INT64`. As a consequence of that, Spark has to convert filters values "dynamically" to `Int`/`Long` while building Parquet filters. ### Why are the changes needed? The PR https://github.com/apache/spark/pull/34057 supported ANSI intervals in the Parquet datasource as INT32 (year-month interval) and INT64 (day-time interval) but filters with such values are not pushed down. So, comparing to primitive types, ANSI intervals can suffer from worse performance in read. This PR aims to solve the issue, and achieve feature parity with other types. ### Does this PR introduce _any_ user-facing change? No, the changes might influence on performance of the parquet datasource only. ### How was this patch tested? By running new tests in `ParquetFilterSuite`: ``` $ build/sbt clean "test:testOnly ParquetV1FilterSuite" $ build/sbt clean "test:testOnly ParquetV2FilterSuite" ``` Closes #34115 from MaxGekk/interval-parquet-filter-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-28 20:09:33 +08:00
Richard Chen	6c6291b3f6	[SPARK-36836][SQL] Fix incorrect result in `sha2` expression ### What changes were proposed in this pull request? `sha2(input, bit_length)` returns incorrect results when `bit_length == 224` for all inputs. This error can be reproduced by running `spark.sql("SELECT sha2('abc', 224)").show()`, for instance, in spark-shell. Spark currently returns ``` #\t}"4�"�B�w��U�*��你��l�� ``` while the expected result is ``` 23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7 ``` This appears to happen because the `MessageDigest.digest()` function appears to return bytes intended to be interpreted as a `BigInt` rather than a string. Thus, the output of `MessageDigest.digest()` must first be interpreted as a `BigInt` and then transformed into a hex string rather than directly being interpreted as a hex string. ### Why are the changes needed? `sha2(input, bit_length)` with a `bit_length` input of `224` would previously return the incorrect result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added new test to `HashExpressionsSuite.scala` which previously failed and now pass Closes #34086 from richardc-db/sha224. Authored-by: Richard Chen <r.chen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-28 18:38:20 +08:00
Chao Sun	53f58b6e51	[SPARK-36873][BUILD][TEST-MAVEN] Add provided Guava dependency for network-yarn module ### What changes were proposed in this pull request? Add provided Guava dependency to `network-yarn` module. ### Why are the changes needed? In Spark 3.1 and earlier the `network-yarn` module implicitly relies on Guava from `hadoop-client` dependency. This was changed by SPARK-33212 where we moved to shaded Hadoop client which no longer expose the transitive Guava dependency. It stayed fine for a while since we were not using `createDependencyReducedPom` so it picks up the transitive dependency from `spark-network-common` instead. However, things start to break after SPARK-36835 where we restored `createDependencyReducedPom` and now it is no longer able to locate Guava classes: ``` build/mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 -Pspark-ganglia-lgpl -Pyarn ... [INFO] Compiling 1 Java source to /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ... [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8 [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32: package com.google.common.annotations does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33: package com.google.common.base does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34: package com.google.common.collect does not exist [ERROR] [Error] /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118: cannot find symbol symbol: class VisibleForTesting location: class org.apache.spark.network.yarn.YarnShuffleService ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested with the above `mvn` command and it's now passing. Closes #34125 from sunchao/SPARK-36873. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-28 18:23:30 +08:00
Chao Sun	8d8b71058c	[SPARK-36863][BUILD] Update dependency manifest files for all released Spark artifacts ### What changes were proposed in this pull request? Update `dev/test-dependencies.sh` so that it now covers all released Spark artifacts such as `hadoop-cloud`. ### Why are the changes needed? Currently `dev/test-dependencies.sh` doesn't cover all released Spark modules. Therefore, if there is any dependency change in those modules (e.g., `hadoop-cloud`), we won't be able to detect it early. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #34119 from sunchao/SPARK-36863-dependency. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-09-28 19:22:48 +09:00
Liang-Chi Hsieh	4caf71f0d5	[SPARK-36809][SQL] Remove broadcast for InSubqueryExec used in DPP ### What changes were proposed in this pull request? This patch proposes to remove broadcast variable in `InSubqueryExec` which is used in DPP. ### Why are the changes needed? Currently we include a broadcast variable in `InSubqueryExec`. We use it to hold filtering side query result of DPP. It looks weird because we don't use the result in executors but only need the result in the driver during query planning. We already hold the original result, so basically we hold two copied of query result at this moment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #34051 from viirya/dpp-no-broadcast. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-28 17:34:45 +08:00
copperybean	56c21e8e5a	[SPARK-36856][BUILD] Get correct JAVA_HOME for macOS ### What changes were proposed in this pull request? Currently, JAVA_HOME may be set to path "/usr" improperly, now JAVA_HOME is fetched from command "/usr/libexec/java_home" for macOS. ### Why are the changes needed? Command "./build/mvn xxx" will be stuck on MacOS 11.4, because JAVA_HOME is set to path "/usr" improperly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `build/mvn -DskipTests package` passed on `macOS 11.5.2`. Closes #34111 from copperybean/work. Authored-by: copperybean <copperybean.zhang@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-09-28 17:27:02 +08:00
Cheng Su	c3cdb89b9d	[SPARK-36794][SQL] Ignore duplicated join keys when building relation for SEMI/ANTI hash join ### What changes were proposed in this pull request? For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we only need to keep one row per unique join key(s) inside hash table (`HashedRelation`) when building the hash table. This can help reduce the size of hash table of join. This PR adds the optimization in `UnsafeHashedRelation` for broadcast hash join and shuffled hash join. The optimization for `LongHashedRelation` would be added later in the future, because it needs more change of underlying hash table data structure `LongToUnsafeRowMap` to check if key exists in hash table or not. ### Why are the changes needed? Help reduce the hash table size of join for LEFT SEMI and LEFT ANTI. This can increase the chance of broadcast join of these queries, and reduce OOM possibility of shuffled hash join. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite.scala`. Just to help easier review, as a lot of files are changed due to unit test plan change. Below files have the real code change: * BroadcastExchangeExec.scala * BroadcastHashJoinExec.scala * HashJoin.scala * HashedRelation.scala * ShuffledHashJoinExec.scala * JoinSuite.scala * DebuggingSuite.scala All other files change are generated with followed commands: ``` SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly SQLQueryTestSuite" SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly PlanStabilitySuite" SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilityWithStatsSuite" ``` Closes #34034 from c21/join-deduplicate. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-09-28 15:32:46 +08:00
Yuming Wang	e024bdc306	Revert "[SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type" ### What changes were proposed in this pull request? This reverts commit `aaa0d2a66b`. ### Why are the changes needed? This approach has 2 disadvantages: 1. It needs to disable `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly`. 2. The filtering side will be evaluated 2 times. For example: https://github.com/apache/spark/pull/29726#issuecomment-780266596 Instead, we can use bloom filter join pruning in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #34116 from wangyum/revert-SPARK-32855. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-09-28 10:30:11 +08:00
Yufei Gu	d03999ab88	[SPARK-36821][SQL] Make class ColumnarBatch extendable - addendum ### What changes were proposed in this pull request? A follow up of https://github.com/apache/spark/pull/34054. Three things changed: 1. Add a test for extendable class `ColumnarBatch` 2. Make `ColumnarBatchRow` public. 3. Change private fields to protected fields. ### Why are the changes needed? A follow up of https://github.com/apache/spark/pull/34054. Class ColumnarBatch need to be extendable to support better vectorized reading in multiple data sources. For example, Iceberg needs to filter out deleted rows in a batch before Spark consumes it, to support row-level delete( apache/iceberg#3141) in vectorized read. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new test is added Closes #34087 from flyrain/SPARK-36821. Authored-by: Yufei Gu <yufei_gu@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2021-09-27 18:26:59 +00:00
Huaxin Gao	e4e64c7552	[SPARK-36848][SQL] Migrate ShowCurrentNamespaceStatement to v2 command framework ### What changes were proposed in this pull request? Migrate `ShowCurrentNamespaceStatement` to v2 command framework ### Why are the changes needed? Migrate to the standard V2 framework ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #34104 from huaxingao/namesp. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-09-27 10:36:55 -07:00

1 2 3 4 5 ...

31351 commits