ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Pavithraramachandran	d7d5bdfd79	[SPARK-32103][CORE] Support IPv6 host/port in core module ### What changes were proposed in this pull request? In IPv6 scenario, the current logic to split hostname and port is not correct. ### Why are the changes needed? to support IPV6 deployment scenario ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT and IPV6 spark deployment with yarn Closes #28931 from PavithraRamachandran/ipv6_issue. Authored-by: Pavithraramachandran <pavi.rams@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-10 13:55:20 -07:00
TJX2014	500877e785	[SPARK-32133][SQL] Forbid time field steps for date start/end in Sequence ### What changes were proposed in this pull request? 1.Add time field steps check for date start/end in Sequence at `org.apache.spark.sql.catalyst.expressions.Sequence.TemporalSequenceImpl` 2.Add a UT：`SPARK-32133: Sequence step must be a day interval if start and end values are dates` at `org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite` ### Why are the changes needed? Sequence time field steps for date start/end looks strange in spark as follows: ``` scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-03-02' as date), interval 1 hour))").head(3) res0: Array[org.apache.spark.sql.Row] = _Array([2011-03-01], [2011-03-01], [2011-03-01])_ <- strange result. scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-03-02' as date), interval 1 day))").head(3) res1: Array[org.apache.spark.sql.Row] = Array([2011-03-01], [2011-03-02]) ``` While this behavior in Prosto make sense： ``` presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' hour); Query 20200624_122744_00002_pehix failed: sequence step must be a day interval if start and end values are dates presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' day); _col0 [2011-03-01, 2011-03-02] ``` ### Does this PR introduce _any_ user-facing change? Yes, after this patch, users will get informed `sequence step must be a day interval if start and end values are dates` when use time field steps for date start/end in Sequence. ### How was this patch tested? Unit test. Closes #28926 from TJX2014/master-SPARK-31982-sequence-cross-dst-follow-presto. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-10 11:06:52 -07:00
angerszhu	560fe1f54c	[SPARK-32220][SQL] SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result ### What changes were proposed in this pull request? In current Join Hint strategies, if we use SHUFFLE_REPLICATE_NL hint, it will directly convert join to Cartesian Product Join and loss join condition making result not correct. For Example: ``` spark-sql> select * from test4 order by a asc; 1 2 Time taken: 1.063 seconds, Fetched 4 row(s)20/07/08 14:11:25 INFO SparkSQLCLIDriver: Time taken: 1.063 seconds, Fetched 4 row(s) spark-sql>select * from test5 order by a asc 1 2 2 2 Time taken: 1.18 seconds, Fetched 24 row(s)20/07/08 14:13:59 INFO SparkSQLCLIDriver: Time taken: 1.18 seconds, Fetched 24 row(s)spar spark-sql>select /+ shuffle_replicate_nl(test4) / * from test4 join test5 where test4.a = test5.a order by test4.a asc ; 1 2 1 2 1 2 2 2 Time taken: 0.351 seconds, Fetched 2 row(s) 20/07/08 14:18:16 INFO SparkSQLCLIDriver: Time taken: 0.351 seconds, Fetched 2 row(s) ``` ### Why are the changes needed? Fix wrong data result ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added UT Closes #29035 from AngersZhuuuu/SPARK-32220. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-10 09:03:16 -07:00
yi.wu	578b90cdec	[SPARK-32091][CORE] Ignore timeout error when remove blocks on the lost executor ### What changes were proposed in this pull request? This PR adds the check to see whether the executor is lost (by asking the `CoarseGrainedSchedulerBackend`) after timeout error raised in `BlockManagerMasterEndponit` due to removing blocks(e.g. RDD, broadcast, shuffle). If the executor is lost, we will ignore the error. Otherwise, throw the error. ### Why are the changes needed? When removing blocks(e.g. RDD, broadcast, shuffle), `BlockManagerMaserEndpoint` will make RPC calls to each known `BlockManagerSlaveEndpoint` to remove the specific blocks. The PRC call sometimes could end in a timeout when the executor has been lost, but only notified the `BlockManagerMasterEndpoint` after the removing call has already happened. The timeout error could therefore fail the whole job. In this case, we actually could just ignore the error since those blocks on the lost executor could be considered as removed already. ### Does this PR introduce _any_ user-facing change? Yes. In case of users hits this issue, they will have the job executed successfully instead of throwing the exception. ### How was this patch tested? Added unit tests. Closes #28924 from Ngone51/ignore-timeout-error-for-inactive-executor. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-10 13:36:29 +00:00
Shixiong Zhu	c8779d9dfc	[SPARK-32256][SQL][TEST-HADOOP2.7] Force to initialize Hadoop VersionInfo in HiveExternalCatalog ### What changes were proposed in this pull request? Force to initialize Hadoop VersionInfo in HiveExternalCatalog to make sure Hive can get the Hadoop version when using the isolated classloader. ### Why are the changes needed? This is a regression in Spark 3.0.0 because we switched the default Hive execution version from 1.2.1 to 2.3.7. Spark allows the user to set `spark.sql.hive.metastore.jars` to specify jars to access Hive Metastore. These jars are loaded by the isolated classloader. Because we also share Hadoop classes with the isolated classloader, the user doesn't need to add Hadoop jars to `spark.sql.hive.metastore.jars`, which means when we are using the isolated classloader, hadoop-common jar is not available in this case. If Hadoop VersionInfo is not initialized before we switch to the isolated classloader, and we try to initialize it using the isolated classloader (the current thread context classloader), it will fail and report `Unknown` which causes Hive to throw the following exception: ``` java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format) at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:147) at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:122) at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:88) at org.apache.hadoop.hive.metastore.ObjectStore.getDataSourceProps(ObjectStore.java:377) at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:268) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:58) at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:517) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:482) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:544) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:370) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84) at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:219) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:67) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1548) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3080) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3108) at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3349) at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:217) at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:204) at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:331) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:292) at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:262) at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:247) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:543) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:511) at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:175) at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:128) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:301) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:431) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:324) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:72) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:71) at org.apache.spark.sql.hive.client.HadoopVersionInfoSuite.$anonfun$new$1(HadoopVersionInfoSuite.scala:63) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) ``` Technically, This is indeed an issue of Hadoop VersionInfo which has been fixed: https://issues.apache.org/jira/browse/HADOOP-14067. But since we are still supporting old Hadoop versions, we should fix it. Why this issue starts to happen in Spark 3.0.0? In Spark 2.4.x, we use Hive 1.2.1 by default. It will trigger `VersionInfo` initialization in the static codes of `Hive` class. This will happen when we load `HiveClientImpl` class because `HiveClientImpl.clent` method refers to `Hive` class. At this moment, the thread context classloader is not using the isolcated classloader, so it can access hadoop-common jar on the classpath and initialize it correctly. In Spark 3.0.0, we use Hive 2.3.7. The static codes of `Hive` class are not accessing `VersionInfo` because of the change in https://issues.apache.org/jira/browse/HIVE-11657. Instead, accessing `VersionInfo` happens when creating a `Hive` object (See the above stack trace). This happens here https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L260. But we switch to the isolated classloader before calling `HiveClientImpl.client` (See https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L283). This is exactly what I mentioned above: `If Hadoop VersionInfo is not initialized before we switch to the isolated classloader, and we try to initialize it using the isolated classloader (the current thread context classloader), it will fail` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The new regression test added in this PR. Note that the new UT doesn't fail with the default profiles (-Phadoop-3.2) because it's already fixed at Hadoop 3.1. Please use the following to verify this. ``` build/sbt -Phadoop-2.7 -Phive "hive/testOnly *.HadoopVersionInfoSuite" ``` Closes #29059 from zsxwing/SPARK-32256. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-10 21:14:29 +09:00
Jungtaek Lim (HeartSaVioR)	e6e43cb2f9	[SPARK-32242][SQL] CliSuite flakiness fix via differentiating cli driver bootup timeout and query execution timeout ### What changes were proposed in this pull request? This patch tries to mitigate the flakiness of CliSuite, via below changes: 1. differentiate cli driver boot-up timeout (2 mins) and query execution timeout (parameter) Cli driver boot-up is determined by master and app ID message. Given spark-sql doesn't print the message if `-e` option is specified, the patch simply add 2 mins on timeout for the case to cover the boot-up timeout. 2. don't fail the test even spark-sql doesn't gracefully shut down in 1 min. 3. extend timeout for `path command` test in CliSuite ### Why are the changes needed? It took around 40 seconds for boot-up message (master: ... Application Id: ...) to be printed in stderr, while the overall timeout is 1 minute in many tests. This case the actual timeout for query execution is just 20 seconds, which may not be enough. Some of the tests also failed with `org.scalatest.exceptions.TestFailedException: spark-sql did not exit gracefully`, which I don't feel the test has to be failed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Verified with multiple triggers of Jenkins builds Closes #29036 from HeartSaVioR/clisuite-flakiness-fix. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-07-10 13:12:25 +09:00
Kent Yao	4609f1fdab	[SPARK-32207][SQL] Support 'F'-suffixed Float Literals ### What changes were proposed in this pull request? In this PR, I suppose we support 'f'-suffixed float literal, e.g. `select 1.1f` ### Why are the changes needed? a very common feature across platforms, checked with pg, presto, hive, MySQL... ### Does this PR introduce _any_ user-facing change? yes, `select 1.1f` results float value 1.1 instead of throwing AnlaysisExceptiion`Can't extract value from 1: need struct type but got int;` ### How was this patch tested? add unit tests Closes #29022 from yaooqinn/SPARK-32207. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 19:45:16 -07:00
HyukjinKwon	01e9dd9050	[SPARK-20680][SQL][FOLLOW-UP] Revert NullType.simpleString from 'unknown' to 'null' ### What changes were proposed in this pull request? This PR proposes to partially reverts the simple string in `NullType` at https://github.com/apache/spark/pull/28833: `NullType.simpleString` back from `unknown` to `null`. ### Why are the changes needed? - Technically speaking, it's orthogonal with the issue itself, SPARK-20680. - It needs some more discussion, see https://github.com/apache/spark/pull/28833#issuecomment-655277714 ### Does this PR introduce _any_ user-facing change? It reverts back the user-facing changes at https://github.com/apache/spark/pull/28833. The simple string of `NullType` is back to `null`. ### How was this patch tested? I just logically reverted. Jenkins should test it out. Closes #29041 from HyukjinKwon/SPARK-20680. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 19:44:08 -07:00
Dilip Biswal	18aae21d96	[SPARK-31875][SQL] Provide a option to disable user supplied Hints ### What changes were proposed in this pull request? Introduce a new SQL config `spark.sql.optimizer.ignoreHints`. When this is set to true application of hints are disabled. This is similar to Oracle's OPTIMIZER_IGNORE_HINTS. This can be helpful to study the impact of performance difference when hints are applied vs when they are not. ### Why are the changes needed? Can be helpful to study the impact of performance difference when hints are applied vs when they are not. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New tests added in ResolveHintsSuite. Closes #28683 from dilipbiswal/disable_hint. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 18:27:07 -07:00
Jungtaek Lim (HeartSaVioR)	ac6406e757	[SPARK-31831][SQL] HiveSessionImplSuite flakiness fix via mocking instances earlier than initializing HiveSessionImpl ### What changes were proposed in this pull request? This patch changes the HiveSessionImplSuite to mock instances "before" initializing HiveSessionImpl, to avoid possible classloader issue. ### Why are the changes needed? The failures of HiveSessionImplSuite always come from classloader issue. While I don't have clear idea what is happening, there's no part possibly dealing with classloader, except initializing HiveSessionImpl. We can move the mock initializations earlier than initialing HiveSessionImpl so that it can avoid possible classloader issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified with multiple triggers of Jenkins builds Closes #29039 from HeartSaVioR/hive-session-impl-suite-flakiness-fix. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 14:32:20 -07:00
moovlin	9331a5c44b	[SPARK-32035][DOCS][EXAMPLES] Fixed typos involving AWS Access, Secret, & Sessions tokens ### What changes were proposed in this pull request? I resolved some of the inconsistencies of AWS env variables. They're fixed in the documentation as well as in the examples. I grep-ed through the repo to try & find any more instances but nothing popped up. ### Why are the changes needed? As previously mentioned, there is a JIRA request, SPARK-32035, which encapsulates all the issues. But, in summary, the naming of items was inconsistent. ### Does this PR introduce _any_ user-facing change? Correct names: AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN These are the same that AWS uses in their libraries. However, looking through the Spark documentation and comments, I see that these are not denoted correctly across the board: docs/cloud-integration.md 106:1. `spark-submit` reads the `AWS_ACCESS_KEY`, `AWS_SECRET_KEY` <-- both different 107:and `AWS_SESSION_TOKEN` environment variables and sets the associated authentication options docs/streaming-kinesis-integration.md 232:- Set up the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_KEY` with your AWS credentials. <-- secret key different external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py 34: $ export AWS_ACCESS_KEY_ID=<your-access-key> 35: $ export AWS_SECRET_KEY=<your-secret-key> <-- different 48: Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 438: val keyId = System.getenv("AWS_ACCESS_KEY_ID") 439: val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY") 448: val sessionToken = System.getenv("AWS_SESSION_TOKEN") external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala 53: * $ export AWS_ACCESS_KEY_ID=<your-access-key> 54: * $ export AWS_SECRET_KEY=<your-secret-key> <-- different 65: * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different external/kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java 59: * $ export AWS_ACCESS_KEY_ID=[your-access-key] 60: * $ export AWS_SECRET_KEY=<your-secret-key> <-- different 71: * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY <-- secret key different These were all fixed to match names listed under the "correct names" heading. ### How was this patch tested? I built the documentation using jekyll and verified that the changes were present & accurate. Closes #29058 from Moovlin/SPARK-32035. Authored-by: moovlin <richjoerger@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-09 10:35:21 -07:00
xiepengjie	523e238d2a	[SPARK-32192][SQL] Print column name when throws ClassCastException ### What changes were proposed in this pull request? When somebody changed the type of partition's field, spark will throw ClassCastException. For example, we have a table like this: ``` drop table if exists cast_exception_test; create table cast_exception_test(c1 int, c2 string) partitioned by (dt string) stored as orc; insert into table cast_exception_test partition(dt='2020-04-08') values('1', 'jeff_1'); ``` If you change the field's type in hive, query the old partition, spark will throw ClassCastException, but hive will not: ``` -- change the field's type using hive alter table cast_exception_test change column c1 c1 string; -- hive correct, but spark throws ClassCastException select * from cast_exception_test where dt='2020-04-08'; ``` ### Why are the changes needed? When the table has many fields, we don's known which field has been changed. If we print out log about this exception, it will very helpful for us to troubleshoot. ### Does this PR introduce _any_ user-facing change? When the ClassCastException is caused by changed field's type, you can search which field has problem in exexutor logs: ``` 20/04/09 17:22:05 ERROR hive.HadoopTableReader: Exception thrown in field <c1> ``` ### How was this patch tested? First, prepare the test data, the table is partitioned and stored as orc: ``` drop table if exists cast_exception_test; create table cast_exception_test(c1 int, c2 string) partitioned by (dt string) stored as orc; insert into table cast_exception_test partition(dt='2020-04-08') values('1', 'jeff_1'); ``` Then, change the field's type in hive. ``` alter table cast_exception_test change column c1 c1 string; ``` Now the metadata of the table has been modified, but the partition's metadata which is stored in orc file or hive metastore's mysql is still old. So, query command throws ClassCastException in spark, because spark use table's metadata which is different from orc file's metadata. But hive use partition's metadata which is the same as orc file's metadata. If you query the old partition, spark will thrown ClassCastException, but hive will not: ``` select * from cast_exception_test where dt='2020-04-08'; ``` Closes #29010 from StefanXiepj/SPARK-32192. Authored-by: xiepengjie <xiepengjie@didiglobal.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-09 09:33:54 -05:00
Erik Erlandson	1cb5bfc47a	[SPARK-32159][SQL] Fix integration between Aggregator[Array[_], _, _] and UnresolvedMapObjects Context: The fix for SPARK-27296 introduced by #25024 allows `Aggregator` objects to appear in queries. This works fine for aggregators with atomic input types, e.g. `Aggregator[Double, _, _]`. However it can cause a null pointer exception if the input type is `Array[_]`. This was historically considered an ignorable case for serialization of `UnresolvedMapObjects`, but the new ScalaAggregator class causes these expressions to be serialized over to executors because the resolve-and-bind is being deferred. ### What changes were proposed in this pull request? A new rule `ResolveEncodersInScalaAgg` that performs the resolution of the expressions contained in the encoders so that properly resolved expressions are serialized over to executors. ### Why are the changes needed? Applying an aggregator of the form `Aggregator[Array[_], _, _]` using `functions.udaf()` currently causes a null pointer error in Catalyst. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A unit test has been added that does aggregation with array types for input, buffer, and output. I have done additional testing with my own custom aggregators in the spark REPL. Closes #28983 from erikerlandson/fix-spark-32159. Authored-by: Erik Erlandson <eerlands@redhat.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-09 08:42:20 +00:00
Dongjoon Hyun	c5bd0730a2	[SPARK-32231][R][INFRA] Use Hadoop 3.2 winutils in AppVeyor build ### What changes were proposed in this pull request? This PR proposes to use Hadoop 3 winutils to make AppVeyor builds pass. Currently it's being failed as below https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/33976604 ### Why are the changes needed? To recover the build in AppVeyor. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? AppVeyor build will test it out. Closes #29042 from HyukjinKwon/SPARK-32231. Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-09 17:18:39 +09:00
Jungtaek Lim (HeartSaVioR)	526cb2d1ba	[SPARK-32148][SS] Fix stream-stream join issue on missing to copy reused unsafe row ### What changes were proposed in this pull request? This patch fixes the odd join result being occurred from stream-stream join for state store format V2. There're some spots on V2 path which leverage UnsafeProjection. As the result row is reused, the row should be copied to avoid changing value during reading (or make sure the caller doesn't affect by such behavior) but `SymmetricHashJoinStateManager.removeByValueCondition` violates the case. This patch makes `KeyWithIndexToValueRowConverterV2.convertValue` copy the row by itself so that callers don't need to take care about it. This patch doesn't change the behavior of `KeyWithIndexToValueRowConverterV2.convertToValueRow` to avoid double-copying, as the caller is expected to store the row which the implementation of state store will call `copy()`. This patch adds such behavior into each method doc in `KeyWithIndexToValueRowConverter`, so that further contributors can read through and make sure the change / new addition doesn't break the contract. ### Why are the changes needed? Stream-stream join with state store format V2 (newly added in Spark 3.0.0) has a serious correctness bug which brings indeterministic result. ### Does this PR introduce _any_ user-facing change? Yes, some of Spark 3.0.0 users using stream-stream join from the new checkpoint (as the bug exists to only v2 format path) may encounter wrong join result. This patch will fix it. ### How was this patch tested? Reported case is converted to the new UT, and confirmed UT passed. All UTs in StreamingInnerJoinSuite and StreamingOuterJoinSuite passed as well Closes #28975 from HeartSaVioR/SPARK-32148. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-09 07:37:06 +00:00
GuoPhilipse	09cc6c51ea	[SPARK-32193][SQL][DOCS] Update regexp usage in SQL docs ### What changes were proposed in this pull request? update REGEXP usage and examples in sql-ref-syntx-qry-select-like.cmd ### Why are the changes needed? make the usage of REGEXP known to more users ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No tests Closes #29009 from GuoPhilipse/update-migrate-guide. Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Co-authored-by: GuoPhilipse <guofei_ok@126.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-07-09 16:14:33 +09:00
Wenchen Fan	8c5bee599d	[SPARK-28067][SPARK-32018] Fix decimal overflow issues ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/27627 to fix the remaining issues. There are 2 issues fixed in this PR: 1. `UnsafeRow.setDecimal` can set an overflowed decimal and causes an error when reading it. The expected behavior is to return null. 2. The update/merge expression for decimal type in `Sum` is wrong. We shouldn't turn the `sum` value back to 0 after it becomes null due to overflow. This issue was hidden because: 2.1 for hash aggregate, the buffer is unsafe row. Due to the first bug, we fail when overflow happens, so there is no chance to mistakenly turn null back to 0. 2.2 for sort-based aggregate, the buffer is generic row. The decimal can overflow (the Decimal class has unlimited precision) and we don't have the null problem. If we only fix the first bug, then the second bug is exposed and test fails. If we only fix the second bug, there is no way to test it. This PR fixes these 2 bugs together. ### Why are the changes needed? Fix issues during decimal sum when overflow happens ### Does this PR introduce _any_ user-facing change? Yes. Now decimal sum can return null correctly for overflow under non-ansi mode. ### How was this patch tested? new test and updated test Closes #29026 from cloud-fan/decimal. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-09 15:56:40 +09:00
Takuya UESHIN	cfecc2030d	[SPARK-32160][CORE][PYSPARK] Disallow to create SparkContext in executors ### What changes were proposed in this pull request? This PR proposes to disallow to create `SparkContext` in executors, e.g., in UDFs. ### Why are the changes needed? Currently executors can create SparkContext, but shouldn't be able to create it. ```scala sc.range(0, 1).foreach { _ => new SparkContext(new SparkConf().setAppName("test").setMaster("local")) } ``` ### Does this PR introduce _any_ user-facing change? Yes, users won't be able to create `SparkContext` in executors. ### How was this patch tested? Addes tests. Closes #28986 from ueshin/issues/SPARK-32160/disallow_spark_context_in_executors. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-09 15:51:56 +09:00
Jungtaek Lim (HeartSaVioR)	161cf2a126	[SPARK-32024][WEBUI][FOLLOWUP] Quick fix on test failure on missing when statements ### What changes were proposed in this pull request? This patch fixes the test failure due to the missing when statements for destination path. Note that it didn't fail on master branch, because `245aee9` got rid of size call in destination path, but still good to not depend on `245aee9`. ### Why are the changes needed? The build against branch-3.0 / branch-2.4 starts to fail after merging SPARK-32024 (#28859) and this patch will fix it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran modified UT against master / branch-3.0 / branch-2.4. Closes #29046 from HeartSaVioR/QUICKFIX-SPARK-32024. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-09 15:26:38 +09:00
Warren Zhu	d1d16d14bc	[SPARK-31723][CORE][TEST] Reenable one test case in HistoryServerSuite ### What changes were proposed in this pull request? Enable test("static relative links are prefixed with uiRoot (spark.ui.proxyBase)") ### Why are the changes needed? In Jira, the failed test is another one test("ajax rendered relative links are prefixed with uiRoot (spark.ui.proxyBase)"). This test has been fixed in `6a895d0` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Fix UT Closes #28970 from warrenzhu25/31723. Authored-by: Warren Zhu <zhonzh@microsoft.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-08 16:45:36 -07:00
Ryan Blue	3bb1ac597a	[SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite ### What changes were proposed in this pull request? When converting an `INSERT OVERWRITE` query to a v2 overwrite plan, Spark attempts to detect when a dynamic overwrite and a static overwrite will produce the same result so it can use the static overwrite. Spark incorrectly detects when dynamic and static overwrites are equivalent when there are hidden partitions, such as `days(ts)`. This updates the analyzer rule `ResolveInsertInto` to always use a dynamic overwrite when the mode is dynamic, and static when the mode is static. This avoids the problem by not trying to determine whether the two plans are equivalent and always using the one that corresponds to the partition overwrite mode. ### Why are the changes needed? This is a correctness bug. If a table has hidden partitions, all of the values for those partitions are dropped instead of dynamically overwriting changed partitions. This only affects SQL commands (not `DataFrameWriter`) writing to tables that have hidden partitions. It is also only a problem when the partition overwrite mode is dynamic. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the correctness bug detailed above. ### How was this patch tested? * This updates the in-memory table to support a hidden partition transform, `days`, and adds a test case to `DataSourceV2SQLSuite` in which the table uses this hidden partition function. This test fails without the fix to `ResolveInsertInto`. * This updates the test case `InsertInto: overwrite - multiple static partitions - dynamic mode` in `InsertIntoTests`. The result of the SQL command is unchanged, but the SQL command will now use a dynamic overwrite so the test now uses `dynamicOverwriteTest`. Closes #28993 from rdblue/fix-insert-overwrite-v2-conversion. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-08 16:06:40 -07:00
Dongjoon Hyun	17997a5796	[SPARK-32233][TESTS] Disable SBT unidoc generation testing in Jenkins ### What changes were proposed in this pull request? This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation. - https://github.com/apache/spark/pull/28848#issuecomment-654577911 (amp-jenkins-worker-06) - https://github.com/apache/spark/pull/28926#issuecomment-654316537 (amp-jenkins-worker-06) - https://github.com/apache/spark/pull/28969#issuecomment-654918636 (amp-jenkins-worker-06) - https://github.com/apache/spark/pull/28975#issuecomment-654447955 (amp-jenkins-worker-05) - https://github.com/apache/spark/pull/28986#issuecomment-654416543 (amp-jenkins-worker-05) - https://github.com/apache/spark/pull/28992#issuecomment-654371469 (amp-jenkins-worker-06) - https://github.com/apache/spark/pull/28993#issuecomment-655289237 (amp-jenkins-worker-05) - https://github.com/apache/spark/pull/28999#issuecomment-653976760 (amp-jenkins-worker-04) - https://github.com/apache/spark/pull/29010#issuecomment-655246083 (amp-jenkins-worker-03) - https://github.com/apache/spark/pull/29013#issuecomment-654292483 (amp-jenkins-worker-04) - https://github.com/apache/spark/pull/29016#issuecomment-654495070 (amp-jenkins-worker-05) - https://github.com/apache/spark/pull/29025#issuecomment-654889938 (amp-jenkins-worker-04) - https://github.com/apache/spark/pull/29042#issuecomment-655587989 (amp-jenkins-worker-03) ### Why are the changes needed? Apache Spark `release-build.sh` generates the official document by using the following command. - https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341 ```bash PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build ``` And, this is executed by the following `unidoc` command for Scala/Java API doc. - https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30 ```ruby system("build/sbt -Pkinesis-asl clean compile unidoc") \|\| raise("Unidoc generation failed") ``` However, the PR builder disabled `Jekyll build` and instead has a different test coverage. ```python # determine if docs were changed and if we're inside the amplab environment # note - the below commented out until all Jenkins workers can get `jekyll` installed # if "DOCS" in changed_modules and test_env == "amplab_jenkins": # build_spark_documentation() ``` ``` Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos -Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc ``` ### Does this PR introduce _any_ user-facing change? No. (This is used only for testing and not used in the official doc generation.) ### How was this patch tested? Pass the Jenkins without doc generation invocation. Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-08 14:11:18 -07:00
HyukjinKwon	f60b3b7e47	[MINOR][INFRA][R] Show the installed packages in R in a prettier way ### What changes were proposed in this pull request? This PR proposes to fix the AppVeyor configuration to show all installed R packages with name/versions. Before: ``` [1] '1.29' [1] '2.3' [1] '2.3.2' [1] '1.7.3' [1] '3.2.3' [1] '0.17.1' ``` After: ``` Package Version arrow arrow 0.17.1 askpass askpass 1.1 assertthat assertthat 0.2.1 backports backports 1.1.8 base64enc base64enc 0.1-3 bit bit 1.1-15.2 bit64 bit64 0.9-7 ... ``` ### Why are the changes needed? To show the package versions in a prettier way, and don't update the line every time when a package is added. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? AppVeyor CI should test it out. Closes #29038 from HyukjinKwon/minor-appveyor. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-08 07:50:07 -07:00
Zhen Li	8e7fc04637	[SPARK-32024][WEBUI] Update ApplicationStoreInfo.size during HistoryServerDiskManager initializing ### What changes were proposed in this pull request? Update ApplicationStoreInfo.size to real size during HistoryServerDiskManager initializing. ### Why are the changes needed? This PR is for fixing bug [32024](https://issues.apache.org/jira/browse/SPARK-32024). We found after history server restart, below error would randomly happen: "java.lang.IllegalStateException: Disk usage tracker went negative (now = -*, delta = -)" from `HistoryServerDiskManager`. ![Capture](https://user-images.githubusercontent.com/10524738/85034468-fda4ae80-b136-11ea-9011-f0c3e6508002.JPG) Cause: Reading data from level db would trigger table file compaction, which may also trigger size of level db directory changes. This size change may not be recorded in LevelDB (`ApplicationStoreInfo` in `listing`). When service restarts, `currentUsage` is calculated from real directory size, but `ApplicationStoreInfo` are loaded from leveldb, then `currentUsage` may be less then sum of `ApplicationStoreInfo.size`. In `makeRoom()` function, `ApplicationStoreInfo.size` is used to update usage. Then `currentUsage` becomes negative after several round of `release()` and `lease()` (`makeRoom()`). Reproduce: we can reproduce this issue in dev environment by reducing config value of "spark.history.retainedApplications" and "spark.history.store.maxDiskUsage" to some small values. Here are steps: 1. start history server, load some applications and access some pages (maybe "stages" page to trigger leveldb compaction). 2. restart HS, and refresh pages. I also added an UT to simulate this case in `HistoryServerDiskManagerSuite`. Benefit*: this change would help improve history server reliability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add unit test and manually tested it. Closes #28859 from zhli1142015/update-ApplicationStoreInfo.size-during-disk-manager-initialize. Authored-by: Zhen Li <zhli@microsoft.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-07-08 21:58:45 +09:00
Kousuke Saruta	371b35d2e0	[SPARK-32214][SQL] The type conversion function generated in makeFromJava for "other" type uses a wrong variable ### What changes were proposed in this pull request? This PR fixes an inconsistency in `EvaluatePython.makeFromJava`, which creates a type conversion function for some Java/Scala types. `other` is a type but it should actually pass `obj`: ```scala case other => (obj: Any) => nullSafeConvert(other)(PartialFunction.empty) ``` This does not change the output because it always returns `null` for unsupported datatypes. ### Why are the changes needed? To make the codes coherent, and consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No behaviour change. Closes #29029 from sarutak/fix-makeFromJava. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-08 17:46:25 +09:00
ulysses	65286aec4b	[SPARK-30703][SQL][FOLLOWUP] Update SqlBase.g4 invalid comment ### What changes were proposed in this pull request? Modify the comment of `SqlBase.g4`. ### Why are the changes needed? `docs/sql-keywords.md` has already moved to `docs/sql-ref-ansi-compliance.md#sql-keywords`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No need. Closes #29033 from ulysses-you/SPARK-30703-FOLLOWUP. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-07-08 11:30:47 +09:00
LantaoJin	b5297c43b0	[SPARK-20680][SQL] Spark-sql do not support for creating table with void column datatype ### What changes were proposed in this pull request? This is the new PR which to address the close one #17953 1. support "void" primitive data type in the `AstBuilder`, point it to `NullType` 2. forbid creating tables with VOID/NULL column type ### Why are the changes needed? 1. Spark is incompatible with hive void type. When Hive table schema contains void type, DESC table will throw an exception in Spark. >hive> create table bad as select 1 x, null z from dual; >hive> describe bad; OK x int z void In Spark2.0.x, the behaviour to read this view is normal: >spark-sql> describe bad; x int NULL z void NULL Time taken: 4.431 seconds, Fetched 2 row(s) But in lastest Spark version, it failed with SparkException: Cannot recognize hive type string: void >spark-sql> describe bad; 17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad] org.apache.spark.SparkException: Cannot recognize hive type string: void Caused by: org.apache.spark.sql.catalyst.parser.ParseException: DataType void() is not supported.(line 1, pos 0) == SQL == void ^^^ ... 61 more org.apache.spark.SparkException: Cannot recognize hive type string: void 2. Hive CTAS statements throws error when select clause has NULL/VOID type column since HIVE-11217 In Spark, creating table with a VOID/NULL column should throw readable exception message, include - create data source table (using parquet, json, ...) - create hive table (with or without stored as) - CTAS ### Does this PR introduce any user-facing change? No ### How was this patch tested? Add unit tests Closes #28833 from LantaoJin/SPARK-20680_COPY. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-07 18:58:01 -07:00
Yuanjian Li	365961155a	[SPARK-32124][CORE][FOLLOW-UP] Use the invalid value Int.MinValue to fill the map index when the event logs from the old Spark version ### What changes were proposed in this pull request? Use the invalid value Int.MinValue to fill the map index when the event logs from the old Spark version. ### Why are the changes needed? Follow up PR for #28941. ### Does this PR introduce _any_ user-facing change? When we use the Spark version 3.0 history server reading the event log written by the old Spark version, we use the invalid value -2 to fill the map index. ### How was this patch tested? Existing UT. Closes #28965 from xuanyuanking/follow-up. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-07-08 09:36:06 +09:00
Ali Smesseim	8b0a54e6ff	[SPARK-32057][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] ExecuteStatement: cancel and close should not transiently ERROR ### What changes were proposed in this pull request? #28671 introduced a change where the order in which CANCELED state for SparkExecuteStatementOperation is set was changed. Before setting the state to CANCELED, `cleanup()` was called which kills the jobs, causing an exception to be thrown inside `execute()`. This causes the state to transiently become ERROR before being set to CANCELED. This PR fixes the order. ### Why are the changes needed? Bug: wrong operation state is set. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test in SparkExecuteStatementOperationSuite.scala. Closes #28912 from alismess-db/execute-statement-operation-cleanup-order. Authored-by: Ali Smesseim <ali.smesseim@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-08 09:28:16 +09:00
Max Gekk	1261fac674	[SPARK-31710][SQL][FOLLOWUP] Allow cast numeric to timestamp by default ### What changes were proposed in this pull request? 1. Set the SQL config `spark.sql.legacy.allowCastNumericToTimestamp` to `true` by default 2. Remove explicit sets of `spark.sql.legacy.allowCastNumericToTimestamp` to `true` in the cast suites. ### Why are the changes needed? To avoid breaking changes in minor versions (in the upcoming Spark 3.1.0) according to the the semantic versioning guidelines (https://spark.apache.org/versioning-policy.html) ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By `CastSuite`. Closes #29012 from MaxGekk/allow-cast-numeric-to-timestamp. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-07 14:09:40 -07:00
Liang-Chi Hsieh	90b9099064	[SPARK-32163][SQL] Nested pruning should work even with cosmetic variations ### What changes were proposed in this pull request? This patch proposes to deal with cosmetic variations when processing nested column extractors in `NestedColumnAliasing`. Currently if cosmetic variations are in the nested column extractors, the query is not optimized. ### Why are the changes needed? If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well. For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug in nested column pruning. ### How was this patch tested? Unit test. Closes #28988 from viirya/SPARK-32163. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-07 11:17:53 -07:00
Gabor Somogyi	eb8eda7d32	[SPARK-32211][SQL] Pin mariadb-plugin-gssapi-server version to fix MariaDBKrbIntegrationSuite ### What changes were proposed in this pull request? `MariaDBKrbIntegrationSuite` fails because the docker image contains MariaDB version `1:10.4.12+maria~bionic` but `1:10.4.13+maria~bionic` came out and `mariadb-plugin-gssapi-server` installation triggered unwanted database upgrade inside the docker image. The main problem is that the docker image scripts are prepared to handle `1:10.4.12+maria~bionic` version and not any future development. ### Why are the changes needed? Failing test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Executed `MariaDBKrbIntegrationSuite` manually. Closes #29025 from gaborgsomogyi/SPARK-32211. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-07 09:38:08 -07:00
fqaiser94@gmail.com	4bbc343a4c	[SPARK-31317][SQL] Add withField method to Column ### What changes were proposed in this pull request? Added a new `withField` method to the `Column` class. This method should allow users to add or replace a `StructField` in a `StructType` column (with very similar semantics to the `withColumn` method on `Dataset`). ### Why are the changes needed? Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column. For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing): ``` import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val data = spark.createDataFrame(sc.parallelize( Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))), StructType(Seq( StructField("a", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("b", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) ))), StructField("c", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType)))) )))))).cache data.show(false) +---------------------------------+ \|a \| +---------------------------------+ \|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]\| +---------------------------------+ ``` Currently, to replace the missing value users would have to do something like this: ``` val result = data.withColumn("a", struct( $"a.a", struct( struct( $"a.b.a.a", lit(5).as("b"), $"a.b.a.c" ).as("a"), $"a.b.b", $"a.b.c" ).as("b"), $"a.c" )) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 5, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as: >this leads to complex, fragile code that cannot survive schema evolution. [SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483) In contrast, with the method added in this PR, a user could simply do something like this: ``` val result = data.withColumn("a", 'a.withField("b.a.b", lit(5))) result.show(false) +---------------------------------------------------------------+ \|a \| +---------------------------------------------------------------+ \|[[1, 2, 3], [[4, 5, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]\| +---------------------------------------------------------------+ ``` This is the first of maybe a few methods that could be added to the `Column` class to make it easier to manipulate nested data. Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `drop` and `renameField`. However, these should be added in a separate PR. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New unit tests were added. Jenkins must pass them. ### Related JIRAs: - https://issues.apache.org/jira/browse/SPARK-22231 - https://issues.apache.org/jira/browse/SPARK-16483 Closes #27066 from fqaiser94/SPARK-22231-withField. Lead-authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com> Co-authored-by: fqaiser94 <fqaiser94@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-07 16:34:03 +00:00
zhengruifeng	8d5c0947f8	[SPARK-32164][ML] GeneralizedLinearRegressionSummary optimization ### What changes were proposed in this pull request? 1, GeneralizedLinearRegressionSummary compute several statistics on single pass 2, LinearRegressionSummary use metrics.count ### Why are the changes needed? avoid extra passes on the dataset ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #28990 from zhengruifeng/glr_summary_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-07 08:30:15 -07:00
Max Gekk	75d342858a	[SPARK-32209][SQL] Re-use GetTimestamp in ParseToDate ### What changes were proposed in this pull request? Replace the combination of expressions `SecondsToTimestamp` and `UnixTimestamp` by `GetTimestamp` in `ParseToDate`. ### Why are the changes needed? Eliminate unnecessary parsing overhead in: string -> timestamp -> long (seconds) -> timestamp -> date. After the changes, the chain will be: string -> timestamp -> date. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites such as `DateFunctionsSuite`. Closes #28999 from MaxGekk/ParseToDate-parse-timestamp. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-07 07:26:40 -07:00
ulysses	2e23da2bda	[SPARK-31975][SQL] Show AnalysisException when WindowFunction is used without WindowExpression ### What changes were proposed in this pull request? Add WindowFunction check at `CheckAnalysis`. ### Why are the changes needed? Provide friendly error msg. BEFORE ```scala scala> sql("select rank() from values(1)").show java.lang.UnsupportedOperationException: Cannot generate code for expression: rank() ``` AFTER ```scala scala> sql("select rank() from values(1)").show org.apache.spark.sql.AnalysisException: Window function rank() requires an OVER clause.;; Project [rank() AS RANK()#3] +- LocalRelation [col1#2] ``` ### Does this PR introduce _any_ user-facing change? Yes, user wiill be given a better error msg. ### How was this patch tested? Pass the newly added UT. Closes #28808 from ulysses-you/SPARK-31975. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-07 13:39:04 +00:00
Wenchen Fan	5d296ed39e	[SPARK-32167][SQL] Fix GetArrayStructFields to respect inner field's nullability together ### What changes were proposed in this pull request? Fix nullability of `GetArrayStructFields`. It should consider both the original array's `containsNull` and the inner field's nullability. ### Why are the changes needed? Fix a correctness issue. ### Does this PR introduce _any_ user-facing change? Yes. See the added test. ### How was this patch tested? a new UT and end-to-end test Closes #28992 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-06 20:07:33 -07:00
sidedoorleftroad	3fe3365292	[SPARK-32172][CORE] Use createDirectory instead of mkdir ### What changes were proposed in this pull request? Use Files.createDirectory() to create local directory instead of File.mkdir() in DiskBlockManager. Many times, we will see such error log information like "Failed to create local dir in xxxxxx". But there is no clear information indicating why the directory creation failed. When Files.createDirectory() fails to create a local directory, it can give specific error information for subsequent troubleshooting(also throws IOException). ### Why are the changes needed? Throw clear error message when creating directory fails. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DiskBlockManagerSuite` Closes #28997 from sidedoorleftroad/SPARK-32172. Authored-by: sidedoorleftroad <sidedoorleftroad@163.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-06 09:20:31 -07:00
Bryan Cutler	1d1809636b	[SPARK-32162][PYTHON][TESTS] Improve error message of Pandas grouped map test with window ### What changes were proposed in this pull request? Improve the error message in test GroupedMapInPandasTests.test_grouped_over_window_with_key to show the incorrect values. ### Why are the changes needed? This test failure has come up often in Arrow testing because it tests a struct with timestamp values through a Pandas UDF. The current error message is not helpful as it doesn't show the incorrect values, only that it failed. This change will instead raise an assertion error with the incorrect values on a failure. Before: ``` ====================================================================== FAIL: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 588, in test_grouped_over_window_with_key self.assertTrue(all([r[0] for r in result])) AssertionError: False is not true ``` After: ``` ====================================================================== ERROR: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) ---------------------------------------------------------------------- ... AssertionError: {'start': datetime.datetime(2018, 3, 20, 0, 0), 'end': datetime.datetime(2018, 3, 25, 0, 0)}, != {'start': datetime.datetime(2020, 3, 20, 0, 0), 'end': datetime.datetime(2020, 3, 25, 0, 0)} ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Improved existing test Closes #28987 from BryanCutler/pandas-grouped-map-test-output-SPARK-32162. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-06 21:39:41 +09:00
Kent Yao	59a70879c0	[SPARK-32145][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] ThriftCLIService.GetOperationStatus should include exception's stack trace to the error message ### What changes were proposed in this pull request? In https://issues.apache.org/jira/browse/SPARK-29283, we only show the error message of root cause to end-users through JDBC client. In some cases, it erases the straightaway messages that we intentionally make to help them for better understanding. The root cause is somehow obscure for JDBC end-users who only writing SQL queries. e.g ``` Error running query: org.apache.spark.sql.AnalysisException: The second argument of 'date_sub' function needs to be an integer.; ``` is better than just ``` Caused by: java.lang.NumberFormatException: invalid input syntax for type numeric: 1.2 ``` We should do as Hive does in https://issues.apache.org/jira/browse/HIVE-14368 In general, this PR partially reverts SPARK-29283, ports HIVE-14368, and improves test coverage ### Why are the changes needed? 1. Do the same as Hive 2.3 and later for getting an error message in ThriftCLIService.GetOperationStatus 2. The root cause is somehow obscure for JDBC end-users who only writing SQL queries. 3. Consistency with `spark-sql` script ### Does this PR introduce _any_ user-facing change? Yes, when running queries using thrift server and an error occurs, you will get the full stack traces instead of only the message of the root cause ### How was this patch tested? add unit test Closes #28963 from yaooqinn/SPARK-32145. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-06 10:34:31 +00:00
Dongjoon Hyun	dea7bc464d	[SPARK-32100][CORE][TESTS][FOLLOWUP] Reduce the required test resources ### What changes were proposed in this pull request? This PR aims to reduce the required test resources in WorkerDecommissionExtendedSuite. ### Why are the changes needed? When Jenkins farms is crowded, the following failure happens currently [here](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2-hive-2.3/890/testReport/junit/org.apache.spark.scheduler/WorkerDecommissionExtendedSuite/Worker_decommission_and_executor_idle_timeout/) ``` java.util.concurrent.TimeoutException: Can't find 20 executors before 60000 milliseconds elapsed at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:326) at org.apache.spark.scheduler.WorkerDecommissionExtendedSuite.$anonfun$new$2(WorkerDecommissionExtendedSuite.scala:45) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins. Closes #29001 from dongjoon-hyun/SPARK-32100-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-05 20:12:41 -07:00
Dongjoon Hyun	0e33b5ecde	[SPARK-32178][TESTS] Disable test-dependencies.sh from Jenkins jobs ### What changes were proposed in this pull request? This PR aims to disable dependency tests(test-dependencies.sh) from Jenkins. ### Why are the changes needed? - First of all, GitHub Action provides the same test capability already and stabler. - Second, currently, `test-dependencies.sh` fails very frequently in AmpLab Jenkins environment. For example, in the following irrelevant PR, it fails 5 times during 6 hours. - https://github.com/apache/spark/pull/29001 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins without `test-dependencies.sh` invocation. Closes #29004 from dongjoon-hyun/SPARK-32178. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-06 12:03:08 +09:00
Kousuke Saruta	3726aab640	[SPARK-32177][WEBUI] Remove the weird line from near the Spark logo on mouseover in the WebUI ### What changes were proposed in this pull request? This PR changes `webui.css` to fix a style issue on moving mouse cursor on the Spark logo. ### Why are the changes needed? In the webui, the Spark logo is on the top right side. When we move mouse cursor on the logo, a weird underline appears near the logo. <img width="209" alt="logo_with_line" src="https://user-images.githubusercontent.com/4736016/86542828-3c6a9f00-bf54-11ea-9b9d-cc50c12c2c9b.png"> ### Does this PR introduce _any_ user-facing change? Yes. After this change applied, no more weird line shown even if mouse cursor moves on the logo. <img width="207" alt="removed-line-from-logo" src="https://user-images.githubusercontent.com/4736016/86542877-98cdbe80-bf54-11ea-8695-ee39689673ab.png"> ### How was this patch tested? By moving mouse cursor on the Spark logo and confirmed no more weird line there. Closes #29003 from sarutak/fix-logo-underline. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-05 19:09:04 -07:00
Huaxin Gao	492d5d174a	[SPARK-32171][SQL][DOCS] Change file locations for use db and refresh table ### What changes were proposed in this pull request? docs/sql-ref-syntax-qry-select-usedb.md -> docs/sql-ref-syntax-ddl-usedb.md docs/sql-ref-syntax-aux-refresh-table.md -> docs/sql-ref-syntax-aux-cache-refresh-table.md ### Why are the changes needed? usedb belongs to DDL. Its location should be consistent with other DDL commands file locations similar reason for refresh table ### Does this PR introduce _any_ user-facing change? before change, when clicking USE DATABASE, the side bar menu shows select commands <img width="1200" alt="Screen Shot 2020-07-04 at 9 05 35 AM" src="https://user-images.githubusercontent.com/13592258/86516696-b45f8a80-bdd7-11ea-8dba-3a5cca22aad3.png"> after change, when clicking USE DATABASE, the side bar menu shows DDL commands <img width="1120" alt="Screen Shot 2020-07-04 at 9 06 06 AM" src="https://user-images.githubusercontent.com/13592258/86516703-bf1a1f80-bdd7-11ea-8a90-ae7eaaafd44c.png"> before change, when clicking refresh table, the side bar menu shows Auxiliary statements <img width="1200" alt="Screen Shot 2020-07-04 at 9 30 40 AM" src="https://user-images.githubusercontent.com/13592258/86516877-3d2af600-bdd9-11ea-9568-0a6f156f57da.png"> after change, when clicking refresh table, the side bar menu shows Cache statements <img width="1199" alt="Screen Shot 2020-07-04 at 9 35 21 AM" src="https://user-images.githubusercontent.com/13592258/86516937-b4f92080-bdd9-11ea-8ad1-5f5a7f58d76b.png"> ### How was this patch tested? Manually build and check Closes #28995 from huaxingao/docs_fix. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-04 19:01:07 -07:00
Max Gekk	42f01e314b	[SPARK-32130][SQL][FOLLOWUP] Enable timestamps inference in JsonBenchmark ### What changes were proposed in this pull request? Set the JSON option `inferTimestamp` to `true` for the cases that measure perf of timestamp inference. ### Why are the changes needed? The PR https://github.com/apache/spark/pull/28966 disabled timestamp inference by default. As a consequence, some benchmarks don't measure perf of timestamp inference from JSON fields. This PR explicitly enable such inference. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By re-generating results of `JsonBenchmark`. Closes #28981 from MaxGekk/json-inferTimestamps-disable-by-default-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-02 13:26:57 -07:00
TJX2014	0acad589e1	[SPARK-32156][SPARK-31061][TESTS][SQL] Refactor two similar test cases from in HiveExternalCatalogSuite ### What changes were proposed in this pull request? 1.Merge two similar tests for SPARK-31061 and make the code clean. 2.Fix table alter issue due to lose path. ### Why are the changes needed? Because this two tests for SPARK-31061 is very similar and could be merged. And the first test case should use `rawTable` instead of `parquetTable` to alter. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #28980 from TJX2014/master-follow-merge-spark-31061-test-case. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-02 10:15:10 -07:00
stczwd	f082a7996a	[SPARK-31100][SQL] Check namespace existens when setting namespace ## What changes were proposed in this pull request? Check the namespace existence while calling "use namespace", and throw NoSuchNamespaceException if namespace not exists. ### Why are the changes needed? Users need to know that the namespace does not exist when they try to set a wrong namespace. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run all suites and add a test for this Closes #27900 from stczwd/SPARK-31100. Authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-02 14:49:40 +00:00
Wenchen Fan	f83415629b	[MINOR][TEST][SQL] Make in-limit.sql more robust ### What changes were proposed in this pull request? For queries like `t1d in (SELECT t2d FROM t2 ORDER BY t2c LIMIT 2)`, the result can be non-deterministic as the result of the subquery may output different results (it's not sorted by `t2d` and it has shuffle). This PR makes the test more robust by sorting the output column. ### Why are the changes needed? avoid flaky test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #28976 from cloud-fan/small. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-02 21:04:26 +09:00
animenon	45fe6b62a7	[MINOR][DOCS] Pyspark getActiveSession docstring ### What changes were proposed in this pull request? Minor fix so that the documentation of `getActiveSession` is fixed. The sample code snippet doesn't come up formatted rightly, added spacing for this to be fixed. Also added return to docs. ### Why are the changes needed? The sample code is getting mixed up as description in the docs. [Current Doc Link](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=getactivesession#pyspark.sql.SparkSession.getActiveSession) ![image](https://user-images.githubusercontent.com/6907950/86331522-d7b6f800-bc66-11ea-998c-42085f5e5b04.png) ### Does this PR introduce _any_ user-facing change? Yes, documentation of getActiveSession is fixed. And added description about return. ### How was this patch tested? Adding a spacing between description and code seems to fix the issue. Closes #28978 from animenon/docs_minor. Authored-by: animenon <animenon@mail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-02 21:02:00 +09:00
pancheng	7fda184f0f	[SPARK-32121][SHUFFLE] Support Windows OS in ExecutorDiskUtils ### What changes were proposed in this pull request? Correct file seprate use in `ExecutorDiskUtils.createNormalizedInternedPathname` on Windows ### Why are the changes needed? `ExternalShuffleBlockResolverSuite` failed on Windows, see detail at: https://issues.apache.org/jira/browse/SPARK-32121 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The existed test suite. Closes #28940 from pan3793/SPARK-32121. Lead-authored-by: pancheng <379377944@qq.com> Co-authored-by: chengpan <cheng.pan@idiaoyan.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-02 19:21:11 +09:00

... 3 4 5 6 7 ...

27767 commits