### What changes were proposed in this pull request?
This patch corrects the method doc of DataFrameWriterV2.replace() which explanation of exception is described oppositely.
### Why are the changes needed?
The method doc is incorrect.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Only doc change.
Closes#29568 from HeartSaVioR/SPARK-28612-FOLLOWUP-fix-doc-nit.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide").
Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html
### Why are the changes needed?
To have a single place for PySpark users, and better documentation.
### Does this PR introduce _any_ user-facing change?
Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation.
### How was this patch tested?
```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```
and
```bash
cd python/docs
make clean html
```
Closes#29548 from HyukjinKwon/SPARK-32183.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR changes key data types check in `HashJoin` to use `sameType`.
### Why are the changes needed?
Looks at the resolving condition of `SetOperation`, it requires only each left data types should be `sameType` as the right ones. Logically the `EqualTo` expression in equi-join, also requires only left data type `sameType` as right data type. Then `HashJoin` requires left keys data type exactly the same as right keys data type, looks not reasonable.
It makes inconsistent results when doing `except` between two dataframes.
If two dataframes don't have nested fields, even their field nullable property different, `HashJoin` passes the key type check because it checks field individually so field nullable property is ignored.
If two dataframes have nested fields like struct, `HashJoin` fails the key type check because now it compare two struct types and nullable property now affects.
### Does this PR introduce _any_ user-facing change?
Yes. Making consistent `except` operation between dataframes.
### How was this patch tested?
Unit test.
Closes#29555 from viirya/SPARK-32693.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims to support executor id placeholder in `spark.kubernetes.executor.volumes.persistentVolumeClaim.myname.options.claimName` configuration like the following.
```
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=pvc-spark-SPARK_EXECUTOR_ID \
```
### Why are the changes needed?
This is a convenient way to mount corresponding PV to the executor.
### Does this PR introduce _any_ user-facing change?
Yes, but this is a new feature and there is no regression because users don't use `SPARK_EXECUTOR_ID` in PVC claim name.
### How was this patch tested?
Pass the newly added test case.
Closes#29557 from dongjoon-hyun/SPARK-PVC.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
The current documentation states that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which is not entirely true since this configuration isn't set anywhere in Spark but rather inherited from the Hadoop FileOutputCommitter class.
### What changes were proposed in this pull request?
I'm submitting this change, to clarify that the default value will entirely depend on the Hadoop version of the runtime environment.
### Why are the changes needed?
An application would end up using algorithm version 1 on certain environments but without any changes the same exact application will use version 2 on environments running Hadoop 3.0 and later. This can have pretty bad consequences in certain scenarios, for example, two tasks can partially overwrite their output if speculation is enabled. Also, please refer to the following JIRA:
https://issues.apache.org/jira/browse/MAPREDUCE-7282
### Does this PR introduce _any_ user-facing change?
Yes. Configuration page content was modified where previously we explicitly highlighted that the default version for the FileOutputCommitter algorithm was v1, this now has changed to "Dependent on environment" with additional information in the description column to elaborate.
### How was this patch tested?
Checked changes locally in browser
Closes#29541 from waleedfateem/SPARK-32701.
Authored-by: waleedfateem <waleed.fateem@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
We are using an older version of Bootstrap (v. 2.1.0) for the online documentation site. Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to EOL in July 2019 (https://github.com/twbs/release). Older versions of Bootstrap are also getting flagged in security scans for various CVEs:
https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700https://snyk.io/vuln/npm:bootstrap:20180529https://snyk.io/vuln/npm:bootstrap:20160627
I haven't validated each CVE, but it would probably be good practice to resolve any potential issues and get on a supported release.
The bad news is that there have been quite a few changes between Bootstrap 2 and Bootstrap 4. I've tried updating the library, refactoring/tweaking the CSS and JS to maintain a similar appearance and functionality, and testing the documentation. This is a fairly large change so I'm sure additional testing and fixes will be needed.
### How was this patch tested?
This has been manually tested, but as there is a lot of documentation it is possible issues were missed. Additional testing and feedback is welcomed. If it appears a whole section was missed let me know and I'll take a pass at addressing that section.
Closes#27369 from clarkead/bootstrap4-docs-upgrade.
Authored-by: Dale Clarke <a.dale.clarke@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR let JDBC clients identify spark interval columns properly.
### Why are the changes needed?
JDBC users can query interval values through thrift server, create views with interval columns, e.g.
```sql
CREATE global temp view view1 as select interval 1 day as i;
```
but when they want to get the details of the columns of view1, the will fail with `Unrecognized type name: INTERVAL`
```
Caused by: java.lang.IllegalArgumentException: Unrecognized type name: INTERVAL
at org.apache.hadoop.hive.serde2.thrift.Type.getType(Type.java:170)
at org.apache.spark.sql.hive.thriftserver.ThriftserverShimUtils$.toJavaSQLType(ThriftserverShimUtils.scala:53)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:157)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:149)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$6(SparkGetColumnsOperation.scala:113)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$6$adapted(SparkGetColumnsOperation.scala:112)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5(SparkGetColumnsOperation.scala:112)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5$adapted(SparkGetColumnsOperation.scala:111)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.runInternal(SparkGetColumnsOperation.scala:111)
... 34 more
```
### Does this PR introduce _any_ user-facing change?
YES,
#### before
![image](https://user-images.githubusercontent.com/8326978/91162239-6cd1ec80-e6fe-11ea-8c2c-914ddb325c4e.png)
#### after
![image](https://user-images.githubusercontent.com/8326978/91162025-1a90cb80-e6fe-11ea-94c4-03a6f2ec296b.png)
### How was this patch tested?
new tests
Closes#29539 from yaooqinn/SPARK-32696.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, EmptyHashedRelation and HashedRelationWithAllNullKeys is an object, and it will cause JavaDeserialization Exception as following
```
20/08/26 11:13:30 WARN [task-result-getter-2] TaskSetManager: Lost task 34.0 in stage 57.0 (TID 18076, emr-worker-5.cluster-183257, executor 18): java.io.InvalidClassException: org.apache.spark.sql.execution.joins.EmptyHashedRelation$; no valid constructor
at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169)
at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
```
This PR includes
* Using case object instead to fix serialization issue.
* Also change EmptyHashedRelation not to extend NullAwareHashedRelation since it's already being used in other non-NAAJ joins.
### Why are the changes needed?
It will cause BHJ failed when buildSide is Empty and BHJ(NAAJ) failed when buildSide with null partition keys.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
* Existing UT.
* Run entire TPCDS for E2E coverage.
Closes#29547 from leanken/leanken-SPARK-32705.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.
### Why are the changes needed?
The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.
For example,
```
Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
```
will write the result to `/tmp/path2`.
### Does this PR introduce _any_ user-facing change?
Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`:
```
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;
```
The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.
### How was this patch tested?
Added new tests.
Closes#29543 from imback82/path_option.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The decommissioning state is a bit fragment across two places in the TaskSchedulerImpl:
https://github.com/apache/spark/pull/29014/ stored the incoming decommission info messages in TaskSchedulerImpl.executorsPendingDecommission.
While https://github.com/apache/spark/pull/28619/ was storing just the executor end time in the map TaskSetManager.tidToExecutorKillTimeMapping (which in turn is contained in TaskSchedulerImpl).
While the two states are not really overlapping, it's a bit of a code hygiene concern to save this state in two places.
With https://github.com/apache/spark/pull/29422, TaskSchedulerImpl is emerging as the place where all decommissioning book keeping is kept within the driver. So consolidate the information in _tidToExecutorKillTimeMapping_ into _executorsPendingDecommission_.
However, in order to do so, we need to walk away from keeping the raw ExecutorDecommissionInfo messages and instead keep another class ExecutorDecommissionState. This decoupling will allow the RPC message class ExecutorDecommissionInfo to evolve independently from the book keeping ExecutorDecommissionState.
### Why are the changes needed?
This is just a code cleanup. These two features were added independently and its time to consolidate their state for good hygiene.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#29452 from agrawaldevesh/consolidate_decom_state.
Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
Fixes typo in docsting of `toDF`
### Why are the changes needed?
The third argument of `toDF` is actually `sampleRatio`.
related discussion: https://github.com/apache/spark/pull/12746#discussion-diff-62704834
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This patch doesn't affect any logic, so existing tests should cover it.
Closes#29551 from unirt/minor_fix_docs.
Authored-by: unirt <lunirtc@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Use `InSet` expression to fix data issue when pruning DPP on non-atomic type. for example:
```scala
spark.range(1000)
.select(col("id"), col("id").as("k"))
.write
.partitionBy("k")
.format("parquet")
.mode("overwrite")
.saveAsTable("df1");
spark.range(100)
.select(col("id"), col("id").as("k"))
.write
.partitionBy("k")
.format("parquet")
.mode("overwrite")
.saveAsTable("df2")
spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show
```
It should return two records, but it returns empty.
### Why are the changes needed?
Fix data issue
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add new unit test.
Closes#29475 from wangyum/SPARK-32659.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.
### Why are the changes needed?
Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.
### Does this PR introduce _any_ user-facing change?
Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;
### How was this patch tested?
new UTs added
Closes#29387 from Udbhav30/tuncateTrash.
Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR regenerates the golden explain file based on the fix: https://github.com/apache/spark/pull/29537
### Why are the changes needed?
Eliminates the personal related information (e.g., local directories) in the explain plan.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Checked manually.
Closes#29546 from Ngone51/follow-up-gen-golden-file.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to explicitly cache and hash the files/directories under 'build' for SBT and Zinc at GitHub Actions. Otherwise, it can end up with overwriting `build` directory. See also https://github.com/apache/spark/pull/29286#issuecomment-679368436
Previously, other files like `build/mvn` and `build/sbt` are also cached and overwritten. So, when you have some changes there, they are ignored.
### Why are the changes needed?
To make GitHub Actions build stable.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
The builds in this PR test it out.
Closes#29536 from HyukjinKwon/SPARK-32695.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of `\0`), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'.
### Why are the changes needed?
To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example.
### Does this PR introduce _any_ user-facing change?
Nothing beyond the effect of the bug fix.
### How was this patch tested?
Existing tests plus new test case.
Closes#29516 from srowen/SPARK-32614.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
fixing flaky test in ExecutorAllocationManagerSuite. The issue is that there is a timing issue when we do a reset as to when the numExecutorsToAddPerResourceProfileId gets reset. The fix is to just always set those values back to 1 when we call reset().
### Why are the changes needed?
fixing flaky test in ExecutorAllocationManagerSuite
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ran the unit test via this PR a bunch of times and the fix seems to be working.
Closes#29508 from tgravescs/debugExecAllocTest.
Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Changing an info log to a debug log based on SPARK-32664
### Why are the changes needed?
It is outlined in SPARK-32664
### Does this PR introduce _any_ user-facing change?
There are changes to the debug and info logs
### How was this patch tested?
Tested by looking at the logs
Closes#29527 from dmoore62/SPARK-32664.
Authored-by: Daniel Moore <moore@knights.ucf.edu>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes the doc error and add a migration guide for datetime pattern.
### Why are the changes needed?
This is a bug of the doc that we inherited from JDK https://bugs.openjdk.java.net/browse/JDK-8169482
The SimpleDateFormatter(**F Day of week in month**) we used in 2.x and the DatetimeFormatter(**F week-of-month**) we use now both have the opposite meanings to what they declared in the java docs. And unfortunately, this also leads to silent data change in Spark too.
The `week-of-month` is actually the pattern `W` in DatetimeFormatter, which is banned to use in Spark 3.x.
If we want to keep pattern `F`, we need to accept the behavior change with proper migration guide and fix the doc in Spark
### Does this PR introduce _any_ user-facing change?
Yes, doc changed
### How was this patch tested?
passing ci doc generating job
Closes#29538 from yaooqinn/SPARK-32683.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Extract `SQLQueryTestSuite.replaceNotIncludedMsg` to `PlanTest`.
2. Reuse `replaceNotIncludedMsg` to normalize the explain plan that generated in `PlanStabilitySuite`.
### Why are the changes needed?
This's a follow-up of https://github.com/apache/spark/pull/29270.
Eliminates the personal related information (e.g., local directories) in the explain plan.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated test.
Closes#29537 from Ngone51/follow-up-plan-stablity.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…er operations"
### What changes were proposed in this pull request?
This reverts commit 510a1656e6.
### Why are the changes needed?
see https://github.com/apache/spark/pull/29204#discussion_r475716547
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
pass ci tools
Closes#29531 from yaooqinn/revert.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
There is a bug in the way the optimizer rule in `SimplifyExtractValueOps` is currently written in master branch which yields incorrect results in scenarios like the following:
```
sql("SELECT CAST(NULL AS struct<a:int,b:int>) struct_col")
.select($"struct_col".withField("d", lit(4)).getField("d").as("d"))
// currently returns this:
+---+
|d |
+---+
|4 |
+---+
// when in fact it should return this:
+----+
|d |
+----+
|null|
+----+
```
The changes in this PR will fix this bug.
### Why are the changes needed?
To fix the aforementioned bug. Optimizer rules should improve the performance of the query but yield exactly the same results.
### Does this PR introduce _any_ user-facing change?
Yes, this bug will no longer occur.
That said, this isn't something to be concerned about as this bug was introduced in Spark 3.1 and Spark 3.1 has yet to be released.
### How was this patch tested?
Unit tests were added. Jenkins must pass them.
Closes#29522 from fqaiser94/SPARK-32641.
Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis.
This is re-submitted for #29457. Because #29457 has a hive-1.2 error and there were some tests failed with hive-1.2 profile at the same time, #29457 was reverted to unblock others.
### Why are the changes needed?
Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis.
But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too.
### Does this PR introduce _any_ user-facing change?
Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work.
### How was this patch tested?
Unit tests.
Closes#29530 from viirya/fix-orc-pushdown.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR enhances `Catalog.createTable()` to allow users to set the table's description. This corresponds to the following SQL syntax:
```sql
CREATE TABLE ...
COMMENT 'this is a fancy table';
```
### Why are the changes needed?
This brings the Scala/Python catalog APIs a bit closer to what's already possible via SQL.
### Does this PR introduce any user-facing change?
Yes, it adds a new parameter to `Catalog.createTable()`.
### How was this patch tested?
Existing unit tests:
```sh
./python/run-tests \
--python-executables python3.7 \
--testnames 'pyspark.sql.tests.test_catalog,pyspark.sql.tests.test_context'
```
```
$ ./build/sbt
testOnly org.apache.spark.sql.internal.CatalogSuite org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.hive.MetastoreDataSourcesSuite org.apache.spark.sql.hive.execution.HiveDDLSuite
```
Closes#27908 from nchammas/SPARK-31000-table-description.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pull request adds 2 test suites for 2 new classes HybridStore and HistoryServerMemoryManager, which were created in https://github.com/apache/spark/pull/28412. This pull request also did some minor changes in these 2 classes to expose some variables for testing. Besides 2 suites, this pull request adds a unit test in FsHistoryProviderSuite to test parsing logs with HybridStore.
### Why are the changes needed?
Unit tests are needed for new features.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#29509 from baohe-zhang/SPARK-31608-UT.
Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Added Java docs for Long data types in the Row class.
### Why are the changes needed?
The Long datatype is somehow missing in Row.scala's `apply` and `get` methods.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UTs.
Closes#29534 from yeshengm/docs-fix.
Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change to use `dataTypes.foreach` instead of get the element use specified index in `def this(dataTypes: Seq[DataType]) `constructor of `SpecificInternalRow` because the random access performance is unsatisfactory if the input argument not a `IndexSeq`.
This pr followed srowen's advice.
### Why are the changes needed?
I found that SPARK-32550 had some negative impact on performance, the typical cases is "deterministic cardinality estimation" in `HyperLogLogPlusPlusSuite` when rsd is 0.001, we found the code that is significantly slower is line 41 in `HyperLogLogPlusPlusSuite`: `new SpecificInternalRow(hll.aggBufferAttributes.map(_.dataType)) `
08b951b1cb/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala (L40-L44)
The size of "hll.aggBufferAttributes" in this case is 209716, the results of comparison before and after spark-32550 merged are as follows, The unit is ns:
| After SPARK-32550 createBuffer | After SPARK-32550 end to end | Before SPARK-32550 createBuffer | Before SPARK-32550 end to end
-- | -- | -- | -- | --
rsd 0.001, n 1000 | 52715513243 | 53004810687 | 195807999 | 773977677
rsd 0.001, n 5000 | 51881246165 | 52519358215 | 13689949 | 249974855
rsd 0.001, n 10000 | 52234282788 | 52374639172 | 14199071 | 183452846
rsd 0.001, n 50000 | 55503517122 | 55664035449 | 15219394 | 584477125
rsd 0.001, n 100000 | 51862662845 | 52116774177 | 19662834 | 166483678
rsd 0.001, n 500000 | 51619226715 | 52183189526 | 178048012 | 16681330
rsd 0.001, n 1000000 | 54861366981 | 54976399142 | 226178708 | 18826340
rsd 0.001, n 5000000 | 52023602143 | 52354615149 | 388173579 | 15446409
rsd 0.001, n 10000000 | 53008591660 | 53601392304 | 533454460 | 16033032
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
`mvn test -pl sql/catalyst -DwildcardSuites=org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlusSuite -Dtest=none`
**Before**:
```
Run completed in 8 minutes, 18 seconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
```
**After**
```
Run completed in 7 seconds, 65 milliseconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
```
Closes#29529 from LuciferYang/revert-spark-32550.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
As discussed in https://github.com/apache/spark/pull/29491#discussion_r474451282 and in SPARK-32686, this PR un-deprecates Spark's ability to infer a DataFrame schema from a list of dictionaries. The ability is Pythonic and matches functionality offered by Pandas.
### Why are the changes needed?
This change clarifies to users that this behavior is supported and is not going away in the near future.
### Does this PR introduce _any_ user-facing change?
Yes. There used to be a `UserWarning` for this, but now there isn't.
### How was this patch tested?
I tested this manually.
Before:
```python
>>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}]))
/Users/nchamm/Documents/GitHub/nchammas/spark/python/pyspark/sql/session.py:388: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead
warnings.warn("Using RDD of dict to inferSchema is deprecated. "
DataFrame[a: bigint]
>>> spark.createDataFrame([{'a': 5}])
.../python/pyspark/sql/session.py:378: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
warnings.warn("inferring schema from dict is deprecated,"
DataFrame[a: bigint]
```
After:
```python
>>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}]))
DataFrame[a: bigint]
>>> spark.createDataFrame([{'a': 5}])
DataFrame[a: bigint]
```
Closes#29510 from nchammas/SPARK-32686-df-dict-infer-schema.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to make the behavior consistent for the `path` option when loading dataframes with a single path (e.g, `option("path", path).format("parquet").load(path)` vs. `option("path", path).parquet(path)`) by disallowing `path` option to coexist with `load`'s path parameters.
### Why are the changes needed?
The current behavior is inconsistent:
```scala
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
+-----+
|value|
+-----+
| 1|
+-----+
scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
+-----+
|value|
+-----+
| 1|
| 1|
+-----+
```
### Does this PR introduce _any_ user-facing change?
Yes, now if the `path` option is specified along with `load`'s path parameters, it would fail:
```scala
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
... 47 elided
scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:250)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:778)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:756)
... 47 elided
```
The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.
### How was this patch tested?
Added a test
Closes#29328 from imback82/dfw_option.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
In order to produce consistent results from SizeEstimator the tests
override some system properties that are used during SizeEstimator
initialization. However there were several places where either the
compressed references property wasn't set or the system properties
were set but the SizeEstimator not re-initialized.
This caused failures when running the tests with a large heap build
of OpenJ9 because it does not use compressed references unlike most
environments.
### What changes were proposed in this pull request?
Initialize SizeEstimator class explicitly in the tests where required to avoid relying on a particular environment.
### Why are the changes needed?
Test failures can be seen when compressed references are disabled (e.g. using an OpenJ9 large heap build or Hotspot with a large heap).
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tests run on machine running OpenJ9 large heap build.
Closes#29407 from mundaym/fix-sizeestimator.
Authored-by: Michael Munday <mike.munday@ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
For broadcast hash join and shuffled hash join, whenever the build side hashed relation turns out to be empty. We don't need to execute stream side plan at all, and can return an empty iterator (for inner join and left semi join), because we know for sure that none of stream side rows can be outputted as there's no match.
### Why are the changes needed?
A very minor optimization for rare use case, but in case build side turns out to be empty, we can leverage it to short-cut stream side to save CPU and IO.
Example broadcast hash join query similar to `JoinBenchmark` with empty hashed relation:
```
def broadcastHashJoinLongKey(): Unit = {
val N = 20 << 20
val M = 1 << 16
val dim = broadcast(spark.range(0).selectExpr("id as k", "cast(id as string) as v"))
codegenBenchmark("Join w long", N) {
val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
df.noop()
}
}
```
Comparing wall clock time for enabling and disabling this PR (for non-codegen code path). Seeing like 8x improvement.
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
Join w long: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Join PR disabled 637 646 12 32.9 30.4 1.0X
Join PR enabled 77 78 2 271.8 3.7 8.3X
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `JoinSuite`.
Closes#29484 from c21/empty-relation.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes a bug of #29406. #29406 partially pushes down data filter even if it mixed in partition filters. But in some cases partition columns might be in data columns too. It will possibly push down a predicate with partition column to datasource.
### Why are the changes needed?
The test "org.apache.spark.sql.hive.orc.HiveOrcHadoopFsRelationSuite.save()/load() - partitioned table - simple queries - partition columns in data" is currently failed with hive-1.2 profile in master branch.
```
[info] - save()/load() - partitioned table - simple queries - partition columns in data *** FAILED *** (1 second, 457 milliseconds)
[info] java.util.NoSuchElementException: key not found: p1
[info] at scala.collection.immutable.Map$Map2.apply(Map.scala:138)
[info] at org.apache.spark.sql.hive.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:250)
[info] at org.apache.spark.sql.hive.orc.OrcFilters$.convertibleFiltersHelper$1(OrcFilters.scala:143)
[info] at org.apache.spark.sql.hive.orc.OrcFilters$.$anonfun$convertibleFilters$4(OrcFilters.scala:146)
[info] at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
[info] at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
[info] at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[info] at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
[info] at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
[info] at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
[info] at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
[info] at org.apache.spark.sql.hive.orc.OrcFilters$.convertibleFilters(OrcFilters.scala:145)
[info] at org.apache.spark.sql.hive.orc.OrcFilters$.createFilter(OrcFilters.scala:83)
[info] at org.apache.spark.sql.hive.orc.OrcFileFormat.buildReader(OrcFileFormat.scala:142)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#29526 from viirya/SPARK-32352-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Copy to master branch the unit test added for branch-2.4(https://github.com/apache/spark/pull/29430).
### Why are the changes needed?
The unit test will pass at master branch, indicating that issue reported in https://issues.apache.org/jira/browse/SPARK-32609 is already fixed at master branch. But adding this unit test for future possible failure catch.
### Does this PR introduce _any_ user-facing change?
no.
### How was this patch tested?
sbt test run
Closes#29435 from mingjialiu/master.
Authored-by: mingjial <mingjial@google.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
The PR checks for the emptiness of `--py-files` value and uses it only if it is not empty.
### Why are the changes needed?
There is a bug in Mesos cluster mode REST Submission API. It is using `--py-files` option without specifying any value for the conf `spark.submit.pyFiles` by the user.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
* Submitting an application to a Mesos cluster:
`curl -X POST http://localhost:7077/v1/submissions/create --header "Content-Type:application/json" --data '{
"action": "CreateSubmissionRequest",
"appResource": "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"clientSparkVersion": "3.0.0",
"appArgs": ["30"],
"environmentVariables": {},
"mainClass": "org.apache.spark.examples.SparkPi",
"sparkProperties": {
"spark.jars": "file:///opt/spark-3.0.0-bin-3.2.0/examples/jars/spark-examples_2.12-3.0.0.jar",
"spark.driver.supervise": "false",
"spark.executor.memory": "512m",
"spark.driver.memory": "512m",
"spark.submit.deployMode": "cluster",
"spark.app.name": "SparkPi",
"spark.master": "mesos://localhost:5050"
}}'`
* It should be able to pick the correct class and run the job successfully.
Closes#29499 from farhan5900/SPARK-32675.
Authored-by: farhan5900 <farhan5900@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Removing unused DELETE_ACTION in FileStreamSinkLog.
### Why are the changes needed?
DELETE_ACTION is not used nowhere in the code.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tests where not added, because code was removed.
Closes#29505 from michal-wieleba/SPARK-32648.
Authored-by: majsiel <majsiel@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Fix double caching in KMeans/BiKMeans:
1, let the callers of `runWithWeight` to pass whether `handlePersistence` is needed;
2, persist and unpersist inside of `runWithWeight`;
3, persist the `norms` if needed according to the comments;
### Why are the changes needed?
avoid double caching
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing testsuites
Closes#29501 from zhengruifeng/kmeans_handlePersistence.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Override `def get: Date` in `DaysWritable` use the `daysToMillis(int d)` from the parent class `DateWritable` instead of `long daysToMillis(int d, boolean doesTimeMatter)`.
### Why are the changes needed?
It fixes failures of `HiveSerDeReadWriteSuite` with the profile `hive-1.2`. In that case, the parent class `DateWritable` has different implementation before the commit to Hive da3ed68eda. In particular, `get()` calls `new Date(daysToMillis(daysSinceEpoch))` instead of overrided `def get(doesTimeMatter: Boolean): Date` in the child class. The `get()` method returns wrong result `1970-01-01` because it uses not updated `daysSinceEpoch`.
### Does this PR introduce _any_ user-facing change?
Yes.
### How was this patch tested?
By running the test suite `HiveSerDeReadWriteSuite`:
```
$ build/sbt -Phive-1.2 -Phadoop-2.7 "test:testOnly org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite"
```
and
```
$ build/sbt -Phive-2.3 -Phadoop-2.7 "test:testOnly org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite"
```
Closes#29523 from MaxGekk/insert-date-into-hive-table-1.2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
As mentioned in https://github.com/apache/spark/pull/29428#issuecomment-678735163 by viirya ,
fix bug in UT, since in script transformation no-serde mode, output of decimal is same in both hive-1.2/hive-2.3
### Why are the changes needed?
FIX UT
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
EXISTED UT
Closes#29520 from AngersZhuuuu/SPARK-32608-FOLLOW.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This reverts commit e277ef1a83.
### Why are the changes needed?
Because master and branch-3.0 both have few tests failed under hive-1.2 profile. And the PR #29457 missed a change in hive-1.2 code that causes compilation error. So it will make debugging the failed tests harder. I'd like revert #29457 first to unblock it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes#29519 from viirya/revert-SPARK-32646.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Some Code refine.
1. rename EmptyHashedRelationWithAllNullKeys to HashedRelationWithAllNullKeys.
2. simplify generated code for BHJ NAAJ.
### Why are the changes needed?
Refine code and naming to avoid confusing understanding.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing test.
Closes#29503 from leanken/leanken-SPARK-32678.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Simple typo
### What changes were proposed in this pull request?
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#29486 from Smith-Cruise/patch-1.
Authored-by: Smith Cruise <chendingchao1@126.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Changed the definitions of `CrossValidatorModel.copy()/_to_java()/_from_java()` so that exposed parameters (i.e. parameters with `get()` methods) are copied in these methods.
### Why are the changes needed?
Parameters are copied in the respective Scala interface for `CrossValidatorModel.copy()`.
It fits the semantics to persist parameters when calling `CrossValidatorModel.save()` and `CrossValidatorModel.load()` so that the user gets the same model by saving and loading it after. Not copying across `numFolds` also causes bugs like Array index out of bound and losing sub-models because this parameters will always default to 3 (as described in the JIRA ticket).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests for `CrossValidatorModel.copy()` and `save()`/`load()` are updated so that they check parameters before and after function calls.
Closes#29445 from Louiszr/master.
Authored-by: Louiszr <zxhst14@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
The purpose of this pr is to resolve [SPARK-32526](https://issues.apache.org/jira/browse/SPARK-32526), all remaining failed cases are fixed.
The main change of this pr as follow:
- Change of `ExecutorAllocationManager.scala` for core module compilation in Scala 2.13, it's a blocking problem
- Change `Seq[_]` to `scala.collection.Seq[_]` refer to failed cases
- Added different expected plan of `Test 4: Star with several branches` of StarJoinCostBasedReorderSuite for Scala 2.13 because the candidates plans:
```
Join Inner, (d1_pk#5 = f1_fk1#0)
:- Join Inner, (f1_fk2#1 = d2_pk#8)
: :- Join Inner, (f1_fk3#2 = d3_pk#11)
```
and
```
Join Inner, (f1_fk2#1 = d2_pk#8)
:- Join Inner, (d1_pk#5 = f1_fk1#0)
: :- Join Inner, (f1_fk3#2 = d3_pk#11)
```
have same cost `Cost(200,9200)`, but `HashMap` is rewritten in scala 2.13 and The order of iterations leads to different results.
This pr fix test cases as follow:
- LiteralExpressionSuite (1 FAILED -> PASS)
- StarJoinCostBasedReorderSuite ( 1 FAILED-> PASS)
- ObjectExpressionsSuite( 2 FAILED-> PASS)
- ScalaReflectionSuite (1 FAILED-> PASS)
- RowEncoderSuite (10 FAILED-> PASS)
- ExpressionEncoderSuite (ABORTED-> PASS)
### Why are the changes needed?
We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
<!--
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl sql/catalyst -Pscala-2.13 -am
mvn test -pl sql/catalyst -Pscala-2.13
```
**Before**
```
Tests: succeeded 4035, failed 17, canceled 0, ignored 6, pending 0
*** 1 SUITE ABORTED ***
*** 15 TESTS FAILED ***
```
**After**
```
Tests: succeeded 4338, failed 0, canceled 0, ignored 6, pending 0
All tests passed.
```
Closes#29434 from LuciferYang/sql-catalyst-tests.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Rephrase the description for some operations to make it clearer.
### Why are the changes needed?
Add more detail in the document.
### Does this PR introduce _any_ user-facing change?
No, document only.
### How was this patch tested?
Document only.
Closes#29269 from xuanyuanking/SPARK-31792-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
## What changes were proposed in this pull request?
This fixed SPARK-32672 a data corruption. Essentially the BooleanBitSet CompressionScheme would miss nulls at the end of a CompressedBatch. The values would then default to false.
### Why are the changes needed?
It fixes data corruption
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
I manually tested it against the original issue that was producing errors for me. I also added in a unit test.
Closes#29506 from revans2/SPARK-32672.
Authored-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>