Fix the R style issue which is not catched by the R style checker. Got error:
```
R/DataFrame.R:1244:17: style: Closing curly-braces should always be on their own line, unless it's followed by an else.
}, finally = {
^
lintr checks failed.
```
Closes#29574 from lu-wang-dl/fix-r-style.
Lead-authored-by: Lu WANG <lu.wang@databricks.com>
Co-authored-by: Lu Wang <38018689+lu-wang-dl@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The following if clause:
```sql
if(p, null, false)
```
can be simplified to:
```sql
and(p, null)
```
Similarly, the clause:
```sql
if(p, null, true)
```
can be simplified to
```sql
or(not(p), null)
```
iff the predicate `p` is non-nullable, i.e., can be evaluated to either true or false, but not null.
### Why are the changes needed?
Converting if to or/and clauses can better push filters down.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#29567 from sunchao/SPARK-32721.
Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
pass specified options in DataFrameReader.table to JDBCTableCatalog.loadTable
### Why are the changes needed?
Currently, `DataFrameReader.table` ignores the specified options. The options specified like the following are lost.
```
val df = spark.read
.option("partitionColumn", "id")
.option("lowerBound", "0")
.option("upperBound", "3")
.option("numPartitions", "2")
.table("h2.test.people")
```
We need to make `DataFrameReader.table` take the specified options.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually test for now. Will add a test after V2 JDBC read is implemented.
Closes#29535 from huaxingao/table_options.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to deduplicate configuration set/unset in `test_sparkSQL_arrow.R`.
Setting `spark.sql.execution.arrow.sparkr.enabled` can be globally done instead of doing it in each test case.
### Why are the changes needed?
To duduplicate the codes.
### Does this PR introduce _any_ user-facing change?
No, dev-only
### How was this patch tested?
Manually ran the tests.
Closes#29592 from HyukjinKwon/SPARK-32747.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Removal of branched `StringIO` import.
### Why are the changes needed?
Top level `StringIO` is no longer present in Python 3.x.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#29590 from zero323/SPARK-32138-FOLLOW-UP.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
For all three different aggregate physical operator: `HashAggregateExec`, `ObjectHashAggregateExec` and `SortAggregateExec`, they have same `outputPartitioning` and `requiredChildDistribution` logic. Refactor these same logic into their super class `BaseAggregateExec` to avoid code duplication and future bugs (similar to `HashJoin` and `ShuffledJoin`).
### Why are the changes needed?
Reduce duplicated code across classes and prevent future bugs if we only update one class but forget another. We already did similar refactoring for join (`HashJoin` and `ShuffledJoin`).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests as this is pure refactoring and no new logic added.
Closes#29583 from c21/aggregate-refactor.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
https://issues.apache.org/jira/browse/SPARK-32719
### What changes were proposed in this pull request?
Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard).
### Why are the changes needed?
To make sure that the quality of the Python code is up to standard.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing unit-tests and Flake8 static analysis
Closes#29563 from Fokko/fd-add-check-missing-imports.
Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds extended information of a function including arguments, examples, notes and the since field to the SparkGetFunctionOperation
### Why are the changes needed?
better user experience, it will help JDBC users to have a better understanding of our builtin functions
### Does this PR introduce _any_ user-facing change?
Yes, BI tools and JDBC users will get full information on a spark function instead of only fragmentary usage info.
e.g. date_part
#### before
```
date_part(field, source) - Extracts a part of the date/timestamp or interval source.
```
#### after
```
Usage:
date_part(field, source) - Extracts a part of the date/timestamp or interval source.
Arguments:
* field - selects which part of the source should be extracted, and supported string values are as same as the fields of the equivalent function `EXTRACT`.
* source - a date/timestamp or interval column from where `field` should be extracted
Examples:
> SELECT date_part('YEAR', TIMESTAMP '2019-08-12 01:00:00.123456');
2019
> SELECT date_part('week', timestamp'2019-08-12 01:00:00.123456');
33
> SELECT date_part('doy', DATE'2019-08-12');
224
> SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001');
1.000001
> SELECT date_part('days', interval 1 year 10 months 5 days);
5
> SELECT date_part('seconds', interval 5 hours 30 seconds 1 milliseconds 1 microseconds);
30.001001
Note:
The date_part function is equivalent to the SQL-standard function `EXTRACT(field FROM source)`
Since: 3.0.0
```
### How was this patch tested?
New tests
Closes#29577 from yaooqinn/SPARK-32733.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.
### Why are the changes needed?
Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.
### Does this PR introduce _any_ user-facing change?
Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;
### How was this patch tested?
new UTs added
Closes#29552 from Udbhav30/truncate.
Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This is followup from https://github.com/apache/spark/pull/29342, where to do two things:
* Per https://github.com/apache/spark/pull/29342#discussion_r470153323, change from java `HashSet` to spark in-house `OpenHashSet` to track matched rows for non-unique join keys. I checked `OpenHashSet` implementation which is built from a key index (`OpenHashSet._bitset` as `BitSet`) and key array (`OpenHashSet._data` as `Array`). Java `HashSet` is built from `HashMap`, which stores value in `Node` linked list and by theory should have taken more memory than `OpenHashSet`. Reran the same benchmark query used in https://github.com/apache/spark/pull/29342, and verified the query has similar performance here between `HashSet` and `OpenHashSet`.
* Track metrics of the extra data structure `BitSet`/`OpenHashSet` for full outer SHJ. This depends on above thing, because there seems no easy way to get java `HashSet` memory size.
### Why are the changes needed?
To better surface the memory usage for full outer SHJ more accurately.
This can help users/developers to debug/improve full outer SHJ.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unite test in `SQLMetricsSuite.scala` .
Closes#29566 from c21/add-metrics.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Remove the YEAR, MONTH, DAY, HOUR, MINUTE, SECOND keywords. They are not useful in the parser, as we need to support plural like YEARS, so the parser has to accept the general identifier as interval unit anyway.
### Why are the changes needed?
These keywords are reserved in ANSI. If Spark has these keywords, then they become reserved under ANSI mode. This makes Spark not able to run TPCDS queries as they use YEAR as alias name.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added `TPCDSQueryANSISuite`, to make sure Spark with ANSI mode can run TPCDS queries.
Closes#29560 from cloud-fan/keyword.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models.
### Why are the changes needed?
`copy()` was first changed in #29445 . The issue was found in CI of #29524 and fixed. This PR introduces the exact same change so that `CrossValidatorModel.copy()` and its related tests are aligned in branch `master` and branch `branch-3.0`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated `test_copy` to make sure `copy()` is called on models instead of lists of models.
Closes#29553 from Louiszr/fix-cv-copy.
Authored-by: Louiszr <zxhst14@gmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
### What changes were proposed in this pull request?
Remove the assertion in ParquetSchemaConverter that the parquet mapKey field must be PrimitiveType.
### Why are the changes needed?
There is a parquet file in the attachment of [SPARK-32639](https://issues.apache.org/jira/browse/SPARK-32639), and the MessageType recorded in the file is:
```
message parquet_schema {
optional group value (MAP) {
repeated group key_value {
required group key {
optional binary first (UTF8);
optional binary middle (UTF8);
optional binary last (UTF8);
}
optional binary value (UTF8);
}
}
}
```
Use `spark.read.parquet("000.snappy.parquet")` to read the file. Spark will throw an exception when converting Parquet MessageType to Spark SQL StructType:
> AssertionError(Map key type is expected to be a primitive type, but found...)
Use `spark.read.schema("value MAP<STRUCT<first:STRING, middle:STRING, last:STRING>, STRING>").parquet("000.snappy.parquet")` to read the file, spark returns the correct result .
According to the parquet project document (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), the mapKey in the parquet format does not need to be a primitive type.
Note: This parquet file is not written by spark, because spark will write additional sparkSchema string information in the parquet file. When Spark reads, it will directly use the additional sparkSchema information in the file instead of converting Parquet MessageType to Spark SQL StructType.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added a unit test case
Closes#29451 from izchen/SPARK-32639.
Authored-by: Chen Zhang <izchen@126.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add missing since version for math functions, including
SPARK-8223 shiftright/shiftleft
SPARK-8215 pi
SPARK-8212 e
SPARK-6829 sin/asin/sinh/cos/acos/cosh/tan/atan/tanh/ceil/floor/rint/cbrt/signum/isignum/Fsignum/Lsignum/degrees/radians/log/log10/log1p/exp/expm1/pow/hypot/atan2
SPARK-8209 conv
SPARK-8213 factorial
SPARK-20751 cot
SPARK-2813 sqrt
SPARK-8227 unhex
SPARK-8218 log(a,b)
SPARK-8207 bin
SPARK-8214 hex
SPARK-8206 round
SPARK-14614 bround
### Why are the changes needed?
fix SQL docs
### Does this PR introduce _any_ user-facing change?
yes, doc updated
### How was this patch tested?
passing doc generation.
Closes#29571 from yaooqinn/minor.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add a specific `AQEOptimizer` for the `AdaptiveSparkPlanExec` instead of implementing an anonymous `RuleExecutor`. At the same time, this PR also adds the configuration `spark.sql.adaptive.optimizer.excludedRules`, which follows the same pattern of `Optimizer`, to make the `AQEOptimizer` more flexible for users and developers.
### Why are the changes needed?
Currently, `AdaptiveSparkPlanExec` has implemented an anonymous `RuleExecutor` to apply the AQE optimize rules on the plan. However, the anonymous class usually could be inconvenient to maintain and extend for the long term.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
It's a pure refactor so pass existing tests should be ok.
Closes#29559 from Ngone51/impro-aqe-optimizer.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch corrects the method doc of DataFrameWriterV2.replace() which explanation of exception is described oppositely.
### Why are the changes needed?
The method doc is incorrect.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Only doc change.
Closes#29568 from HeartSaVioR/SPARK-28612-FOLLOWUP-fix-doc-nit.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide").
Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html
### Why are the changes needed?
To have a single place for PySpark users, and better documentation.
### Does this PR introduce _any_ user-facing change?
Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation.
### How was this patch tested?
```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```
and
```bash
cd python/docs
make clean html
```
Closes#29548 from HyukjinKwon/SPARK-32183.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR changes key data types check in `HashJoin` to use `sameType`.
### Why are the changes needed?
Looks at the resolving condition of `SetOperation`, it requires only each left data types should be `sameType` as the right ones. Logically the `EqualTo` expression in equi-join, also requires only left data type `sameType` as right data type. Then `HashJoin` requires left keys data type exactly the same as right keys data type, looks not reasonable.
It makes inconsistent results when doing `except` between two dataframes.
If two dataframes don't have nested fields, even their field nullable property different, `HashJoin` passes the key type check because it checks field individually so field nullable property is ignored.
If two dataframes have nested fields like struct, `HashJoin` fails the key type check because now it compare two struct types and nullable property now affects.
### Does this PR introduce _any_ user-facing change?
Yes. Making consistent `except` operation between dataframes.
### How was this patch tested?
Unit test.
Closes#29555 from viirya/SPARK-32693.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims to support executor id placeholder in `spark.kubernetes.executor.volumes.persistentVolumeClaim.myname.options.claimName` configuration like the following.
```
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=pvc-spark-SPARK_EXECUTOR_ID \
```
### Why are the changes needed?
This is a convenient way to mount corresponding PV to the executor.
### Does this PR introduce _any_ user-facing change?
Yes, but this is a new feature and there is no regression because users don't use `SPARK_EXECUTOR_ID` in PVC claim name.
### How was this patch tested?
Pass the newly added test case.
Closes#29557 from dongjoon-hyun/SPARK-PVC.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
The current documentation states that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which is not entirely true since this configuration isn't set anywhere in Spark but rather inherited from the Hadoop FileOutputCommitter class.
### What changes were proposed in this pull request?
I'm submitting this change, to clarify that the default value will entirely depend on the Hadoop version of the runtime environment.
### Why are the changes needed?
An application would end up using algorithm version 1 on certain environments but without any changes the same exact application will use version 2 on environments running Hadoop 3.0 and later. This can have pretty bad consequences in certain scenarios, for example, two tasks can partially overwrite their output if speculation is enabled. Also, please refer to the following JIRA:
https://issues.apache.org/jira/browse/MAPREDUCE-7282
### Does this PR introduce _any_ user-facing change?
Yes. Configuration page content was modified where previously we explicitly highlighted that the default version for the FileOutputCommitter algorithm was v1, this now has changed to "Dependent on environment" with additional information in the description column to elaborate.
### How was this patch tested?
Checked changes locally in browser
Closes#29541 from waleedfateem/SPARK-32701.
Authored-by: waleedfateem <waleed.fateem@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
We are using an older version of Bootstrap (v. 2.1.0) for the online documentation site. Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to EOL in July 2019 (https://github.com/twbs/release). Older versions of Bootstrap are also getting flagged in security scans for various CVEs:
https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700https://snyk.io/vuln/npm:bootstrap:20180529https://snyk.io/vuln/npm:bootstrap:20160627
I haven't validated each CVE, but it would probably be good practice to resolve any potential issues and get on a supported release.
The bad news is that there have been quite a few changes between Bootstrap 2 and Bootstrap 4. I've tried updating the library, refactoring/tweaking the CSS and JS to maintain a similar appearance and functionality, and testing the documentation. This is a fairly large change so I'm sure additional testing and fixes will be needed.
### How was this patch tested?
This has been manually tested, but as there is a lot of documentation it is possible issues were missed. Additional testing and feedback is welcomed. If it appears a whole section was missed let me know and I'll take a pass at addressing that section.
Closes#27369 from clarkead/bootstrap4-docs-upgrade.
Authored-by: Dale Clarke <a.dale.clarke@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR let JDBC clients identify spark interval columns properly.
### Why are the changes needed?
JDBC users can query interval values through thrift server, create views with interval columns, e.g.
```sql
CREATE global temp view view1 as select interval 1 day as i;
```
but when they want to get the details of the columns of view1, the will fail with `Unrecognized type name: INTERVAL`
```
Caused by: java.lang.IllegalArgumentException: Unrecognized type name: INTERVAL
at org.apache.hadoop.hive.serde2.thrift.Type.getType(Type.java:170)
at org.apache.spark.sql.hive.thriftserver.ThriftserverShimUtils$.toJavaSQLType(ThriftserverShimUtils.scala:53)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:157)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:149)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$6(SparkGetColumnsOperation.scala:113)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$6$adapted(SparkGetColumnsOperation.scala:112)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5(SparkGetColumnsOperation.scala:112)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5$adapted(SparkGetColumnsOperation.scala:111)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.runInternal(SparkGetColumnsOperation.scala:111)
... 34 more
```
### Does this PR introduce _any_ user-facing change?
YES,
#### before
![image](https://user-images.githubusercontent.com/8326978/91162239-6cd1ec80-e6fe-11ea-8c2c-914ddb325c4e.png)
#### after
![image](https://user-images.githubusercontent.com/8326978/91162025-1a90cb80-e6fe-11ea-94c4-03a6f2ec296b.png)
### How was this patch tested?
new tests
Closes#29539 from yaooqinn/SPARK-32696.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, EmptyHashedRelation and HashedRelationWithAllNullKeys is an object, and it will cause JavaDeserialization Exception as following
```
20/08/26 11:13:30 WARN [task-result-getter-2] TaskSetManager: Lost task 34.0 in stage 57.0 (TID 18076, emr-worker-5.cluster-183257, executor 18): java.io.InvalidClassException: org.apache.spark.sql.execution.joins.EmptyHashedRelation$; no valid constructor
at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169)
at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
```
This PR includes
* Using case object instead to fix serialization issue.
* Also change EmptyHashedRelation not to extend NullAwareHashedRelation since it's already being used in other non-NAAJ joins.
### Why are the changes needed?
It will cause BHJ failed when buildSide is Empty and BHJ(NAAJ) failed when buildSide with null partition keys.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
* Existing UT.
* Run entire TPCDS for E2E coverage.
Closes#29547 from leanken/leanken-SPARK-32705.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.
### Why are the changes needed?
The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.
For example,
```
Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
```
will write the result to `/tmp/path2`.
### Does this PR introduce _any_ user-facing change?
Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`:
```
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;
```
The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.
### How was this patch tested?
Added new tests.
Closes#29543 from imback82/path_option.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The decommissioning state is a bit fragment across two places in the TaskSchedulerImpl:
https://github.com/apache/spark/pull/29014/ stored the incoming decommission info messages in TaskSchedulerImpl.executorsPendingDecommission.
While https://github.com/apache/spark/pull/28619/ was storing just the executor end time in the map TaskSetManager.tidToExecutorKillTimeMapping (which in turn is contained in TaskSchedulerImpl).
While the two states are not really overlapping, it's a bit of a code hygiene concern to save this state in two places.
With https://github.com/apache/spark/pull/29422, TaskSchedulerImpl is emerging as the place where all decommissioning book keeping is kept within the driver. So consolidate the information in _tidToExecutorKillTimeMapping_ into _executorsPendingDecommission_.
However, in order to do so, we need to walk away from keeping the raw ExecutorDecommissionInfo messages and instead keep another class ExecutorDecommissionState. This decoupling will allow the RPC message class ExecutorDecommissionInfo to evolve independently from the book keeping ExecutorDecommissionState.
### Why are the changes needed?
This is just a code cleanup. These two features were added independently and its time to consolidate their state for good hygiene.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#29452 from agrawaldevesh/consolidate_decom_state.
Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
Fixes typo in docsting of `toDF`
### Why are the changes needed?
The third argument of `toDF` is actually `sampleRatio`.
related discussion: https://github.com/apache/spark/pull/12746#discussion-diff-62704834
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This patch doesn't affect any logic, so existing tests should cover it.
Closes#29551 from unirt/minor_fix_docs.
Authored-by: unirt <lunirtc@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Use `InSet` expression to fix data issue when pruning DPP on non-atomic type. for example:
```scala
spark.range(1000)
.select(col("id"), col("id").as("k"))
.write
.partitionBy("k")
.format("parquet")
.mode("overwrite")
.saveAsTable("df1");
spark.range(100)
.select(col("id"), col("id").as("k"))
.write
.partitionBy("k")
.format("parquet")
.mode("overwrite")
.saveAsTable("df2")
spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show
```
It should return two records, but it returns empty.
### Why are the changes needed?
Fix data issue
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add new unit test.
Closes#29475 from wangyum/SPARK-32659.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.
### Why are the changes needed?
Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.
### Does this PR introduce _any_ user-facing change?
Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;
### How was this patch tested?
new UTs added
Closes#29387 from Udbhav30/tuncateTrash.
Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR regenerates the golden explain file based on the fix: https://github.com/apache/spark/pull/29537
### Why are the changes needed?
Eliminates the personal related information (e.g., local directories) in the explain plan.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Checked manually.
Closes#29546 from Ngone51/follow-up-gen-golden-file.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to explicitly cache and hash the files/directories under 'build' for SBT and Zinc at GitHub Actions. Otherwise, it can end up with overwriting `build` directory. See also https://github.com/apache/spark/pull/29286#issuecomment-679368436
Previously, other files like `build/mvn` and `build/sbt` are also cached and overwritten. So, when you have some changes there, they are ignored.
### Why are the changes needed?
To make GitHub Actions build stable.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
The builds in this PR test it out.
Closes#29536 from HyukjinKwon/SPARK-32695.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of `\0`), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'.
### Why are the changes needed?
To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example.
### Does this PR introduce _any_ user-facing change?
Nothing beyond the effect of the bug fix.
### How was this patch tested?
Existing tests plus new test case.
Closes#29516 from srowen/SPARK-32614.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
fixing flaky test in ExecutorAllocationManagerSuite. The issue is that there is a timing issue when we do a reset as to when the numExecutorsToAddPerResourceProfileId gets reset. The fix is to just always set those values back to 1 when we call reset().
### Why are the changes needed?
fixing flaky test in ExecutorAllocationManagerSuite
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
ran the unit test via this PR a bunch of times and the fix seems to be working.
Closes#29508 from tgravescs/debugExecAllocTest.
Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Changing an info log to a debug log based on SPARK-32664
### Why are the changes needed?
It is outlined in SPARK-32664
### Does this PR introduce _any_ user-facing change?
There are changes to the debug and info logs
### How was this patch tested?
Tested by looking at the logs
Closes#29527 from dmoore62/SPARK-32664.
Authored-by: Daniel Moore <moore@knights.ucf.edu>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes the doc error and add a migration guide for datetime pattern.
### Why are the changes needed?
This is a bug of the doc that we inherited from JDK https://bugs.openjdk.java.net/browse/JDK-8169482
The SimpleDateFormatter(**F Day of week in month**) we used in 2.x and the DatetimeFormatter(**F week-of-month**) we use now both have the opposite meanings to what they declared in the java docs. And unfortunately, this also leads to silent data change in Spark too.
The `week-of-month` is actually the pattern `W` in DatetimeFormatter, which is banned to use in Spark 3.x.
If we want to keep pattern `F`, we need to accept the behavior change with proper migration guide and fix the doc in Spark
### Does this PR introduce _any_ user-facing change?
Yes, doc changed
### How was this patch tested?
passing ci doc generating job
Closes#29538 from yaooqinn/SPARK-32683.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Extract `SQLQueryTestSuite.replaceNotIncludedMsg` to `PlanTest`.
2. Reuse `replaceNotIncludedMsg` to normalize the explain plan that generated in `PlanStabilitySuite`.
### Why are the changes needed?
This's a follow-up of https://github.com/apache/spark/pull/29270.
Eliminates the personal related information (e.g., local directories) in the explain plan.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated test.
Closes#29537 from Ngone51/follow-up-plan-stablity.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…er operations"
### What changes were proposed in this pull request?
This reverts commit 510a1656e6.
### Why are the changes needed?
see https://github.com/apache/spark/pull/29204#discussion_r475716547
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
pass ci tools
Closes#29531 from yaooqinn/revert.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
There is a bug in the way the optimizer rule in `SimplifyExtractValueOps` is currently written in master branch which yields incorrect results in scenarios like the following:
```
sql("SELECT CAST(NULL AS struct<a:int,b:int>) struct_col")
.select($"struct_col".withField("d", lit(4)).getField("d").as("d"))
// currently returns this:
+---+
|d |
+---+
|4 |
+---+
// when in fact it should return this:
+----+
|d |
+----+
|null|
+----+
```
The changes in this PR will fix this bug.
### Why are the changes needed?
To fix the aforementioned bug. Optimizer rules should improve the performance of the query but yield exactly the same results.
### Does this PR introduce _any_ user-facing change?
Yes, this bug will no longer occur.
That said, this isn't something to be concerned about as this bug was introduced in Spark 3.1 and Spark 3.1 has yet to be released.
### How was this patch tested?
Unit tests were added. Jenkins must pass them.
Closes#29522 from fqaiser94/SPARK-32641.
Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis.
This is re-submitted for #29457. Because #29457 has a hive-1.2 error and there were some tests failed with hive-1.2 profile at the same time, #29457 was reverted to unblock others.
### Why are the changes needed?
Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis.
But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too.
### Does this PR introduce _any_ user-facing change?
Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work.
### How was this patch tested?
Unit tests.
Closes#29530 from viirya/fix-orc-pushdown.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR enhances `Catalog.createTable()` to allow users to set the table's description. This corresponds to the following SQL syntax:
```sql
CREATE TABLE ...
COMMENT 'this is a fancy table';
```
### Why are the changes needed?
This brings the Scala/Python catalog APIs a bit closer to what's already possible via SQL.
### Does this PR introduce any user-facing change?
Yes, it adds a new parameter to `Catalog.createTable()`.
### How was this patch tested?
Existing unit tests:
```sh
./python/run-tests \
--python-executables python3.7 \
--testnames 'pyspark.sql.tests.test_catalog,pyspark.sql.tests.test_context'
```
```
$ ./build/sbt
testOnly org.apache.spark.sql.internal.CatalogSuite org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.hive.MetastoreDataSourcesSuite org.apache.spark.sql.hive.execution.HiveDDLSuite
```
Closes#27908 from nchammas/SPARK-31000-table-description.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pull request adds 2 test suites for 2 new classes HybridStore and HistoryServerMemoryManager, which were created in https://github.com/apache/spark/pull/28412. This pull request also did some minor changes in these 2 classes to expose some variables for testing. Besides 2 suites, this pull request adds a unit test in FsHistoryProviderSuite to test parsing logs with HybridStore.
### Why are the changes needed?
Unit tests are needed for new features.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#29509 from baohe-zhang/SPARK-31608-UT.
Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Added Java docs for Long data types in the Row class.
### Why are the changes needed?
The Long datatype is somehow missing in Row.scala's `apply` and `get` methods.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UTs.
Closes#29534 from yeshengm/docs-fix.
Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change to use `dataTypes.foreach` instead of get the element use specified index in `def this(dataTypes: Seq[DataType]) `constructor of `SpecificInternalRow` because the random access performance is unsatisfactory if the input argument not a `IndexSeq`.
This pr followed srowen's advice.
### Why are the changes needed?
I found that SPARK-32550 had some negative impact on performance, the typical cases is "deterministic cardinality estimation" in `HyperLogLogPlusPlusSuite` when rsd is 0.001, we found the code that is significantly slower is line 41 in `HyperLogLogPlusPlusSuite`: `new SpecificInternalRow(hll.aggBufferAttributes.map(_.dataType)) `
08b951b1cb/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala (L40-L44)
The size of "hll.aggBufferAttributes" in this case is 209716, the results of comparison before and after spark-32550 merged are as follows, The unit is ns:
| After SPARK-32550 createBuffer | After SPARK-32550 end to end | Before SPARK-32550 createBuffer | Before SPARK-32550 end to end
-- | -- | -- | -- | --
rsd 0.001, n 1000 | 52715513243 | 53004810687 | 195807999 | 773977677
rsd 0.001, n 5000 | 51881246165 | 52519358215 | 13689949 | 249974855
rsd 0.001, n 10000 | 52234282788 | 52374639172 | 14199071 | 183452846
rsd 0.001, n 50000 | 55503517122 | 55664035449 | 15219394 | 584477125
rsd 0.001, n 100000 | 51862662845 | 52116774177 | 19662834 | 166483678
rsd 0.001, n 500000 | 51619226715 | 52183189526 | 178048012 | 16681330
rsd 0.001, n 1000000 | 54861366981 | 54976399142 | 226178708 | 18826340
rsd 0.001, n 5000000 | 52023602143 | 52354615149 | 388173579 | 15446409
rsd 0.001, n 10000000 | 53008591660 | 53601392304 | 533454460 | 16033032
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
`mvn test -pl sql/catalyst -DwildcardSuites=org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlusSuite -Dtest=none`
**Before**:
```
Run completed in 8 minutes, 18 seconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
```
**After**
```
Run completed in 7 seconds, 65 milliseconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
```
Closes#29529 from LuciferYang/revert-spark-32550.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
As discussed in https://github.com/apache/spark/pull/29491#discussion_r474451282 and in SPARK-32686, this PR un-deprecates Spark's ability to infer a DataFrame schema from a list of dictionaries. The ability is Pythonic and matches functionality offered by Pandas.
### Why are the changes needed?
This change clarifies to users that this behavior is supported and is not going away in the near future.
### Does this PR introduce _any_ user-facing change?
Yes. There used to be a `UserWarning` for this, but now there isn't.
### How was this patch tested?
I tested this manually.
Before:
```python
>>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}]))
/Users/nchamm/Documents/GitHub/nchammas/spark/python/pyspark/sql/session.py:388: UserWarning: Using RDD of dict to inferSchema is deprecated. Use pyspark.sql.Row instead
warnings.warn("Using RDD of dict to inferSchema is deprecated. "
DataFrame[a: bigint]
>>> spark.createDataFrame([{'a': 5}])
.../python/pyspark/sql/session.py:378: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
warnings.warn("inferring schema from dict is deprecated,"
DataFrame[a: bigint]
```
After:
```python
>>> spark.createDataFrame(spark.sparkContext.parallelize([{'a': 5}]))
DataFrame[a: bigint]
>>> spark.createDataFrame([{'a': 5}])
DataFrame[a: bigint]
```
Closes#29510 from nchammas/SPARK-32686-df-dict-infer-schema.
Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to make the behavior consistent for the `path` option when loading dataframes with a single path (e.g, `option("path", path).format("parquet").load(path)` vs. `option("path", path).parquet(path)`) by disallowing `path` option to coexist with `load`'s path parameters.
### Why are the changes needed?
The current behavior is inconsistent:
```scala
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
+-----+
|value|
+-----+
| 1|
+-----+
scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
+-----+
|value|
+-----+
| 1|
| 1|
+-----+
```
### Does this PR introduce _any_ user-facing change?
Yes, now if the `path` option is specified along with `load`'s path parameters, it would fail:
```scala
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
... 47 elided
scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:250)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:778)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:756)
... 47 elided
```
The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.
### How was this patch tested?
Added a test
Closes#29328 from imback82/dfw_option.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>