### What changes were proposed in this pull request?
1. For the interval arithmetic functions, e.g. `add`/`subtract`/`negative`/`multiply`/`divide`, enable overflow check when `ANSI` is on.
2. For `multiply`/`divide`, throw an exception when an overflow happens in spite of `ANSI` is on/off.
3. `add`/`subtract`/`negative` stay the same for backward compatibility.
4. `divide` by 0 throws ArithmeticException whether `ANSI` or not as same as numerics.
5. These behaviors fit the numeric type operations fully when ANSI is on.
6. These behaviors fit the numeric type operations fully when ANSI is off, except 2 and 4.
### Why are the changes needed?
1. bug fix
2. `ANSI` support
### Does this PR introduce any user-facing change?
When `ANSI` is on, interval `add`/`subtract`/`negative`/`multiply`/`divide` will overflow if any field overflows
### How was this patch tested?
add unit tests
Closes#26995 from yaooqinn/SPARK-30341.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Revert https://github.com/apache/spark/pull/26338 , as the syntax is actually the [hive style ALTER COLUMN](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment).
This PR brings it back, and make it support multi-catalog:
1. renaming is not allowed as `AlterTableAlterColumnStatement` can't do renaming.
2. column name should be multi-part
### Why are the changes needed?
to not break hive compatibility.
### Does this PR introduce any user-facing change?
no, as the removal was merged in 3.0.
### How was this patch tested?
new parser tests
Closes#27076 from cloud-fan/alter.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Update CREATE VIEW command to store the current catalog and namespace instead of current database in view metadata. Also update analyzer to leverage the catalog and namespace in view metastore to resolve relations inside views.
Note that, this PR still keeps the way we resolve views, by recursively calling Analyzer. This is necessary because view text may contain CTE, window spec, etc. which needs rules outside of the main resolution batch (e.g. `CTESubstitution`)
### Why are the changes needed?
To resolve relations inside view correctly.
### Does this PR introduce any user-facing change?
Yes, fix a bug. Now tables referred by a view can be resolved correctly even if the current catalog/namespace has been updated.
### How was this patch tested?
a new test
Closes#26923 from cloud-fan/view.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aim to fix the NoSuchElementException exception when enable AQE with insubquery expression.
### Why are the changes needed?
Fix exception
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
added new ut
Closes#27068 from JkSelf/fixSubqueryIssue.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior.
(This patch contains the commit from #23921 to retain credit.)
Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader.
This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function.
This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads.
This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader.
### Why are the changes needed?
Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
New UT.
Closes#27025 from HeartSaVioR/SPARK-26560-revised.
Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: nivo091 <nivedeeta.singh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Change config name from `spark.sql.legacy.typeCoercion.datetimeToString` to `spark.sql.legacy.typeCoercion.datetimeToString.enabled`.
### Why are the changes needed?
To follow the other boolean conf naming convention.
### Does this PR introduce any user-facing change?
No, it's newly added in Spark 3.0.
### How was this patch tested?
Pass Jenkins
Closes#27065 from Ngone51/SPARK-27638-FOLLOWUP.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change config name from `spark.sql.optimizer.reassignLambdaVariableID` to `spark.sql.optimizer.reassignLambdaVariableID.enabled`.
### Why are the changes needed?
To follow the other boolean conf naming convention.
### Does this PR introduce any user-facing change?
No, it's newly added in Spark 3.0.
### How was this patch tested?
Pass Jenkins.
Closes#27063 from Ngone51/SPARK-27871-FOLLOWUP.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Calls of `requireNonStaticConf()` are removed from the `set()` methods in RuntimeConfig because those methods invoke `def set(key: String, value: String): Unit` where `requireNonStaticConf()` is called as well.
### Why are the changes needed?
To avoid unnecessary calls of `requireNonStaticConf()`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing tests from `SQLConfSuite`
Closes#27062 from MaxGekk/call-requireNonStaticConf-once.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Documentation added for refresh resources command in spark-sql.
### Why are the changes needed?
Previously, only refresh table command was documented.
### Does this PR introduce any user-facing change?
Yes. Now users can access documentation for refresh resources command.
### How was this patch tested?
Manually.
Closes#27023 from iRakson/SPARK-30363.
Authored-by: root1 <raksonrakesh@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
There're too many classes placed in a single package "org.apache.spark.sql.kafka010" which classes can be grouped by purpose.
As a part of change in SPARK-21869 (#26845), we moved out producer related classes to "org.apache.spark.sql.kafka010.producer" and only expose necessary classes/methods to the outside of package. This patch applies the same to consumer related classes.
### Why are the changes needed?
Described above.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#26991 from HeartSaVioR/SPARK-30336.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, add new foreach-like methods: foreach/foreachNonZero
2, add iterator: iterator/activeIterator/nonZeroIterator
### Why are the changes needed?
see the [ticke](https://issues.apache.org/jira/browse/SPARK-30329) for details
foreach/foreachNonZero: for both convenience and performace (SparseVector.foreach should be faster than current traversal method)
iterator/activeIterator/nonZeroIterator: add the three iterators, so that we can futuremore add/change some impls based on those iterators for both ml and mllib sides, to avoid vector conversions.
### Does this PR introduce any user-facing change?
Yes, new methods are added
### How was this patch tested?
added testsuites
Closes#26982 from zhengruifeng/vector_iter.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add instr.logSumOfWeights in the Algo that has weightCol support
### Why are the changes needed?
Many algorithms support weightCol now. I think weightsum is useful info to add to the log.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
manually tested
Closes#26972 from huaxingao/spark-30321.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Refactor `RandomForest.findSplits` by applying `aggregateByKey` instead of `groupByKey`
### Why are the changes needed?
Current impl of `RandomForest.findSplits` uses `groupByKey` to collect non-zero values for each feature, so it is quite dangerous.
After looking into the following logic to find splits, I found that collecting all non-zero values is not necessary, and we only need weightSums of distinct values.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27040 from zhengruifeng/rf_opt.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add getter/setter in Python FM
### Why are the changes needed?
to be consistent with other algorithms
### Does this PR introduce any user-facing change?
Yes.
add getter/setter in Python FMRegressor/FMRegressionModel/FMClassifier/FMClassificationModel
### How was this patch tested?
doctest
Closes#27044 from huaxingao/spark-30378.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, expose predictRaw and predictProbability
2, add tests
### Why are the changes needed?
single instance prediction is useful out of spark, specially for online prediction.
Current `predict` is exposed, but it is not enough.
### Does this PR introduce any user-facing change?
Yes, new methods are exposed
### How was this patch tested?
added testsuites
Closes#27015 from zhengruifeng/expose_raw_prob.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This patch proposes to only convert first few elements of collection accumulators in `LiveEntityHelpers.newAccumulatorInfos`.
### Why are the changes needed?
One Spark job on our cluster uses collection accumulator to collect something and has encountered an exception like:
```
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at java.util.AbstractCollection.toString(AbstractCollection.java:462)
at java.util.Collections$UnmodifiableCollection.toString(Collections.java:1035)
at org.apache.spark.status.LiveEntityHelpers$$anonfun$newAccumulatorInfos$2$$anonfun$apply$3.apply(LiveEntity.scala:596)
at org.apache.spark.status.LiveEntityHelpers$$anonfun$newAccumulatorInfos$2$$anonfun$apply$3.apply(LiveEntity.scala:596)
at scala.Option.map(Option.scala:146)
at org.apache.spark.status.LiveEntityHelpers$$anonfun$newAccumulatorInfos$2.apply(LiveEntity.scala:596)
at org.apache.spark.status.LiveEntityHelpers$$anonfun$newAccumulatorInfos$2.apply(LiveEntity.scala:591)
```
`LiveEntityHelpers.newAccumulatorInfos` converts `AccumulableInfo`s to `v1.AccumulableInfo` by calling `toString` on accumulator's value. For collection accumulator, it might take much more memory when in string representation, for example, collection accumulator of long values, and cause OOM (in this job, the driver memory is 6g).
Looks like the results of `newAccumulatorInfos` are used in api and ui. For such usage, it also does not make sense to have very long string of complete collection accumulators.
### Does this PR introduce any user-facing change?
Yes. Collection accumulator now only shows first few elements in api and ui.
### How was this patch tested?
Unit test.
Manual test. Launched a Spark shell, ran:
```scala
val accum = sc.collectionAccumulator[Long]("Collection Accumulator Example")
sc.range(0, 10000, 1, 1).foreach(x => accum.add(x))
accum.value
```
<img width="2533" alt="Screen Shot 2019-12-30 at 2 03 43 PM" src="https://user-images.githubusercontent.com/68855/71602488-6eb2c400-2b0d-11ea-8725-dba36478198f.png">
Closes#27038 from viirya/partial-collect-accu.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Currently if function lookup fails, spark will give it a second change by casting decimal type to double type. But for cases where decimal type doesn't exist, it's meaningless to lookup again and causes extra cost like unnecessary metastore access. We should throw exceptions directly in these cases.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Covered by existing tests.
Closes#26994 from wzhfy/avoid_udf_fail_twice.
Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch fixes a small bug in the example of streaming query, as the type of observable metrics is Java Map instead of Scala Map, so to use foreach it should be converted first.
### Why are the changes needed?
Described above.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Ran below query via `spark-shell`:
**Streaming**
```scala
import scala.collection.JavaConverters._
import scala.util.Random
import org.apache.spark.sql.streaming.StreamingQueryListener
import org.apache.spark.sql.streaming.StreamingQueryListener._
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryProgress(event: QueryProgressEvent): Unit = {
event.progress.observedMetrics.asScala.get("my_event").foreach { row =>
// Trigger if the number of errors exceeds 5 percent
val num_rows = row.getAs[Long]("rc")
val num_error_rows = row.getAs[Long]("erc")
val ratio = num_error_rows.toDouble / num_rows
if (ratio > 0.05) {
// Trigger alert
println(s"alert! error ratio: $ratio")
}
}
}
def onQueryStarted(event: QueryStartedEvent): Unit = {}
def onQueryTerminated(event: QueryTerminatedEvent): Unit = {}
})
val rates = spark
.readStream
.format("rate")
.option("rowsPerSecond", 10)
.load
val rand = new Random()
val df = rates.map { row => (row.getLong(1), if (row.getLong(1) % 2 == 0) "error" else null) }.toDF
val ds = df.selectExpr("_1 AS id", "_2 AS error")
// Observe row count (rc) and error row count (erc) in the batch Dataset
val observed_ds = ds.observe("my_event", count(lit(1)).as("rc"), count($"error").as("erc"))
observed_ds.writeStream.format("console").start()
```
Closes#27046 from HeartSaVioR/SPARK-29348-FOLLOWUP.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
UI and event logs redact sensitive information. But the monitoring URL, https://spark.apache.org/docs/latest/monitoring.html#rest-api , specifically /applications/[app-id]/environment does not, which is a security issue.
### What changes were proposed in this pull request?
REST api response is redacted before sending it
### Why are the changes needed?
If no redaction is done for rest API call, it can leak security information
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Tested manually
Closes#27018 from ajithme/redactrest.
Authored-by: Ajith <ajith2489@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Remove meaning less tooltip from All pages.
### Why are the changes needed?
If we can't come up with meaningful tooltips, then tooltips not require to add.
### Does this PR introduce any user-facing change?
Yes
![67598045-351ab100-f78a-11e9-88cf-573e09d7c50e](https://user-images.githubusercontent.com/8948111/71558018-81c58580-2a74-11ea-9f38-dcaebd3f0bbf.png)
tooltips like highlight in above image got removed
### How was this patch tested?
Manual test.
Closes#27043 from 07ARB/SPARK-30383.
Authored-by: 07ARB <ankitrajboudh@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/26780
In https://github.com/apache/spark/pull/26780, a new Avro data source option `actualSchema` is introduced for setting the original Avro schema in function `from_avro`, while the expected schema is supposed to be set in the parameter `jsonFormatSchema` of `from_avro`.
However, there is another Avro data source option `avroSchema`. It is used for setting the expected schema in readiong and writing.
This PR is to use the option `avroSchema` option for reading Avro data with an evolved schema and remove the new one `actualSchema`
### Why are the changes needed?
Unify and simplify the Avro data source options.
### Does this PR introduce any user-facing change?
Yes.
To deserialize Avro data with an evolved schema, before changes:
```
from_avro('col, expectedSchema, ("actualSchema" -> actualSchema))
```
After changes:
```
from_avro('col, actualSchema, ("avroSchema" -> expectedSchema))
```
The second parameter is always the actual Avro schema after changes.
### How was this patch tested?
Update the existing tests in https://github.com/apache/spark/pull/26780Closes#27045 from gengliangwang/renameAvroOption.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In `SqlBase.g4`, `database` is defined as
```
database : DATABASE | SCHEMA;
```
and it is being used as `(database | NAMESPACE)` in many places.
This PR proposes to define the following and use it as discussed in https://github.com/apache/spark/pull/26847/files#r359754778:
```
namespace : NAMESPACE | DATABASE | SCHEMA;
```
### Why are the changes needed?
To simplify the grammar.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
There is no change in the actual grammar, so the existing tests should be sufficient.
Closes#27027 from imback82/sqlbase_namespace.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch fixes the flaky test failure on MasterSuite, "SPARK-27510: Master should avoid dead loop while launching executor failed in Worker".
The culprit of test failure was ironically the test ran too fast; the interval of `eventually` is by default "15 ms", but it took only "8 ms" from submitting driver to removing app from master.
```
19/12/23 15:45:06.533 dispatcher-event-loop-6 INFO Master: Registering worker localhost:9999 with 10 cores, 3.6 GiB RAM
19/12/23 15:45:06.534 dispatcher-event-loop-6 INFO Master: Driver submitted org.apache.spark.FakeClass
19/12/23 15:45:06.535 dispatcher-event-loop-6 INFO Master: Launching driver driver-20191223154506-0000 on worker 10001
19/12/23 15:45:06.536 dispatcher-event-loop-9 INFO Master: Registering app name
19/12/23 15:45:06.537 dispatcher-event-loop-9 INFO Master: Registered app name with ID app-20191223154506-0000
19/12/23 15:45:06.537 dispatcher-event-loop-9 INFO Master: Launching executor app-20191223154506-0000/0 on worker 10001
19/12/23 15:45:06.537 dispatcher-event-loop-10 INFO Master: Removing executor app-20191223154506-0000/0 because it is FAILED
...
19/12/23 15:45:06.542 dispatcher-event-loop-19 ERROR Master: Application name with ID app-20191223154506-0000 failed 10 times; removing it
```
Given the interval is already tiny, instead of lowering interval, the patch considers above case as well when verifying the status.
### Why are the changes needed?
We observed intermittent test failure in Jenkins build which should be fixed.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115664/testReport/
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Modified UT.
Closes#27004 from HeartSaVioR/SPARK-30348.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Remove it from `CoarseGrainedSchedulerBackend` when `HeartbeatReceiver` recognizes a lost executor.
### Why are the changes needed?
Currently, an application may hang if we don't remove a lost executor from `CoarseGrainedSchedulerBackend` as it may happens due to:
1) In `expireDeadHosts()`, `HeartbeatReceiver` calls `scheduler.executorLost()`;
2) Before `HeartbeatReceiver` calls `sc.killAndReplaceExecutor()`(which would mark the lost executor as "pendingToRemove") in a separate thread, `CoarseGrainedSchedulerBackend` may begins to launch tasks on that executor without realizing it has been lost indeed.
3) If that lost executor doesn't shut down gracefully, `CoarseGrainedSchedulerBackend ` may never receive a disconnect event. As a result, tasks launched on that lost executor become orphans. While at the same time, driver just thinks that those tasks are still running and waits forever.
Removing the lost executor from `CoarseGrainedSchedulerBackend` would let `TaskSetManager` mark those tasks as failed which avoids app hang. Furthermore, it cleans up records in `executorDataMap`, which may never be removed in such case.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Updated existed tests.
Close#24350.
Closes#26980 from Ngone51/SPARK-27348.
Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
using `MetadataUtils.getNumFeatures` to extract the numFeatures
### Why are the changes needed?
may avoid `first`/`head` job if metadata has attrGroup
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27037 from zhengruifeng/unify_numFeatures.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
This reverts commit f926809a1f.
Closes#27032 from gatorsmile/revertSPARK-29390.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Updated the document for LIST FILE/JAR command.
### Why are the changes needed?
LIST FILE/JAR can take multiple filenames as argument and it returns the files which were added as resources.
### Does this PR introduce any user-facing change?
Yes. Documentation updated for LIST FILE/JAR command
### How was this patch tested?
Manually
Closes#26996 from iRakson/SPARK-30342.
Authored-by: root1 <raksonrakesh@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
compute count and quantile on one pass
### Why are the changes needed?
to avoid extra pass
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#26990 from zhengruifeng/quantile_count_lsh.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
1, remove unused imports and variables
2, remove `countAccum: LongAccumulator`, since `costAccum: DoubleAccumulator` also records the count
3, mark `clusterCentersWithNorm` in KMeansModel trasient and lazy, since it is only used in transformation and can be directly generated from the centers.
### Why are the changes needed?
1,remove unused codes
2,avoid repeated computation
3,reduce broadcasted size
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27014 from zhengruifeng/kmeans_clean_up.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
Currently SPARK history server display the classpath entries in the Environment tab with classpaths redacted, this is because EventLog file has the entry values redacted while writing. But when same is seen from a running application UI, its seen that it is not redacted. Classpath entries redact is not needed and can be avoided
### What changes were proposed in this pull request?
Event logs will not redect the classpath entries
### Why are the changes needed?
Redact of classpath entries is not needed
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
Tested manually to verify on UI
Closes#27016 from ajithme/redactui.
Authored-by: Ajith <ajith2489@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
precompute the `DecisionTreeMetadata` and reuse it for all trees
### Why are the changes needed?
In existing impl, each `DecisionTreeRegressor` needs a pass on the whole dataset to calculate the same `DecisionTreeMetadata` repeatedly.
In this PR, with default depth=5, it is about 8% faster then existing impl
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#27011 from zhengruifeng/gbt_reuse_instr_meta.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
## What changes were proposed in this pull request?
Create temporary or permanent function it should throw AnalysisException if the resource is not found. Need to keep behavior consistent across permanent and temporary functions.
## How was this patch tested?
Added UT and also tested manually
**Before Fix**
If the UDF resource is not present then on creation of temporary function it throws AnalysisException where as for permanent function it does not throw. Permanent funtcion throws AnalysisException only after select operation is performed.
**After Fix**
For temporary and permanent function check for the resource, if the UDF resource is not found then throw AnalysisException
![rt](https://user-images.githubusercontent.com/35216143/62781519-d1131580-bad5-11e9-9d58-69e65be86c03.png)
Closes#25399 from sandeep-katta/funcIssue.
Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This patch fixes the intermittent test failure on ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite, getting ConnectException when querying to thrift server.
(https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/)
The relevant unit test log messages are following:
```
19/12/23 13:33:01.875 pool-1-thread-1 INFO AbstractService: Service:ThriftBinaryCLIService is started.
19/12/23 13:33:01.875 pool-1-thread-1 INFO AbstractService: Service:HiveServer2 is started.
...
19/12/23 13:33:01.888 pool-1-thread-1 INFO ThriftServerWithSparkContextSuite: HiveThriftServer2 started successfully
...
19/12/23 13:33:01.909 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO ThriftServerWithSparkContextSuite:
===== TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.ThriftServerWithSparkContextSuite: 'SPARK-29911: Uncache cached tables when session closed' =====
...
19/12/23 13:33:02.017 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO Utils: Supplied authorities: localhost:15441
19/12/23 13:33:02.018 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO Utils: Resolved authority: localhost:15441
19/12/23 13:33:02.078 HiveServer2-Background-Pool: Thread-213 INFO BaseSessionStateBuilder$$anon$2: Optimization rule 'org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation' is excluded from the optimizer.
19/12/23 13:33:02.078 HiveServer2-Background-Pool: Thread-213 INFO BaseSessionStateBuilder$$anon$2: Optimization rule 'org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation' is excluded from the optimizer.
19/12/23 13:33:02.121 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite WARN HiveConnection: Failed to connect to localhost:15441
19/12/23 13:33:02.124 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO ThriftServerWithSparkContextSuite:
===== FINISHED o.a.s.sql.hive.thriftserver.ThriftServerWithSparkContextSuite: 'SPARK-29911: Uncache cached tables when session closed' =====
19/12/23 13:33:02.143 Thread-35 INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 15441 with 5...500 worker threads
19/12/23 13:33:02.327 pool-1-thread-1 INFO HiveServer2: Shutting down HiveServer2
19/12/23 13:33:02.328 pool-1-thread-1 INFO ThriftCLIService: Thrift server has stopped
```
(Here the error is logged as `WARN HiveConnection: Failed to connect to localhost:15441` - the actual stack trace can be seen on Jenkins test summary.)
The reason of test failure: Thrift(Binary|Http)CLIService prepare and launch the service asynchronously (in new thread), which suites are not waiting for completion and just start running tests, ends up with race condition.
That can be easily reproduced, via adding artificial sleep in `ThriftBinaryCLIService.run()` here:
ba3f6330dd/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java (L49)
(Note that `sleep` should be added before initializing server socket. E.g. Line 57)
This patch changes the test initialization logic to try executing simple query to wait until the service is available. The patch also refactors the code to apply the change both ThriftServerQueryTestSuite and ThriftServerWithSparkContextSuite easily.
### Why are the changes needed?
This patch fixes the intermittent failure observed here:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Artificially made the test fail consistently (by the approach described above), and confirmed the patch fixed the test.
Closes#27001 from HeartSaVioR/SPARK-30345.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Unify `DriverEndpoint. executorIsAlive()` and `CoarseGrainedSchedulerBackend .isExecutorActive()`.
### Why are the changes needed?
`DriverEndPoint` has method `executorIsAlive()` to check wether an executor is alive/active, while `CoarseGrainedSchedulerBackend` has method `isExecutorActive()` to do the same work. But, `isExecutorActive()` seems forget to consider `executorsPendingLossReason`. Unify these two methods makes behavior be consistent between `DriverEndPoint` and `CoarseGrainedSchedulerBackend` and make code more easier to maintain.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass Jenkins.
Closes#27012 from Ngone51/unify-is-executor-alive.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
supports instance weighting in GMM
### Why are the changes needed?
ML should support instance weighting
### Does this PR introduce any user-facing change?
yes, a new param `weightCol` is exposed
### How was this patch tested?
added testsuits
Closes#26735 from zhengruifeng/gmm_support_weight.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
add a corresponding class MultivariateGaussian containing a vector and a matrix on the py side, so gaussian can be used on the py side.
### Does this PR introduce any user-facing change?
add Python class ```MultivariateGaussian```
### How was this patch tested?
doctest
Closes#27020 from huaxingao/spark-30247.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Update the Spark SQL document menu and join strategy hints.
### Why are the changes needed?
- Several new changes in the Spark SQL document didn't change the menu-sql.yaml correspondingly.
- Update the demo code for join strategy hints.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Document change only.
Closes#26917 from xuanyuanking/SPARK-30278.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### Why are the changes needed?
`EnsureRequirements` adds `ShuffleExchangeExec` (RangePartitioning) after Sort if `RoundRobinPartitioning` behinds it. This will cause 2 shuffles, and the number of partitions in the final stage is not the number specified by `RoundRobinPartitioning.
**Example SQL**
```
SELECT /*+ REPARTITION(5) */ * FROM test ORDER BY a
```
**BEFORE**
```
== Physical Plan ==
*(1) Sort [a#0 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 200), true, [id=#11]
+- Exchange RoundRobinPartitioning(5), false, [id=#9]
+- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1]
```
**AFTER**
```
== Physical Plan ==
*(1) Sort [a#0 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 5), true, [id=#11]
+- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1]
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Run suite Tests and add new test for this.
Closes#26946 from stczwd/RoundRobinPartitioning.
Lead-authored-by: lijunqing <lijunqing@baidu.com>
Co-authored-by: stczwd <qcsd2011@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Implement Factorization Machines as a ml-pipeline component
1. loss function supports: logloss, mse
2. optimizer: GD, adamW
### Why are the changes needed?
Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate).
Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR.
References:
1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#27000 from mob-ai/ml/fm.
Authored-by: zhanjf <zhanjf@mob.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Settings to quiet for Class `ExecutorAllocationManager` that request message too verbose. otherwise, this class generates too many messages like
`INFO spark.ExecutorAllocationManager: Request to remove executorIds: 890`
when we enabled DRA.
### Why are the changes needed?
Log level improvement.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Closes#26925 from cfmcgrady/quiet-request-executor-remove-message.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR fixes `ScalaReflection.arrayClassFor()` to use an empty array instead of a one-element array for getting its class object by reflection.
### Why are the changes needed?
Because it may reduce unnecessary memory allocation.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Ran the existing unit tests for sql/catalyst and confirmed that all of them succeeded.
Closes#27005 from sekikn/SPARK-30350.
Authored-by: Kengo Seki <sekikn@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
The filter predicate for aggregate expression is an `ANSI SQL`.
```
<aggregate function> ::=
COUNT <left paren> <asterisk> <right paren> [ <filter clause> ]
| <general set function> [ <filter clause> ]
| <binary set function> [ <filter clause> ]
| <ordered set function> [ <filter clause> ]
| <array aggregate function> [ <filter clause> ]
| <row pattern count function> [ <filter clause> ]
```
There are some mainstream database support this syntax.
**PostgreSQL:**
https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES
For example:
```
SELECT
year,
count(*) FILTER (WHERE gdp_per_capita >= 40000)
FROM
countries
GROUP BY
year
```
```
SELECT
year,
code,
gdp_per_capita,
count(*)
FILTER (WHERE gdp_per_capita >= 40000)
OVER (PARTITION BY year)
FROM
countries
```
**jOOQ:**
https://blog.jooq.org/2014/12/30/the-awesome-postgresql-9-4-sql2003-filter-clause-for-aggregate-functions/
**Notice:**
1.This PR only supports FILTER predicate without codegen. maropu will create another PR is related to SPARK-30027 to support codegen.
2.This PR only supports FILTER predicate without DISTINCT. I will create another PR is related to SPARK-30276 to support this.
3.This PR only supports FILTER predicate that can't reference the outer query. I created ticket SPARK-30219 to support it.
4.This PR only supports FILTER predicate that can't use IN/EXISTS predicate sub-queries. I created ticket SPARK-30220 to support it.
5.Spark SQL cannot supports a SQL with nested aggregate. I created ticket SPARK-30182 to support it.
There are some show of the PR on my production environment.
```
spark-sql> desc gja_test_partition;
key string NULL
value string NULL
other string NULL
col2 int NULL
# Partition Information
# col_name data_type comment
col2 int NULL
Time taken: 0.79 s
```
```
spark-sql> select * from gja_test_partition;
a A ao 1
b B bo 1
c C co 1
d D do 1
e E eo 2
g G go 2
h H ho 2
j J jo 2
f F fo 3
k K ko 3
l L lo 4
i I io 4
Time taken: 1.75 s
```
```
spark-sql> select count(key), sum(col2) from gja_test_partition;
12 26
Time taken: 1.848 s
```
```
spark-sql> select count(key) filter (where col2 > 1) from gja_test_partition;
8
Time taken: 2.926 s
```
```
spark-sql> select sum(col2) filter (where col2 > 2) from gja_test_partition;
14
Time taken: 2.087 s
```
```
spark-sql> select count(key) filter (where col2 > 1), sum(col2) filter (where col2 > 2) from gja_test_partition;
8 14
Time taken: 2.847 s
```
```
spark-sql> select count(key), count(key) filter (where col2 > 1), sum(col2), sum(col2) filter (where col2 > 2) from gja_test_partition;
12 8 26 14
Time taken: 1.787 s
```
```
spark-sql> desc student;
id int NULL
name string NULL
sex string NULL
class_id int NULL
Time taken: 0.206 s
```
```
spark-sql> select * from student;
1 张三 man 1
2 李四 man 1
3 王五 man 2
4 赵六 man 2
5 钱小花 woman 1
6 赵九红 woman 2
7 郭丽丽 woman 2
Time taken: 0.786 s
```
```
spark-sql> select class_id, count(id), sum(id) from student group by class_id;
1 3 8
2 4 20
Time taken: 18.783 s
```
```
spark-sql> select class_id, count(id) filter (where sex = 'man'), sum(id) filter (where sex = 'woman') from student group by class_id;
1 2 5
2 2 13
Time taken: 3.887 s
```
### Why are the changes needed?
Add new SQL feature.
### Does this PR introduce any user-facing change?
'No'.
### How was this patch tested?
Exists UT and new UT.
Closes#26656 from beliefer/support-aggregate-clause.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch removes TODO comments which are left to address changing case classes having vars to normal classes in spark-sql-kafka module - the pattern is actually discouraged, but still worth to break it, as we already use automatic toString implementation and we may be using more.
### Why are the changes needed?
Described above.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#26992 from HeartSaVioR/SPARK-30337.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
I execute some query as` select get_json_object(ytag, '$.y1') AS y1 from t4`; SparkSQL return null but Hive return correct results.
In my production environment, ytag is a json wrapped by single quotes,as follows
```
{'y1': 'shuma', 'y2': 'shuma:shouji'}
{'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'}
{'y1': 'yule', 'y2': 'yule:mingxing'}
```
Then l realized some functions including get_json_object and json_tuple does not support single quotes json parsing.
So l provide this PR to resolve the question.
### Why are the changes needed?
Enabled for Hive compatibility
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
NEW TESTS
Closes#26965 from wenfang6/enableSingleQuotesJsonForSparkSQL.
Authored-by: wenfang <wenfang@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
LibSVMDataSource attach AttributeGroup
### Why are the changes needed?
LibSVMDataSource will attach a special metadata to indicate numFeatures:
```scala
scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
scala> data.schema("features").metadata
res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4}
```
However, all ML impls will try to obtain vector size via AttributeGroup, which can not use this metadata:
```scala
scala> import org.apache.spark.ml.attribute._
import org.apache.spark.ml.attribute._
scala> AttributeGroup.fromStructField(data.schema("features")).size
res1: Int = -1
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
added tests
Closes#27003 from zhengruifeng/libsvm_attr_group.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Rename `spark.sql.streaming.forceDeleteTempCheckpointLocation` to `spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled`.
### Why are the changes needed?
To follow the other boolean conf naming convention.
### Does this PR introduce any user-facing change?
No, as this config is newly added in 3.0.
### How was this patch tested?
Pass Jenkins.
Closes#26981 from Ngone51/SPARK-26389-FOLLOWUP.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
accuracyExpression can accept Long which may cause overflow error.
accuracyExpression can accept fractions which are implicitly floored.
accuracyExpression can accept null which is implicitly changed to 0.
percentageExpression can accept null but cause MatchError.
percentageExpression can accept ArrayType(_, nullable=true) in which the nulls are implicitly changed to zeros.
##### cases
```sql
select percentile_approx(10.0, 0.5, 2147483648); -- overflow and fail
select percentile_approx(10.0, 0.5, 4294967297); -- overflow but success
select percentile_approx(10.0, 0.5, null); -- null cast to 0
select percentile_approx(10.0, 0.5, 1.2); -- 1.2 cast to 1
select percentile_approx(10.0, null, 1); -- scala.MatchError
select percentile_approx(10.0, array(0.2, 0.4, null), 1); -- null cast to zero.
```
##### behavior before
```sql
+select percentile_approx(10.0, 0.5, 2147483648)
+org.apache.spark.sql.AnalysisException
+cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(2147483648L AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = -2147483648); line 1 pos 7
+
+select percentile_approx(10.0, 0.5, 4294967297)
+10.0
+
+select percentile_approx(10.0, 0.5, null)
+org.apache.spark.sql.AnalysisException
+cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(NULL AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = 0); line 1 pos 7
+
+select percentile_approx(10.0, 0.5, 1.2)
+10.0
+
+select percentile_approx(10.0, null, 1)
+scala.MatchError
+null
+
+
+select percentile_approx(10.0, array(0.2, 0.4, null), 1)
+[10.0,10.0,10.0]
```
##### behavior after
```sql
+select percentile_approx(10.0, 0.5, 2147483648)
+10.0
+
+select percentile_approx(10.0, 0.5, 4294967297)
+10.0
+
+select percentile_approx(10.0, 0.5, null)
+org.apache.spark.sql.AnalysisException
+cannot resolve 'percentile_approx(10.0BD, 0.5BD, NULL)' due to data type mismatch: argument 3 requires integral type, however, 'NULL' is of null type.; line 1 pos 7
+
+select percentile_approx(10.0, 0.5, 1.2)
+org.apache.spark.sql.AnalysisException
+cannot resolve 'percentile_approx(10.0BD, 0.5BD, 1.2BD)' due to data type mismatch: argument 3 requires integral type, however, '1.2BD' is of decimal(2,1) type.; line 1 pos 7
+
+select percentile_approx(10.0, null, 1)
+java.lang.IllegalArgumentException
+The value of percentage must be be between 0.0 and 1.0, but got null
+
+select percentile_approx(10.0, array(0.2, 0.4, null), 1)
+java.lang.IllegalArgumentException
+Each value of the percentage array must be be between 0.0 and 1.0, but got [0.2,0.4,null]
```
### Why are the changes needed?
bug fix
### Does this PR introduce any user-facing change?
yes, fix some improper usages of percentile_approx as cases list above
### How was this patch tested?
add ut
Closes#26905 from yaooqinn/SPARK-30266.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>