ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kent Yao	e04309cb1f	[SPARK-30341][SQL] Overflow check for interval arithmetic operations ### What changes were proposed in this pull request? 1. For the interval arithmetic functions, e.g. `add`/`subtract`/`negative`/`multiply`/`divide`, enable overflow check when `ANSI` is on. 2. For `multiply`/`divide`, throw an exception when an overflow happens in spite of `ANSI` is on/off. 3. `add`/`subtract`/`negative` stay the same for backward compatibility. 4. `divide` by 0 throws ArithmeticException whether `ANSI` or not as same as numerics. 5. These behaviors fit the numeric type operations fully when ANSI is on. 6. These behaviors fit the numeric type operations fully when ANSI is off, except 2 and 4. ### Why are the changes needed? 1. bug fix 2. `ANSI` support ### Does this PR introduce any user-facing change? When `ANSI` is on, interval `add`/`subtract`/`negative`/`multiply`/`divide` will overflow if any field overflows ### How was this patch tested? add unit tests Closes #26995 from yaooqinn/SPARK-30341. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 02:04:20 +08:00
Wenchen Fan	68260f5297	revert [SPARK-29680][SQL] Remove ALTER TABLE CHANGE COLUMN syntax ### What changes were proposed in this pull request? Revert https://github.com/apache/spark/pull/26338 , as the syntax is actually the [hive style ALTER COLUMN](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ChangeColumnName/Type/Position/Comment). This PR brings it back, and make it support multi-catalog: 1. renaming is not allowed as `AlterTableAlterColumnStatement` can't do renaming. 2. column name should be multi-part ### Why are the changes needed? to not break hive compatibility. ### Does this PR introduce any user-facing change? no, as the removal was merged in 3.0. ### How was this patch tested? new parser tests Closes #27076 from cloud-fan/alter. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 01:55:38 +08:00
Wenchen Fan	1743d5be7f	[SPARK-30284][SQL] CREATE VIEW should keep the current catalog and namespace ### What changes were proposed in this pull request? Update CREATE VIEW command to store the current catalog and namespace instead of current database in view metadata. Also update analyzer to leverage the catalog and namespace in view metastore to resolve relations inside views. Note that, this PR still keeps the way we resolve views, by recursively calling Analyzer. This is necessary because view text may contain CTE, window spec, etc. which needs rules outside of the main resolution batch (e.g. `CTESubstitution`) ### Why are the changes needed? To resolve relations inside view correctly. ### Does this PR introduce any user-facing change? Yes, fix a bug. Now tables referred by a view can be resolved correctly even if the current catalog/namespace has been updated. ### How was this patch tested? a new test Closes #26923 from cloud-fan/view. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-03 01:41:32 +08:00
jiake	6bd5494f34	[SPARK-30403][SQL] fix the NoSuchElementException when enable AQE with InSubquery expression ### What changes were proposed in this pull request? This PR aim to fix the NoSuchElementException exception when enable AQE with insubquery expression. ### Why are the changes needed? Fix exception ### Does this PR introduce any user-facing change? No ### How was this patch tested? added new ut Closes #27068 from JkSelf/fixSubqueryIssue. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-02 23:11:56 +08:00
jiake	05f7b57ddc	[SPARK-30407][SQL] fix the reset metric issue when enable AQE ### What changes were proposed in this pull request? When working on [PR#26813](https://github.com/apache/spark/pull/26813), we encounter the exception in [here(the number of metrics(1) is 2 not 1 )](`5d870ef0bc/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala (L120)`). This PR fix the above exception. ### Why are the changes needed? Fix exception ### Does this PR introduce any user-facing change? No ### How was this patch tested? [this test with enable AQE](`5d870ef0bc/sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala (L120)`) Closes #27074 from JkSelf/resetMetricsIssue. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-02 21:55:36 +08:00
Jungtaek Lim (HeartSaVioR)	5d870ef0bc	[SPARK-26560][SQL] Spark should be able to run Hive UDF using jar regardless of current thread context classloader ### What changes were proposed in this pull request? This patch is based on #23921 but revised to be simpler, as well as adds UT to test the behavior. (This patch contains the commit from #23921 to retain credit.) Spark loads new JARs for `ADD JAR` and `CREATE FUNCTION ... USING JAR` into jar classloader in shared state, and changes current thread's context classloader to jar classloader as many parts of remaining codes rely on current thread's context classloader. This would work if the further queries will run in same thread and there's no change on context classloader for the thread, but once the context classloader of current thread is switched back by various reason, Spark fails to create instance of class for the function. This bug mostly affects spark-shell, as spark-shell will roll back current thread's context classloader at every prompt. But it may also affects the case of job-server, where the queries may be running in multiple threads. This patch fixes the issue via switching the context classloader to the classloader which loads the class. Hopefully FunctionBuilder created by `makeFunctionBuilder` has the information of Class as a part of closure, hence the Class itself can be provided regardless of current thread's context classloader. ### Why are the changes needed? Without this patch, end users cannot execute Hive UDF using JAR twice in spark-shell. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. Closes #27025 from HeartSaVioR/SPARK-26560-revised. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: nivo091 <nivedeeta.singh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-02 15:44:45 +08:00
yi.wu	83d289eef4	[SPARK-27638][SQL][FOLLOW-UP] Format config name to follow the other boolean conf naming convention ### What changes were proposed in this pull request? Change config name from `spark.sql.legacy.typeCoercion.datetimeToString` to `spark.sql.legacy.typeCoercion.datetimeToString.enabled`. ### Why are the changes needed? To follow the other boolean conf naming convention. ### Does this PR introduce any user-facing change? No, it's newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins Closes #27065 from Ngone51/SPARK-27638-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-02 15:35:33 +09:00
yi.wu	90794b617c	[SPARK-27871][SQL][FOLLOW-UP] Format config name to follow the other boolean conf naming convention ### What changes were proposed in this pull request? Change config name from `spark.sql.optimizer.reassignLambdaVariableID` to `spark.sql.optimizer.reassignLambdaVariableID.enabled`. ### Why are the changes needed? To follow the other boolean conf naming convention. ### Does this PR introduce any user-facing change? No, it's newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #27063 from Ngone51/SPARK-27871-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-02 09:59:15 +09:00
Maxim Gekk	b316d37365	[SPARK-30401][SQL] Call `requireNonStaticConf()` only once in `set()` ### What changes were proposed in this pull request? Calls of `requireNonStaticConf()` are removed from the `set()` methods in RuntimeConfig because those methods invoke `def set(key: String, value: String): Unit` where `requireNonStaticConf()` is called as well. ### Why are the changes needed? To avoid unnecessary calls of `requireNonStaticConf()`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By existing tests from `SQLConfSuite` Closes #27062 from MaxGekk/call-requireNonStaticConf-once. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-02 09:56:50 +09:00
root1	ce7a49f7fa	[SPARK-30363][SQL][DOC] Add Documentation for refresh resources ### What changes were proposed in this pull request? Documentation added for refresh resources command in spark-sql. ### Why are the changes needed? Previously, only refresh table command was documented. ### Does this PR introduce any user-facing change? Yes. Now users can access documentation for refresh resources command. ### How was this patch tested? Manually. Closes #27023 from iRakson/SPARK-30363. Authored-by: root1 <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-31 09:36:41 -06:00
Jungtaek Lim (HeartSaVioR)	319ccd5711	[SPARK-30336][SQL][SS] Move Kafka consumer-related classes to its own package ### What changes were proposed in this pull request? There're too many classes placed in a single package "org.apache.spark.sql.kafka010" which classes can be grouped by purpose. As a part of change in SPARK-21869 (#26845), we moved out producer related classes to "org.apache.spark.sql.kafka010.producer" and only expose necessary classes/methods to the outside of package. This patch applies the same to consumer related classes. ### Why are the changes needed? Described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #26991 from HeartSaVioR/SPARK-30336. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-31 09:30:55 -06:00
zhengruifeng	23a49aff27	[SPARK-30329][ML] add iterator/foreach methods for Vectors ### What changes were proposed in this pull request? 1, add new foreach-like methods: foreach/foreachNonZero 2, add iterator: iterator/activeIterator/nonZeroIterator ### Why are the changes needed? see the [ticke](https://issues.apache.org/jira/browse/SPARK-30329) for details foreach/foreachNonZero: for both convenience and performace (SparseVector.foreach should be faster than current traversal method) iterator/activeIterator/nonZeroIterator: add the three iterators, so that we can futuremore add/change some impls based on those iterators for both ml and mllib sides, to avoid vector conversions. ### Does this PR introduce any user-facing change? Yes, new methods are added ### How was this patch tested? added testsuites Closes #26982 from zhengruifeng/vector_iter. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 15:52:17 +08:00
Huaxin Gao	694da0382e	[SPARK-30321][ML] Log weightSum in Algo that has weights support ### What changes were proposed in this pull request? add instr.logSumOfWeights in the Algo that has weightCol support ### Why are the changes needed? Many algorithms support weightCol now. I think weightsum is useful info to add to the log. ### Does this PR introduce any user-facing change? no ### How was this patch tested? manually tested Closes #26972 from huaxingao/spark-30321. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 14:47:02 +08:00
zhengruifeng	0b561a7f46	[SPARK-30380][ML] Refactor RandomForest.findSplits ### What changes were proposed in this pull request? Refactor `RandomForest.findSplits` by applying `aggregateByKey` instead of `groupByKey` ### Why are the changes needed? Current impl of `RandomForest.findSplits` uses `groupByKey` to collect non-zero values for each feature, so it is quite dangerous. After looking into the following logic to find splits, I found that collecting all non-zero values is not necessary, and we only need weightSums of distinct values. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27040 from zhengruifeng/rf_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 14:15:23 +08:00
Huaxin Gao	9ee8da298d	[SPARK-30378][ML][PYSPARK] Add getter/setter in Python FM ### What changes were proposed in this pull request? add getter/setter in Python FM ### Why are the changes needed? to be consistent with other algorithms ### Does this PR introduce any user-facing change? Yes. add getter/setter in Python FMRegressor/FMRegressionModel/FMClassifier/FMClassificationModel ### How was this patch tested? doctest Closes #27044 from huaxingao/spark-30378. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 12:56:19 +08:00
zhengruifeng	32a5233d12	[SPARK-30358][ML] ML expose predictRaw and predictProbability ### What changes were proposed in this pull request? 1, expose predictRaw and predictProbability 2, add tests ### Why are the changes needed? single instance prediction is useful out of spark, specially for online prediction. Current `predict` is exposed, but it is not enough. ### Does this PR introduce any user-facing change? Yes, new methods are exposed ### How was this patch tested? added testsuites Closes #27015 from zhengruifeng/expose_raw_prob. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-31 12:49:16 +08:00
Liang-Chi Hsieh	9eff1186ae	[SPARK-30379][CORE] Avoid OOM when using collection accumulator ### What changes were proposed in this pull request? This patch proposes to only convert first few elements of collection accumulators in `LiveEntityHelpers.newAccumulatorInfos`. ### Why are the changes needed? One Spark job on our cluster uses collection accumulator to collect something and has encountered an exception like: ``` java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at java.lang.StringBuilder.append(StringBuilder.java:136) at java.lang.StringBuilder.append(StringBuilder.java:131) at java.util.AbstractCollection.toString(AbstractCollection.java:462) at java.util.Collections$UnmodifiableCollection.toString(Collections.java:1035) at org.apache.spark.status.LiveEntityHelpers$$anonfun$newAccumulatorInfos$2$$anonfun$apply$3.apply(LiveEntity.scala:596) at org.apache.spark.status.LiveEntityHelpers$$anonfun$newAccumulatorInfos$2$$anonfun$apply$3.apply(LiveEntity.scala:596) at scala.Option.map(Option.scala:146) at org.apache.spark.status.LiveEntityHelpers$$anonfun$newAccumulatorInfos$2.apply(LiveEntity.scala:596) at org.apache.spark.status.LiveEntityHelpers$$anonfun$newAccumulatorInfos$2.apply(LiveEntity.scala:591) ``` `LiveEntityHelpers.newAccumulatorInfos` converts `AccumulableInfo`s to `v1.AccumulableInfo` by calling `toString` on accumulator's value. For collection accumulator, it might take much more memory when in string representation, for example, collection accumulator of long values, and cause OOM (in this job, the driver memory is 6g). Looks like the results of `newAccumulatorInfos` are used in api and ui. For such usage, it also does not make sense to have very long string of complete collection accumulators. ### Does this PR introduce any user-facing change? Yes. Collection accumulator now only shows first few elements in api and ui. ### How was this patch tested? Unit test. Manual test. Launched a Spark shell, ran: ```scala val accum = sc.collectionAccumulator[Long]("Collection Accumulator Example") sc.range(0, 10000, 1, 1).foreach(x => accum.add(x)) accum.value ``` <img width="2533" alt="Screen Shot 2019-12-30 at 2 03 43 PM" src="https://user-images.githubusercontent.com/68855/71602488-6eb2c400-2b0d-11ea-8725-dba36478198f.png"> Closes #27038 from viirya/partial-collect-accu. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-31 12:45:23 +09:00
Zhenhua Wang	a8bf5d823b	[SPARK-30339][SQL] Avoid to fail twice in function lookup ### What changes were proposed in this pull request? Currently if function lookup fails, spark will give it a second change by casting decimal type to double type. But for cases where decimal type doesn't exist, it's meaningless to lookup again and causes extra cost like unnecessary metastore access. We should throw exceptions directly in these cases. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Covered by existing tests. Closes #26994 from wzhfy/avoid_udf_fail_twice. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-31 01:09:51 +09:00
Jungtaek Lim (HeartSaVioR)	e054a0af6f	[SPARK-29348][SQL][FOLLOWUP] Fix slight bug on streaming example for Dataset.observe ### What changes were proposed in this pull request? This patch fixes a small bug in the example of streaming query, as the type of observable metrics is Java Map instead of Scala Map, so to use foreach it should be converted first. ### Why are the changes needed? Described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Ran below query via `spark-shell`: Streaming ```scala import scala.collection.JavaConverters._ import scala.util.Random import org.apache.spark.sql.streaming.StreamingQueryListener import org.apache.spark.sql.streaming.StreamingQueryListener._ spark.streams.addListener(new StreamingQueryListener() { override def onQueryProgress(event: QueryProgressEvent): Unit = { event.progress.observedMetrics.asScala.get("my_event").foreach { row => // Trigger if the number of errors exceeds 5 percent val num_rows = row.getAs[Long]("rc") val num_error_rows = row.getAs[Long]("erc") val ratio = num_error_rows.toDouble / num_rows if (ratio > 0.05) { // Trigger alert println(s"alert! error ratio: $ratio") } } } def onQueryStarted(event: QueryStartedEvent): Unit = {} def onQueryTerminated(event: QueryTerminatedEvent): Unit = {} }) val rates = spark .readStream .format("rate") .option("rowsPerSecond", 10) .load val rand = new Random() val df = rates.map { row => (row.getLong(1), if (row.getLong(1) % 2 == 0) "error" else null) }.toDF val ds = df.selectExpr("_1 AS id", "_2 AS error") // Observe row count (rc) and error row count (erc) in the batch Dataset val observed_ds = ds.observe("my_event", count(lit(1)).as("rc"), count($"error").as("erc")) observed_ds.writeStream.format("console").start() ``` Closes #27046 from HeartSaVioR/SPARK-29348-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-31 01:08:25 +09:00
HyukjinKwon	7079e871a7	[SPARK-30185][SQL] Implement Dataset.tail API ### What changes were proposed in this pull request? This PR proposes a `tail` API. Namely, as below: ```scala scala> spark.range(10).head(5) res1: Array[Long] = Array(0, 1, 2, 3, 4) scala> spark.range(10).tail(5) res2: Array[Long] = Array(5, 6, 7, 8, 9) ``` Implementation details will be similar with `head` but it will be reversed: 1. Run the job against the last partition and collect rows. If this is enough, return as is. 2. If this is not enough, calculate the number of partitions to select more based upon `spark.sql.limit.scaleUpFactor` 3. Run more jobs against more partitions (in a reversed order compared to head) as many as the number calculated from 2. 4. Go to 2. Note that, we don't guarantee the natural order in DataFrame in general - there are cases when it's deterministic and when it's not. We probably should write down this as a caveat separately. ### Why are the changes needed? Many other systems support the way to take data from the end, for instance, pandas[1] and Python[2][3]. Scala collections APIs also have head and tail On the other hand, in Spark, we only provide a way to take data from the start (e.g., DataFrame.head). This has been requested multiple times here and there in Spark user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party projects such as Koalas[8]. In addition, this missing API seems explicitly mentioned in comparison to another system[9] time to time. It seems we're missing non-trivial use case in Spark and this motivated me to propose this API. [1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail [2] https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line [3] https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python [4] http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html [5] https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index [6] https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe [7] https://issues.apache.org/jira/browse/SPARK-26433 [8] https://github.com/databricks/koalas/issues/343 [9] https://medium.com/chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 ### Does this PR introduce any user-facing change? No, (new API) ### How was this patch tested? Unit tests were added and manually tested. Closes #26809 from HyukjinKwon/wip-tail. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-31 01:07:09 +09:00
Ajith	f0fbbf014e	[SPARK-30361][REST] Monitoring URL do not redact information about environment UI and event logs redact sensitive information. But the monitoring URL, https://spark.apache.org/docs/latest/monitoring.html#rest-api , specifically /applications/[app-id]/environment does not, which is a security issue. ### What changes were proposed in this pull request? REST api response is redacted before sending it ### Why are the changes needed? If no redaction is done for rest API call, it can leak security information ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested manually Closes #27018 from ajithme/redactrest. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-30 09:10:31 -06:00
07ARB	5af77410bb	[SPARK-30383][WEBUI] Remove meaning less tooltip from All pages ### What changes were proposed in this pull request? Remove meaning less tooltip from All pages. ### Why are the changes needed? If we can't come up with meaningful tooltips, then tooltips not require to add. ### Does this PR introduce any user-facing change? Yes ![67598045-351ab100-f78a-11e9-88cf-573e09d7c50e](https://user-images.githubusercontent.com/8948111/71558018-81c58580-2a74-11ea-9f38-dcaebd3f0bbf.png) tooltips like highlight in above image got removed ### How was this patch tested? Manual test. Closes #27043 from 07ARB/SPARK-30383. Authored-by: 07ARB <ankitrajboudh@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-30 18:28:02 +09:00
Gengliang Wang	07593d362f	[SPARK-27506][SQL][FOLLOWUP] Use option `avroSchema` to specify an evolved schema in `from_avro` ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/26780 In https://github.com/apache/spark/pull/26780, a new Avro data source option `actualSchema` is introduced for setting the original Avro schema in function `from_avro`, while the expected schema is supposed to be set in the parameter `jsonFormatSchema` of `from_avro`. However, there is another Avro data source option `avroSchema`. It is used for setting the expected schema in readiong and writing. This PR is to use the option `avroSchema` option for reading Avro data with an evolved schema and remove the new one `actualSchema` ### Why are the changes needed? Unify and simplify the Avro data source options. ### Does this PR introduce any user-facing change? Yes. To deserialize Avro data with an evolved schema, before changes: ``` from_avro('col, expectedSchema, ("actualSchema" -> actualSchema)) ``` After changes: ``` from_avro('col, actualSchema, ("avroSchema" -> expectedSchema)) ``` The second parameter is always the actual Avro schema after changes. ### How was this patch tested? Update the existing tests in https://github.com/apache/spark/pull/26780 Closes #27045 from gengliangwang/renameAvroOption. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-30 18:14:21 +09:00
Terry Kim	a90ad5bf2a	[SPARK-30370][SQL] Update SqlBase.g4 to combine namespace and database tokens ### What changes were proposed in this pull request? In `SqlBase.g4`, `database` is defined as ``` database : DATABASE \| SCHEMA; ``` and it is being used as `(database \| NAMESPACE)` in many places. This PR proposes to define the following and use it as discussed in https://github.com/apache/spark/pull/26847/files#r359754778: ``` namespace : NAMESPACE \| DATABASE \| SCHEMA; ``` ### Why are the changes needed? To simplify the grammar. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? There is no change in the actual grammar, so the existing tests should be sufficient. Closes #27027 from imback82/sqlbase_namespace. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-30 14:58:02 +08:00
Jungtaek Lim (HeartSaVioR)	8092d634ea	[SPARK-30348][SPARK-27510][CORE][TEST] Fix flaky test failure on "MasterSuite.: Master should avoid ..." ### What changes were proposed in this pull request? This patch fixes the flaky test failure on MasterSuite, "SPARK-27510: Master should avoid dead loop while launching executor failed in Worker". The culprit of test failure was ironically the test ran too fast; the interval of `eventually` is by default "15 ms", but it took only "8 ms" from submitting driver to removing app from master. ``` 19/12/23 15:45:06.533 dispatcher-event-loop-6 INFO Master: Registering worker localhost:9999 with 10 cores, 3.6 GiB RAM 19/12/23 15:45:06.534 dispatcher-event-loop-6 INFO Master: Driver submitted org.apache.spark.FakeClass 19/12/23 15:45:06.535 dispatcher-event-loop-6 INFO Master: Launching driver driver-20191223154506-0000 on worker 10001 19/12/23 15:45:06.536 dispatcher-event-loop-9 INFO Master: Registering app name 19/12/23 15:45:06.537 dispatcher-event-loop-9 INFO Master: Registered app name with ID app-20191223154506-0000 19/12/23 15:45:06.537 dispatcher-event-loop-9 INFO Master: Launching executor app-20191223154506-0000/0 on worker 10001 19/12/23 15:45:06.537 dispatcher-event-loop-10 INFO Master: Removing executor app-20191223154506-0000/0 because it is FAILED ... 19/12/23 15:45:06.542 dispatcher-event-loop-19 ERROR Master: Application name with ID app-20191223154506-0000 failed 10 times; removing it ``` Given the interval is already tiny, instead of lowering interval, the patch considers above case as well when verifying the status. ### Why are the changes needed? We observed intermittent test failure in Jenkins build which should be fixed. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115664/testReport/ ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Modified UT. Closes #27004 from HeartSaVioR/SPARK-30348. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-30 14:37:17 +08:00
yi.wu	b5c35d68e4	[SPARK-27348][CORE] HeartbeatReceiver should remove lost executors from CoarseGrainedSchedulerBackend ### What changes were proposed in this pull request? Remove it from `CoarseGrainedSchedulerBackend` when `HeartbeatReceiver` recognizes a lost executor. ### Why are the changes needed? Currently, an application may hang if we don't remove a lost executor from `CoarseGrainedSchedulerBackend` as it may happens due to: 1) In `expireDeadHosts()`, `HeartbeatReceiver` calls `scheduler.executorLost()`; 2) Before `HeartbeatReceiver` calls `sc.killAndReplaceExecutor()`(which would mark the lost executor as "pendingToRemove") in a separate thread, `CoarseGrainedSchedulerBackend` may begins to launch tasks on that executor without realizing it has been lost indeed. 3) If that lost executor doesn't shut down gracefully, `CoarseGrainedSchedulerBackend ` may never receive a disconnect event. As a result, tasks launched on that lost executor become orphans. While at the same time, driver just thinks that those tasks are still running and waits forever. Removing the lost executor from `CoarseGrainedSchedulerBackend` would let `TaskSetManager` mark those tasks as failed which avoids app hang. Furthermore, it cleans up records in `executorDataMap`, which may never be removed in such case. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated existed tests. Close #24350. Closes #26980 from Ngone51/SPARK-27348. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-30 12:29:24 +08:00
zhengruifeng	57649b56d9	[SPARK-30376][ML] Unify the computation of numFeatures ### What changes were proposed in this pull request? using `MetadataUtils.getNumFeatures` to extract the numFeatures ### Why are the changes needed? may avoid `first`/`head` job if metadata has attrGroup ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27037 from zhengruifeng/unify_numFeatures. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-30 10:22:40 +08:00
Xiao Li	919d551ddb	Revert "[SPARK-29390][SQL] Add the justify_days(), justify_hours() and justif_interval() functions" This reverts commit `f926809a1f`. Closes #27032 from gatorsmile/revertSPARK-29390. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-12-29 15:25:14 -08:00
root1	724dcf099c	[SPARK-30342][SQL][DOC] Update LIST FILE/JAR command Documentation ### What changes were proposed in this pull request? Updated the document for LIST FILE/JAR command. ### Why are the changes needed? LIST FILE/JAR can take multiple filenames as argument and it returns the files which were added as resources. ### Does this PR introduce any user-facing change? Yes. Documentation updated for LIST FILE/JAR command ### How was this patch tested? Manually Closes #26996 from iRakson/SPARK-30342. Authored-by: root1 <raksonrakesh@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-29 12:28:01 -06:00
zhengruifeng	f72db40080	[SPARK-18409][ML][FOLLOWUP] LSH approxNearestNeighbors optimization ### What changes were proposed in this pull request? compute count and quantile on one pass ### Why are the changes needed? to avoid extra pass ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #26990 from zhengruifeng/quantile_count_lsh. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-29 17:07:54 +08:00
zhengruifeng	a719edcfd5	[SPARK-29967][ML][FOLLOWUP] KMeans Cleanup ### What changes were proposed in this pull request? 1, remove unused imports and variables 2, remove `countAccum: LongAccumulator`, since `costAccum: DoubleAccumulator` also records the count 3, mark `clusterCentersWithNorm` in KMeansModel trasient and lazy, since it is only used in transformation and can be directly generated from the centers. ### Why are the changes needed? 1,remove unused codes 2,avoid repeated computation 3,reduce broadcasted size ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27014 from zhengruifeng/kmeans_clean_up. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-29 12:43:37 +08:00
Ajith	4257a94447	[SPARK-30360][UI] Avoid Redact classpath entries in History Server UI Currently SPARK history server display the classpath entries in the Environment tab with classpaths redacted, this is because EventLog file has the entry values redacted while writing. But when same is seen from a running application UI, its seen that it is not redacted. Classpath entries redact is not needed and can be avoided ### What changes were proposed in this pull request? Event logs will not redect the classpath entries ### Why are the changes needed? Redact of classpath entries is not needed ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Tested manually to verify on UI Closes #27016 from ajithme/redactui. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-28 12:59:30 -06:00
zhengruifeng	049641346c	[SPARK-30354][ML] GBT reuse DecisionTreeMetadata among iterations ### What changes were proposed in this pull request? precompute the `DecisionTreeMetadata` and reuse it for all trees ### Why are the changes needed? In existing impl, each `DecisionTreeRegressor` needs a pass on the whole dataset to calculate the same `DecisionTreeMetadata` repeatedly. In this PR, with default depth=5, it is about 8% faster then existing impl ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27011 from zhengruifeng/gbt_reuse_instr_meta. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-28 11:23:46 -06:00
sandeep katta	16e5e79877	[SPARK-28670][SQL] create function should thrown Exception if the resource is not found ## What changes were proposed in this pull request? Create temporary or permanent function it should throw AnalysisException if the resource is not found. Need to keep behavior consistent across permanent and temporary functions. ## How was this patch tested? Added UT and also tested manually Before Fix If the UDF resource is not present then on creation of temporary function it throws AnalysisException where as for permanent function it does not throw. Permanent funtcion throws AnalysisException only after select operation is performed. After Fix For temporary and permanent function check for the resource, if the UDF resource is not found then throw AnalysisException ![rt](https://user-images.githubusercontent.com/35216143/62781519-d1131580-bad5-11e9-9d58-69e65be86c03.png) Closes #25399 from sandeep-katta/funcIssue. Authored-by: sandeep katta <sandeep.katta2007@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-28 14:35:33 +09:00
Kent Yao	f0bf2eb006	[SPARK-30356][SQL] Codegen support for the function str_to_map ### What changes were proposed in this pull request? `str_to_map ` has not implemented with codegen support, which prevents a query that contains this expression from being whole stage codegen-ed. This PR removes `CodegenFallBack` from `StringToMap`, add the codegen support for it. ### Why are the changes needed? improve codegen coverage and gain better perfomance ### Does this PR introduce any user-facing change? no ### How was this patch tested? 1. pass ComplexTypeSuite 2. manually review generated code ```java -- !query 12 explain codegen select v, str_to_map(v) from values ('abc🅰️a,:'), (null), (''), ('1:2') t(v) -- !query 12 schema struct<plan:string> -- !query 12 output Found 1 WholeStageCodegen subtrees. == Subtree 1 / 1 (maxMethodCodeSize:511; maxConstantPoolSize:188(0.29% used); numInnerClasses:0) == Project [v#x, str_to_map(v#x, ,, :) AS str_to_map(v, ,, :)#x] +- LocalTableScan [v#x] Generated code: /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator localtablescan_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[] project_mutableStateArray_1 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[2]; / 012 / / 013 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 014 / this.references = references; / 015 / } / 016 / / 017 / public void init(int index, scala.collection.Iterator[] inputs) { / 018 / partitionIndex = index; / 019 / this.inputs = inputs; / 020 / localtablescan_input_0 = inputs[0]; / 021 / project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 64); / 022 / project_mutableStateArray_1[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(project_mutableStateArray_0[0], 8); / 023 / project_mutableStateArray_1[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(project_mutableStateArray_0[0], 8); / 024 / / 025 / } / 026 / / 027 / private void project_doConsume_0(InternalRow localtablescan_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException { / 028 / boolean project_isNull_1 = true; / 029 / MapData project_value_1 = null; / 030 / / 031 / if (!project_exprIsNull_0_0) { / 032 / project_isNull_1 = false; // resultCode could change nullability. / 033 / / 034 / int project_i_0 = 0; / 035 / UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[2] / literal /), -1); / 036 / while (project_i_0 < project_kvs_0.length) { / 037 / UTF8String[] kv = project_kvs_0[project_i_0].split(((UTF8String) references[3] / literal /), 2); / 038 / UTF8String key = kv[0]; / 039 / UTF8String value = null; / 040 / if (kv.length == 2) { / 041 / value = kv[1]; / 042 / } / 043 / ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[1] / mapBuilder /).put(key, value); / 044 / project_i_0++; / 045 / } / 046 / project_value_1 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[1] / mapBuilder /).build(); / 047 / / 048 / } / 049 / project_mutableStateArray_0[0].reset(); / 050 / / 051 / project_mutableStateArray_0[0].zeroOutNullBytes(); / 052 / / 053 / if (project_exprIsNull_0_0) { / 054 / project_mutableStateArray_0[0].setNullAt(0); / 055 / } else { / 056 / project_mutableStateArray_0[0].write(0, project_expr_0_0); / 057 / } / 058 / / 059 / if (project_isNull_1) { / 060 / project_mutableStateArray_0[0].setNullAt(1); / 061 / } else { / 062 / final MapData project_tmpInput_0 = project_value_1; / 063 / if (project_tmpInput_0 instanceof UnsafeMapData) { / 064 / project_mutableStateArray_0[0].write(1, (UnsafeMapData) project_tmpInput_0); / 065 / } else { / 066 / // Remember the current cursor so that we can calculate how many bytes are / 067 / // written later. / 068 / final int project_previousCursor_0 = project_mutableStateArray_0[0].cursor(); / 069 / / 070 / // preserve 8 bytes to write the key array numBytes later. / 071 / project_mutableStateArray_0[0].grow(8); / 072 / project_mutableStateArray_0[0].increaseCursor(8); / 073 / / 074 / // Remember the current cursor so that we can write numBytes of key array later. / 075 / final int project_tmpCursor_0 = project_mutableStateArray_0[0].cursor(); / 076 / / 077 / final ArrayData project_tmpInput_1 = project_tmpInput_0.keyArray(); / 078 / if (project_tmpInput_1 instanceof UnsafeArrayData) { / 079 / project_mutableStateArray_0[0].write((UnsafeArrayData) project_tmpInput_1); / 080 / } else { / 081 / final int project_numElements_0 = project_tmpInput_1.numElements(); / 082 / project_mutableStateArray_1[0].initialize(project_numElements_0); / 083 / / 084 / for (int project_index_0 = 0; project_index_0 < project_numElements_0; project_index_0++) { / 085 / project_mutableStateArray_1[0].write(project_index_0, project_tmpInput_1.getUTF8String(project_index_0)); / 086 / } / 087 / } / 088 / / 089 / // Write the numBytes of key array into the first 8 bytes. / 090 / Platform.putLong( / 091 / project_mutableStateArray_0[0].getBuffer(), / 092 / project_tmpCursor_0 - 8, / 093 / project_mutableStateArray_0[0].cursor() - project_tmpCursor_0); / 094 / / 095 / final ArrayData project_tmpInput_2 = project_tmpInput_0.valueArray(); / 096 / if (project_tmpInput_2 instanceof UnsafeArrayData) { / 097 / project_mutableStateArray_0[0].write((UnsafeArrayData) project_tmpInput_2); / 098 / } else { / 099 / final int project_numElements_1 = project_tmpInput_2.numElements(); / 100 / project_mutableStateArray_1[1].initialize(project_numElements_1); / 101 / / 102 / for (int project_index_1 = 0; project_index_1 < project_numElements_1; project_index_1++) { / 103 / if (project_tmpInput_2.isNullAt(project_index_1)) { / 104 / project_mutableStateArray_1[1].setNull8Bytes(project_index_1); / 105 / } else { / 106 / project_mutableStateArray_1[1].write(project_index_1, project_tmpInput_2.getUTF8String(project_index_1)); / 107 / } / 108 / / 109 / } / 110 / } / 111 / / 112 / project_mutableStateArray_0[0].setOffsetAndSizeFromPreviousCursor(1, project_previousCursor_0); / 113 / } / 114 / } / 115 / append((project_mutableStateArray_0[0].getRow())); / 116 / / 117 / } / 118 / / 119 / protected void processNext() throws java.io.IOException { / 120 / while ( localtablescan_input_0.hasNext()) { / 121 / InternalRow localtablescan_row_0 = (InternalRow) localtablescan_input_0.next(); / 122 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 123 / boolean localtablescan_isNull_0 = localtablescan_row_0.isNullAt(0); / 124 / UTF8String localtablescan_value_0 = localtablescan_isNull_0 ? / 125 / null : (localtablescan_row_0.getUTF8String(0)); / 126 / / 127 / project_doConsume_0(localtablescan_row_0, localtablescan_value_0, localtablescan_isNull_0); / 128 / if (shouldStop()) return; / 129 / } / 130 / } / 131 / / 132 */ } -- !query 13 select v, str_to_map(v) from values ('abc🅰️a,:'), (null), (''), ('1:2') t(v) -- !query 13 schema struct<v:string,str_to_map(v, ,, :):map<string,string>> -- !query 13 output {"":null} 1:2 {"1":"2"} NULL NULL abc🅰️a,: {"":"","abc":"a:a"} ``` Closes #27013 from yaooqinn/SPARK-30356. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 21:44:03 +08:00
Jungtaek Lim (HeartSaVioR)	7adf886792	[SPARK-30345][SQL] Fix intermittent test failure (ConnectException) on ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite ### What changes were proposed in this pull request? This patch fixes the intermittent test failure on ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite, getting ConnectException when querying to thrift server. (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/) The relevant unit test log messages are following: ``` 19/12/23 13:33:01.875 pool-1-thread-1 INFO AbstractService: Service:ThriftBinaryCLIService is started. 19/12/23 13:33:01.875 pool-1-thread-1 INFO AbstractService: Service:HiveServer2 is started. ... 19/12/23 13:33:01.888 pool-1-thread-1 INFO ThriftServerWithSparkContextSuite: HiveThriftServer2 started successfully ... 19/12/23 13:33:01.909 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO ThriftServerWithSparkContextSuite: ===== TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.ThriftServerWithSparkContextSuite: 'SPARK-29911: Uncache cached tables when session closed' ===== ... 19/12/23 13:33:02.017 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO Utils: Supplied authorities: localhost:15441 19/12/23 13:33:02.018 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO Utils: Resolved authority: localhost:15441 19/12/23 13:33:02.078 HiveServer2-Background-Pool: Thread-213 INFO BaseSessionStateBuilder$$anon$2: Optimization rule 'org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation' is excluded from the optimizer. 19/12/23 13:33:02.078 HiveServer2-Background-Pool: Thread-213 INFO BaseSessionStateBuilder$$anon$2: Optimization rule 'org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation' is excluded from the optimizer. 19/12/23 13:33:02.121 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite WARN HiveConnection: Failed to connect to localhost:15441 19/12/23 13:33:02.124 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO ThriftServerWithSparkContextSuite: ===== FINISHED o.a.s.sql.hive.thriftserver.ThriftServerWithSparkContextSuite: 'SPARK-29911: Uncache cached tables when session closed' ===== 19/12/23 13:33:02.143 Thread-35 INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 15441 with 5...500 worker threads 19/12/23 13:33:02.327 pool-1-thread-1 INFO HiveServer2: Shutting down HiveServer2 19/12/23 13:33:02.328 pool-1-thread-1 INFO ThriftCLIService: Thrift server has stopped ``` (Here the error is logged as `WARN HiveConnection: Failed to connect to localhost:15441` - the actual stack trace can be seen on Jenkins test summary.) The reason of test failure: Thrift(Binary\|Http)CLIService prepare and launch the service asynchronously (in new thread), which suites are not waiting for completion and just start running tests, ends up with race condition. That can be easily reproduced, via adding artificial sleep in `ThriftBinaryCLIService.run()` here: `ba3f6330dd/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java (L49)` (Note that `sleep` should be added before initializing server socket. E.g. Line 57) This patch changes the test initialization logic to try executing simple query to wait until the service is available. The patch also refactors the code to apply the change both ThriftServerQueryTestSuite and ThriftServerWithSparkContextSuite easily. ### Why are the changes needed? This patch fixes the intermittent failure observed here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/ ### Does this PR introduce any user-facing change? No ### How was this patch tested? Artificially made the test fail consistently (by the approach described above), and confirmed the patch fixed the test. Closes #27001 from HeartSaVioR/SPARK-30345. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 15:30:54 +08:00
yi.wu	c35427f6b1	[SPARK-30355][CORE] Unify isExecutorActive between CoarseGrainedSchedulerBackend and DriverEndpoint ### What changes were proposed in this pull request? Unify `DriverEndpoint. executorIsAlive()` and `CoarseGrainedSchedulerBackend .isExecutorActive()`. ### Why are the changes needed? `DriverEndPoint` has method `executorIsAlive()` to check wether an executor is alive/active, while `CoarseGrainedSchedulerBackend` has method `isExecutorActive()` to do the same work. But, `isExecutorActive()` seems forget to consider `executorsPendingLossReason`. Unify these two methods makes behavior be consistent between `DriverEndPoint` and `CoarseGrainedSchedulerBackend` and make code more easier to maintain. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #27012 from Ngone51/unify-is-executor-alive. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 14:41:45 +08:00
zhengruifeng	9c046dc808	[SPARK-30102][ML][PYSPARK] GMM supports instance weighting ### What changes were proposed in this pull request? supports instance weighting in GMM ### Why are the changes needed? ML should support instance weighting ### Does this PR introduce any user-facing change? yes, a new param `weightCol` is exposed ### How was this patch tested? added testsuits Closes #26735 from zhengruifeng/gmm_support_weight. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-27 13:32:57 +08:00
Huaxin Gao	a3cf9c564e	[SPARK-30247][PYSPARK][FOLLOWUP] Add Python class MultivariateGaussian ### What changes were proposed in this pull request? add a corresponding class MultivariateGaussian containing a vector and a matrix on the py side, so gaussian can be used on the py side. ### Does this PR introduce any user-facing change? add Python class ```MultivariateGaussian``` ### How was this patch tested? doctest Closes #27020 from huaxingao/spark-30247. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-27 13:30:18 +08:00
Yuanjian Li	2acae975aa	[SPARK-30278][SQL][DOC] Update Spark SQL document menu for new changes ### What changes were proposed in this pull request? Update the Spark SQL document menu and join strategy hints. ### Why are the changes needed? - Several new changes in the Spark SQL document didn't change the menu-sql.yaml correspondingly. - Update the demo code for join strategy hints. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Document change only. Closes #26917 from xuanyuanking/SPARK-30278. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 13:22:26 +08:00
lijunqing	a2de20c0e6	[SPARK-30036][SQL] Fix: REPARTITION hint does not work with order by ### Why are the changes needed? `EnsureRequirements` adds `ShuffleExchangeExec` (RangePartitioning) after Sort if `RoundRobinPartitioning` behinds it. This will cause 2 shuffles, and the number of partitions in the final stage is not the number specified by `RoundRobinPartitioning. Example SQL ``` SELECT /+ REPARTITION(5) / * FROM test ORDER BY a ``` BEFORE ``` == Physical Plan == (1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 200), true, [id=#11] +- Exchange RoundRobinPartitioning(5), false, [id=#9] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` AFTER* ``` == Physical Plan == *(1) Sort [a#0 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#0 ASC NULLS FIRST, 5), true, [id=#11] +- Scan hive default.test [a#0, b#1], HiveTableRelation `default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#0, b#1] ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run suite Tests and add new test for this. Closes #26946 from stczwd/RoundRobinPartitioning. Lead-authored-by: lijunqing <lijunqing@baidu.com> Co-authored-by: stczwd <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-27 11:52:39 +08:00
zhanjf	8d3eed33ee	[SPARK-29224][ML] Implement Factorization Machines as a ml-pipeline component ### What changes were proposed in this pull request? Implement Factorization Machines as a ml-pipeline component 1. loss function supports: logloss, mse 2. optimizer: GD, adamW ### Why are the changes needed? Factorization Machines is widely used in advertising and recommendation system to estimate CTR(click-through rate). Advertising and recommendation system usually has a lot of data, so we need Spark to estimate the CTR, and Factorization Machines are common ml model to estimate CTR. References: 1. S. Rendle, “Factorization machines,” in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010. https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #27000 from mob-ai/ml/fm. Authored-by: zhanjf <zhanjf@mob.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-26 11:39:53 -06:00
Fu Chen	3584d84943	[MINOR][CORE] Quiet request executor remove message ### What changes were proposed in this pull request? Settings to quiet for Class `ExecutorAllocationManager` that request message too verbose. otherwise, this class generates too many messages like `INFO spark.ExecutorAllocationManager: Request to remove executorIds: 890` when we enabled DRA. ### Why are the changes needed? Log level improvement. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Closes #26925 from cfmcgrady/quiet-request-executor-remove-message. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2019-12-26 09:59:41 -06:00
Kengo Seki	59c014e120	[SPARK-30350][SQL] Fix ScalaReflection to use an empty array for getting its class object ### What changes were proposed in this pull request? This PR fixes `ScalaReflection.arrayClassFor()` to use an empty array instead of a one-element array for getting its class object by reflection. ### Why are the changes needed? Because it may reduce unnecessary memory allocation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran the existing unit tests for sql/catalyst and confirmed that all of them succeeded. Closes #27005 from sekikn/SPARK-30350. Authored-by: Kengo Seki <sekikn@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-12-26 22:54:29 +09:00
gengjiaan	d59e7195f6	[SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression ### What changes were proposed in this pull request? The filter predicate for aggregate expression is an `ANSI SQL`. ``` <aggregate function> ::= COUNT <left paren> <asterisk> <right paren> [ <filter clause> ] \| <general set function> [ <filter clause> ] \| <binary set function> [ <filter clause> ] \| <ordered set function> [ <filter clause> ] \| <array aggregate function> [ <filter clause> ] \| <row pattern count function> [ <filter clause> ] ``` There are some mainstream database support this syntax. PostgreSQL: https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES For example: ``` SELECT year, count() FILTER (WHERE gdp_per_capita >= 40000) FROM countries GROUP BY year ``` ``` SELECT year, code, gdp_per_capita, count() FILTER (WHERE gdp_per_capita >= 40000) OVER (PARTITION BY year) FROM countries ``` jOOQ: https://blog.jooq.org/2014/12/30/the-awesome-postgresql-9-4-sql2003-filter-clause-for-aggregate-functions/ Notice: 1.This PR only supports FILTER predicate without codegen. maropu will create another PR is related to SPARK-30027 to support codegen. 2.This PR only supports FILTER predicate without DISTINCT. I will create another PR is related to SPARK-30276 to support this. 3.This PR only supports FILTER predicate that can't reference the outer query. I created ticket SPARK-30219 to support it. 4.This PR only supports FILTER predicate that can't use IN/EXISTS predicate sub-queries. I created ticket SPARK-30220 to support it. 5.Spark SQL cannot supports a SQL with nested aggregate. I created ticket SPARK-30182 to support it. There are some show of the PR on my production environment. ``` spark-sql> desc gja_test_partition; key string NULL value string NULL other string NULL col2 int NULL # Partition Information # col_name data_type comment col2 int NULL Time taken: 0.79 s ``` ``` spark-sql> select * from gja_test_partition; a A ao 1 b B bo 1 c C co 1 d D do 1 e E eo 2 g G go 2 h H ho 2 j J jo 2 f F fo 3 k K ko 3 l L lo 4 i I io 4 Time taken: 1.75 s ``` ``` spark-sql> select count(key), sum(col2) from gja_test_partition; 12 26 Time taken: 1.848 s ``` ``` spark-sql> select count(key) filter (where col2 > 1) from gja_test_partition; 8 Time taken: 2.926 s ``` ``` spark-sql> select sum(col2) filter (where col2 > 2) from gja_test_partition; 14 Time taken: 2.087 s ``` ``` spark-sql> select count(key) filter (where col2 > 1), sum(col2) filter (where col2 > 2) from gja_test_partition; 8 14 Time taken: 2.847 s ``` ``` spark-sql> select count(key), count(key) filter (where col2 > 1), sum(col2), sum(col2) filter (where col2 > 2) from gja_test_partition; 12 8 26 14 Time taken: 1.787 s ``` ``` spark-sql> desc student; id int NULL name string NULL sex string NULL class_id int NULL Time taken: 0.206 s ``` ``` spark-sql> select * from student; 1 张三 man 1 2 李四 man 1 3 王五 man 2 4 赵六 man 2 5 钱小花 woman 1 6 赵九红 woman 2 7 郭丽丽 woman 2 Time taken: 0.786 s ``` ``` spark-sql> select class_id, count(id), sum(id) from student group by class_id; 1 3 8 2 4 20 Time taken: 18.783 s ``` ``` spark-sql> select class_id, count(id) filter (where sex = 'man'), sum(id) filter (where sex = 'woman') from student group by class_id; 1 2 5 2 2 13 Time taken: 3.887 s ``` ### Why are the changes needed? Add new SQL feature. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Exists UT and new UT. Closes #26656 from beliefer/support-aggregate-clause. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-26 17:41:50 +08:00
Jungtaek Lim (HeartSaVioR)	481fb63f97	[MINOR][SQL][SS] Remove TODO comments as var in case class is discouraged but worth breaking it ### What changes were proposed in this pull request? This patch removes TODO comments which are left to address changing case classes having vars to normal classes in spark-sql-kafka module - the pattern is actually discouraged, but still worth to break it, as we already use automatic toString implementation and we may be using more. ### Why are the changes needed? Described above. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #26992 from HeartSaVioR/SPARK-30337. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-26 11:47:41 +09:00
wenfang	4d58cd77f9	[SPARK-30330][SQL] Support single quotes json parsing for get_json_object and json_tuple ### What changes were proposed in this pull request? I execute some query as` select get_json_object(ytag, '$.y1') AS y1 from t4`; SparkSQL return null but Hive return correct results. In my production environment, ytag is a json wrapped by single quotes,as follows ``` {'y1': 'shuma', 'y2': 'shuma:shouji'} {'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'} {'y1': 'yule', 'y2': 'yule:mingxing'} ``` Then l realized some functions including get_json_object and json_tuple does not support single quotes json parsing. So l provide this PR to resolve the question. ### Why are the changes needed? Enabled for Hive compatibility ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NEW TESTS Closes #26965 from wenfang6/enableSingleQuotesJsonForSparkSQL. Authored-by: wenfang <wenfang@360.cn> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-26 11:45:31 +09:00
zhengruifeng	ad77b400da	[SPARK-30347][ML] LibSVMDataSource attach AttributeGroup ### What changes were proposed in this pull request? LibSVMDataSource attach AttributeGroup ### Why are the changes needed? LibSVMDataSource will attach a special metadata to indicate numFeatures: ```scala scala> val data = spark.read.format("libsvm").load("/data0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt") scala> data.schema("features").metadata res0: org.apache.spark.sql.types.Metadata = {"numFeatures":4} ``` However, all ML impls will try to obtain vector size via AttributeGroup, which can not use this metadata: ```scala scala> import org.apache.spark.ml.attribute._ import org.apache.spark.ml.attribute._ scala> AttributeGroup.fromStructField(data.schema("features")).size res1: Int = -1 ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? added tests Closes #27003 from zhengruifeng/libsvm_attr_group. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-12-26 10:02:59 +08:00
yi.wu	6d64fc2407	[SPARK-26389][SS][FOLLOW-UP] Format config name to follow the other boolean conf naming convention ### What changes were proposed in this pull request? Rename `spark.sql.streaming.forceDeleteTempCheckpointLocation` to `spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled`. ### Why are the changes needed? To follow the other boolean conf naming convention. ### Does this PR introduce any user-facing change? No, as this config is newly added in 3.0. ### How was this patch tested? Pass Jenkins. Closes #26981 from Ngone51/SPARK-26389-FOLLOWUP. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 21:45:01 +08:00
Kent Yao	da65a955ed	[SPARK-30266][SQL] Avoid match error and int overflow in ApproximatePercentile and Percentile ### What changes were proposed in this pull request? accuracyExpression can accept Long which may cause overflow error. accuracyExpression can accept fractions which are implicitly floored. accuracyExpression can accept null which is implicitly changed to 0. percentageExpression can accept null but cause MatchError. percentageExpression can accept ArrayType(_, nullable=true) in which the nulls are implicitly changed to zeros. ##### cases ```sql select percentile_approx(10.0, 0.5, 2147483648); -- overflow and fail select percentile_approx(10.0, 0.5, 4294967297); -- overflow but success select percentile_approx(10.0, 0.5, null); -- null cast to 0 select percentile_approx(10.0, 0.5, 1.2); -- 1.2 cast to 1 select percentile_approx(10.0, null, 1); -- scala.MatchError select percentile_approx(10.0, array(0.2, 0.4, null), 1); -- null cast to zero. ``` ##### behavior before ```sql +select percentile_approx(10.0, 0.5, 2147483648) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(2147483648L AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = -2147483648); line 1 pos 7 + +select percentile_approx(10.0, 0.5, 4294967297) +10.0 + +select percentile_approx(10.0, 0.5, null) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(NULL AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = 0); line 1 pos 7 + +select percentile_approx(10.0, 0.5, 1.2) +10.0 + +select percentile_approx(10.0, null, 1) +scala.MatchError +null + + +select percentile_approx(10.0, array(0.2, 0.4, null), 1) +[10.0,10.0,10.0] ``` ##### behavior after ```sql +select percentile_approx(10.0, 0.5, 2147483648) +10.0 + +select percentile_approx(10.0, 0.5, 4294967297) +10.0 + +select percentile_approx(10.0, 0.5, null) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, 0.5BD, NULL)' due to data type mismatch: argument 3 requires integral type, however, 'NULL' is of null type.; line 1 pos 7 + +select percentile_approx(10.0, 0.5, 1.2) +org.apache.spark.sql.AnalysisException +cannot resolve 'percentile_approx(10.0BD, 0.5BD, 1.2BD)' due to data type mismatch: argument 3 requires integral type, however, '1.2BD' is of decimal(2,1) type.; line 1 pos 7 + +select percentile_approx(10.0, null, 1) +java.lang.IllegalArgumentException +The value of percentage must be be between 0.0 and 1.0, but got null + +select percentile_approx(10.0, array(0.2, 0.4, null), 1) +java.lang.IllegalArgumentException +Each value of the percentage array must be be between 0.0 and 1.0, but got [0.2,0.4,null] ``` ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? yes, fix some improper usages of percentile_approx as cases list above ### How was this patch tested? add ut Closes #26905 from yaooqinn/SPARK-30266. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-25 20:03:26 +08:00

1 2 3 4 5 ...

26198 commits