ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yesheng Ma	5e3520f7f4	[SPARK-27809][SQL] Make optional clauses order insensitive for CREATE DATABASE/VIEW SQL statement ## What changes were proposed in this pull request? Each time, when I write a complex CREATE DATABASE/VIEW statements, I have to open the .g4 file to find the EXACT order of clauses in CREATE TABLE statement. When the order is not right, I will get A strange confusing error message generated from ANTLR4. The original g4 grammar for CREATE VIEW is ``` CREATE [OR REPLACE] [[GLOBAL] TEMPORARY] VIEW [db_name.]view_name [(col_name1 [COMMENT col_comment1], ...)] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] AS select_statement ``` The proposal is to make the following clauses order insensitive. ``` [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] ``` – The original g4 grammar for CREATE DATABASE is ``` CREATE (DATABASE\|SCHEMA) [IF NOT EXISTS] db_name [COMMENT comment_text] [LOCATION path] [WITH DBPROPERTIES (key1=val1, key2=val2, ...)] ``` The proposal is to make the following clauses order insensitive. ``` [COMMENT comment_text] [LOCATION path] [WITH DBPROPERTIES (key1=val1, key2=val2, ...)] ``` ## How was this patch tested? By adding new unit tests to test duplicate clauses and modifying some existing unit tests to test whether those clauses are actually order insensitive Closes #24681 from yeshengm/create-view-parser. Authored-by: Yesheng Ma <kimi.ysma@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-24 15:19:14 -07:00
maryannxue	de13f70ce1	[SPARK-27824][SQL] Make rule EliminateResolvedHint idempotent ## What changes were proposed in this pull request? This fix prevents the rule EliminateResolvedHint from being applied again if it's already applied. ## How was this patch tested? Added new UT. Closes #24692 from maryannxue/eliminatehint-bug. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-24 11:25:22 -07:00
Ryan Blue	6b28497d6f	[SPARK-27732][SQL] Add v2 CreateTable implementation. ## What changes were proposed in this pull request? This adds a v2 implementation of create table: * `CreateV2Table` is the logical plan, named using v2 to avoid conflicting with the existing plan * `CreateTableExec` is the physical plan ## How was this patch tested? Added resolution and v2 SQL tests. Closes #24617 from rdblue/SPARK-27732-add-v2-create-table. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-24 11:13:22 +08:00
gatorsmile	f94247ec90	[SPARK-27770][SQL][PART 1] Port AGGREGATES.sql ## What changes were proposed in this pull request? This PR is to port AGGREGATES.sql from PostgreSQL regression tests. `02ddd49932/src/test/regress/sql/aggregates.sql (L1-L143)` The expected results can be found in the link: https://github.com/postgres/postgres/blob/master/src/test/regress/expected/aggregates.out When porting the test cases, found three PostgreSQL specific features that do not exist in Spark SQL. - https://issues.apache.org/jira/browse/SPARK-27765: Type Casts: expression::type - https://issues.apache.org/jira/browse/SPARK-27766: Data type: POINT(x, y) - https://issues.apache.org/jira/browse/SPARK-27767: Built-in function: generate_series Also, found two bugs: - https://issues.apache.org/jira/browse/SPARK-27768: Infinity, -Infinity, NaN should be recognized in a case insensitive manner - https://issues.apache.org/jira/browse/SPARK-27769: Handling of sublinks within outer-level aggregates. This PR also fixes the error message when the column can't be resolved. For running the regression tests, this PR also added three tables `aggtest`, `onek` and `tenk1` from the postgreSQL data sets: `02ddd49932/src/test/regress/data` ## How was this patch tested? N/A Closes #24640 from gatorsmile/addTestCase. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-05-23 16:34:37 -07:00
HyukjinKwon	c1e555711b	Revert "Revert "[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values"" This reverts commit `855399bbad`.	2019-05-24 05:36:17 +09:00
HyukjinKwon	1ba4011a7f	Revert "Revert "[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation wit…"" This reverts commit `516b0fb537`.	2019-05-24 05:36:08 +09:00
Wenchen Fan	1a68fc38f0	[SPARK-27816][SQL] make TreeNode tag type safe ## What changes were proposed in this pull request? Add type parameter to `TreeNodeTag`. ## How was this patch tested? existing tests Closes #24687 from cloud-fan/tag. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-23 11:53:21 -07:00
HyukjinKwon	516b0fb537	Revert "[SPARK-27351][SQL] Wrong outputRows estimation after AggregateEstimation wit…" This reverts commit `40668c53ed`.	2019-05-24 03:17:06 +09:00
HyukjinKwon	855399bbad	Revert "[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values" This reverts commit `42cb4a2ccd`.	2019-05-24 03:16:24 +09:00
Wenchen Fan	a590a935b1	[SPARK-27806][SQL] byName/byPosition should apply to struct fields as well ## What changes were proposed in this pull request? When writing a query to data source v2, we have 2 modes to resolve the input query's output: byName or byPosition. For byName mode, we would reorder the top level columns according to the name, and add type cast if possible. If the names don't match, we fail. For byPosition mode, we don't do the reorder, and just add type cast directly if possible. However, for struct type fields, we always apply byName mode. We should ignore the name difference if byPosition mode is used. ## How was this patch tested? new tests Closes #24678 from cloud-fan/write. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-23 10:37:45 -07:00
Liu Xiao	bf617996aa	[SPARK-27800][SQL][DOC] Fix wrong answer of example for BitwiseXor ## What changes were proposed in this pull request? Fix example for bitwise xor function. 3 ^ 5 should be 6 rather than 2. - See https://spark.apache.org/docs/latest/api/sql/index.html#_14 ## How was this patch tested? manual tests Closes #24669 from alex-lx/master. Authored-by: Liu Xiao <hhdxlx@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 21:52:19 -07:00
Wenchen Fan	03c9e8adee	[SPARK-24586][SQL] Upcast should not allow casting from string to other types ## What changes were proposed in this pull request? When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. However, the current upcast behavior is a little weird, we don't allow up casting from string to numeric, but allow non-numeric types as the target, like boolean, date, etc. As a result, `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. The motivation of the up cast is to prevent things like runtime NPE, it's more reasonable to make up cast stricter. This PR does 2 things: 1. rename `Cast.canSafeCast` to `Cast.canUpcast`, and support complex typres 2. remove `Cast.mayTruncate` and replace it with `!Cast.canUpcast` Note that, the up cast change also affects persistent view resolution. But since we don't support changing column types of an existing table, there is no behavior change here. ## How was this patch tested? new tests Closes #21586 from cloud-fan/cast. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-22 11:35:51 +08:00
Wenchen Fan	1e0facb60d	[SQL][DOC][MINOR] update documents for Table and WriteBuilder ## What changes were proposed in this pull request? Update the docs to reflect the changes made by https://github.com/apache/spark/pull/24129 ## How was this patch tested? N/A Closes #24658 from cloud-fan/comment. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-21 09:29:06 -07:00
Josh Rosen	604aa1b045	[SPARK-27786][SQL] Fix Sha1, Md5, and Base64 codegen when commons-codec is shaded ## What changes were proposed in this pull request? When running a custom build of Spark which shades `commons-codec`, the `Sha1` expression generates code which fails to compile: ``` org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 93: A method named "sha1Hex" is not declared in any enclosing class nor any supertype, nor through a static import ``` This is caused by an interaction between Spark's code generator and the shading: the current codegen template includes the string `org.apache.commons.codec.digest.DigestUtils.sha1Hex` as part of a larger string literal, preventing JarJarLinks from being able to replace the class name with the shaded class's name. As a result, the generated code still references the original unshaded class name name, triggering an error in case the original unshaded dependency isn't on the path. This problem impacts the `Sha1`, `Md5`, and `Base64` expressions. To fix this problem and allow for proper shading, this PR updates the codegen templates to replace the hardcoded class names with `${classof[<name>].getName}` calls. ## How was this patch tested? Existing tests. To ensure that I found all occurrences of this problem, I used IntelliJ's "Find in Path" to search for lines matching the regex `^(?!import\|package).*(org\|com\|net\|io)\.(?!apache\.spark)` and then filtered matches to inspect only non-test "Usage in string constants" cases. This isn't _perfect_ but I think it'll catch most cases. Closes #24655 from JoshRosen/fix-shaded-apache-commons. Authored-by: Josh Rosen <rosenville@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-21 21:18:34 +08:00
Wenchen Fan	0e6601acdf	[SPARK-27747][SQL] add a logical plan link in the physical plan ## What changes were proposed in this pull request? It's pretty useful if we can convert a physical plan back to a logical plan, e.g., in https://github.com/apache/spark/pull/24389 This PR introduces a new feature to `TreeNode`, which allows `TreeNode` to carry some extra information via a mutable map, and keep the information when it's copied. The planner leverages this feature to put the logical plan into the physical plan. ## How was this patch tested? a test suite that runs all TPCDS queries and checks that some common physical plans contain the corresponding logical plans. Closes #24626 from cloud-fan/link. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Peng Bo <bo.peng1019@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-20 13:42:25 -07:00
Ryan Blue	bc46feaced	[SPARK-27693][SQL] Add default catalog property Add a SQL config property for the default v2 catalog. Existing tests for regressions. Closes #24594 from rdblue/SPARK-27693-add-default-catalog-config. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 21:30:52 -07:00
HyukjinKwon	2431ab0999	[SPARK-27771][SQL] Add SQL description for grouping functions (cube, rollup, grouping and grouping_id) ## What changes were proposed in this pull request? Both look added as of 2.0 (see SPARK-12541 and SPARK-12706). I referred existing docs and examples in other API docs. ## How was this patch tested? Manually built the documentation and, by running examples, by running `DESCRIBE FUNCTION EXTENDED`. Closes #24642 from HyukjinKwon/SPARK-27771. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-19 19:26:20 -07:00
Wenchen Fan	fc5bd6da77	[SPARK-27576][SQL] table capability to skip the output column resolution ## What changes were proposed in this pull request? Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. ## How was this patch tested? new test cases Closes #24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-16 16:24:53 -07:00
Shixiong Zhu	6a317c8f01	[SPARK-27735][SS] Parsing interval string should be case-insensitive in SS ## What changes were proposed in this pull request? Some APIs in Structured Streaming requires the user to specify an interval. Right now these APIs don't accept upper-case strings. This PR adds a new method `fromCaseInsensitiveString` to `CalendarInterval` to support paring upper-case strings, and fixes all APIs that need to parse an interval string. ## How was this patch tested? The new unit test. Closes #24619 from zsxwing/SPARK-27735. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-16 13:58:27 -07:00
Wenchen Fan	3e30a98810	[SPARK-27674][SQL] the hint should not be dropped after cache lookup ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/20365 . #20365 fixed this problem when the hint node is a root node. This PR fixes this problem for all the cases. ## How was this patch tested? a new test Closes #24580 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-15 15:47:52 -07:00
Xingbo Jiang	0bba5cf568	[SPARK-20774][SPARK-27036][SQL] Cancel the running broadcast execution on BroadcastTimeout ## What changes were proposed in this pull request? In the existing code, a broadcast execution timeout for the Future only causes a query failure, but the job running with the broadcast and the computation in the Future are not canceled. This wastes resources and slows down the other jobs. This PR tries to cancel both the running job and the running hashed relation construction thread. ## How was this patch tested? Add new test suite `BroadcastExchangeExec` Closes #24595 from jiangxb1987/SPARK-20774. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-05-15 14:47:15 -07:00
Sean Owen	bfb3ffe9b3	[SPARK-27682][CORE][GRAPHX][MLLIB] Replace use of collections and methods that will be removed in Scala 2.13 with work-alikes ## What changes were proposed in this pull request? This replaces use of collection classes like `MutableList` and `ArrayStack` with workalikes that are available in 2.12, as they will be removed in 2.13. It also removes use of `.to[Collection]` as its uses was superfluous anyway. Removing `collection.breakOut` will have to wait until 2.13 ## How was this patch tested? Existing tests Closes #24586 from srowen/SPARK-27682. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-15 09:29:12 -05:00
xy_xin	fd9acf23b0	[SPARK-27713][SQL] Move org.apache.spark.sql.execution.* in catalyst to core ## What changes were proposed in this pull request? `RecordBinaryComparator`, `UnsafeExternalRowSorter` and `UnsafeKeyValueSorter` now locates in catalyst, which should be moved to core, as they're used only in physical plan. ## How was this patch tested? exist tests. Closes #24607 from xianyinxin/SPARK-27713. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-15 15:24:21 +08:00
Ryan Blue	2da5b21834	[SPARK-24923][SQL] Implement v2 CreateTableAsSelect ## What changes were proposed in this pull request? This adds a v2 implementation for CTAS queries * Update the SQL parser to parse CREATE queries using multi-part identifiers * Update `CheckAnalysis` to validate partitioning references with the CTAS query schema * Add `CreateTableAsSelect` v2 logical plan and `CreateTableAsSelectExec` v2 physical plan * Update create conversion from `CreateTableAsSelectStatement` to support the new v2 logical plan * Update `DataSourceV2Strategy` to convert v2 CTAS logical plan to the new physical plan * Add `findNestedField` to `StructType` to support reference validation ## How was this patch tested? We have been running these changes in production for several months. Also: * Add a test suite `CreateTablePartitioningValidationSuite` for new analysis checks * Add a test suite for v2 SQL, `DataSourceV2SQLSuite` * Update catalyst `DDLParserSuite` to use multi-part identifiers (`Seq[String]`) * Add test cases to `PlanResolutionSuite` for v2 CTAS: known catalog and v2 source implementation Closes #24570 from rdblue/SPARK-24923-add-v2-ctas. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-15 11:24:03 +08:00
Sean Owen	a10608cb82	[SPARK-27680][CORE][SQL][GRAPHX] Remove usage of Traversable ## What changes were proposed in this pull request? This removes usage of `Traversable`, which is removed in Scala 2.13. This is mostly an internal change, except for the change in the `SparkConf.setAll` method. See additional comments below. ## How was this patch tested? Existing tests. Closes #24584 from srowen/SPARK-27680. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-14 09:14:56 -05:00
mingbo.pb	66f5a42ca5	[SPARK-27638][SQL] Cast string to date/timestamp in binary comparisons with dates/timestamps ## What changes were proposed in this pull request? The below example works with both Mysql and Hive, however not with spark. ``` mysql> select * from date_test where date_col >= '2000-1-1'; +------------+ \| date_col \| +------------+ \| 2000-01-01 \| +------------+ ``` The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420. Based on some tests, the behavior of Date and String comparison in Hive and Mysql: Hive: Cast to Date, partial date is not supported Mysql: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out str_to_datetime in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c As below date patterns have been supported, the PR is to cast string to date when comparing string and date: ``` `yyyy` `yyyy-[m]m` `yyyy-[m]m-[d]d` `yyyy-[m]m-[d]d ` `yyyy-[m]m-[d]d ` `yyyy-[m]m-[d]dT ``` ## How was this patch tested? UT has been added Closes #24567 from pengbo/SPARK-27638. Authored-by: mingbo.pb <mingbo.pb@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-14 17:10:36 +08:00
Liang-Chi Hsieh	8b0bdaa8e0	[SPARK-27671][SQL] Fix error when casting from a nested null in a struct ## What changes were proposed in this pull request? When a null in a nested field in struct, casting from the struct throws error, currently. ```scala scala> sql("select cast(struct(1, null) as struct<a:int,b:int>)").show scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$) at org.apache.spark.sql.catalyst.expressions.Cast.castToInt(Cast.scala:447) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:635) at org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castStruct$1(Cast.scala:603) ``` Similarly, inline table, which casts null in nested field under the hood, also throws an error. ```scala scala> sql("select * FROM VALUES (('a', (10, null))), (('b', (10, 50))), (('c', null)) AS tab(x, y)").show org.apache.spark.sql.AnalysisException: failed to evaluate expression named_struct('col1', 10, 'col2', NULL): NullType (of class org.apache.spark.sql.t ypes.NullType$); line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47) at org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$convert$6(ResolveInlineTables.scala:106) ``` This fixes the issue. ## How was this patch tested? Added tests. Closes #24576 from viirya/cast-null. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-13 12:40:46 -07:00
Liang-Chi Hsieh	d169b0aac3	[SPARK-27653][SQL] Add max_by() and min_by() SQL aggregate functions ## What changes were proposed in this pull request? This PR goes to add `max_by()` and `min_by()` SQL aggregate functions. Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by) > max_by(x, y) → [same as x] > Returns the value of x associated with the maximum value of y over all input values. `min_by()` works similarly. ## How was this patch tested? Added tests. Closes #24557 from viirya/SPARK-27653. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-13 22:37:34 +08:00
zhoukang	126310ca68	[SPARK-26601][SQL] Make broadcast-exchange thread pool configurable ## What changes were proposed in this pull request? Currently,thread number of broadcast-exchange thread pool is fixed and keepAliveSeconds is also fixed as 60s. ``` object BroadcastExchangeExec { private[execution] val executionContext = ExecutionContext.fromExecutorService( ThreadUtils.newDaemonCachedThreadPool("broadcast-exchange", 128)) } /** * Create a cached thread pool whose max number of threads is `maxThreadNumber`. Thread names * are formatted as prefix-ID, where ID is a unique, sequentially assigned integer. */ def newDaemonCachedThreadPool( prefix: String, maxThreadNumber: Int, keepAliveSeconds: Int = 60): ThreadPoolExecutor = { val threadFactory = namedThreadFactory(prefix) val threadPool = new ThreadPoolExecutor( maxThreadNumber, // corePoolSize: the max number of threads to create before queuing the tasks maxThreadNumber, // maximumPoolSize: because we use LinkedBlockingDeque, this one is not used keepAliveSeconds, TimeUnit.SECONDS, new LinkedBlockingQueue[Runnable], threadFactory) threadPool.allowCoreThreadTimeOut(true) threadPool } ``` But some times, if the Thead object do not GC quickly it may caused server(driver) OOM. In such case,we need to make this thread pool configurable. A case has described in https://issues.apache.org/jira/browse/SPARK-26601 ## How was this patch tested? UT Closes #23670 from caneGuy/zhoukang/make-broadcat-config. Authored-by: zhoukang <zhoukang199191@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-13 20:40:21 +09:00
HyukjinKwon	c71f217de1	[SPARK-27673][SQL] Add `since` info to random, regex, null expressions ## What changes were proposed in this pull request? We should add since info to all expressions. SPARK-7886 Rand / Randn `af3746ce0d` RLike, Like (I manually checked that it exists from 1.0.0) SPARK-8262 Split SPARK-8256 RegExpReplace SPARK-8255 RegExpExtract `9aadcffabd` Coalesce / IsNull / IsNotNull (I manually checked that it exists from 1.0.0) SPARK-14541 IfNull / NullIf / Nvl / Nvl2 SPARK-9080 IsNaN SPARK-9168 NaNvl ## How was this patch tested? N/A Closes #24579 from HyukjinKwon/SPARK-27673. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-10 09:24:04 -07:00
HyukjinKwon	3442fcaa9b	[SPARK-27672][SQL] Add `since` info to string expressions ## What changes were proposed in this pull request? This PR adds since information to the all string expressions below: SPARK-8241 ConcatWs SPARK-16276 Elt SPARK-1995 Upper / Lower SPARK-20750 StringReplace SPARK-8266 StringTranslate SPARK-8244 FindInSet SPARK-8253 StringTrimLeft SPARK-8260 StringTrimRight SPARK-8267 StringTrim SPARK-8247 StringInstr SPARK-8264 SubstringIndex SPARK-8249 StringLocate SPARK-8252 StringLPad SPARK-8259 StringRPad SPARK-16281 ParseUrl SPARK-9154 FormatString SPARK-8269 Initcap SPARK-8257 StringRepeat SPARK-8261 StringSpace SPARK-8263 Substring SPARK-21007 Right SPARK-21007 Left SPARK-8248 Length SPARK-20749 BitLength SPARK-20749 OctetLength SPARK-8270 Levenshtein SPARK-8271 SoundEx SPARK-8238 Ascii SPARK-20748 Chr SPARK-8239 Base64 SPARK-8268 UnBase64 SPARK-8242 Decode SPARK-8243 Encode SPARK-8245 format_number SPARK-16285 Sentences ## How was this patch tested? N/A Closes #24578 from HyukjinKwon/SPARK-27672. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-10 09:11:12 -07:00
Marco Gaido	78748b5752	[SPARK-27625][SQL] ScalaReflection support for annotated types ## What changes were proposed in this pull request? If a type is annotated, `ScalaReflection` can fail if the datatype is an `Option`, a `Seq`, a `Map` and other similar types. This is because it assumes we are dealing with `TypeRef`, while types with annotations are `AnnotatedType`. The PR deals with the case the annotation is present. ## How was this patch tested? added UT Closes #24564 from mgaido91/SPARK-27625. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-10 22:48:36 +08:00
pgandhi	0969d7aa0c	[SPARK-27207][SQL] : Ensure aggregate buffers are initialized again for So… …rtBasedAggregate Normally, the aggregate operations that are invoked for an aggregation buffer for User Defined Aggregate Functions(UDAF) follow the order like initialize(), update(), eval() OR initialize(), merge(), eval(). However, after a certain threshold configurable by spark.sql.objectHashAggregate.sortBased.fallbackThreshold is reached, ObjectHashAggregate falls back to SortBasedAggregator which invokes the merge or update operation without calling initialize() on the aggregate buffer. ## What changes were proposed in this pull request? The fix here is to initialize aggregate buffers again when fallback to SortBasedAggregate operator happens. ## How was this patch tested? The patch was tested as part of [SPARK-24935](https://issues.apache.org/jira/browse/SPARK-24935) as documented in PR https://github.com/apache/spark/pull/23778. Closes #24149 from pgandhi999/SPARK-27207. Authored-by: pgandhi <pgandhi@verizonmedia.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-09 11:12:20 +08:00
Jose Torres	83f628b57d	[SPARK-27253][SQL][FOLLOW-UP] Add a legacy flag to restore old session init behavior ## What changes were proposed in this pull request? Add a legacy flag to restore the old session init behavior, where SparkConf defaults take precedence over configs in a parent session. Closes #24540 from jose-torres/oss. Authored-by: Jose Torres <torres.joseph.f+github@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-07 20:04:09 -07:00
Ryan Blue	303ee3fce0	[SPARK-24252][SQL] Add TableCatalog API ## What changes were proposed in this pull request? This adds the TableCatalog API proposed in the [Table Metadata API SPIP](https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d). For `TableCatalog` to use `Table`, it needed to be moved into the catalyst module where the v2 catalog API is located. This also required moving `TableCapability`. Most of the files touched by this PR are import changes needed by this move. ## How was this patch tested? This adds a test implementation and contract tests. Closes #24246 from rdblue/SPARK-24252-add-table-catalog-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-05-08 10:31:06 +08:00
Adi Muraru	8ef4da753d	[SPARK-27610][YARN] Shade netty native libraries ## What changes were proposed in this pull request? Fixed the `spark-<version>-yarn-shuffle.jar` artifact packaging to shade the native netty libraries: - shade the `META-INF/native/libnetty_*` native libraries when packagin the yarn shuffle service jar. This is required as netty library loader derives that based on shaded package name. - updated the `org/spark_project` shade package prefix to `org/sparkproject` (i.e. removed underscore) as the former breaks the netty native lib loading. This was causing the yarn external shuffle service to fail when spark.shuffle.io.mode=EPOLL ## How was this patch tested? Manual tests Closes #24502 from amuraru/SPARK-27610_master. Authored-by: Adi Muraru <amuraru@adobe.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-07 10:47:36 -07:00
gaoweikang	3859ca37d9	[SPARK-27586][SQL] Improve binary comparison: replace Scala's for-comprehension if statements with while loop ## What changes were proposed in this pull request? This PR replaces for-comprehension if statement with while loop to gain better performance in `TypeUtils.compareBinary`. ## How was this patch tested? Add UT to test old version and new version comparison result Closes #24494 from woudygao/opt_binary_compare. Authored-by: gaoweikang <gaoweikang@bytedance.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-02 20:33:27 -07:00
Marco Gaido	7a8cc8e071	[SPARK-27607][SQL] Improve Row.toString performance ## What changes were proposed in this pull request? `Row.toString` is currently causing the useless creation of an `Array` containing all the values in the row before generating the string containing it. This operation adds a considerable overhead. The PR proposes to avoid this operation in order to get a faster implementation. ## How was this patch tested? Run ```scala test("Row toString perf test") { val n = 100000 val rows = (1 to n).map { i => Row(i, i.toDouble, i.toString, i.toShort, true, null) } // warmup (1 to 10).foreach { _ => rows.foreach(_.toString) } val times = (1 to 100).map { _ => val t0 = System.nanoTime() rows.foreach(_.toString) val t1 = System.nanoTime() t1 - t0 } // scalastyle:off println println(s"Avg time on ${times.length} iterations for $n toString:" + s" ${times.sum.toDouble / times.length / 1e6} ms") // scalastyle:on println } ``` Before the PR: ``` Avg time on 100 iterations for 100000 toString: 61.08408419 ms ``` After the PR: ``` Avg time on 100 iterations for 100000 toString: 38.16539432 ms ``` This means the new implementation is about 1.60X faster than the original one. Closes #24505 from mgaido91/SPARK-27607. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-02 07:20:33 -07:00
HyukjinKwon	df8aa7ba8a	[SPARK-27606][SQL] Deprecate 'extended' field in ExpressionDescription/ExpressionInfo ## What changes were proposed in this pull request? After we added other fields, `arguments`, `examples`, `note` and `since` at SPARK-21485 and `deprecated` at SPARK-27328, we have nicer way to separately describe extended usages. `extended` field and method at `ExpressionDescription`/`ExpressionInfo` is now pretty useless - it's not used in Spark side and only exists to keep backward compatibility. This PR proposes to deprecate it. ## How was this patch tested? Manually checked the deprecation waring is properly shown. Closes #24500 from HyukjinKwon/SPARK-27606. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-02 21:10:00 +09:00
gatorsmile	2da406cae5	[SPARK-27618][SQL][FOLLOW-UP] Unnecessary access to externalCatalog ## What changes were proposed in this pull request? This PR is to add test cases for ensuring that we do not have unnecessary access to externalCatalog. In the future, we can follow these examples to improve our test coverage in this area. ## How was this patch tested? N/A Closes #24511 from gatorsmile/addTestcaseSpark-27618. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-01 20:09:46 -07:00
HyukjinKwon	3670826af6	[SPARK-26921][R][DOCS] Document Arrow optimization and vectorized R APIs ## What changes were proposed in this pull request? This PR adds SparkR with Arrow optimization documentation. Note that looks CRAN issue in Arrow side won't look likely fixed soon, IMHO, even after Spark 3.0. If it happen to be fixed, I will fix this doc too later. Another note is that Arrow R package itself requires R 3.5+. So, I intentionally didn't note this. ## How was this patch tested? Manually built and checked. Closes #24506 from HyukjinKwon/SPARK-26924. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-02 10:02:14 +09:00
Artem Kalchenko	a35043c9e2	[SPARK-27591][SQL] Fix UnivocityParser for UserDefinedType ## What changes were proposed in this pull request? Fix bug in UnivocityParser. makeConverter method didn't work correctly for UsedDefinedType ## How was this patch tested? A test suite for UnivocityParser has been extended. Closes #24496 from kalkolab/spark-27591. Authored-by: Artem Kalchenko <artem.kalchenko@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-01 08:27:51 +09:00
Xiangrui Meng	618d6bff71	[SPARK-27588] Binary file data source fails fast and doesn't attempt to read very large files ## What changes were proposed in this pull request? If a file is too big (>2GB), we should fail fast and do not try to read the file. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24483 from mengxr/SPARK-27588. Authored-by: Xiangrui Meng <meng@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-04-29 16:24:49 -07:00
Sean Owen	8a17d26784	[SPARK-27536][CORE][ML][SQL][STREAMING] Remove most use of scala.language.existentials ## What changes were proposed in this pull request? I want to get rid of as much use of `scala.language.existentials` as possible for 3.0. It's a complicated language feature that generates warnings unless this value is imported. It might even be on the way out of Scala: https://contributors.scala-lang.org/t/proposal-to-remove-existential-types-from-the-language/2785 For Spark, it comes up mostly where the code plays fast and loose with generic types, not the advanced situations you'll often see referenced where this feature is explained. For example, it comes up in cases where a function returns something like `(String, Class[_])`. Scala doesn't like matching this to any other instance of `(String, Class[_])` because doing so requires inferring the existence of some type that satisfies both. Seems obvious if the generic type is a wildcard, but, not technically something Scala likes to let you get away with. This is a large PR, and it only gets rid of _most_ instances of `scala.language.existentials`. The change should be all compile-time and shouldn't affect APIs or logic. Many of the changes simply touch up sloppiness about generic types, making the known correct value explicit in the code. Some fixes involve being more explicit about the existence of generic types in methods. For instance, `def foo(arg: Class[_])` seems innocent enough but should really be declared `def foo[T](arg: Class[T])` to let Scala select and fix a single type when evaluating calls to `foo`. For kind of surprising reasons, this comes up in places where code evaluates a tuple of things that involve a generic type, but is OK if the two parts of the tuple are evaluated separately. One key change was altering `Utils.classForName(...): Class[_]` to the more correct `Utils.classForName[T](...): Class[T]`. This caused a number of small but positive changes to callers that otherwise had to cast the result. In several tests, `Dataset[_]` was used where `DataFrame` seems to be the clear intent. Finally, in a few cases in MLlib, the return type `this.type` was used where there are no subclasses of the class that uses it. This really isn't needed and causes issues for Scala reasoning about the return type. These are just changed to be concrete classes as return types. After this change, we have only a few classes that still import `scala.language.existentials` (because modifying them would require extensive rewrites to fix) and no build warnings. ## How was this patch tested? Existing tests. Closes #24431 from srowen/SPARK-27536. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-29 11:02:01 -05:00
Jash Gala	90085a1847	[SPARK-23619][DOCS] Add output description for some generator expressions / functions ## What changes were proposed in this pull request? This PR addresses SPARK-23619: https://issues.apache.org/jira/browse/SPARK-23619 It adds additional comments indicating the default column names for the `explode` and `posexplode` functions in Spark-SQL. Functions for which comments have been updated so far: * stack * inline * explode * posexplode * explode_outer * posexplode_outer ## How was this patch tested? This is just a change in the comments. The package builds and tests successfullly after the change. Closes #23748 from jashgala/SPARK-23619. Authored-by: Jash Gala <jashgala@amazon.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-27 10:30:12 +09:00
uncleGen	6328be78f9	[MINOR][TEST][DOC] Execute action miss name message ## What changes were proposed in this pull request? some minor updates: - `Execute` action miss `name` message - typo in SS document - typo in SQLConf ## How was this patch tested? N/A Closes #24466 from uncleGen/minor-fix. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-27 09:28:31 +08:00
Liang-Chi Hsieh	8b86326521	[SPARK-27551][SQL] Improve error message of mismatched types for CASE WHEN ## What changes were proposed in this pull request? When there are mismatched types among cases or else values in case when expression, current error message is hard to read to figure out what and where the mismatch is. This patch simply improves the error message for mismatched types for case when. Before: ```scala scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as "y")))) org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS BI GINT)) + CAST(123456789 AS BIGINT)))) ELSE array(named_struct('y', ((`id` * CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT)))) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;; ``` After: ```scala scala> spark.range(100).select(when('id === 1, array(struct('id * 123456789 + 123456789 as "x"))).otherwise(array(struct('id * 987654321 + 987654321 as "y")))) org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`id` = CAST(1 AS BIGINT)) THEN array(named_struct('x', ((`id` * CAST(123456789 AS BI GINT)) + CAST(123456789 AS BIGINT)))) ELSE array(named_struct('y', ((`id` * CAST(987654321 AS BIGINT)) + CAST(987654321 AS BIGINT)))) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type, got CASE WHEN ... THEN array<struct<x:bigint>> ELSE arr ay<struct<y:bigint>> END;; ``` ## How was this patch tested? Added unit test. Closes #24453 from viirya/SPARK-27551. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-25 08:47:19 -07:00
HyukjinKwon	a30983db57	[SPARK-27512][SQL] Avoid to replace ',' in CSV's decimal type inference for backward compatibility ## What changes were proposed in this pull request? The code below currently infers as decimal but previously it was inferred as string. In branch-2.4, type inference path for decimal and parsing data are different. `2a8343121e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala (L153)` `c284c4e1f6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala (L125)` So the code below: ```scala scala> spark.read.option("delimiter", "\|").option("inferSchema", "true").csv(Seq("1,2").toDS).printSchema() ``` produced string as its type. ``` root \|-- _c0: string (nullable = true) ``` In the current master, it now infers decimal as below: ``` root \|-- _c0: decimal(2,0) (nullable = true) ``` It happened after https://github.com/apache/spark/pull/22979 because, now after this PR, we only have one way to parse decimal: `7a83d71403/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala (L92)` After the fix: ``` root \|-- _c0: string (nullable = true) ``` This PR proposes to restore the previous behaviour back in `CSVInferSchema`. ## How was this patch tested? Manually tested and unit tests were added. Closes #24437 from HyukjinKwon/SPARK-27512. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-04-24 16:22:07 +09:00
Gengliang Wang	00f2f311f7	[SPARK-27128][SQL] Migrate JSON to File Data Source V2 ## What changes were proposed in this pull request? Migrate JSON to File Data Source V2 ## How was this patch tested? Unit test Closes #24058 from gengliangwang/jsonV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-23 22:39:59 +08:00
pengbo	d9b2ce0f0f	[SPARK-27539][SQL] Fix inaccurate aggregate outputRows estimation with column containing null values ## What changes were proposed in this pull request? This PR is follow up of https://github.com/apache/spark/pull/24286. As gatorsmile pointed out that column with null value is inaccurate as well. ``` > select key from test; 2 NULL 1 spark-sql> desc extended test key; col_name key data_type int comment NULL min 1 max 2 num_nulls 1 distinct_count 2 ``` The distinct count should be distinct_count + 1 when column contains null value. ## How was this patch tested? Existing tests & new UT added. Closes #24436 from pengbo/aggregation_estimation. Authored-by: pengbo <bo.peng1019@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-22 20:30:08 -07:00

1 2 3 4 5 ...

3589 commits