ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Herman van Hovell	05af2de0fd	[SPARK-21830][SQL] Bump ANTLR version and fix a few issues. ## What changes were proposed in this pull request? This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump. The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse: ```sql SELECT * FROM RANGE(1000) WHERE TRUE AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' AND NOT upper(DESCRIPTION) LIKE '%FOO%' ``` This is caused by a know bug in ANTLR (https://github.com/antlr/antlr4/issues/994), which was fixed in version 4.6. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #19042 from hvanhovell/SPARK-21830.	2017-08-24 16:33:55 -07:00
Liang-Chi Hsieh	183d4cb71f	[SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery ## What changes were proposed in this pull request? With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule `PullupCorrelatedPredicates` can produce unresolved plans. For a correlated IN query looks like: SELECT t1.a FROM t1 WHERE t1.a IN (SELECT t2.c FROM t2 WHERE t1.b < t2.d); The query plan might look like: Project [a#0] +- Filter a#0 IN (list#4 [b#1]) : +- Project [c#2] : +- Filter (outer(b#1) < d#3) : +- LocalRelation <empty>, [c#2, d#3] +- LocalRelation <empty>, [a#0, b#1] After `PullupCorrelatedPredicates`, it produces query plan like: 'Project [a#0] +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) : +- Project [c#2, d#3] : +- LocalRelation <empty>, [c#2, d#3] +- LocalRelation <empty>, [a#0, b#1] Because the correlated predicate involves another attribute `d#3` in subquery, it has been pulled out and added into the `Project` on the top of the subquery. When `list` in `In` contains just one `ListQuery`, `In.checkInputDataTypes` checks if the size of `value` expressions matches the output size of subquery. In the above example, there is only `value` expression and the subquery output has two attributes `c#2, d#3`, so it fails the check and `In.resolved` returns `false`. We should not let `In.checkInputDataTypes` wrongly report unresolved plans to fail the structural integrity check. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18968 from viirya/SPARK-21759.	2017-08-24 21:46:58 +08:00
Takuya UESHIN	9e33954ddf	[SPARK-21745][SQL] Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector. ## What changes were proposed in this pull request? This is a refactoring of `ColumnVector` hierarchy and related classes. 1. make `ColumnVector` read-only 2. introduce `WritableColumnVector` with write interface 3. remove `ReadOnlyColumnVector` ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18958 from ueshin/issues/SPARK-21745.	2017-08-24 21:13:44 +08:00
Jen-Ming Chung	95713eb4f2	[SPARK-21804][SQL] json_tuple returns null values within repeated columns except the first one ## What changes were proposed in this pull request? When json_tuple in extracting values from JSON it returns null values within repeated columns except the first one as below: ``` scala scala> spark.sql("""SELECT json_tuple('{"a":1, "b":2}', 'a', 'b', 'a')""").show() +---+---+----+ \| c0\| c1\| c2\| +---+---+----+ \| 1\| 2\|null\| +---+---+----+ ``` I think this should be consistent with Hive's implementation: ``` hive> SELECT json_tuple('{"a": 1, "b": 2}', 'a', 'a'); ... 1 1 ``` In this PR, we located all the matched indices in `fieldNames` instead of returning the first matched index, i.e., indexOf. ## How was this patch tested? Added test in JsonExpressionsSuite. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #19017 from jmchung/SPARK-21804.	2017-08-24 19:24:00 +09:00
10129659	b8aaef49fb	[SPARK-21807][SQL] Override ++ operation in ExpressionSet to reduce clone time ## What changes were proposed in this pull request? The getAliasedConstraints fuction in LogicalPlan.scala will clone the expression set when an element added, and it will take a long time. This PR add a function to add multiple elements at once to reduce the clone time. Before modified, the cost of getAliasedConstraints is: 100 expressions: 41 seconds 150 expressions: 466 seconds After modified, the cost of getAliasedConstraints is: 100 expressions: 1.8 seconds 150 expressions: 6.5 seconds The test is like this: test("getAliasedConstraints") { val expressionNum = 150 val aggExpression = (1 to expressionNum).map(i => Alias(Count(Literal(1)), s"cnt$i")()) val aggPlan = Aggregate(Nil, aggExpression, LocalRelation()) val beginTime = System.currentTimeMillis() val expressions = aggPlan.validConstraints println(s"validConstraints cost: ${System.currentTimeMillis() - beginTime}ms") // The size of Aliased expression is n * (n - 1) / 2 + n assert( expressions.size === expressionNum * (expressionNum - 1) / 2 + expressionNum) } (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Run new added test. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: 10129659 <chen.yanshan@zte.com.cn> Closes #19022 from eatoncys/getAliasedConstraints.	2017-08-23 20:35:08 -07:00
Takeshi Yamamuro	6942aeeb0a	[SPARK-21603][SQL][FOLLOW-UP] Change the default value of maxLinesPerFunction into 4000 ## What changes were proposed in this pull request? This pr changed the default value of `maxLinesPerFunction` into `4000`. In #18810, we had this new option to disable code generation for too long functions and I found this option only affected `Q17` and `Q66` in TPC-DS. But, `Q66` had some performance regression: ``` Q17 w/o #18810, 3224ms --> q17 w/#18810, 2627ms (improvement) Q66 w/o #18810, 1712ms --> q66 w/#18810, 3032ms (regression) ``` To keep the previous performance in TPC-DS, we better set higher value at `maxLinesPerFunction` by default. ## How was this patch tested? Existing tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #19021 from maropu/SPARK-21603-FOLLOWUP-1.	2017-08-23 12:02:24 -07:00
Jose Torres	3c0c2d09ca	[SPARK-21765] Set isStreaming on leaf nodes for streaming plans. ## What changes were proposed in this pull request? All streaming logical plans will now have isStreaming set. This involved adding isStreaming as a case class arg in a few cases, since a node might be logically streaming depending on where it came from. ## How was this patch tested? Existing unit tests - no functional change is intended in this PR. Author: Jose Torres <joseph-torres@databricks.com> Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18973 from joseph-torres/SPARK-21765.	2017-08-22 19:07:43 -07:00
gatorsmile	43d71d9659	[SPARK-21499][SQL] Support creating persistent function for Spark UDAF(UserDefinedAggregateFunction) ## What changes were proposed in this pull request? This PR is to enable users to create persistent Scala UDAF (that extends UserDefinedAggregateFunction). ```SQL CREATE FUNCTION myDoubleAvg AS 'test.org.apache.spark.sql.MyDoubleAvg' ``` Before this PR, Spark UDAF only can be registered through the API `spark.udf.register(...)` ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #18700 from gatorsmile/javaUDFinScala.	2017-08-22 13:01:35 -07:00
Wenchen Fan	7880909c45	[SPARK-21743][SQL][FOLLOW-UP] top-most limit should not cause memory leak ## What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/18955 , to fix a bug that we break whole stage codegen for `Limit`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18993 from cloud-fan/bug.	2017-08-18 11:19:22 -07:00
Masha Basmanova	23ea898080	[SPARK-21213][SQL] Support collecting partition-level statistics: rowCount and sizeInBytes ## What changes were proposed in this pull request? Added support for ANALYZE TABLE [db_name].tablename PARTITION (partcol1[=val1], partcol2[=val2], ...) COMPUTE STATISTICS [NOSCAN] SQL command to calculate total number of rows and size in bytes for a subset of partitions. Calculated statistics are stored in Hive Metastore as user-defined properties attached to partition objects. Property names are the same as the ones used to store table-level statistics: spark.sql.statistics.totalSize and spark.sql.statistics.numRows. When partition specification contains all partition columns with values, the command collects statistics for a single partition that matches the specification. When some partition columns are missing or listed without their values, the command collects statistics for all partitions which match a subset of partition column values specified. For example, table t has 4 partitions with the following specs: * Partition1: (ds='2008-04-08', hr=11) * Partition2: (ds='2008-04-08', hr=12) * Partition3: (ds='2008-04-09', hr=11) * Partition4: (ds='2008-04-09', hr=12) 'ANALYZE TABLE t PARTITION (ds='2008-04-09', hr=11)' command will collect statistics only for partition 3. 'ANALYZE TABLE t PARTITION (ds='2008-04-09')' command will collect statistics for partitions 3 and 4. 'ANALYZE TABLE t PARTITION (ds, hr)' command will collect statistics for all four partitions. When the optional parameter NOSCAN is specified, the command doesn't count number of rows and only gathers size in bytes. The statistics gathered by ANALYZE TABLE command can be fetched using DESC EXTENDED [db_name.]tablename PARTITION command. ## How was this patch tested? Added tests. Author: Masha Basmanova <mbasmanova@fb.com> Closes #18421 from mbasmanova/mbasmanova-analyze-partition.	2017-08-18 09:54:39 -07:00
Jen-Ming Chung	7ab951885f	[SPARK-21677][SQL] json_tuple throws NullPointException when column is null as string type ## What changes were proposed in this pull request? ``` scala scala> Seq(("""{"Hyukjin": 224, "John": 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show() ... java.lang.NullPointerException at ... ``` Currently the `null` field name will throw NullPointException. As a given field name null can't be matched with any field names in json, we just output null as its column value. This PR achieves it by returning a very unlikely column name `__NullFieldName` in evaluation of the field names. ## How was this patch tested? Added unit test. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #18930 from jmchung/SPARK-21677.	2017-08-17 15:59:45 -07:00
Takeshi Yamamuro	6aad02d036	[SPARK-18394][SQL] Make an AttributeSet.toSeq output order consistent ## What changes were proposed in this pull request? This pr sorted output attributes on their name and exprId in `AttributeSet.toSeq` to make the order consistent. If the order is different, spark possibly generates different code and then misses cache in `CodeGenerator`, e.g., `GenerateColumnAccessor` generates code depending on an input attribute order. ## How was this patch tested? Added tests in `AttributeSetSuite` and manually checked if the cache worked well in the given query of the JIRA. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18959 from maropu/SPARK-18394.	2017-08-17 22:47:14 +02:00
10129659	1cce1a3b63	[SPARK-21603][SQL] The wholestage codegen will be much slower then that is closed when the function is too long ## What changes were proposed in this pull request? Close the whole stage codegen when the function lines is longer than the maxlines which will be setted by spark.sql.codegen.MaxFunctionLength parameter, because when the function is too long , it will not get the JIT optimizing. A benchmark test result is 10x slower when the generated function is too long : ignore("max function length of wholestagecodegen") { val N = 20 << 15 val benchmark = new Benchmark("max function length of wholestagecodegen", N) def f(): Unit = sparkSession.range(N) .selectExpr( "id", "(id & 1023) as k1", "cast(id & 1023 as double) as k2", "cast(id & 1023 as int) as k3", "case when id > 100 and id <= 200 then 1 else 0 end as v1", "case when id > 200 and id <= 300 then 1 else 0 end as v2", "case when id > 300 and id <= 400 then 1 else 0 end as v3", "case when id > 400 and id <= 500 then 1 else 0 end as v4", "case when id > 500 and id <= 600 then 1 else 0 end as v5", "case when id > 600 and id <= 700 then 1 else 0 end as v6", "case when id > 700 and id <= 800 then 1 else 0 end as v7", "case when id > 800 and id <= 900 then 1 else 0 end as v8", "case when id > 900 and id <= 1000 then 1 else 0 end as v9", "case when id > 1000 and id <= 1100 then 1 else 0 end as v10", "case when id > 1100 and id <= 1200 then 1 else 0 end as v11", "case when id > 1200 and id <= 1300 then 1 else 0 end as v12", "case when id > 1300 and id <= 1400 then 1 else 0 end as v13", "case when id > 1400 and id <= 1500 then 1 else 0 end as v14", "case when id > 1500 and id <= 1600 then 1 else 0 end as v15", "case when id > 1600 and id <= 1700 then 1 else 0 end as v16", "case when id > 1700 and id <= 1800 then 1 else 0 end as v17", "case when id > 1800 and id <= 1900 then 1 else 0 end as v18") .groupBy("k1", "k2", "k3") .sum() .collect() benchmark.addCase(s"codegen = F") { iter => sparkSession.conf.set("spark.sql.codegen.wholeStage", "false") f() } benchmark.addCase(s"codegen = T") { iter => sparkSession.conf.set("spark.sql.codegen.wholeStage", "true") sparkSession.conf.set("spark.sql.codegen.MaxFunctionLength", "10000") f() } benchmark.run() /* Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14 on Windows 7 6.1 Intel64 Family 6 Model 58 Stepping 9, GenuineIntel max function length of wholestagecodegen: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ codegen = F 443 / 507 1.5 676.0 1.0X codegen = T 3279 / 3283 0.2 5002.6 0.1X */ } ## How was this patch tested? Run the unit test Author: 10129659 <chen.yanshan@zte.com.cn> Closes #18810 from eatoncys/codegen.	2017-08-16 09:12:20 -07:00
WeichenXu	07549b20a3	[SPARK-19634][ML] Multivariate summarizer - dataframes API ## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics. ## How was this patch tested? Testcases added. ## Performance Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks liancheng and cloud-fan ### Performance data (test on my laptop, use 2 partitions. tries out = 20, warm up = 10) The unit of test results is records/milliseconds (higher is better) Vector size/records number \| 1/10000000 \| 10/1000000 \| 100/1000000 \| 1000/100000 \| 10000/10000 ----\|------\|----\|---\|----\|---- Dataframe \| 15149 \| 7441 \| 2118 \| 224 \| 21 RDD from Dataframe \| 4992 \| 4440 \| 2328 \| 320 \| 33 raw RDD \| 53931 \| 20683 \| 3966 \| 528 \| 53 Author: WeichenXu <WeichenXu123@outlook.com> Closes #18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.	2017-08-16 10:41:05 +08:00
Marcelo Vanzin	3f958a9992	[SPARK-21731][BUILD] Upgrade scalastyle to 0.9. This version fixes a few issues in the import order checker; it provides better error messages, and detects more improper ordering (thus the need to change a lot of files in this patch). The main fix is that it correctly complains about the order of packages vs. classes. As part of the above, I moved some "SparkSession" import in ML examples inside the "$example on$" blocks; that didn't seem consistent across different source files to start with, and avoids having to add more on/off blocks around specific imports. The new scalastyle also seems to have a better header detector, so a few license headers had to be updated to match the expected indentation. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18943 from vanzin/SPARK-21731.	2017-08-15 13:59:00 -07:00
Wenchen Fan	14bdb25fd7	[SPARK-18464][SQL][FOLLOWUP] support old table which doesn't store schema in table properties ## What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/15900 , to fix one more bug: When table schema is empty and need to be inferred at runtime, we should not resolve parent plans before the schema has been inferred, or the parent plans will be resolved against an empty schema and may get wrong result for something like `select *` The fix logic is: introduce `UnresolvedCatalogRelation` as a placeholder. Then we replace it with `LogicalRelation` or `HiveTableRelation` during analysis, so that it's guaranteed that we won't resolve parent plans until the schema has been inferred. ## How was this patch tested? regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18907 from cloud-fan/bug.	2017-08-15 09:04:56 -07:00
hyukjinkwon	0422ce06df	[SPARK-21724][SQL][DOC] Adds since information in the documentation of date functions ## What changes were proposed in this pull request? This PR adds `since` annotation in documentation so that this can be rendered as below: <img width="290" alt="2017-08-14 6 54 26" src="https://user-images.githubusercontent.com/6477701/29267050-034c1f64-8122-11e7-862b-7dfc38e292bf.png"> ## How was this patch tested? Manually checked the documentation by `cd sql && ./create-docs.sh`. Also, Jenkins tests are required. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18939 from HyukjinKwon/add-sinces-date-functions.	2017-08-14 23:44:25 -07:00
aokolnychyi	5596ce83c4	[MINOR][SQL] Additional test case for CheckCartesianProducts rule ## What changes were proposed in this pull request? While discovering optimization rules and their test coverage, I did not find any tests for `CheckCartesianProducts` in the Catalyst folder. So, I decided to create a new test suite. Once I finished, I found a test in `JoinSuite` for this functionality so feel free to discard this change if it does not make much sense. The proposed test suite covers a few additional use cases. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18909 from aokolnychyi/check-cartesian-join-tests.	2017-08-13 21:33:16 -07:00
Tejas Patil	94439997d5	[SPARK-21595] Separate thresholds for buffering and spilling in ExternalAppendOnlyUnsafeRowArray ## What changes were proposed in this pull request? [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre https://github.com/apache/spark/pull/16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers). Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control. ## How was this patch tested? Added unit tests Author: Tejas Patil <tejasp@fb.com> Closes #18843 from tejasapatil/SPARK-21595.	2017-08-11 22:01:00 +02:00
Reynold Xin	584c7f1437	[SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog ## What changes were proposed in this pull request? This patch removes the unused SessionCatalog.getTableMetadataOption and ExternalCatalog. getTableOption. ## How was this patch tested? Removed the test case. Author: Reynold Xin <rxin@databricks.com> Closes #18912 from rxin/remove-getTableOption.	2017-08-10 18:56:25 -07:00
Jose Torres	0fb73253fc	[SPARK-21587][SS] Added filter pushdown through watermarks. ## What changes were proposed in this pull request? Push filter predicates through EventTimeWatermark if they're deterministic and do not reference the watermarked attribute. (This is similar but not identical to the logic for pushing through UnaryNode.) ## How was this patch tested? unit tests Author: Jose Torres <joseph-torres@databricks.com> Closes #18790 from joseph-torres/SPARK-21587.	2017-08-09 12:50:04 -07:00
gatorsmile	2d799d0808	[SPARK-21504][SQL] Add spark version info into table metadata ## What changes were proposed in this pull request? This PR is to add the spark version info in the table metadata. When creating the table, this value is assigned. It can help users find which version of Spark was used to create the table. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18709 from gatorsmile/addVersion.	2017-08-09 08:46:25 -07:00
Xingbo Jiang	031910b0ec	[SPARK-21608][SPARK-9221][SQL] Window rangeBetween() API should allow literal boundary ## What changes were proposed in this pull request? Window rangeBetween() API should allow literal boundary, that means, the window range frame can calculate frame of double/date/timestamp. Example of the use case can be: ``` SELECT val_timestamp, cate, avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING) FROM testData ``` This PR refactors the Window `rangeBetween` and `rowsBetween` API, while the legacy user code should still be valid. ## How was this patch tested? Add new test cases both in `DataFrameWindowFunctionsSuite` and in `window.sql`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18814 from jiangxb1987/literal-boundary.	2017-08-09 13:23:49 +08:00
Liang-Chi Hsieh	ee1304199b	[SPARK-21567][SQL] Dataset should work with type alias ## What changes were proposed in this pull request? If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset. A reproducible case looks like: object C { type TwoInt = (Int, Int) def tupleTypeAlias: TwoInt = (1, 1) } Seq(1).toDS().map(_ => ("", C.tupleTypeAlias)) It throws an exception like: type T1 is not a class scala.ScalaReflectionException: type T1 is not a class at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275) ... This patch accesses the dealias of type in many places in `ScalaReflection` to fix it. ## How was this patch tested? Added test case. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18813 from viirya/SPARK-21567.	2017-08-08 16:12:41 +08:00
zhoukang	8b69b17f3f	[SPARK-21544][DEPLOY][TEST-MAVEN] Tests jar of some module should not upload twice ## What changes were proposed in this pull request? For moudle below: common/network-common streaming sql/core sql/catalyst tests.jar will install or deploy twice.Like: `[DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml [INFO] Installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar [DEBUG] Skipped re-installing /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar to /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, seems unchanged` The reason is below: `[DEBUG] (f) artifact = org.apache.spark:spark-streaming_2.11🫙2.1.0-mdh2.1.0.1-SNAPSHOT [DEBUG] (f) attachedArtifacts = [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11🫙tests:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 -mdh2.1.0.1-SNAPSHOT]` when executing 'mvn deploy' to nexus during release.I will fail since release nexus can not be overrided. ## How was this patch tested? Execute 'mvn clean install -Pyarn -Phadoop-2.6 -Phadoop-provided -DskipTests' Author: zhoukang <zhoukang199191@gmail.com> Closes #18745 from caneGuy/zhoukang/fix-installtwice.	2017-08-07 12:51:39 +01:00
BartekH	438c381584	Add "full_outer" name to join types I have discovered that "full_outer" name option is working in Spark 2.0, but it is not printed in exception. Please verify. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: BartekH <bartekhamielec@gmail.com> Closes #17985 from BartekH/patch-1.	2017-08-06 16:40:59 -07:00
Takeshi Yamamuro	74b47845ea	[SPARK-20963][SQL][FOLLOW-UP] Use UnresolvedSubqueryColumnAliases for visitTableName ## What changes were proposed in this pull request? This pr (follow-up of #18772) used `UnresolvedSubqueryColumnAliases` for `visitTableName` in `AstBuilder`, which is a new unresolved `LogicalPlan` implemented in #18185. ## How was this patch tested? Existing tests Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18857 from maropu/SPARK-20963-FOLLOWUP.	2017-08-06 10:14:45 -07:00
vinodkc	1ba967b25e	[SPARK-21588][SQL] SQLContext.getConf(key, null) should return null ## What changes were proposed in this pull request? In SQLContext.get(key,null) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NPE. Int happens only when conf has a value converter Added null check on defaultValue inside SQLConf.getConfString to avoid calling entry.valueConverter(defaultValue) ## How was this patch tested? Added unit test Author: vinodkc <vinod.kc.in@gmail.com> Closes #18852 from vinodkc/br_Fix_SPARK-21588.	2017-08-05 23:04:39 -07:00
Takeshi Yamamuro	990efad1c6	[SPARK-20963][SQL] Support column aliases for join relations in FROM clause ## What changes were proposed in this pull request? This pr added parsing rules to support column aliases for join relations in FROM clause. This pr is a sub-task of #18079. ## How was this patch tested? Added tests in `AnalysisSuite`, `PlanParserSuite,` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18772 from maropu/SPARK-20963-2.	2017-08-05 20:35:54 -07:00
hyukjinkwon	ba327ee54c	[SPARK-21485][FOLLOWUP][SQL][DOCS] Describes examples and arguments separately, and note/since in SQL built-in function documentation ## What changes were proposed in this pull request? This PR proposes to separate `extended` into `examples` and `arguments` internally so that both can be separately documented and add `since` and `note` for additional information. For `since`, it looks users sometimes get confused by, up to my knowledge, missing version information. For example, see https://www.mail-archive.com/userspark.apache.org/msg64798.html For few good examples to check the built documentation, please see both: `from_json` - https://spark-test.github.io/sparksqldoc/#from_json `like` - https://spark-test.github.io/sparksqldoc/#like For `DESCRIBE FUNCTION`, `note` and `since` are added as below: ``` > DESCRIBE FUNCTION EXTENDED rlike; ... Extended Usage: Arguments: ... Examples: ... Note: Use LIKE to match with simple string pattern ``` ``` > DESCRIBE FUNCTION EXTENDED to_json; ... Examples: ... Since: 2.2.0 ``` For the complete documentation, see https://spark-test.github.io/sparksqldoc/ ## How was this patch tested? Manual tests and existing tests. Please see https://spark-test.github.io/sparksqldoc Jenkins tests are needed to double check Author: hyukjinkwon <gurwls223@gmail.com> Closes #18749 from HyukjinKwon/followup-sql-doc-gen.	2017-08-05 10:10:56 -07:00
liuxian	894d5a453a	[SPARK-21580][SQL] Integers in aggregation expressions are wrongly taken as group-by ordinal ## What changes were proposed in this pull request? create temporary view data as select * from values (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2) as data(a, b); `select 3, 4, sum(b) from data group by 1, 2;` `select 3 as c, 4 as d, sum(b) from data group by c, d;` When running these two cases, the following exception occurred: `Error in query: GROUP BY position 4 is not in select list (valid range is [1, 3]); line 1 pos 10` The cause of this failure: If an aggregateExpression is integer, after replaced with this aggregateExpression, the groupExpression still considered as an ordinal. The solution: This bug is due to re-entrance of an analyzed plan. We can solve it by using `resolveOperators` in `SubstituteUnresolvedOrdinals`. ## How was this patch tested? Added unit test case Author: liuxian <liu.xian3@zte.com.cn> Closes #18779 from 10110346/groupby.	2017-08-04 22:55:06 -07:00
Reynold Xin	5ad1796b9f	[SPARK-21634][SQL] Change OneRowRelation from a case object to case class ## What changes were proposed in this pull request? OneRowRelation is the only plan that is a case object, which causes some issues with makeCopy using a 0-arg constructor. This patch changes it from a case object to a case class. This blocks SPARK-21619. ## How was this patch tested? Should be covered by existing test cases. Author: Reynold Xin <rxin@databricks.com> Closes #18839 from rxin/SPARK-21634.	2017-08-04 10:36:08 -07:00
Yuming Wang	231f67247b	[SPARK-21205][SQL] pmod(number, 0) should be null. ## What changes were proposed in this pull request? Hive `pmod(3.13, 0)`: ```:sql hive> select pmod(3.13, 0); OK NULL Time taken: 2.514 seconds, Fetched: 1 row(s) hive> ``` Spark `mod(3.13, 0)`: ```:sql spark-sql> select mod(3.13, 0); NULL spark-sql> ``` But the Spark `pmod(3.13, 0)`: ```:sql spark-sql> select pmod(3.13, 0); 17/06/25 09:35:58 ERROR SparkSQLDriver: Failed in [select pmod(3.13, 0)] java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.Pmod.pmod(arithmetic.scala:504) at org.apache.spark.sql.catalyst.expressions.Pmod.nullSafeEval(arithmetic.scala:432) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:419) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:323) ... ``` This PR make `pmod(number, 0)` to null. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18413 from wangyum/SPARK-21205.	2017-08-04 12:06:08 +02:00
bravo-zhang	6b186c9d60	[SPARK-18950][SQL] Report conflicting fields when merging two StructTypes ## What changes were proposed in this pull request? Currently, StructType.merge() only reports data types of conflicting fields when merging two incompatible schemas. It would be nice to also report the field names for easier debugging. ## How was this patch tested? Unit test in DataTypeSuite. Print exception message when conflict is triggered. Author: bravo-zhang <mzhang1230@gmail.com> Closes #16365 from bravo-zhang/spark-18950.	2017-07-31 17:19:55 -07:00
Takeshi Yamamuro	6550086bbd	[SPARK-20962][SQL] Support subquery column aliases in FROM clause ## What changes were proposed in this pull request? This pr added parsing rules to support subquery column aliases in FROM clause. This pr is a sub-task of #18079. ## How was this patch tested? Added tests in `PlanParserSuite` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18185 from maropu/SPARK-20962.	2017-07-29 10:14:47 -07:00
Xingbo Jiang	92d85637e7	[SPARK-19451][SQL] rangeBetween method should accept Long value as boundary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: `596f53c339` After this been merged, we can close #16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18540 from jiangxb1987/rangeFrame.	2017-07-29 10:11:31 -07:00
Liang-Chi Hsieh	9c8109ef41	[SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized child ## What changes were proposed in this pull request? When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`, as they are not in the children expression, those aliases can't be cleaned up in analyzer rule `CleanupAliases`. An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions in a group by query because they contain different aliases. Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`. So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`. If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It makes the interface ugly IMO. Consider those aliases will be replaced later at optimization and so they're no harm, this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`. One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18761 from viirya/SPARK-21555.	2017-07-29 10:02:56 -07:00
Wenchen Fan	9f5647d62e	[SPARK-21319][SQL] Fix memory leak in sorter ## What changes were proposed in this pull request? `UnsafeExternalSorter.recordComparator` can be either `KVComparator` or `RowComparator`, and both of them will keep the reference to the input rows they compared last time. After sorting, we return the sorted iterator to upstream operators. However, the upstream operators may take a while to consume up the sorted iterator, and `UnsafeExternalSorter` is registered to `TaskContext` at [here](https://github.com/apache/spark/blob/v2.2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L159-L161), which means we will keep the `UnsafeExternalSorter` instance and keep the last compared input rows in memory until the sorted iterator is consumed up. Things get worse if we sort within partitions of a dataset and coalesce all partitions into one, as we will keep a lot of input rows in memory and the time to consume up all the sorted iterators is long. This PR takes over https://github.com/apache/spark/pull/18543 , the idea is that, we do not keep the record comparator instance in `UnsafeExternalSorter`, but a generator of record comparator. close #18543 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18679 from cloud-fan/memory-leak.	2017-07-27 22:56:26 +08:00
Kazuaki Ishizaki	ebbe589d12	[SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8 ## What changes were proposed in this pull request? This PR ensures that `Unsafe.sizeInBytes` must be a multiple of 8. It it is not satisfied. `Unsafe.hashCode` causes the assertion violation. ## How was this patch tested? Will add test cases Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18503 from kiszk/SPARK-21271.	2017-07-27 15:27:24 +08:00
gatorsmile	ebc24a9b7f	[SPARK-20586][SQL] Add deterministic to ScalaUDF ### What changes were proposed in this pull request? Like [Hive UDFType](https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html), we should allow users to add the extra flags for ScalaUDF and JavaUDF too. _stateful_/_impliesOrder_ are not applicable to our Scala UDF. Thus, we only add the following two flags. - deterministic: Certain optimizations should not be applied if UDF is not deterministic. Deterministic UDF returns same result each time it is invoked with a particular input. This determinism just needs to hold within the context of a query. When the deterministic flag is not correctly set, the results could be wrong. For ScalaUDF in Dataset APIs, users can call the following extra APIs for `UserDefinedFunction` to make the corresponding changes. - `nonDeterministic`: Updates UserDefinedFunction to non-deterministic. Also fixed the Java UDF name loss issue. Will submit a separate PR for `distinctLike` for UDAF ### How was this patch tested? Added test cases for both ScalaUDF Author: gatorsmile <gatorsmile@gmail.com> Author: Wenchen Fan <cloud0fan@gmail.com> Closes #17848 from gatorsmile/udfRegister.	2017-07-25 17:19:44 -07:00
pj.fanning	2a53fbfce7	[SPARK-20871][SQL] limit logging of Janino code ## What changes were proposed in this pull request? When the code that is generated is greater than 64k, then Janino compile will fail and CodeGenerator.scala will log the entire code at Error level. SPARK-20871 suggests only logging the code at Debug level. Since, the code is already logged at debug level, this Pull Request proposes not including the formatted code in the Error logging and exception message at all. When an exception occurs, the code will be logged at Info level but truncated if it is more than 1000 lines long. ## How was this patch tested? Existing tests were run. An extra test test case was added to CodeFormatterSuite to test the new maxLines parameter, Author: pj.fanning <pj.fanning@workday.com> Closes #18658 from pjfanning/SPARK-20871.	2017-07-23 10:38:03 -07:00
Wenchen Fan	3ac6093086	[SPARK-10063] Follow-up: remove dead code related to an old output committer ## What changes were proposed in this pull request? DirectParquetOutputCommitter was removed from Spark as it was deemed unsafe to use. We however still have some code to generate warning. This patch removes those code as well. This is kind of a follow-up of https://github.com/apache/spark/pull/16796 ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #18689 from cloud-fan/minor.	2017-07-20 12:08:20 -07:00
gatorsmile	ae253e5a87	[SPARK-21273][SQL][FOLLOW-UP] Propagate logical plan stats using visitor pattern and mixin ## What changes were proposed in this pull request? This PR is to add back the stats propagation of `Window` and remove the stats calculation of the leaf node `Range`, which has been covered by `9c32d2507d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L56)` ## How was this patch tested? Added two test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #18677 from gatorsmile/visitStats.	2017-07-19 10:57:15 +08:00
Wenchen Fan	f18b905f6c	[SPARK-21457][SQL] ExternalCatalog.listPartitions should correctly handle partition values with dot ## What changes were proposed in this pull request? When we list partitions from hive metastore with a partial partition spec, we are expecting exact matching according to the partition values. However, hive treats dot specially and match any single character for dot. We should do an extra filter to drop unexpected partitions. ## How was this patch tested? new regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes #18671 from cloud-fan/hive.	2017-07-18 15:56:16 -07:00
Sean Owen	e26dac5feb	[SPARK-21415] Triage scapegoat warnings, part 1 ## What changes were proposed in this pull request? Address scapegoat warnings for: - BigDecimal double constructor - Catching NPE - Finalizer without super - List.size is O(n) - Prefer Seq.empty - Prefer Set.empty - reverse.map instead of reverseMap - Type shadowing - Unnecessary if condition. - Use .log1p - Var could be val In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #18635 from srowen/Scapegoat1.	2017-07-18 08:47:17 +01:00
aokolnychyi	0be5fb41a6	[SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions ## What changes were proposed in this pull request? This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below: ``` val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil) val sc = spark.sparkContext val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12))) val df = spark.createDataFrame(rdd, inputSchema) // Works correctly since no nested decimal expression is involved // Expected result type: (26, 6) * (26, 6) = (38, 12) df.select($"col" * $"col").explain(true) df.select($"col" * $"col").printSchema() // Gives a wrong result since there is a nested decimal expression that should be visited first // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18) df.select($"col" * $"col" * $"col").explain(true) df.select($"col" * $"col" * $"col").printSchema() ``` The example above gives the following output: ``` // Correct result without sub-expressions == Parsed Logical Plan == 'Project [('col * 'col) AS (col * col)#4] +- LogicalRDD [col#1] == Analyzed Logical Plan == (col * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Physical Plan == Project [CheckOverflow((col#1 col#1), DecimalType(38,12)) AS (col * col)#4] +- Scan ExistingRDD[col#1] // Schema root \|-- (col * col): decimal(38,12) (nullable = true) // Incorrect result with sub-expressions == Parsed Logical Plan == 'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Analyzed Logical Plan == ((col * col) * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Physical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- Scan ExistingRDD[col#1] // Schema root \|-- ((col * col) * col): decimal(38,12) (nullable = true) ``` ## How was this patch tested? This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18583 from aokolnychyi/spark-21332.	2017-07-17 21:07:50 -07:00
Sean Owen	fd52a747fd	[SPARK-19810][SPARK-19810][MINOR][FOLLOW-UP] Follow-ups from to remove Scala 2.10 ## What changes were proposed in this pull request? Follow up to a few comments on https://github.com/apache/spark/pull/17150#issuecomment-315020196 that couldn't be addressed before it was merged. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #18646 from srowen/SPARK-19810.2.	2017-07-17 09:22:42 +08:00
Kazuaki Ishizaki	ac5d5d7959	[SPARK-21344][SQL] BinaryType comparison does signed byte array comparison ## What changes were proposed in this pull request? This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations. ## How was this patch tested? Added a test suite in `OrderingSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18571 from kiszk/SPARK-21344.	2017-07-14 20:16:04 -07:00
Sean Owen	425c4ada4c	[SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 ## What changes were proposed in this pull request? - Remove Scala 2.10 build profiles and support - Replace some 2.10 support in scripts with commented placeholders for 2.12 later - Remove deprecated API calls from 2.10 support - Remove usages of deprecated context bounds where possible - Remove Scala 2.10 workarounds like ScalaReflectionLock - Other minor Scala warning fixes ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17150 from srowen/SPARK-19810.	2017-07-13 17:06:24 +08:00
liuxian	aaad34dc2f	[SPARK-21007][SQL] Add SQL function - RIGHT && LEFT ## What changes were proposed in this pull request? Add SQL function - RIGHT && LEFT, same as MySQL: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_left https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_right ## How was this patch tested? unit test Author: liuxian <liu.xian3@zte.com.cn> Closes #18228 from 10110346/lx-wip-0607.	2017-07-12 18:51:19 +08:00

1 2 3 4 5 ...

2517 commits