ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Liang-Chi Hsieh	fc43690d36	[SPARK-24749][SQL] Use sameType to compare Array's element type in ArrayContains ## What changes were proposed in this pull request? We should use `DataType.sameType` to compare element type in `ArrayContains`, otherwise nullability affects comparison result. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21724 from viirya/SPARK-24749.	2018-07-07 11:34:30 +08:00
Liang-Chi Hsieh	4de0425df8	[SPARK-24569][SQL] Aggregator with output type Option should produce consistent schema ## What changes were proposed in this pull request? SQL `Aggregator` with output type `Option[Boolean]` creates column of type `StructType`. It's not in consistency with a Dataset of similar java class. This changes the way `definedByConstructorParams` checks given type. For `Option[_]`, it goes to check its type argument. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21611 from viirya/SPARK-24569.	2018-07-07 10:54:14 +08:00
Yuming Wang	bf67f70c48	[SPARK-24692][TESTS] Improvement FilterPushdownBenchmark ## What changes were proposed in this pull request? Refer to the [`WideSchemaBenchmark`](https://github.com/apache/spark/blob/v2.3.1/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/WideSchemaBenchmark.scala) update `FilterPushdownBenchmark`: 1. Write the result to `benchmarks/FilterPushdownBenchmark-results.txt` for easy maintenance. 2. Add more benchmark case: `StringStartsWith`, `Decimal`, `InSet -> InFilters` and `tinyint`. ## How was this patch tested? manual tests Author: Yuming Wang <yumwang@ebay.com> Closes #21677 from wangyum/SPARK-24692.	2018-07-06 11:13:57 +08:00
Takuya UESHIN	01fcba2c68	[SPARK-24737][SQL] Type coercion between StructTypes. ## What changes were proposed in this pull request? We can support type coercion between `StructType`s where all the internal types are compatible. ## How was this patch tested? Added tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #21713 from ueshin/issues/SPARK-24737/structtypecoercion.	2018-07-06 11:10:50 +08:00
Gengliang Wang	33952cfa81	[SPARK-24675][SQL] Rename table: validate existence of new location ## What changes were proposed in this pull request? If table is renamed to a existing new location, data won't show up. ``` scala> Seq("hello").toDF("a").write.format("parquet").saveAsTable("t") scala> sql("select * from t").show() +-----+ \| a\| +-----+ \|hello\| +-----+ scala> sql("alter table t rename to test") res2: org.apache.spark.sql.DataFrame = [] scala> sql("select * from test").show() +---+ \| a\| +---+ +---+ ``` The file layout is like ``` $ tree test test ├── gabage └── t ├── _SUCCESS └── part-00000-856b0f10-08f1-42d6-9eb3-7719261f3d5e-c000.snappy.parquet ``` In Hive, if the new location exists, the renaming will fail even the location is empty. We should have the same validation in Catalog, in case of unexpected bugs. ## How was this patch tested? New unit test. Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21655 from gengliangwang/validate_rename_table.	2018-07-05 09:25:19 -07:00
Liang-Chi Hsieh	32cfd3e75a	[SPARK-24361][SQL] Polish code block manipulation API ## What changes were proposed in this pull request? Current code block manipulation API is immature and hacky. We need a formal API to manipulate code blocks. The basic idea is making `JavaCode` as `TreeNode`. So we can use familiar `transform` API to manipulate code blocks and expressions in code blocks. For example, we can replace `SimpleExprValue` in a code block like this: ```scala code.transformExprValues { case SimpleExprValue("1 + 1", _) => aliasedParam } ``` The example use case is splitting code to methods. For example, we have an `ExprCode` containing generated code. But it is too long and we need to split it as method. Because statement-based expressions can't be directly passed into. We need to transform them as variables first: ```scala def getExprValues(block: Block): Set[ExprValue] = block match { case c: CodeBlock => c.blockInputs.collect { case e: ExprValue => e }.toSet case _ => Set.empty } def currentCodegenInputs(ctx: CodegenContext): Set[ExprValue] = { // Collects current variables in ctx.currentVars and ctx.INPUT_ROW. // It looks roughly like... ctx.currentVars.flatMap { v => getExprValues(v.code) ++ Set(v.value, v.isNull) }.toSet + ctx.INPUT_ROW } // A code block of an expression contains too long code, making it as method if (eval.code.length > 1024) { val setIsNull = if (!eval.isNull.isInstanceOf[LiteralValue]) { ... } else { "" } // Pick up variables and statements necessary to pass in. val currentVars = currentCodegenInputs(ctx) val varsPassIn = getExprValues(eval.code).intersect(currentVars) val aliasedExprs = HashMap.empty[SimpleExprValue, VariableValue] // Replace statement-based expressions which can't be directly passed in the method. val newCode = eval.code.transform { case block => block.transformExprValues { case s: SimpleExprValue(_, javaType) if varsPassIn.contains(s) => if (aliasedExprs.contains(s)) { aliasedExprs(s) } else { val aliasedVariable = JavaCode.variable(ctx.freshName("aliasedVar"), javaType) aliasedExprs += s -> aliasedVariable varsPassIn += aliasedVariable aliasedVariable } } } val params = varsPassIn.filter(!_.isInstanceOf[SimpleExprValue])).map { variable => s"${variable.javaType.getName} ${variable.variableName}" }.mkString(", ") val funcName = ctx.freshName("nodeName") val javaType = CodeGenerator.javaType(dataType) val newValue = JavaCode.variable(ctx.freshName("value"), dataType) val funcFullName = ctx.addNewFunction(funcName, s""" \|private $javaType $funcName($params) { \| $newCode \| $setIsNull \| return ${eval.value}; \|} """.stripMargin)) eval.value = newValue val args = varsPassIn.filter(!_.isInstanceOf[SimpleExprValue])).map { variable => s"${variable.variableName}" } // Create a code block to assign statements to aliased variables. val createVariables = aliasedExprs.foldLeft(EmptyBlock) { (block, (statement, variable)) => block + code"${statement.javaType.getName} $variable = $statement;" } eval.code = createVariables + code"$javaType $newValue = $funcFullName($args);" } ``` ## How was this patch tested? Added unite tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21405 from viirya/codeblock-api.	2018-07-05 20:48:55 +08:00
Antonio Murgia	4be9f0c028	[SPARK-24673][SQL] scala sql function from_utc_timestamp second argument could be Column instead of String ## What changes were proposed in this pull request? Add an overloaded version to `from_utc_timestamp` and `to_utc_timestamp` having second argument as a `Column` instead of `String`. ## How was this patch tested? Unit testing, especially adding two tests to org.apache.spark.sql.DateFunctionsSuite.scala Author: Antonio Murgia <antonio.murgia@agilelab.it> Author: Antonio Murgia <antonio.murgia2@studio.unibo.it> Closes #21693 from tmnd1991/feature/SPARK-24673.	2018-07-05 16:10:34 +08:00
Xiao Li	489a5294d1	[SPARK-17213][SPARK-17213][FOLLOW-UP] Improve the test of ## What changes were proposed in this pull request? This is a minor improvement for the test of SPARK-17213 ## How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #21716 from gatorsmile/testMaster23.	2018-07-05 09:56:48 +08:00
Wenchen Fan	bf764a33be	[SPARK-22384][SQL][FOLLOWUP] Refine partition pruning when attribute is wrapped in Cast ## What changes were proposed in this pull request? As mentioned in https://github.com/apache/spark/pull/21586 , `Cast.mayTruncate` is not 100% safe, string to boolean is allowed. Since changing `Cast.mayTruncate` also changes the behavior of Dataset, here I propose to add a new `Cast.canSafeCast` for partition pruning. ## How was this patch tested? new test cases Author: Wenchen Fan <wenchen@databricks.com> Closes #21712 from cloud-fan/safeCast.	2018-07-04 18:36:09 -07:00
Liang-Chi Hsieh	1a2655a9e7	[SPARK-24635][SQL] Remove Blocks class from JavaCode class hierarchy ## What changes were proposed in this pull request? The `Blocks` class in `JavaCode` class hierarchy is not necessary. Its function can be taken by `CodeBlock`. We should remove it to make simpler class hierarchy. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21619 from viirya/SPARK-24635.	2018-07-04 20:42:08 +08:00
Yuming Wang	021145f364	[SPARK-24716][SQL] Refactor ParquetFilters ## What changes were proposed in this pull request? Replace DataFrame schema to Parquet file schema when create `ParquetFilters`. Thus we can easily implement `Decimal` and `Timestamp` push down. some thing like this: ```scala // DecimalType: 32BitDecimalType case ParquetSchemaType(DECIMAL, INT32, decimal) if pushDownDecimal => (n: String, v: Any) => FilterApi.eq( intColumn(n), Option(v).map(_.asInstanceOf[JBigDecimal].unscaledValue().intValue() .asInstanceOf[Integer]).orNull) // DecimalType: 64BitDecimalType case ParquetSchemaType(DECIMAL, INT64, decimal) if pushDownDecimal => (n: String, v: Any) => FilterApi.eq( longColumn(n), Option(v).map(_.asInstanceOf[JBigDecimal].unscaledValue().longValue() .asInstanceOf[java.lang.Long]).orNull) // DecimalType: LegacyParquetFormat 32BitDecimalType & 64BitDecimalType case ParquetSchemaType(DECIMAL, FIXED_LEN_BYTE_ARRAY, decimal) if pushDownDecimal && decimal.getPrecision <= Decimal.MAX_LONG_DIGITS => (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(d => decimalToBinaryUsingUnscaledLong(decimal.getPrecision, d.asInstanceOf[JBigDecimal])).orNull) // DecimalType: ByteArrayDecimalType case ParquetSchemaType(DECIMAL, FIXED_LEN_BYTE_ARRAY, decimal) if pushDownDecimal && decimal.getPrecision > Decimal.MAX_LONG_DIGITS => (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(d => decimalToBinaryUsingUnscaledBytes(decimal.getPrecision, d.asInstanceOf[JBigDecimal])).orNull) ``` ```scala // INT96 doesn't support pushdown case ParquetSchemaType(TIMESTAMP_MICROS, INT64, null) => (n: String, v: Any) => FilterApi.eq( longColumn(n), Option(v).map(t => DateTimeUtils.fromJavaTimestamp(t.asInstanceOf[Timestamp]) .asInstanceOf[java.lang.Long]).orNull) case ParquetSchemaType(TIMESTAMP_MILLIS, INT64, null) => (n: String, v: Any) => FilterApi.eq( longColumn(n), Option(v).map(_.asInstanceOf[Timestamp].getTime.asInstanceOf[java.lang.Long]).orNull) ``` ## How was this patch tested? unit tests Author: Yuming Wang <yumwang@ebay.com> Closes #21696 from wangyum/SPARK-24716.	2018-07-04 20:15:40 +08:00
Takeshi Yamamuro	b2deef64f6	[SPARK-24727][SQL] Add a static config to control cache size for generated classes ## What changes were proposed in this pull request? Since SPARK-24250 has been resolved, executors correctly references user-defined configurations. So, this pr added a static config to control cache size for generated classes in `CodeGenerator`. ## How was this patch tested? Added tests in `ExecutorSideSQLConfSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21705 from maropu/SPARK-24727.	2018-07-04 20:04:18 +08:00
Takuya UESHIN	7c08eb6d61	[SPARK-24732][SQL] Type coercion between MapTypes. ## What changes were proposed in this pull request? Currently we don't allow type coercion between maps. We can support type coercion between MapTypes where both the key types and the value types are compatible. ## How was this patch tested? Added tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #21703 from ueshin/issues/SPARK-24732/maptypecoercion.	2018-07-04 12:21:26 +08:00
Maxim Gekk	776f299fc8	[SPARK-24709][SQL] schema_of_json() - schema inference from an example ## What changes were proposed in this pull request? In the PR, I propose to add new function - schema_of_json() which infers schema of JSON string literal. The result of the function is a string containing a schema in DDL format. One of the use cases is using of schema_of_json() in the combination with from_json(). Currently, _from_json()_ requires a schema as a mandatory argument. The schema_of_json() function will allow to point out an JSON string as an example which has the same schema as the first argument of _from_json()_. For instance: ```sql select from_json(json_column, schema_of_json('{"c1": [0], "c2": [{"c3":0}]}')) from json_table; ``` ## How was this patch tested? Added new test to `JsonFunctionsSuite`, `JsonExpressionsSuite` and SQL tests to `json-functions.sql` Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21686 from MaxGekk/infer_schema_json.	2018-07-04 09:38:18 +08:00
DB Tsai	5585c5765f	[SPARK-24420][BUILD] Upgrade ASM to 6.1 to support JDK9+ ## What changes were proposed in this pull request? Upgrade ASM to 6.1 to support JDK9+ ## How was this patch tested? Existing tests. Author: DB Tsai <d_tsai@apple.com> Closes #21459 from dbtsai/asm.	2018-07-03 10:13:48 -07:00
Marco Gaido	a7c8f0c8cb	[SPARK-24385][SQL] Resolve self-join condition ambiguity for EqualNullSafe ## What changes were proposed in this pull request? In Dataset.join we have a small hack for resolving ambiguity in the column name for self-joins. The current code supports only `EqualTo`. The PR extends the fix to `EqualNullSafe`. Credit for this PR should be given to daniel-shields. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21605 from mgaido91/SPARK-24385_2.	2018-07-03 12:20:03 +08:00
Yuanjian Li	8f91c697e2	[SPARK-24665][PYSPARK] Use SQLConf in PySpark to manage all sql configs ## What changes were proposed in this pull request? Use SQLConf for PySpark to manage all sql configs, drop all the hard code in config usage. ## How was this patch tested? Existing UT. Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21648 from xuanyuanking/SPARK-24665.	2018-07-02 14:35:37 +08:00
Xiao Li	d54d8b8630	simplify rand in dsl/package.scala	2018-06-29 23:51:13 -07:00
maryannxue	797971ed42	[SPARK-24696][SQL] ColumnPruning rule fails to remove extra Project ## What changes were proposed in this pull request? The ColumnPruning rule tries adding an extra Project if an input node produces fields more than needed, but as a post-processing step, it needs to remove the lower Project in the form of "Project - Filter - Project" otherwise it would conflict with PushPredicatesThroughProject and would thus cause a infinite optimization loop. The current post-processing method is defined as: ``` private def removeProjectBeforeFilter(plan: LogicalPlan): LogicalPlan = plan transform { case p1 Project(_, f Filter(_, p2 Project(_, child))) if p2.outputSet.subsetOf(child.outputSet) => p1.copy(child = f.copy(child = child)) } ``` This method works well when there is only one Filter but would not if there's two or more Filters. In this case, there is a deterministic filter and a non-deterministic filter so they stay as separate filter nodes and cannot be combined together. An simplified illustration of the optimization process that forms the infinite loop is shown below (F1 stands for the 1st filter, F2 for the 2nd filter, P for project, S for scan of relation, PredicatePushDown as abbrev. of PushPredicatesThroughProject): ``` F1 - F2 - P - S PredicatePushDown => F1 - P - F2 - S ColumnPruning => F1 - P - F2 - P - S => F1 - P - F2 - S (Project removed) PredicatePushDown => P - F1 - F2 - S ColumnPruning => P - F1 - P - F2 - S => P - F1 - P - F2 - P - S => P - F1 - F2 - P - S (only one Project removed) RemoveRedundantProject => F1 - F2 - P - S (goes back to the loop start) ``` So the problem is the ColumnPruning rule adds a Project under a Filter (and fails to remove it in the end), and that new Project triggers PushPredicateThroughProject. Once the filters have been push through the Project, a new Project will be added by the ColumnPruning rule and this goes on and on. The fix should be when adding Projects, the rule applies top-down, but later when removing extra Projects, the process should go bottom-up to ensure all extra Projects can be matched. ## How was this patch tested? Added a optimization rule test in ColumnPruningSuite; and a end-to-end test in SQLQuerySuite. Author: maryannxue <maryannxue@apache.org> Closes #21674 from maryannxue/spark-24696.	2018-06-29 23:46:12 -07:00
Yuming Wang	03545ce6de	[SPARK-24638][SQL] StringStartsWith support push down ## What changes were proposed in this pull request? `StringStartsWith` support push down. About 50% savings in compute time. ## How was this patch tested? unit tests, manual tests and performance test: ```scala cat <<EOF > SPARK-24638.scala def benchmark(func: () => Unit): Long = { val start = System.currentTimeMillis() for(i <- 0 until 100) { func() } val end = System.currentTimeMillis() end - start } val path = "/tmp/spark/parquet/string/" spark.range(10000000).selectExpr("concat(id, 'str', id) as id").coalesce(1).write.mode("overwrite").option("parquet.block.size", 1048576).parquet(path) val df = spark.read.parquet(path) spark.sql("set spark.sql.parquet.filterPushdown.string.startsWith=true") val pushdownEnable = benchmark(() => df.where("id like '999998%'").count()) spark.sql("set spark.sql.parquet.filterPushdown.string.startsWith=false") val pushdownDisable = benchmark(() => df.where("id like '999998%'").count()) val improvements = pushdownDisable - pushdownEnable println(s"improvements: $improvements") EOF bin/spark-shell -i SPARK-24638.scala ``` result: ```scala Loading SPARK-24638.scala... benchmark: (func: () => Unit)Long path: String = /tmp/spark/parquet/string/ df: org.apache.spark.sql.DataFrame = [id: string] res1: org.apache.spark.sql.DataFrame = [key: string, value: string] pushdownEnable: Long = 11608 res2: org.apache.spark.sql.DataFrame = [key: string, value: string] pushdownDisable: Long = 31981 improvements: Long = 20373 ``` Author: Yuming Wang <yumwang@ebay.com> Closes #21623 from wangyum/SPARK-24638.	2018-06-30 13:58:50 +08:00
Jose Torres	f6e6899a8b	[SPARK-24386][SS] coalesce(1) aggregates in continuous processing ## What changes were proposed in this pull request? Provide a continuous processing implementation of coalesce(1), as well as allowing aggregates on top of it. The changes in ContinuousQueuedDataReader and such are to use split.index (the ID of the partition within the RDD currently being compute()d) rather than context.partitionId() (the partition ID of the scheduled task within the Spark job - that is, the post coalesce writer). In the absence of a narrow dependency, these values were previously always the same, so there was no need to distinguish. ## How was this patch tested? new unit test Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #21560 from jose-torres/coalesce.	2018-06-28 16:25:40 -07:00
Jacek Laskowski	e1d3f80103	[SPARK-24408][SQL][DOC] Move abs function to math_funcs group ## What changes were proposed in this pull request? A few math functions (`abs` , `bitwiseNOT`, `isnan`, `nanvl`) are not in math_funcs group. They should really be. ## How was this patch tested? Awaiting Jenkins Author: Jacek Laskowski <jacek@japila.pl> Closes #21448 from jaceklaskowski/SPARK-24408-math-funcs-doc.	2018-06-28 13:22:52 -07:00
Xingbo Jiang	5b05966488	[SPARK-24564][TEST] Add test suite for RecordBinaryComparator ## What changes were proposed in this pull request? Add a new test suite to test RecordBinaryComparator. ## How was this patch tested? New test suite. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #21570 from jiangxb1987/rbc-test.	2018-06-28 14:19:50 +08:00
Fokko Driesprong	6a97e8eb31	[SPARK-24603][SQL] Fix findTightestCommonType reference in comments findTightestCommonTypeOfTwo has been renamed to findTightestCommonType ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Fokko Driesprong <fokkodriesprong@godatadriven.com> Closes #21597 from Fokko/fd-typo.	2018-06-28 09:59:00 +08:00
Takeshi Yamamuro	1c9acc2438	[SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBenchmark benchmark results ## What changes were proposed in this pull request? This pr corrected the default configuration (`spark.master=local[1]`) for benchmarks. Also, this updated performance results on the AWS `r3.xlarge`. ## How was this patch tested? N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21625 from maropu/FixDataSourceReadBenchmark.	2018-06-28 09:21:10 +08:00
Takeshi Yamamuro	bd32b509a1	[SPARK-24645][SQL] Skip parsing when csvColumnPruning enabled and partitions scanned only ## What changes were proposed in this pull request? In the master, when `csvColumnPruning`(implemented in [this commit](`64fad0b519 (diff-d19881aceddcaa5c60620fdcda99b4c4)`)) enabled and partitions scanned only, it throws an exception below; ``` scala> val dir = "/tmp/spark-csv/csv" scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir) scala> spark.read.csv(dir).selectExpr("sum(p)").collect() 18/06/25 13:12:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5) java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:197) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309) at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61) ... ``` This pr modified code to skip CSV parsing in the case. ## How was this patch tested? Added tests in `CSVSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21631 from maropu/SPARK-24645.	2018-06-28 09:19:25 +08:00
Kallman, Steven	c5aa54d54b	[SPARK-24553][WEB-UI] http 302 fixes for href redirect ## What changes were proposed in this pull request? Updated URL/href links to include a '/' before '?id' to make links consistent and avoid http 302 redirect errors within UI port 4040 tabs. ## How was this patch tested? Built a runnable distribution and executed jobs. Validated that http 302 redirects are no longer encountered when clicking on links within UI port 4040 tabs. Author: Steven Kallman <SJKallmangmail.com> Author: Kallman, Steven <Steven.Kallman@CapitalOne.com> Closes #21600 from SJKallman/{Spark-24553}{WEB-UI}-redirect-href-fixes.	2018-06-27 15:36:59 -07:00
Takeshi Yamamuro	893ea224cc	[SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat ## What changes were proposed in this pull request? This pr added code to verify a schema in Json/Orc/ParquetFileFormat along with CSVFileFormat. ## How was this patch tested? Added verification tests in `FileBasedDataSourceSuite` and `HiveOrcSourceSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21389 from maropu/SPARK-24204.	2018-06-27 15:25:51 -07:00
debugger87	c04cb2d1b7	[SPARK-21687][SQL] Spark SQL should set createTime for Hive partition ## What changes were proposed in this pull request? Set createTime for every hive partition created in Spark SQL, which could be used to manage data lifecycle in Hive warehouse. We found that almost every partition modified by spark sql has not been set createTime. ``` mysql> select * from partitions where create_time=0 limit 1\G; ************************* 1. row ************************* PART_ID: 1028584 CREATE_TIME: 0 LAST_ACCESS_TIME: 1502203611 PART_NAME: date=20170130 SD_ID: 1543605 TBL_ID: 211605 LINK_TARGET_ID: NULL 1 row in set (0.27 sec) ``` ## How was this patch tested? N/A Author: debugger87 <yangchaozhong.2009@gmail.com> Author: Chaozhong Yang <yangchaozhong.2009@gmail.com> Closes #18900 from debugger87/fix/set-create-time-for-hive-partition.	2018-06-27 11:34:28 -07:00
Yuanjian Li	6a0b77a55d	[SPARK-24215][PYSPARK][FOLLOW UP] Implement eager evaluation for DataFrame APIs in PySpark ## What changes were proposed in this pull request? Address comments in #21370 and add more test. ## How was this patch tested? Enhance test in pyspark/sql/test.py and DataFrameSuite Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21553 from xuanyuanking/SPARK-24215-follow.	2018-06-27 10:43:06 -07:00
Takuya UESHIN	9a76f23c6a	[SPARK-23927][SQL][FOLLOW-UP] Fix a build failure. ## What changes were proposed in this pull request? This pr is a follow-up pr of #21155. The #21155 removed unnecessary import at that time, but the import became necessary in another pr. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #21646 from ueshin/issues/SPARK-23927/fup1.	2018-06-27 11:52:48 +08:00
Vayda, Oleksandr: IT (PRG)	2669b4de3b	[SPARK-23927][SQL] Add "sequence" expression ## What changes were proposed in this pull request? The PR adds the SQL function ```sequence```. https://issues.apache.org/jira/browse/SPARK-23927 The behavior of the function is based on Presto's one. Ref: https://prestodb.io/docs/current/functions/array.html - ```sequence(start, stop) → array<bigint>``` Generate a sequence of integers from ```start``` to ```stop```, incrementing by ```1``` if ```start``` is less than or equal to ```stop```, otherwise ```-1```. - ```sequence(start, stop, step) → array<bigint>``` Generate a sequence of integers from ```start``` to ```stop```, incrementing by ```step```. - ```sequence(start_date, stop_date) → array<date>``` Generate a sequence of dates from ```start_date``` to ```stop_date```, incrementing by ```interval 1 day``` if ```start_date``` is less than or equal to ```stop_date```, otherwise ```- interval 1 day```. - ```sequence(start_date, stop_date, step_interval) → array<date>``` Generate a sequence of dates from ```start_date``` to ```stop_date```, incrementing by ```step_interval```. The type of ```step_interval``` is ```CalendarInterval```. - ```sequence(start_timestemp, stop_timestemp) → array<timestamp>``` Generate a sequence of timestamps from ```start_timestamps``` to ```stop_timestamps```, incrementing by ```interval 1 day``` if ```start_date``` is less than or equal to ```stop_date```, otherwise ```- interval 1 day```. - ```sequence(start_timestamp, stop_timestamp, step_interval) → array<timestamp>``` Generate a sequence of timestamps from ```start_timestamps``` to ```stop_timestamps```, incrementing by ```step_interval```. The type of ```step_interval``` is ```CalendarInterval```. ## How was this patch tested? Added unit tests. Author: Vayda, Oleksandr: IT (PRG) <Oleksandr.Vayda@barclayscapital.com> Closes #21155 from wajda/feature/array-api-sequence.	2018-06-27 11:52:31 +09:00
Maxim Gekk	d08f53dc61	[SPARK-24605][SQL] size(null) returns null instead of -1 ## What changes were proposed in this pull request? In PR, I propose new behavior of `size(null)` under the config flag `spark.sql.legacy.sizeOfNull`. If the former one is disabled, the `size()` function returns `null` for `null` input. By default the `spark.sql.legacy.sizeOfNull` is enabled to keep backward compatibility with previous versions. In that case, `size(null)` returns `-1`. ## How was this patch tested? Modified existing tests for the `size()` function to check new behavior (`null`) and old one (`-1`). Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21598 from MaxGekk/legacy-size-of-null.	2018-06-27 10:36:51 +08:00
Kris Mok	1b9368f7d4	[SPARK-24659][SQL] GenericArrayData.equals should respect element type differences ## What changes were proposed in this pull request? Fix `GenericArrayData.equals`, so that it respects the actual types of the elements. e.g. an instance that represents an `array<int>` and another instance that represents an `array<long>` should be considered incompatible, and thus should return false for `equals`. `GenericArrayData` doesn't keep any schema information by itself, and rather relies on the Java objects referenced by its `array` field's elements to keep track of their own object types. So, the most straightforward way to respect their types is to call `equals` on the elements, instead of using Scala's `==` operator, which can have semantics that are not always desirable: ``` new java.lang.Integer(123) == new java.lang.Long(123L) // true in Scala new java.lang.Integer(123).equals(new java.lang.Long(123L)) // false in Scala ``` ## How was this patch tested? Added unit test in `ComplexDataSuite` Author: Kris Mok <kris.mok@databricks.com> Closes #21643 from rednaxelafx/fix-genericarraydata-equals.	2018-06-27 10:27:40 +08:00
Dilip Biswal	02f8781fa2	[SPARK-24423][SQL] Add a new option for JDBC sources ## What changes were proposed in this pull request? Here is the description in the JIRA - Currently, our JDBC connector provides the option `dbtable` for users to specify the to-be-loaded JDBC source table. ```SQL val jdbcDf = spark.read .format("jdbc") .option("dbtable", "dbName.tableName") .options(jdbcCredentials: Map) .load() ``` Normally, users do not fetch the whole JDBC table due to the poor performance/throughput of JDBC. Thus, they normally just fetch a small set of tables. For advanced users, they can pass a subquery as the option. ```SQL val query = """ (select * from tableName limit 10) as tmp """ val jdbcDf = spark.read .format("jdbc") .option("dbtable", query) .options(jdbcCredentials: Map) .load() ``` However, this is straightforward to end users. We should simply allow users to specify the query by a new option `query`. We will handle the complexity for them. ```SQL val query = """select * from tableName limit 10""" val jdbcDf = spark.read .format("jdbc") .option("query", query) .options(jdbcCredentials: Map) .load() ``` ## How was this patch tested? Added tests in JDBCSuite and JDBCWriterSuite. Also tested against MySQL, Postgress, Oracle, DB2 (using docker infrastructure) to make sure there are no syntax issues. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #21590 from dilipbiswal/SPARK-24423.	2018-06-26 15:17:00 -07:00
Yuming Wang	dcaa49ff1e	[SPARK-24658][SQL] Remove workaround for ANTLR bug ## What changes were proposed in this pull request? Issue antlr/antlr4#781 has already been fixed, so the workaround of extracting the pattern into a separate rule is no longer needed. The presto already removed it: https://github.com/prestodb/presto/pull/10744. ## How was this patch tested? Existing tests Author: Yuming Wang <yumwang@ebay.com> Closes #21641 from wangyum/ANTLR-780.	2018-06-26 14:33:04 -07:00
Marek Novotny	e07aee2165	[SPARK-24636][SQL] Type coercion of arrays for array_join function ## What changes were proposed in this pull request? Presto's implementation accepts arbitrary arrays of primitive types as an input: ``` presto> SELECT array_join(ARRAY [1, 2, 3], ', '); _col0 --------- 1, 2, 3 (1 row) ``` This PR proposes to implement a type coercion rule for ```array_join``` function that converts arrays of primitive as well as non-primitive types to arrays of string. ## How was this patch tested? New test cases add into: - sql-tests/inputs/typeCoercion/native/arrayJoin.sql - DataFrameFunctionsSuite.scala Author: Marek Novotny <mn.mikke@gmail.com> Closes #21620 from mn-mikke/SPARK-24636.	2018-06-26 09:51:55 +08:00
Bryan Cutler	d48803bf64	[SPARK-24324][PYTHON][FOLLOWUP] Grouped Map positional conf should have deprecation note ## What changes were proposed in this pull request? Followup to the discussion of the added conf in SPARK-24324 which allows assignment by column position only. This conf is to preserve old behavior and will be removed in future releases, so it should have a note to indicate that. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #21637 from BryanCutler/arrow-groupedMap-conf-deprecate-followup-SPARK-24324.	2018-06-25 17:08:23 -07:00
Marcelo Vanzin	6d16b9885d	[SPARK-24552][CORE][SQL] Use task ID instead of attempt number for writes. This passes the unique task attempt id instead of attempt number to v2 data sources because attempt number is reused when stages are retried. When attempt numbers are reused, sources that track data by partition id and attempt number may incorrectly clean up data because the same attempt number can be both committed and aborted. For v1 / Hadoop writes, generate a unique ID based on available attempt numbers to avoid a similar problem. Closes #21558 Author: Marcelo Vanzin <vanzin@cloudera.com> Author: Ryan Blue <blue@apache.org> Closes #21606 from vanzin/SPARK-24552.2.	2018-06-25 16:54:57 -07:00
Stacy Kerkela	5264164a67	[SPARK-24648][SQL] SqlMetrics should be threadsafe Use LongAdder to make SQLMetrics thread safe. ## What changes were proposed in this pull request? Replace += with LongAdder.add() for concurrent counting ## How was this patch tested? Unit tests with local threads Author: Stacy Kerkela <stacy.kerkela@databricks.com> Closes #21634 from dbkerkela/sqlmetrics-concurrency-stacy.	2018-06-25 23:41:39 +02:00
Marco Gaido	594ac4f7b8	[SPARK-24633][SQL] Fix codegen when split is required for arrays_zip ## What changes were proposed in this pull request? In function array_zip, when split is required by the high number of arguments, a codegen error can happen. The PR fixes codegen for cases when splitting the code is required. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21621 from mgaido91/SPARK-24633.	2018-06-25 23:44:20 +08:00
Maryann Xue	bac50aa371	[SPARK-24596][SQL] Non-cascading Cache Invalidation ## What changes were proposed in this pull request? 1. Add parameter 'cascade' in CacheManager.uncacheQuery(). Under 'cascade=false' mode, only invalidate the current cache, and for other dependent caches, rebuild execution plan and reuse cached buffer. 2. Pass true/false from callers in different uncache scenarios: - Drop tables and regular (persistent) views: regular mode - Drop temporary views: non-cascading mode - Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode - Call `DataSet.unpersist()`: non-cascading mode - Call `Catalog.uncacheTable()`: follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest Note that a regular (persistent) view is a database object just like a table, so after dropping a regular view (whether cached or not cached), any query referring to that view should no long be valid. Hence if a cached persistent view is dropped, we need to invalidate the all dependent caches so that exceptions will be thrown for any later reference. On the other hand, a temporary view is in fact equivalent to an unnamed DataSet, and dropping a temporary view should have no impact on queries referencing that view. Thus we should do non-cascading uncaching for temporary views, which also guarantees a consistent uncaching behavior between temporary views and unnamed DataSets. ## How was this patch tested? New tests in CachedTableSuite and DatasetCacheSuite. Author: Maryann Xue <maryannxue@apache.org> Closes #21594 from maryannxue/noncascading-cache.	2018-06-25 07:17:30 -07:00
Takuya UESHIN	6e0596e263	[SPARK-23931][SQL][FOLLOW-UP] Make `arrays_zip` in function.scala `@scala.annotation.varargs`. ## What changes were proposed in this pull request? This is a follow-up pr of #21045 which added `arrays_zip`. The `arrays_zip` in functions.scala should've been `scala.annotation.varargs`. This pr makes it `scala.annotation.varargs`. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #21630 from ueshin/issues/SPARK-23931/fup1.	2018-06-24 23:56:47 -07:00
Takeshi Yamamuro	f596ebe4d3	[SPARK-24327][SQL] Verify and normalize a partition column name based on the JDBC resolved schema ## What changes were proposed in this pull request? This pr modified JDBC datasource code to verify and normalize a partition column based on the JDBC resolved schema before building `JDBCRelation`. Closes #20370 ## How was this patch tested? Added tests in `JDBCSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21379 from maropu/SPARK-24327.	2018-06-24 23:14:42 -07:00
Bryan Cutler	a5849ad9a3	[SPARK-24324][PYTHON] Pandas Grouped Map UDF should assign result columns by name ## What changes were proposed in this pull request? Currently, a `pandas_udf` of type `PandasUDFType.GROUPED_MAP` will assign the resulting columns based on index of the return pandas.DataFrame. If a new DataFrame is returned and constructed using a dict, then the order of the columns could be arbitrary and be different than the defined schema for the UDF. If the schema types still match, then no error will be raised and the user will see column names and column data mixed up. This change will first try to assign columns using the return type field names. If a KeyError occurs, then the column index is checked if it is string based. If so, then the error is raised as it is most likely a naming mistake, else it will fallback to assign columns by position and raise a TypeError if the field types do not match. ## How was this patch tested? Added a test that returns a new DataFrame with column order different than the schema. Author: Bryan Cutler <cutlerb@gmail.com> Closes #21427 from BryanCutler/arrow-grouped-map-mixesup-cols-SPARK-24324.	2018-06-24 09:28:46 +08:00
Takeshi Yamamuro	98f363b774	[SPARK-24206][SQL] Improve FilterPushdownBenchmark benchmark code ## What changes were proposed in this pull request? This pr added benchmark code `FilterPushdownBenchmark` for string pushdown and updated performance results on the AWS `r3.xlarge`. ## How was this patch tested? N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21288 from maropu/UpdateParquetBenchmark.	2018-06-23 17:51:18 -07:00
Maxim Gekk	c7e2742f9b	[SPARK-24190][SQL] Allow saving of JSON files in UTF-16 and UTF-32 ## What changes were proposed in this pull request? Currently, restrictions in JSONOptions for `encoding` and `lineSep` are the same for read and for write. For example, a requirement for `lineSep` in the code: ``` df.write.option("encoding", "UTF-32BE").json(file) ``` doesn't allow to skip `lineSep` and use its default value `\n` because it throws the exception: ``` equirement failed: The lineSep option must be specified for the UTF-32BE encoding java.lang.IllegalArgumentException: requirement failed: The lineSep option must be specified for the UTF-32BE encoding ``` In the PR, I propose to separate JSONOptions in read and write, and make JSONOptions in write less restrictive. ## How was this patch tested? Added new test for blacklisted encodings in read. And the `lineSep` option was removed in write for some tests. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21247 from MaxGekk/json-options-in-write.	2018-06-23 17:40:20 -07:00
Marek Novotny	92c2f00bd2	[SPARK-23934][SQL] Adding map_from_entries function ## What changes were proposed in this pull request? The PR adds the `map_from_entries` function that returns a map created from the given array of entries. ## How was this patch tested? New tests added into: - `CollectionExpressionSuite` - `DataFrameFunctionSuite` ## CodeGen Examples ### Primitive-type Keys and Values ``` val idf = Seq( Seq((1, 10), (2, 20), (3, 10)), Seq((1, 10), null, (2, 20)) ).toDF("a") idf.filter('a.isNotNull).select(map_from_entries('a)).debugCodegen ``` Result: ``` /* 042 / boolean project_isNull_0 = false; / 043 / MapData project_value_0 = null; / 044 / / 045 / for (int project_idx_2 = 0; !project_isNull_0 && project_idx_2 < inputadapter_value_0.numElements(); project_idx_2++) { / 046 / project_isNull_0 \|= inputadapter_value_0.isNullAt(project_idx_2); / 047 / } / 048 / if (!project_isNull_0) { / 049 / final int project_numEntries_0 = inputadapter_value_0.numElements(); / 050 / / 051 / final long project_keySectionSize_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(project_numEntries_0, 4); / 052 / final long project_valueSectionSize_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(project_numEntries_0, 4); / 053 / final long project_byteArraySize_0 = 8 + project_keySectionSize_0 + project_valueSectionSize_0; / 054 / if (project_byteArraySize_0 > 2147483632) { / 055 / final Object[] project_keys_0 = new Object[project_numEntries_0]; / 056 / final Object[] project_values_0 = new Object[project_numEntries_0]; / 057 / / 058 / for (int project_idx_1 = 0; project_idx_1 < project_numEntries_0; project_idx_1++) { / 059 / InternalRow project_entry_1 = inputadapter_value_0.getStruct(project_idx_1, 2); / 060 / / 061 / project_keys_0[project_idx_1] = project_entry_1.getInt(0); / 062 / project_values_0[project_idx_1] = project_entry_1.getInt(1); / 063 / } / 064 / / 065 / project_value_0 = org.apache.spark.sql.catalyst.util.ArrayBasedMapData.apply(project_keys_0, project_values_0); / 066 / / 067 / } else { / 068 / final byte[] project_byteArray_0 = new byte[(int)project_byteArraySize_0]; / 069 / UnsafeMapData project_unsafeMapData_0 = new UnsafeMapData(); / 070 / Platform.putLong(project_byteArray_0, 16, project_keySectionSize_0); / 071 / Platform.putLong(project_byteArray_0, 24, project_numEntries_0); / 072 / Platform.putLong(project_byteArray_0, 24 + project_keySectionSize_0, project_numEntries_0); / 073 / project_unsafeMapData_0.pointTo(project_byteArray_0, 16, (int)project_byteArraySize_0); / 074 / ArrayData project_keyArrayData_0 = project_unsafeMapData_0.keyArray(); / 075 / ArrayData project_valueArrayData_0 = project_unsafeMapData_0.valueArray(); / 076 / / 077 / for (int project_idx_0 = 0; project_idx_0 < project_numEntries_0; project_idx_0++) { / 078 / InternalRow project_entry_0 = inputadapter_value_0.getStruct(project_idx_0, 2); / 079 / / 080 / project_keyArrayData_0.setInt(project_idx_0, project_entry_0.getInt(0)); / 081 / project_valueArrayData_0.setInt(project_idx_0, project_entry_0.getInt(1)); / 082 / } / 083 / / 084 / project_value_0 = project_unsafeMapData_0; / 085 / } / 086 / / 087 / } ``` ### Non-primitive-type Keys and Values ``` val sdf = Seq( Seq(("a", null), ("b", "bb"), ("c", "aa")), Seq(("a", "aa"), null, (null, "bb")) ).toDF("a") sdf.filter('a.isNotNull).select(map_from_entries('a)).debugCodegen ``` Result: ``` / 042 / boolean project_isNull_0 = false; / 043 / MapData project_value_0 = null; / 044 / / 045 / for (int project_idx_1 = 0; !project_isNull_0 && project_idx_1 < inputadapter_value_0.numElements(); project_idx_1++) { / 046 / project_isNull_0 \|= inputadapter_value_0.isNullAt(project_idx_1); / 047 / } / 048 / if (!project_isNull_0) { / 049 / final int project_numEntries_0 = inputadapter_value_0.numElements(); / 050 / / 051 / final Object[] project_keys_0 = new Object[project_numEntries_0]; / 052 / final Object[] project_values_0 = new Object[project_numEntries_0]; / 053 / / 054 / for (int project_idx_0 = 0; project_idx_0 < project_numEntries_0; project_idx_0++) { / 055 / InternalRow project_entry_0 = inputadapter_value_0.getStruct(project_idx_0, 2); / 056 / / 057 / if (project_entry_0.isNullAt(0)) { / 058 / throw new RuntimeException("The first field from a struct (key) can't be null."); / 059 / } / 060 / / 061 / project_keys_0[project_idx_0] = project_entry_0.getUTF8String(0); / 062 / project_values_0[project_idx_0] = project_entry_0.getUTF8String(1); / 063 / } / 064 / / 065 / project_value_0 = org.apache.spark.sql.catalyst.util.ArrayBasedMapData.apply(project_keys_0, project_values_0); / 066 / / 067 */ } ``` Author: Marek Novotny <mn.mikke@gmail.com> Closes #21282 from mn-mikke/feature/array-api-map_from_entries-to-master.	2018-06-22 16:18:22 +09:00
Wenchen Fan	dc8a6befa5	[SPARK-24588][SS] streaming join should require HashClusteredPartitioning from children ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/19080 we simplified the distribution/partitioning framework, and make all the join-like operators require `HashClusteredDistribution` from children. Unfortunately streaming join operator was missed. This can cause wrong result. Think about ``` val input1 = MemoryStream[Int] val input2 = MemoryStream[Int] val df1 = input1.toDF.select('value as 'a, 'value * 2 as 'b) val df2 = input2.toDF.select('value as 'a, 'value * 2 as 'b).repartition('b) val joined = df1.join(df2, Seq("a", "b")).select('a) ``` The physical plan is ``` (3) Project [a#5] +- StreamingSymmetricHashJoin [a#5, b#6], [a#10, b#11], Inner, condition = [ leftOnly = null, rightOnly = null, both = null, full = null ], state info [ checkpoint = <unknown>, runId = 54e31fce-f055-4686-b75d-fcd2b076f8d8, opId = 0, ver = 0, numPartitions = 5], 0, state cleanup [ left = null, right = null ] :- Exchange hashpartitioning(a#5, b#6, 5) : +- (1) Project [value#1 AS a#5, (value#1 * 2) AS b#6] : +- StreamingRelation MemoryStream[value#1], [value#1] +- Exchange hashpartitioning(b#11, 5) +- (2) Project [value#3 AS a#10, (value#3 2) AS b#11] +- StreamingRelation MemoryStream[value#3], [value#3] ``` The left table is hash partitioned by `a, b`, while the right table is hash partitioned by `b`. This means, we may have a matching record that is in different partitions, which should be in the output but not. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #21587 from cloud-fan/join.	2018-06-21 15:38:46 -07:00
Maryann Xue	b9a6f7499a	[SPARK-24613][SQL] Cache with UDF could not be matched with subsequent dependent caches ## What changes were proposed in this pull request? Wrap the logical plan with a `AnalysisBarrier` for execution plan compilation in CacheManager, in order to avoid the plan being analyzed again. ## How was this patch tested? Add one test in `DatasetCacheSuite` Author: Maryann Xue <maryannxue@apache.org> Closes #21602 from maryannxue/cache-mismatch.	2018-06-21 11:45:30 -07:00
Marcelo Vanzin	c8e909cd49	[SPARK-24589][CORE] Correctly identify tasks in output commit coordinator. When an output stage is retried, it's possible that tasks from the previous attempt are still running. In that case, there would be a new task for the same partition in the new attempt, and the coordinator would allow both tasks to commit their output since it did not keep track of stage attempts. The change adds more information to the stage state tracked by the coordinator, so that only one task is allowed to commit the output in the above case. The stage state in the coordinator is also maintained across stage retries, so that a stray speculative task from a previous stage attempt is not allowed to commit. This also removes some code added in SPARK-18113 that allowed for duplicate commit requests; with the RPC code used in Spark 2, that situation cannot happen, so there is no need to handle it. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21577 from vanzin/SPARK-24552.	2018-06-21 13:25:15 -05:00
Chongguang LIU	7236e759c9	[SPARK-24574][SQL] array_contains, array_position, array_remove and element_at functions deal with Column type ## What changes were proposed in this pull request? For the function ```def array_contains(column: Column, value: Any): Column ``` , if we pass the `value` parameter as a Column type, it will yield a runtime exception. This PR proposes a pattern matching to detect if `value` is of type Column. If yes, it will use the .expr of the column, otherwise it will work as it used to. Same thing for ```array_position, array_remove and element_at``` functions ## How was this patch tested? Unit test modified to cover this code change. Ping ueshin Author: Chongguang LIU <chong@Chongguangs-MacBook-Pro.local> Closes #21581 from chongguang/SPARK-24574.	2018-06-21 14:58:57 +08:00
Maxim Gekk	54fcaafb09	[SPARK-24571][SQL] Support Char literals ## What changes were proposed in this pull request? In the PR, I propose to automatically convert a `Literal` with `Char` type to a `Literal` of `String` type. Currently, the following code: ```scala val df = Seq("Amsterdam", "San Francisco", "London").toDF("city") df.where($"city".contains('o')).show(false) ``` fails with the exception: ``` Unsupported literal type class java.lang.Character o java.lang.RuntimeException: Unsupported literal type class java.lang.Character o at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) ``` The PR fixes this issue by converting `char` to `string` of length `1`. I believe it makes sense to does not differentiate `char` and `string(1)` in _a unified, multi-language data platform_ like Spark which supports languages like Python/R. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21578 from MaxGekk/support-char-literals.	2018-06-20 23:38:37 -07:00
Huaxin Gao	9de11d3f90	[SPARK-23912][SQL] add array_distinct ## What changes were proposed in this pull request? Add array_distinct to remove duplicate value from the array. ## How was this patch tested? Add unit tests Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21050 from huaxingao/spark-23912.	2018-06-21 12:24:53 +09:00
aokolnychyi	c5a0d1132a	[SPARK-24575][SQL] Prohibit window expressions inside WHERE and HAVING clauses ## What changes were proposed in this pull request? As discussed [before](https://github.com/apache/spark/pull/19193#issuecomment-393726964), this PR prohibits window expressions inside WHERE and HAVING clauses. ## How was this patch tested? This PR comes with a dedicated unit test. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #21580 from aokolnychyi/spark-24575.	2018-06-20 18:57:13 +02:00
Jungtaek Lim	c8ef9232cf	[MINOR][SQL] Remove invalid comment from SparkStrategies ## What changes were proposed in this pull request? This patch is removing invalid comment from SparkStrategies, given that TODO-like comment is no longer preferred one as the comment: https://github.com/apache/spark/pull/21388#issuecomment-396856235 Removing invalid comment will prevent contributors to spend their times which is not going to be merged. ## How was this patch tested? N/A Author: Jungtaek Lim <kabhwan@gmail.com> Closes #21595 from HeartSaVioR/MINOR-remove-invalid-comment-on-spark-strategies.	2018-06-20 18:38:42 +02:00
Maryann Xue	bc0498d582	[SPARK-24583][SQL] Wrong schema type in InsertIntoDataSourceCommand ## What changes were proposed in this pull request? Change insert input schema type: "insertRelationType" -> "insertRelationType.asNullable", in order to avoid nullable being overridden. ## How was this patch tested? Added one test in InsertSuite. Author: Maryann Xue <maryannxue@apache.org> Closes #21585 from maryannxue/spark-24583.	2018-06-19 15:27:20 -07:00
Tathagata Das	2cb976355c	[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame ## What changes were proposed in this pull request? Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have better sense of buiding a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the microbatch output as a dataframe is useful. - Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning). - Reuse batch data sources for output whose streaming version does not exists (e.g. redshift data source). - Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice. The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to Scala/Java/Python `DataStreamWriter`. ## How was this patch tested? New unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21571 from tdas/foreachBatch.	2018-06-19 13:56:51 -07:00
yucai	9dbe53eb6b	[SPARK-24556][SQL] Always rewrite output partitioning in ReusedExchangeExec and InMemoryTableScanExec ## What changes were proposed in this pull request? Currently, ReusedExchange and InMemoryTableScanExec only rewrite output partitioning if child's partitioning is HashPartitioning and do nothing for other partitioning, e.g., RangePartitioning. We should always rewrite it, otherwise, unnecessary shuffle could be introduced like https://issues.apache.org/jira/browse/SPARK-24556. ## How was this patch tested? Add new tests. Author: yucai <yyu1@ebay.com> Closes #21564 from yucai/SPARK-24556.	2018-06-19 10:52:51 -07:00
Li Jin	a78a904641	[SPARK-24521][SQL][TEST] Fix ineffective test in CachedTableSuite ## What changes were proposed in this pull request? test("withColumn doesn't invalidate cached dataframe") in CachedTableSuite doesn't not work because: The UDF is executed and test count incremented when "df.cache()" is called and the subsequent "df.collect()" has no effect on the test result. This PR fixed this test and add another test for caching UDF. ## How was this patch tested? Add new tests. Author: Li Jin <ice.xelloss@gmail.com> Closes #21531 from icexelloss/fix-cache-test.	2018-06-19 10:42:08 -07:00
Xiao Li	9a75c18290	[SPARK-24542][SQL] UDF series UDFXPathXXXX allow users to pass carefully crafted XML to access arbitrary files ## What changes were proposed in this pull request? UDF series UDFXPathXXXX allow users to pass carefully crafted XML to access arbitrary files. Spark does not have built-in access control. When users use the external access control library, users might bypass them and access the file contents. This PR basically patches the Hive fix to Apache Spark. https://issues.apache.org/jira/browse/HIVE-18879 ## How was this patch tested? A unit test case Author: Xiao Li <gatorsmile@gmail.com> Closes #21549 from gatorsmile/xpathSecurity.	2018-06-18 20:17:04 -07:00
Wenchen Fan	1737d45e08	[SPARK-24478][SQL][FOLLOWUP] Move projection and filter push down to physical conversion ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/21503, to completely move operator pushdown to the planner rule. The code are mostly from https://github.com/apache/spark/pull/21319 ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #21574 from cloud-fan/followup.	2018-06-18 20:15:01 -07:00
Liang-Chi Hsieh	8f225e055c	[SPARK-24548][SQL] Fix incorrect schema of Dataset with tuple encoders ## What changes were proposed in this pull request? When creating tuple expression encoders, we should give the serializer expressions of tuple items correct names, so we can have correct output schema when we use such tuple encoders. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21576 from viirya/SPARK-24548.	2018-06-18 11:01:17 -07:00
Takeshi Yamamuro	e219e692ef	[SPARK-23772][SQL] Provide an option to ignore column of all null values or empty array during JSON schema inference ## What changes were proposed in this pull request? This pr added a new JSON option `dropFieldIfAllNull ` to ignore column of all null values or empty array/struct during JSON schema inference. ## How was this patch tested? Added tests in `JsonSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Author: Xiangrui Meng <meng@databricks.com> Closes #20929 from maropu/SPARK-23772.	2018-06-19 00:24:54 +08:00
James Yu	c7c0b086a0	add one supported type missing from the javadoc ## What changes were proposed in this pull request? The supported java.math.BigInteger type is not mentioned in the javadoc of Encoders.bean() ## How was this patch tested? only Javadoc fix Please review http://spark.apache.org/contributing.html before opening a pull request. Author: James Yu <james@ispot.tv> Closes #21544 from yuj/master.	2018-06-15 21:04:04 -07:00
Mukul Murthy	e4fee395ec	[SPARK-24525][SS] Provide an option to limit number of rows in a MemorySink ## What changes were proposed in this pull request? Provide an option to limit number of rows in a MemorySink. Currently, MemorySink and MemorySinkV2 have unbounded size, meaning that if they're used on big data, they can OOM the stream. This change adds a maxMemorySinkRows option to limit how many rows MemorySink and MemorySinkV2 can hold. By default, they are still unbounded. ## How was this patch tested? Added new unit tests. Author: Mukul Murthy <mukul.murthy@databricks.com> Closes #21559 from mukulmurthy/SPARK-24525.	2018-06-15 13:56:48 -07:00
Kazuaki Ishizaki	90da7dc241	[SPARK-24452][SQL][CORE] Avoid possible overflow in int add or multiple ## What changes were proposed in this pull request? This PR fixes possible overflow in int add or multiply. In particular, their overflows in multiply are detected by [Spotbugs](https://spotbugs.github.io/) The following assignments may cause overflow in right hand side. As a result, the result may be negative. ``` long = int * int long = int + int ``` To avoid this problem, this PR performs cast from int to long in right hand side. ## How was this patch tested? Existing UTs. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21481 from kiszk/SPARK-24452.	2018-06-15 13:47:48 -07:00
Tathagata Das	b5ccf0d395	[SPARK-24396][SS][PYSPARK] Add Structured Streaming ForeachWriter for python ## What changes were proposed in this pull request? This PR adds `foreach` for streaming queries in Python. Users will be able to specify their processing logic in two different ways. - As a function that takes a row as input. - As an object that has methods `open`, `process`, and `close` methods. See the python docs in this PR for more details. ## How was this patch tested? Added java and python unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21477 from tdas/SPARK-24396.	2018-06-15 12:56:39 -07:00
Ryan Blue	22daeba59b	[SPARK-24478][SQL] Move projection and filter push down to physical conversion ## What changes were proposed in this pull request? This removes the v2 optimizer rule for push-down and instead pushes filters and required columns when converting to a physical plan, as suggested by marmbrus. This makes the v2 relation cleaner because the output and filters do not change in the logical plan. A side-effect of this change is that the stats from the logical (optimized) plan no longer reflect pushed filters and projection. This is a temporary state, until the planner gathers stats from the physical plan instead. An alternative to this approach is `9d3a11e68b`. The first commit was proposed in #21262. This PR replaces #21262. ## How was this patch tested? Existing tests. Author: Ryan Blue <blue@apache.org> Closes #21503 from rdblue/SPARK-24478-move-push-down-to-physical-conversion.	2018-06-14 20:59:42 -07:00
Maxim Gekk	b8f27ae3b3	[SPARK-24543][SQL] Support any type as DDL string for from_json's schema ## What changes were proposed in this pull request? In the PR, I propose to support any DataType represented as DDL string for the from_json function. After the changes, it will be possible to specify `MapType` in SQL like: ```sql select from_json('{"a":1, "b":2}', 'map<string, int>') ``` and in Scala (similar in other languages) ```scala val in = Seq("""{"a": {"b": 1}}""").toDS() val schema = "map<string, map<string, int>>" val out = in.select(from_json($"value", schema, Map.empty[String, String])) ``` ## How was this patch tested? Added a couple sql tests and modified existing tests for Python and Scala. The former tests were modified because it is not imported for them in which format schema for `from_json` is provided. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21550 from MaxGekk/from_json-ddl-schema.	2018-06-14 13:27:27 -07:00
Marco Gaido	fdadc4be08	[SPARK-24495][SQL] EnsureRequirement returns wrong plan when reordering equal keys ## What changes were proposed in this pull request? `EnsureRequirement` in its `reorder` method currently assumes that the same key appears only once in the join condition. This of course might not be the case, and when it is not satisfied, it returns a wrong plan which produces a wrong result of the query. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21529 from mgaido91/SPARK-24495.	2018-06-14 09:20:41 -07:00
Marco Gaido	3bf76918fb	[SPARK-24531][TESTS] Replace 2.3.0 version with 2.3.1 ## What changes were proposed in this pull request? The PR updates the 2.3 version tested to the new release 2.3.1. ## How was this patch tested? existing UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21543 from mgaido91/patch-1.	2018-06-13 15:18:19 -07:00
Jose Torres	1b46f41c55	[SPARK-24235][SS] Implement continuous shuffle writer for single reader partition. ## What changes were proposed in this pull request? https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit Implement continuous shuffle write RDD for a single reader partition. (I don't believe any implementation changes are actually required for multiple reader partitions, but this PR is already very large, so I want to exclude those for now to keep the size down.) ## How was this patch tested? new unit tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #21428 from jose-torres/writerTask.	2018-06-13 13:13:01 -07:00
Herman van Hovell	299d297e25	[SPARK-24500][SQL] Make sure streams are materialized during Tree transforms. ## What changes were proposed in this pull request? If you construct catalyst trees using `scala.collection.immutable.Stream` you can run into situations where valid transformations do not seem to have any effect. There are two causes for this behavior: - `Stream` is evaluated lazily. Note that default implementation will generally only evaluate a function for the first element (this makes testing a bit tricky). - `TreeNode` and `QueryPlan` use side effects to detect if a tree has changed. Mapping over a stream is lazy and does not need to trigger this side effect. If this happens the node will invalidly assume that it did not change and return itself instead if the newly created node (this is for GC reasons). This PR fixes this issue by forcing materialization on streams in `TreeNode` and `QueryPlan`. ## How was this patch tested? Unit tests were added to `TreeNodeSuite` and `LogicalPlanSuite`. An integration test was added to the `PlannerSuite` Author: Herman van Hovell <hvanhovell@databricks.com> Closes #21539 from hvanhovell/SPARK-24500.	2018-06-13 07:09:48 -07:00
Arun Mahadevan	7703b46d28	[SPARK-24479][SS] Added config for registering streamingQueryListeners ## What changes were proposed in this pull request? Currently a "StreamingQueryListener" can only be registered programatically. We could have a new config "spark.sql.streamingQueryListeners" similar to "spark.sql.queryExecutionListeners" and "spark.extraListeners" for users to register custom streaming listeners. ## How was this patch tested? New unit test and running example programs. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Arun Mahadevan <arunm@apache.org> Closes #21504 from arunmahadevan/SPARK-24480.	2018-06-13 20:43:16 +08:00
Jungtaek Lim	4c388bccf1	[SPARK-24485][SS] Measure and log elapsed time for filesystem operations in HDFSBackedStateStoreProvider ## What changes were proposed in this pull request? This patch measures and logs elapsed time for each operation which communicate with file system (mostly remote HDFS in production) in HDFSBackedStateStoreProvider to help investigating any latency issue. ## How was this patch tested? Manually tested. Author: Jungtaek Lim <kabhwan@gmail.com> Closes #21506 from HeartSaVioR/SPARK-24485.	2018-06-13 12:36:20 +08:00
Jungtaek Lim	3352d6fe9a	[SPARK-24466][SS] Fix TextSocketMicroBatchReader to be compatible with netcat again ## What changes were proposed in this pull request? TextSocketMicroBatchReader was no longer be compatible with netcat due to launching temporary reader for reading schema, and closing reader, and re-opening reader. While reliable socket server should be able to handle this without any issue, nc command normally can't handle multiple connections and simply exits when closing temporary reader. This patch fixes TextSocketMicroBatchReader to be compatible with netcat again, via deferring opening socket to the first call of planInputPartitions() instead of constructor. ## How was this patch tested? Added unit test which fails on current and succeeds with the patch. And also manually tested. Author: Jungtaek Lim <kabhwan@gmail.com> Closes #21497 from HeartSaVioR/SPARK-24466.	2018-06-13 12:34:46 +08:00
Li Jin	9786ce66c5	[SPARK-22239][SQL][PYTHON] Enable grouped aggregate pandas UDFs as window functions with unbounded window frames ## What changes were proposed in this pull request? This PR enables using a grouped aggregate pandas UDFs as window functions. The semantics is the same as using SQL aggregation function as window functions. ``` >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> from pyspark.sql import Window >>> df = spark.createDataFrame( ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ... ("id", "v")) >>> pandas_udf("double", PandasUDFType.GROUPED_AGG) ... def mean_udf(v): ... return v.mean() >>> w = Window.partitionBy('id') >>> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show() +---+----+------+ \| id\| v\|mean_v\| +---+----+------+ \| 1\| 1.0\| 1.5\| \| 1\| 2.0\| 1.5\| \| 2\| 3.0\| 6.0\| \| 2\| 5.0\| 6.0\| \| 2\|10.0\| 6.0\| +---+----+------+ ``` The scope of this PR is somewhat limited in terms of: (1) Only supports unbounded window, which acts essentially as group by. (2) Only supports aggregation functions, not "transform" like window functions (n -> n mapping) Both of these are left as future work. Especially, (1) needs careful thinking w.r.t. how to pass rolling window data to python efficiently. (2) is a bit easier but does require more changes therefore I think it's better to leave it as a separate PR. ## How was this patch tested? WindowPandasUDFTests Author: Li Jin <ice.xelloss@gmail.com> Closes #21082 from icexelloss/SPARK-22239-window-udf.	2018-06-13 09:10:52 +08:00
Kazuaki Ishizaki	ada28f2595	[SPARK-23933][SQL] Add map_from_arrays function ## What changes were proposed in this pull request? The PR adds the SQL function `map_from_arrays`. The behavior of the function is based on Presto's `map`. Since SparkSQL already had a `map` function, we prepared the different name for this behavior. This function returns returns a map from a pair of arrays for keys and values. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21258 from kiszk/SPARK-23933.	2018-06-12 12:31:22 -07:00
Fangshi Li	cc88d7fad1	[SPARK-24216][SQL] Spark TypedAggregateExpression uses getSimpleName that is not safe in scala ## What changes were proposed in this pull request? When user create a aggregator object in scala and pass the aggregator to Spark Dataset's agg() method, Spark's will initialize TypedAggregateExpression with the nodeName field as aggregator.getClass.getSimpleName. However, getSimpleName is not safe in scala environment, depending on how user creates the aggregator object. For example, if the aggregator class full qualified name is "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw java.lang.InternalError "Malformed class name". This has been reported in scalatest https://github.com/scalatest/scalatest/pull/1044 and discussed in many scala upstream jiras such as SI-8110, SI-5425. To fix this issue, we follow the solution in https://github.com/scalatest/scalatest/pull/1044 to add safer version of getSimpleName as a util method, and TypedAggregateExpression will invoke this util method rather than getClass.getSimpleName. ## How was this patch tested? added unit test Author: Fangshi Li <fli@linkedin.com> Closes #21276 from fangshil/SPARK-24216.	2018-06-12 12:10:08 -07:00
DylanGuedes	f0ef1b311d	[SPARK-23931][SQL] Adds arrays_zip function to sparksql Signed-off-by: DylanGuedes <djmgguedesgmail.com> ## What changes were proposed in this pull request? Addition of arrays_zip function to spark sql functions. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Unit tests that checks if the results are correct. Author: DylanGuedes <djmgguedes@gmail.com> Closes #21045 from DylanGuedes/SPARK-23931.	2018-06-12 11:57:25 -07:00
Marco Gaido	2824f1436b	[SPARK-24531][TESTS] Remove version 2.2.0 from testing versions in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? Removing version 2.2.0 from testing versions in HiveExternalCatalogVersionsSuite as it is not present anymore in the mirrors and this is blocking all the open PRs. ## How was this patch tested? running UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21540 from mgaido91/SPARK-24531.	2018-06-12 09:56:35 -07:00
Tom Saleeba	1d7db65e96	docs: fix typo no => no[t] ## What changes were proposed in this pull request? Fixing a typo. ## How was this patch tested? Visual check of the docs. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tom Saleeba <tom.saleeba@gmail.com> Closes #21496 from tomsaleeba/patch-1.	2018-06-12 09:22:52 -05:00
Wenchen Fan	01452ea9c7	[SPARK-24502][SQL] flaky test: UnsafeRowSerializerSuite ## What changes were proposed in this pull request? `UnsafeRowSerializerSuite` calls `UnsafeProjection.create` which accesses `SQLConf.get`, while the current active SparkSession may already be stopped, and we may hit exception like this ``` sbt.ForkMain$ForkError: java.lang.IllegalStateException: LiveListenerBus is stopped. at org.apache.spark.scheduler.LiveListenerBus.addToQueue(LiveListenerBus.scala:97) at org.apache.spark.scheduler.LiveListenerBus.addToStatusQueue(LiveListenerBus.scala:80) at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:93) at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:120) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:120) at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:119) at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:286) at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42) at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41) at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) at org.apache.spark.sql.SparkSession$$anonfun$1$$anonfun$apply$1.apply(SparkSession.scala:95) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:95) at org.apache.spark.sql.SparkSession$$anonfun$1.apply(SparkSession.scala:94) at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:126) at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:54) at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:157) at org.apache.spark.sql.catalyst.expressions.UnsafeProjection$.create(Projection.scala:150) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$unsafeRowConverter(UnsafeRowSerializerSuite.scala:54) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite.org$apache$spark$sql$execution$UnsafeRowSerializerSuite$$toUnsafeRow(UnsafeRowSerializerSuite.scala:49) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:63) at org.apache.spark.sql.execution.UnsafeRowSerializerSuite$$anonfun$2.apply(UnsafeRowSerializerSuite.scala:60) ... ``` ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #21518 from cloud-fan/test.	2018-06-11 22:08:44 -07:00
liutang123	048197749e	[SPARK-22144][SQL] ExchangeCoordinator combine the partitions of an 0 sized pre-shuffle to 0 ## What changes were proposed in this pull request? when the length of pre-shuffle's partitions is 0, the length of post-shuffle's partitions should be 0 instead of spark.sql.shuffle.partitions. ## How was this patch tested? ExchangeCoordinator converted a pre-shuffle that partitions is 0 to a post-shuffle that partitions is 0 instead of one that partitions is spark.sql.shuffle.partitions. Author: liutang123 <liutang123@yeah.net> Closes #19364 from liutang123/SPARK-22144.	2018-06-11 17:48:07 -07:00
Marco Gaido	f07c5064a3	[SPARK-24468][SQL] Handle negative scale when adjusting precision for decimal operations ## What changes were proposed in this pull request? In SPARK-22036 we introduced the possibility to allow precision loss in arithmetic operations (according to the SQL standard). The implementation was drawn from Hive's one, where Decimals with a negative scale are not allowed in the operations. The PR handles the case when the scale is negative, removing the assertion that it is not. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21499 from mgaido91/SPARK-24468.	2018-06-08 18:51:56 -07:00
Thiruvasakan Paramasivan	36a3409134	[SPARK-24412][SQL] Adding docs about automagical type casting in `isin` and `isInCollection` APIs ## What changes were proposed in this pull request? Update documentation for `isInCollection` API to clealy explain the "auto-casting" of elements if their types are different. ## How was this patch tested? No-Op Author: Thiruvasakan Paramasivan <thiru@apple.com> Closes #21519 from trvskn/sql-doc-update.	2018-06-08 17:17:43 -07:00
Bruce Robbins	1462bba4fd	[SPARK-24119][SQL] Add interpreted execution to SortPrefix expression ## What changes were proposed in this pull request? Implemented eval in SortPrefix expression. ## How was this patch tested? - ran existing sbt SQL tests - added unit test - ran existing Python SQL tests - manual tests: disabling codegen -- patching code to disable beyond what spark.sql.codegen.wholeStage=false can do -- and running sbt SQL tests Author: Bruce Robbins <bersprockets@gmail.com> Closes #21231 from bersprockets/sortprefixeval.	2018-06-08 13:27:52 +02:00
Asher Saban	e76b0124fb	[SPARK-23803][SQL] Support bucket pruning ## What changes were proposed in this pull request? support bucket pruning when filtering on a single bucketed column on the following predicates - EqualTo, EqualNullSafe, In, And/Or predicates ## How was this patch tested? refactored unit tests to test the above. based on gatorsmile work in `e3c75c6398` Author: Asher Saban <asaban@palantir.com> Author: asaban <asaban@palantir.com> Closes #20915 from sabanas/filter-prune-buckets.	2018-06-06 07:14:08 -07:00
jinxing	93df3cd035	[SPARK-22384][SQL] Refine partition pruning when attribute is wrapped in Cast ## What changes were proposed in this pull request? Sql below will get all partitions from metastore, which put much burden on metastore; ``` CREATE TABLE `partition_test`(`col` int) PARTITIONED BY (`pt` byte) SELECT * FROM partition_test WHERE CAST(pt AS INT)=1 ``` The reason is that the the analyzed attribute `dt` is wrapped in `Cast` and `HiveShim` fails to generate a proper partition filter. This pr proposes to take `Cast` into consideration when generate partition filter. ## How was this patch tested? Test added. This pr proposes to use analyzed expressions in `HiveClientSuite` Author: jinxing <jinxing6042@126.com> Closes #19602 from jinxing64/SPARK-22384.	2018-06-05 11:32:42 -07:00
Tathagata Das	2c2a86b5d5	[SPARK-24453][SS] Fix error recovering from the failure in a no-data batch ## What changes were proposed in this pull request? The error occurs when we are recovering from a failure in a no-data batch (say X) that has been planned (i.e. written to offset log) but not executed (i.e. not written to commit log). Upon recovery the following sequence of events happen. 1. `MicroBatchExecution.populateStartOffsets` sets `currentBatchId` to X. Since there was no data in the batch, the `availableOffsets` is same as `committedOffsets`, so `isNewDataAvailable` is `false`. 2. When `MicroBatchExecution.constructNextBatch` is called, ideally it should immediately return true because the next batch has already been constructed. However, the check of whether the batch has been constructed was `if (isNewDataAvailable) return true`. Since the planned batch is a no-data batch, it escaped this check and proceeded to plan the same batch X once again. The solution is to have an explicit flag that signifies whether a batch has already been constructed or not. `populateStartOffsets` is going to set the flag appropriately. ## How was this patch tested? new unit test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21491 from tdas/SPARK-24453.	2018-06-05 01:08:55 -07:00
Yuanjian Li	dbb4d83829	[SPARK-24215][PYSPARK] Implement _repr_html_ for dataframes in PySpark ## What changes were proposed in this pull request? Implement `_repr_html_` for PySpark while in notebook and add config named "spark.sql.repl.eagerEval.enabled" to control this. The dev list thread for context: http://apache-spark-developers-list.1001551.n3.nabble.com/eager-execution-and-debuggability-td23928.html ## How was this patch tested? New ut in DataFrameSuite and manual test in jupyter. Some screenshot below. After: ![image](https://user-images.githubusercontent.com/4833765/40268422-8db5bef0-5b9f-11e8-80f1-04bc654a4f2c.png) Before: ![image](https://user-images.githubusercontent.com/4833765/40268431-9f92c1b8-5b9f-11e8-9db9-0611f0940b26.png) Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #21370 from xuanyuanking/SPARK-24215.	2018-06-05 08:23:08 +07:00
aokolnychyi	7297ae04d8	[SPARK-21896][SQL] Fix StackOverflow caused by window functions inside aggregate functions ## What changes were proposed in this pull request? This PR explicitly prohibits window functions inside aggregates. Currently, this will cause StackOverflow during analysis. See PR #19193 for previous discussion. ## How was this patch tested? This PR comes with a dedicated unit test. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #21473 from aokolnychyi/fix-stackoverflow-window-funcs.	2018-06-04 13:28:16 -07:00
Yuming Wang	0be5aa2746	[SPARK-23903][SQL] Add support for date extract ## What changes were proposed in this pull request? Add support for date `extract` function: ```sql spark-sql> SELECT EXTRACT(YEAR FROM TIMESTAMP '2000-12-16 12:21:13'); 2000 ``` Supported field same as [Hive](https://github.com/apache/hive/blob/rel/release-2.3.3/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g#L308-L316): `YEAR`, `QUARTER`, `MONTH`, `WEEK`, `DAY`, `DAYOFWEEK`, `HOUR`, `MINUTE`, `SECOND`. ## How was this patch tested? unit tests Author: Yuming Wang <yumwang@ebay.com> Closes #21479 from wangyum/SPARK-23903.	2018-06-04 10:16:13 -07:00
Maxim Gekk	1d9338bb10	[SPARK-23786][SQL] Checking column names of csv headers ## What changes were proposed in this pull request? Currently column names of headers in CSV files are not checked against provided schema of CSV data. It could cause errors like showed in the [SPARK-23786](https://issues.apache.org/jira/browse/SPARK-23786) and https://github.com/apache/spark/pull/20894#issuecomment-375957777. I introduced new CSV option - `enforceSchema`. If it is enabled (by default `true`), Spark forcibly applies provided or inferred schema to CSV files. In that case, CSV headers are ignored and not checked against the schema. If `enforceSchema` is set to `false`, additional checks can be performed. For example, if column in CSV header and in the schema have different ordering, the following exception is thrown: ``` java.lang.IllegalArgumentException: CSV file header does not contain the expected fields Header: depth, temperature Schema: temperature, depth CSV file: marina.csv ``` ## How was this patch tested? The changes were tested by existing tests of CSVSuite and by 2 new tests. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20894 from MaxGekk/check-column-names.	2018-06-03 22:02:21 -07:00
Wenchen Fan	416cd1fd96	[SPARK-24369][SQL] Correct handling for multiple distinct aggregations having the same argument set ## What changes were proposed in this pull request? bring back https://github.com/apache/spark/pull/21443 This is a different approach: just change the check to count distinct columns with `toSet` ## How was this patch tested? a new test to verify the planner behavior. Author: Wenchen Fan <wenchen@databricks.com> Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21487 from cloud-fan/back.	2018-06-03 21:57:42 -07:00
Xiao Li	d2c3de7efc	Revert "[SPARK-24369][SQL] Correct handling for multiple distinct aggregations having the same argument set" This reverts commit `1e46f92f95`.	2018-06-01 11:51:10 -07:00
Huang Tengfei	6039b13230	[SPARK-24351][SS] offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode ## What changes were proposed in this pull request? Compute the thresholdBatchId to purge metadata based on current committed epoch instead of currentBatchId in CP mode to avoid cleaning all the committed metadata in some case as described in the jira [SPARK-24351](https://issues.apache.org/jira/browse/SPARK-24351). ## How was this patch tested? Add new unit test. Author: Huang Tengfei <tengfei.h@gmail.com> Closes #21400 from ivoson/branch-cp-meta.	2018-06-01 10:47:53 -07:00
Huaxin Gao	98909c398d	[SPARK-23920][SQL] add array_remove to remove all elements that equal element from array ## What changes were proposed in this pull request? add array_remove to remove all elements that equal element from array ## How was this patch tested? add unit tests Author: Huaxin Gao <huaxing@us.ibm.com> Closes #21069 from huaxingao/spark-23920.	2018-05-31 22:04:26 -07:00
Gengliang Wang	cbaa729132	[SPARK-24330][SQL] Refactor ExecuteWriteTask and Use `while` in writing files ## What changes were proposed in this pull request? 1. Refactor ExecuteWriteTask in FileFormatWriter to reduce common logic and improve readability. After the change, callers only need to call `commit()` or `abort` at the end of task. Also there is less code in `SingleDirectoryWriteTask` and `DynamicPartitionWriteTask`. Definitions of related classes are moved to a new file, and `ExecuteWriteTask` is renamed to `FileFormatDataWriter`. 2. As per code style guide: https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex , we avoid using `for` for looping in [FileFormatWriter](https://github.com/apache/spark/pull/21381/files#diff-3b69eb0963b68c65cfe8075f8a42e850L536) , or `foreach` in [WriteToDataSourceV2Exec](https://github.com/apache/spark/pull/21381/files#diff-6fbe10db766049a395bae2e785e9d56eL119). In such critical code path, using `while` is good for performance. ## How was this patch tested? Existing unit test. I tried the microbenchmark in https://github.com/apache/spark/pull/21409 \| Workload \| Before changes(Best/Avg Time(ms)) \| After changes(Best/Avg Time(ms)) \| \| --- \| --- \| -- \| \|Output Single Int Column\| 2018 / 2043 \| 2096 / 2236 \| \|Output Single Double Column\| 1978 / 2043 \| 2013 / 2018 \| \|Output Int and String Column\| 6332 / 6706 \| 6162 / 6298 \| \|Output Partitions\| 4458 / 5094 \| 3792 / 4008 \| \|Output Buckets\| 5695 / 6102 \| 5120 / 5154 \| Also a microbenchmark on my laptop for general comparison among while/foreach/for : ``` class Writer { var sum = 0L def write(l: Long): Unit = sum += l } def testWhile(iterator: Iterator[Long]): Long = { val w = new Writer while (iterator.hasNext) { w.write(iterator.next()) } w.sum } def testForeach(iterator: Iterator[Long]): Long = { val w = new Writer iterator.foreach(w.write) w.sum } def testFor(iterator: Iterator[Long]): Long = { val w = new Writer for (x <- iterator) { w.write(x) } w.sum } val data = 0L to 100000000L val start = System.nanoTime (0 to 10).foreach(_ => testWhile(data.iterator)) println("benchmark while: " + (System.nanoTime - start)/1000000) val start2 = System.nanoTime (0 to 10).foreach(_ => testForeach(data.iterator)) println("benchmark foreach: " + (System.nanoTime - start2)/1000000) val start3 = System.nanoTime (0 to 10).foreach(_ => testForeach(data.iterator)) println("benchmark for: " + (System.nanoTime - start3)/1000000) ``` Benchmark result: `while`: 15401 ms `foreach`: 43034 ms `for`: 41279 ms Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21381 from gengliangwang/refactorExecuteWriteTask.	2018-06-01 10:01:15 +08:00
Yuming Wang	cc976f6cb8	[SPARK-23900][SQL] format_number support user specifed format as argument ## What changes were proposed in this pull request? `format_number` support user specifed format as argument. For example: ```sql spark-sql> SELECT format_number(12332.123456, '##################.###'); 12332.123 ``` ## How was this patch tested? unit test Author: Yuming Wang <yumwang@ebay.com> Closes #21010 from wangyum/SPARK-23900.	2018-05-31 11:38:23 -07:00
Marco Gaido	24ef7fbfa9	[SPARK-24276][SQL] Order of literals in IN should not affect semantic equality ## What changes were proposed in this pull request? When two `In` operators are created with the same list of values, but different order, we are considering them as semantically different. This is wrong, since they have the same semantic meaning. The PR adds a canonicalization rule which orders the literals in the `In` operator so the semantic equality works properly. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21331 from mgaido91/SPARK-24276.	2018-05-30 15:31:40 -07:00
Marco Gaido	1b36f14889	[SPARK-23901][SQL] Add masking functions ## What changes were proposed in this pull request? The PR adds the masking function as they are described in Hive's documentation: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions. This means that only `string`s are accepted as parameter for the masking functions. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21246 from mgaido91/SPARK-23901.	2018-05-30 11:18:04 -07:00
Takeshi Yamamuro	1e46f92f95	[SPARK-24369][SQL] Correct handling for multiple distinct aggregations having the same argument set ## What changes were proposed in this pull request? This pr fixed an issue when having multiple distinct aggregations having the same argument set, e.g., ``` scala>: paste val df = sql( s"""SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) \| FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y) """.stripMargin) java.lang.RuntimeException You hit a query analyzer bug. Please report your query to Spark user mailing list. ``` The root cause is that `RewriteDistinctAggregates` can't detect multiple distinct aggregations if they have the same argument set. This pr modified code so that `RewriteDistinctAggregates` could count the number of aggregate expressions with `isDistinct=true`. ## How was this patch tested? Added tests in `DataFrameAggregateSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21443 from maropu/SPARK-24369.	2018-05-31 00:23:25 +08:00
Gengliang Wang	f48938800e	[SPARK-24365][SQL] Add Data Source write benchmark ## What changes were proposed in this pull request? Add Data Source write benchmark. So that it would be easier to measure the writer performance. Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21409 from gengliangwang/parquetWriteBenchmark.	2018-05-30 09:32:33 +08:00
DB Tsai	900bc1f7dc	[SPARK-24371][SQL] Added isInCollection in DataFrame API for Scala and Java. ## What changes were proposed in this pull request? Implemented `isInCollection ` in DataFrame API for both Scala and Java, so users can do ```scala val profileDF = Seq( Some(1), Some(2), Some(3), Some(4), Some(5), Some(6), Some(7), None ).toDF("profileID") val validUsers: Seq[Any] = Seq(6, 7.toShort, 8L, "3") val result = profileDF.withColumn("isValid", $"profileID". isInCollection(validUsers)) result.show(10) """ +---------+-------+ \|profileID\|isValid\| +---------+-------+ \| 1\| false\| \| 2\| false\| \| 3\| true\| \| 4\| false\| \| 5\| false\| \| 6\| true\| \| 7\| true\| \| null\| null\| +---------+-------+ """.stripMargin ``` ## How was this patch tested? Several unit tests are added. Author: DB Tsai <d_tsai@apple.com> Closes #21416 from dbtsai/optimize-set.	2018-05-29 10:22:18 -07:00
Xiao Li	23db600c95	[SPARK-24250][SQL][FOLLOW-UP] support accessing SQLConf inside tasks ## What changes were proposed in this pull request? We should not stop users from calling `getActiveSession` and `getDefaultSession` in executors. To not break the existing behaviors, we should simply return None. ## How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #21436 from gatorsmile/followUpSPARK-24250.	2018-05-28 23:23:22 -07:00
Dongjoon Hyun	b31b587cd0	[SPARK-19613][SS][TEST] Random.nextString is not safe for directory namePrefix ## What changes were proposed in this pull request? `Random.nextString` is good for generating random string data, but it's not proper for directory name prefix in `Utils.createDirectory(tempDir, Random.nextString(10))`. This PR uses more safe directory namePrefix. ```scala scala> scala.util.Random.nextString(10) res0: String = 馨쭔ᎰႻ穚䃈兩㻞藑並 ``` ```scala StateStoreRDDSuite: - versioning and immutability - recovering from files - usage with iterators - only gets and only puts - preferred locations using StateStoreCoordinator * FAILED * java.io.IOException: Failed to create a temp directory (under /.../spark/sql/core/target/tmp/StateStoreRDDSuite8712796397908632676) after 10 attempts! at org.apache.spark.util.Utils$.createDirectory(Utils.scala:295) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13$$anonfun$apply$6.apply(StateStoreRDDSuite.scala:152) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13$$anonfun$apply$6.apply(StateStoreRDDSuite.scala:149) at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13.apply(StateStoreRDDSuite.scala:149) at org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite$$anonfun$13.apply(StateStoreRDDSuite.scala:149) ... - distributed test * FAILED * java.io.IOException: Failed to create a temp directory (under /.../spark/sql/core/target/tmp/StateStoreRDDSuite8712796397908632676) after 10 attempts! at org.apache.spark.util.Utils$.createDirectory(Utils.scala:295) ``` ## How was this patch tested? Pass the existing tests.StateStoreRDDSuite: Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21446 from dongjoon-hyun/SPARK-19613.	2018-05-29 10:35:30 +08:00
Marco Gaido	de01a8d50c	[SPARK-24373][SQL] Add AnalysisBarrier to RelationalGroupedDataset's and KeyValueGroupedDataset's child ## What changes were proposed in this pull request? When we create a `RelationalGroupedDataset` or a `KeyValueGroupedDataset` we set its child to the `logicalPlan` of the `DataFrame` we need to aggregate. Since the `logicalPlan` is already analyzed, we should not analyze it again. But this happens when the new plan of the aggregate is analyzed. The current behavior in most of the cases is likely to produce no harm, but in other cases re-analyzing an analyzed plan can change it, since the analysis is not idempotent. This can cause issues like the one described in the JIRA (missing to find a cached plan). The PR adds an `AnalysisBarrier` to the `logicalPlan` which is used as child of `RelationalGroupedDataset` or a `KeyValueGroupedDataset`. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21432 from mgaido91/SPARK-24373.	2018-05-28 12:09:44 +08:00
Li Jin	672209f290	[SPARK-24334] Fix race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator ## What changes were proposed in this pull request? There is a race condition of closing Arrow VectorSchemaRoot and Allocator in the writer thread of ArrowPythonRunner. The race results in memory leak exception when closing the allocator. This patch removes the closing routine from the TaskCompletionListener and make the writer thread responsible for cleaning up the Arrow memory. This issue be reproduced by this test: ``` def test_memory_leak(self): from pyspark.sql.functions import pandas_udf, col, PandasUDFType, array, lit, explode # Have all data in a single executor thread so it can trigger the race condition easier with self.sql_conf({'spark.sql.shuffle.partitions': 1}): df = self.spark.range(0, 1000) df = df.withColumn('id', array([lit(i) for i in range(0, 300)])) \ .withColumn('id', explode(col('id'))) \ .withColumn('v', array([lit(i) for i in range(0, 1000)])) pandas_udf(df.schema, PandasUDFType.GROUPED_MAP) def foo(pdf): xxx return pdf result = df.groupby('id').apply(foo) with QuietTest(self.sc): with self.assertRaises(py4j.protocol.Py4JJavaError) as context: result.count() self.assertTrue('Memory leaked' not in str(context.exception)) ``` Note: Because of the race condition, the test case cannot reproduce the issue reliably so it's not added to test cases. ## How was this patch tested? Because of the race condition, the bug cannot be unit test easily. So far it has only happens on large amount of data. This is currently tested manually. Author: Li Jin <ice.xelloss@gmail.com> Closes #21397 from icexelloss/SPARK-24334-arrow-memory-leak.	2018-05-28 10:50:17 +08:00
Miles Yucht	d440699192	[SPARK-24381][TESTING] Add unit tests for NOT IN subquery around null values ## What changes were proposed in this pull request? This PR adds several unit tests along the `cols NOT IN (subquery)` pathway. There are a scattering of tests here and there which cover this codepath, but there doesn't seem to be a unified unit test of the correctness of null-aware anti joins anywhere. I have also added a brief explanation of how this expression behaves in SubquerySuite. Lastly, I made some clarifying changes in the NOT IN pathway in RewritePredicateSubquery. ## How was this patch tested? Added unit tests! There should be no behavioral change in this PR. Author: Miles Yucht <miles@databricks.com> Closes #21425 from mgyucht/spark-24381.	2018-05-26 20:42:23 -07:00
Maxim Gekk	1b1528a504	[SPARK-24366][SQL] Improving of error messages for type converting ## What changes were proposed in this pull request? Currently, users are getting the following error messages on type conversions: ``` scala.MatchError: test (of class java.lang.String) ``` The message doesn't give any clues to the users where in the schema the error happened. In this PR, I would like to improve the error message like: ``` The value (test) of the type (java.lang.String) cannot be converted to struct<f1:int> ``` ## How was this patch tested? Added tests for converting of wrong values to `struct`, `map`, `array`, `string` and `decimal`. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21410 from MaxGekk/type-conv-error.	2018-05-25 15:42:46 -07:00
Maxim Gekk	64fad0b519	[SPARK-24244][SPARK-24368][SQL] Passing only required columns to the CSV parser ## What changes were proposed in this pull request? uniVocity parser allows to specify only required column names or indexes for [parsing](https://www.univocity.com/pages/parsers-tutorial) like: ``` // Here we select only the columns by their indexes. // The parser just skips the values in other columns parserSettings.selectIndexes(4, 0, 1); CsvParser parser = new CsvParser(parserSettings); ``` In this PR, I propose to extract indexes from required schema and pass them into the CSV parser. Benchmarks show the following improvements in parsing of 1000 columns: ``` Select 100 columns out of 1000: x1.76 Select 1 column out of 1000: x2 ``` Note: Comparing to current implementation, the changes can return different result for malformed rows in the `DROPMALFORMED` and `FAILFAST` modes if only subset of all columns is requested. To have previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`. ## How was this patch tested? It was tested by new test which selects 3 columns out of 15, by existing tests and by new benchmarks. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21415 from MaxGekk/csv-column-pruning2.	2018-05-24 21:38:04 -07:00
Gengliang Wang	3b20b34ab7	[SPARK-24367][SQL] Parquet: use JOB_SUMMARY_LEVEL instead of deprecated flag ENABLE_JOB_SUMMARY ## What changes were proposed in this pull request? In current parquet version,the conf ENABLE_JOB_SUMMARY is deprecated. When writing to Parquet files, the warning message ```WARN org.apache.parquet.hadoop.ParquetOutputFormat: Setting parquet.enable.summary-metadata is deprecated, please use parquet.summary.metadata.level``` keeps showing up. From https://github.com/apache/parquet-mr/blame/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L164 we can see that we should use JOB_SUMMARY_LEVEL. ## How was this patch tested? Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21411 from gengliangwang/summaryLevel.	2018-05-25 11:16:35 +08:00
Jose Torres	0fd68cb727	[SPARK-24234][SS] Support multiple row writers in continuous processing shuffle reader. ## What changes were proposed in this pull request? https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii Support multiple different row writers in continuous processing shuffle reader. Note that having multiple read-side buffers ended up being the natural way to do this. Otherwise it's hard to express the constraint of sending an epoch marker only when all writers have sent one. ## How was this patch tested? new unit tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #21385 from jose-torres/multipleWrite.	2018-05-24 17:08:52 -07:00
Yuming Wang	0d89943449	[SPARK-24378][SQL] Fix date_trunc function incorrect examples ## What changes were proposed in this pull request? Fix `date_trunc` function incorrect examples. ## How was this patch tested? N/A Author: Yuming Wang <yumwang@ebay.com> Closes #21423 from wangyum/SPARK-24378.	2018-05-24 23:38:50 +08:00
Maxim Gekk	13bedc05c2	[SPARK-24329][SQL] Test for skipping multi-space lines ## What changes were proposed in this pull request? The PR is a continue of https://github.com/apache/spark/pull/21380 . It checks cases that are handled by the code: `e3de6ab30d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala (L303-L304)` Basically the code skips lines with one or many whitespaces, and lines with comments (see [filterCommentAndEmpty](`e3de6ab30d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala (L47)`)) ```scala iter.filter { line => line.trim.nonEmpty && !line.startsWith(options.comment.toString) } ``` Closes #21380 ## How was this patch tested? Added a test for the case described above. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21394 from MaxGekk/test-for-multi-space-lines.	2018-05-24 22:18:58 +08:00
Ryan Blue	3469f5c989	[SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase with dictionary filters. ## What changes were proposed in this pull request? I missed this commit when preparing #21070. When Parquet is able to filter blocks with dictionary filtering, the expected total value count to be too high in Spark, leading to an error when there were fewer than expected row groups to process. Spark should get the row groups from Parquet to pick up new filter schemes in Parquet like dictionary filtering. ## How was this patch tested? Using in production at Netflix. Added test case for dictionary-filtered blocks. Author: Ryan Blue <blue@apache.org> Closes #21295 from rdblue/SPARK-24230-fix-parquet-block-tracking.	2018-05-24 20:55:26 +08:00
hyukjinkwon	8a545822d0	[SPARK-24364][SS] Prevent InMemoryFileIndex from failing if file path doesn't exist ## What changes were proposed in this pull request? This PR proposes to follow up https://github.com/apache/spark/pull/15153 and complete SPARK-17599. `FileSystem` operation (`fs.getFileBlockLocations`) can still fail if the file path does not exist. For example see the exception message below: ``` Error occurred while processing: File does not exist: /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv ... java.io.FileNotFoundException: File does not exist: /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv ... org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:249) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:229) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles$3.apply(InMemoryFileIndex.scala:314) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles$3.apply(InMemoryFileIndex.scala:297) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$listLeafFiles(InMemoryFileIndex.scala:297) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$1.apply(InMemoryFileIndex.scala:174) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles$1.apply(InMemoryFileIndex.scala:173) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.org$apache$spark$sql$execution$datasources$InMemoryFileIndex$$bulkListLeafFiles(InMemoryFileIndex.scala:173) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:126) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:91) at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:67) at org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:161) at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$tempFileIndex$1(DataSource.scala:152) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:166) at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:261) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:94) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:94) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:196) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:206) at com.hwx.StreamTest$.main(StreamTest.scala:97) at com.hwx.StreamTest.main(StreamTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:906) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /rel/00171151/input/PJ/part-00136-b6403bac-a240-44f8-a792-fc2e174682b7-c000.csv ... ``` So, it fixes it to make a warning instead. ## How was this patch tested? It's hard to write a test. Manually tested multiple times. Author: hyukjinkwon <gurwls223@apache.org> Closes #21408 from HyukjinKwon/missing-files.	2018-05-24 13:21:02 +08:00
Dongjoon Hyun	486ecc680e	[SPARK-24322][BUILD] Upgrade Apache ORC to 1.4.4 ## What changes were proposed in this pull request? ORC 1.4.4 includes [nine fixes](https://issues.apache.org/jira/issues/?filter=12342568&jql=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4). One of the issues is about `Timestamp` bug (ORC-306) which occurs when `native` ORC vectorized reader reads ORC column vector's sub-vector `times` and `nanos`. ORC-306 fixes this according to the [original definition](https://github.com/apache/hive/blob/master/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/TimestampColumnVector.java#L45-L46) and this PR includes the updated interpretation on ORC column vectors. Note that `hive` ORC reader and ORC MR reader is not affected. ```scala scala> spark.version res0: String = 2.3.0 scala> spark.sql("set spark.sql.orc.impl=native") scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 12:34:56.000789")).toDF().write.orc("/tmp/orc") scala> spark.read.orc("/tmp/orc").show(false) +--------------------------+ \|value \| +--------------------------+ \|1900-05-05 12:34:55.000789\| +--------------------------+ ``` This PR aims to update Apache Spark to use it. FULL LIST ID \| TITLE -- \| -- ORC-281 \| Fix compiler warnings from clang 5.0 ORC-301 \| `extractFileTail` should open a file in `try` statement ORC-304 \| Fix TestRecordReaderImpl to not fail with new storage-api ORC-306 \| Fix incorrect workaround for bug in java.sql.Timestamp ORC-324 \| Add support for ARM and PPC arch ORC-330 \| Remove unnecessary Hive artifacts from root pom ORC-332 \| Add syntax version to orc_proto.proto ORC-336 \| Remove avro and parquet dependency management entries ORC-360 \| Implement error checking on subtype fields in Java ## How was this patch tested? Pass the Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21372 from dongjoon-hyun/SPARK_ORC144.	2018-05-24 11:34:13 +08:00
sychen	888340151f	[SPARK-24257][SQL] LongToUnsafeRowMap calculate the new size may be wrong LongToUnsafeRowMap has a mistake when growing its page array: it blindly grows to `oldSize * 2`, while the new record may be larger than `oldSize * 2`. Then we may have a malformed UnsafeRow when querying this map, whose actual data is smaller than its declared size, and the data is corrupted. Author: sychen <sychen@ctrip.com> Closes #21311 from cxzl25/fix_LongToUnsafeRowMap_page_size.	2018-05-24 11:18:07 +08:00
Vayda, Oleksandr: IT (PRG)	230f144197	[SPARK-24350][SQL] Fixes ClassCastException in the "array_position" function ## What changes were proposed in this pull request? ### Fixes `ClassCastException` in the `array_position` function - [SPARK-24350](https://issues.apache.org/jira/browse/SPARK-24350) When calling `array_position` function with a wrong type of the 1st argument an `AnalysisException` should be thrown instead of `ClassCastException` Example: ```sql select array_position('foo', 'bar') ``` ``` java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot be cast to org.apache.spark.sql.types.ArrayType at org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398) at org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44) at org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401) at org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168) at org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) ``` ## How was this patch tested? unit test Author: Vayda, Oleksandr: IT (PRG) <Oleksandr.Vayda@barclayscapital.com> Closes #21401 from wajda/SPARK-24350-array_position-error-fix.	2018-05-23 17:22:52 -07:00
Jose Torres	f457933293	[SPARK-23416][SS] Add a specific stop method for ContinuousExecution. ## What changes were proposed in this pull request? Add a specific stop method for ContinuousExecution. The previous StreamExecution.stop() method had a race condition as applied to continuous processing: if the cancellation was round-tripped to the driver too quickly, the generic SparkException it caused would be reported as the query death cause. We earlier decided that SparkException should not be added to the StreamExecution.isInterruptionException() whitelist, so we need to ensure this never happens instead. ## How was this patch tested? Existing tests. I could consistently reproduce the previous flakiness by putting Thread.sleep(1000) between the first job cancellation and thread interruption in StreamExecution.stop(). Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #21384 from jose-torres/fixKafka.	2018-05-23 17:21:29 -07:00
jinxing	b7a036b75b	[SPARK-24294] Throw SparkException when OOM in BroadcastExchangeExec ## What changes were proposed in this pull request? When OutOfMemoryError thrown from BroadcastExchangeExec, scala.concurrent.Future will hit scala bug – https://github.com/scala/bug/issues/9554, and hang until future timeout: We could wrap the OOM inside SparkException to resolve this issue. ## How was this patch tested? Manually tested. Author: jinxing <jinxing6042@126.com> Closes #21342 from jinxing64/SPARK-24294.	2018-05-23 13:12:05 -07:00
Takeshi Yamamuro	84557bc9f8	[SPARK-24206][SQL] Improve DataSource read benchmark code ## What changes were proposed in this pull request? This pr added benchmark code `DataSourceReadBenchmark` for `orc`, `paruqet`, `csv`, and `json` based on the existing `ParquetReadBenchmark` and `OrcReadBenchmark`. ## How was this patch tested? N/A Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21266 from maropu/DataSourceReadBenchmark.	2018-05-23 13:02:32 -07:00
Xiao Li	5a5a868dc4	Revert "[SPARK-24244][SQL] Passing only required columns to the CSV parser" This reverts commit `8086acc2f6`.	2018-05-23 11:51:13 -07:00
Liang-Chi Hsieh	a40ffc656d	[SPARK-23711][SQL] Add fallback generator for UnsafeProjection ## What changes were proposed in this pull request? Add fallback logic for `UnsafeProjection`. In production we can try to create unsafe projection using codegen implementation. Once any compile error happens, it fallbacks to interpreted implementation. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21106 from viirya/SPARK-23711.	2018-05-23 22:40:52 +08:00
Seth Fitzsimmons	00c13cfad7	Correct reference to Offset class This is a documentation-only correction; `org.apache.spark.sql.sources.v2.reader.Offset` is actually `org.apache.spark.sql.sources.v2.reader.streaming.Offset`. Author: Seth Fitzsimmons <seth@mojodna.net> Closes #21387 from mojodna/patch-1.	2018-05-23 09:14:03 +08:00
Vayda, Oleksandr: IT (PRG)	bc6ea614ad	[SPARK-24348][SQL] "element_at" error fix ## What changes were proposed in this pull request? ### Fixes a `scala.MatchError` in the `element_at` operation - [SPARK-24348](https://issues.apache.org/jira/browse/SPARK-24348) When calling `element_at` with a wrong first operand type an `AnalysisException` should be thrown instead of `scala.MatchError` Example: ```sql select element_at('foo', 1) ``` results in: ``` scala.MatchError: StringType (of class org.apache.spark.sql.types.StringType$) at org.apache.spark.sql.catalyst.expressions.ElementAt.inputTypes(collectionOperations.scala:1469) at org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44) at org.apache.spark.sql.catalyst.expressions.ElementAt.checkInputDataTypes(collectionOperations.scala:1478) at org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168) at org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) ``` ## How was this patch tested? unit tests Author: Vayda, Oleksandr: IT (PRG) <Oleksandr.Vayda@barclayscapital.com> Closes #21395 from wajda/SPARK-24348-element_at-error-fix.	2018-05-22 13:01:07 -07:00
Liang-Chi Hsieh	f9f055afa4	[SPARK-24121][SQL] Add API for handling expression code generation ## What changes were proposed in this pull request? This patch tries to implement this [proposal](https://github.com/apache/spark/pull/19813#issuecomment-354045400) to add an API for handling expression code generation. It should allow us to manipulate how to generate codes for expressions. In details, this adds an new abstraction `CodeBlock` to `JavaCode`. `CodeBlock` holds the code snippet and inputs for generating actual java code. For example, in following java code: ```java int ${variable} = 1; boolean ${isNull} = ${CodeGenerator.defaultValue(BooleanType)}; ``` `variable`, `isNull` are two `VariableValue` and `CodeGenerator.defaultValue(BooleanType)` is a string. They are all inputs to this code block and held by `CodeBlock` representing this code. For codegen, we provide a specified string interpolator `code`, so you can define a code like this: ```scala val codeBlock = code""" \|int ${variable} = 1; \|boolean ${isNull} = ${CodeGenerator.defaultValue(BooleanType)}; """.stripMargin // Generates actual java code. codeBlock.toString ``` Because those inputs are held separately in `CodeBlock` before generating code, we can safely manipulate them, e.g., replacing statements to aliased variables, etc.. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21193 from viirya/SPARK-24121.	2018-05-23 01:50:22 +08:00
Maxim Gekk	8086acc2f6	[SPARK-24244][SQL] Passing only required columns to the CSV parser ## What changes were proposed in this pull request? uniVocity parser allows to specify only required column names or indexes for [parsing](https://www.univocity.com/pages/parsers-tutorial) like: ``` // Here we select only the columns by their indexes. // The parser just skips the values in other columns parserSettings.selectIndexes(4, 0, 1); CsvParser parser = new CsvParser(parserSettings); ``` In this PR, I propose to extract indexes from required schema and pass them into the CSV parser. Benchmarks show the following improvements in parsing of 1000 columns: ``` Select 100 columns out of 1000: x1.76 Select 1 column out of 1000: x2 ``` Note: Comparing to current implementation, the changes can return different result for malformed rows in the `DROPMALFORMED` and `FAILFAST` modes if only subset of all columns is requested. To have previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`. ## How was this patch tested? It was tested by new test which selects 3 columns out of 15, by existing tests and by new benchmarks. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21296 from MaxGekk/csv-column-pruning.	2018-05-22 22:07:32 +08:00
Yuming Wang	fc743f7b30	[SPARK-20120][SQL][FOLLOW-UP] Better way to support spark-sql silent mode. ## What changes were proposed in this pull request? `spark-sql` silent mode will broken if`SPARK_HOME/jars` missing `kubernetes-model-2.0.0.jar`. This pr use `sc.setLogLevel (<logLevel>)` to implement silent mode. ## How was this patch tested? manual tests ``` build/sbt -Phive -Phive-thriftserver package export SPARK_PREPEND_CLASSES=true ./bin/spark-sql -S ``` Author: Yuming Wang <yumwang@ebay.com> Closes #20274 from wangyum/SPARK-20120-FOLLOW-UP.	2018-05-22 08:20:59 -05:00
Marco Gaido	d3d1807315	[SPARK-24313][SQL] Fix collection operations' interpreted evaluation for complex types ## What changes were proposed in this pull request? The interpreted evaluation of several collection operations works only for simple datatypes. For complex data types, for instance, `array_contains` it returns always `false`. The list of the affected functions is `array_contains`, `array_position`, `element_at` and `GetMapValue`. The PR fixes the behavior for all the datatypes. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21361 from mgaido91/SPARK-24313.	2018-05-22 21:08:49 +08:00
Kris Mok	952e4d1c83	[SPARK-24321][SQL] Extract common code from Divide/Remainder to a base trait ## What changes were proposed in this pull request? Extract common code from `Divide`/`Remainder` to a new base trait, `DivModLike`. Further refactoring to make `Pmod` work with `DivModLike` is to be done as a separate task. ## How was this patch tested? Existing tests in `ArithmeticExpressionSuite` covers the functionality. Author: Kris Mok <kris.mok@databricks.com> Closes #21367 from rednaxelafx/catalyst-divmod.	2018-05-22 19:12:30 +08:00
Marco Gaido	84d31aa5d4	[SPARK-24209][SHS] Automatic retrieve proxyBase from Knox headers ## What changes were proposed in this pull request? The PR retrieves the proxyBase automatically from the header `X-Forwarded-Context` (if available). This is the header used by Knox to inform the proxied service about the base path. This provides 0-configuration support for Knox gateway (instead of having to properly set `spark.ui.proxyBase`) and it allows to access directly SHS when it is proxied by Knox. In the previous scenario, indeed, after setting `spark.ui.proxyBase`, direct access to SHS was not working fine (due to bad link generated). ## How was this patch tested? added UT + manual tests Author: Marco Gaido <marcogaido91@gmail.com> Closes #21268 from mgaido91/SPARK-24209.	2018-05-21 18:11:05 -07:00
Maxim Gekk	b550b2a1a1	[SPARK-24325] Tests for Hadoop's LinesReader ## What changes were proposed in this pull request? The tests cover basic functionality of [Hadoop LinesReader](`8d79113b81/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala (L42)`). In particular, the added tests check: - A split slices a line or delimiter - A split slices two consecutive lines and cover a delimiter between the lines - Two splits slice a line and there are no duplicates - Internal buffer size (`io.file.buffer.size`) is less than line length - Constrain of maximum line length - `mapreduce.input.linerecordreader.line.maxlength` Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21377 from MaxGekk/line-reader-tests.	2018-05-21 14:21:05 -07:00
Jose Torres	a33dcf4a0b	[SPARK-24234][SS] Reader for continuous processing shuffle ## What changes were proposed in this pull request? Read RDD for continuous processing shuffle, as well as the initial RPC-based row receiver. https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii ## How was this patch tested? new unit tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #21337 from jose-torres/readerRddMaster.	2018-05-21 12:58:05 -07:00
Wenchen Fan	03e90f65bf	[SPARK-24250][SQL] support accessing SQLConf inside tasks re-submit https://github.com/apache/spark/pull/21299 which broke build. A few new commits are added to fix the SQLConf problem in `JsonSchemaInference.infer`, and prevent us to access `SQLConf` in DAGScheduler event loop thread. ## What changes were proposed in this pull request? Previously in #20136 we decided to forbid tasks to access `SQLConf`, because it doesn't work and always give you the default conf value. In #21190 we fixed the check and all the places that violate it. Currently the pattern of accessing configs at the executor side is: read the configs at the driver side, then access the variables holding the config values in the RDD closure, so that they will be serialized to the executor side. Something like ``` val someConf = conf.getXXX child.execute().mapPartitions { if (someConf == ...) ... ... } ``` However, this pattern is hard to apply if the config needs to be propagated via a long call stack. An example is `DataType.sameType`, and see how many changes were made in #21190 . When it comes to code generation, it's even worse. I tried it locally and we need to change a ton of files to propagate configs to code generators. This PR proposes to allow tasks to access `SQLConf`. The idea is, we can save all the SQL configs to job properties when an SQL execution is triggered. At executor side we rebuild the `SQLConf` from job properties. ## How was this patch tested? a new test suite Author: Wenchen Fan <wenchen@databricks.com> Closes #21376 from cloud-fan/config.	2018-05-22 00:19:18 +08:00
Marek Novotny	a6e883feb3	[SPARK-23935][SQL] Adding map_entries function ## What changes were proposed in this pull request? This PR adds `map_entries` function that returns an unordered array of all entries in the given map. ## How was this patch tested? New tests added into: - `CollectionExpressionSuite` - `DataFrameFunctionsSuite` ## CodeGen examples ### Primitive types ``` val df = Seq(Map(1 -> 5, 2 -> 6)).toDF("m") df.filter('m.isNotNull).select(map_entries('m)).debugCodegen ``` Result: ``` /* 042 / boolean project_isNull_0 = false; / 043 / / 044 / ArrayData project_value_0 = null; / 045 / / 046 / final int project_numElements_0 = inputadapter_value_0.numElements(); / 047 / final ArrayData project_keys_0 = inputadapter_value_0.keyArray(); / 048 / final ArrayData project_values_0 = inputadapter_value_0.valueArray(); / 049 / / 050 / final long project_size_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray( / 051 / project_numElements_0, / 052 / 32); / 053 / if (project_size_0 > 2147483632) { / 054 / final Object[] project_internalRowArray_0 = new Object[project_numElements_0]; / 055 / for (int z = 0; z < project_numElements_0; z++) { / 056 / project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{project_keys_0.getInt(z), project_values_0.getInt(z)}); / 057 / } / 058 / project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0); / 059 / / 060 / } else { / 061 / final byte[] project_arrayBytes_0 = new byte[(int)project_size_0]; / 062 / UnsafeArrayData project_unsafeArrayData_0 = new UnsafeArrayData(); / 063 / Platform.putLong(project_arrayBytes_0, 16, project_numElements_0); / 064 / project_unsafeArrayData_0.pointTo(project_arrayBytes_0, 16, (int)project_size_0); / 065 / / 066 / final int project_structsOffset_0 = UnsafeArrayData.calculateHeaderPortionInBytes(project_numElements_0) + project_numElements_0 8; /* 067 / UnsafeRow project_unsafeRow_0 = new UnsafeRow(2); / 068 / for (int z = 0; z < project_numElements_0; z++) { / 069 / long offset = project_structsOffset_0 + z 24L; /* 070 / project_unsafeArrayData_0.setLong(z, (offset << 32) + 24L); / 071 / project_unsafeRow_0.pointTo(project_arrayBytes_0, 16 + offset, 24); / 072 / project_unsafeRow_0.setInt(0, project_keys_0.getInt(z)); / 073 / project_unsafeRow_0.setInt(1, project_values_0.getInt(z)); / 074 / } / 075 / project_value_0 = project_unsafeArrayData_0; / 076 / / 077 / } ``` ### Non-primitive types ``` val df = Seq(Map("a" -> "foo", "b" -> null)).toDF("m") df.filter('m.isNotNull).select(map_entries('m)).debugCodegen ``` Result: ``` / 042 / boolean project_isNull_0 = false; / 043 / / 044 / ArrayData project_value_0 = null; / 045 / / 046 / final int project_numElements_0 = inputadapter_value_0.numElements(); / 047 / final ArrayData project_keys_0 = inputadapter_value_0.keyArray(); / 048 / final ArrayData project_values_0 = inputadapter_value_0.valueArray(); / 049 / / 050 / final Object[] project_internalRowArray_0 = new Object[project_numElements_0]; / 051 / for (int z = 0; z < project_numElements_0; z++) { / 052 / project_internalRowArray_0[z] = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(new Object[]{project_keys_0.getUTF8String(z), project_values_0.getUTF8String(z)}); / 053 / } / 054 */ project_value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_internalRowArray_0); ``` Author: Marek Novotny <mn.mikke@gmail.com> Closes #21236 from mn-mikke/feature/array-api-map_entries-to-master.	2018-05-21 23:14:03 +09:00
Kazuaki Ishizaki	e480eccd97	[SPARK-24323][SQL] Fix lint-java errors ## What changes were proposed in this pull request? This PR fixes the following errors reported by `lint-java` ``` % dev/lint-java Using `mvn` from path: /usr/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartition.java:[39] (sizes) LineLength: Line is longer than 100 characters (found 104). [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:[26] (sizes) LineLength: Line is longer than 100 characters (found 110). [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:[30] (sizes) LineLength: Line is longer than 100 characters (found 104). ``` ## How was this patch tested? Run `lint-java` manually. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21374 from kiszk/SPARK-24323.	2018-05-21 15:42:04 +08:00
Liang-Chi Hsieh	6d7d45a1af	[SPARK-24242][SQL] RangeExec should have correct outputOrdering and outputPartitioning ## What changes were proposed in this pull request? Logical `Range` node has been added with `outputOrdering` recently. It's used to eliminate redundant `Sort` during optimization. However, this `outputOrdering` doesn't not propagate to physical `RangeExec` node. We also add correct `outputPartitioning` to `RangeExec` node. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21291 from viirya/SPARK-24242.	2018-05-21 15:39:35 +08:00
Wenchen Fan	000e25ae79	Revert "[SPARK-24250][SQL] support accessing SQLConf inside tasks" This reverts commit `dd37529a8d`.	2018-05-20 16:13:42 +08:00
Wenchen Fan	dd37529a8d	[SPARK-24250][SQL] support accessing SQLConf inside tasks ## What changes were proposed in this pull request? Previously in #20136 we decided to forbid tasks to access `SQLConf`, because it doesn't work and always give you the default conf value. In #21190 we fixed the check and all the places that violate it. Currently the pattern of accessing configs at the executor side is: read the configs at the driver side, then access the variables holding the config values in the RDD closure, so that they will be serialized to the executor side. Something like ``` val someConf = conf.getXXX child.execute().mapPartitions { if (someConf == ...) ... ... } ``` However, this pattern is hard to apply if the config needs to be propagated via a long call stack. An example is `DataType.sameType`, and see how many changes were made in #21190 . When it comes to code generation, it's even worse. I tried it locally and we need to change a ton of files to propagate configs to code generators. This PR proposes to allow tasks to access `SQLConf`. The idea is, we can save all the SQL configs to job properties when an SQL execution is triggered. At executor side we rebuild the `SQLConf` from job properties. ## How was this patch tested? a new test suite Author: Wenchen Fan <wenchen@databricks.com> Closes #21299 from cloud-fan/config.	2018-05-19 18:51:02 +08:00
Efim Poberezkin	434d74e337	[SPARK-23503][SS] Enforce sequencing of committed epochs for Continuous Execution ## What changes were proposed in this pull request? Made changes to EpochCoordinator so that it enforces a commit order. In case a message for epoch n is lost and epoch (n + 1) is ready for commit before epoch n is, epoch (n + 1) will wait for epoch n to be committed first. ## How was this patch tested? Existing tests in ContinuousSuite and EpochCoordinatorSuite. Author: Efim Poberezkin <efim@poberezkin.ru> Closes #20936 from efimpoberezkin/pr/sequence-commited-epochs.	2018-05-18 16:54:39 -07:00
Arun Mahadevan	710e4e81a8	[SPARK-24308][SQL] Handle DataReaderFactory to InputPartition rename in left over classes ## What changes were proposed in this pull request? SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> InputPartitionReader. Some classes still reflects the old name and causes confusion. This patch renames the left over classes to reflect the new interface and fixes a few comments. ## How was this patch tested? Existing unit tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Arun Mahadevan <arunm@apache.org> Closes #21355 from arunmahadevan/SPARK-24308.	2018-05-18 14:37:01 -07:00
Takeshi Yamamuro	a53ea70c1d	[SPARK-23856][SQL] Add an option `queryTimeout` in JDBCOptions ## What changes were proposed in this pull request? This pr added an option `queryTimeout` for the number of seconds the the driver will wait for a Statement object to execute. ## How was this patch tested? Added tests in `JDBCSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21173 from maropu/SPARK-23856.	2018-05-18 13:38:36 -07:00
Dongjoon Hyun	7f82c4a47e	[SPARK-24312][SQL] Upgrade to 2.3.3 for Hive Metastore Client 2.3 ## What changes were proposed in this pull request? Hive 2.3.3 was [released on April 3rd](https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162&styleName=Text&projectId=12310843). This PR aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3. ## How was this patch tested? Pass the Jenkins with the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21359 from dongjoon-hyun/SPARK-24312.	2018-05-18 12:54:19 -07:00
gatorsmile	1c4553d67d	Revert "[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol" This reverts commit `7b2dca5b12`.	2018-05-18 12:51:09 -07:00
Marcelo Vanzin	ed7ba7db8f	[SPARK-23850][SQL] Add separate config for SQL options redaction. The old code was relying on a core configuration and extended its default value to include things that redact desired things in the app's environment. Instead, add a SQL-specific option for which options to redact, and apply both the core and SQL-specific rules when redacting the options in the save command. This is a little sub-optimal since it adds another config, but it retains the current default behavior. While there I also fixed a typo and a couple of minor config API usage issues in the related redaction option that SQL already had. Tested with existing unit tests, plus checking the env page on a shell UI. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #21158 from vanzin/SPARK-23850.	2018-05-18 11:14:22 -07:00
Tathagata Das	807ba44cb7	[SPARK-24159][SS] Enable no-data micro batches for streaming mapGroupswithState ## What changes were proposed in this pull request? Enabled no-data batches in flatMapGroupsWithState in following two cases. - When ProcessingTime timeout is used, then we always run a batch every trigger interval. - When event-time watermark is defined, then the user may be doing arbitrary logic against the watermark value even if timeouts are not set. In such cases, it's best to run batches whenever the watermark has changed, irrespective of whether timeouts (i.e. event-time timeout) have been explicitly enabled. ## How was this patch tested? updated tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21345 from tdas/SPARK-24159.	2018-05-18 10:35:43 -07:00
Soham Aurangabadkar	7696b9de0d	[SPARK-20538][SQL] Wrap Dataset.reduce with withNewRddExecutionId. ## What changes were proposed in this pull request? Wrap Dataset.reduce with `withNewExecutionId`. Author: Soham Aurangabadkar <sohama4@gmail.com> Closes #21316 from sohama4/dataset_reduce_withexecutionid.	2018-05-18 10:29:34 -07:00
Gengliang Wang	7b2dca5b12	[SPARK-24277][SQL] Code clean up in SQL module: HadoopMapReduceCommitProtocol ## What changes were proposed in this pull request? In HadoopMapReduceCommitProtocol and FileFormatWriter, there are unnecessary settings in hadoop configuration. Also clean up some code in SQL module. ## How was this patch tested? Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21329 from gengliangwang/codeCleanWrite.	2018-05-18 15:32:29 +08:00
jinxing	8a837bf4f3	[SPARK-24193] create TakeOrderedAndProjectExec only when the limit number is below spark.sql.execution.topKSortFallbackThreshold. ## What changes were proposed in this pull request? Physical plan of `select colA from t order by colB limit M` is `TakeOrderedAndProject`; Currently `TakeOrderedAndProject` sorts data in memory, see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L158 We can add a config – if the number of limit (M) is too big, we can sort by disk. Thus memory issue can be resolved. ## How was this patch tested? Test added Author: jinxing <jinxing6042@126.com> Closes #21252 from jinxing64/SPARK-24193.	2018-05-17 22:29:18 +08:00
Marco Gaido	69350aa2f0	[SPARK-23922][SQL] Add arrays_overlap function ## What changes were proposed in this pull request? The PR adds the function `arrays_overlap`. This function returns `true` if the input arrays contain a non-null common element; if not, it returns `null` if any of the arrays contains a `null` element, `false` otherwise. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21028 from mgaido91/SPARK-23922.	2018-05-17 20:45:32 +08:00
Florent Pépin	3e66350c24	[SPARK-23925][SQL] Add array_repeat collection function ## What changes were proposed in this pull request? The PR adds a new collection function, array_repeat. As there already was a function repeat with the same signature, with the only difference being the expected return type (String instead of Array), the new function is called array_repeat to distinguish. The behaviour of the function is based on Presto's one. The function creates an array containing a given element repeated the requested number of times. ## How was this patch tested? New unit tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite Author: Florent Pépin <florentpepin.92@gmail.com> Author: Florent Pépin <florent.pepin14@imperial.ac.uk> Closes #21208 from pepinoflo/SPARK-23925.	2018-05-17 13:31:14 +09:00
Tathagata Das	991726f31a	[SPARK-24158][SS] Enable no-data batches for streaming joins ## What changes were proposed in this pull request? This is a continuation of the larger task of enabling zero-data batches for more eager state cleanup. This PR enables it for stream-stream joins. ## How was this patch tested? - Updated join tests. Additionally, updated them to not use `CheckLastBatch` anywhere to set good precedence for future. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21253 from tdas/SPARK-24158.	2018-05-16 14:55:02 -07:00
Gengliang Wang	6fb7d6c4f7	[SPARK-24275][SQL] Revise doc comments in InputPartition ## What changes were proposed in this pull request? In #21145, DataReaderFactory is renamed to InputPartition. This PR is to revise wording in the comments to make it more clear. ## How was this patch tested? None Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21326 from gengliangwang/revise_reader_comments.	2018-05-17 00:40:39 +08:00
Wenchen Fan	943493b165	Revert "[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is acces… …sed only on the driver" This reverts commit `a4206d58e0`. This is from https://github.com/apache/spark/pull/21299 and to ease the review of it. Author: Wenchen Fan <wenchen@databricks.com> Closes #21341 from cloud-fan/revert.	2018-05-16 22:01:24 +08:00
Jose Torres	3fabbc5762	[SPARK-24040][SS] Support single partition aggregates in continuous processing. ## What changes were proposed in this pull request? Support aggregates with exactly 1 partition in continuous processing. A few small tweaks are needed to make this work: * Replace currentEpoch tracking with an ThreadLocal. This means that current epoch is scoped to a task rather than a node, but I think that's sustainable even once we add shuffle. * Add a new testing-only flag to disable the UnsupportedOperationChecker whitelist of allowed continuous processing nodes. I think this is preferable to writing a pile of custom logic to enforce that there is in fact only 1 partition; we plan to support multi-partition aggregates before the next Spark release, so we'd just have to tear that logic back out. * Restart continuous processing queries from the first available uncommitted epoch, rather than one that's guaranteed to be unused. This is required for stateful operators to overwrite partial state from the previous attempt at the epoch, and there was no specific motivation for the original strategy. In another PR before stabilizing the StreamWriter API, we'll need to narrow down and document more precise semantic guarantees for the epoch IDs. * We need a single-partition ContinuousMemoryStream. The way MemoryStream is constructed means it can't be a text option like it is for rate source, unfortunately. ## How was this patch tested? new unit tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #21239 from jose-torres/withAggr.	2018-05-15 10:25:29 -07:00
Liang-Chi Hsieh	d610d2a3f5	[SPARK-24259][SQL] ArrayWriter for Arrow produces wrong output ## What changes were proposed in this pull request? Right now `ArrayWriter` used to output Arrow data for array type, doesn't do `clear` or `reset` after each batch. It produces wrong output. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21312 from viirya/SPARK-24259.	2018-05-15 22:06:58 +08:00
maryannxue	80c6d35a3e	[SPARK-24035][SQL] SQL syntax for Pivot - fix antlr warning ## What changes were proposed in this pull request? 1. Change antlr rule to fix the warning. 2. Add PIVOT/LATERAL check in AstBuilder with a more meaningful error message. ## How was this patch tested? 1. Add a counter case in `PlanParserSuite.test("lateral view")` Author: maryannxue <maryann.xue@gmail.com> Closes #21324 from maryannxue/spark-24035-fix.	2018-05-14 23:34:42 -07:00
Goun Na	e29176fd7d	[SPARK-23627][SQL] Provide isEmpty in Dataset ## What changes were proposed in this pull request? This PR adds isEmpty() in DataSet ## How was this patch tested? Unit tests added Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Goun Na <gounna@gmail.com> Author: goungoun <gounna@gmail.com> Closes #20800 from goungoun/SPARK-23627.	2018-05-15 14:11:20 +08:00
Henry Robinson	061e0084ce	[SPARK-23852][SQL] Add withSQLConf(...) to test case ## What changes were proposed in this pull request? Add a `withSQLConf(...)` wrapper to force Parquet filter pushdown for a test that relies on it. ## How was this patch tested? Test passes Author: Henry Robinson <henry@apache.org> Closes #21323 from henryr/spark-23582.	2018-05-14 14:35:08 -07:00
Maxim Gekk	8cd83acf40	[SPARK-24027][SQL] Support MapType with StringType for keys as the root type by from_json ## What changes were proposed in this pull request? Currently, the from_json function support StructType or ArrayType as the root type. The PR allows to specify MapType(StringType, DataType) as the root type additionally to mentioned types. For example: ```scala import org.apache.spark.sql.types._ val schema = MapType(StringType, IntegerType) val in = Seq("""{"a": 1, "b": 2, "c": 3}""").toDS() in.select(from_json($"value", schema, Map[String, String]())).collect() ``` ``` res1: Array[org.apache.spark.sql.Row] = Array([Map(a -> 1, b -> 2, c -> 3)]) ``` ## How was this patch tested? It was checked by new tests for the map type with integer type and struct type as value types. Also roundtrip tests like from_json(to_json) and to_json(from_json) for MapType are added. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21108 from MaxGekk/from_json-map-type.	2018-05-14 14:05:42 -07:00
Shixiong Zhu	c26f673252	[SPARK-24246][SQL] Improve AnalysisException by setting the cause when it's available ## What changes were proposed in this pull request? If there is an exception, it's better to set it as the cause of AnalysisException since the exception may contain useful debug information. ## How was this patch tested? Jenkins Author: Shixiong Zhu <zsxwing@gmail.com> Closes #21297 from zsxwing/SPARK-24246.	2018-05-14 11:37:57 -07:00
Kazuaki Ishizaki	b6c50d7820	[SPARK-24228][SQL] Fix Java lint errors ## What changes were proposed in this pull request? This PR fixes the following Java lint errors due to importing unimport classes ``` $ dev/lint-java Using `mvn` from path: /usr/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/partitioning/Distribution.java:[25] (sizes) LineLength: Line is longer than 100 characters (found 109). [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousReader.java:[38] (sizes) LineLength: Line is longer than 100 characters (found 102). [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[21,8] (imports) UnusedImports: Unused import - java.io.ByteArrayInputStream. [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform. [ERROR] src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java:[110] (sizes) LineLength: Line is longer than 100 characters (found 101). ``` With this PR ``` $ dev/lint-java Using `mvn` from path: /usr/bin/mvn Checkstyle checks passed. ``` ## How was this patch tested? Existing UTs. Also manually run checkstyles against these two files. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21301 from kiszk/SPARK-24228.	2018-05-14 10:57:10 +08:00
Maxim Gekk	7a2d4895c7	[SPARK-17916][SQL] Fix empty string being parsed as null when nullValue is set. ## What changes were proposed in this pull request? I propose to bump version of uniVocity parser up to 2.6.3 where quoted empty strings are replaced by the empty value (passed to `setEmptyValue`) instead of `null` values as in the current version 2.5.9: https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L125 Empty value for writer is set to `""`. So, empty string in dataframe/dataset is stored as empty quoted string `""`. Empty value for reader is set to empty string (zero size). In this way, saved empty quoted string will be read as just empty string. Please, look at the tests for more details. Here are main changes made in [2.6.0](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.0), [2.6.1](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.1), [2.6.2](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.2), [2.6.3](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.6.3): - CSV parser now parses quoted values ~30% faster - CSV format detection process has option provide a list of possible delimiters, in order of priority ( i.e. settings.detectFormatAutomatically( '-', '.');) - https://github.com/uniVocity/univocity-parsers/issues/214 - Implemented trim quoted values support - https://github.com/uniVocity/univocity-parsers/issues/230 - NullPointer when stopping parser when nothing is parsed - https://github.com/uniVocity/univocity-parsers/issues/219 - Concurrency issue when calling stopParsing() - https://github.com/uniVocity/univocity-parsers/issues/231 Closes #20068 ## How was this patch tested? Added tests from the PR https://github.com/apache/spark/pull/20068 Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21273 from MaxGekk/univocity-2.6.	2018-05-14 10:01:06 +08:00
Cody Allen	32acfa78c6	Improve implicitNotFound message for Encoder The `implicitNotFound` message for `Encoder` doesn't mention the name of the type for which it can't find an encoder. Furthermore, it covers up the fact that `Encoder` is the name of the relevant type class. Hopefully this new message provides a little more specific type detail while still giving the general message about which types are supported. ## What changes were proposed in this pull request? Augment the existing message to mention that it's looking for an `Encoder` and what the type of the encoder is. For example instead of: ``` Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. ``` return this message: ``` Unable to find encoder for type Exception. An implicit Encoder[Exception] is needed to store Exception instances in a Dataset. Primitive types (Int, String, etc) and Product types (ca se classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. ``` ## How was this patch tested? It was tested manually in the Scala REPL, since triggering this in a test would cause a compilation error. ``` scala> implicitly[Encoder[Exception]] <console>:51: error: Unable to find encoder for type Exception. An implicit Encoder[Exception] is needed to store Exception instances in a Dataset. Primitive types (Int, String, etc) and Product types (ca se classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. implicitly[Encoder[Exception]] ^ ``` Author: Cody Allen <ceedubs@gmail.com> Closes #20869 from ceedubs/encoder-implicit-msg.	2018-05-12 14:35:40 -05:00
Reynold Xin	e3dabdf6ef	[SPARK-23907] Removes regr_* functions in functions.scala ## What changes were proposed in this pull request? This patch removes the various regr_* functions in functions.scala. They are so uncommon that I don't think they deserve real estate in functions.scala. We can consider adding them later if more users need them. ## How was this patch tested? Removed the associated test case as well. Author: Reynold Xin <rxin@databricks.com> Closes #21309 from rxin/SPARK-23907.	2018-05-12 12:15:36 +08:00
aditkumar	92f6f52ff0	[MINOR][DOCS] Documenting months_between direction ## What changes were proposed in this pull request? It's useful to know what relationship between date1 and date2 results in a positive number. Author: aditkumar <aditkumar@gmail.com> Author: Adit Kumar <aditkumar@gmail.com> Closes #20787 from aditkumar/master.	2018-05-11 14:42:23 -05:00
Wenchen Fan	928845a422	[SPARK-24172][SQL] we should not apply operator pushdown to data source v2 many times ## What changes were proposed in this pull request? In `PushDownOperatorsToDataSource`, we use `transformUp` to match `PhysicalOperation` and apply pushdown. This is problematic if we have multiple `Filter` and `Project` above the data source v2 relation. e.g. for a query ``` Project Filter DataSourceV2Relation ``` The pattern match will be triggered twice and we will do operator pushdown twice. This is unnecessary, we can use `mapChildren` to only apply pushdown once. ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes #21230 from cloud-fan/step2.	2018-05-11 10:00:28 -07:00
Wenchen Fan	a4206d58e0	[SPARK-22938][SQL][FOLLOWUP] Assert that SQLConf.get is accessed only on the driver ## What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/20136 . #20136 didn't really work because in the test, we are using local backend, which shares the driver side `SparkEnv`, so `SparkEnv.get.executorId == SparkContext.DRIVER_IDENTIFIER` doesn't work. This PR changes the check to `TaskContext.get != null`, and move the check to `SQLConf.get`, and fix all the places that violate this check: * `InMemoryTableScanExec#createAndDecompressColumn` is executed inside `rdd.map`, we can't access `conf.offHeapColumnVectorEnabled` there. https://github.com/apache/spark/pull/21223 merged * `DataType#sameType` may be executed in executor side, for things like json schema inference, so we can't call `conf.caseSensitiveAnalysis` there. This contributes to most of the code changes, as we need to add `caseSensitive` parameter to a lot of methods. * `ParquetFilters` is used in the file scan function, which is executed in executor side, so we can't can't call `conf.parquetFilterPushDownDate` there. https://github.com/apache/spark/pull/21224 merged * `WindowExec#createBoundOrdering` is called on executor side, so we can't use `conf.sessionLocalTimezone` there. https://github.com/apache/spark/pull/21225 merged * `JsonToStructs` can be serialized to executors and evaluate, we should not call `SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)` in the body. https://github.com/apache/spark/pull/21226 merged ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes #21190 from cloud-fan/minor.	2018-05-11 09:01:40 +08:00
Maxim Gekk	f4fed05121	[SPARK-24171] Adding a note for non-deterministic functions ## What changes were proposed in this pull request? I propose to add a clear statement for functions like `collect_list()` about non-deterministic behavior of such functions. The behavior must be taken into account by user while creating and running queries. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21228 from MaxGekk/deterministic-comments.	2018-05-10 09:44:49 -07:00
Marco Gaido	94d6714482	[SPARK-23907][SQL] Add regr_* functions ## What changes were proposed in this pull request? The PR introduces regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgy, regr_count. The implementation of this functions mirrors Hive's one in HIVE-15978. ## How was this patch tested? added UT (values compared with Hive) Author: Marco Gaido <marcogaido91@gmail.com> Closes #21054 from mgaido91/SPARK-23907.	2018-05-10 20:38:52 +09:00
Dongjoon Hyun	e3d4349947	[SPARK-22279][SQL] Enable `convertMetastoreOrc` by default ## What changes were proposed in this pull request? We reverted `spark.sql.hive.convertMetastoreOrc` at https://github.com/apache/spark/pull/20536 because we should not ignore the table-specific compression conf. Now, it's resolved via [SPARK-23355](`8aa1d7b0ed`). ## How was this patch tested? Pass the Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21186 from dongjoon-hyun/SPARK-24112.	2018-05-10 13:36:52 +08:00
Ryan Blue	62d01391fe	[SPARK-24073][SQL] Rename DataReaderFactory to InputPartition. ## What changes were proposed in this pull request? Renames: * `DataReaderFactory` to `InputPartition` * `DataReader` to `InputPartitionReader` * `createDataReaderFactories` to `planInputPartitions` * `createUnsafeDataReaderFactories` to `planUnsafeInputPartitions` * `createBatchDataReaderFactories` to `planBatchInputPartitions` This fixes the changes in SPARK-23219, which renamed ReadTask to DataReaderFactory. The intent of that change was to make the read and write API match (write side uses DataWriterFactory), but the underlying problem is that the two classes are not equivalent. ReadTask/DataReader function as Iterable/Iterator. One InputPartition is a specific partition of the data to be read, in contrast to DataWriterFactory where the same factory instance is used in all write tasks. InputPartition's purpose is to manage the lifecycle of the associated reader, which is now called InputPartitionReader, with an explicit create operation to mirror the close operation. This was no longer clear from the API because DataReaderFactory appeared to be more generic than it is and it isn't clear why a set of them is produced for a read. ## How was this patch tested? Existing tests, which have been updated to use the new name. Author: Ryan Blue <blue@apache.org> Closes #21145 from rdblue/SPARK-24073-revert-data-reader-factory-rename.	2018-05-09 21:48:54 -07:00
Henry Robinson	9341c951e8	[SPARK-23852][SQL] Add test that fails if PARQUET-1217 is not fixed ## What changes were proposed in this pull request? Add a new test that triggers if PARQUET-1217 - a predicate pushdown bug - is not fixed in Spark's Parquet dependency. ## How was this patch tested? New unit test passes. Author: Henry Robinson <henry@apache.org> Closes #21284 from henryr/spark-23852.	2018-05-09 19:56:03 -07:00
Shixiong Zhu	fd1179c172	[SPARK-24214][SS] Fix toJSON for StreamingRelationV2/StreamingExecutionRelation/ContinuousExecutionRelation ## What changes were proposed in this pull request? We should overwrite "otherCopyArgs" to provide the SparkSession parameter otherwise TreeNode.toJSON cannot get the full constructor parameter list. ## How was this patch tested? The new unit test. Author: Shixiong Zhu <zsxwing@gmail.com> Closes #21275 from zsxwing/SPARK-24214.	2018-05-09 11:32:17 -07:00
Marcelo Vanzin	cc613b552e	[PYSPARK] Update py4j to version 0.10.7.	2018-05-09 10:47:35 -07:00
DB Tsai	6ea582e36a	[SPARK-24181][SQL] Better error message for writing sorted data ## What changes were proposed in this pull request? The exception message should clearly distinguish sorting and bucketing in `save` and `jdbc` write. When a user tries to write a sorted data using save or insertInto, it will throw an exception with message that `s"'$operation' does not support bucketing right now""`. We should throw `s"'$operation' does not support sortBy right now""` instead. ## How was this patch tested? More tests in `DataFrameReaderWriterSuite.scala` Author: DB Tsai <d_tsai@apple.com> Closes #21235 from dbtsai/fixException.	2018-05-09 09:15:16 -07:00
Ryan Blue	cac9b1dea1	[SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0. ## What changes were proposed in this pull request? This updates Parquet to 1.10.0 and updates the vectorized path for buffer management changes. Parquet 1.10.0 uses ByteBufferInputStream instead of byte arrays in encoders. This allows Parquet to break allocations into smaller chunks that are better for garbage collection. ## How was this patch tested? Existing Parquet tests. Running in production at Netflix for about 3 months. Author: Ryan Blue <blue@apache.org> Closes #21070 from rdblue/SPARK-23972-update-parquet-to-1.10.0.	2018-05-09 12:27:32 +08:00
Maxim Gekk	e3de6ab30d	[SPARK-24068] Propagating DataFrameReader's options to Text datasource on schema inferring ## What changes were proposed in this pull request? While reading CSV or JSON files, DataFrameReader's options are converted to Hadoop's parameters, for example there: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302 but the options are not propagated to Text datasource on schema inferring, for instance: https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188 The PR proposes propagation of user's options to Text datasource on scheme inferring in similar way as user's options are converted to Hadoop parameters if schema is specified. ## How was this patch tested? The changes were tested manually by using https://github.com/twitter/hadoop-lzo: ``` hadoop-lzo> mvn clean package hadoop-lzo> ln -s ./target/hadoop-lzo-0.4.21-SNAPSHOT.jar ./hadoop-lzo.jar ``` Create 2 test files in JSON and CSV format and compress them: ```shell $ cat test.csv col1\|col2 a\|1 $ lzop test.csv $ cat test.json {"col1":"a","col2":1} $ lzop test.json ``` Run `spark-shell` with hadoop-lzo: ``` bin/spark-shell --jars ~/hadoop-lzo/hadoop-lzo.jar ``` reading compressed CSV and JSON without schema: ```scala spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("inferSchema",true).option("header",true).option("sep","\|").csv("test.csv.lzo").show() +----+----+ \|col1\|col2\| +----+----+ \| a\| 1\| +----+----+ ``` ```scala spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("multiLine", true).json("test.json.lzo").printSchema root \|-- col1: string (nullable = true) \|-- col2: long (nullable = true) ``` Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #21182 from MaxGekk/text-options.	2018-05-09 08:32:20 +08:00
Yuming Wang	487faf17ab	[SPARK-24117][SQL] Unified the getSizePerRow ## What changes were proposed in this pull request? This pr unified the `getSizePerRow` because `getSizePerRow` is used in many places. For example: 1. [LocalRelation.scala#L80](`f70f46d1e5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LocalRelation.scala (L80)`) 2. [SizeInBytesOnlyStatsPlanVisitor.scala#L36](`76b8b840dd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L36)`) ## How was this patch tested? Exist tests Author: Yuming Wang <yumwang@ebay.com> Closes #21189 from wangyum/SPARK-24117.	2018-05-08 23:43:02 +08:00
gatorsmile	2f6fe7d679	[SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] Support custom encoding for json files ## What changes were proposed in this pull request? This is to add a test case to check the behaviors when users write json in the specified UTF-16/UTF-32 encoding with multiline off. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #21254 from gatorsmile/followupSPARK-23094.	2018-05-08 21:24:35 +08:00
yucai	e17567ca78	[SPARK-24076][SQL] Use different seed in HashAggregate to avoid hash conflict ## What changes were proposed in this pull request? HashAggregate uses the same hash algorithm and seed as ShuffleExchange, it may lead to bad hash conflict when shuffle.partitions=8192n. Considering below example: ``` SET spark.sql.shuffle.partitions=8192; INSERT OVERWRITE TABLE target_xxx SELECT item_id, auct_end_dt FROM from source_xxx GROUP BY item_id, auct_end_dt; ``` In the shuffle stage, if user sets the shuffle.partition = 8192, all tuples in the same partition will meet the following relationship: ``` hash(tuple x) = hash(tuple y) + n 8192 ``` Then in the next HashAggregate stage, all tuples from the same partition need be put into a 16K BytesToBytesMap (unsafeRowAggBuffer). Here, the HashAggregate uses the same hash algorithm on the same expression as shuffle, and uses the same seed, and 16K = 8192 * 2, so actually, all tuples in the same parititon will only be hashed to 2 different places in the BytesToBytesMap. It is bad hash conflict. With BytesToBytesMap growing, the conflict will always exist. Before change: <img width="334" alt="hash_conflict" src="https://user-images.githubusercontent.com/2989575/39250210-ed032d46-48d2-11e8-855a-c1afc2a0ceb5.png"> After change: <img width="334" alt="no_hash_conflict" src="https://user-images.githubusercontent.com/2989575/39250218-f1cb89e0-48d2-11e8-9244-5a93c1e8b60d.png"> ## How was this patch tested? Unit tests and production cases. Author: yucai <yyu1@ebay.com> Closes #21149 from yucai/SPARK-24076.	2018-05-08 11:34:27 +02:00
Henry Robinson	cd12c5c3ec	[SPARK-24128][SQL] Mention configuration option in implicit CROSS JOIN error ## What changes were proposed in this pull request? Mention `spark.sql.crossJoin.enabled` in error message when an implicit `CROSS JOIN` is detected. ## How was this patch tested? `CartesianProductSuite` and `JoinSuite`. Author: Henry Robinson <henry@apache.org> Closes #21201 from henryr/spark-24128.	2018-05-08 12:21:33 +08:00
Bruce Robbins	d83e963724	[SPARK-24043][SQL] Interpreted Predicate should initialize nondeterministic expressions ## What changes were proposed in this pull request? When creating an InterpretedPredicate instance, initialize any Nondeterministic expressions in the expression tree to avoid java.lang.IllegalArgumentException on later call to eval(). ## How was this patch tested? - sbt SQL tests - python SQL tests - new unit test Author: Bruce Robbins <bersprockets@gmail.com> Closes #21144 from bersprockets/interpretedpredicate.	2018-05-07 17:54:39 +02:00
Herman van Hovell	4e861db5f1	[SPARK-16406][SQL] Improve performance of LogicalPlan.resolve ## What changes were proposed in this pull request? `LogicalPlan.resolve(...)` uses linear searches to find an attribute matching a name. This is fine in normal cases, but gets problematic when you try to resolve a large number of columns on a plan with a large number of attributes. This PR adds an indexing structure to `resolve(...)` in order to find potential matches quicker. This PR improves the reference resolution time for the following code by 4x (11.8s -> 2.4s): ``` scala val n = 4000 val values = (1 to n).map(_.toString).mkString(", ") val columns = (1 to n).map("column" + _).mkString(", ") val query = s""" \|SELECT $columns \|FROM VALUES ($values) T($columns) \|WHERE 1=2 AND 1 IN ($columns) \|GROUP BY $columns \|ORDER BY $columns \|""".stripMargin spark.time(sql(query)) ``` ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #14083 from hvanhovell/SPARK-16406.	2018-05-07 11:21:22 +02:00
Marco Gaido	e35ad3cadd	[SPARK-23930][SQL] Add slice function ## What changes were proposed in this pull request? The PR add the `slice` function. The behavior of the function is based on Presto's one. The function slices an array according to the requested start index and length. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21040 from mgaido91/SPARK-23930.	2018-05-07 16:57:37 +09:00
Gabor Somogyi	c5981976f1	[SPARK-23775][TEST] Make DataFrameRangeSuite not flaky ## What changes were proposed in this pull request? DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays sometimes in an infinite loop and times out the build. There were multiple issues with the test: 1. The first valid stageId is zero when the test started alone and not in a suite and the following code waits until timeout: ``` eventually(timeout(10.seconds), interval(1.millis)) { assert(DataFrameRangeSuite.stageToKill > 0) } ``` 2. The `DataFrameRangeSuite.stageToKill` was overwritten by the task's thread after the reset which ended up in canceling the same stage 2 times. This caused the infinite wait. This PR solves this mentioned flakyness by removing the shared `DataFrameRangeSuite.stageToKill` and using `onTaskStart` where stage ID is provided. In order to make sure cancelStage called for all stages `waitUntilEmpty` is called on `ListenerBus`. In [PR20888](https://github.com/apache/spark/pull/20888) this tried to get solved by: * Stopping the executor thread with `wait` * Wait for all `cancelStage` called * Kill the executor thread by setting `SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL` but the thread killing left the shared `SparkContext` sometimes in a state where further jobs can't be submitted. As a result DataFrameRangeSuite.test("Cancelling stage in a query with Range.") test passed properly but the next test inside the suite was hanging. ## How was this patch tested? Existing unit test executed 10k times. Author: Gabor Somogyi <gabor.g.somogyi@gmail.com> Closes #21214 from gaborgsomogyi/SPARK-23775_1.	2018-05-07 14:45:14 +08:00
Kazuaki Ishizaki	7564a9a706	[SPARK-23921][SQL] Add array_sort function ## What changes were proposed in this pull request? The PR adds the SQL function `array_sort`. The behavior of the function is based on Presto's one. The function sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21021 from kiszk/SPARK-23921.	2018-05-07 15:22:23 +09:00
gatorsmile	f38ea00e83	[SPARK-24017][SQL] Refactor ExternalCatalog to be an interface ## What changes were proposed in this pull request? This refactors the external catalog to be an interface. It can be easier for the future work in the catalog federation. After the refactoring, `ExternalCatalog` is much cleaner without mixing the listener event generation logic. ## How was this patch tested? The existing tests Author: gatorsmile <gatorsmile@gmail.com> Closes #21122 from gatorsmile/refactorExternalCatalog.	2018-05-06 20:41:32 -07:00
Tathagata Das	47b5b68528	[SPARK-24157][SS] Enabled no-data batches in MicroBatchExecution for streaming aggregation and deduplication. ## What changes were proposed in this pull request? This PR enables the MicroBatchExecution to run no-data batches if some SparkPlan requires running another batch to output results based on updated watermark / processing time. In this PR, I have enabled streaming aggregations and streaming deduplicates to automatically run addition batch even if new data is available. See https://issues.apache.org/jira/browse/SPARK-24156 for more context. Major changes/refactoring done in this PR. - Refactoring MicroBatchExecution - A major point of confusion in MicroBatchExecution control flow was always (at least to me) was that `populateStartOffsets` internally called `constructNextBatch` which was not obvious from just the name "populateStartOffsets" and made the control flow from the main trigger execution loop very confusing (main loop in `runActivatedStream` called `constructNextBatch` but only if `populateStartOffsets` hadn't already called it). Instead, the refactoring makes it cleaner. - `populateStartOffsets` only the updates `availableOffsets` and `committedOffsets`. Does not call `constructNextBatch`. - Main loop in `runActivatedStream` calls `constructNextBatch` which returns true or false reflecting whether the next batch is ready for executing. This method is now idempotent; if a batch has already been constructed, then it will always return true until the batch has been executed. - If next batch is ready then we call `runBatch` or sleep. - That's it. - Refactoring watermark management logic - This has been refactored out from `MicroBatchExecution` in a separate class to simplify `MicroBatchExecution`. - New method `shouldRunAnotherBatch` in `IncrementalExecution` - This returns true if there is any stateful operation in the last execution plan that requires another batch for state cleanup, etc. This is used to decide whether to construct a batch or not in `constructNextBatch`. - Changes to stream testing framework - Many tests used CheckLastBatch to validate answers. This assumed that there will be no more batches after the last set of input has been processed, so the last batch is the one that has output corresponding to the last input. This is not true anymore. To account for that, I made two changes. - `CheckNewAnswer` is a new test action that verifies the new rows generated since the last time the answer was checked by `CheckAnswer`, `CheckNewAnswer` or `CheckLastBatch`. This is agnostic to how many batches occurred between the last check and now. To do make this easier, I added a common trait between MemorySink and MemorySinkV2 to abstract out some common methods. - `assertNumStateRows` has been updated in the same way to be agnostic to batches while checking what the total rows and how many state rows were updated (sums up updates since the last check). ## How was this patch tested? - Changes made to existing tests - Tests have been changed in one of the following patterns. - Tests where the last input was given again to force another batch to be executed and state cleaned up / output generated, they were simplified by removing the extra input. - Tests using aggregation+watermark where CheckLastBatch were replaced with CheckNewAnswer to make them batch agnostic. - New tests added to check whether the flag works for streaming aggregation and deduplication Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21220 from tdas/SPARK-24157.	2018-05-04 16:35:24 -07:00
Jose Torres	af4dc50280	[SPARK-24039][SS] Do continuous processing writes with multiple compute() calls ## What changes were proposed in this pull request? Do continuous processing writes with multiple compute() calls. The current strategy (before this PR) is hacky; we just call next() on an iterator which has already returned hasNext = false, knowing that all the nodes we whitelist handle this properly. This will have to be changed before we can support more complex query plans. (In particular, I have a WIP https://github.com/jose-torres/spark/pull/13 which should be able to support aggregates in a single partition with minimal additional work.) Most of the changes here are just refactoring to accommodate the new model. The behavioral changes are: * The writer now calls prev.compute(split, context) once per epoch within the epoch loop. * ContinuousDataSourceRDD now spawns a ContinuousQueuedDataReader which is shared across multiple calls to compute() for the same partition. ## How was this patch tested? existing unit tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #21200 from jose-torres/noAggr.	2018-05-04 14:14:40 -07:00
Arun Mahadevan	7f1b6b182e	[SPARK-24136][SS] Fix MemoryStreamDataReader.next to skip sleeping if record is available ## What changes were proposed in this pull request? Avoid unnecessary sleep (10 ms) in each invocation of MemoryStreamDataReader.next. ## How was this patch tested? Ran ContinuousSuite from IDE. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Arun Mahadevan <arunm@apache.org> Closes #21207 from arunmahadevan/memorystream.	2018-05-04 16:02:21 +08:00
Wenchen Fan	0c23e254c3	[SPARK-24167][SQL] ParquetFilters should not access SQLConf at executor side ## What changes were proposed in this pull request? This PR is extracted from #21190 , to make it easier to backport. `ParquetFilters` is used in the file scan function, which is executed in executor side, so we can't call `conf.parquetFilterPushDownDate` there. ## How was this patch tested? it's tested in #21190 Author: Wenchen Fan <wenchen@databricks.com> Closes #21224 from cloud-fan/minor2.	2018-05-04 09:27:14 +08:00
Wenchen Fan	e646ae67f2	[SPARK-24168][SQL] WindowExec should not access SQLConf at executor side ## What changes were proposed in this pull request? This PR is extracted from #21190 , to make it easier to backport. `WindowExec#createBoundOrdering` is called on executor side, so we can't use `conf.sessionLocalTimezone` there. ## How was this patch tested? tested in #21190 Author: Wenchen Fan <wenchen@databricks.com> Closes #21225 from cloud-fan/minor3.	2018-05-03 17:27:13 -07:00
maryannxue	e3201e165e	[SPARK-24035][SQL] SQL syntax for Pivot ## What changes were proposed in this pull request? Add SQL support for Pivot according to Pivot grammar defined by Oracle (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_clause.htm) with some simplifications, based on our existing functionality and limitations for Pivot at the backend: 1. For pivot_for_clause (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_for_clause.htm), the column list form is not supported, which means the pivot column can only be one single column. 2. For pivot_in_clause (https://docs.oracle.com/database/121/SQLRF/img_text/pivot_in_clause.htm), the sub-query form and "ANY" is not supported (this is only supported by Oracle for XML anyway). 3. For pivot_in_clause, aliases for the constant values are not supported. The code changes are: 1. Add parser support for Pivot. Note that according to https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#i2076542, Pivot cannot be used together with lateral views in the from clause. This restriction has been implemented in the Parser rule. 2. Infer group-by expressions: group-by expressions are not explicitly specified in SQL Pivot clause and need to be deduced based on this rule: https://docs.oracle.com/database/121/SQLRF/statements_10002.htm#CHDFAFIE, so we have to post-fix it at query analysis stage. 3. Override Pivot.resolved as "false": for the reason mentioned in [2] and the fact that output attributes change after Pivot being replaced by Project or Aggregate, we avoid resolving parent references until after Pivot has been resolved and replaced. 4. Verify aggregate expressions: only aggregate expressions with or without aliases can appear in the first part of the Pivot clause, and this check is performed as analysis stage. ## How was this patch tested? A new test suite PivotSuite is added. Author: maryannxue <maryann.xue@gmail.com> Closes #21187 from maryannxue/spark-24035.	2018-05-03 17:05:02 -07:00
Wenchen Fan	96a50016bb	[SPARK-24169][SQL] JsonToStructs should not access SQLConf at executor side ## What changes were proposed in this pull request? This PR is extracted from #21190 , to make it easier to backport. `JsonToStructs` can be serialized to executors and evaluate, we should not call `SQLConf.get.getConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA)` in the body. ## How was this patch tested? tested in #21190 Author: Wenchen Fan <wenchen@databricks.com> Closes #21226 from cloud-fan/minor4.	2018-05-03 23:36:09 +08:00
Wenchen Fan	991b526992	[SPARK-24166][SQL] InMemoryTableScanExec should not access SQLConf at executor side ## What changes were proposed in this pull request? This PR is extracted from https://github.com/apache/spark/pull/21190 , to make it easier to backport. `InMemoryTableScanExec#createAndDecompressColumn` is executed inside `rdd.map`, we can't access `conf.offHeapColumnVectorEnabled` there. ## How was this patch tested? it's tested in #21190 Author: Wenchen Fan <wenchen@databricks.com> Closes #21223 from cloud-fan/minor1.	2018-05-03 19:56:30 +08:00
Wenchen Fan	417ad92502	[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have timezone ## What changes were proposed in this pull request? `from_utc_timestamp` assumes its input is in UTC timezone and shifts it to the specified timezone. When the timestamp contains timezone(e.g. `2018-03-13T06:18:23+00:00`), Spark breaks the semantic and respect the timezone in the string. This is not what user expects and the result is different from Hive/Impala. `to_utc_timestamp` has the same problem. More details please refer to the JIRA ticket. This PR fixes this by returning null if the input timestamp contains timezone. ## How was this patch tested? new tests Author: Wenchen Fan <wenchen@databricks.com> Closes #21169 from cloud-fan/from_utc_timezone.	2018-05-03 19:27:01 +08:00
Dongjoon Hyun	c9bfd1c6f8	[SPARK-23489][SQL][TEST] HiveExternalCatalogVersionsSuite should verify the downloaded file ## What changes were proposed in this pull request? Although [SPARK-22654](https://issues.apache.org/jira/browse/SPARK-22654) made `HiveExternalCatalogVersionsSuite` download from Apache mirrors three times, it has been flaky because it didn't verify the downloaded file. Some Apache mirrors terminate the downloading abnormally, the corrupted file shows the following errors. ``` gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now 22:46:32.700 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.hive.HiveExternalCatalogVersionsSuite, thread names: Keep-Alive-Timer ===== * RUN ABORTED * java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "/tmp/test-spark/spark-2.2.0"): error=2, No such file or directory ``` This has been reported weirdly in two ways. For example, the above case is reported as Case 2 `no failures`. - Case 1. [Test Result (1 failure / +1)](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4389/) - Case 2. [Test Result (no failures)](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4811/) This PR aims to make `HiveExternalCatalogVersionsSuite` more robust by verifying the downloaded `tgz` file by extracting and checking the existence of `bin/spark-submit`. If it turns out that the file is empty or corrupted, `HiveExternalCatalogVersionsSuite` will do retry logic like the download failure. ## How was this patch tested? Pass the Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #21210 from dongjoon-hyun/SPARK-23489.	2018-05-03 15:15:05 +08:00
jerryshao	bf4352ca6c	[SPARK-24110][THRIFT-SERVER] Avoid UGI.loginUserFromKeytab in STS ## What changes were proposed in this pull request? Spark ThriftServer will call UGI.loginUserFromKeytab twice in initialization. This is unnecessary and will cause various potential problems, like Hadoop IPC failure after 7 days, or RM failover issue and so on. So here we need to remove all the unnecessary login logics and make sure UGI in the context never be created again. Note this is actually a HS2 issue, If later on we upgrade supported Hive version, the issue may already be fixed in Hive side. ## How was this patch tested? Local verification in secure cluster. Author: jerryshao <sshao@hortonworks.com> Closes #21178 from jerryshao/SPARK-24110.	2018-05-03 09:28:14 +08:00
Takeshi Yamamuro	e4c91c089a	[SPARK-24111][SQL] Add the TPCDS v2.7 (latest) queries in TPCDSQueryBenchmark ## What changes were proposed in this pull request? This pr added the TPCDS v2.7 (latest) queries in `TPCDSQueryBenchmark`. These query files have been added in `SPARK-23167`. ## How was this patch tested? Manually checked. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21177 from maropu/AddTpcdsV2_7InBenchmark.	2018-05-02 16:12:21 -07:00
Kazuaki Ishizaki	5be8aab144	[SPARK-23923][SQL] Add cardinality function ## What changes were proposed in this pull request? The PR adds the SQL function `cardinality`. The behavior of the function is based on Presto's one. The function returns the length of the array or map stored in the column as `int` while the Presto version returns the value as `BigInt` (`long` in Spark). The discussions regarding the difference of return type are [here](https://github.com/apache/spark/pull/21031#issuecomment-381284638) and [there](https://github.com/apache/spark/pull/21031#discussion_r181622107). ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21031 from kiszk/SPARK-23923.	2018-05-02 13:53:10 -07:00
Marco Gaido	504c9cfd21	[SPARK-24123][SQL] Fix precision issues in monthsBetween with more than 8 digits ## What changes were proposed in this pull request? SPARK-23902 introduced the ability to retrieve more than 8 digits in `monthsBetween`. Unfortunately, current implementation can cause precision loss in such a case. This was causing also a flaky UT. This PR mirrors Hive's implementation in order to avoid precision loss also when more than 8 digits are returned. ## How was this patch tested? running 10000000 times the flaky UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21196 from mgaido91/SPARK-24123.	2018-05-02 13:49:15 -07:00
Ala Luszczak	8bd27025b7	[SPARK-24133][SQL] Check for integer overflows when resizing WritableColumnVectors ## What changes were proposed in this pull request? `ColumnVector`s store string data in one big byte array. Since the array size is capped at just under Integer.MAX_VALUE, a single `ColumnVector` cannot store more than 2GB of string data. But since the Parquet files commonly contain large blobs stored as strings, and `ColumnVector`s by default carry 4096 values, it's entirely possible to go past that limit. In such cases a negative capacity is requested from `WritableColumnVector.reserve()`. The call succeeds (requested capacity is smaller than already allocated capacity), and consequently `java.lang.ArrayIndexOutOfBoundsException` is thrown when the reader actually attempts to put the data into the array. This change introduces a simple check for integer overflow to `WritableColumnVector.reserve()` which should help catch the error earlier and provide more informative exception. Additionally, the error message in `WritableColumnVector.throwUnsupportedException()` was corrected, as it previously encouraged users to increase rather than reduce the batch size. ## How was this patch tested? New units tests were added. Author: Ala Luszczak <ala@databricks.com> Closes #21206 from ala/overflow-reserve.	2018-05-02 12:43:19 -07:00
Marco Gaido	8dbf56c055	[SPARK-24013][SQL] Remove unneeded compress in ApproximatePercentile ## What changes were proposed in this pull request? `ApproximatePercentile` contains a workaround logic to compress the samples since at the beginning `QuantileSummaries` was ignoring the compression threshold. This problem was fixed in SPARK-17439, but the workaround logic was not removed. So we are compressing the samples many more times than needed: this could lead to critical performance degradation. This can create serious performance issues in queries like: ``` select approx_percentile(id, array(0.1)) from range(10000000) ``` ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21133 from mgaido91/SPARK-24013.	2018-05-02 11:58:55 -07:00
wangyanlin01	7bbec0dced	[SPARK-24061][SS] Add TypedFilter support for continuous processing ## What changes were proposed in this pull request? Add TypedFilter support for continuous processing application. ## How was this patch tested? unit tests Author: wangyanlin01 <wangyanlin01@baidu.com> Closes #21136 from yanlin-Lynn/SPARK-24061.	2018-05-01 16:22:52 +08:00
Wenchen Fan	b42ad165bb	[SPARK-24072][SQL] clearly define pushed filters ## What changes were proposed in this pull request? filters like parquet row group filter, which is actually pushed to the data source but still to be evaluated by Spark, should also count as `pushedFilters`. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #21143 from cloud-fan/step1.	2018-04-30 09:13:32 -07:00
Maxim Gekk	3121b411f7	[SPARK-23846][SQL] The samplingRatio option for CSV datasource ## What changes were proposed in this pull request? I propose to support the `samplingRatio` option for schema inferring of CSV datasource similar to the same option of JSON datasource: `b14993e1fc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala (L49-L50)` ## How was this patch tested? Added 2 tests for json and 2 tests for csv datasources. The tests checks that only subset of input dataset is used for schema inferring. Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20959 from MaxGekk/csv-sampling.	2018-04-30 09:45:22 +08:00
Maxim Gekk	bd14da6fd5	[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files ## What changes were proposed in this pull request? I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option: ``` spark.read.schema(schema) .option("multiline", "true") .option("encoding", "UTF-16LE") .json(fileName) ``` If the option is not specified, charset auto-detection mechanism is used by default. The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in `UTF-8` charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like `.option("charset", "UTF-16BE")`. By default the output charset is still `UTF-8` to keep backward compatibility. The solution has the following restrictions for per-line mode (`multiline = false`): - If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725 - Encoding with [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) are not supported. For example, the `UTF-16` and `UTF-32` encodings are blacklisted. The problem can be solved by https://github.com/MaxGekk/spark-1/pull/2 ## How was this patch tested? I added the following tests: - reads an json file in `UTF-16LE` encoding with BOM in `multiline` mode - read json file by using charset auto detection (`UTF-32BE` with BOM) - read json file using of user's charset (`UTF-16LE`) - saving in `UTF-32BE` and read the result by standard library (not by Spark) - checking that default charset is `UTF-8` - handling wrong (unsupported) charset Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20937 from MaxGekk/json-encoding-line-sep.	2018-04-29 11:25:31 +08:00
Yuming Wang	4df51361a5	[SPARK-22732][SS][FOLLOW-UP] Fix MemorySinkV2 toString error ## What changes were proposed in this pull request? Fix `MemorySinkV2` toString() error ## How was this patch tested? N/A Author: Yuming Wang <yumwang@ebay.com> Closes #21170 from wangyum/SPARK-22732.	2018-04-28 16:57:41 +08:00
Marco Gaido	ad94e8592b	[SPARK-23736][SQL][FOLLOWUP] Error message should contains SQL types ## What changes were proposed in this pull request? In the error messages we should return the SQL types (like `string` rather than the internal types like `StringType`). ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21181 from mgaido91/SPARK-23736_followup.	2018-04-28 10:47:43 +08:00
Jungtaek Lim	1fb46f30f8	[SPARK-23688][SS] Refactor tests away from rate source ## What changes were proposed in this pull request? Replace rate source with memory source in continuous mode test suite. Keep using "rate" source if the tests intend to put data periodically in background, or need to put short source name to load, since "memory" doesn't have provider for source. ## How was this patch tested? Ran relevant test suite from IDE. Author: Jungtaek Lim <kabhwan@gmail.com> Closes #21152 from HeartSaVioR/SPARK-23688.	2018-04-28 09:55:56 +08:00
Juliusz Sompolski	8614edd445	[SPARK-24104] SQLAppStatusListener overwrites metrics onDriverAccumUpdates instead of updating them ## What changes were proposed in this pull request? Event `SparkListenerDriverAccumUpdates` may happen multiple times in a query - e.g. every `FileSourceScanExec` and `BroadcastExchangeExec` call `postDriverMetricUpdates`. In Spark 2.2 `SQLListener` updated the map with new values. `SQLAppStatusListener` overwrites it. Unless `update` preserved it in the KV store (dependant on `exec.lastWriteTime`), only the metrics from the last operator that does `postDriverMetricUpdates` are preserved. ## How was this patch tested? Unit test added. Author: Juliusz Sompolski <julek@databricks.com> Closes #21171 from juliuszsompolski/SPARK-24104.	2018-04-27 14:14:28 -07:00
Dilip Biswal	3fd297af6d	[SPARK-24085][SQL] Query returns UnsupportedOperationException when scalar subquery is present in partitioning expression ## What changes were proposed in this pull request? In this case, the partition pruning happens before the planning phase of scalar subquery expressions. For scalar subquery expressions, the planning occurs late in the cycle (after the physical planning) in "PlanSubqueries" just before execution. Currently we try to execute the scalar subquery expression as part of partition pruning and fail as it implements Unevaluable. The fix attempts to ignore the Subquery expressions from partition pruning computation. Another option can be to somehow plan the subqueries before the partition pruning. Since this may not be a commonly occuring expression, i am opting for a simpler fix. Repro ``` SQL CREATE TABLE test_prc_bug ( id_value string ) partitioned by (id_type string) location '/tmp/test_prc_bug' stored as parquet; insert into test_prc_bug values ('1','a'); insert into test_prc_bug values ('2','a'); insert into test_prc_bug values ('3','b'); insert into test_prc_bug values ('4','b'); select * from test_prc_bug where id_type = (select 'b'); ``` ## How was this patch tested? Added test in SubquerySuite and hive/SQLQuerySuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #21174 from dilipbiswal/spark-24085.	2018-04-27 11:43:29 -07:00
Patrick McGloin	2824f12b8b	[SPARK-23565][SS] New error message for structured streaming sources assertion ## What changes were proposed in this pull request? A more informative message to tell you why a structured streaming query cannot continue if you have added more sources, than there are in the existing checkpoint offsets. ## How was this patch tested? I added a Unit Test. Author: Patrick McGloin <mcgloin.patrick@gmail.com> Closes #20946 from patrickmcgloin/master.	2018-04-27 23:04:14 +08:00
Dongjoon Hyun	8aa1d7b0ed	[SPARK-23355][SQL] convertMetastore should not ignore table properties ## What changes were proposed in this pull request? Previously, SPARK-22158 fixed for `USING hive` syntax. This PR aims to fix for `STORED AS` syntax. Although the test case covers ORC part, the patch considers both `convertMetastoreOrc` and `convertMetastoreParquet`. ## How was this patch tested? Pass newly added test cases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20522 from dongjoon-hyun/SPARK-22158-2.	2018-04-27 11:00:41 +08:00
gatorsmile	ce2f919f8d	[SPARK-23799][SQL][FOLLOW-UP] FilterEstimation.evaluateInSet produces wrong stats for STRING ## What changes were proposed in this pull request? `colStat.min` AND `colStat.max` are empty for string type. Thus, `evaluateInSet` should not return zero when either `colStat.min` or `colStat.max`. ## How was this patch tested? Added a test case. Author: gatorsmile <gatorsmile@gmail.com> Closes #21147 from gatorsmile/cached.	2018-04-26 19:07:13 +08:00
Tathagata Das	d1eb8d3ddc	[SPARK-24094][SS][MINOR] Change description strings of v2 streaming sources to reflect the change ## What changes were proposed in this pull request? This makes it easy to understand at runtime which version is running. Great for debugging production issues. ## How was this patch tested? Not necessary. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21160 from tdas/SPARK-24094.	2018-04-25 23:24:05 -07:00
Marco Gaido	cd10f9df82	[SPARK-23916][SQL] Add array_join function ## What changes were proposed in this pull request? The PR adds the SQL function `array_join`. The behavior of the function is based on Presto's one. The function accepts an `array` of `string` which is to be joined, a `string` which is the delimiter to use between the items of the first argument and optionally a `string` which is used to replace `null` values. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21011 from mgaido91/SPARK-23916.	2018-04-26 13:37:13 +09:00
Marco Gaido	58c55cb4a6	[SPARK-23902][SQL] Add roundOff flag to months_between ## What changes were proposed in this pull request? HIVE-15511 introduced the `roundOff` flag in order to disable the rounding to 8 digits which is performed in `months_between`. Since this can be a computational intensive operation, skipping it may improve performances when the rounding is not needed. ## How was this patch tested? modified existing UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21008 from mgaido91/SPARK-23902.	2018-04-26 12:19:20 +09:00
Maxim Gekk	3f1e999d3d	[SPARK-23849][SQL] Tests for samplingRatio of json datasource ## What changes were proposed in this pull request? Added the `samplingRatio` option to the `json()` method of PySpark DataFrame Reader. Improving existing tests for Scala API according to review of the PR: https://github.com/apache/spark/pull/20959 ## How was this patch tested? Added new test for PySpark, updated 2 existing tests according to reviews of https://github.com/apache/spark/pull/20959 and added new negative test Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #21056 from MaxGekk/json-sampling.	2018-04-26 09:14:24 +08:00
Wenchen Fan	ac4ca7c4dd	[SPARK-24012][SQL][TEST][FOLLOWUP] add unit test ## What changes were proposed in this pull request? a followup of https://github.com/apache/spark/pull/21100 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #21154 from cloud-fan/test.	2018-04-25 13:42:44 -07:00
Tathagata Das	396938ef02	[SPARK-24050][SS] Calculate input / processing rates correctly for DataSourceV2 streaming sources ## What changes were proposed in this pull request? In some streaming queries, the input and processing rates are not calculated at all (shows up as zero) because MicroBatchExecution fails to associated metrics from the executed plan of a trigger with the sources in the logical plan of the trigger. The way this executed-plan-leaf-to-logical-source attribution works is as follows. With V1 sources, there was no way to identify which execution plan leaves were generated by a streaming source. So did a best-effort attempt to match logical and execution plan leaves when the number of leaves were same. In cases where the number of leaves is different, we just give up and report zero rates. An example where this may happen is as follows. ``` val cachedStaticDF = someStaticDF.union(anotherStaticDF).cache() val streamingInputDF = ... val query = streamingInputDF.join(cachedStaticDF).writeStream.... ``` In this case, the `cachedStaticDF` has multiple logical leaves, but in the trigger's execution plan it only has leaf because a cached subplan is represented as a single InMemoryTableScanExec leaf. This leads to a mismatch in the number of leaves causing the input rates to be computed as zero. With DataSourceV2, all inputs are represented in the executed plan using `DataSourceV2ScanExec`, each of which has a reference to the associated logical `DataSource` and `DataSourceReader`. So its easy to associate the metrics to the original streaming sources. In this PR, the solution is as follows. If all the streaming sources in a streaming query as v2 sources, then use a new code path where the execution-metrics-to-source mapping is done directly. Otherwise we fall back to existing mapping logic. ## How was this patch tested? - New unit tests using V2 memory source - Existing unit tests using V1 source Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21126 from tdas/SPARK-24050.	2018-04-25 12:21:55 -07:00
Takeshi Yamamuro	20ca208bcd	[SPARK-23880][SQL] Do not trigger any jobs for caching data ## What changes were proposed in this pull request? This pr fixed code so that `cache` could prevent any jobs from being triggered. For example, in the current master, an operation below triggers a actual job; ``` val df = spark.range(10000000000L) .filter('id > 1000) .orderBy('id.desc) .cache() ``` This triggers a job while the cache should be lazy. The problem is that, when creating `InMemoryRelation`, we build the RDD, which calls `SparkPlan.execute` and may trigger jobs, like sampling job for range partitioner, or broadcast job. This pr removed the code to build a cached `RDD` in the constructor of `InMemoryRelation` and added `CachedRDDBuilder` to lazily build the `RDD` in `InMemoryRelation`. Then, the first call of `CachedRDDBuilder.cachedColumnBuffers` triggers a job to materialize the cache in `InMemoryTableScanExec` . ## How was this patch tested? Added tests in `CachedTableSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21018 from maropu/SPARK-23880.	2018-04-25 19:06:18 +08:00
liutang123	64e8408e6f	[SPARK-24012][SQL] Union of map and other compatible column ## What changes were proposed in this pull request? Union of map and other compatible column result in unresolved operator 'Union; exception Reproduction `spark-sql>select map(1,2), 'str' union all select map(1,2,3,null), 1` Output: ``` Error in query: unresolved operator 'Union;; 'Union :- Project [map(1, 2) AS map(1, 2)#106, str AS str#107] : +- OneRowRelation$ +- Project [map(1, cast(2 as int), 3, cast(null as int)) AS map(1, CAST(2 AS INT), 3, CAST(NULL AS INT))#109, 1 AS 1#108] +- OneRowRelation$ ``` So, we should cast part of columns to be compatible when appropriate. ## How was this patch tested? Added a test (query union of map and other columns) to SQLQueryTestSuite's union.sql. Author: liutang123 <liutang123@yeah.net> Closes #21100 from liutang123/SPARK-24012.	2018-04-25 18:10:51 +08:00
mn-mikke	5fea17b3be	[SPARK-23821][SQL] Collection function: flatten ## What changes were proposed in this pull request? This PR adds a new collection function that transforms an array of arrays into a single array. The PR comprises: - An expression for flattening array structure - Flatten function - A wrapper for PySpark ## How was this patch tested? New tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite ## Codegen examples ### Primitive type ``` val df = Seq( Seq(Seq(1, 2), Seq(4, 5)), Seq(null, Seq(1)) ).toDF("i") df.filter($"i".isNotNull \|\| $"i".isNull).select(flatten($"i")).debugCodegen ``` Result: ``` /* 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / boolean filter_value = true; / 038 / / 039 / if (!(!inputadapter_isNull)) { / 040 / filter_value = inputadapter_isNull; / 041 / } / 042 / if (!filter_value) continue; / 043 / / 044 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 045 / / 046 / boolean project_isNull = inputadapter_isNull; / 047 / ArrayData project_value = null; / 048 / / 049 / if (!inputadapter_isNull) { / 050 / for (int z = 0; !project_isNull && z < inputadapter_value.numElements(); z++) { / 051 / project_isNull \|= inputadapter_value.isNullAt(z); / 052 / } / 053 / if (!project_isNull) { / 054 / long project_numElements = 0; / 055 / for (int z = 0; z < inputadapter_value.numElements(); z++) { / 056 / project_numElements += inputadapter_value.getArray(z).numElements(); / 057 / } / 058 / if (project_numElements > 2147483632) { / 059 / throw new RuntimeException("Unsuccessful try to flatten an array of arrays with " + / 060 / project_numElements + " elements due to exceeding the array size limit 2147483632."); / 061 / } / 062 / / 063 / long project_size = UnsafeArrayData.calculateSizeOfUnderlyingByteArray( / 064 / project_numElements, / 065 / 4); / 066 / if (project_size > 2147483632) { / 067 / throw new RuntimeException("Unsuccessful try to flatten an array of arrays with " + / 068 / project_size + " bytes of data due to exceeding the limit 2147483632" + / 069 / " bytes for UnsafeArrayData."); / 070 / } / 071 / / 072 / byte[] project_array = new byte[(int)project_size]; / 073 / UnsafeArrayData project_tempArrayData = new UnsafeArrayData(); / 074 / Platform.putLong(project_array, 16, project_numElements); / 075 / project_tempArrayData.pointTo(project_array, 16, (int)project_size); / 076 / int project_counter = 0; / 077 / for (int k = 0; k < inputadapter_value.numElements(); k++) { / 078 / ArrayData arr = inputadapter_value.getArray(k); / 079 / for (int l = 0; l < arr.numElements(); l++) { / 080 / if (arr.isNullAt(l)) { / 081 / project_tempArrayData.setNullAt(project_counter); / 082 / } else { / 083 / project_tempArrayData.setInt( / 084 / project_counter, / 085 / arr.getInt(l) / 086 / ); / 087 / } / 088 / project_counter++; / 089 / } / 090 / } / 091 / project_value = project_tempArrayData; / 092 / / 093 / } / 094 / / 095 / } ``` ### Non-primitive type ``` val df = Seq( Seq(Seq("a", "b"), Seq(null, "d")), Seq(null, Seq("a")) ).toDF("s") df.filter($"s".isNotNull \|\| $"s".isNull).select(flatten($"s")).debugCodegen ``` Result: ``` / 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / boolean filter_value = true; / 038 / / 039 / if (!(!inputadapter_isNull)) { / 040 / filter_value = inputadapter_isNull; / 041 / } / 042 / if (!filter_value) continue; / 043 / / 044 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 045 / / 046 / boolean project_isNull = inputadapter_isNull; / 047 / ArrayData project_value = null; / 048 / / 049 / if (!inputadapter_isNull) { / 050 / for (int z = 0; !project_isNull && z < inputadapter_value.numElements(); z++) { / 051 / project_isNull \|= inputadapter_value.isNullAt(z); / 052 / } / 053 / if (!project_isNull) { / 054 / long project_numElements = 0; / 055 / for (int z = 0; z < inputadapter_value.numElements(); z++) { / 056 / project_numElements += inputadapter_value.getArray(z).numElements(); / 057 / } / 058 / if (project_numElements > 2147483632) { / 059 / throw new RuntimeException("Unsuccessful try to flatten an array of arrays with " + / 060 / project_numElements + " elements due to exceeding the array size limit 2147483632."); / 061 / } / 062 / / 063 / Object[] project_arrayObject = new Object[(int)project_numElements]; / 064 / int project_counter = 0; / 065 / for (int k = 0; k < inputadapter_value.numElements(); k++) { / 066 / ArrayData arr = inputadapter_value.getArray(k); / 067 / for (int l = 0; l < arr.numElements(); l++) { / 068 / project_arrayObject[project_counter] = arr.getUTF8String(l); / 069 / project_counter++; / 070 / } / 071 / } / 072 / project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_arrayObject); / 073 / / 074 / } / 075 / / 076 */ } ``` Author: mn-mikke <mrkAha12346github> Closes #20938 from mn-mikke/feature/array-api-flatten-to-master.	2018-04-25 11:19:08 +09:00
Jose Torres	d6c26d1c9a	[SPARK-24038][SS] Refactor continuous writing to its own class ## What changes were proposed in this pull request? Refactor continuous writing to its own class. See WIP https://github.com/jose-torres/spark/pull/13 for the overall direction this is going, but I think this PR is very isolated and necessary anyway. ## How was this patch tested? existing unit tests - refactoring only Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #21116 from jose-torres/SPARK-24038.	2018-04-24 17:06:03 -07:00
Takeshi Yamamuro	4926a7c2f0	[SPARK-23589][SQL][FOLLOW-UP] Reuse InternalRow in ExternalMapToCatalyst eval ## What changes were proposed in this pull request? This pr is a follow-up of #20980 and fixes code to reuse `InternalRow` for converting input keys/values in `ExternalMapToCatalyst` eval. ## How was this patch tested? Existing tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21137 from maropu/SPARK-23589-FOLLOWUP.	2018-04-24 17:52:05 +02:00
seancxmao	c303b1b676	[MINOR][DOCS] Fix comments of SQLExecution#withExecutionId ## What changes were proposed in this pull request? Fix comment. Change `BroadcastHashJoin.broadcastFuture` to `BroadcastExchangeExec.relationFuture`: `d28d5732ae/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala (L66)` ## How was this patch tested? N/A Author: seancxmao <seancxmao@gmail.com> Closes #21113 from seancxmao/SPARK-13136.	2018-04-24 16:16:07 +08:00
Marco Gaido	281c1ca0dc	[SPARK-23973][SQL] Remove consecutive Sorts ## What changes were proposed in this pull request? In SPARK-23375 we introduced the ability of removing `Sort` operation during query optimization if the data is already sorted. In this follow-up we remove also a `Sort` which is followed by another `Sort`: in this case the first sort is not needed and can be safely removed. The PR starts from henryr's comment: https://github.com/apache/spark/pull/20560#discussion_r180601594. So credit should be given to him. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21072 from mgaido91/SPARK-23973.	2018-04-24 10:11:09 +08:00
Tathagata Das	770add81c3	[SPARK-23004][SS] Ensure StateStore.commit is called only once in a streaming aggregation task ## What changes were proposed in this pull request? A structured streaming query with a streaming aggregation can throw the following error in rare cases. ``` java.lang.IllegalStateException: Cannot commit after already committed or aborted at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$verify(HDFSBackedStateStoreProvider.scala:643) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:135) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2$$anonfun$hasNext$2.apply$mcV$sp(statefulOperators.scala:359) at org.apache.spark.sql.execution.streaming.StateStoreWriter$class.timeTakenMs(statefulOperators.scala:102) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.timeTakenMs(statefulOperators.scala:251) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2.hasNext(statefulOperators.scala:359) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:188) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:114) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:830) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$apply$24.apply(RDD.scala:830) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:42) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:336) ``` This can happen when the following conditions are accidentally hit. - Streaming aggregation with aggregation function that is a subset of [`TypedImperativeAggregation`](`76b8b840dd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala (L473)`) (for example, `collect_set`, `collect_list`, `percentile`, etc.). - Query running in `update}` mode - After the shuffle, a partition has exactly 128 records. This causes StateStore.commit to be called twice. See the [JIRA](https://issues.apache.org/jira/browse/SPARK-23004) for a more detailed explanation. The solution is to use `NextIterator` or `CompletionIterator`, each of which has a flag to prevent the "onCompletion" task from being called more than once. In this PR, I chose to implement using `NextIterator`. ## How was this patch tested? Added unit test that I have confirm will fail without the fix. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21124 from tdas/SPARK-23004.	2018-04-23 13:20:32 -07:00
Takeshi Yamamuro	afbdf42730	[SPARK-23589][SQL] ExternalMapToCatalyst should support interpreted execution ## What changes were proposed in this pull request? This pr supported interpreted mode for `ExternalMapToCatalyst`. ## How was this patch tested? Added tests in `ObjectExpressionsSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20980 from maropu/SPARK-23589.	2018-04-23 14:28:28 +02:00
Wenchen Fan	d87d30e4fe	[SPARK-23564][SQL] infer additional filters from constraints for join's children ## What changes were proposed in this pull request? The existing query constraints framework has 2 steps: 1. propagate constraints bottom up. 2. use constraints to infer additional filters for better data pruning. For step 2, it mostly helps with Join, because we can connect the constraints from children to the join condition and infer powerful filters to prune the data of the join sides. e.g., the left side has constraints `a = 1`, the join condition is `left.a = right.a`, then we can infer `right.a = 1` to the right side and prune the right side a lot. However, the current logic of inferring filters from constraints for Join is pretty weak. It infers the filters from Join's constraints. Some joins like left semi/anti exclude output from right side and the right side constraints will be lost here. This PR propose to check the left and right constraints individually, expand the constraints with join condition and add filters to children of join directly, instead of adding to the join condition. This reverts https://github.com/apache/spark/pull/20670 , covers https://github.com/apache/spark/pull/20717 and https://github.com/apache/spark/pull/20816 This is inspired by the original PRs and the tests are all from these PRs. Thanks to the authors mgaido91 maryannxue KaiXinXiaoLei ! ## How was this patch tested? new tests Author: Wenchen Fan <wenchen@databricks.com> Closes #21083 from cloud-fan/join.	2018-04-23 20:21:01 +08:00
Wenchen Fan	f70f46d1e5	[SPARK-23877][SQL][FOLLOWUP] use PhysicalOperation to simplify the handling of Project and Filter over partitioned relation ## What changes were proposed in this pull request? A followup of https://github.com/apache/spark/pull/20988 `PhysicalOperation` can collect Project and Filters over a certain plan and substitute the alias with the original attributes in the bottom plan. We can use it in `OptimizeMetadataOnlyQuery` rule to handle the Project and Filter over partitioned relation. ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes #21111 from cloud-fan/refactor.	2018-04-23 20:18:50 +08:00
Mykhailo Shtelma	c48085aa91	[SPARK-23799][SQL] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics >What changes were proposed in this pull request? During evaluation of IN conditions, if the source data frame, is represented by a plan, that uses hive table with columns, which were previously analysed, and the plan has conditions for these fields, that cannot be satisfied (which leads us to an empty data frame), FilterEstimation.evaluateInSet method produces NumberFormatException and ClassCastException. In order to fix this bug, method FilterEstimation.evaluateInSet at first checks, if distinct count is not zero, and also checks if colStat.min and colStat.max are defined, and only in this case proceeds with the calculation. If at least one of the conditions is not satisfied, zero is returned. >How was this patch tested? In order to test the PR two tests were implemented: one in FilterEstimationSuite, that tests the plan with the statistics that violates the conditions mentioned above, and another one in StatisticsCollectionSuite, that test the whole process of analysis/optimisation of the query, that leads to the problems, mentioned in the first section. Author: Mykhailo Shtelma <mykhailo.shtelma@bearingpoint.com> Author: smikesh <mshtelma@gmail.com> Closes #21052 from mshtelma/filter_estimation_evaluateInSet_Bugs.	2018-04-21 23:33:57 -07:00
gatorsmile	7bc853d089	[SPARK-24033][SQL] Fix Mismatched of Window Frame specifiedwindowframe(RowFrame, -1, -1) ## What changes were proposed in this pull request? When the OffsetWindowFunction's frame is `UnaryMinus(Literal(1))` but the specified window frame has been simplified to `Literal(-1)` by some optimizer rules e.g., `ConstantFolding`. Thus, they do not match and cause the following error: ``` org.apache.spark.sql.AnalysisException: Window Frame specifiedwindowframe(RowFrame, -1, -1) must match the required frame specifiedwindowframe(RowFrame, -1, -1); at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) at ``` ## How was this patch tested? Added a test Author: gatorsmile <gatorsmile@gmail.com> Closes #21115 from gatorsmile/fixLag.	2018-04-21 10:45:12 -07:00
Marcelo Vanzin	1d758dc73b	Revert "[SPARK-23775][TEST] Make DataFrameRangeSuite not flaky" This reverts commit `0c94e48bc5`.	2018-04-20 10:23:01 -07:00
Takeshi Yamamuro	0dd97f6ea4	[SPARK-23595][SQL] ValidateExternalType should support interpreted execution ## What changes were proposed in this pull request? This pr supported interpreted mode for `ValidateExternalType`. ## How was this patch tested? Added tests in `ObjectExpressionsSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20757 from maropu/SPARK-23595.	2018-04-20 15:02:27 +02:00
Takeshi Yamamuro	074a7f9053	[SPARK-23588][SQL][FOLLOW-UP] Resolve a map builder method per execution in CatalystToExternalMap ## What changes were proposed in this pull request? This pr is a follow-up pr of #20979 and fixes code to resolve a map builder method per execution instead of per row in `CatalystToExternalMap`. ## How was this patch tested? Existing tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #21112 from maropu/SPARK-23588-FOLLOWUP.	2018-04-20 14:43:47 +02:00
mn-mikke	e6b466084c	[SPARK-23736][SQL] Extending the concat function to support array columns ## What changes were proposed in this pull request? The PR adds a logic for easy concatenation of multiple array columns and covers: - Concat expression has been extended to support array columns - A Python wrapper ## How was this patch tested? New tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite - typeCoercion/native/concat.sql ## Codegen examples ### Primitive-type elements ``` val df = Seq( (Seq(1 ,2), Seq(3, 4)), (Seq(1, 2, 3), null) ).toDF("a", "b") df.filter('a.isNotNull).select(concat('a, 'b)).debugCodegen() ``` Result: ``` /* 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / if (!(!inputadapter_isNull)) continue; / 038 / / 039 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 040 / / 041 / ArrayData[] project_args = new ArrayData[2]; / 042 / / 043 / if (!false) { / 044 / project_args[0] = inputadapter_value; / 045 / } / 046 / / 047 / boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1); / 048 / ArrayData inputadapter_value1 = inputadapter_isNull1 ? / 049 / null : (inputadapter_row.getArray(1)); / 050 / if (!inputadapter_isNull1) { / 051 / project_args[1] = inputadapter_value1; / 052 / } / 053 / / 054 / ArrayData project_value = new Object() { / 055 / public ArrayData concat(ArrayData[] args) { / 056 / for (int z = 0; z < 2; z++) { / 057 / if (args[z] == null) return null; / 058 / } / 059 / / 060 / long project_numElements = 0L; / 061 / for (int z = 0; z < 2; z++) { / 062 / project_numElements += args[z].numElements(); / 063 / } / 064 / if (project_numElements > 2147483632) { / 065 / throw new RuntimeException("Unsuccessful try to concat arrays with " + project_numElements + / 066 / " elements due to exceeding the array size limit 2147483632."); / 067 / } / 068 / / 069 / long project_size = UnsafeArrayData.calculateSizeOfUnderlyingByteArray( / 070 / project_numElements, / 071 / 4); / 072 / if (project_size > 2147483632) { / 073 / throw new RuntimeException("Unsuccessful try to concat arrays with " + project_size + / 074 / " bytes of data due to exceeding the limit 2147483632 bytes" + / 075 / " for UnsafeArrayData."); / 076 / } / 077 / / 078 / byte[] project_array = new byte[(int)project_size]; / 079 / UnsafeArrayData project_arrayData = new UnsafeArrayData(); / 080 / Platform.putLong(project_array, 16, project_numElements); / 081 / project_arrayData.pointTo(project_array, 16, (int)project_size); / 082 / int project_counter = 0; / 083 / for (int y = 0; y < 2; y++) { / 084 / for (int z = 0; z < args[y].numElements(); z++) { / 085 / if (args[y].isNullAt(z)) { / 086 / project_arrayData.setNullAt(project_counter); / 087 / } else { / 088 / project_arrayData.setInt( / 089 / project_counter, / 090 / args[y].getInt(z) / 091 / ); / 092 / } / 093 / project_counter++; / 094 / } / 095 / } / 096 / return project_arrayData; / 097 / } / 098 / }.concat(project_args); / 099 / boolean project_isNull = project_value == null; ``` ### Non-primitive-type elements ``` val df = Seq( (Seq("aa" ,"bb"), Seq("ccc", "ddd")), (Seq("x", "y"), null) ).toDF("a", "b") df.filter('a.isNotNull).select(concat('a, 'b)).debugCodegen() ``` Result: ``` / 033 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 034 / ArrayData inputadapter_value = inputadapter_isNull ? / 035 / null : (inputadapter_row.getArray(0)); / 036 / / 037 / if (!(!inputadapter_isNull)) continue; / 038 / / 039 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 040 / / 041 / ArrayData[] project_args = new ArrayData[2]; / 042 / / 043 / if (!false) { / 044 / project_args[0] = inputadapter_value; / 045 / } / 046 / / 047 / boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1); / 048 / ArrayData inputadapter_value1 = inputadapter_isNull1 ? / 049 / null : (inputadapter_row.getArray(1)); / 050 / if (!inputadapter_isNull1) { / 051 / project_args[1] = inputadapter_value1; / 052 / } / 053 / / 054 / ArrayData project_value = new Object() { / 055 / public ArrayData concat(ArrayData[] args) { / 056 / for (int z = 0; z < 2; z++) { / 057 / if (args[z] == null) return null; / 058 / } / 059 / / 060 / long project_numElements = 0L; / 061 / for (int z = 0; z < 2; z++) { / 062 / project_numElements += args[z].numElements(); / 063 / } / 064 / if (project_numElements > 2147483632) { / 065 / throw new RuntimeException("Unsuccessful try to concat arrays with " + project_numElements + / 066 / " elements due to exceeding the array size limit 2147483632."); / 067 / } / 068 / / 069 / Object[] project_arrayObjects = new Object[(int)project_numElements]; / 070 / int project_counter = 0; / 071 / for (int y = 0; y < 2; y++) { / 072 / for (int z = 0; z < args[y].numElements(); z++) { / 073 / project_arrayObjects[project_counter] = args[y].getUTF8String(z); / 074 / project_counter++; / 075 / } / 076 / } / 077 / return new org.apache.spark.sql.catalyst.util.GenericArrayData(project_arrayObjects); / 078 / } / 079 / }.concat(project_args); / 080 */ boolean project_isNull = project_value == null; ``` Author: mn-mikke <mrkAha12346github> Closes #20858 from mn-mikke/feature/array-api-concat_arrays-to-master.	2018-04-20 14:58:11 +09:00
Ryan Blue	b3fde5a41e	[SPARK-23877][SQL] Use filter predicates to prune partitions in metadata-only queries ## What changes were proposed in this pull request? This updates the OptimizeMetadataOnlyQuery rule to use filter expressions when listing partitions, if there are filter nodes in the logical plan. This avoids listing all partitions for large tables on the driver. This also fixes a minor bug where the partitions returned from fsRelation cannot be serialized without hitting a stack level too deep error. This is caused by serializing a stream to executors, where the stream is a recursive structure. If the stream is too long, the serialization stack reaches the maximum level of depth. The fix is to create a LocalRelation using an Array instead of the incoming Seq. ## How was this patch tested? Existing tests for metadata-only queries. Author: Ryan Blue <blue@apache.org> Closes #20988 from rdblue/SPARK-23877-metadata-only-push-filters.	2018-04-20 12:06:41 +08:00
“attilapiros”	9ea8d3d31b	[SPARK-22362][SQL] Add unit test for Window Aggregate Functions ## What changes were proposed in this pull request? Improving the test coverage of window functions focusing on missing test for window aggregate functions. No new UDAF test is added as it has been tested already. ## How was this patch tested? Only new tests were added, automated tests were executed. Author: “attilapiros” <piros.attila.zsolt@gmail.com> Author: Attila Zsolt Piros <2017933+attilapiros@users.noreply.github.com> Closes #20046 from attilapiros/SPARK-22362.	2018-04-19 18:55:59 +02:00
Wenchen Fan	6e19f7683f	[SPARK-23989][SQL] exchange should copy data before non-serialized shuffle ## What changes were proposed in this pull request? In Spark SQL, we usually reuse the `UnsafeRow` instance and need to copy the data when a place buffers non-serialized objects. Shuffle may buffer objects if we don't make it to the bypass merge shuffle or unsafe shuffle. `ShuffleExchangeExec.needToCopyObjectsBeforeShuffle` misses the case that, if `spark.sql.shuffle.partitions` is large enough, we could fail to run unsafe shuffle and go with the non-serialized shuffle. This bug is very hard to hit since users wouldn't set such a large number of partitions(16 million) for Spark SQL exchange. TODO: test ## How was this patch tested? todo. Author: Wenchen Fan <wenchen@databricks.com> Closes #21101 from cloud-fan/shuffle.	2018-04-19 17:54:53 +02:00
Xingbo Jiang	d96c3e33cc	[SPARK-21811][SQL] Fix the inconsistency behavior when finding the widest common type ## What changes were proposed in this pull request? Currently we find the wider common type by comparing the two types from left to right, this can be a problem when you have two data types which don't have a common type but each can be promoted to StringType. For instance, if you have a table with the schema: [c1: date, c2: string, c3: int] The following succeeds: SELECT coalesce(c1, c2, c3) FROM table While the following produces an exception: SELECT coalesce(c1, c3, c2) FROM table This is only a issue when the seq of dataTypes contains `StringType` and all the types can do string promotion. close #19033 ## How was this patch tested? Add test in `TypeCoercionSuite` Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #21074 from jiangxb1987/typeCoercion.	2018-04-19 21:21:22 +08:00
jinxing	9e10f69df5	[SPARK-22676][FOLLOW-UP] fix code style for test. ## What changes were proposed in this pull request? This pr address comments in https://github.com/apache/spark/pull/19868 ; Fix the code style for `org.apache.spark.sql.hive.QueryPartitionSuite` by using: `withTempView`, `withTempDir`, `withTable`... Author: jinxing <jinxing6042@126.com> Closes #21091 from jinxing64/SPARK-22676-FOLLOW-UP.	2018-04-19 21:07:21 +08:00
Takeshi Yamamuro	e13416502f	[SPARK-23588][SQL] CatalystToExternalMap should support interpreted execution ## What changes were proposed in this pull request? This pr supported interpreted mode for `CatalystToExternalMap`. ## How was this patch tested? Added tests in `ObjectExpressionsSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20979 from maropu/SPARK-23588.	2018-04-19 14:42:50 +02:00
Takeshi Yamamuro	1b08c4393c	[SPARK-23584][SQL] NewInstance should support interpreted execution ## What changes were proposed in this pull request? This pr supported interpreted mode for `NewInstance`. ## How was this patch tested? Added tests in `ObjectExpressionsSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20778 from maropu/SPARK-23584.	2018-04-19 14:38:26 +02:00
Kazuaki Ishizaki	46bb2b5129	[SPARK-23924][SQL] Add element_at function ## What changes were proposed in this pull request? The PR adds the SQL function `element_at`. The behavior of the function is based on Presto's one. This function returns element of array at given index in value if column is array, or returns value for the given key in value if column is map. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21053 from kiszk/SPARK-23924.	2018-04-19 21:00:10 +09:00
Kazuaki Ishizaki	d5bec48b9c	[SPARK-23919][SQL] Add array_position function ## What changes were proposed in this pull request? The PR adds the SQL function `array_position`. The behavior of the function is based on Presto's one. The function returns the position of the first occurrence of the element in array x (or 0 if not found) using 1-based index as BigInt. ## How was this patch tested? Added UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21037 from kiszk/SPARK-23919.	2018-04-19 11:59:17 +09:00
Gabor Somogyi	0c94e48bc5	[SPARK-23775][TEST] Make DataFrameRangeSuite not flaky ## What changes were proposed in this pull request? DataFrameRangeSuite.test("Cancelling stage in a query with Range.") stays sometimes in an infinite loop and times out the build. There were multiple issues with the test: 1. The first valid stageId is zero when the test started alone and not in a suite and the following code waits until timeout: ``` eventually(timeout(10.seconds), interval(1.millis)) { assert(DataFrameRangeSuite.stageToKill > 0) } ``` 2. The `DataFrameRangeSuite.stageToKill` was overwritten by the task's thread after the reset which ended up in canceling the same stage 2 times. This caused the infinite wait. This PR solves this mentioned flakyness by removing the shared `DataFrameRangeSuite.stageToKill` and using `wait` and `CountDownLatch` for synhronization. ## How was this patch tested? Existing unit test. Author: Gabor Somogyi <gabor.g.somogyi@gmail.com> Closes #20888 from gaborgsomogyi/SPARK-23775.	2018-04-18 16:37:41 -07:00
Liang-Chi Hsieh	a9066478f6	[SPARK-23875][SQL][FOLLOWUP] Add IndexedSeq wrapper for ArrayData ## What changes were proposed in this pull request? Use specified accessor in `ArrayData.foreach` and `toArray`. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21099 from viirya/SPARK-23875-followup.	2018-04-19 00:05:47 +02:00
Takuya UESHIN	f09a9e9418	[SPARK-24007][SQL] EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen. ## What changes were proposed in this pull request? `EqualNullSafe` for `FloatType` and `DoubleType` might generate a wrong result by codegen. ```scala scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF() df: org.apache.spark.sql.DataFrame = [_1: double, _2: double] scala> df.show() +----+----+ \| _1\| _2\| +----+----+ \|-1.0\|null\| \|null\|-1.0\| +----+----+ scala> df.filter("_1 <=> _2").show() +----+----+ \| _1\| _2\| +----+----+ \|-1.0\|null\| \|null\|-1.0\| +----+----+ ``` The result should be empty but the result remains two rows. ## How was this patch tested? Added a test. Author: Takuya UESHIN <ueshin@databricks.com> Closes #21094 from ueshin/issues/SPARK-24007/equalnullsafe.	2018-04-18 08:22:05 -07:00
mn-mikke	f81fa478ff	[SPARK-23926][SQL] Extending reverse function to support ArrayType arguments ## What changes were proposed in this pull request? This PR extends `reverse` functions to be able to operate over array columns and covers: - Introduction of `Reverse` expression that represents logic for reversing arrays and also strings - Removal of `StringReverse` expression - A wrapper for PySpark ## How was this patch tested? New tests added into: - CollectionExpressionsSuite - DataFrameFunctionsSuite ## Codegen examples ### Primitive type ``` val df = Seq( Seq(1, 3, 4, 2), null ).toDF("i") df.filter($"i".isNotNull \|\| $"i".isNull).select(reverse($"i")).debugCodegen ``` Result: ``` /* 032 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 033 / ArrayData inputadapter_value = inputadapter_isNull ? / 034 / null : (inputadapter_row.getArray(0)); / 035 / / 036 / boolean filter_value = true; / 037 / / 038 / if (!(!inputadapter_isNull)) { / 039 / filter_value = inputadapter_isNull; / 040 / } / 041 / if (!filter_value) continue; / 042 / / 043 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 044 / / 045 / boolean project_isNull = inputadapter_isNull; / 046 / ArrayData project_value = null; / 047 / / 048 / if (!inputadapter_isNull) { / 049 / final int project_length = inputadapter_value.numElements(); / 050 / project_value = inputadapter_value.copy(); / 051 / for(int k = 0; k < project_length / 2; k++) { / 052 / int l = project_length - k - 1; / 053 / boolean isNullAtK = project_value.isNullAt(k); / 054 / boolean isNullAtL = project_value.isNullAt(l); / 055 / if(!isNullAtK) { / 056 / int el = project_value.getInt(k); / 057 / if(!isNullAtL) { / 058 / project_value.setInt(k, project_value.getInt(l)); / 059 / } else { / 060 / project_value.setNullAt(k); / 061 / } / 062 / project_value.setInt(l, el); / 063 / } else if (!isNullAtL) { / 064 / project_value.setInt(k, project_value.getInt(l)); / 065 / project_value.setNullAt(l); / 066 / } / 067 / } / 068 / / 069 / } ``` ### Non-primitive type ``` val df = Seq( Seq("a", "c", "d", "b"), null ).toDF("s") df.filter($"s".isNotNull \|\| $"s".isNull).select(reverse($"s")).debugCodegen ``` Result: ``` / 032 / boolean inputadapter_isNull = inputadapter_row.isNullAt(0); / 033 / ArrayData inputadapter_value = inputadapter_isNull ? / 034 / null : (inputadapter_row.getArray(0)); / 035 / / 036 / boolean filter_value = true; / 037 / / 038 / if (!(!inputadapter_isNull)) { / 039 / filter_value = inputadapter_isNull; / 040 / } / 041 / if (!filter_value) continue; / 042 / / 043 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 044 / / 045 / boolean project_isNull = inputadapter_isNull; / 046 / ArrayData project_value = null; / 047 / / 048 / if (!inputadapter_isNull) { / 049 / final int project_length = inputadapter_value.numElements(); / 050 / project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(new Object[project_length]); / 051 / for(int k = 0; k < project_length; k++) { / 052 / int l = project_length - k - 1; / 053 / project_value.update(k, inputadapter_value.getUTF8String(l)); / 054 / } / 055 / / 056 */ } ``` Author: mn-mikke <mrkAha12346github> Closes #21034 from mn-mikke/feature/array-api-reverse-to-master.	2018-04-18 18:41:55 +09:00
gatorsmile	cce469435d	[SPARK-24002][SQL] Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes ## What changes were proposed in this pull request? ``` Py4JJavaError: An error occurred while calling o153.sql. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646) at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:293) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:226) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190) at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37) at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144) ... at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190) ... 23 more Caused by: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Task not serializable at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179) ... 276 more Caused by: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156) at org.apache.spark.SparkContext.clean(SparkContext.scala:2380) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:371) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:417) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:123) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$3.apply(SparkPlan.scala:152) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:149) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:118) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:89) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:125) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:116) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:116) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:123) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$3.apply(SparkPlan.scala:152) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:149) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:118) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:271) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:181) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:414) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:123) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:118) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$3.apply(SparkPlan.scala:152) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:149) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:118) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:61) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:70) at org.apache.spark.sql.execution.SparkPlan.executeCollectResult(SparkPlan.scala:264) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1$$anonfun$call$1.apply(BroadcastExchangeExec.scala:93) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1$$anonfun$call$1.apply(BroadcastExchangeExec.scala:81) at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:150) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1.call(BroadcastExchangeExec.scala:80) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1.call(BroadcastExchangeExec.scala:76) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more Caused by: java.nio.BufferUnderflowException at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151) at java.nio.ByteBuffer.get(ByteBuffer.java:715) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) ``` The Parquet filters are serializable but not thread safe. SparkPlan.prepare() could be called in different threads (BroadcastExchange will call it in a thread pool). Thus, we could serialize the same Parquet filter at the same time. This is not easily reproduced. The fix is to avoid serializing these Parquet filters in the driver. This PR is to avoid serializing these Parquet filters by moving the parquet filter generation from the driver to executors. ## How was this patch tested? Having two queries one is a 1000-line SQL query and a 3000-line SQL query. Need to run at least one hour with a heavy write workload to reproduce once. Author: gatorsmile <gatorsmile@gmail.com> Closes #21086 from gatorsmile/taskNotSerializable.	2018-04-17 21:03:57 -07:00
Wenchen Fan	310a8cd062	[SPARK-23341][SQL] define some standard options for data source v2 ## What changes were proposed in this pull request? Each data source implementation can define its own options and teach its users how to set them. Spark doesn't have any restrictions about what options a data source should or should not have. It's possible that some options are very common and many data sources use them. However different data sources may define the common options(key and meaning) differently, which is quite confusing to end users. This PR defines some standard options that data sources can optionally adopt: path, table and database. ## How was this patch tested? a new test case. Author: Wenchen Fan <wenchen@databricks.com> Closes #20535 from cloud-fan/options.	2018-04-18 11:51:10 +08:00
maryannxue	1e3b8762a8	[SPARK-21479][SQL] Outer join filter pushdown in null supplying table when condition is on one of the joined columns ## What changes were proposed in this pull request? Added `TransitPredicateInOuterJoin` optimization rule that transits constraints from the preserved side of an outer join to the null-supplying side. The constraints of the join operator will remain unchanged. ## How was this patch tested? Added 3 tests in `InferFiltersFromConstraintsSuite`. Author: maryannxue <maryann.xue@gmail.com> Closes #20816 from maryannxue/spark-21479.	2018-04-18 10:36:41 +08:00
Marco Gaido	f39e82ce15	[SPARK-23986][SQL] freshName can generate non-unique names ## What changes were proposed in this pull request? We are using `CodegenContext.freshName` to get a unique name for any new variable we are adding. Unfortunately, this method currently fails to create a unique name when we request more than one instance of variables with starting name `name1` and an instance with starting name `name11`. The PR changes the way a new name is generated by `CodegenContext.freshName` so that we generate unique names in this scenario too. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #21080 from mgaido91/SPARK-23986.	2018-04-18 00:35:44 +08:00
jinxing	ed4101d29f	[SPARK-22676] Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true ## What changes were proposed in this pull request? In current code, it will scanning all partition paths when spark.sql.hive.verifyPartitionPath=true. e.g. table like below: ``` CREATE TABLE `test`( `id` int, `age` int, `name` string) PARTITIONED BY ( `A` string, `B` string) load data local inpath '/tmp/data0' into table test partition(A='00', B='00') load data local inpath '/tmp/data1' into table test partition(A='01', B='01') load data local inpath '/tmp/data2' into table test partition(A='10', B='10') load data local inpath '/tmp/data3' into table test partition(A='11', B='11') ``` If I query with SQL – "select * from test where A='00' and B='01' ", current code will scan all partition paths including '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', '/data/A=11/B=11'. It costs much time and memory cost. This pr proposes to avoid iterating all partition paths. Add a config `spark.files.ignoreMissingFiles` and ignore the `file not found` when `getPartitions/compute`(for hive table scan). This is much like the logic brought by `spark.sql.files.ignoreMissingFiles`(which is for datasource scan). ## How was this patch tested? UT Author: jinxing <jinxing6042@126.com> Closes #19868 from jinxing64/SPARK-22676.	2018-04-17 21:52:33 +08:00
Marco Gaido	0a9172a05e	[SPARK-23835][SQL] Add not-null check to Tuples' arguments deserialization ## What changes were proposed in this pull request? There was no check on nullability for arguments of `Tuple`s. This could lead to have weird behavior when a null value had to be deserialized into a non-nullable Scala object: in those cases, the `null` got silently transformed in a valid value (like `-1` for `Int`), corresponding to the default value we are using in the SQL codebase. This situation was very likely to happen when deserializing to a Tuple of primitive Scala types (like Double, Int, ...). The PR adds the `AssertNotNull` to arguments of tuples which have been asked to be converted to non-nullable types. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20976 from mgaido91/SPARK-23835.	2018-04-17 21:45:20 +08:00
Liang-Chi Hsieh	30ffb53cad	[SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData ## What changes were proposed in this pull request? We don't have a good way to sequentially access `UnsafeArrayData` with a common interface such as `Seq`. An example is `MapObject` where we need to access several sequence collection types together. But `UnsafeArrayData` doesn't implement `ArrayData.array`. Calling `toArray` will copy the entire array. We can provide an `IndexedSeq` wrapper for `ArrayData`, so we can avoid copying the entire array. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20984 from viirya/SPARK-23875.	2018-04-17 15:09:36 +02:00
Efim Poberezkin	05ae74778a	[SPARK-23747][STRUCTURED STREAMING] Add EpochCoordinator unit tests ## What changes were proposed in this pull request? Unit tests for EpochCoordinator that test correct sequencing of committed epochs. Several tests are ignored since they test functionality implemented in SPARK-23503 which is not yet merged, otherwise they fail. Author: Efim Poberezkin <efim@poberezkin.ru> Closes #20983 from efimpoberezkin/pr/EpochCoordinator-tests.	2018-04-17 04:13:17 -07:00
Jose Torres	1cc66a072b	[SPARK-23687][SS] Add a memory source for continuous processing. ## What changes were proposed in this pull request? Add a memory source for continuous processing. Note that only one of the ContinuousSuite tests is migrated to minimize the diff here. I'll submit a second PR for SPARK-23688 to change the rest and get rid of waitForRateSourceTriggers. ## How was this patch tested? unit test Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #20828 from jose-torres/continuousMemory.	2018-04-17 01:59:38 -07:00
Marco Gaido	14844a62c0	[SPARK-23918][SQL] Add array_min function ## What changes were proposed in this pull request? The PR adds the SQL function `array_min`. It takes an array as argument and returns the minimum value in it. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21025 from mgaido91/SPARK-23918.	2018-04-17 17:55:35 +09:00
Liang-Chi Hsieh	fd990a908b	[SPARK-23873][SQL] Use accessors in interpreted LambdaVariable ## What changes were proposed in this pull request? Currently, interpreted execution of `LambdaVariable` just uses `InternalRow.get` to access element. We should use specified accessors if possible. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20981 from viirya/SPARK-23873.	2018-04-16 22:45:57 +02:00
Marco Gaido	6931022031	[SPARK-23917][SQL] Add array_max function ## What changes were proposed in this pull request? The PR adds the SQL function `array_max`. It takes an array as argument and returns the maximum value in it. ## How was this patch tested? added UTs Author: Marco Gaido <marcogaido91@gmail.com> Closes #21024 from mgaido91/SPARK-23917.	2018-04-15 21:45:55 -07:00
Liang-Chi Hsieh	73f28530d6	[SPARK-23979][SQL] MultiAlias should not be a CodegenFallback ## What changes were proposed in this pull request? Just found `MultiAlias` is a `CodegenFallback`. It should not be as looks like `MultiAlias` won't be evaluated. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #21065 from viirya/multialias-without-codegenfallback.	2018-04-14 08:59:04 +08:00
Tathagata Das	cbb41a0c5b	[SPARK-23966][SS] Refactoring all checkpoint file writing logic in a common CheckpointFileManager interface ## What changes were proposed in this pull request? Checkpoint files (offset log files, state store files) in Structured Streaming must be written atomically such that no partial files are generated (would break fault-tolerance guarantees). Currently, there are 3 locations which try to do this individually, and in some cases, incorrectly. 1. HDFSOffsetMetadataLog - This uses a FileManager interface to use any implementation of `FileSystem` or `FileContext` APIs. It preferably loads `FileContext` implementation as FileContext of HDFS has atomic renames. 1. HDFSBackedStateStore (aka in-memory state store) - Writing a version.delta file - This uses FileSystem APIs only to perform a rename. This is incorrect as rename is not atomic in HDFS FileSystem implementation. - Writing a snapshot file - Same as above. #### Current problems: 1. State Store behavior is incorrect - HDFS FileSystem implementation does not have atomic rename. 1. Inflexible - Some file systems provide mechanisms other than write-to-temp-file-and-rename for writing atomically and more efficiently. For example, with S3 you can write directly to the final file and it will be made visible only when the entire file is written and closed correctly. Any failure can be made to terminate the writing without making any partial files visible in S3. The current code does not abstract out this mechanism enough that it can be customized. #### Solution: 1. Introduce a common interface that all 3 cases above can use to write checkpoint files atomically. 2. This interface must provide the necessary interfaces that allow customization of the write-and-rename mechanism. This PR does that by introducing the interface `CheckpointFileManager` and modifying `HDFSMetadataLog` and `HDFSBackedStateStore` to use the interface. Similar to earlier `FileManager`, there are implementations based on `FileSystem` and `FileContext` APIs, and the latter implementation is preferred to make it work correctly with HDFS. The key method this interface has is `createAtomic(path, overwrite)` which returns a `CancellableFSDataOutputStream` that has the method `cancel()`. All users of this method need to either call `close()` to successfully write the file, or `cancel()` in case of an error. ## How was this patch tested? New tests in `CheckpointFileManagerSuite` and slightly modified existing tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21048 from tdas/SPARK-23966.	2018-04-13 16:31:39 -07:00
Bruce Robbins	558f31b31c	[SPARK-23963][SQL] Properly handle large number of columns in query on text-based Hive table ## What changes were proposed in this pull request? TableReader would get disproportionately slower as the number of columns in the query increased. I fixed the way TableReader was looking up metadata for each column in the row. Previously, it had been looking up this data in linked lists, accessing each linked list by an index (column number). Now it looks up this data in arrays, where indexing by column number works better. ## How was this patch tested? Manual testing All sbt unit tests python sql tests Author: Bruce Robbins <bersprockets@gmail.com> Closes #21043 from bersprockets/tabreadfix.	2018-04-13 14:05:04 -07:00
Marco Gaido	25892f3cc9	[SPARK-23375][SQL] Eliminate unneeded Sort in Optimizer ## What changes were proposed in this pull request? Added a new rule to remove Sort operation when its child is already sorted. For instance, this simple code: ``` spark.sparkContext.parallelize(Seq(("a", "b"))).toDF("a", "b").registerTempTable("table1") val df = sql(s"""SELECT b \| FROM ( \| SELECT a, b \| FROM table1 \| ORDER BY a \| ) t \| ORDER BY a""".stripMargin) df.explain(true) ``` before the PR produces this plan: ``` == Parsed Logical Plan == 'Sort ['a ASC NULLS FIRST], true +- 'Project ['b] +- 'SubqueryAlias t +- 'Sort ['a ASC NULLS FIRST], true +- 'Project ['a, 'b] +- 'UnresolvedRelation `table1` == Analyzed Logical Plan == b: string Project [b#7] +- Sort [a#6 ASC NULLS FIRST], true +- Project [b#7, a#6] +- SubqueryAlias t +- Sort [a#6 ASC NULLS FIRST], true +- Project [a#6, b#7] +- SubqueryAlias table1 +- Project [_1#3 AS a#6, _2#4 AS b#7] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#4] +- ExternalRDD [obj#2] == Optimized Logical Plan == Project [b#7] +- Sort [a#6 ASC NULLS FIRST], true +- Project [b#7, a#6] +- Sort [a#6 ASC NULLS FIRST], true +- Project [_1#3 AS a#6, _2#4 AS b#7] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS _2#4] +- ExternalRDD [obj#2] == Physical Plan == (3) Project [b#7] +- (3) Sort [a#6 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#6 ASC NULLS FIRST, 200) +- (2) Project [b#7, a#6] +- (2) Sort [a#6 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#6 ASC NULLS FIRST, 200) +- (1) Project [_1#3 AS a#6, _2#4 AS b#7] +- (1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS _2#4] +- Scan ExternalRDDScan[obj#2] ``` while after the PR produces: ``` == Parsed Logical Plan == 'Sort ['a ASC NULLS FIRST], true +- 'Project ['b] +- 'SubqueryAlias t +- 'Sort ['a ASC NULLS FIRST], true +- 'Project ['a, 'b] +- 'UnresolvedRelation `table1` == Analyzed Logical Plan == b: string Project [b#7] +- Sort [a#6 ASC NULLS FIRST], true +- Project [b#7, a#6] +- SubqueryAlias t +- Sort [a#6 ASC NULLS FIRST], true +- Project [a#6, b#7] +- SubqueryAlias table1 +- Project [_1#3 AS a#6, _2#4 AS b#7] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#4] +- ExternalRDD [obj#2] == Optimized Logical Plan == Project [b#7] +- Sort [a#6 ASC NULLS FIRST], true +- Project [_1#3 AS a#6, _2#4 AS b#7] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS _2#4] +- ExternalRDD [obj#2] == Physical Plan == (2) Project [b#7] +- (2) Sort [a#6 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(a#6 ASC NULLS FIRST, 5) +- (1) Project [_1#3 AS a#6, _2#4 AS b#7] +- (1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS _2#4] +- Scan ExternalRDDScan[obj#2] ``` this means that an unnecessary sort operation is not performed after the PR. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20560 from mgaido91/SPARK-23375.	2018-04-14 01:01:00 +08:00
Gengliang Wang	4dfd746de3	[SPARK-23896][SQL] Improve PartitioningAwareFileIndex ## What changes were proposed in this pull request? Currently `PartitioningAwareFileIndex` accepts an optional parameter `userPartitionSchema`. If provided, it will combine the inferred partition schema with the parameter. However, 1. to get `userPartitionSchema`, we need to combine inferred partition schema with `userSpecifiedSchema` 2. to get the inferred partition schema, we have to create a temporary file index. Only after that, a final version of `PartitioningAwareFileIndex` can be created. This can be improved by passing `userSpecifiedSchema` to `PartitioningAwareFileIndex`. With the improvement, we can reduce redundant code and avoid parsing the file partition twice. ## How was this patch tested? Unit test Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21004 from gengliangwang/PartitioningAwareFileIndex.	2018-04-14 00:22:38 +08:00
yucai	0323e61465	[SPARK-23905][SQL] Add UDF weekday ## What changes were proposed in this pull request? Add UDF weekday ## How was this patch tested? A new test Author: yucai <yyu1@ebay.com> Closes #21009 from yucai/SPARK-23905.	2018-04-13 00:00:04 -07:00
Eric Liang	1018be44d6	[SPARK-23971] Should not leak Spark sessions across test suites ## What changes were proposed in this pull request? Many suites currently leak Spark sessions (sometimes with stopped SparkContexts) via the thread-local active Spark session and default Spark session. We should attempt to clean these up and detect when this happens to improve the reproducibility of tests. ## How was this patch tested? Existing tests Author: Eric Liang <ekl@databricks.com> Closes #21058 from ericl/clear-session.	2018-04-12 22:30:59 -07:00
hyukjinkwon	ab7b961a4f	[SPARK-23942][PYTHON][SQL] Makes collect in PySpark as action for a query executor listener ## What changes were proposed in this pull request? This PR proposes to add `collect` to a query executor as an action. Seems `collect` / `collect` with Arrow are not recognised via `QueryExecutionListener` as an action. For example, if we have a custom listener as below: ```scala package org.apache.spark.sql import org.apache.spark.internal.Logging import org.apache.spark.sql.execution.QueryExecution import org.apache.spark.sql.util.QueryExecutionListener class TestQueryExecutionListener extends QueryExecutionListener with Logging { override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = { logError("Look at me! I'm 'onSuccess'") } override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = { } } ``` and set `spark.sql.queryExecutionListeners` to `org.apache.spark.sql.TestQueryExecutionListener` Other operations in PySpark or Scala side seems fine: ```python >>> sql("SELECT * FROM range(1)").show() ``` ``` 18/04/09 17:02:04 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' +---+ \| id\| +---+ \| 0\| +---+ ``` ```scala scala> sql("SELECT * FROM range(1)").collect() ``` ``` 18/04/09 16:58:41 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' res1: Array[org.apache.spark.sql.Row] = Array([0]) ``` but .. Before ```python >>> sql("SELECT * FROM range(1)").collect() ``` ``` [Row(id=0)] ``` ```python >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> sql("SELECT * FROM range(1)").toPandas() ``` ``` id 0 0 ``` After ```python >>> sql("SELECT * FROM range(1)").collect() ``` ``` 18/04/09 16:57:58 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' [Row(id=0)] ``` ```python >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> sql("SELECT * FROM range(1)").toPandas() ``` ``` 18/04/09 17:53:26 ERROR TestQueryExecutionListener: Look at me! I'm 'onSuccess' id 0 0 ``` ## How was this patch tested? I have manually tested as described above and unit test was added. Author: hyukjinkwon <gurwls223@apache.org> Closes #21007 from HyukjinKwon/SPARK-23942.	2018-04-13 11:28:13 +08:00
jerryshao	14291b061b	[SPARK-23748][SS] Fix SS continuous process doesn't support SubqueryAlias issue ## What changes were proposed in this pull request? Current SS continuous doesn't support processing on temp table or `df.as("xxx")`, SS will throw an exception as LogicalPlan not supported, details described in [here](https://issues.apache.org/jira/browse/SPARK-23748). So here propose to add this support. ## How was this patch tested? new UT. Author: jerryshao <sshao@hortonworks.com> Closes #21017 from jerryshao/SPARK-23748.	2018-04-12 20:00:25 -07:00
Kazuaki Ishizaki	0b19122d43	[SPARK-23762][SQL] UTF8StringBuffer uses MemoryBlock ## What changes were proposed in this pull request? This PR tries to use `MemoryBlock` in `UTF8StringBuffer`. In general, there are two advantages to use `MemoryBlock`. 1. Has clean API calls rather than using a Java array or `PlatformMemory` 2. Improve runtime performance of memory access instead of using `Object`. ## How was this patch tested? Added `UTF8StringBufferSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20871 from kiszk/SPARK-23762.	2018-04-12 22:21:30 +08:00
Imran Rashid	6a2289ecf0	[SPARK-23962][SQL][TEST] Fix race in currentExecutionIds(). SQLMetricsTestUtils.currentExecutionIds() was racing with the listener bus, which lead to some flaky tests. We should wait till the listener bus is empty. I tested by adding some Thread.sleep()s in SQLAppStatusListener, which reproduced the exceptions I saw on Jenkins. With this change, they went away. Author: Imran Rashid <irashid@cloudera.com> Closes #21041 from squito/SPARK-23962.	2018-04-12 15:58:04 +08:00
gatorsmile	e904dfaf0d	Revert "[SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars as transient" This reverts commit `271c891b91`.	2018-04-11 17:04:34 -07:00
Kris Mok	271c891b91	[SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars as transient ## What changes were proposed in this pull request? Mark `HashAggregateExec.bufVars` as transient to avoid it from being serialized. Also manually null out this field at the end of `doProduceWithoutKeys()` to shorten its lifecycle, because it'll no longer be used after that. ## How was this patch tested? Existing tests. Author: Kris Mok <kris.mok@databricks.com> Closes #21039 from rednaxelafx/codegen-improve.	2018-04-11 21:52:48 +08:00
Herman van Hovell	c604d659e1	[SPARK-23951][SQL] Use actual java class instead of string representation. ## What changes were proposed in this pull request? This PR slightly refactors the newly added `ExprValue` API by quite a bit. The following changes are introduced: 1. `ExprValue` now uses the actual class instead of the class name as its type. This should give some more flexibility with generating code in the future. 2. Renamed `StatementValue` to `SimpleExprValue`. The statement concept is broader then an expression (untyped and it cannot be on the right hand side of an assignment), and this was not really what we were using it for. I have added a top level `JavaCode` trait that can be used in the future to reinstate (no pun intended) a statement a-like code fragment. 3. Added factory methods to the `JavaCode` companion object to make it slightly less verbose to create `JavaCode`/`ExprValue` objects. This is also what makes the diff quite large. 4. Added one more factory method to `ExprCode` to make it easier to create code-less expressions. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #21026 from hvanhovell/SPARK-23951.	2018-04-11 20:11:03 +08:00
Gengliang Wang	e179658914	[SPARK-19724][SQL][FOLLOW-UP] Check location of managed table when ignoreIfExists is true ## What changes were proposed in this pull request? In the PR #20886, I mistakenly check the table location only when `ignoreIfExists` is false, which was following the original deprecated PR. That was wrong. When `ignoreIfExists` is true and the target table doesn't exist, we should also check the table location. In other word, `ignoreIfExists` has nothing to do with table location validation. This is a follow-up PR to fix the mistake. ## How was this patch tested? Add one unit test. Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #21001 from gengliangwang/SPARK-19724-followup.	2018-04-10 09:33:09 -07:00
Herman van Hovell	3323b156f9	[SPARK-23864][SQL] Add unsafe object writing to UnsafeWriter ## What changes were proposed in this pull request? This PR moves writing of `UnsafeRow`, `UnsafeArrayData` & `UnsafeMapData` out of the `GenerateUnsafeProjection`/`InterpretedUnsafeProjection` classes into the `UnsafeWriter` interface. This cleans up the code a little bit, and it should also result in less byte code for the code generated path. ## How was this patch tested? Existing tests Author: Herman van Hovell <hvanhovell@databricks.com> Closes #20986 from hvanhovell/SPARK-23864.	2018-04-10 17:32:00 +02:00
Herman van Hovell	6498884154	[SPARK-23898][SQL] Simplify add & subtract code generation ## What changes were proposed in this pull request? Code generation for the `Add` and `Subtract` expressions was not done using the `BinaryArithmetic.doCodeGen` method because these expressions also support `CalendarInterval`. This leads to a bit of duplication. This PR gets rid of that duplication by adding `calendarIntervalMethod` to `BinaryArithmetic` and doing the code generation for `CalendarInterval` in `BinaryArithmetic` instead. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #21005 from hvanhovell/SPARK-23898.	2018-04-09 21:49:49 -07:00
Kris Mok	f94f3624ea	[SPARK-23947][SQL] Add hashUTF8String convenience method to hasher classes ## What changes were proposed in this pull request? Add `hashUTF8String()` to the hasher classes to allow Spark SQL codegen to generate cleaner code for hashing `UTF8String`s. No change in behavior otherwise. Although with the introduction of SPARK-10399, the code size for hashing `UTF8String` is already smaller, it's still good to extract a separate function in the hasher classes so that the generated code can stay clean. ## How was this patch tested? Existing tests. Author: Kris Mok <kris.mok@databricks.com> Closes #21016 from rednaxelafx/hashutf8.	2018-04-09 21:07:28 -07:00
Liang-Chi Hsieh	7c1654e215	[SPARK-22856][SQL] Add wrappers for codegen output and nullability ## What changes were proposed in this pull request? The codegen output of `Expression`, aka `ExprCode`, now encapsulates only strings of output value (`value`) and nullability (`isNull`). It makes difficulty for us to know what the output really is. I think it is better if we can add wrappers for the value and nullability that let us to easily know that. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20043 from viirya/SPARK-22856.	2018-04-09 11:54:35 -07:00
Kazuaki Ishizaki	8d40a79a07	[SPARK-23893][CORE][SQL] Avoid possible integer overflow in multiplication ## What changes were proposed in this pull request? This PR avoids possible overflow at an operation `long = (long)(int * int)`. The multiplication of large positive integer values may set one to MSB. This leads to a negative value in long while we expected a positive value (e.g. `0111_0000_0000_0000 * 0000_0000_0000_0010`). This PR performs long cast before the multiplication to avoid this situation. ## How was this patch tested? Existing UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21002 from kiszk/SPARK-23893.	2018-04-08 20:40:27 +02:00
Maxim Gekk	6a734575a8	[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource ## What changes were proposed in this pull request? Proposed tests checks that only subset of input dataset is touched during schema inferring. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #20963 from MaxGekk/json-sampling-tests.	2018-04-07 21:44:32 -07:00
Huaxin Gao	2c1fe64757	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark ## What changes were proposed in this pull request? Column.scala and Functions.scala have asc_nulls_first, asc_nulls_last, desc_nulls_first and desc_nulls_last. Add the corresponding python APIs in column.py and functions.py ## How was this patch tested? Add doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20962 from huaxingao/spark-23847.	2018-04-08 12:09:06 +08:00
Kazuaki Ishizaki	b6935ffb4d	[SPARK-10399][SPARK-23879][HOTFIX] Fix Java lint errors ## What changes were proposed in this pull request? This PR fixes the following errors in [Java lint](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/7717/console) after #19222 has been merged. These errors were pointed by ueshin . ``` [ERROR] src/main/java/org/apache/spark/unsafe/array/ByteArrayMethods.java:[57] (sizes) LineLength: Line is longer than 100 characters (found 106). [ERROR] src/main/java/org/apache/spark/unsafe/memory/HeapMemoryAllocator.java:[26,8] (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OffHeapMemoryBlock.java:[23,10] (modifier) ModifierOrder: 'public' modifier out of order with the JLS suggestions. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[64,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[69,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[74,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[79,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[84,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[89,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[94,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[99,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[104,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[109,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[114,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[119,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[124,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/OnHeapMemoryBlock.java:[129,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[60,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[65,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[70,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[75,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[80,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[85,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[90,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[95,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[100,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[105,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[110,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[115,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[120,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/ByteArrayMemoryBlock.java:[125,10] (modifier) RedundantModifier: Redundant 'final' modifier. [ERROR] src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java:[114,16] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions. [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/HiveHasher.java:[20,8] (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform. [ERROR] src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java:[30,8] (imports) UnusedImports: Unused import - org.apache.spark.unsafe.memory.MemoryBlock. [ERROR] src/test/java/org/apache/spark/unsafe/memory/MemoryBlockSuite.java:[126,15] (naming) MethodName: Method name 'ByteArrayMemoryBlockTest' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'. [ERROR] src/test/java/org/apache/spark/unsafe/memory/MemoryBlockSuite.java:[143,15] (naming) MethodName: Method name 'OnHeapMemoryBlockTest' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]$'. [ERROR] src/test/java/org/apache/spark/unsafe/memory/MemoryBlockSuite.java:[160,15] (naming) MethodName: Method name 'OffHeapArrayMemoryBlockTest' must match pattern '^[a-z][a-z0-9][a-zA-Z0-9_]*$'. [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java:[19,8] (imports) UnusedImports: Unused import - com.google.common.primitives.Ints. [ERROR] src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java:[21,8] (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform. [ERROR] src/test/java/org/apache/spark/sql/catalyst/expressions/HiveHasherSuite.java:[20,8] (imports) UnusedImports: Unused import - org.apache.spark.unsafe.Platform. ``` ## How was this patch tested? Existing UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20991 from kiszk/SPARK-10399-jlint.	2018-04-06 10:23:26 -07:00
Li Jin	d766ea2ff2	[SPARK-23861][SQL][DOC] Clarify default window frame with and without orderBy clause ## What changes were proposed in this pull request? Add docstring to clarify default window frame boundaries with and without orderBy clause ## How was this patch tested? Manually generate doc and check. Author: Li Jin <ice.xelloss@gmail.com> Closes #20978 from icexelloss/SPARK-23861-window-doc.	2018-04-07 00:15:54 +08:00
Yuchen Huo	9452401931	[SPARK-23822][SQL] Improve error message for Parquet schema mismatches ## What changes were proposed in this pull request? This pull request tries to improve the error message for spark while reading parquet files with different schemas, e.g. One with a STRING column and the other with a INT column. A new ParquetSchemaColumnConvertNotSupportedException is added to replace the old UnsupportedOperationException. The Exception is again wrapped in FileScanRdd.scala to throw a more a general QueryExecutionException with the actual parquet file name which trigger the exception. ## How was this patch tested? Unit tests added to check the new exception and verify the error messages. Also manually tested with two parquet with different schema to check the error message. <img width="1125" alt="screen shot 2018-03-30 at 4 03 04 pm" src="https://user-images.githubusercontent.com/37087310/38156580-dd58a140-3433-11e8-973a-b816d859fbe1.png"> Author: Yuchen Huo <yuchen.huo@databricks.com> Closes #20953 from yuchenhuo/SPARK-23822.	2018-04-06 08:35:20 -07:00
Daniel Sakuma	6ade5cbb49	[MINOR][DOC] Fix some typos and grammar issues ## What changes were proposed in this pull request? Easy fix in the documentation. ## How was this patch tested? N/A Closes #20948 Author: Daniel Sakuma <dsakuma@gmail.com> Closes #20928 from dsakuma/fix_typo_configuration_docs.	2018-04-06 13:37:08 +08:00
Gengliang Wang	249007e37f	[SPARK-19724][SQL] create a managed table with an existed default table should throw an exception ## What changes were proposed in this pull request? This PR is to finish https://github.com/apache/spark/pull/17272 This JIRA is a follow up work after SPARK-19583 As we discussed in that PR The following DDL for a managed table with an existed default location should throw an exception: CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... CREATE TABLE ... (PARTITIONED BY ...) Currently there are some situations which are not consist with above logic: CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default location situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog) CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... situation: hive table succeed with an existed default location This PR is going to make above two situations consist with the logic that it should throw an exception with an existed default location. ## How was this patch tested? unit test added Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #20886 from gengliangwang/pr-17272.	2018-04-05 20:19:25 -07:00
JiahuiJiang	d65e531b44	[SPARK-23823][SQL] Keep origin in transformExpression Fixes https://issues.apache.org/jira/browse/SPARK-23823 Keep origin for all the methods using transformExpression ## What changes were proposed in this pull request? Keep origin in transformExpression ## How was this patch tested? Manually tested that this fixes https://issues.apache.org/jira/browse/SPARK-23823 and columns have correct origins after Analyzer.analyze Author: JiahuiJiang <jjiang@palantir.com> Author: Jiahui Jiang <jjiang@palantir.com> Closes #20961 from JiahuiJiang/jj/keep-origin.	2018-04-05 20:06:08 -07:00
Kazuaki Ishizaki	4807d381bb	[SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks to choose several types of memory block ## What changes were proposed in this pull request? This PR allows us to use one of several types of `MemoryBlock`, such as byte array, int array, long array, or `java.nio.DirectByteBuffer`. To use `java.nio.DirectByteBuffer` allows to have off heap memory which is automatically deallocated by JVM. `MemoryBlock` class has primitive accessors like `Platform.getInt()`, `Platform.putint()`, or `Platform.copyMemory()`. This PR uses `MemoryBlock` for `OffHeapColumnVector`, `UTF8String`, and other places. This PR can improve performance of operations involving memory accesses (e.g. `UTF8String.trim`) by 1.8x. For now, this PR does not use `MemoryBlock` for `BufferHolder` based on cloud-fan's [suggestion](https://github.com/apache/spark/pull/11494#issuecomment-309694290). Since this PR is a successor of #11494, close #11494. Many codes were ported from #11494. Many efforts were put here. I think this PR should credit to yzotov. This PR can achieve 1.1-1.4x performance improvements for operations in `UTF8String` or `Murmur3_x86_32`. Other operations are almost comparable performances. Without this PR ``` OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Hash byte arrays with length 268435487: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Murmur3_x86_32 526 / 536 0.0 131399881.5 1.0X UTF8String benchmark: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ hashCode 525 / 552 1022.6 1.0 1.0X substring 414 / 423 1298.0 0.8 1.3X ``` With this PR ``` OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Hash byte arrays with length 268435487: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Murmur3_x86_32 474 / 488 0.0 118552232.0 1.0X UTF8String benchmark: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ hashCode 476 / 480 1127.3 0.9 1.0X substring 287 / 291 1869.9 0.5 1.7X ``` Benchmark program ``` test("benchmark Murmur3_x86_32") { val length = 8192 * 32768 + 31 val seed = 42L val iters = 1 << 2 val random = new Random(seed) val arrays = Array.fill[MemoryBlock](numArrays) { val bytes = new Array[Byte](length) random.nextBytes(bytes) new ByteArrayMemoryBlock(bytes, Platform.BYTE_ARRAY_OFFSET, length) } val benchmark = new Benchmark("Hash byte arrays with length " + length, iters * numArrays, minNumIters = 20) benchmark.addCase("HiveHasher") { _: Int => var sum = 0L for (_ <- 0L until iters) { sum += HiveHasher.hashUnsafeBytesBlock( arrays(i), Platform.BYTE_ARRAY_OFFSET, length) } } benchmark.run() } test("benchmark UTF8String") { val N = 512 * 1024 * 1024 val iters = 2 val benchmark = new Benchmark("UTF8String benchmark", N, minNumIters = 20) val str0 = new java.io.StringWriter() { { for (i <- 0 until N) { write(" ") } } }.toString val s0 = UTF8String.fromString(str0) benchmark.addCase("hashCode") { _: Int => var h: Int = 0 for (_ <- 0L until iters) { h += s0.hashCode } } benchmark.addCase("substring") { _: Int => var s: UTF8String = null for (_ <- 0L until iters) { s = s0.substring(N / 2 - 5, N / 2 + 5) } } benchmark.run() } ``` I run [this benchmark program](https://gist.github.com/kiszk/94f75b506c93a663bbbc372ffe8f05de) using [the commit](`ee5a79861c`). I got the following results: ``` OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Memory access benchmarks: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ ByteArrayMemoryBlock get/putInt() 220 / 221 609.3 1.6 1.0X Platform get/putInt(byte[]) 220 / 236 610.9 1.6 1.0X Platform get/putInt(Object) 492 / 494 272.8 3.7 0.4X OnHeapMemoryBlock get/putLong() 322 / 323 416.5 2.4 0.7X long[] 221 / 221 608.0 1.6 1.0X Platform get/putLong(long[]) 321 / 321 418.7 2.4 0.7X Platform get/putLong(Object) 561 / 563 239.2 4.2 0.4X ``` I also run [this benchmark program](https://gist.github.com/kiszk/5fdb4e03733a5d110421177e289d1fb5) for comparing performance of `Platform.copyMemory()`. ``` OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Platform copyMemory: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Object to Object 1961 / 1967 8.6 116.9 1.0X System.arraycopy Object to Object 1917 / 1921 8.8 114.3 1.0X byte array to byte array 1961 / 1968 8.6 116.9 1.0X System.arraycopy byte array to byte array 1909 / 1937 8.8 113.8 1.0X int array to int array 1921 / 1990 8.7 114.5 1.0X double array to double array 1918 / 1923 8.7 114.3 1.0X Object to byte array 1961 / 1967 8.6 116.9 1.0X Object to short array 1965 / 1972 8.5 117.1 1.0X Object to int array 1910 / 1915 8.8 113.9 1.0X Object to float array 1971 / 1978 8.5 117.5 1.0X Object to double array 1919 / 1944 8.7 114.4 1.0X byte array to Object 1959 / 1967 8.6 116.8 1.0X int array to Object 1961 / 1970 8.6 116.9 1.0X double array to Object 1917 / 1924 8.8 114.3 1.0X ``` These results show three facts: 1. According to the second/third or sixth/seventh results in the first experiment, if we use `Platform.get/putInt(Object)`, we achieve more than 2x worse performance than `Platform.get/putInt(byte[])` with concrete type (i.e. `byte[]`). 2. According to the second/third or fourth/fifth/sixth results in the first experiment, the fastest way to access an array element on Java heap is `array[]`. Cons of `array[]` is that it is not possible to support unaligned-8byte access. 3. According to the first/second/third or fourth/sixth/seventh results in the first experiment, `getInt()/putInt() or getLong()/putLong()` in subclasses of `MemoryBlock` can achieve comparable performance to `Platform.get/putInt()` or `Platform.get/putLong()` with concrete type (second or sixth result). There is no overhead regarding virtual call. 4. According to results in the second experiment, for `Platform.copy()`, to pass `Object` can achieve the same performance as to pass any type of primitive array as source or destination. 5. According to second/fourth results in the second experiment, `Platform.copy()` can achieve the same performance as `System.arrayCopy`. It would be good to use `Platform.copy()` since `Platform.copy()` can take any types for src and dst. We are incrementally replace `Platform.get/putXXX` with `MemoryBlock.get/putXXX`. This is because we have two advantages. 1) Achieve better performance due to having a concrete type for an array. 2) Use simple OO design instead of passing `Object` It is easy to use `MemoryBlock` in `InternalRow`, `BufferHolder`, `TaskMemoryManager`, and others that are already abstracted. It is not easy to use `MemoryBlock` in utility classes related to hashing or others. Other candidates are - UnsafeRow, UnsafeArrayData, UnsafeMapData, SpecificUnsafeRowJoiner - UTF8StringBuffer - BufferHolder - TaskMemoryManager - OnHeapColumnVector - BytesToBytesMap - CachedBatch - classes for hash - others. ## How was this patch tested? Added `UnsafeMemoryAllocator` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19222 from kiszk/SPARK-10399.	2018-04-06 10:13:59 +08:00
Liang-Chi Hsieh	d9ca1c906b	[SPARK-23593][SQL] Add interpreted execution for InitializeJavaBean expression ## What changes were proposed in this pull request? Add interpreted execution for `InitializeJavaBean` expression. ## How was this patch tested? Added unit test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20985 from viirya/SPARK-23593-2.	2018-04-05 20:43:05 +02:00
Herman van Hovell	b2329fb1fc	Revert "[SPARK-23593][SQL] Add interpreted execution for InitializeJavaBean expression" This reverts commit `c5c8b54404`.	2018-04-05 13:57:41 +02:00
Kazuaki Ishizaki	1822ecda51	[SPARK-23582][SQL] StaticInvoke should support interpreted execution ## What changes were proposed in this pull request? This pr added interpreted execution for `StaticInvoke`. ## How was this patch tested? Added tests in `ObjectExpressionsSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20753 from kiszk/SPARK-23582.	2018-04-05 13:47:06 +02:00
Liang-Chi Hsieh	c5c8b54404	[SPARK-23593][SQL] Add interpreted execution for InitializeJavaBean expression ## What changes were proposed in this pull request? Add interpreted execution for `InitializeJavaBean` expression. ## How was this patch tested? Added unit test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20756 from viirya/SPARK-23593.	2018-04-05 13:39:45 +02:00
Gengliang Wang	d8379e5bc3	[SPARK-23838][WEBUI] Running SQL query is displayed as "completed" in SQL tab ## What changes were proposed in this pull request? A running SQL query would appear as completed in the Spark UI: ![image1](https://user-images.githubusercontent.com/1097932/38170733-3d7cb00c-35bf-11e8-994c-43f2d4fa285d.png) We can see the query in "Completed queries", while in in the job page we see it's still running Job 132. ![image2](https://user-images.githubusercontent.com/1097932/38170735-48f2c714-35bf-11e8-8a41-6fae23543c46.png) After some time in the query still appears in "Completed queries" (while it's still running), but the "Duration" gets increased. ![image3](https://user-images.githubusercontent.com/1097932/38170737-50f87ea4-35bf-11e8-8b60-000f6f918964.png) To reproduce, we can run a query with multiple jobs. E.g. Run TPCDS q6. The reason is that updates from executions are written into kvstore periodically, and the job start event may be missed. ## How was this patch tested? Manually run the job again and check the SQL Tab. The fix is pretty simple. Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #20955 from gengliangwang/jobCompleted.	2018-04-04 15:43:58 -07:00
Kazuaki Ishizaki	a35523653c	[SPARK-23583][SQL] Invoke should support interpreted execution ## What changes were proposed in this pull request? This pr added interpreted execution for `Invoke`. ## How was this patch tested? Added tests in `ObjectExpressionsSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20797 from kiszk/SPARK-28583.	2018-04-04 18:36:15 +02:00
Takeshi Yamamuro	5197562afe	[SPARK-21351][SQL] Update nullability based on children's output ## What changes were proposed in this pull request? This pr added a new optimizer rule `UpdateNullabilityInAttributeReferences ` to update the nullability that `Filter` changes when having `IsNotNull`. In the master, optimized plans do not respect the nullability when `Filter` has `IsNotNull`. This wrongly generates unnecessary code. For example: ``` scala> val df = Seq((Some(1), Some(2))).toDF("a", "b") scala> val bIsNotNull = df.where($"b" =!= 2).select($"b") scala> val targetQuery = bIsNotNull.distinct scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable res5: Boolean = true scala> targetQuery.debugCodegen Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 == HashAggregate(keys=[b#19], functions=[], output=[b#19]) +- Exchange hashpartitioning(b#19, 200) +- HashAggregate(keys=[b#19], functions=[], output=[b#19]) +- Project [_2#16 AS b#19] +- Filter isnotnull(_2#16) +- LocalTableScan [_1#15, _2#16] Generated code: ... /* 124 / protected void processNext() throws java.io.IOException { ... / 132 / // output the result / 133 / / 134 / while (agg_mapIter.next()) { / 135 / wholestagecodegen_numOutputRows.add(1); / 136 / UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); / 137 / UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); / 138 / / 139 / boolean agg_isNull4 = agg_aggKey.isNullAt(0); / 140 / int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0)); / 141 / agg_rowWriter1.zeroOutNullBytes(); / 142 / // We don't need this NULL check because NULL is filtered out in `$"b" =!=2` / 143 / if (agg_isNull4) { / 144 / agg_rowWriter1.setNullAt(0); / 145 / } else { / 146 / agg_rowWriter1.write(0, agg_value4); / 147 / } / 148 / append(agg_result1); / 149 / / 150 / if (shouldStop()) return; / 151 / } / 152 / / 153 / agg_mapIter.close(); / 154 / if (agg_sorter == null) { / 155 / agg_hashMap.free(); / 156 / } / 157 / } / 158 / / 159 / } ``` In the line 143, we don't need this NULL check because NULL is filtered out in `$"b" =!=2`. This pr could remove this NULL check; ``` scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable res5: Boolean = false scala> targetQuery.debugCodegen ... Generated code: ... / 144 / protected void processNext() throws java.io.IOException { ... / 152 / // output the result / 153 / / 154 / while (agg_mapIter.next()) { / 155 / wholestagecodegen_numOutputRows.add(1); / 156 / UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); / 157 / UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); / 158 / / 159 / int agg_value4 = agg_aggKey.getInt(0); / 160 / agg_rowWriter1.write(0, agg_value4); / 161 / append(agg_result1); / 162 / / 163 / if (shouldStop()) return; / 164 / } / 165 / / 166 / agg_mapIter.close(); / 167 / if (agg_sorter == null) { / 168 / agg_hashMap.free(); / 169 / } / 170 */ } ``` ## How was this patch tested? Added `UpdateNullabilityInAttributeReferencesSuite` for unit tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18576 from maropu/SPARK-21351.	2018-04-04 14:39:19 +08:00
gatorsmile	16ef6baa36	[SPARK-23826][TEST] TestHiveSparkSession should set default session ## What changes were proposed in this pull request? In TestHive, the base spark session does this in getOrCreate(), we emulate that behavior for tests. ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20969 from gatorsmile/setDefault.	2018-04-04 14:31:03 +08:00
Robert Kruszewski	5cfd5fabcd	[SPARK-23802][SQL] PropagateEmptyRelation can leave query plan in unresolved state ## What changes were proposed in this pull request? Add cast to nulls introduced by PropagateEmptyRelation so in cases they're part of coalesce they will not break its type checking rules ## How was this patch tested? Added unit test Author: Robert Kruszewski <robertk@palantir.com> Closes #20914 from robert3005/rk/propagate-empty-fix.	2018-04-03 17:25:54 -07:00
Eric Liang	359375eff7	[SPARK-23809][SQL] Active SparkSession should be set by getOrCreate ## What changes were proposed in this pull request? Currently, the active spark session is set inconsistently (e.g., in createDataFrame, prior to query execution). Many places in spark also incorrectly query active session when they should be calling activeSession.getOrElse(defaultSession) and so might get None even if a Spark session exists. The semantics here can be cleaned up if we also set the active session when the default session is set. Related: https://github.com/apache/spark/pull/20926/files ## How was this patch tested? Unit test, existing test. Note that if https://github.com/apache/spark/pull/20926 merges first we should also update the tests there. Author: Eric Liang <ekl@databricks.com> Closes #20927 from ericl/active-session-cleanup.	2018-04-03 17:09:12 -07:00
Liang-Chi Hsieh	1035aaa617	[SPARK-23587][SQL] Add interpreted execution for MapObjects expression ## What changes were proposed in this pull request? Add interpreted execution for `MapObjects` expression. ## How was this patch tested? Added unit test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20771 from viirya/SPARK-23587.	2018-04-04 01:36:58 +02:00
Jose Torres	66a3a5a2dc	[SPARK-23099][SS] Migrate foreach sink to DataSourceV2 ## What changes were proposed in this pull request? Migrate foreach sink to DataSourceV2. Since the previous attempt at this PR #20552, we've changed and strictly defined the lifecycle of writer components. This means we no longer need the complicated lifecycle shim from that PR; it just naturally works. ## How was this patch tested? existing tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #20951 from jose-torres/foreach.	2018-04-03 11:05:29 -07:00
Kazuaki Ishizaki	a7c19d9c21	[SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder classes ## What changes were proposed in this pull request? This PR implemented the following cleanups related to `UnsafeWriter` class: - Remove code duplication between `UnsafeRowWriter` and `UnsafeArrayWriter` - Make `BufferHolder` class internal by delegating its accessor methods to `UnsafeWriter` - Replace `UnsafeRow.setTotalSize(...)` with `UnsafeRowWriter.setTotalSize()` ## How was this patch tested? Tested by existing UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20850 from kiszk/SPARK-23713.	2018-04-02 21:48:44 +02:00
Tathagata Das	15298b99ac	[SPARK-23827][SS] StreamingJoinExec should ensure that input data is partitioned into specific number of partitions ## What changes were proposed in this pull request? Currently, the requiredChildDistribution does not specify the partitions. This can cause the weird corner cases where the child's distribution is `SinglePartition` which satisfies the required distribution of `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the shuffle needed to repartition input data into the required number of partitions (i.e. same as state stores). That can lead to "file not found" errors on the state store delta files as the micro-batch-with-no-shuffle will not run certain tasks and therefore not generate the expected state store delta files. This PR adds the required constraint on the number of partitions. ## How was this patch tested? Modified test harness to always check that ANY stateful operator should have a constraint on the number of partitions. As part of that, the existing opt-in checks on child output partitioning were removed, as they are redundant. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #20941 from tdas/SPARK-23827.	2018-03-30 16:48:26 -07:00
gatorsmile	bc8d093117	[SPARK-23500][SQL][FOLLOWUP] Fix complex type simplification rules to apply to entire plan ## What changes were proposed in this pull request? This PR is to improve the test coverage of the original PR https://github.com/apache/spark/pull/20687 ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20911 from gatorsmile/addTests.	2018-03-30 23:21:07 +08:00
Jose Torres	5b5a36ed6d	Roll forward "[SPARK-23096][SS] Migrate rate source to V2" ## What changes were proposed in this pull request? Roll forward `c68ec4e` (#20688). There are two minor test changes required: * An error which used to be TreeNodeException[ArithmeticException] is no longer wrapped and is now just ArithmeticException. * The test framework simply does not set the active Spark session. (Or rather, it doesn't do so early enough - I think it only happens when a query is analyzed.) I've added the required logic to SQLTestUtils. ## How was this patch tested? existing tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Author: jerryshao <sshao@hortonworks.com> Closes #20922 from jose-torres/ratefix.	2018-03-30 21:54:26 +08:00
yucai	b02e76cbff	[SPARK-23727][SQL] Support for pushing down filters for DateType in parquet ## What changes were proposed in this pull request? This PR supports for pushing down filters for DateType in parquet ## How was this patch tested? Added UT and tested in local. Author: yucai <yyu1@ebay.com> Closes #20851 from yucai/SPARK-23727.	2018-03-30 15:07:38 +08:00
Jongyoul Lee	df05fb63ab	[SPARK-23743][SQL] Changed a comparison logic from containing 'slf4j' to starting with 'org.slf4j' ## What changes were proposed in this pull request? isSharedClass returns if some classes can/should be shared or not. It checks if the classes names have some keywords or start with some names. Following the logic, it can occur unintended behaviors when a custom package has `slf4j` inside the package or class name. As I guess, the first intention seems to figure out the class containing `org.slf4j`. It would be better to change the comparison logic to `name.startsWith("org.slf4j")` ## How was this patch tested? This patch should pass all of the current tests and keep all of the current behaviors. In my case, I'm using ProtobufDeserializer to get a table schema from hive tables. Thus some Protobuf packages and names have `slf4j` inside. Without this patch, it cannot be resolved because of ClassCastException from different classloaders. Author: Jongyoul Lee <jongyoul@gmail.com> Closes #20860 from jongyoul/SPARK-23743.	2018-03-30 14:07:35 +08:00
Jose Torres	b348901192	[SPARK-23808][SQL] Set default Spark session in test-only spark sessions. ## What changes were proposed in this pull request? Set default Spark session in the TestSparkSession and TestHiveSparkSession constructors. ## How was this patch tested? new unit tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #20926 from jose-torres/test3.	2018-03-29 21:36:56 -07:00
Kent Yao	a7755fd8ce	[SPARK-23639][SQL] Obtain token before init metastore client in SparkSQL CLI ## What changes were proposed in this pull request? In SparkSQLCLI, SessionState generates before SparkContext instantiating. When we use --proxy-user to impersonate, it's unable to initializing a metastore client to talk to the secured metastore for no kerberos ticket. This PR use real user ugi to obtain token for owner before talking to kerberized metastore. ## How was this patch tested? Manually verified with kerberized hive metasotre / hdfs. Author: Kent Yao <yaooqinn@hotmail.com> Closes #20784 from yaooqinn/SPARK-23639.	2018-03-29 10:46:28 -07:00
gatorsmile	761565a3cc	Revert "[SPARK-23096][SS] Migrate rate source to V2" This reverts commit `c68ec4e6a1`.	2018-03-28 09:11:52 -07:00
hyukjinkwon	34c4b9c57e	[SPARK-23765][SQL] Supports custom line separator for json datasource ## What changes were proposed in this pull request? This PR proposes to add lineSep option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. The approach is similar with https://github.com/apache/spark/pull/20727; however, one main difference is, it uses text datasource's `lineSep` option to parse line by line in JSON's schema inference. ## How was this patch tested? Manually tested and unit tests were added. Author: hyukjinkwon <gurwls223@apache.org> Author: hyukjinkwon <gurwls223@gmail.com> Closes #20877 from HyukjinKwon/linesep-json.	2018-03-28 19:49:27 +08:00
jerryshao	c68ec4e6a1	[SPARK-23096][SS] Migrate rate source to V2 ## What changes were proposed in this pull request? This PR migrate micro batch rate source to V2 API and rewrite UTs to suite V2 test. ## How was this patch tested? UTs. Author: jerryshao <sshao@hortonworks.com> Closes #20688 from jerryshao/SPARK-23096.	2018-03-27 14:39:05 -07:00
Liang-Chi Hsieh	35997b59f3	[SPARK-23794][SQL] Make UUID as stateful expression ## What changes were proposed in this pull request? The UUID() expression is stateful and should implement the `Stateful` trait instead of the `Nondeterministic` trait. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20912 from viirya/SPARK-23794.	2018-03-27 14:49:50 +02:00
Kazuaki Ishizaki	e4bec7cb88	[SPARK-23549][SQL] Cast to timestamp when comparing timestamp with date ## What changes were proposed in this pull request? This PR fixes an incorrect comparison in SQL between timestamp and date. This is because both of them are casted to `string` and then are compared lexicographically. This implementation shows `false` regarding this query `spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as date) and cast('2017-03-01' as date)").show`. This PR shows `true` for this query by casting `date("2017-03-01")` to `timestamp("2017-03-01 00:00:00")`. (Please fill in changes proposed in this fix) ## How was this patch tested? Added new UTs to `TypeCoercionSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20774 from kiszk/SPARK-23549.	2018-03-25 16:38:49 -07:00
Takeshi Yamamuro	5f653d4f7c	[SPARK-23167][SQL] Add TPCDS queries v2.7 in TPCDSQuerySuite ## What changes were proposed in this pull request? This pr added TPCDS v2.7 (latest) queries in `TPCDSQuerySuite` because the current `TPCDSQuerySuite` tests older one (v1.4) and some queries are different from v1.4 and v2.7. Since the original v2.7 queries have the syntaxes that Spark cannot parse, I changed these queries in a following way: - [date] + 14 days -> date + `INTERVAL` 14 days - [column name] as "30 days" -> [column name] as \`30 days\` - Fix some syntax errors, e.g., missing brackets ## How was this patch tested? Added tests in `TPCDSQuerySuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20343 from maropu/TPCDSV2_7.	2018-03-25 09:18:26 -07:00
Jose Torres	816a5496ba	[SPARK-23788][SS] Fix race in StreamingQuerySuite ## What changes were proposed in this pull request? The serializability test uses the same MemoryStream instance for 3 different queries. If any of those queries ask it to commit before the others have run, the rest will see empty dataframes. This can fail the test if q3 is affected. We should use one instance per query instead. ## How was this patch tested? Existing unit test. If I move q2.processAllAvailable() before starting q3, the test always fails without the fix. Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #20896 from jose-torres/fixrace.	2018-03-24 18:21:01 -07:00
Liang-Chi Hsieh	b2edc30db1	[SPARK-23614][SQL] Fix incorrect reuse exchange when caching is used ## What changes were proposed in this pull request? We should provide customized canonicalize plan for `InMemoryRelation` and `InMemoryTableScanExec`. Otherwise, we can wrongly treat two different cached plans as same result. It causes wrongly reused exchange then. For a test query like this: ```scala val cached = spark.createDataset(Seq(TestDataUnion(1, 2, 3), TestDataUnion(4, 5, 6))).cache() val group1 = cached.groupBy("x").agg(min(col("y")) as "value") val group2 = cached.groupBy("x").agg(min(col("z")) as "value") group1.union(group2) ``` Canonicalized plans before: First exchange: ``` Exchange hashpartitioning(none#0, 5) +- (1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- (1) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [x#4253, y#4254, z#4255] ``` Second exchange: ``` Exchange hashpartitioning(none#0, 5) +- (3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- (3) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [x#4253, y#4254, z#4255] ``` You can find that they have the canonicalized plans are the same, although we use different columns in two `InMemoryTableScan`s. Canonicalized plan after: First exchange: ``` Exchange hashpartitioning(none#0, 5) +- (1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- (1) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas) +- LocalTableScan [none#0, none#1, none#2] ``` Second exchange: ``` Exchange hashpartitioning(none#0, 5) +- (3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- (3) InMemoryTableScan [none#0, none#2] +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas) +- LocalTableScan [none#0, none#1, none#2] ``` ## How was this patch tested? Added unit test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20831 from viirya/SPARK-23614.	2018-03-22 21:23:25 -07:00
Liang-Chi Hsieh	4d37008c78	[SPARK-23599][SQL] Use RandomUUIDGenerator in Uuid expression ## What changes were proposed in this pull request? As stated in Jira, there are problems with current `Uuid` expression which uses `java.util.UUID.randomUUID` for UUID generation. This patch uses the newly added `RandomUUIDGenerator` for UUID generation. So we can make `Uuid` deterministic between retries. ## How was this patch tested? Added unit tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20861 from viirya/SPARK-23599-2.	2018-03-22 19:57:32 +01:00
Dilip Biswal	5c9eaa6b58	[SPARK-23372][SQL] Writing empty struct in parquet fails during execution. It should fail earlier in the processing. ## What changes were proposed in this pull request? Currently we allow writing data frames with empty schema into a file based datasource for certain file formats such as JSON, ORC etc. For formats such as Parquet and Text, we raise error at different times of execution. For text format, we return error from the driver early on in processing where as for format such as parquet, the error is raised from executor. Example spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path) Results in ``` SQL org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message spark_schema { } at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) at org.apache.parquet.schema.MessageType.accept(MessageType.java:58) at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:225) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread. ``` In this PR, we unify the error processing and raise error on attempt to write empty schema based dataframes into file based datasource (orc, parquet, text , csv, json etc) early on in the processing. ## How was this patch tested? Unit tests added in FileBasedDatasourceSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #20579 from dilipbiswal/spark-23372.	2018-03-21 21:49:02 -07:00
Kris Mok	95e51ff849	[SPARK-23760][SQL] CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly ## What changes were proposed in this pull request? Fixed `CodegenContext.withSubExprEliminationExprs()` so that it saves/restores CSE state correctly. ## How was this patch tested? Added new unit test to verify that the old CSE state is indeed saved and restored around the `withSubExprEliminationExprs()` call. Manually verified that this test fails without this patch. Author: Kris Mok <kris.mok@databricks.com> Closes #20870 from rednaxelafx/codegen-subexpr-fix.	2018-03-21 21:21:36 -07:00
Gabor Somogyi	918c7e99af	[SPARK-23288][SS] Fix output metrics with parquet sink ## What changes were proposed in this pull request? Output metrics were not filled when parquet sink used. This PR fixes this problem by passing a `BasicWriteJobStatsTracker` in `FileStreamSink`. ## How was this patch tested? Additional unit test added. Author: Gabor Somogyi <gabor.g.somogyi@gmail.com> Closes #20745 from gaborgsomogyi/SPARK-23288.	2018-03-21 10:06:26 -07:00
Takeshi Yamamuro	98d0ea3f60	[SPARK-23264][SQL] Fix scala.MatchError in literals.sql.out ## What changes were proposed in this pull request? To fix `scala.MatchError` in `literals.sql.out`, this pr added an entry for `CalendarIntervalType` in `QueryExecution.toHiveStructString`. ## How was this patch tested? Existing tests and added tests in `literals.sql` Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20872 from maropu/FixIntervalTests.	2018-03-21 09:52:28 -07:00
hyukjinkwon	8d79113b81	[SPARK-23577][SQL] Supports custom line separator for text datasource ## What changes were proposed in this pull request? This PR proposes to add `lineSep` option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. ## How was this patch tested? Manual tests and unit tests were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20727 from HyukjinKwon/linesep-text.	2018-03-21 09:46:47 -07:00
Takeshi Yamamuro	983e8d9d64	[SPARK-23666][SQL] Do not display exprIds of Alias in user-facing info. ## What changes were proposed in this pull request? To drop `exprId`s for `Alias` in user-facing info., this pr added an entry for `Alias` in `NonSQLExpression.sql` ## How was this patch tested? Added tests in `UDFSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20827 from maropu/SPARK-23666.	2018-03-20 23:17:49 -07:00
Henry Robinson	477d6bd726	[SPARK-23500][SQL] Fix complex type simplification rules to apply to entire plan ## What changes were proposed in this pull request? Complex type simplification optimizer rules were not applied to the entire plan, just the expressions reachable from the root node. This patch fixes the rules to transform the entire plan. ## How was this patch tested? New unit test + ran sql / core tests. Author: Henry Robinson <henry@apache.org> Author: Henry Robinson <henry@cloudera.com> Closes #20687 from henryr/spark-25000.	2018-03-20 13:27:50 -07:00
Jose Torres	2c4b9962fd	[SPARK-23574][SQL] Report SinglePartition in DataSourceV2ScanExec when there's exactly 1 data reader factory. ## What changes were proposed in this pull request? Report SinglePartition in DataSourceV2ScanExec when there's exactly 1 data reader factory. Note that this means reader factories end up being constructed as partitioning is checked; let me know if you think that could be a problem. ## How was this patch tested? existing unit tests Author: Jose Torres <jose@databricks.com> Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #20726 from jose-torres/SPARK-23574.	2018-03-20 11:46:51 -07:00
Liang-Chi Hsieh	4de638c197	[SPARK-23599][SQL] Add a UUID generator from Pseudo-Random Numbers ## What changes were proposed in this pull request? This patch adds a UUID generator from Pseudo-Random Numbers. We can use it later to have deterministic `UUID()` expression. ## How was this patch tested? Added unit tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20817 from viirya/SPARK-23599.	2018-03-19 09:41:43 +01:00
Herman van Hovell	88d8de9260	[SPARK-23581][SQL] Add interpreted unsafe projection ## What changes were proposed in this pull request? We currently can only create unsafe rows using code generation. This is a problem for situations in which code generation fails. There is no fallback, and as a result we cannot execute the query. This PR adds an interpreted version of `UnsafeProjection`. The implementation is modeled after `InterpretedMutableProjection`. It stores the expression results in a `GenericInternalRow`, and it then uses a conversion function to convert the `GenericInternalRow` into an `UnsafeRow`. This PR does not implement the actual code generated to interpreted fallback logic. This will be done in a follow-up. ## How was this patch tested? I am piggybacking on exiting `UnsafeProjection` tests, and I have added an interpreted version for each of these. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #20750 from hvanhovell/SPARK-23581.	2018-03-16 18:28:16 +01:00
Dongjoon Hyun	5414abca4f	[SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default` ## What changes were proposed in this pull request? Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format. This PR aims to - Improve test suites more robust and makes it easy to test new data sources in the future. - Test new native ORC data source with the full existing Apache Spark test coverage. As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted. ## How was this patch tested? Pass the Jenkins with updated tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20705 from dongjoon-hyun/SPARK-23553.	2018-03-16 09:36:30 -07:00
myroslavlisniak	c2632edebd	[SPARK-23670][SQL] Fix memory leak on SparkPlanGraphWrapper Clean up SparkPlanGraphWrapper objects from InMemoryStore together with cleaning up SQLExecutionUIData existing unit test was extended to check also SparkPlanGraphWrapper object count vanzin Author: myroslavlisniak <acnipin@gmail.com> Closes #20813 from myroslavlisniak/master.	2018-03-15 17:20:59 -07:00
Yuming Wang	15c3c98300	[HOT-FIX] Fix SparkOutOfMemoryError: Unable to acquire 262144 bytes of memory, got 224631 ## What changes were proposed in this pull request? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88263/testReport https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88260/testReport https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88257/testReport https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88224/testReport These tests all failed: ``` org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 262144 bytes of memory, got 224631 at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98) at org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:787) at org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:204) at org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:219) ... ``` This PR ignore this test. ## How was this patch tested? N/A Author: Yuming Wang <yumwang@ebay.com> Closes #20835 from wangyum/SPARK-23598.	2018-03-15 19:54:58 +01:00
Yuanjian Li	7c3e8995f1	[SPARK-23533][SS] Add support for changing ContinuousDataReader's startOffset ## What changes were proposed in this pull request? As discussion in #20675, we need add a new interface `ContinuousDataReaderFactory` to support the requirements of setting start offset in Continuous Processing. ## How was this patch tested? Existing UT. Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #20689 from xuanyuanking/SPARK-23533.	2018-03-15 00:04:28 -07:00
Kazuaki Ishizaki	1098933b0a	[SPARK-23598][SQL] Make methods in BufferedRowIterator public to avoid runtime error for a large query ## What changes were proposed in this pull request? This PR fixes runtime error regarding a large query when a generated code has split classes. The issue is `append()`, `stopEarly()`, and other methods are not accessible from split classes that are not subclasses of `BufferedRowIterator`. This PR fixes this issue by making them `public`. Before applying the PR, we see the following exception by running the attached program with `CodeGenerator.GENERATED_CLASS_SIZE_THRESHOLD=-1`. ``` test("SPARK-23598") { // When set -1 to CodeGenerator.GENERATED_CLASS_SIZE_THRESHOLD, an exception is thrown val df_pet_age = Seq((8, "bat"), (15, "mouse"), (5, "horse")).toDF("age", "name") df_pet_age.groupBy("name").avg("age").show() } ``` Exception: ``` 19:40:52.591 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19:41:32.319 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalAccessError: tried to access method org.apache.spark.sql.execution.BufferedRowIterator.shouldStop()Z from class org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1$agg_NestedClass1 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1$agg_NestedClass1.agg_doAggregateWithKeys$(generated.java:203) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:160) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:616) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ... ``` Generated code (line 195 calles `stopEarly()`). ``` /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private boolean agg_initAgg; / 010 / private boolean agg_bufIsNull; / 011 / private double agg_bufValue; / 012 / private boolean agg_bufIsNull1; / 013 / private long agg_bufValue1; / 014 / private agg_FastHashMap agg_fastHashMap; / 015 / private org.apache.spark.unsafe.KVIterator<UnsafeRow, UnsafeRow> agg_fastHashMapIter; / 016 / private org.apache.spark.unsafe.KVIterator agg_mapIter; / 017 / private org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap; / 018 / private org.apache.spark.sql.execution.UnsafeKVExternalSorter agg_sorter; / 019 / private scala.collection.Iterator inputadapter_input; / 020 / private boolean agg_agg_isNull11; / 021 / private boolean agg_agg_isNull25; / 022 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder[] agg_mutableStateArray1 = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder[2]; / 023 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] agg_mutableStateArray2 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; / 024 / private UnsafeRow[] agg_mutableStateArray = new UnsafeRow[2]; / 025 / / 026 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 027 / this.references = references; / 028 / } / 029 / / 030 / public void init(int index, scala.collection.Iterator[] inputs) { / 031 / partitionIndex = index; / 032 / this.inputs = inputs; / 033 / / 034 / agg_fastHashMap = new agg_FastHashMap(((org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0] / plan /).getTaskMemoryManager(), ((org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0] / plan /).getEmptyAggregationBuffer()); / 035 / agg_hashMap = ((org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0] / plan /).createHashMap(); / 036 / inputadapter_input = inputs[0]; / 037 / agg_mutableStateArray[0] = new UnsafeRow(1); / 038 / agg_mutableStateArray1[0] = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_mutableStateArray[0], 32); / 039 / agg_mutableStateArray2[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(agg_mutableStateArray1[0], 1); / 040 / agg_mutableStateArray[1] = new UnsafeRow(3); / 041 / agg_mutableStateArray1[1] = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_mutableStateArray[1], 32); / 042 / agg_mutableStateArray2[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(agg_mutableStateArray1[1], 3); / 043 / / 044 / } / 045 / / 046 / public class agg_FastHashMap { / 047 / private org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch batch; / 048 / private int[] buckets; / 049 / private int capacity = 1 << 16; / 050 / private double loadFactor = 0.5; / 051 / private int numBuckets = (int) (capacity / loadFactor); / 052 / private int maxSteps = 2; / 053 / private int numRows = 0; / 054 / private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1] / keyName /), org.apache.spark.sql.types.DataTypes.StringType); / 055 / private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2] / keyName /), org.apache.spark.sql.types.DataTypes.DoubleType) / 056 / .add(((java.lang.String) references[3] / keyName /), org.apache.spark.sql.types.DataTypes.LongType); / 057 / private Object emptyVBase; / 058 / private long emptyVOff; / 059 / private int emptyVLen; / 060 / private boolean isBatchFull = false; / 061 / / 062 / public agg_FastHashMap( / 063 / org.apache.spark.memory.TaskMemoryManager taskMemoryManager, / 064 / InternalRow emptyAggregationBuffer) { / 065 / batch = org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch / 066 / .allocate(keySchema, valueSchema, taskMemoryManager, capacity); / 067 / / 068 / final UnsafeProjection valueProjection = UnsafeProjection.create(valueSchema); / 069 / final byte[] emptyBuffer = valueProjection.apply(emptyAggregationBuffer).getBytes(); / 070 / / 071 / emptyVBase = emptyBuffer; / 072 / emptyVOff = Platform.BYTE_ARRAY_OFFSET; / 073 / emptyVLen = emptyBuffer.length; / 074 / / 075 / buckets = new int[numBuckets]; / 076 / java.util.Arrays.fill(buckets, -1); / 077 / } / 078 / / 079 / public org.apache.spark.sql.catalyst.expressions.UnsafeRow findOrInsert(UTF8String agg_key) { / 080 / long h = hash(agg_key); / 081 / int step = 0; / 082 / int idx = (int) h & (numBuckets - 1); / 083 / while (step < maxSteps) { / 084 / // Return bucket index if it's either an empty slot or already contains the key / 085 / if (buckets[idx] == -1) { / 086 / if (numRows < capacity && !isBatchFull) { / 087 / // creating the unsafe for new entry / 088 / UnsafeRow agg_result = new UnsafeRow(1); / 089 / org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder / 090 / = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_result, / 091 / 32); / 092 / org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter / 093 / = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter( / 094 / agg_holder, / 095 / 1); / 096 / agg_holder.reset(); //TODO: investigate if reset or zeroout are actually needed / 097 / agg_rowWriter.zeroOutNullBytes(); / 098 / agg_rowWriter.write(0, agg_key); / 099 / agg_result.setTotalSize(agg_holder.totalSize()); / 100 / Object kbase = agg_result.getBaseObject(); / 101 / long koff = agg_result.getBaseOffset(); / 102 / int klen = agg_result.getSizeInBytes(); / 103 / / 104 / UnsafeRow vRow / 105 / = batch.appendRow(kbase, koff, klen, emptyVBase, emptyVOff, emptyVLen); / 106 / if (vRow == null) { / 107 / isBatchFull = true; / 108 / } else { / 109 / buckets[idx] = numRows++; / 110 / } / 111 / return vRow; / 112 / } else { / 113 / // No more space / 114 / return null; / 115 / } / 116 / } else if (equals(idx, agg_key)) { / 117 / return batch.getValueRow(buckets[idx]); / 118 / } / 119 / idx = (idx + 1) & (numBuckets - 1); / 120 / step++; / 121 / } / 122 / // Didn't find it / 123 / return null; / 124 / } / 125 / / 126 / private boolean equals(int idx, UTF8String agg_key) { / 127 / UnsafeRow row = batch.getKeyRow(buckets[idx]); / 128 / return (row.getUTF8String(0).equals(agg_key)); / 129 / } / 130 / / 131 / private long hash(UTF8String agg_key) { / 132 / long agg_hash = 0; / 133 / / 134 / int agg_result = 0; / 135 / byte[] agg_bytes = agg_key.getBytes(); / 136 / for (int i = 0; i < agg_bytes.length; i++) { / 137 / int agg_hash1 = agg_bytes[i]; / 138 / agg_result = (agg_result ^ (0x9e3779b9)) + agg_hash1 + (agg_result << 6) + (agg_result >>> 2); / 139 / } / 140 / / 141 / agg_hash = (agg_hash ^ (0x9e3779b9)) + agg_result + (agg_hash << 6) + (agg_hash >>> 2); / 142 / / 143 / return agg_hash; / 144 / } / 145 / / 146 / public org.apache.spark.unsafe.KVIterator<UnsafeRow, UnsafeRow> rowIterator() { / 147 / return batch.rowIterator(); / 148 / } / 149 / / 150 / public void close() { / 151 / batch.close(); / 152 / } / 153 / / 154 / } / 155 / / 156 / protected void processNext() throws java.io.IOException { / 157 / if (!agg_initAgg) { / 158 / agg_initAgg = true; / 159 / long wholestagecodegen_beforeAgg = System.nanoTime(); / 160 / agg_nestedClassInstance1.agg_doAggregateWithKeys(); / 161 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[8] / aggTime /).add((System.nanoTime() - wholestagecodegen_beforeAgg) / 1000000); / 162 / } / 163 / / 164 / // output the result / 165 / / 166 / while (agg_fastHashMapIter.next()) { / 167 / UnsafeRow agg_aggKey = (UnsafeRow) agg_fastHashMapIter.getKey(); / 168 / UnsafeRow agg_aggBuffer = (UnsafeRow) agg_fastHashMapIter.getValue(); / 169 / wholestagecodegen_nestedClassInstance.agg_doAggregateWithKeysOutput(agg_aggKey, agg_aggBuffer); / 170 / / 171 / if (shouldStop()) return; / 172 / } / 173 / agg_fastHashMap.close(); / 174 / / 175 / while (agg_mapIter.next()) { / 176 / UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); / 177 / UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); / 178 / wholestagecodegen_nestedClassInstance.agg_doAggregateWithKeysOutput(agg_aggKey, agg_aggBuffer); / 179 / / 180 / if (shouldStop()) return; / 181 / } / 182 / / 183 / agg_mapIter.close(); / 184 / if (agg_sorter == null) { / 185 / agg_hashMap.free(); / 186 / } / 187 / } / 188 / / 189 / private wholestagecodegen_NestedClass wholestagecodegen_nestedClassInstance = new wholestagecodegen_NestedClass(); / 190 / private agg_NestedClass1 agg_nestedClassInstance1 = new agg_NestedClass1(); / 191 / private agg_NestedClass agg_nestedClassInstance = new agg_NestedClass(); / 192 / / 193 / private class agg_NestedClass1 { / 194 / private void agg_doAggregateWithKeys() throws java.io.IOException { / 195 / while (inputadapter_input.hasNext() && !stopEarly()) { / 196 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 197 / int inputadapter_value = inputadapter_row.getInt(0); / 198 / boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1); / 199 / UTF8String inputadapter_value1 = inputadapter_isNull1 ? / 200 / null : (inputadapter_row.getUTF8String(1)); / 201 / / 202 / agg_nestedClassInstance.agg_doConsume(inputadapter_row, inputadapter_value, inputadapter_value1, inputadapter_isNull1); / 203 / if (shouldStop()) return; / 204 / } / 205 / / 206 / agg_fastHashMapIter = agg_fastHashMap.rowIterator(); / 207 / agg_mapIter = ((org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0] / plan /).finishAggregate(agg_hashMap, agg_sorter, ((org.apache.spark.sql.execution.metric.SQLMetric) references[4] / peakMemory /), ((org.apache.spark.sql.execution.metric.SQLMetric) references[5] / spillSize /), ((org.apache.spark.sql.execution.metric.SQLMetric) references[6] / avgHashProbe /)); / 208 / / 209 / } / 210 / / 211 / } / 212 / / 213 / private class wholestagecodegen_NestedClass { / 214 / private void agg_doAggregateWithKeysOutput(UnsafeRow agg_keyTerm, UnsafeRow agg_bufferTerm) / 215 / throws java.io.IOException { / 216 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[7] / numOutputRows /).add(1); / 217 / / 218 / boolean agg_isNull35 = agg_keyTerm.isNullAt(0); / 219 / UTF8String agg_value37 = agg_isNull35 ? / 220 / null : (agg_keyTerm.getUTF8String(0)); / 221 / boolean agg_isNull36 = agg_bufferTerm.isNullAt(0); / 222 / double agg_value38 = agg_isNull36 ? / 223 / -1.0 : (agg_bufferTerm.getDouble(0)); / 224 / boolean agg_isNull37 = agg_bufferTerm.isNullAt(1); / 225 / long agg_value39 = agg_isNull37 ? / 226 / -1L : (agg_bufferTerm.getLong(1)); / 227 / / 228 / agg_mutableStateArray1[1].reset(); / 229 / / 230 / agg_mutableStateArray2[1].zeroOutNullBytes(); / 231 / / 232 / if (agg_isNull35) { / 233 / agg_mutableStateArray2[1].setNullAt(0); / 234 / } else { / 235 / agg_mutableStateArray2[1].write(0, agg_value37); / 236 / } / 237 / / 238 / if (agg_isNull36) { / 239 / agg_mutableStateArray2[1].setNullAt(1); / 240 / } else { / 241 / agg_mutableStateArray2[1].write(1, agg_value38); / 242 / } / 243 / / 244 / if (agg_isNull37) { / 245 / agg_mutableStateArray2[1].setNullAt(2); / 246 / } else { / 247 / agg_mutableStateArray2[1].write(2, agg_value39); / 248 / } / 249 / agg_mutableStateArray[1].setTotalSize(agg_mutableStateArray1[1].totalSize()); / 250 / append(agg_mutableStateArray[1]); / 251 / / 252 / } / 253 / / 254 / } / 255 / / 256 / private class agg_NestedClass { / 257 / private void agg_doConsume(InternalRow inputadapter_row, int agg_expr_0, UTF8String agg_expr_1, boolean agg_exprIsNull_1) throws java.io.IOException { / 258 / UnsafeRow agg_unsafeRowAggBuffer = null; / 259 / UnsafeRow agg_fastAggBuffer = null; / 260 / / 261 / if (true) { / 262 / if (!agg_exprIsNull_1) { / 263 / agg_fastAggBuffer = agg_fastHashMap.findOrInsert( / 264 / agg_expr_1); / 265 / } / 266 / } / 267 / // Cannot find the key in fast hash map, try regular hash map. / 268 / if (agg_fastAggBuffer == null) { / 269 / // generate grouping key / 270 / agg_mutableStateArray1[0].reset(); / 271 / / 272 / agg_mutableStateArray2[0].zeroOutNullBytes(); / 273 / / 274 / if (agg_exprIsNull_1) { / 275 / agg_mutableStateArray2[0].setNullAt(0); / 276 / } else { / 277 / agg_mutableStateArray2[0].write(0, agg_expr_1); / 278 / } / 279 / agg_mutableStateArray[0].setTotalSize(agg_mutableStateArray1[0].totalSize()); / 280 / int agg_value7 = 42; / 281 / / 282 / if (!agg_exprIsNull_1) { / 283 / agg_value7 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(agg_expr_1.getBaseObject(), agg_expr_1.getBaseOffset(), agg_expr_1.numBytes(), agg_value7); / 284 / } / 285 / if (true) { / 286 / // try to get the buffer from hash map / 287 / agg_unsafeRowAggBuffer = / 288 / agg_hashMap.getAggregationBufferFromUnsafeRow(agg_mutableStateArray[0], agg_value7); / 289 / } / 290 / // Can't allocate buffer from the hash map. Spill the map and fallback to sort-based / 291 / // aggregation after processing all input rows. / 292 / if (agg_unsafeRowAggBuffer == null) { / 293 / if (agg_sorter == null) { / 294 / agg_sorter = agg_hashMap.destructAndCreateExternalSorter(); / 295 / } else { / 296 / agg_sorter.merge(agg_hashMap.destructAndCreateExternalSorter()); / 297 / } / 298 / / 299 / // the hash map had be spilled, it should have enough memory now, / 300 / // try to allocate buffer again. / 301 / agg_unsafeRowAggBuffer = agg_hashMap.getAggregationBufferFromUnsafeRow( / 302 / agg_mutableStateArray[0], agg_value7); / 303 / if (agg_unsafeRowAggBuffer == null) { / 304 / // failed to allocate the first page / 305 / throw new OutOfMemoryError("No enough memory for aggregation"); / 306 / } / 307 / } / 308 / / 309 / } / 310 / / 311 / if (agg_fastAggBuffer != null) { / 312 / // common sub-expressions / 313 / boolean agg_isNull21 = false; / 314 / long agg_value23 = -1L; / 315 / if (!false) { / 316 / agg_value23 = (long) agg_expr_0; / 317 / } / 318 / // evaluate aggregate function / 319 / boolean agg_isNull23 = true; / 320 / double agg_value25 = -1.0; / 321 / / 322 / boolean agg_isNull24 = agg_fastAggBuffer.isNullAt(0); / 323 / double agg_value26 = agg_isNull24 ? / 324 / -1.0 : (agg_fastAggBuffer.getDouble(0)); / 325 / if (!agg_isNull24) { / 326 / agg_agg_isNull25 = true; / 327 / double agg_value27 = -1.0; / 328 / do { / 329 / boolean agg_isNull26 = agg_isNull21; / 330 / double agg_value28 = -1.0; / 331 / if (!agg_isNull21) { / 332 / agg_value28 = (double) agg_value23; / 333 / } / 334 / if (!agg_isNull26) { / 335 / agg_agg_isNull25 = false; / 336 / agg_value27 = agg_value28; / 337 / continue; / 338 / } / 339 / / 340 / boolean agg_isNull27 = false; / 341 / double agg_value29 = -1.0; / 342 / if (!false) { / 343 / agg_value29 = (double) 0; / 344 / } / 345 / if (!agg_isNull27) { / 346 / agg_agg_isNull25 = false; / 347 / agg_value27 = agg_value29; / 348 / continue; / 349 / } / 350 / / 351 / } while (false); / 352 / / 353 / agg_isNull23 = false; // resultCode could change nullability. / 354 / agg_value25 = agg_value26 + agg_value27; / 355 / / 356 / } / 357 / boolean agg_isNull29 = false; / 358 / long agg_value31 = -1L; / 359 / if (!false && agg_isNull21) { / 360 / boolean agg_isNull31 = agg_fastAggBuffer.isNullAt(1); / 361 / long agg_value33 = agg_isNull31 ? / 362 / -1L : (agg_fastAggBuffer.getLong(1)); / 363 / agg_isNull29 = agg_isNull31; / 364 / agg_value31 = agg_value33; / 365 / } else { / 366 / boolean agg_isNull32 = true; / 367 / long agg_value34 = -1L; / 368 / / 369 / boolean agg_isNull33 = agg_fastAggBuffer.isNullAt(1); / 370 / long agg_value35 = agg_isNull33 ? / 371 / -1L : (agg_fastAggBuffer.getLong(1)); / 372 / if (!agg_isNull33) { / 373 / agg_isNull32 = false; // resultCode could change nullability. / 374 / agg_value34 = agg_value35 + 1L; / 375 / / 376 / } / 377 / agg_isNull29 = agg_isNull32; / 378 / agg_value31 = agg_value34; / 379 / } / 380 / // update fast row / 381 / if (!agg_isNull23) { / 382 / agg_fastAggBuffer.setDouble(0, agg_value25); / 383 / } else { / 384 / agg_fastAggBuffer.setNullAt(0); / 385 / } / 386 / / 387 / if (!agg_isNull29) { / 388 / agg_fastAggBuffer.setLong(1, agg_value31); / 389 / } else { / 390 / agg_fastAggBuffer.setNullAt(1); / 391 / } / 392 / } else { / 393 / // common sub-expressions / 394 / boolean agg_isNull7 = false; / 395 / long agg_value9 = -1L; / 396 / if (!false) { / 397 / agg_value9 = (long) agg_expr_0; / 398 / } / 399 / // evaluate aggregate function / 400 / boolean agg_isNull9 = true; / 401 / double agg_value11 = -1.0; / 402 / / 403 / boolean agg_isNull10 = agg_unsafeRowAggBuffer.isNullAt(0); / 404 / double agg_value12 = agg_isNull10 ? / 405 / -1.0 : (agg_unsafeRowAggBuffer.getDouble(0)); / 406 / if (!agg_isNull10) { / 407 / agg_agg_isNull11 = true; / 408 / double agg_value13 = -1.0; / 409 / do { / 410 / boolean agg_isNull12 = agg_isNull7; / 411 / double agg_value14 = -1.0; / 412 / if (!agg_isNull7) { / 413 / agg_value14 = (double) agg_value9; / 414 / } / 415 / if (!agg_isNull12) { / 416 / agg_agg_isNull11 = false; / 417 / agg_value13 = agg_value14; / 418 / continue; / 419 / } / 420 / / 421 / boolean agg_isNull13 = false; / 422 / double agg_value15 = -1.0; / 423 / if (!false) { / 424 / agg_value15 = (double) 0; / 425 / } / 426 / if (!agg_isNull13) { / 427 / agg_agg_isNull11 = false; / 428 / agg_value13 = agg_value15; / 429 / continue; / 430 / } / 431 / / 432 / } while (false); / 433 / / 434 / agg_isNull9 = false; // resultCode could change nullability. / 435 / agg_value11 = agg_value12 + agg_value13; / 436 / / 437 / } / 438 / boolean agg_isNull15 = false; / 439 / long agg_value17 = -1L; / 440 / if (!false && agg_isNull7) { / 441 / boolean agg_isNull17 = agg_unsafeRowAggBuffer.isNullAt(1); / 442 / long agg_value19 = agg_isNull17 ? / 443 / -1L : (agg_unsafeRowAggBuffer.getLong(1)); / 444 / agg_isNull15 = agg_isNull17; / 445 / agg_value17 = agg_value19; / 446 / } else { / 447 / boolean agg_isNull18 = true; / 448 / long agg_value20 = -1L; / 449 / / 450 / boolean agg_isNull19 = agg_unsafeRowAggBuffer.isNullAt(1); / 451 / long agg_value21 = agg_isNull19 ? / 452 / -1L : (agg_unsafeRowAggBuffer.getLong(1)); / 453 / if (!agg_isNull19) { / 454 / agg_isNull18 = false; // resultCode could change nullability. / 455 / agg_value20 = agg_value21 + 1L; / 456 / / 457 / } / 458 / agg_isNull15 = agg_isNull18; / 459 / agg_value17 = agg_value20; / 460 / } / 461 / // update unsafe row buffer / 462 / if (!agg_isNull9) { / 463 / agg_unsafeRowAggBuffer.setDouble(0, agg_value11); / 464 / } else { / 465 / agg_unsafeRowAggBuffer.setNullAt(0); / 466 / } / 467 / / 468 / if (!agg_isNull15) { / 469 / agg_unsafeRowAggBuffer.setLong(1, agg_value17); / 470 / } else { / 471 / agg_unsafeRowAggBuffer.setNullAt(1); / 472 / } / 473 / / 474 / } / 475 / / 476 / } / 477 / / 478 / } / 479 / / 480 */ } ``` ## How was this patch tested? Added UT into `WholeStageCodegenSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20779 from kiszk/SPARK-23598.	2018-03-13 23:04:16 +01:00
zuotingbing	918fb9beee	[SPARK-23547][SQL] Cleanup the .pipeout file when the Hive Session closed ## What changes were proposed in this pull request? ![2018-03-07_121010](https://user-images.githubusercontent.com/24823338/37073232-922e10d2-2200-11e8-8172-6e03aa984b39.png) when the hive session closed, we should also cleanup the .pipeout file. ## How was this patch tested? Added test cases. Author: zuotingbing <zuo.tingbing9@zte.com.cn> Closes #20702 from zuotingbing/SPARK-23547.	2018-03-13 11:31:32 -07:00
Xingbo Jiang	9ddd1e2cea	[MINOR][SQL][TEST] Create table using `dataSourceName` in `HadoopFsRelationTest` ## What changes were proposed in this pull request? This PR fixes a minor issue in `HadoopFsRelationTest`, that you should create table using `dataSourceName` instead of `parquet`. The issue won't affect the correctness, but it will generate wrong error message in case the test fails. ## How was this patch tested? Exsiting tests. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #20780 from jiangxb1987/dataSourceName.	2018-03-13 23:31:08 +09:00
Kazuaki Ishizaki	23370554d0	[SPARK-23656][TEST] Perform assertions in XXH64Suite.testKnownByteArrayInputs() on big endian platform, too ## What changes were proposed in this pull request? This PR enables assertions in `XXH64Suite.testKnownByteArrayInputs()` on big endian platform, too. The current implementation performs them only on little endian platform. This PR increase test coverage of big endian platform. ## How was this patch tested? Updated `XXH64Suite` Tested on big endian platform using JIT compiler or interpreter `-Xint`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20804 from kiszk/SPARK-23656.	2018-03-13 15:20:09 +01:00
Xiayun Sun	b304e07e06	[SPARK-23462][SQL] improve missing field error message in `StructType` ## What changes were proposed in this pull request? The error message ```s"""Field "$name" does not exist."""``` is thrown when looking up an unknown field in StructType. In the error message, we should also contain the information about which columns/fields exist in this struct. ## How was this patch tested? Added new unit tests. Note: I created a new `StructTypeSuite.scala` as I couldn't find an existing suite that's suitable to place these tests. I may be missing something so feel free to propose new locations. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xiayun Sun <xiayunsun@gmail.com> Closes #20649 from xysun/SPARK-23462.	2018-03-12 22:13:28 +09:00
Wang Gengliang	10b0657b03	[SPARK-23624][SQL] Revise doc of method pushFilters in Datasource V2 ## What changes were proposed in this pull request? Revise doc of method pushFilters in SupportsPushDownFilters/SupportsPushDownCatalystFilters In `FileSourceStrategy`, except `partitionKeyFilters`(the references of which is subset of partition keys), all filters needs to be evaluated after scanning. Otherwise, Spark will get wrong result from data sources like Orc/Parquet. This PR is to improve the doc. Author: Wang Gengliang <gengliang.wang@databricks.com> Closes #20769 from gengliangwang/revise_pushdown_doc.	2018-03-09 15:41:19 -08:00
Michał Świtakowski	2ca9bb083c	[SPARK-23173][SQL] Avoid creating corrupt parquet files when loading data from JSON ## What changes were proposed in this pull request? The from_json() function accepts an additional parameter, where the user might specify the schema. The issue is that the specified schema might not be compatible with data. In particular, the JSON data might be missing data for fields declared as non-nullable in the schema. The from_json() function does not verify the data against such errors. When data with missing fields is sent to the parquet encoder, there is no verification either. The end results is a corrupt parquet file. To avoid corruptions, make sure that all fields in the user-specified schema are set to be nullable. Since this changes the behavior of a public function, we need to include it in release notes. The behavior can be reverted by setting `spark.sql.fromJsonForceNullableSchema=false` ## How was this patch tested? Added two new tests. Author: Michał Świtakowski <michal.switakowski@databricks.com> Closes #20694 from mswit-databricks/SPARK-23173.	2018-03-09 14:29:31 -08:00
Dilip Biswal	d90e77bd0e	[SPARK-23271][SQL] Parquet output contains only _SUCCESS file after writing an empty dataframe ## What changes were proposed in this pull request? Below are the two cases. ``` SQL case 1 scala> List.empty[String].toDF().rdd.partitions.length res18: Int = 1 ``` When we write the above data frame as parquet, we create a parquet file containing just the schema of the data frame. Case 2 ``` SQL scala> val anySchema = StructType(StructField("anyName", StringType, nullable = false) :: Nil) anySchema: org.apache.spark.sql.types.StructType = StructType(StructField(anyName,StringType,false)) scala> spark.read.schema(anySchema).csv("/tmp/empty_folder").rdd.partitions.length res22: Int = 0 ``` For the 2nd case, since number of partitions = 0, we don't call the write task (the task has logic to create the empty metadata only parquet file) The fix is to create a dummy single partition RDD and set up the write task based on it to ensure the metadata-only file. ## How was this patch tested? A new test is added to DataframeReaderWriterSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #20525 from dilipbiswal/spark-23271.	2018-03-08 14:58:40 -08:00
Marco Gaido	e7bbca8896	[SPARK-23602][SQL] PrintToStderr prints value also in interpreted mode ## What changes were proposed in this pull request? `PrintToStderr` was doing what is it supposed to only when code generation is enabled. The PR adds the same behavior in interpreted mode too. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20773 from mgaido91/SPARK-23602.	2018-03-08 22:02:28 +01:00
Marco Gaido	ea480990e7	[SPARK-23628][SQL] calculateParamLength should not return 1 + num of epressions ## What changes were proposed in this pull request? There was a bug in `calculateParamLength` which caused it to return always 1 + the number of expressions. This could lead to Exceptions especially with expressions of type long. ## How was this patch tested? added UT + fixed previous UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20772 from mgaido91/SPARK-23628.	2018-03-08 11:09:15 -08:00

... 5 6 7 8 9 ...

7020 commits