ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Tathagata Das	70fe99dc62	[SPARK-21464][SS] Minimize deprecation warnings caused by ProcessingTime class ## What changes were proposed in this pull request? Use of `ProcessingTime` class was deprecated in favor of `Trigger.ProcessingTime` in Spark 2.2. However interval uses to ProcessingTime causes deprecation warnings during compilation. This cannot be avoided entirely as even though it is deprecated as a public API, ProcessingTime instances are used internally in TriggerExecutor. This PR is to minimize the warning by removing its uses from tests as much as possible. ## How was this patch tested? Existing tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18678 from tdas/SPARK-21464.	2017-07-19 11:02:07 -07:00
donnyzone	6b6dd682e8	[SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results failures in some cases ## What changes were proposed in this pull request? https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441 This issue can be reproduced by the following example: ``` val spark = SparkSession .builder() .appName("smj-codegen") .master("local") .config("spark.sql.autoBroadcastJoinThreshold", "1") .getOrCreate() val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int") val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str") val df = df1.join(df2, df1("key") === df2("key")) .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1") .select("int") df.show() ``` To conclude, the issue happens when: (1) SortMergeJoin condition contains CodegenFallback expressions. (2) In PhysicalPlan tree, SortMergeJoin node is the child of root node, e.g., the Project in above example. This patch fixes the logic in `CollapseCodegenStages` rule. ## How was this patch tested? Unit test and manual verification in our cluster. Author: donnyzone <wellfengzhu@gmail.com> Closes #18656 from DonnyZone/Fix_SortMergeJoinExec.	2017-07-19 21:48:54 +08:00
jinxing	4eb081cc87	[SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM. ## What changes were proposed in this pull request? In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound. This could result in the buffer is very big though the window is small. For example: ``` select a, b, sum(a) over (partition by b order by a range between 1000000 following and 1000001 following) from table ``` We can refine the logic and just add the qualified rows into buffer. ## How was this patch tested? Manual test: Run sql `select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue range between 100 following and 200 following) from revenueList limit 10` against a table with 4 columns(shop: String, shopInfo: String, district: String, revenue: Int). The biggest partition is around 2G bytes, containing 200k lines. Configure the executor with 2G bytes memory. With the change in this pr, it works find. Without this change, below exception will be thrown. ``` MemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62) at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365) at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ``` Author: jinxing <jinxing6042@126.com> Closes #18634 from jinxing64/SPARK-21414.	2017-07-19 21:35:26 +08:00
gatorsmile	ae253e5a87	[SPARK-21273][SQL][FOLLOW-UP] Propagate logical plan stats using visitor pattern and mixin ## What changes were proposed in this pull request? This PR is to add back the stats propagation of `Window` and remove the stats calculation of the leaf node `Range`, which has been covered by `9c32d2507d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L56)` ## How was this patch tested? Added two test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #18677 from gatorsmile/visitStats.	2017-07-19 10:57:15 +08:00
xuanyuanking	81c99a5b95	[SPARK-21435][SQL] Empty files should be skipped while write to file ## What changes were proposed in this pull request? Add EmptyDirectoryWriteTask for empty task while writing files. Fix the empty result for parquet format by leaving the first partition for meta writing. ## How was this patch tested? Add new test in `FileFormatWriterSuite ` Author: xuanyuanking <xyliyuanjian@gmail.com> Closes #18654 from xuanyuanking/SPARK-21435.	2017-07-19 10:27:42 +08:00
Tathagata Das	84f1b25f31	[SPARK-21462][SS] Added batchId to StreamingQueryProgress.json ## What changes were proposed in this pull request? - Added batchId to StreamingQueryProgress.json as that was missing from the generated json. - Also, removed recently added numPartitions from StatefulOperatorProgress as this value does not change through the query run, and there are other ways to find that. ## How was this patch tested? Updated unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18675 from tdas/SPARK-21462.	2017-07-18 16:29:45 -07:00
Wenchen Fan	f18b905f6c	[SPARK-21457][SQL] ExternalCatalog.listPartitions should correctly handle partition values with dot ## What changes were proposed in this pull request? When we list partitions from hive metastore with a partial partition spec, we are expecting exact matching according to the partition values. However, hive treats dot specially and match any single character for dot. We should do an extra filter to drop unexpected partitions. ## How was this patch tested? new regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes #18671 from cloud-fan/hive.	2017-07-18 15:56:16 -07:00
Sean Owen	e26dac5feb	[SPARK-21415] Triage scapegoat warnings, part 1 ## What changes were proposed in this pull request? Address scapegoat warnings for: - BigDecimal double constructor - Catching NPE - Finalizer without super - List.size is O(n) - Prefer Seq.empty - Prefer Set.empty - reverse.map instead of reverseMap - Type shadowing - Unnecessary if condition. - Use .log1p - Var could be val In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #18635 from srowen/Scapegoat1.	2017-07-18 08:47:17 +01:00
aokolnychyi	0be5fb41a6	[SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions ## What changes were proposed in this pull request? This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below: ``` val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil) val sc = spark.sparkContext val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12))) val df = spark.createDataFrame(rdd, inputSchema) // Works correctly since no nested decimal expression is involved // Expected result type: (26, 6) * (26, 6) = (38, 12) df.select($"col" * $"col").explain(true) df.select($"col" * $"col").printSchema() // Gives a wrong result since there is a nested decimal expression that should be visited first // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18) df.select($"col" * $"col" * $"col").explain(true) df.select($"col" * $"col" * $"col").printSchema() ``` The example above gives the following output: ``` // Correct result without sub-expressions == Parsed Logical Plan == 'Project [('col * 'col) AS (col * col)#4] +- LogicalRDD [col#1] == Analyzed Logical Plan == (col * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4] +- LogicalRDD [col#1] == Physical Plan == Project [CheckOverflow((col#1 col#1), DecimalType(38,12)) AS (col * col)#4] +- Scan ExistingRDD[col#1] // Schema root \|-- (col * col): decimal(38,12) (nullable = true) // Incorrect result with sub-expressions == Parsed Logical Plan == 'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Analyzed Logical Plan == ((col * col) * col): decimal(38,12) Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Optimized Logical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- LogicalRDD [col#1] == Physical Plan == Project [CheckOverflow((cast(CheckOverflow((col#1 col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11] +- Scan ExistingRDD[col#1] // Schema root \|-- ((col * col) * col): decimal(38,12) (nullable = true) ``` ## How was this patch tested? This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18583 from aokolnychyi/spark-21332.	2017-07-17 21:07:50 -07:00
Tathagata Das	e9faae135c	[SPARK-21409][SS] Follow up PR to allow different types of custom metrics to be exposed ## What changes were proposed in this pull request? Implementation may expose both timing as well as size metrics. This PR enables that. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18661 from tdas/SPARK-21409-2.	2017-07-17 19:28:55 -07:00
gatorsmile	a8c6d0f64e	[MINOR] Improve SQLConf messages ### What changes were proposed in this pull request? The current SQLConf messages of `spark.sql.hive.convertMetastoreParquet` and `spark.sql.hive.convertMetastoreOrc` are not very clear to end users. This PR is to improve them. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18657 from gatorsmile/msgUpdates.	2017-07-18 09:15:18 +08:00
Tathagata Das	9d8c83179a	[SPARK-21409][SS] Expose state store memory usage in SQL metrics and progress updates ## What changes were proposed in this pull request? Currently, there is no tracking of memory usage of state stores. This JIRA is to expose that through SQL metrics and StreamingQueryProgress. Additionally, added the ability to expose implementation-specific metrics through the StateStore APIs to the SQLMetrics. ## How was this patch tested? Added unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18629 from tdas/SPARK-21409.	2017-07-17 16:48:15 -07:00
gatorsmile	e398c28146	[SPARK-21354][SQL] INPUT FILE related functions do not support more than one sources ### What changes were proposed in this pull request? The build-in functions `input_file_name`, `input_file_block_start`, `input_file_block_length` do not support more than one sources, like what Hive does. Currently, Spark does not block it and the outputs are ambiguous/non-deterministic. It could be from any side. ``` hive> select *, INPUT__FILE__NAME FROM t1, t2; FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One Tables/Subqueries ``` This PR blocks it and issues an error. ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #18580 from gatorsmile/inputFileName.	2017-07-17 14:58:14 +08:00
Sean Owen	fd52a747fd	[SPARK-19810][SPARK-19810][MINOR][FOLLOW-UP] Follow-ups from to remove Scala 2.10 ## What changes were proposed in this pull request? Follow up to a few comments on https://github.com/apache/spark/pull/17150#issuecomment-315020196 that couldn't be addressed before it was merged. ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #18646 from srowen/SPARK-19810.2.	2017-07-17 09:22:42 +08:00
Kazuaki Ishizaki	ac5d5d7959	[SPARK-21344][SQL] BinaryType comparison does signed byte array comparison ## What changes were proposed in this pull request? This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations. ## How was this patch tested? Added a test suite in `OrderingSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18571 from kiszk/SPARK-21344.	2017-07-14 20:16:04 -07:00
Shixiong Zhu	2d968a07d2	[SPARK-21421][SS] Add the query id as a local property to allow source and sink using it ## What changes were proposed in this pull request? Add the query id as a local property to allow source and sink using it. ## How was this patch tested? The new unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18638 from zsxwing/SPARK-21421.	2017-07-14 14:37:27 -07:00
Sean Owen	425c4ada4c	[SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 ## What changes were proposed in this pull request? - Remove Scala 2.10 build profiles and support - Replace some 2.10 support in scripts with commented placeholders for 2.12 later - Remove deprecated API calls from 2.10 support - Remove usages of deprecated context bounds where possible - Remove Scala 2.10 workarounds like ScalaReflectionLock - Other minor Scala warning fixes ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #17150 from srowen/SPARK-19810.	2017-07-13 17:06:24 +08:00
Wenchen Fan	780586a9f2	[SPARK-17701][SQL] Refactor RowDataSourceScanExec so its sameResult call does not compare strings ## What changes were proposed in this pull request? Currently, `RowDataSourceScanExec` and `FileSourceScanExec` rely on a "metadata" string map to implement equality comparison, since the RDDs they depend on cannot be directly compared. This has resulted in a number of correctness bugs around exchange reuse, e.g. SPARK-17673 and SPARK-16818. To make these comparisons less brittle, we should refactor these classes to compare constructor parameters directly instead of relying on the metadata map. This PR refactors `RowDataSourceScanExec`, `FileSourceScanExec` will be fixed in the follow-up PR. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #18600 from cloud-fan/minor.	2017-07-12 09:23:54 -07:00
liuxian	aaad34dc2f	[SPARK-21007][SQL] Add SQL function - RIGHT && LEFT ## What changes were proposed in this pull request? Add SQL function - RIGHT && LEFT, same as MySQL: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_left https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_right ## How was this patch tested? unit test Author: liuxian <liu.xian3@zte.com.cn> Closes #18228 from 10110346/lx-wip-0607.	2017-07-12 18:51:19 +08:00
Xiao Li	f587d2e3fa	[SPARK-20842][SQL] Upgrade to 1.2.2 for Hive Metastore Client 1.2 ### What changes were proposed in this pull request? Hive 1.2.2 release is available. Below is the list of bugs fixed in 1.2.2 https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12332952&styleName=Text&projectId=12310843 ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #18063 from gatorsmile/upgradeHiveClientTo1.2.2.	2017-07-12 15:48:44 +08:00
Burak Yavuz	e0af76a36a	[SPARK-21370][SS] Add test for state reliability when one read-only state store aborts after read-write state store commits ## What changes were proposed in this pull request? During Streaming Aggregation, we have two StateStores per task, one used as read-only in `StateStoreRestoreExec`, and one read-write used in `StateStoreSaveExec`. `StateStore.abort` will be called for these StateStores if they haven't committed their results. We need to make sure that `abort` in read-only store after a `commit` in the read-write store doesn't accidentally lead to the deletion of state. This PR adds a test for this condition. ## How was this patch tested? This PR adds a test. Author: Burak Yavuz <brkyvz@gmail.com> Closes #18603 from brkyvz/ss-test.	2017-07-12 00:39:09 -07:00
Jane Wang	2cbfc975ba	[SPARK-12139][SQL] REGEX Column Specification ## What changes were proposed in this pull request? Hive interprets regular expression, e.g., `(a)?+.+` in query specification. This PR enables spark to support this feature when hive.support.quoted.identifiers is set to true. ## How was this patch tested? - Add unittests in SQLQuerySuite.scala - Run spark-shell tested the original failed query: scala> hc.sql("SELECT `(a\|b)?+.+` from test1").collect.foreach(println) Author: Jane Wang <janewang@fb.com> Closes #18023 from janewangfb/support_select_regex.	2017-07-11 22:00:36 -07:00
gatorsmile	d3e071658f	[SPARK-19285][SQL] Implement UDF0 ### What changes were proposed in this pull request? This PR is to implement UDF0. `UDF0` is needed when users need to implement a JAVA UDF with no argument. ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #18598 from gatorsmile/udf0.	2017-07-11 15:44:29 -07:00
hyukjinkwon	ebc124d4c4	[SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition ## What changes were proposed in this pull request? This PR deals with four points as below: - Reuse existing DDL parser APIs rather than reimplementing within PySpark - Support DDL formatted string, `field type, field type`. - Support case-insensitivity for parsing. - Support nested data types as below: Before ``` >>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show() ... ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int> ``` ``` >>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show() ... ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int> ``` ``` >>> spark.createDataFrame([[1]], "a int").show() ... ValueError: Could not parse datatype: a int ``` After ``` >>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show() +---+ \| a\| +---+ \|[1]\| +---+ ``` ``` >>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show() +---+ \| a\| +---+ \|[1]\| +---+ ``` ``` >>> spark.createDataFrame([[1]], "a int").show() +---+ \| a\| +---+ \| 1\| +---+ ``` ## How was this patch tested? Author: hyukjinkwon <gurwls223@gmail.com> Closes #18590 from HyukjinKwon/deduplicate-python-ddl.	2017-07-11 22:03:10 +08:00
Xingbo Jiang	66d2168655	[SPARK-21366][SQL][TEST] Add sql test for window functions ## What changes were proposed in this pull request? Add sql test for window functions, also remove uncecessary test cases in `WindowQuerySuite`. ## How was this patch tested? Added `window.sql` and the corresponding output file. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18591 from jiangxb1987/window.	2017-07-11 21:52:54 +08:00
hyukjinkwon	7514db1dec	[SPARK-21263][SQL] Do not allow partially parsing double and floats via NumberFormat in CSV ## What changes were proposed in this pull request? This PR proposes to remove `NumberFormat.parse` use to disallow a case of partially parsed data. For example, ``` scala> spark.read.schema("a DOUBLE").option("mode", "FAILFAST").csv(Seq("10u12").toDS).show() +----+ \| a\| +----+ \|10.0\| +----+ ``` ## How was this patch tested? Unit tests added in `UnivocityParserSuite` and `CSVSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18532 from HyukjinKwon/SPARK-21263.	2017-07-11 11:11:08 +01:00
Michael Allman	a4baa8f48f	[SPARK-20331][SQL] Enhanced Hive partition pruning predicate pushdown (Link to Jira: https://issues.apache.org/jira/browse/SPARK-20331) ## What changes were proposed in this pull request? Spark 2.1 introduced scalable support for Hive tables with huge numbers of partitions. Key to leveraging this support is the ability to prune unnecessary table partitions to answer queries. Spark supports a subset of the class of partition pruning predicates that the Hive metastore supports. If a user writes a query with a partition pruning predicate that is not supported by Spark, Spark falls back to loading all partitions and pruning client-side. We want to broaden Spark's current partition pruning predicate pushdown capabilities. One of the key missing capabilities is support for disjunctions. For example, for a table partitioned by date, writing a query with a predicate like date = 20161011 or date = 20161014 will result in Spark fetching all partitions. For a table partitioned by date and hour, querying a range of hours across dates can be quite difficult to accomplish without fetching all partition metadata. The current partition pruning support supports only comparisons against literals. We can expand that to foldable expressions by evaluating them at planning time. We can also implement support for the "IN" comparison by expanding it to a sequence of "OR"s. ## How was this patch tested? The `HiveClientSuite` and `VersionsSuite` were refactored and simplified to make Hive client-based, version-specific testing more modular and conceptually simpler. There are now two Hive test suites: `HiveClientSuite` and `HivePartitionFilteringSuite`. These test suites have a single-argument constructor taking a `version` parameter. As such, these test suites cannot be run by themselves. Instead, they have been bundled into "aggregation" test suites which run each suite for each Hive client version. These aggregation suites are called `HiveClientSuites` and `HivePartitionFilteringSuites`. The `VersionsSuite` and `HiveClientSuite` have been refactored into each of these aggregation suites, respectively. `HiveClientSuite` and `HivePartitionFilteringSuite` subclass a new abstract class, `HiveVersionSuite`. `HiveVersionSuite` collects functionality related to testing a single Hive version and overrides relevant test suite methods to display version-specific information. A new trait, `HiveClientVersions`, has been added with a sequence of Hive test versions. Author: Michael Allman <michael@videoamp.com> Closes #17633 from mallman/spark-20331-enhanced_partition_pruning_pushdown.	2017-07-11 14:50:11 +08:00
jinxing	97a1aa2c70	[SPARK-21315][SQL] Skip some spill files when generateIterator(startIndex) in ExternalAppendOnlyUnsafeRowArray. ## What changes were proposed in this pull request? In current code, it is expensive to use `UnboundedFollowingWindowFunctionFrame`, because it is iterating from the start to lower bound every time calling `write` method. When traverse the iterator, it's possible to skip some spilled files thus to save some time. ## How was this patch tested? Added unit test Did a small test for benchmark: Put 2000200 rows into `UnsafeExternalSorter`-- 2 spill files(each contains 1000000 rows) and inMemSorter contains 200 rows. Move the iterator forward to index=2000001. With this change: `getIterator(2000001)`, it will cost almost 0ms~1ms; Without this change: `for(int i=0; i<2000001; i++)geIterator().loadNext()`, it will cost 300ms. Author: jinxing <jinxing6042@126.com> Closes #18541 from jinxing64/SPARK-21315.	2017-07-11 11:47:47 +08:00
gatorsmile	1471ee7af5	[SPARK-21350][SQL] Fix the error message when the number of arguments is wrong when invoking a UDF ### What changes were proposed in this pull request? Users get a very confusing error when users specify a wrong number of parameters. ```Scala val df = spark.emptyDataFrame spark.udf.register("foo", (_: String).length) df.selectExpr("foo(2, 3, 4)") ``` ``` org.apache.spark.sql.UDFSuite$$anonfun$9$$anonfun$apply$mcV$sp$12 cannot be cast to scala.Function3 java.lang.ClassCastException: org.apache.spark.sql.UDFSuite$$anonfun$9$$anonfun$apply$mcV$sp$12 cannot be cast to scala.Function3 at org.apache.spark.sql.catalyst.expressions.ScalaUDF.<init>(ScalaUDF.scala:109) ``` This PR is to capture the exception and issue an error message that is consistent with what we did for built-in functions. After the fix, the error message is improved to ``` Invalid number of arguments for function foo; line 1 pos 0 org.apache.spark.sql.AnalysisException: Invalid number of arguments for function foo; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:119) ``` ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #18574 from gatorsmile/statsCheck.	2017-07-11 11:19:59 +08:00
Takeshi Yamamuro	a2bec6c92a	[SPARK-21043][SQL] Add unionByName in Dataset ## What changes were proposed in this pull request? This pr added `unionByName` in `DataSet`. Here is how to use: ``` val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") df1.unionByName(df2).show // output: // +----+----+----+ // \|col0\|col1\|col2\| // +----+----+----+ // \| 1\| 2\| 3\| // \| 6\| 4\| 5\| // +----+----+----+ ``` ## How was this patch tested? Added tests in `DataFrameSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18300 from maropu/SPARK-21043-2.	2017-07-10 20:16:29 -07:00
Bryan Cutler	d03aebbe65	[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas ## What changes were proposed in this pull request? Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. Data types except complex, date, timestamp, and decimal are currently supported, otherwise an `UnsupportedOperation` exception is thrown. Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default). ## How was this patch tested? Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly. Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas. Author: Bryan Cutler <cutlerb@gmail.com> Author: Li Jin <ice.xelloss@gmail.com> Author: Li Jin <li.jin@twosigma.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.	2017-07-10 15:21:03 -07:00
hyukjinkwon	2bfd5accdc	[SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dapply/gapply/from_json ## What changes were proposed in this pull request? This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs. Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases. Python `from_json` ```python from pyspark.sql.functions import from_json data = [(1, '''{"a": 1}''')] df = spark.createDataFrame(data, ("key", "value")) df.select(from_json(df.value, "a INT").alias("json")).show() ``` R `from_json` ```R df <- sql("SELECT named_struct('name', 'Bob') as people") df <- mutate(df, people_json = to_json(df$people)) head(select(df, from_json(df$people_json, "name STRING"))) ``` `structType.character` ```R structType("a STRING, b INT") ``` `dapply` ```R dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE") ``` `gapply` ```R gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE") ``` ## How was this patch tested? Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18498 from HyukjinKwon/SPARK-21266.	2017-07-10 10:40:03 -07:00
Juliusz Sompolski	18b3b00ecf	[SPARK-21272] SortMergeJoin LeftAnti does not update numOutputRows ## What changes were proposed in this pull request? Updating numOutputRows metric was missing from one return path of LeftAnti SortMergeJoin. ## How was this patch tested? Non-zero output rows manually seen in metrics. Author: Juliusz Sompolski <julek@databricks.com> Closes #18494 from juliuszsompolski/SPARK-21272.	2017-07-10 09:26:42 -07:00
Takeshi Yamamuro	647963a26a	[SPARK-20460][SQL] Make it more consistent to handle column name duplication ## What changes were proposed in this pull request? This pr made it more consistent to handle column name duplication. In the current master, error handling is different when hitting column name duplication: ``` // json scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("json").schema(schema).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) scala> spark.read.format("json").load("/tmp/data").show org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot save to JSON format; at org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81) at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63) at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176) // csv scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#41, a#42.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896) scala> spark.read.format("csv").option("header", true).load("/tmp/data").show +---+---+ \| a0\| a1\| +---+---+ \| 1\| 1\| +---+---+ // parquet scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data") scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#110, a#111.; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) ``` When this patch applied, the results change to; ``` // json scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("json").schema(schema).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a"; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368) scala> spark.read.format("json").load("/tmp/data").show org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a"; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156) // csv scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data") scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a"; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) scala> spark.read.format("csv").option("header", true).load("/tmp/data").show +---+---+ \| a0\| a1\| +---+---+ \| 1\| 1\| +---+---+ // parquet scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil) scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data") scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a"; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368) ``` ## How was this patch tested? Added tests in `DataFrameReaderWriterSuite` and `SQLQueryTestSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #17758 from maropu/SPARK-20460.	2017-07-10 15:58:34 +08:00
Wenchen Fan	0e80ecae30	[SPARK-21100][SQL][FOLLOWUP] cleanup code and add more comments for Dataset.summary ## What changes were proposed in this pull request? Some code cleanup and adding comments to make the code more readable. Changed the way to generate result rows, to be more clear. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #18570 from cloud-fan/summary.	2017-07-09 22:53:27 -07:00
Wenchen Fan	680b33f166	[SPARK-18016][SQL][FOLLOWUP] merge declareAddedFunctions, initNestedClasses and declareNestedClasses ## What changes were proposed in this pull request? These 3 methods have to be used together, so it makes more sense to merge them into one method and then the caller side only need to call one method. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18579 from cloud-fan/minor.	2017-07-09 16:30:35 -07:00
Xiao Li	c3712b77a9	[SPARK-21307][REVERT][SQL] Remove SQLConf parameters from the parser-related classes ## What changes were proposed in this pull request? Since we do not set active sessions when parsing the plan, we are unable to correctly use SQLConf.get to find the correct active session. Since https://github.com/apache/spark/pull/18531 breaks the build, I plan to revert it at first. ## How was this patch tested? The existing test cases Author: Xiao Li <gatorsmile@gmail.com> Closes #18568 from gatorsmile/revert18531.	2017-07-08 11:56:19 -07:00
Zhenhua Wang	9fccc3627f	[SPARK-21083][SQL] Store zero size and row count when analyzing empty table ## What changes were proposed in this pull request? We should be able to store zero size and row count after analyzing empty table. This pr also enhances the test cases for re-analyzing tables. ## How was this patch tested? Added a new test case and enhanced some test cases. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #18292 from wzhfy/analyzeNewColumn.	2017-07-08 20:44:12 +08:00
Dongjoon Hyun	0b8dd2d084	[SPARK-21345][SQL][TEST][TEST-MAVEN] SparkSessionBuilderSuite should clean up stopped sessions. ## What changes were proposed in this pull request? `SparkSessionBuilderSuite` should clean up stopped sessions. Otherwise, it leaves behind some stopped `SparkContext`s interfereing with other test suites using `ShardSQLContext`. Recently, master branch fails consequtively. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ ## How was this patch tested? Pass the Jenkins with a updated suite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18567 from dongjoon-hyun/SPARK-SESSION.	2017-07-08 20:16:47 +08:00
Michael Patterson	f5f02d213d	[SPARK-20456][DOCS] Add examples for functions collection for pyspark ## What changes were proposed in this pull request? This adds documentation to many functions in pyspark.sql.functions.py: `upper`, `lower`, `reverse`, `unix_timestamp`, `from_unixtime`, `rand`, `randn`, `collect_list`, `collect_set`, `lit` Add units to the trigonometry functions. Renames columns in datetime examples to be more informative. Adds links between some functions. ## How was this patch tested? `./dev/lint-python` `python python/pyspark/sql/functions.py` `./python/run-tests.py --module pyspark-sql` Author: Michael Patterson <map222@gmail.com> Closes #17865 from map222/spark-20456.	2017-07-07 23:59:34 -07:00
Takeshi Yamamuro	7896e7b99d	[SPARK-21281][SQL] Use string types by default if array and map have no argument ## What changes were proposed in this pull request? This pr modified code to use string types by default if `array` and `map` in functions have no argument. This behaviour is the same with Hive one; ``` hive> CREATE TEMPORARY TABLE t1 AS SELECT map(); hive> DESCRIBE t1; _c0 map<string,string> hive> CREATE TEMPORARY TABLE t2 AS SELECT array(); hive> DESCRIBE t2; _c0 array<string> ``` ## How was this patch tested? Added tests in `DataFrameFunctionsSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18516 from maropu/SPARK-21281.	2017-07-07 23:05:38 -07:00
Andrew Ray	e1a172c201	[SPARK-21100][SQL] Add summary method as alternative to describe that gives quartiles similar to Pandas ## What changes were proposed in this pull request? Adds method `summary` that allows user to specify which statistics and percentiles to calculate. By default it include the existing statistics from `describe` and quartiles (25th, 50th, and 75th percentiles) similar to Pandas. Also changes the implementation of `describe` to delegate to `summary`. ## How was this patch tested? additional unit test Author: Andrew Ray <ray.andrew@gmail.com> Closes #18307 from aray/SPARK-21100.	2017-07-08 13:47:41 +08:00
Wang Gengliang	a0fe32a219	[SPARK-21336] Revise rand comparison in BatchEvalPythonExecSuite ## What changes were proposed in this pull request? Revise rand comparison in BatchEvalPythonExecSuite In BatchEvalPythonExecSuite, there are two cases using the case "rand() > 3" Rand() generates a random value in [0, 1), it is wired to be compared with 3, use 0.3 instead ## How was this patch tested? unit test Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Wang Gengliang <ltnwgl@gmail.com> Closes #18560 from gengliangwang/revise_BatchEvalPythonExecSuite.	2017-07-07 15:39:29 -07:00
Wenchen Fan	fef081309f	[SPARK-21335][SQL] support un-aliased subquery ## What changes were proposed in this pull request? un-aliased subquery is supported by Spark SQL for a long time. Its semantic was not well defined and had confusing behaviors, and it's not a standard SQL syntax, so we disallowed it in https://issues.apache.org/jira/browse/SPARK-20690 . However, this is a breaking change, and we do have existing queries using un-aliased subquery. We should add the support back and fix its semantic. This PR fixes the un-aliased subquery by assigning a default alias name. After this PR, there is no syntax change from branch 2.2 to master, but we invalid a weird use case: `SELECT v.i from (SELECT i FROM v)`. Now this query will throw analysis exception because users should not be able to use the qualifier inside a subquery. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18559 from cloud-fan/sub-query.	2017-07-07 20:04:30 +08:00
Jacek Laskowski	7fcbb9b57f	[SPARK-21313][SS] ConsoleSink's string representation ## What changes were proposed in this pull request? Add `toString` with options for `ConsoleSink` so it shows nicely in query progress. BEFORE ``` "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink4b340441" } ``` AFTER ``` "sink" : { "description" : "ConsoleSink[numRows=10, truncate=false]" } ``` /cc zsxwing tdas ## How was this patch tested? Local build Author: Jacek Laskowski <jacek@japila.pl> Closes #18539 from jaceklaskowski/SPARK-21313-ConsoleSink-toString.	2017-07-07 08:31:30 +01:00
Liang-Chi Hsieh	5df99bd364	[SPARK-20703][SQL][FOLLOW-UP] Associate metrics with data writes onto DataFrameWriter operations ## What changes were proposed in this pull request? Remove time metrics since it seems no way to measure it in non per-row tracking. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18558 from viirya/SPARK-20703-followup.	2017-07-07 13:12:20 +08:00
Kazuaki Ishizaki	c09b31eb8f	[SPARK-21217][SQL] Support ColumnVector.Array.to<type>Array() ## What changes were proposed in this pull request? This PR implements bulk-copy for `ColumnVector.Array.to<type>Array()` methods (e.g. `toIntArray()`) in `ColumnVector.Array` by using `System.arrayCopy()` or `Platform.copyMemory()`. Before this PR, when one of these method is called, the generic method in `ArrayData` is called. It is not fast since element-wise copy is performed. This PR can improve performance of a benchmark program by 1.9x and 3.2x. Without this PR ``` OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Int Array Best/Avg Time(ms) Rate(M/s) Per Row(ns) ------------------------------------------------------------------------------------------------ ON_HEAP 586 / 628 14.3 69.9 OFF_HEAP 893 / 902 9.4 106.5 ``` With this PR ``` OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Int Array Best/Avg Time(ms) Rate(M/s) Per Row(ns) ------------------------------------------------------------------------------------------------ ON_HEAP 306 / 331 27.4 36.4 OFF_HEAP 282 / 287 29.8 33.6 ``` Source program ``` (MemoryMode.ON_HEAP :: MemoryMode.OFF_HEAP :: Nil).foreach { memMode => { val len = 8 * 1024 * 1024 val column = ColumnVector.allocate(len * 2, new ArrayType(IntegerType, false), memMode) val data = column.arrayData var i = 0 while (i < len) { data.putInt(i, i) i += 1 } column.putArray(0, 0, len) val benchmark = new Benchmark("Int Array", len, minNumIters = 20) benchmark.addCase(s"$memMode") { iter => var i = 0 while (i < 50) { column.getArray(0).toIntArray i += 1 } } benchmark.run }} ``` ## How was this patch tested? Added test suite Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18425 from kiszk/SPARK-21217.	2017-07-07 13:09:32 +08:00
Jacek Laskowski	e5bb26174d	[SPARK-21329][SS] Make EventTimeWatermarkExec explicitly UnaryExecNode ## What changes were proposed in this pull request? Making EventTimeWatermarkExec explicitly UnaryExecNode /cc tdas zsxwing ## How was this patch tested? Local build. Author: Jacek Laskowski <jacek@japila.pl> Closes #18509 from jaceklaskowski/EventTimeWatermarkExec-UnaryExecNode.	2017-07-06 18:11:41 -07:00
Wenchen Fan	40c7add3a4	[SPARK-20946][SQL] Do not update conf for existing SparkContext in SparkSession.getOrCreate ## What changes were proposed in this pull request? SparkContext is shared by all sessions, we should not update its conf for only one session. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #18536 from cloud-fan/config.	2017-07-07 08:44:31 +08:00
Tathagata Das	0217dfd26f	[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation ## What changes were proposed in this pull request? Few changes to the Structured Streaming documentation - Clarify that the entire stream input table is not materialized - Add information for Ganglia - Add Kafka Sink to the main docs - Removed a couple of leftover experimental tags - Added more associated reading material and talk videos. In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan. - Added a redirection to avoid breaking internal and possible external links. - Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18485 from tdas/SPARK-21267.	2017-07-06 17:28:20 -07:00
Wang Gengliang	bf66335aca	[SPARK-21323][SQL] Rename plans.logical.statsEstimation.Range to ValueInterval ## What changes were proposed in this pull request? Rename org.apache.spark.sql.catalyst.plans.logical.statsEstimation.Range to ValueInterval. The current naming is identical to logical operator "range". Refactoring it to ValueInterval is more accurate. ## How was this patch tested? unit test Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Wang Gengliang <ltnwgl@gmail.com> Closes #18549 from gengliangwang/ValueInterval.	2017-07-06 13:58:27 -07:00
Liang-Chi Hsieh	48e44b24a7	[SPARK-21204][SQL] Add support for Scala Set collection types in serialization ## What changes were proposed in this pull request? Currently we can't produce a `Dataset` containing `Set` in SparkSQL. This PR tries to support serialization/deserialization of `Set`. Because there's no corresponding internal data type in SparkSQL for a `Set`, the most proper choice for serializing a set should be an array. ## How was this patch tested? Added unit tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18416 from viirya/SPARK-21204.	2017-07-07 01:07:45 +08:00
Bogdan Raducanu	26ac085deb	[SPARK-21228][SQL] InSet incorrect handling of structs ## What changes were proposed in this pull request? When data type is struct, InSet now uses TypeUtils.getInterpretedOrdering (similar to EqualTo) to build a TreeSet. In other cases it will use a HashSet as before (which should be faster). Similarly, In.eval uses Ordering.equiv instead of equals. ## How was this patch tested? New test in SQLQuerySuite. Author: Bogdan Raducanu <bogdan@databricks.com> Closes #18455 from bogdanrdc/SPARK-21228.	2017-07-07 01:04:57 +08:00
Wang Gengliang	d540dfbff3	[SPARK-21273][SQL][FOLLOW-UP] Add missing test cases back and revise code style ## What changes were proposed in this pull request? Add missing test cases back and revise code style Follow up the previous PR: https://github.com/apache/spark/pull/18479 ## How was this patch tested? Unit test Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Wang Gengliang <ltnwgl@gmail.com> Closes #18548 from gengliangwang/stat_propagation_revise.	2017-07-06 19:12:15 +08:00
wangzhenhua	b8e4d567a7	[SPARK-21324][TEST] Improve statistics test suites ## What changes were proposed in this pull request? 1. move `StatisticsCollectionTestBase` to a separate file. 2. move some test cases to `StatisticsCollectionSuite` so that `hive/StatisticsSuite` only keeps tests that need hive support. 3. clear up some test cases. ## How was this patch tested? Existing tests. Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes #18545 from wzhfy/cleanStatSuites.	2017-07-06 16:00:31 +08:00
Liang-Chi Hsieh	6ff05a66fe	[SPARK-20703][SQL] Associate metrics with data writes onto DataFrameWriter operations ## What changes were proposed in this pull request? Right now in the UI, after SPARK-20213, we can show the operations to write data out. However, there is no way to associate metrics with data writes. We should show relative metrics on the operations. #### Supported commands This change supports updating metrics for file-based data writing operations, including `InsertIntoHadoopFsRelationCommand`, `InsertIntoHiveTable`. Supported metrics: * number of written files * number of dynamic partitions * total bytes of written data * total number of output rows * average writing data out time (ms) * (TODO) min/med/max number of output rows per file/partition * (TODO) min/med/max bytes of written data per file/partition #### Commands not supported `InsertIntoDataSourceCommand`, `SaveIntoDataSourceCommand`: The two commands uses DataSource APIs to write data out, i.e., the logic of writing data out is delegated to the DataSource implementations, such as `InsertableRelation.insert` and `CreatableRelationProvider.createRelation`. So we can't obtain metrics from delegated methods for now. `CreateHiveTableAsSelectCommand`, `CreateDataSourceTableAsSelectCommand` : The two commands invokes other commands to write data out. The invoked commands can even write to non file-based data source. We leave them as future TODO. #### How to update metrics of writing files out A `RunnableCommand` which wants to update metrics, needs to override its `metrics` and provide the metrics data structure to `ExecutedCommandExec`. The metrics are prepared during the execution of `FileFormatWriter`. The callback function passed to `FileFormatWriter` will accept the metrics and update accordingly. There is a metrics updating function in `RunnableCommand`. In runtime, the function will be bound to the spark context and `metrics` of `ExecutedCommandExec` and pass to `FileFormatWriter`. ## How was this patch tested? Updated unit tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18159 from viirya/SPARK-20703-2.	2017-07-06 15:47:09 +08:00
Tathagata Das	60043f2245	[SS][MINOR] Fix flaky test in DatastreamReaderWriterSuite. temp checkpoint dir should be deleted ## What changes were proposed in this pull request? Stopping query while it is being initialized can throw interrupt exception, in which case temporary checkpoint directories will not be deleted, and the test will fail. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18442 from tdas/DatastreamReaderWriterSuite-fix.	2017-07-06 00:20:26 -07:00
Sumedh Wale	14a3bb3a00	[SPARK-21312][SQL] correct offsetInBytes in UnsafeRow.writeToStream ## What changes were proposed in this pull request? Corrects offsetInBytes calculation in UnsafeRow.writeToStream. Known failures include writes to some DataSources that have own SparkPlan implementations and cause EXCHANGE in writes. ## How was this patch tested? Extended UnsafeRowSuite.writeToStream to include an UnsafeRow over byte array having non-zero offset. Author: Sumedh Wale <swale@snappydata.io> Closes #18535 from sumwale/SPARK-21312.	2017-07-06 14:47:22 +08:00
gatorsmile	75b168fd30	[SPARK-21308][SQL] Remove SQLConf parameters from the optimizer ### What changes were proposed in this pull request? This PR removes SQLConf parameters from the optimizer rules ### How was this patch tested? The existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #18533 from gatorsmile/rmSQLConfOptimizer.	2017-07-06 14:18:50 +08:00
Shixiong Zhu	ab866f1173	[SPARK-21248][SS] The clean up codes in StreamExecution should not be interrupted ## What changes were proposed in this pull request? This PR uses `runUninterruptibly` to avoid that the clean up codes in StreamExecution is interrupted. It also removes an optimization in `runUninterruptibly` to make sure this method never throw `InterruptedException`. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #18461 from zsxwing/SPARK-21248.	2017-07-05 18:26:28 -07:00
gatorsmile	c8e7f445b9	[SPARK-21307][SQL] Remove SQLConf parameters from the parser-related classes. ### What changes were proposed in this pull request? This PR is to remove SQLConf parameters from the parser-related classes. ### How was this patch tested? The existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #18531 from gatorsmile/rmSQLConfParser.	2017-07-05 11:06:15 -07:00
Jeff Zhang	742da08685	[SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support UDAFs ## What changes were proposed in this pull request? Support register Java UDAFs in PySpark so that user can use Java UDAF in PySpark. Besides that I also add api in `UDFRegistration` ## How was this patch tested? Unit test is added Author: Jeff Zhang <zjffdu@apache.org> Closes #17222 from zjffdu/SPARK-19439.	2017-07-05 10:59:10 -07:00
ouyangxiaochen	5787ace463	[SPARK-20383][SQL] Supporting Create [temporary] Function with the keyword 'OR REPLACE' and 'IF NOT EXISTS' ## What changes were proposed in this pull request? support to create [temporary] function with the keyword 'OR REPLACE' and 'IF NOT EXISTS' ## How was this patch tested? manual test and added test cases Please review http://spark.apache.org/contributing.html before opening a pull request. Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn> Closes #17681 from ouyangxiaochen/spark-419.	2017-07-05 20:46:42 +08:00
Takuya UESHIN	873f3ad2b8	[SPARK-16167][SQL] RowEncoder should preserve array/map type nullability. ## What changes were proposed in this pull request? Currently `RowEncoder` doesn't preserve nullability of `ArrayType` or `MapType`. It returns always `containsNull = true` for `ArrayType`, `valueContainsNull = true` for `MapType` and also the nullability of itself is always `true`. This pr fixes the nullability of them. ## How was this patch tested? Add tests to check if `RowEncoder` preserves array/map nullability. Author: Takuya UESHIN <ueshin@happy-camper.st> Author: Takuya UESHIN <ueshin@databricks.com> Closes #13873 from ueshin/issues/SPARK-16167.	2017-07-05 20:32:47 +08:00
Takuya UESHIN	a386432566	[SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke` and modify it to handle properly. ## What changes were proposed in this pull request? Add `returnNullable` to `StaticInvoke` the same as #15780 is trying to add to `Invoke` and modify to handle properly. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <ueshin@happy-camper.st> Author: Takuya UESHIN <ueshin@databricks.com> Closes #16056 from ueshin/issues/SPARK-18623.	2017-07-05 14:25:26 +08:00
Wenchen Fan	f2c3b1dd69	[SPARK-21304][SQL] remove unnecessary isNull variable for collection related encoder expressions ## What changes were proposed in this pull request? For these collection-related encoder expressions, we don't need to create `isNull` variable if the loop element is not nullable. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18529 from cloud-fan/minor.	2017-07-05 14:17:26 +08:00
Takuya UESHIN	ce10545d34	[SPARK-21300][SQL] ExternalMapToCatalyst should null-check map key prior to converting to internal value. ## What changes were proposed in this pull request? `ExternalMapToCatalyst` should null-check map key prior to converting to internal value to throw an appropriate Exception instead of something like NPE. ## How was this patch tested? Added a test and existing tests. Author: Takuya UESHIN <ueshin@databricks.com> Closes #18524 from ueshin/issues/SPARK-21300.	2017-07-05 11:24:38 +08:00
gatorsmile	de14086e1f	[SPARK-21295][SQL] Use qualified names in error message for missing references ### What changes were proposed in this pull request? It is strange to see the following error message. Actually, the column is from another table. ``` cannot resolve '`right.a`' given input columns: [a, c, d]; ``` After the PR, the error message looks like ``` cannot resolve '`right.a`' given input columns: [left.a, right.c, right.d]; ``` ### How was this patch tested? Added a test case Author: gatorsmile <gatorsmile@gmail.com> Closes #18520 from gatorsmile/removeSQLConf.	2017-07-05 10:40:02 +08:00
Dongjoon Hyun	1b50e0e0d6	[SPARK-20256][SQL] SessionState should be created more lazily ## What changes were proposed in this pull request? `SessionState` is designed to be created lazily. However, in reality, it created immediately in `SparkSession.Builder.getOrCreate` ([here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L943)). This PR aims to recover the lazy behavior by keeping the options into `initialSessionOptions`. The benefit is like the following. Users can start `spark-shell` and use RDD operations without any problems. BEFORE ```scala $ bin/spark-shell java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder' ... Caused by: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=spark, access=READ, inode="/apps/hive/warehouse":hive:hdfs:drwx------ ``` As reported in SPARK-20256, this happens when the warehouse directory is not allowed for this user. AFTER ```scala $ bin/spark-shell ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112) Type in expressions to have them evaluated. Type :help for more information. scala> sc.range(0, 10, 1).count() res0: Long = 10 ``` ## How was this patch tested? Manual. This closes #18512 . Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18501 from dongjoon-hyun/SPARK-20256.	2017-07-04 09:48:40 -07:00
$YIHAODIAN\wangshuangshuang$ YIHAODIAN\wangshuangshuang	a3c29fcbbd	[SPARK-19726][SQL] Faild to insert null timestamp value to mysql using spark jdbc ## What changes were proposed in this pull request? when creating table like following: > create table timestamp_test(id int(11), time_stamp timestamp not null default current_timestamp); The result of Excuting "insert into timestamp_test values (111, null)" is different between Spark and JDBC. ``` mysql> select * from timestamp_test; +------+---------------------+ \| id \| time_stamp \| +------+---------------------+ \| 111 \| 1970-01-01 00:00:00 \| -> spark \| 111 \| 2017-06-27 19:32:38 \| -> mysql +------+---------------------+ 2 rows in set (0.00 sec) ``` Because in such case ```StructField.nullable``` is false, so the generated codes of ```InvokeLike``` and ```BoundReference``` don't check whether the field is null or not. Instead, they directly use ```CodegenContext.INPUT_ROW.getLong(1)```, however, ```UnsafeRow.setNullAt(1)``` will put 0 in the underlying memory. The PR will ```always``` set ```StructField.nullable``` true after obtaining metadata from jdbc connection, Since we can insert null to not null timestamp column in MySQL. In this way, spark will propagate null to underlying DB engine, and let DB to choose how to process NULL. ## How was this patch tested? Added tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: YIHAODIAN\wangshuangshuang <wangshuangshuang@yihaodian.com> Author: Shuangshuang Wang <wsszone@gmail.com> Closes #18445 from shuangshuangwang/SPARK-19726.	2017-07-04 09:44:27 -07:00
gatorsmile	29b1f6b09f	[SPARK-21256][SQL] Add withSQLConf to Catalyst Test ### What changes were proposed in this pull request? SQLConf is moved to Catalyst. We are adding more and more test cases for verifying the conf-specific behaviors. It is nice to add a helper function to simplify the test cases. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18469 from gatorsmile/withSQLConf.	2017-07-04 08:54:07 -07:00
gatorsmile	c79c10ebaf	[TEST] Different behaviors of SparkContext Conf when building SparkSession ## What changes were proposed in this pull request? If the created ACTIVE sparkContext is not EXPLICITLY passed through the Builder's API `sparkContext()`, the conf of this sparkContext will also contain the conf set through the API `config()`; otherwise, the conf of this sparkContext will NOT contain the conf set through the API `config()` ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #18517 from gatorsmile/fixTestCase2.	2017-07-03 16:18:54 -07:00
Wenchen Fan	f953ca56ec	[SPARK-21284][SQL] rename SessionCatalog.registerFunction parameter name ## What changes were proposed in this pull request? Looking at the code in `SessionCatalog.registerFunction`, the parameter `ignoreIfExists` is a wrong name. When `ignoreIfExists` is true, we will override the function if it already exists. So `overrideIfExists` should be the corrected name. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #18510 from cloud-fan/minor.	2017-07-03 10:51:44 -07:00
Takeshi Yamamuro	363bfe30ba	[SPARK-20073][SQL] Prints an explicit warning message in case of NULL-safe equals ## What changes were proposed in this pull request? This pr added code to print the same warning messages with `===` cases when using NULL-safe equals (`<=>`). ## How was this patch tested? Existing tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18436 from maropu/SPARK-20073.	2017-07-03 10:14:03 -07:00
aokolnychyi	17bdc36ef1	[SPARK-21102][SQL] Refresh command is too aggressive in parsing ### Idea This PR adds validation to REFRESH sql statements. Currently, users can specify whatever they want as resource path. For example, spark.sql("REFRESH ! $ !") will be executed without any exceptions. ### Implementation I am not sure that my current implementation is the most optimal, so any feedback is appreciated. My first idea was to make the grammar as strict as possible. Unfortunately, there were some problems. I tried the approach below: SqlBase.g4 ``` ... \| REFRESH TABLE tableIdentifier #refreshTable \| REFRESH resourcePath #refreshResource ... resourcePath : STRING \| (IDENTIFIER \| number \| nonReserved \| '/' \| '-')+ // other symbols can be added if needed ; ``` It is not flexible enough and requires to explicitly mention all possible symbols. Therefore, I came up with the current approach that is implemented in the code. Let me know your opinion on which one is better. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #18368 from aokolnychyi/spark-21102.	2017-07-03 09:35:49 -07:00
Zhenhua Wang	eb7a5a66bb	[TEST] Load test table based on case sensitivity ## What changes were proposed in this pull request? It is strange that we will get "table not found" error if the first sql uses upper case table names, when developers write tests with `TestHiveSingleton`, although case insensitivity. This is because in `TestHiveQueryExecution`, test tables are loaded based on exact matching instead of case sensitivity. ## How was this patch tested? Added a new test case. Author: Zhenhua Wang <wzh_zju@163.com> Closes #18504 from wzhfy/testHive.	2017-07-03 09:01:42 -07:00
Rui Zha	d4107196d5	[SPARK-18004][SQL] Make sure the date or timestamp related predicate can be pushed down to Oracle correctly ## What changes were proposed in this pull request? Move `compileValue` method in JDBCRDD to JdbcDialect, and override the `compileValue` method in OracleDialect to rewrite the Oracle-specific timestamp and date literals in where clause. ## How was this patch tested? An integration test has been added. Author: Rui Zha <zrdt713@gmail.com> Author: Zharui <zrdt713@gmail.com> Closes #18451 from SharpRay/extend-compileValue-to-dialects.	2017-07-02 17:37:47 -07:00
Xingbo Jiang	c605fee01f	[SPARK-21260][SQL][MINOR] Remove the unused OutputFakerExec ## What changes were proposed in this pull request? OutputFakerExec was added long ago and is not used anywhere now so we should remove it. ## How was this patch tested? N/A Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18473 from jiangxb1987/OutputFakerExec.	2017-07-02 08:50:48 +01:00
Reynold Xin	b1d719e7c9	[SPARK-21273][SQL] Propagate logical plan stats using visitor pattern and mixin ## What changes were proposed in this pull request? We currently implement statistics propagation directly in logical plan. Given we already have two different implementations, it'd make sense to actually decouple the two and add stats propagation using mixin. This would reduce the coupling between logical plan and statistics handling. This can also be a powerful pattern in the future to add additional properties (e.g. constraints). ## How was this patch tested? Should be covered by existing test cases. Author: Reynold Xin <rxin@databricks.com> Closes #18479 from rxin/stats-trait.	2017-06-30 21:10:23 -07:00
wangzhenhua	61b5df567e	[SPARK-21127][SQL] Update statistics after data changing commands ## What changes were proposed in this pull request? Update stats after the following data changing commands: - InsertIntoHadoopFsRelationCommand - InsertIntoHiveTable - LoadDataCommand - TruncateTableCommand - AlterTableSetLocationCommand - AlterTableDropPartitionCommand ## How was this patch tested? Added new test cases. Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes #18334 from wzhfy/changeStatsForOperation.	2017-07-01 10:01:44 +08:00
Wenchen Fan	4eb41879ce	[SPARK-17528][SQL] data should be copied properly before saving into InternalRow ## What changes were proposed in this pull request? For performance reasons, `UnsafeRow.getString`, `getStruct`, etc. return a "pointer" that points to a memory region of this unsafe row. This makes the unsafe projection a little dangerous, because all of its output rows share one instance. When we implement SQL operators, we should be careful to not cache the input rows because they may be produced by unsafe projection from child operator and thus its content may change overtime. However, when we updating values of InternalRow(e.g. in mutable projection and safe projection), we only copy UTF8String, we should also copy InternalRow, ArrayData and MapData. This PR fixes this, and also fixes the copy of vairous InternalRow, ArrayData and MapData implementations. ## How was this patch tested? new regression tests Author: Wenchen Fan <wenchen@databricks.com> Closes #18483 from cloud-fan/fix-copy.	2017-07-01 09:25:29 +08:00
Liang-Chi Hsieh	fd13255225	[SPARK-21052][SQL][FOLLOW-UP] Add hash map metrics to join ## What changes were proposed in this pull request? Remove `numHashCollisions` in `BytesToBytesMap`. And change `getAverageProbesPerLookup()` to `getAverageProbesPerLookup` as suggested. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18480 from viirya/SPARK-21052-followup.	2017-06-30 15:11:27 -07:00
Xiao Li	eed9c4ef85	[SPARK-21129][SQL] Arguments of SQL function call should not be named expressions ### What changes were proposed in this pull request? Function argument should not be named expressions. It could cause two issues: - Misleading error message - Unexpected query results when the column name is `distinct`, which is not a reserved word in our parser. ``` spark-sql> select count(distinct c1, distinct c2) from t1; Error in query: cannot resolve '`distinct`' given input columns: [c1, c2]; line 1 pos 26; 'Project [unresolvedalias('count(c1#30, 'distinct), None)] +- SubqueryAlias t1 +- CatalogRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#30, c2#31] ``` After the fix, the error message becomes ``` spark-sql> select count(distinct c1, distinct c2) from t1; Error in query: extraneous input 'c2' expecting {')', ',', '.', '[', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '\|', '\|\|', '^'}(line 1, pos 35) == SQL == select count(distinct c1, distinct c2) from t1 -----------------------------------^^^ ``` ### How was this patch tested? Added a test case to parser suite. Author: Xiao Li <gatorsmile@gmail.com> Author: gatorsmile <gatorsmile@gmail.com> Closes #18338 from gatorsmile/parserDistinctAggFunc.	2017-06-30 14:23:56 -07:00
Herman van Hovell	e2f32ee45a	[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling ## What changes were proposed in this pull request? `WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`. This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by https://github.com/apache/spark/pull/16909, after this PR Spark spills more eagerly. This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point. ## How was this patch tested? Added a regression test to `DataFrameWindowFunctionsSuite`. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #18470 from hvanhovell/SPARK-21258.	2017-06-30 12:34:09 +08:00
Liang-Chi Hsieh	18066f2e61	[SPARK-21052][SQL] Add hash map metrics to join ## What changes were proposed in this pull request? This adds the average hash map probe metrics to join operator such as `BroadcastHashJoin` and `ShuffledHashJoin`. This PR adds the API to `HashedRelation` to get average hash map probe. ## How was this patch tested? Related test cases are added. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18301 from viirya/SPARK-21052.	2017-06-29 21:28:48 +08:00
jinxing	d106a74c53	[SPARK-21240] Fix code style for constructing and stopping a SparkContext in UT. ## What changes were proposed in this pull request? Same with SPARK-20985. Fix code style for constructing and stopping a `SparkContext`. Assure the context is stopped to avoid other tests complain that there's only one `SparkContext` can exist. Author: jinxing <jinxing6042@126.com> Closes #18454 from jinxing64/SPARK-21240.	2017-06-29 09:59:36 +01:00
Wenchen Fan	9f6b3e65cc	[SPARK-21238][SQL] allow nested SQL execution ## What changes were proposed in this pull request? This is kind of another follow-up for https://github.com/apache/spark/pull/18064 . In #18064 , we wrap every SQL command with SQL execution, which makes nested SQL execution very likely to happen. #18419 trid to improve it a little bit, by introduing `SQLExecition.ignoreNestedExecutionId`. However, this is not friendly to data source developers, they may need to update their code to use this `ignoreNestedExecutionId` API. This PR proposes a new solution, to just allow nested execution. The downside is that, we may have multiple executions for one query. We can improve this by updating the data organization in SQLListener, to have 1-n mapping from query to execution, instead of 1-1 mapping. This can be done in a follow-up. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18450 from cloud-fan/execution-id.	2017-06-29 14:37:42 +08:00
wangzhenhua	82e24912d6	[SPARK-21237][SQL] Invalidate stats once table data is changed ## What changes were proposed in this pull request? Invalidate spark's stats after data changing commands: - InsertIntoHadoopFsRelationCommand - InsertIntoHiveTable - LoadDataCommand - TruncateTableCommand - AlterTableSetLocationCommand - AlterTableDropPartitionCommand ## How was this patch tested? Added test cases. Author: wangzhenhua <wangzhenhua@huawei.com> Closes #18449 from wzhfy/removeStats.	2017-06-29 11:32:29 +08:00
Wenchen Fan	25c2edf6f9	[SPARK-21229][SQL] remove QueryPlan.preCanonicalized ## What changes were proposed in this pull request? `QueryPlan.preCanonicalized` is only overridden in a few places, and it does introduce an extra concept to `QueryPlan` which may confuse people. This PR removes it and override `canonicalized` in these places ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #18440 from cloud-fan/minor.	2017-06-29 11:21:50 +08:00
hyukjinkwon	db44f5f3e8	[SPARK-21224][R] Specify a schema by using a DDL-formatted string when reading in R ## What changes were proposed in this pull request? This PR proposes to support a DDL-formetted string as schema as below: ```r mockLines <- c("{\"name\":\"Michael\"}", "{\"name\":\"Andy\", \"age\":30}", "{\"name\":\"Justin\", \"age\":19}") jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp") writeLines(mockLines, jsonPath) df <- read.df(jsonPath, "json", "name STRING, age DOUBLE") collect(df) ``` ## How was this patch tested? Tests added in `test_streaming.R` and `test_sparkSQL.R` and manual tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18431 from HyukjinKwon/r-ddl-schema.	2017-06-28 19:36:00 -07:00
Wang Gengliang	b72b8521d9	[SPARK-21222] Move elimination of Distinct clause from analyzer to optimizer ## What changes were proposed in this pull request? Move elimination of Distinct clause from analyzer to optimizer Distinct clause is useless after MAX/MIN clause. For example, "Select MAX(distinct a) FROM src from" is equivalent of "Select MAX(a) FROM src from" However, this optimization is implemented in analyzer. It should be in optimizer. ## How was this patch tested? Unit test gatorsmile cloud-fan Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Wang Gengliang <ltnwgl@gmail.com> Closes #18429 from gengliangwang/distinct_opt.	2017-06-29 08:47:31 +08:00
Burak Yavuz	e68aed70fb	[SPARK-21216][SS] Hive strategies missed in Structured Streaming IncrementalExecution ## What changes were proposed in this pull request? If someone creates a HiveSession, the planner in `IncrementalExecution` doesn't take into account the Hive scan strategies. This causes joins of Streaming DataFrame's with Hive tables to fail. ## How was this patch tested? Regression test Author: Burak Yavuz <brkyvz@gmail.com> Closes #18426 from brkyvz/hive-join.	2017-06-28 10:45:45 -07:00
Wenchen Fan	838effb98a	Revert "[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas" This reverts commit `e44697606f`.	2017-06-28 14:28:40 +08:00
Liang-Chi Hsieh	fd8c931a30	[SPARK-19104][SQL] Lambda variables in ExternalMapToCatalyst should be global ## What changes were proposed in this pull request? The issue happens in `ExternalMapToCatalyst`. For example, the following codes create `ExternalMapToCatalyst` to convert Scala Map to catalyst map format. val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100)))) val ds = spark.createDataset(data) The `valueConverter` in `ExternalMapToCatalyst` looks like: if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value) There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`. Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18418 from viirya/SPARK-19104.	2017-06-28 00:57:05 +08:00
Wang Gengliang	3cb3ccce12	[SPARK-21196] Split codegen info of query plan into sequence codegen info of query plan can be very long. In debugging console / web page, it would be more readable if the subtrees and corresponding codegen are split into sequence. Example: ```java codegenStringSeq(sql("select 1").queryExecution.executedPlan) ``` The example will return Seq[(String, String)] of length 1, containing the subtree as string and the corresponding generated code. The subtree as string: > (Project [1 AS 1#0] > +- Scan OneRowRelation[] The generated code: ```java / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIterator(references); / 003 / } / 004 / / 005 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 006 / private Object[] references; / 007 / private scala.collection.Iterator[] inputs; / 008 / private scala.collection.Iterator inputadapter_input; / 009 / private UnsafeRow project_result; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter; / 012 / / 013 / public GeneratedIterator(Object[] references) { / 014 / this.references = references; / 015 / } / 016 / / 017 / public void init(int index, scala.collection.Iterator[] inputs) { / 018 / partitionIndex = index; / 019 / this.inputs = inputs; / 020 / inputadapter_input = inputs[0]; / 021 / project_result = new UnsafeRow(1); / 022 / project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 0); / 023 / project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1); / 024 / / 025 / } / 026 / / 027 / protected void processNext() throws java.io.IOException { / 028 / while (inputadapter_input.hasNext() && !stopEarly()) { / 029 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 030 / project_rowWriter.write(0, 1); / 031 / append(project_result); / 032 / if (shouldStop()) return; / 033 / } / 034 / } / 035 / / 036 */ } ``` ## What changes were proposed in this pull request? add method codegenToSeq: split codegen info of query plan into sequence ## How was this patch tested? unit test cloud-fan gatorsmile Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Wang Gengliang <ltnwgl@gmail.com> Closes #18409 from gengliangwang/codegen.	2017-06-27 17:24:46 +08:00
Wenchen Fan	c22810004f	[SPARK-20213][SQL][FOLLOW-UP] introduce SQLExecution.ignoreNestedExecutionId ## What changes were proposed in this pull request? in https://github.com/apache/spark/pull/18064, to work around the nested sql execution id issue, we introduced several internal methods in `Dataset`, like `collectInternal`, `countInternal`, `showInternal`, etc., to avoid nested execution id. However, this approach has poor expansibility. When we hit other nested execution id cases, we may need to add more internal methods in `Dataset`. Our goal is to ignore the nested execution id in some cases, and we can have a better approach to achieve this goal, by introducing `SQLExecution.ignoreNestedExecutionId`. Whenever we find a place which needs to ignore the nested execution, we can just wrap the action with `SQLExecution.ignoreNestedExecutionId`, and this is more expansible than the previous approach. The idea comes from https://github.com/apache/spark/pull/17540/files#diff-ab49028253e599e6e74cc4f4dcb2e3a8R57 by rdblue ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #18419 from cloud-fan/follow.	2017-06-27 02:35:51 +08:00
Burak Yavuz	5282bae040	[SPARK-21153] Use project instead of expand in tumbling windows ## What changes were proposed in this pull request? Time windowing in Spark currently performs an Expand + Filter, because there is no way to guarantee the amount of windows a timestamp will fall in, in the general case. However, for tumbling windows, a record is guaranteed to fall into a single bucket. In this case, doubling the number of records with Expand is wasteful, and can be improved by using a simple Projection instead. Benchmarks show that we get an order of magnitude performance improvement after this patch. ## How was this patch tested? Existing unit tests. Benchmarked using the following code: ```scala import org.apache.spark.sql.functions._ spark.time { spark.range(numRecords) .select(from_unixtime((current_timestamp().cast("long") * 1000 + 'id / 1000) / 1000) as 'time) .select(window('time, "10 seconds")) .count() } ``` Setup: - 1 c3.2xlarge worker (8 cores) ![image](https://user-images.githubusercontent.com/5243515/27348748-ed991b84-55a9-11e7-8f8b-6e7abc524417.png) 1 B rows ran in 287 seconds after this optimization. I didn't wait for it to finish without the optimization. Shows about 5x improvement for large number of records. Author: Burak Yavuz <brkyvz@gmail.com> Closes #18364 from brkyvz/opt-tumble.	2017-06-26 01:26:32 -07:00
Zhenhua Wang	884347e1f7	[HOT FIX] fix stats functions in the recent patch ## What changes were proposed in this pull request? Builds failed due to the recent [merge](`b449a1d6aa`). This is because [PR#18309](https://github.com/apache/spark/pull/18309) needed update after [this patch](`b803b66a81`) was merged. ## How was this patch tested? N/A Author: Zhenhua Wang <wzh_zju@163.com> Closes #18415 from wzhfy/hotfixStats.	2017-06-25 15:06:29 +01:00
Masha Basmanova	b449a1d6aa	[SPARK-21079][SQL] Calculate total size of a partition table as a sum of individual partitions ## What changes were proposed in this pull request? Storage URI of a partitioned table may or may not point to a directory under which individual partitions are stored. In fact, individual partitions may be located in totally unrelated directories. Before this change, ANALYZE TABLE table COMPUTE STATISTICS command calculated total size of a table by adding up sizes of files found under table's storage URI. This calculation could produce 0 if partitions are stored elsewhere. This change uses storage URIs of individual partitions to calculate the sizes of all partitions of a table and adds these up to produce the total size of a table. CC: wzhfy ## How was this patch tested? Added unit test. Ran ANALYZE TABLE xxx COMPUTE STATISTICS on a partitioned Hive table and verified that sizeInBytes is calculated correctly. Before this change, the size would be zero. Author: Masha Basmanova <mbasmanova@fb.com> Closes #18309 from mbasmanova/mbasmanova-analyze-part-table.	2017-06-24 22:49:35 -07:00
gatorsmile	2e1586f60a	[SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct ### What changes were proposed in this pull request? ```SQL CREATE TABLE `tab1` (`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>) USING parquet INSERT INTO `tab1` SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b')) SELECT custom_fields.id, custom_fields.value FROM tab1 ``` The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast. ### How was this patch tested? Author: gatorsmile <gatorsmile@gmail.com> Closes #18412 from gatorsmile/castStruct.	2017-06-24 22:35:59 +08:00
Gabor Feher	b837bf9ae9	[SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types in read path ## What changes were proposed in this pull request? This PR is to revert some code changes in the read path of https://github.com/apache/spark/pull/14377. The original fix is https://github.com/apache/spark/pull/17830 When merging this PR, please give the credit to gaborfeher ## How was this patch tested? Added a test case to OracleIntegrationSuite.scala Author: Gabor Feher <gabor.feher@lynxanalytics.com> Author: gatorsmile <gatorsmile@gmail.com> Closes #18408 from gatorsmile/OracleType.	2017-06-23 21:53:38 -07:00
hyukjinkwon	7525ce98b4	[SPARK-20431][SS][FOLLOWUP] Specify a schema by using a DDL-formatted string in DataStreamReader ## What changes were proposed in this pull request? This pr supported a DDL-formatted string in `DataStreamReader.schema`. This fix could make users easily define a schema without importing the type classes. For example, ```scala scala> spark.readStream.schema("col0 INT, col1 DOUBLE").load("/tmp/abc").printSchema() root \|-- col0: integer (nullable = true) \|-- col1: double (nullable = true) ``` ## How was this patch tested? Added tests in `DataStreamReaderWriterSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18373 from HyukjinKwon/SPARK-20431.	2017-06-24 11:39:41 +08:00
Xiao Li	03eb6117af	[SPARK-21164][SQL] Remove isTableSample from Sample and isGenerated from Alias and AttributeReference ## What changes were proposed in this pull request? `isTableSample` and `isGenerated ` were introduced for SQL Generation respectively by https://github.com/apache/spark/pull/11148 and https://github.com/apache/spark/pull/11050 Since SQL Generation is removed, we do not need to keep `isTableSample`. ## How was this patch tested? The existing test cases Author: Xiao Li <gatorsmile@gmail.com> Closes #18379 from gatorsmile/CleanSample.	2017-06-23 14:48:33 -07:00
Dilip Biswal	13c2a4f2f8	[SPARK-20417][SQL] Move subquery error handling to checkAnalysis from Analyzer ## What changes were proposed in this pull request? Currently we do a lot of validations for subquery in the Analyzer. We should move them to CheckAnalysis which is the framework to catch and report Analysis errors. This was mentioned as a review comment in SPARK-18874. ## How was this patch tested? Exists tests + A few tests added to SQLQueryTestSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #17713 from dilipbiswal/subquery_checkanalysis.	2017-06-23 11:02:54 -07:00
Ong Ming Yang	4cc62951a2	[MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method ## What changes were proposed in this pull request? * Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`. * Filled in some missing parentheses ## How was this patch tested? N/A Author: Ong Ming Yang <me@ongmingyang.com> Closes #18398 from ongmingyang/master.	2017-06-23 10:56:59 -07:00
Tathagata Das	2ebd0838d1	[SPARK-21192][SS] Preserve State Store provider class configuration across StreamingQuery restarts ## What changes were proposed in this pull request? If the SQL conf for StateStore provider class is changed between restarts (i.e. query started with providerClass1 and attempted to restart using providerClass2), then the query will fail in a unpredictable way as files saved by one provider class cannot be used by the newer one. Ideally, the provider class used to start the query should be used to restart the query, and the configuration in the session where it is being restarted should be ignored. This PR saves the provider class config to OffsetSeqLog, in the same way # shuffle partitions is saved and recovered. ## How was this patch tested? new unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18402 from tdas/SPARK-21192.	2017-06-23 10:55:02 -07:00
wangzhenhua	b803b66a81	[SPARK-21180][SQL] Remove conf from stats functions since now we have conf in LogicalPlan ## What changes were proposed in this pull request? After wiring `SQLConf` in logical plan ([PR 18299](https://github.com/apache/spark/pull/18299)), we can remove the need of passing `conf` into `def stats` and `def computeStats`. ## How was this patch tested? Covered by existing tests, plus some modified existing tests. Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes #18391 from wzhfy/removeConf.	2017-06-23 10:33:53 -07:00
Takeshi Yamamuro	f3dea60793	[SPARK-21144][SQL] Print a warning if the data schema and partition schema have the duplicate columns ## What changes were proposed in this pull request? The current master outputs unexpected results when the data schema and partition schema have the duplicate columns: ``` withTempPath { dir => val basePath = dir.getCanonicalPath spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString) spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString) spark.read.parquet(basePath).show() } +---+ \|foo\| +---+ \| 1\| \| 1\| \| a\| \| a\| \| 1\| \| a\| +---+ ``` This patch added code to print a warning when the duplication found. ## How was this patch tested? Manually checked. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18375 from maropu/SPARK-21144-3.	2017-06-23 09:28:02 -07:00
jinxing	153dd49b74	[SPARK-21047] Add test suites for complicated cases in ColumnarBatchSuite ## What changes were proposed in this pull request? Current ColumnarBatchSuite has very simple test cases for `Array` and `Struct`. This pr wants to add some test suites for complicated cases in ColumnVector. Author: jinxing <jinxing6042@126.com> Closes #18327 from jinxing64/SPARK-21047.	2017-06-23 20:41:17 +08:00
Tathagata Das	fe24634d14	[SPARK-21145][SS] Added StateStoreProviderId with queryRunId to reload StateStoreProviders when query is restarted ## What changes were proposed in this pull request? StateStoreProvider instances are loaded on-demand in a executor when a query is started. When a query is restarted, the loaded provider instance will get reused. Now, there is a non-trivial chance, that the task of the previous query run is still running, while the tasks of the restarted run has started. So for a stateful partition, there may be two concurrent tasks related to the same stateful partition, and there for using the same provider instance. This can lead to inconsistent results and possibly random failures, as state store implementations are not designed to be thread-safe. To fix this, I have introduced a `StateStoreProviderId`, that unique identifies a provider loaded in an executor. It has the query run id in it, thus making sure that restarted queries will force the executor to load a new provider instance, thus avoiding two concurrent tasks (from two different runs) from reusing the same provider instance. Additional minor bug fixes - All state stores related to query run is marked as deactivated in the `StateStoreCoordinator` so that the executors can unload them and clear resources. - Moved the code that determined the checkpoint directory of a state store from implementation-specific code (`HDFSBackedStateStoreProvider`) to non-specific code (StateStoreId), so that implementation do not accidentally get it wrong. - Also added store name to the path, to support multiple stores per sql operator partition. Note: This change does not address the scenario where two tasks of the same run (e.g. speculative tasks) are concurrently running in the same executor. The chance of this very small, because ideally speculative tasks should never run in the same executor. ## How was this patch tested? Existing unit tests + new unit test. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #18355 from tdas/SPARK-21145.	2017-06-23 00:43:21 -07:00
Wang Gengliang	b8a743b6a5	[SPARK-21174][SQL] Validate sampling fraction in logical operator level ## What changes were proposed in this pull request? Currently the validation of sampling fraction in dataset is incomplete. As an improvement, validate sampling fraction in logical operator level: 1) if with replacement: fraction should be nonnegative 2) else: fraction should be on interval [0, 1] Also add test cases for the validation. ## How was this patch tested? integration tests gatorsmile cloud-fan Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Wang Gengliang <ltnwgl@gmail.com> Closes #18387 from gengliangwang/sample_ratio_validate.	2017-06-23 09:27:35 +08:00
Bryan Cutler	e44697606f	[SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas ## What changes were proposed in this pull request? Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`. This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process. The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame. All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown. Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served. A package private class/object `ArrowConverters` that provide data type mappings and conversion routines. In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default). ## How was this patch tested? Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types. The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data. This will ensure that the schema and data has been converted correctly. Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow. A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas. Author: Bryan Cutler <cutlerb@gmail.com> Author: Li Jin <ice.xelloss@gmail.com> Author: Li Jin <li.jin@twosigma.com> Author: Wes McKinney <wes.mckinney@twosigma.com> Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.	2017-06-23 09:01:13 +08:00
Lubo Zhang	e55a105ae0	[SPARK-20599][SS] ConsoleSink should work with (batch) ## What changes were proposed in this pull request? Currently, if we read a batch and want to display it on the console sink, it will lead a runtime exception. Changes: - In this PR, we add a match rule to check whether it is a ConsoleSinkProvider, we will display the Dataset if using console format. ## How was this patch tested? spark.read.schema().json(path).write.format("console").save Author: Lubo Zhang <lubo.zhang@intel.com> Author: lubozhan <lubo.zhang@intel.com> Closes #18347 from lubozhan/dev.	2017-06-22 11:18:58 -07:00
actuaryzhang	97b307c87c	[SQL][DOC] Fix documentation of lpad ## What changes were proposed in this pull request? Fix incomplete documentation for `lpad`. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #18367 from actuaryzhang/SQLDoc.	2017-06-22 10:12:33 +01:00
Shixiong Zhu	d66b143eec	[SPARK-21167][SS] Decode the path generated by File sink to handle special characters ## What changes were proposed in this pull request? Decode the path generated by File sink to handle special characters. ## How was this patch tested? The added unit test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #18381 from zsxwing/SPARK-21167.	2017-06-21 23:43:21 -07:00
hyukjinkwon	7a00c658d4	[SPARK-21147][SS] Throws an analysis exception when a user-specified schema is given in socket/rate sources ## What changes were proposed in this pull request? This PR proposes to throw an exception if a schema is provided by user to socket source as below: socket source ```scala import org.apache.spark.sql.types._ val userSpecifiedSchema = StructType( StructField("name", StringType) :: StructField("area", StringType) :: Nil) val df = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).schema(userSpecifiedSchema).load df.printSchema ``` Before ``` root \|-- value: string (nullable = true) ``` After ``` org.apache.spark.sql.AnalysisException: The socket source does not support a user-specified schema.; at org.apache.spark.sql.execution.streaming.TextSocketSourceProvider.sourceSchema(socket.scala:199) at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:192) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150) ... 50 elided ``` rate source ```scala spark.readStream.format("rate").schema(spark.range(1).schema).load().printSchema() ``` Before ``` root \|-- timestamp: timestamp (nullable = true) \|-- value: long (nullable = true)` ``` After ``` org.apache.spark.sql.AnalysisException: The rate source does not support a user-specified schema.; at org.apache.spark.sql.execution.streaming.RateSourceProvider.sourceSchema(RateSourceProvider.scala:57) at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:192) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150) ... 48 elided ``` ## How was this patch tested? Unit test in `TextSocketStreamSuite` and `RateSourceSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #18365 from HyukjinKwon/SPARK-21147.	2017-06-21 10:51:17 -07:00
Xingbo Jiang	cad88f17e8	[SPARK-17851][SQL][TESTS] Make sure all test sqls in catalyst pass checkAnalysis ## What changes were proposed in this pull request? Currently we have several tens of test sqls in catalyst will fail at `SimpleAnalyzer.checkAnalysis`, we should make sure they are valid. This PR makes the following changes: 1. Apply `checkAnalysis` on plans that tests `Optimizer` rules, but don't require the testcases for `Parser`/`Analyzer` pass `checkAnalysis`; 2. Fix testcases for `Optimizer` that would have fall. ## How was this patch tested? Apply `SimpleAnalyzer.checkAnalysis` on plans in `PlanTest.comparePlans`, update invalid test cases. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15417 from jiangxb1987/cptest.	2017-06-21 09:40:06 -07:00
Marcos P	e92befcb4b	[MINOR][DOC] modified issue link and updated status ## What changes were proposed in this pull request? This PR aims to clarify some outdated comments that i found at spark-catalyst and spark-sql pom files. Maven bug still happening and in order to track it I have updated the issue link and also the status of the issue. Author: Marcos P <mpenate@stratio.com> Closes #18374 from mpenate/fix/mng-3559-comment.	2017-06-21 15:34:10 +01:00
sureshthalamati	9ce714dca2	[SPARK-10655][SQL] Adding additional data type mappings to jdbc DB2dialect. This patch adds DB2 specific data type mappings for decfloat, real, xml , and timestamp with time zone (DB2Z specific type) types on read and for byte, short data types on write to the to jdbc data source DB2 dialect. Default mapping does not work for these types when reading/writing from DB2 database. Added docker test, and a JDBC unit test case. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #9162 from sureshthalamati/db2dialect_enhancements-spark-10655.	2017-06-20 22:35:42 -07:00
Reynold Xin	b6b108826a	[SPARK-21103][SQL] QueryPlanConstraints should be part of LogicalPlan ## What changes were proposed in this pull request? QueryPlanConstraints should be part of LogicalPlan, rather than QueryPlan, since the constraint framework is only used for query plan rewriting and not for physical planning. ## How was this patch tested? Should be covered by existing tests, since it is a simple refactoring. Author: Reynold Xin <rxin@databricks.com> Closes #18310 from rxin/SPARK-21103.	2017-06-20 11:34:22 -07:00
Wenchen Fan	e862dc9049	[SPARK-21150][SQL] Persistent view stored in Hive metastore should be case preserving ## What changes were proposed in this pull request? This is a regression in Spark 2.2. In Spark 2.2, we introduced a new way to resolve persisted view: https://issues.apache.org/jira/browse/SPARK-18209 , but this makes the persisted view non case-preserving because we store the schema in hive metastore directly. We should follow data source table and store schema in table properties. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18360 from cloud-fan/view.	2017-06-20 09:15:33 -07:00
Xianyang Liu	0a4b7e4f81	[MINOR] Fix some typo of the document ## What changes were proposed in this pull request? Fix some typo of the document. ## How was this patch tested? Existing tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18350 from ConeyLiu/fixtypo.	2017-06-19 20:35:58 +01:00
Dongjoon Hyun	ecc5631351	[MINOR][BUILD] Fix Java linter errors ## What changes were proposed in this pull request? This PR cleans up a few Java linter errors for Apache Spark 2.2 release. ## How was this patch tested? ```bash $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` We can check the result at Travis CI, [here](https://travis-ci.org/dongjoon-hyun/spark/builds/244297894). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #18345 from dongjoon-hyun/fix_lint_java_2.	2017-06-19 20:17:54 +01:00
Yong Tang	e5387018e7	[SPARK-19975][PYTHON][SQL] Add map_keys and map_values functions to Python ## What changes were proposed in this pull request? This fix tries to address the issue in SPARK-19975 where we have `map_keys` and `map_values` functions in SQL yet there is no Python equivalent functions. This fix adds `map_keys` and `map_values` functions to Python. ## How was this patch tested? This fix is tested manually (See Python docs for examples). Author: Yong Tang <yong.tang.github@outlook.com> Closes #17328 from yongtang/SPARK-19975.	2017-06-19 11:40:07 -07:00
Xiao Li	9413b84b5a	[SPARK-21132][SQL] DISTINCT modifier of function arguments should not be silently ignored ### What changes were proposed in this pull request? We should not silently ignore `DISTINCT` when they are not supported in the function arguments. This PR is to block these cases and issue the error messages. ### How was this patch tested? Added test cases for both regular functions and window functions Author: Xiao Li <gatorsmile@gmail.com> Closes #18340 from gatorsmile/firstCount.	2017-06-19 15:51:21 +08:00
Yuming Wang	f913f158ec	[SPARK-20948][SQL] Built-in SQL Function UnaryMinus/UnaryPositive support string type ## What changes were proposed in this pull request? Built-in SQL Function UnaryMinus/UnaryPositive support string type, if it's string type, convert it to double type, after this PR: ```sql spark-sql> select positive('-1.11'), negative('-1.11'); -1.11 1.11 spark-sql> ``` ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18173 from wangyum/SPARK-20948.	2017-06-18 20:14:05 -07:00
Yuming Wang	ce49428ef7	[SPARK-20749][SQL][FOLLOWUP] Support character_length ## What changes were proposed in this pull request? The function `char_length` is shorthand for `character_length` function. Both Hive and Postgresql support `character_length`, This PR add support for `character_length`. Ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions https://www.postgresql.org/docs/current/static/functions-string.html ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18330 from wangyum/SPARK-20749-character_length.	2017-06-18 18:56:53 -07:00
Yuming Wang	53e48f73e4	[SPARK-20931][SQL] ABS function support string type. ## What changes were proposed in this pull request? ABS function support string type. Hive/MySQL support this feature. Ref: `4ba713ccd8/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java (L93)` ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18153 from wangyum/SPARK-20931.	2017-06-16 09:40:58 -07:00
Wenchen Fan	d1c333ac77	[SPARK-21119][SQL] unset table properties should keep the table comment ## What changes were proposed in this pull request? Previous code mistakenly use `table.properties.get("comment")` to read the existing table comment, we should use `table.comment` ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #18325 from cloud-fan/unset.	2017-06-16 08:05:43 -07:00
Kazuaki Ishizaki	7a3e5dc28b	[SPARK-20749][SQL] Built-in SQL Function Support - all variants of LEN[GTH] ## What changes were proposed in this pull request? This PR adds built-in SQL function `BIT_LENGTH()`, `CHAR_LENGTH()`, and `OCTET_LENGTH()` functions. `BIT_LENGTH()` returns the bit length of the given string or binary expression. `CHAR_LENGTH()` returns the length of the given string or binary expression. (i.e. equal to `LENGTH()`) `OCTET_LENGTH()` returns the byte length of the given string or binary expression. ## How was this patch tested? Added new test suites for these three functions Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18046 from kiszk/SPARK-20749.	2017-06-15 23:06:58 -07:00
Xianyang Liu	87ab0cec65	[SPARK-21072][SQL] TreeNode.mapChildren should only apply to the children node. ## What changes were proposed in this pull request? Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342 ## How was this patch tested? Existing tests. Author: Xianyang Liu <xianyang.liu@intel.com> Closes #18284 from ConeyLiu/treenode.	2017-06-16 12:10:09 +08:00
Xiao Li	5d35d5c15c	[SPARK-21112][SQL] ALTER TABLE SET TBLPROPERTIES should not overwrite COMMENT ### What changes were proposed in this pull request? `ALTER TABLE SET TBLPROPERTIES` should not overwrite `COMMENT` even if the input property does not have the property of `COMMENT`. This PR is to fix the issue. ### How was this patch tested? Covered by the existing tests. Author: Xiao Li <gatorsmile@gmail.com> Closes #18318 from gatorsmile/fixTableComment.	2017-06-16 10:11:23 +08:00
ALeksander Eskilson	b32b2123dd	[SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - Class Splitting ## What changes were proposed in this pull request? This pull-request exclusively includes the class splitting feature described in #16648. When code for a given class would grow beyond 1600k bytes, a private, nested sub-class is generated into which subsequent functions are inlined. Additional sub-classes are generated as the code threshold is met subsequent times. This code includes 3 changes: 1. Includes helper maps, lists, and functions for keeping track of sub-classes during code generation (included in the `CodeGenerator` class). These helper functions allow nested classes and split functions to be initialized/declared/inlined to the appropriate locations in the various projection classes. 2. Changes `addNewFunction` to return a string to support instances where a split function is inlined to a nested class and not the outer class (and so must be invoked using the class-qualified name). Uses of `addNewFunction` throughout the codebase are modified so that the returned name is properly used. 3. Removes instances of the `this` keyword when used on data inside generated classes. All state declared in the outer class is by default global and accessible to the nested classes. However, if a reference to global state in a nested class is prepended with the `this` keyword, it would attempt to reference state belonging to the nested class (which would not exist), rather than the correct variable belonging to the outer class. ## How was this patch tested? Added a test case to the `GeneratedProjectionSuite` that increases the number of columns tested in various projections to a threshold that would previously have triggered a `JaninoRuntimeException` for the Constant Pool. Note: This PR does not address the second Constant Pool issue with code generation (also mentioned in #16648): excess global mutable state. A second PR may be opened to resolve that issue. Author: ALeksander Eskilson <alek.eskilson@cerner.com> Closes #18075 from bdrillard/class_splitting_only.	2017-06-15 13:45:08 +08:00
Xiao Li	2051428173	[SPARK-20980][SQL] Rename `wholeFile` to `multiLine` for both CSV and JSON ### What changes were proposed in this pull request? The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`. ### How was this patch tested? N/A Author: Xiao Li <gatorsmile@gmail.com> Closes #18202 from gatorsmile/renameCVSOption.	2017-06-15 13:18:19 +08:00
Reynold Xin	fffeb6d7c3	[SPARK-21092][SQL] Wire SQLConf in logical plan and expressions ## What changes were proposed in this pull request? It is really painful to not have configs in logical plan and expressions. We had to add all sorts of hacks (e.g. pass SQLConf explicitly in functions). This patch exposes SQLConf in logical plan, using a thread local variable and a getter closure that's set once there is an active SparkSession. The implementation is a bit of a hack, since we didn't anticipate this need in the beginning (config was only exposed in physical plan). The implementation is described in `SQLConf.get`. In terms of future work, we should follow up to clean up CBO (remove the need for passing in config). ## How was this patch tested? Updated relevant tests for constraint propagation. Author: Reynold Xin <rxin@databricks.com> Closes #18299 from rxin/SPARK-21092.	2017-06-14 22:11:41 -07:00
Reynold Xin	e254e868f1	[SPARK-21091][SQL] Move constraint code into QueryPlanConstraints ## What changes were proposed in this pull request? This patch moves constraint related code into a separate trait QueryPlanConstraints, so we don't litter QueryPlan with a lot of constraint private functions. ## How was this patch tested? This is a simple move refactoring and should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #18298 from rxin/SPARK-21091.	2017-06-14 14:28:21 -07:00
Xiao Li	77a2fc5b52	Revert "[SPARK-20941][SQL] Fix SubqueryExec Reuse" This reverts commit `f7cf2096fd`.	2017-06-14 11:48:32 -07:00
Xiao Li	df766a4714	[SPARK-21089][SQL] Fix DESC EXTENDED/FORMATTED to Show Table Properties ### What changes were proposed in this pull request? Since both table properties and storage properties share the same key values, table properties are not shown in the output of DESC EXTENDED/FORMATTED when the storage properties are not empty. This PR is to fix the above issue by renaming them to different keys. ### How was this patch tested? Added test cases. Author: Xiao Li <gatorsmile@gmail.com> Closes #18294 from gatorsmile/tableProperties.	2017-06-14 11:13:16 -07:00
gatorsmile	0c88e8d372	[SPARK-21085][SQL] Failed to read the partitioned table created by Spark 2.1 ### What changes were proposed in this pull request? Before the PR, Spark is unable to read the partitioned table created by Spark 2.1 when the table schema does not put the partitioning column at the end of the schema. [assert(partitionFields.map(_.name) == partitionColumnNames)](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L234-L236) When reading the table metadata from the metastore, we also need to reorder the columns. ### How was this patch tested? Added test cases to check both Hive-serde and data source tables. Author: gatorsmile <gatorsmile@gmail.com> Closes #18295 from gatorsmile/reorderReadSchema.	2017-06-14 16:28:06 +08:00
Yuming Wang	4d01aa4648	[SPARK-20754][SQL][FOLLOWUP] Add Function Alias For MOD/POSITION. ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/18106 Support TRUNC (number), We should also add function alias for `MOD `and `POSITION`. `POSITION(substr IN str) `is a synonym for `LOCATE(substr,str)`. same as MySQL: https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_position ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18206 from wangyum/SPARK-20754-mod&position.	2017-06-13 23:39:06 -07:00
lianhuiwang	8b5b2e272f	[SPARK-20986][SQL] Reset table's statistics after PruneFileSourcePartitions rule. ## What changes were proposed in this pull request? After PruneFileSourcePartitions rule, It needs reset table's statistics because PruneFileSourcePartitions can filter some unnecessary partitions. So the statistics need to be changed. ## How was this patch tested? add unit test. Author: lianhuiwang <lianhuiwang09@gmail.com> Closes #18205 from lianhuiwang/SPARK-20986.	2017-06-14 09:57:56 +08:00
liuxian	7ba8bf288d	[SPARK-21016][CORE] Improve code fault tolerance for converting string to number ## What changes were proposed in this pull request? When converting `string` to `number`(int, long or double), if the string has a space before or after,will lead to unnecessary mistakes. ## How was this patch tested? unit test Author: liuxian <liu.xian3@zte.com.cn> Closes #18238 from 10110346/lx-wip-0608.	2017-06-13 10:12:28 -07:00
Liang-Chi Hsieh	bcf3643f94	[SPARK-21051][SQL] Add hash map metrics to aggregate ## What changes were proposed in this pull request? This adds the average hash map probe metrics to hash aggregate. `BytesToBytesMap` already has API to get the metrics, this PR adds an API to `UnsafeFixedWidthAggregationMap` to access it. Preparing a test for this metrics seems tricky, because we don't know what collision keys are. For now, the test case generates random data large enough to have desired probe. TODO in later PR: add hash map metrics to join. ## How was this patch tested? Added test to SQLMetricsSuite. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18258 from viirya/SPARK-20953.	2017-06-13 10:10:35 -07:00
Rishabh Bhardwaj	9b2c877bec	[SPARK-21039][SPARK CORE] Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter ## What changes were proposed in this pull request? To use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter to parallelize the operation of merging the bloom filters (Please fill in changes proposed in this fix) ## How was this patch tested? unit tests passed (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Rishabh Bhardwaj <rbnext29@gmail.com> Author: Rishabh Bhardwaj <admin@rishabh.local> Author: Rishabh Bhardwaj <r0b00ko@rishabh.Dlink> Author: Rishabh Bhardwaj <admin@Admins-MacBook-Pro.local> Author: Rishabh Bhardwaj <r0b00ko@rishabh.local> Closes #18263 from rishabhbhardwaj/SPARK-21039.	2017-06-13 15:09:12 +01:00
Sean Owen	7b7c85ede3	[SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive tables with many partitions ## What changes were proposed in this pull request? Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #18216 from srowen/SPARK-20920.	2017-06-13 10:48:07 +01:00
Dongjoon Hyun	2639c3ed03	[SPARK-19910][SQL] `stack` should not reject NULL values due to type mismatch ## What changes were proposed in this pull request? Since `stack` function generates a table with nullable columns, it should allow mixed null values. ```scala scala> sql("select stack(3, 1, 2, 3)").printSchema root \|-- col0: integer (nullable = true) scala> sql("select stack(3, 1, 2, null)").printSchema org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); line 1 pos 7; ``` ## How was this patch tested? Pass the Jenkins with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #17251 from dongjoon-hyun/SPARK-19910.	2017-06-12 21:18:43 -07:00
Wenchen Fan	fc0e6944a5	Revert "[SPARK-21046][SQL] simplify the array offset and length in ColumnVector" This reverts commit `22dd65f58e`.	2017-06-13 09:15:14 +08:00
Shixiong Zhu	74a432d3a3	[SPARK-20979][SS] Add RateSource to generate values for tests and benchmark ## What changes were proposed in this pull request? This PR adds RateSource for Structured Streaming so that the user can use it to generate data for tests and benchmark easily. This source generates increment long values with timestamps. Each generated row has two columns: a timestamp column for the generated time and an auto increment long column starting with 0L. It supports the following options: - `rowsPerSecond` (e.g. 100, default: 1): How many rows should be generated per second. - `rampUpTime` (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes `rowsPerSecond`. Using finer granularities than seconds will be truncated to integer seconds. - `numPartitions` (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. The source will try its best to reach `rowsPerSecond`, but the query may be resource constrained, and `numPartitions` can be tweaked to help reach the desired speed. Here is a simple example that prints 10 rows per seconds: ``` spark.readStream .format("rate") .option("rowsPerSecond", "10") .load() .writeStream .format("console") .start() ``` The idea came from marmbrus and he did the initial work. ## How was this patch tested? The added tests. Author: Shixiong Zhu <shixiong@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #18199 from zsxwing/rate.	2017-06-12 14:58:08 -07:00
Reynold Xin	b1436c7496	[SPARK-21059][SQL] LikeSimplification can NPE on null pattern ## What changes were proposed in this pull request? This patch fixes a bug that can cause NullPointerException in LikeSimplification, when the pattern for like is null. ## How was this patch tested? Added a new unit test case in LikeSimplificationSuite. Author: Reynold Xin <rxin@databricks.com> Closes #18273 from rxin/SPARK-21059.	2017-06-12 14:07:51 -07:00
Dongjoon Hyun	32818d9b37	[SPARK-20345][SQL] Fix STS error handling logic on HiveSQLException ## What changes were proposed in this pull request? [SPARK-5100](`343d3bfafd`) added Spark Thrift Server(STS) UI and the following logic to handle exceptions on case `Throwable`. ```scala HiveThriftServer2.listener.onStatementError( statementId, e.getMessage, SparkUtils.exceptionString(e)) ``` However, there occurred a missed case after implementing [SPARK-6964](`eb19d3f75c`)'s `Support Cancellation in the Thrift Server` by adding case `HiveSQLException` before case `Throwable`. ```scala case e: HiveSQLException => if (getStatus().getState() == OperationState.CANCELED) { return } else { setState(OperationState.ERROR) throw e } // Actually do need to catch Throwable as some failures don't inherit from Exception and // HiveServer will silently swallow them. case e: Throwable => val currentState = getStatus().getState() logError(s"Error executing query, currentState $currentState, ", e) setState(OperationState.ERROR) HiveThriftServer2.listener.onStatementError( statementId, e.getMessage, SparkUtils.exceptionString(e)) throw new HiveSQLException(e.toString) ``` Logically, we had better add `HiveThriftServer2.listener.onStatementError` on case `HiveSQLException`, too. ## How was this patch tested? N/A Author: Dongjoon Hyun <dongjoon@apache.org> Closes #17643 from dongjoon-hyun/SPARK-20345.	2017-06-12 14:05:03 -07:00

1 2 3 4 5 ...

5688 commits