ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Shixiong Zhu	fedf6961be	[SPARK-22094][SS] processAllAvailable should check the query state ## What changes were proposed in this pull request? `processAllAvailable` should also check the query state and if the query is stopped, it should return. ## How was this patch tested? The new unit test. Author: Shixiong Zhu <zsxwing@gmail.com> Closes #19314 from zsxwing/SPARK-22094.	2017-09-21 21:55:07 -07:00
Tathagata Das	f32a842505	[SPARK-22053][SS] Stream-stream inner join in Append Mode ## What changes were proposed in this pull request? #### Architecture This PR implements stream-stream inner join using a two-way symmetric hash join. At a high level, we want to do the following. 1. For each stream, we maintain the past rows as state in State Store. - For each joining key, there can be multiple rows that have been received. - So, we have to effectively maintain a key-to-list-of-values multimap as state for each stream. 2. In each batch, for each input row in each stream - Look up the other streams state to see if there are matching rows, and output them if they satisfy the joining condition - Add the input row to corresponding stream’s state. - If the data has a timestamp/window column with watermark, then we will use that to calculate the threshold for keys that are required to buffered for future matches and drop the rest from the state. Cleaning up old unnecessary state rows depends completely on whether watermark has been defined and what are join conditions. We definitely want to support state clean up two types of queries that are likely to be common. - Queries to time range conditions - E.g. `SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND leftTime > rightTime - INTERVAL 8 MINUTES AND leftTime < rightTime + INTERVAL 1 HOUR` - Queries with windows as the matching key - E.g. `SELECT * FROM leftTable, rightTable ON leftKey = rightKey AND window(leftTime, "1 hour") = window(rightTime, "1 hour")` (pseudo-SQL) #### Implementation The stream-stream join is primarily implemented in three classes - `StreamingSymmetricHashJoinExec` implements the above symmetric join algorithm. - `SymmetricsHashJoinStateManagers` manages the streaming state for the join. This essentially is a fault-tolerant key-to-list-of-values multimap built on the StateStore APIs. `StreamingSymmetricHashJoinExec` instantiates two such managers, one for each join side. - `StreamingSymmetricHashJoinExecHelper` is a helper class to extract threshold for the state based on the join conditions and the event watermark. Refer to the scaladocs class for more implementation details. Besides the implementation of stream-stream inner join SparkPlan. Some additional changes are - Allowed inner join in append mode in UnsupportedOperationChecker - Prevented stream-stream join on an empty batch dataframe to be collapsed by the optimizer ## How was this patch tested? - New tests in StreamingJoinSuite - Updated tests UnsupportedOperationSuite Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #19271 from tdas/SPARK-22053.	2017-09-21 15:39:07 -07:00
Zheng RuiFeng	a8a5cd24e2	[SPARK-22009][ML] Using treeAggregate improve some algs ## What changes were proposed in this pull request? I test on a dataset of about 13M instances, and found that using `treeAggregate` give a speedup in following algs: \|Algs\| SpeedUp \| \|------\|-----------\| \|OneHotEncoder\| 5% \| \|StatFunctions.calculateCov\| 7% \| \|StatFunctions.multipleApproxQuantiles\| 9% \| \|RegressionEvaluator\| 8% \| ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #19232 from zhengruifeng/use_treeAggregate.	2017-09-21 20:06:42 +01:00
Sean Owen	f10cbf17dc	[SPARK-21977][HOTFIX] Adjust EnsureStatefulOpPartitioningSuite to use scalatest lifecycle normally instead of constructor ## What changes were proposed in this pull request? Adjust EnsureStatefulOpPartitioningSuite to use scalatest lifecycle normally instead of constructor; fixes: ``` * RUN ABORTED * org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at: org.apache.spark.sql.streaming.EnsureStatefulOpPartitioningSuite.<init>(EnsureStatefulOpPartitioningSuite.scala:35) ``` ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19306 from srowen/SPARK-21977.2.	2017-09-21 18:00:19 +01:00
Sean Owen	e17901d6df	[SPARK-22049][DOCS] Confusing behavior of from_utc_timestamp and to_utc_timestamp ## What changes were proposed in this pull request? Clarify behavior of to_utc_timestamp/from_utc_timestamp with an example ## How was this patch tested? Doc only change / existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19276 from srowen/SPARK-22049.	2017-09-20 20:47:17 +09:00
Sean Owen	3d4dd14cd5	[SPARK-22066][BUILD] Update checkstyle to 8.2, enable it, fix violations ## What changes were proposed in this pull request? Update plugins, including scala-maven-plugin, to latest versions. Update checkstyle to 8.2. Remove bogus checkstyle config and enable it. Fix existing and new Java checkstyle errors. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19282 from srowen/SPARK-22066.	2017-09-20 10:01:46 +01:00
Burak Yavuz	280ff523f4	[SPARK-21977] SinglePartition optimizations break certain Streaming Stateful Aggregation requirements ## What changes were proposed in this pull request? This is a bit hard to explain as there are several issues here, I'll try my best. Here are the requirements: 1. A StructuredStreaming Source that can generate empty RDDs with 0 partitions 2. A StructuredStreaming query that uses the above source, performs a stateful aggregation (mapGroupsWithState, groupBy.count, ...), and coalesce's by 1 The crux of the problem is that when a dataset has a `coalesce(1)` call, it receives a `SinglePartition` partitioning scheme. This scheme satisfies most required distributions used for aggregations such as HashAggregateExec. This causes a world of problems: Symptom 1. If the input RDD has 0 partitions, the whole lineage will receive 0 partitions, nothing will be executed, the state store will not create any delta files. When this happens, the next trigger fails, because the StateStore fails to load the delta file for the previous trigger Symptom 2. Let's say that there was data. Then in this case, if you stop your stream, and change `coalesce(1)` with `coalesce(2)`, then restart your stream, your stream will fail, because `spark.sql.shuffle.partitions - 1` number of StateStores will fail to find its delta files. To fix the issues above, we must check that the partitioning of the child of a `StatefulOperator` satisfies: If the grouping expressions are empty: a) AllTuple distribution b) Single physical partition If the grouping expressions are non empty: a) Clustered distribution b) spark.sql.shuffle.partition # of partitions whether or not `coalesce(1)` exists in the plan, and whether or not the input RDD for the trigger has any data. Once you fix the above problem by adding an Exchange to the plan, you come across the following bug: If you call `coalesce(1).groupBy().count()` on a Streaming DataFrame, and if you have a trigger with no data, `StateStoreRestoreExec` doesn't return the prior state. However, for this specific aggregation, `HashAggregateExec` after the restore returns a (0, 0) row, since we're performing a count, and there is no data. Then this data gets stored in `StateStoreSaveExec` causing the previous counts to be overwritten and lost. ## How was this patch tested? Regression tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #19196 from brkyvz/sa-0.	2017-09-20 00:01:21 -07:00
Marcelo Vanzin	c6ff59a230	[SPARK-18838][CORE] Add separate listener queues to LiveListenerBus. This change modifies the live listener bus so that all listeners are added to queues; each queue has its own thread to dispatch events, making it possible to separate slow listeners from other more performance-sensitive ones. The public API has not changed - all listeners added with the existing "addListener" method, which after this change mostly means all user-defined listeners, end up in a default queue. Internally, there's an API allowing listeners to be added to specific queues, and that API is used to separate the internal Spark listeners into 3 categories: application status listeners (e.g. UI), executor management (e.g. dynamic allocation), and the event log. The queueing logic, while abstracted away in a separate class, is kept as much as possible hidden away from consumers. Aside from choosing their queue, there's no code change needed to take advantage of queues. Test coverage relies on existing tests; a few tests had to be tweaked because they relied on `LiveListenerBus.postToAll` being synchronous, and the change makes that method asynchronous. Other tests were simplified not to use the asynchronous LiveListenerBus. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #19211 from vanzin/SPARK-18838.	2017-09-20 13:41:29 +08:00
Bryan Cutler	718bbc9390	[SPARK-22067][SQL] ArrowWriter should use position when setting UTF8String ByteBuffer ## What changes were proposed in this pull request? The ArrowWriter StringWriter was setting Arrow data using a position of 0 instead of the actual position in the ByteBuffer. This was currently working because of a bug ARROW-1443, and has been fixed as of Arrow 0.7.0. Testing with this version revealed the error in ArrowConvertersSuite test string conversion. ## How was this patch tested? Existing tests, manually verified working with Arrow 0.7.0 Author: Bryan Cutler <cutlerb@gmail.com> Closes #19284 from BryanCutler/arrow-ArrowWriter-StringWriter-position-SPARK-22067.	2017-09-20 10:51:00 +09:00
aokolnychyi	ee13f3e3dc	[SPARK-21969][SQL] CommandUtils.updateTableStats should call refreshTable ## What changes were proposed in this pull request? Tables in the catalog cache are not invalidated once their statistics are updated. As a consequence, existing sessions will use the cached information even though it is not valid anymore. Consider and an example below. ``` // step 1 spark.range(100).write.saveAsTable("tab1") // step 2 spark.sql("analyze table tab1 compute statistics") // step 3 spark.sql("explain cost select distinct * from tab1").show(false) // step 4 spark.range(100).write.mode("append").saveAsTable("tab1") // step 5 spark.sql("explain cost select distinct * from tab1").show(false) ``` After step 3, the table will be present in the catalog relation cache. Step 4 will correctly update the metadata inside the catalog but will NOT invalidate the cache. By the way, ``spark.sql("analyze table tab1 compute statistics")`` between step 3 and step 4 would also solve the problem. ## How was this patch tested? Current and additional unit tests. Author: aokolnychyi <anton.okolnychyi@sap.com> Closes #19252 from aokolnychyi/spark-21969.	2017-09-19 14:19:13 -07:00
Huaxin Gao	d5aefa83ad	[SPARK-21338][SQL] implement isCascadingTruncateTable() method in AggregatedDialect ## What changes were proposed in this pull request? org.apache.spark.sql.jdbc.JdbcDialect's method: def isCascadingTruncateTable(): Option[Boolean] = None is not overriden in org.apache.spark.sql.jdbc.AggregatedDialect class. Because of this issue, when you add more than one dialect Spark doesn't truncate table because isCascadingTruncateTable always returns default None for Aggregated Dialect. Will implement isCascadingTruncateTable in AggregatedDialect class in this PR. ## How was this patch tested? In JDBCSuite, inside test("Aggregated dialects"), will add one line to test AggregatedDialect.isCascadingTruncateTable Author: Huaxin Gao <huaxing@us.ibm.com> Closes #19256 from huaxingao/spark-21338.	2017-09-19 09:27:05 -07:00
Taaffy	1bc17a6b8a	[SPARK-22052] Incorrect Metric assigned in MetricsReporter.scala Current implementation for processingRate-total uses wrong metric: mistakenly uses inputRowsPerSecond instead of processedRowsPerSecond ## What changes were proposed in this pull request? Adjust processingRate-total from using inputRowsPerSecond to processedRowsPerSecond ## How was this patch tested? Built spark from source with proposed change and tested output with correct parameter. Before change the csv metrics file for inputRate-total and processingRate-total displayed the same values due to the error. After changing MetricsReporter.scala the processingRate-total csv file displayed the correct metric. <img width="963" alt="processed rows per second" src="https://user-images.githubusercontent.com/32072374/30554340-82eea12c-9ca4-11e7-8370-8168526ff9a2.png"> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Taaffy <32072374+Taaffy@users.noreply.github.com> Closes #19268 from Taaffy/patch-1.	2017-09-19 10:20:04 +01:00
Kevin Yu	c66d64b3df	[SPARK-14878][SQL] Trim characters string function support #### What changes were proposed in this pull request? This PR enhances the TRIM function support in Spark SQL by allowing the specification of trim characters set. Below is the SQL syntax : ``` SQL <trim function> ::= TRIM <left paren> <trim operands> <right paren> <trim operands> ::= [ [ <trim specification> ] [ <trim character set> ] FROM ] <trim source> <trim source> ::= <character value expression> <trim specification> ::= LEADING \| TRAILING \| BOTH <trim character set> ::= <characters value expression> ``` or ``` SQL LTRIM (source-exp [, trim-exp]) RTRIM (source-exp [, trim-exp]) ``` Here are the documentation link of support of this feature by other mainstream databases. - Oracle: [TRIM function](http://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2126.htm#OLADM704) - DB2: [TRIM scalar function](https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/ak05270_.htm) - MySQL: [Trim function](http://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_trim) - Oracle: [ltrim](https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2018.htm#OLADM594) - DB2: [ltrim](https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0/sqlref/src/tpc/db2z_bif_ltrim.html) This PR is to implement the above enhancement. In the implementation, the design principle is to keep the changes to the minimum. Also, the exiting trim functions (which handles a special case, i.e., trimming space characters) are kept unchanged for performane reasons. #### How was this patch tested? The unit test cases are added in the following files: - UTF8StringSuite.java - StringExpressionsSuite.scala - sql/SQLQuerySuite.scala - StringFunctionsSuite.scala Author: Kevin Yu <qyu@us.ibm.com> Closes #12646 from kevinyu98/spark-14878.	2017-09-18 12:12:35 -07:00
Feng Liu	3b049abf10	[SPARK-22003][SQL] support array column in vectorized reader with UDF ## What changes were proposed in this pull request? The UDF needs to deserialize the `UnsafeRow`. When the column type is Array, the `get` method from the `ColumnVector`, which is used by the vectorized reader, is called, but this method is not implemented. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Feng Liu <fengliu@databricks.com> Closes #19230 from liufengdb/fix_array_open.	2017-09-18 08:49:32 -07:00
Jose Torres	0bad10d3e3	[SPARK-22017] Take minimum of all watermark execs in StreamExecution. ## What changes were proposed in this pull request? Take the minimum of all watermark exec nodes as the "real" watermark in StreamExecution, rather than picking one arbitrarily. ## How was this patch tested? new unit test Author: Jose Torres <jose@databricks.com> Closes #19239 from joseph-torres/SPARK-22017.	2017-09-15 21:10:07 -07:00
Wenchen Fan	c7307acdad	[SPARK-15689][SQL] data source v2 read path ## What changes were proposed in this pull request? This PR adds the infrastructure for data source v2, and implement features which Spark already have in data source v1, i.e. column pruning, filter push down, catalyst expression filter push down, InternalRow scan, schema inference, data size report. The write path is excluded to avoid making this PR growing too big, and will be added in follow-up PR. ## How was this patch tested? new tests Author: Wenchen Fan <wenchen@databricks.com> Closes #19136 from cloud-fan/data-source-v2.	2017-09-15 22:18:36 +08:00
Wenchen Fan	3c6198c86e	[SPARK-21987][SQL] fix a compatibility issue of sql event logs ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/18600 we removed the `metadata` field from `SparkPlanInfo`. This causes a problem when we replay event logs that are generated by older Spark versions. ## How was this patch tested? a regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes #19237 from cloud-fan/event.	2017-09-15 00:47:44 -07:00
Yuming Wang	4decedfdbd	[SPARK-22002][SQL] Read JDBC table use custom schema support specify partial fields. ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/18266 add a new feature to support read JDBC table use custom schema, but we must specify all the fields. For simplicity, this PR support specify partial fields. ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #19231 from wangyum/SPARK-22002.	2017-09-14 23:35:55 -07:00
goldmedal	a28728a9af	[SPARK-21513][SQL][FOLLOWUP] Allow UDF to_json support converting MapType to json for PySpark and SparkR ## What changes were proposed in this pull request? In previous work SPARK-21513, we has allowed `MapType` and `ArrayType` of `MapType`s convert to a json string but only for Scala API. In this follow-up PR, we will make SparkSQL support it for PySpark and SparkR, too. We also fix some little bugs and comments of the previous work in this follow-up PR. ### For PySpark ``` >>> data = [(1, {"name": "Alice"})] >>> df = spark.createDataFrame(data, ("key", "value")) >>> df.select(to_json(df.value).alias("json")).collect() [Row(json=u'{"name":"Alice")'] >>> data = [(1, [{"name": "Alice"}, {"name": "Bob"}])] >>> df = spark.createDataFrame(data, ("key", "value")) >>> df.select(to_json(df.value).alias("json")).collect() [Row(json=u'[{"name":"Alice"},{"name":"Bob"}]')] ``` ### For SparkR ``` # Converts a map into a JSON object df2 <- sql("SELECT map('name', 'Bob')) as people") df2 <- mutate(df2, people_json = to_json(df2$people)) # Converts an array of maps into a JSON array df2 <- sql("SELECT array(map('name', 'Bob'), map('name', 'Alice')) as people") df2 <- mutate(df2, people_json = to_json(df2$people)) ``` ## How was this patch tested? Add unit test cases. cc viirya HyukjinKwon Author: goldmedal <liugs963@gmail.com> Closes #19223 from goldmedal/SPARK-21513-fp-PySaprkAndSparkR.	2017-09-15 11:53:10 +09:00
Jose Torres	054ddb2f54	[SPARK-21988] Add default stats to StreamingExecutionRelation. ## What changes were proposed in this pull request? Add default stats to StreamingExecutionRelation. ## How was this patch tested? existing unit tests and an explain() test to be sure Author: Jose Torres <jose@databricks.com> Closes #19212 from joseph-torres/SPARK-21988.	2017-09-14 11:06:25 -07:00
Zhenhua Wang	ddd7f5e11d	[SPARK-17642][SQL][FOLLOWUP] drop test tables and improve comments ## What changes were proposed in this pull request? Drop test tables and improve comments. ## How was this patch tested? Modified existing test. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19213 from wzhfy/useless_comment.	2017-09-14 23:14:21 +08:00
gatorsmile	4e6fc69014	[SPARK-4131][FOLLOW-UP] Support "Writing data into the filesystem from queries" ## What changes were proposed in this pull request? This PR is clean the codes in https://github.com/apache/spark/pull/18975 ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #19225 from gatorsmile/refactorSPARK-4131.	2017-09-14 14:48:04 +08:00
Takeshi Yamamuro	8be7e6bb3c	[SPARK-21973][SQL] Add an new option to filter queries in TPC-DS ## What changes were proposed in this pull request? This pr added a new option to filter TPC-DS queries to run in `TPCDSQueryBenchmark`. By default, `TPCDSQueryBenchmark` runs all the TPC-DS queries. This change could enable developers to run some of the TPC-DS queries by this option, e.g., to run q2, q4, and q6 only: ``` spark-submit --class <this class> --conf spark.sql.tpcds.queryFilter="q2,q4,q6" --jars <spark sql test jar> ``` ## How was this patch tested? Manually checked. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #19188 from maropu/RunPartialQueriesInTPCDS.	2017-09-13 21:54:10 -07:00
Yuming Wang	17edfec59d	[SPARK-20427][SQL] Read JDBC table use custom schema ## What changes were proposed in this pull request? Auto generated Oracle schema some times not we expect: - `number(1)` auto mapped to BooleanType, some times it's not we expect, per [SPARK-20921](https://issues.apache.org/jira/browse/SPARK-20921). - `number` auto mapped to Decimal(38,10), It can't read big data, per [SPARK-20427](https://issues.apache.org/jira/browse/SPARK-20427). This PR fix this issue by custom schema as follows: ```scala val props = new Properties() props.put("customSchema", "ID decimal(38, 0), N1 int, N2 boolean") val dfRead = spark.read.schema(schema).jdbc(jdbcUrl, "tableWithCustomSchema", props) dfRead.show() ``` or ```sql CREATE TEMPORARY VIEW tableWithCustomSchema USING org.apache.spark.sql.jdbc OPTIONS (url '$jdbcUrl', dbTable 'tableWithCustomSchema', customSchema'ID decimal(38, 0), N1 int, N2 boolean') ``` ## How was this patch tested? unit tests Author: Yuming Wang <wgyumg@gmail.com> Closes #18266 from wangyum/SPARK-20427.	2017-09-13 16:34:17 -07:00
donnyzone	21c4450fb2	[SPARK-21980][SQL] References in grouping functions should be indexed with semanticEquals ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-21980 This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations. The problem can be reproduced by: `val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b") df.cube("a").agg(grouping("A")).show()` ## How was this patch tested? unit tests Author: donnyzone <wellfengzhu@gmail.com> Closes #19202 from DonnyZone/ResolveGroupingAnalytics.	2017-09-13 10:06:53 -07:00
Armin	b6ef1f57bc	[SPARK-21970][CORE] Fix Redundant Throws Declarations in Java Codebase ## What changes were proposed in this pull request? 1. Removing all redundant throws declarations from Java codebase. 2. Removing dead code made visible by this from `ShuffleExternalSorter#closeAndGetSpills` ## How was this patch tested? Build still passes. Author: Armin <me@obrown.io> Closes #19182 from original-brownbear/SPARK-21970.	2017-09-13 14:04:26 +01:00
goldmedal	371e4e2053	[SPARK-21513][SQL] Allow UDF to_json support converting MapType to json # What changes were proposed in this pull request? UDF to_json only supports converting `StructType` or `ArrayType` of `StructType`s to a json output string now. According to the discussion of JIRA SPARK-21513, I allow to `to_json` support converting `MapType` and `ArrayType` of `MapType`s to a json output string. This PR is for SQL and Scala API only. # How was this patch tested? Adding unit test case. cc viirya HyukjinKwon Author: goldmedal <liugs963@gmail.com> Author: Jia-Xuan Liu <liugs963@gmail.com> Closes #18875 from goldmedal/SPARK-21513.	2017-09-13 09:43:00 +09:00
sarutak	b9b54b1c88	[SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query files. ## What changes were proposed in this pull request? TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit. It's because of the failure of reference query files in the jar file. ## How was this patch tested? Ran the benchmark. Author: sarutak <sarutak@oss.nttdata.co.jp> Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #18592 from sarutak/fix-tpcds-benchmark.	2017-09-12 10:49:46 -07:00
Zhenhua Wang	515910e9bd	[SPARK-17642][SQL] support DESC EXTENDED/FORMATTED table column commands ## What changes were proposed in this pull request? Support DESC (EXTENDED \| FORMATTED) ? TABLE COLUMN command. Support DESC EXTENDED \| FORMATTED TABLE COLUMN command to show column-level statistics. Do NOT support describe nested columns. ## How was this patch tested? Added test cases. Author: Zhenhua Wang <wzh_zju@163.com> Author: Zhenhua Wang <wangzhenhua@huawei.com> Author: wangzhenhua <wangzhenhua@huawei.com> Closes #16422 from wzhfy/descColumn.	2017-09-12 08:59:52 -07:00
Jen-Ming Chung	7d0a3ef4ce	[SPARK-21610][SQL][FOLLOWUP] Corrupt records are not handled properly when creating a dataframe from a file ## What changes were proposed in this pull request? When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query. ## How was this patch tested? Added unit test in `CSVSuite`. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #19199 from jmchung/SPARK-21610-FOLLOWUP.	2017-09-12 22:47:12 +09:00
caoxuewen	dc74c0e67d	[MINOR][SQL] remove unuse import class ## What changes were proposed in this pull request? this PR describe remove the import class that are unused. ## How was this patch tested? N/A Author: caoxuewen <cao.xuewen@zte.com.cn> Closes #19131 from heary-cao/unuse_import.	2017-09-11 10:09:20 +01:00
Jen-Ming Chung	6273a711b6	[SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file ## What changes were proposed in this pull request? ``` echo '{"field": 1} {"field": 2} {"field": "3"}' >/tmp/sample.json ``` ```scala import org.apache.spark.sql.types._ val schema = new StructType() .add("field", ByteType) .add("_corrupt_record", StringType) val file = "/tmp/sample.json" val dfFromFile = spark.read.schema(schema).json(file) scala> dfFromFile.show(false) +-----+---------------+ \|field\|_corrupt_record\| +-----+---------------+ \|1 \|null \| \|2 \|null \| \|null \|{"field": "3"} \| +-----+---------------+ scala> dfFromFile.filter($"_corrupt_record".isNotNull).count() res1: Long = 0 scala> dfFromFile.filter($"_corrupt_record".isNull).count() res2: Long = 3 ``` When the `requiredSchema` only contains `_corrupt_record`, the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query. ## How was this patch tested? Added test case. Author: Jen-Ming Chung <jenmingisme@gmail.com> Closes #18865 from jmchung/SPARK-21610.	2017-09-10 17:26:43 -07:00
Jane Wang	f76790557b	[SPARK-4131] Support "Writing data into the filesystem from queries" ## What changes were proposed in this pull request? This PR implements the sql feature: INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format] SELECT ... FROM ... ## How was this patch tested? Added new unittests and also pulled the code to fb-spark so that we could test writing to hdfs directory. Author: Jane Wang <janewang@fb.com> Closes #18975 from janewangfb/port_local_directory.	2017-09-09 11:48:34 -07:00
Yanbo Liang	e4d8f9a36a	[MINOR][SQL] Correct DataFrame doc. ## What changes were proposed in this pull request? Correct DataFrame doc. ## How was this patch tested? Only doc change, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #19173 from yanboliang/df-doc.	2017-09-09 09:25:12 -07:00
Liang-Chi Hsieh	6b45d7e941	[SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type ## What changes were proposed in this pull request? `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys. Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19167 from viirya/test-jacksonutils.	2017-09-09 19:10:52 +09:00
Andrew Ash	8a5eb50681	[SPARK-21941] Stop storing unused attemptId in SQLTaskMetrics ## What changes were proposed in this pull request? In a driver heap dump containing 390,105 instances of SQLTaskMetrics this would have saved me approximately 3.2MB of memory. Since we're not getting any benefit from storing this unused value, let's eliminate it until a future PR makes use of it. ## How was this patch tested? Existing unit tests Author: Andrew Ash <andrew@andrewash.com> Closes #19153 from ash211/aash/trim-sql-listener.	2017-09-08 23:33:15 -07:00
Kazuaki Ishizaki	8a4f228dc0	[SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite ## What changes were proposed in this pull request? This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`. Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result. ## How was this patch tested? Use existing test case Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19159 from kiszk/SPARK-21946.	2017-09-08 09:39:20 -07:00
Dongjoon Hyun	c26976fe14	[SPARK-21939][TEST] Use TimeLimits instead of Timeouts Since ScalaTest 3.0.0, `org.scalatest.concurrent.Timeouts` is deprecated. This PR replaces the deprecated one with `org.scalatest.concurrent.TimeLimits`. ```scala -import org.scalatest.concurrent.Timeouts._ +import org.scalatest.concurrent.TimeLimits._ ``` Pass the existing test suites. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19150 from dongjoon-hyun/SPARK-21939. Change-Id: I1a1b07f1b97e51e2263dfb34b7eaaa099b2ded5e	2017-09-08 09:31:13 +08:00
Dongjoon Hyun	eea2b877cf	[SPARK-21912][SQL] ORC/Parquet table should not create invalid column names ## What changes were proposed in this pull request? Currently, users meet job abortions while creating or altering ORC/Parquet tables with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables. BEFORE ```scala scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") 17/09/04 13:28:21 ERROR Utils: Aborting task java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<a b:int>' but ' ' is found. 17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted. 17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.spark.SparkException: Task failed while writing rows. ``` AFTER ```scala scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`") 17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1 org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; ``` ## How was this patch tested? Pass the Jenkins with a new test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19124 from dongjoon-hyun/SPARK-21912.	2017-09-06 22:20:48 -07:00
Liang-Chi Hsieh	ce7293c150	[SPARK-21835][SQL][FOLLOW-UP] RewritePredicateSubquery should not produce unresolved query plans ## What changes were proposed in this pull request? This is a follow-up of #19050 to deal with `ExistenceJoin` case. ## How was this patch tested? Added test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19151 from viirya/SPARK-21835-followup.	2017-09-06 22:15:25 -07:00
Jacek Laskowski	fa0092bddf	[SPARK-21901][SS] Define toString for StateOperatorProgress ## What changes were proposed in this pull request? Just `StateOperatorProgress.toString` + few formatting fixes ## How was this patch tested? Local build. Waiting for OK from Jenkins. Author: Jacek Laskowski <jacek@japila.pl> Closes #19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString.	2017-09-06 15:48:48 -07:00
Jose Torres	acdf45fb52	[SPARK-21765] Check that optimization doesn't affect isStreaming bit. ## What changes were proposed in this pull request? Add an assert in logical plan optimization that the isStreaming bit stays the same, and fix empty relation rules where that wasn't happening. ## How was this patch tested? new and existing unit tests Author: Jose Torres <joseph.torres@databricks.com> Author: Jose Torres <joseph-torres@databricks.com> Closes #19056 from joseph-torres/SPARK-21765-followup.	2017-09-06 11:19:46 -07:00
Liang-Chi Hsieh	f2e22aebfe	[SPARK-21835][SQL] RewritePredicateSubquery should not produce unresolved query plans ## What changes were proposed in this pull request? Correlated predicate subqueries are rewritten into `Join` by the rule `RewritePredicateSubquery` during optimization. It is possibly that the two sides of the `Join` have conflicting attributes. The query plans produced by `RewritePredicateSubquery` become unresolved and break structural integrity. We should check if there are conflicting attributes in the `Join` and de-duplicate them by adding a `Project`. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #19050 from viirya/SPARK-21835.	2017-09-06 07:42:19 -07:00
Xingbo Jiang	fd60d4fa6c	[SPARK-21652][SQL] Fix rule confliction between InferFiltersFromConstraints and ConstantPropagation ## What changes were proposed in this pull request? For the given example below, the predicate added by `InferFiltersFromConstraints` is folded by `ConstantPropagation` later, this leads to unconverged optimize iteration: ``` Seq((1, 1)).toDF("col1", "col2").createOrReplaceTempView("t1") Seq(1, 2).toDF("col").createOrReplaceTempView("t2") sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 = t2.col AND t1.col2 = t2.col") ``` We can fix this by adjusting the indent of the optimize rules. ## How was this patch tested? Add test case that would have failed in `SQLQuerySuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #19099 from jiangxb1987/unconverge-optimization.	2017-09-05 13:12:39 -07:00
gatorsmile	2974406d17	[SPARK-21845][SQL][TEST-MAVEN] Make codegen fallback of expressions configurable ## What changes were proposed in this pull request? We should make codegen fallback of expressions configurable. So far, it is always on. We might hide it when our codegen have compilation bugs. Thus, we should also disable the codegen fallback when running test cases. ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #19119 from gatorsmile/fallbackCodegen.	2017-09-05 09:04:03 -07:00
hyukjinkwon	02a4386aec	[SPARK-20978][SQL] Bump up Univocity version to 2.5.4 ## What changes were proposed in this pull request? There was a bug in Univocity Parser that causes the issue in SPARK-20978. This was fixed as below: ```scala val df = spark.read.schema("a string, b string, unparsed string").option("columnNameOfCorruptRecord", "unparsed").csv(Seq("a").toDS()) df.show() ``` Before ``` java.lang.NullPointerException at scala.collection.immutable.StringLike$class.stripLineEnd(StringLike.scala:89) at scala.collection.immutable.StringOps.stripLineEnd(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$getCurrentInput(UnivocityParser.scala:56) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert$1.apply(UnivocityParser.scala:207) ... ``` After ``` +---+----+--------+ \| a\| b\|unparsed\| +---+----+--------+ \| a\|null\| a\| +---+----+--------+ ``` It was fixed in 2.5.0 and 2.5.4 was released. I guess it'd be safe to upgrade this. ## How was this patch tested? Unit test added in `CSVSuite.scala`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #19113 from HyukjinKwon/bump-up-univocity.	2017-09-05 23:21:43 +08:00
Dongjoon Hyun	4e7a29efdb	[SPARK-21913][SQL][TEST] withDatabase` should drop database with CASCADE ## What changes were proposed in this pull request? Currently, `withDatabase` fails if the database is not empty. It would be great if we drop cleanly with CASCADE. ## How was this patch tested? This is a change on test util. Pass the existing Jenkins. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #19125 from dongjoon-hyun/SPARK-21913.	2017-09-05 00:20:16 -07:00
Sean Owen	ca59445adb	[SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true ## What changes were proposed in this pull request? If no SparkConf is available to Utils.redact, simply don't redact. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #19123 from srowen/SPARK-21418.	2017-09-04 23:02:59 +02:00
Liang-Chi Hsieh	9f30d92803	[SPARK-21654][SQL] Complement SQL predicates expression description ## What changes were proposed in this pull request? SQL predicates don't have complete expression description. This patch goes to complement the description by adding arguments, examples. This change also adds related test cases for the SQL predicate expressions. ## How was this patch tested? Existing tests. And added predicate test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18869 from viirya/SPARK-21654.	2017-09-03 21:55:18 -07:00
gatorsmile	acb7fed237	[SPARK-21891][SQL] Add TBLPROPERTIES to DDL statement: CREATE TABLE USING ## What changes were proposed in this pull request? Add `TBLPROPERTIES` to the DDL statement `CREATE TABLE USING`. After this change, the DDL becomes ``` CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name USING table_provider [OPTIONS table_property_list] [PARTITIONED BY (col_name, col_name, ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC\|DESC], ...)] INTO num_buckets BUCKETS ] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (property_name=property_value, ...)] [[AS] select_statement]; ``` ## How was this patch tested? Add a few tests Author: gatorsmile <gatorsmile@gmail.com> Closes #19100 from gatorsmile/addTablePropsToCreateTableUsing.	2017-09-02 14:53:41 -07:00

1 2 3 4 5 ...

4092 commits