ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kazuaki Ishizaki	8d40a79a07	[SPARK-23893][CORE][SQL] Avoid possible integer overflow in multiplication ## What changes were proposed in this pull request? This PR avoids possible overflow at an operation `long = (long)(int * int)`. The multiplication of large positive integer values may set one to MSB. This leads to a negative value in long while we expected a positive value (e.g. `0111_0000_0000_0000 * 0000_0000_0000_0010`). This PR performs long cast before the multiplication to avoid this situation. ## How was this patch tested? Existing UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #21002 from kiszk/SPARK-23893.	2018-04-08 20:40:27 +02:00
Maxim Gekk	6a734575a8	[SPARK-23849][SQL] Tests for the samplingRatio option of JSON datasource ## What changes were proposed in this pull request? Proposed tests checks that only subset of input dataset is touched during schema inferring. Author: Maxim Gekk <maxim.gekk@databricks.com> Closes #20963 from MaxGekk/json-sampling-tests.	2018-04-07 21:44:32 -07:00
Huaxin Gao	2c1fe64757	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark ## What changes were proposed in this pull request? Column.scala and Functions.scala have asc_nulls_first, asc_nulls_last, desc_nulls_first and desc_nulls_last. Add the corresponding python APIs in column.py and functions.py ## How was this patch tested? Add doctest Author: Huaxin Gao <huaxing@us.ibm.com> Closes #20962 from huaxingao/spark-23847.	2018-04-08 12:09:06 +08:00
Li Jin	d766ea2ff2	[SPARK-23861][SQL][DOC] Clarify default window frame with and without orderBy clause ## What changes were proposed in this pull request? Add docstring to clarify default window frame boundaries with and without orderBy clause ## How was this patch tested? Manually generate doc and check. Author: Li Jin <ice.xelloss@gmail.com> Closes #20978 from icexelloss/SPARK-23861-window-doc.	2018-04-07 00:15:54 +08:00
Yuchen Huo	9452401931	[SPARK-23822][SQL] Improve error message for Parquet schema mismatches ## What changes were proposed in this pull request? This pull request tries to improve the error message for spark while reading parquet files with different schemas, e.g. One with a STRING column and the other with a INT column. A new ParquetSchemaColumnConvertNotSupportedException is added to replace the old UnsupportedOperationException. The Exception is again wrapped in FileScanRdd.scala to throw a more a general QueryExecutionException with the actual parquet file name which trigger the exception. ## How was this patch tested? Unit tests added to check the new exception and verify the error messages. Also manually tested with two parquet with different schema to check the error message. <img width="1125" alt="screen shot 2018-03-30 at 4 03 04 pm" src="https://user-images.githubusercontent.com/37087310/38156580-dd58a140-3433-11e8-973a-b816d859fbe1.png"> Author: Yuchen Huo <yuchen.huo@databricks.com> Closes #20953 from yuchenhuo/SPARK-23822.	2018-04-06 08:35:20 -07:00
Gengliang Wang	249007e37f	[SPARK-19724][SQL] create a managed table with an existed default table should throw an exception ## What changes were proposed in this pull request? This PR is to finish https://github.com/apache/spark/pull/17272 This JIRA is a follow up work after SPARK-19583 As we discussed in that PR The following DDL for a managed table with an existed default location should throw an exception: CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... CREATE TABLE ... (PARTITIONED BY ...) Currently there are some situations which are not consist with above logic: CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default location situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog) CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... situation: hive table succeed with an existed default location This PR is going to make above two situations consist with the logic that it should throw an exception with an existed default location. ## How was this patch tested? unit test added Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #20886 from gengliangwang/pr-17272.	2018-04-05 20:19:25 -07:00
Kazuaki Ishizaki	4807d381bb	[SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks to choose several types of memory block ## What changes were proposed in this pull request? This PR allows us to use one of several types of `MemoryBlock`, such as byte array, int array, long array, or `java.nio.DirectByteBuffer`. To use `java.nio.DirectByteBuffer` allows to have off heap memory which is automatically deallocated by JVM. `MemoryBlock` class has primitive accessors like `Platform.getInt()`, `Platform.putint()`, or `Platform.copyMemory()`. This PR uses `MemoryBlock` for `OffHeapColumnVector`, `UTF8String`, and other places. This PR can improve performance of operations involving memory accesses (e.g. `UTF8String.trim`) by 1.8x. For now, this PR does not use `MemoryBlock` for `BufferHolder` based on cloud-fan's [suggestion](https://github.com/apache/spark/pull/11494#issuecomment-309694290). Since this PR is a successor of #11494, close #11494. Many codes were ported from #11494. Many efforts were put here. I think this PR should credit to yzotov. This PR can achieve 1.1-1.4x performance improvements for operations in `UTF8String` or `Murmur3_x86_32`. Other operations are almost comparable performances. Without this PR ``` OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Hash byte arrays with length 268435487: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Murmur3_x86_32 526 / 536 0.0 131399881.5 1.0X UTF8String benchmark: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ hashCode 525 / 552 1022.6 1.0 1.0X substring 414 / 423 1298.0 0.8 1.3X ``` With this PR ``` OpenJDK 64-Bit Server VM 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13 on Linux 4.4.0-22-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Hash byte arrays with length 268435487: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Murmur3_x86_32 474 / 488 0.0 118552232.0 1.0X UTF8String benchmark: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ hashCode 476 / 480 1127.3 0.9 1.0X substring 287 / 291 1869.9 0.5 1.7X ``` Benchmark program ``` test("benchmark Murmur3_x86_32") { val length = 8192 * 32768 + 31 val seed = 42L val iters = 1 << 2 val random = new Random(seed) val arrays = Array.fill[MemoryBlock](numArrays) { val bytes = new Array[Byte](length) random.nextBytes(bytes) new ByteArrayMemoryBlock(bytes, Platform.BYTE_ARRAY_OFFSET, length) } val benchmark = new Benchmark("Hash byte arrays with length " + length, iters * numArrays, minNumIters = 20) benchmark.addCase("HiveHasher") { _: Int => var sum = 0L for (_ <- 0L until iters) { sum += HiveHasher.hashUnsafeBytesBlock( arrays(i), Platform.BYTE_ARRAY_OFFSET, length) } } benchmark.run() } test("benchmark UTF8String") { val N = 512 * 1024 * 1024 val iters = 2 val benchmark = new Benchmark("UTF8String benchmark", N, minNumIters = 20) val str0 = new java.io.StringWriter() { { for (i <- 0 until N) { write(" ") } } }.toString val s0 = UTF8String.fromString(str0) benchmark.addCase("hashCode") { _: Int => var h: Int = 0 for (_ <- 0L until iters) { h += s0.hashCode } } benchmark.addCase("substring") { _: Int => var s: UTF8String = null for (_ <- 0L until iters) { s = s0.substring(N / 2 - 5, N / 2 + 5) } } benchmark.run() } ``` I run [this benchmark program](https://gist.github.com/kiszk/94f75b506c93a663bbbc372ffe8f05de) using [the commit](`ee5a79861c`). I got the following results: ``` OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Memory access benchmarks: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ ByteArrayMemoryBlock get/putInt() 220 / 221 609.3 1.6 1.0X Platform get/putInt(byte[]) 220 / 236 610.9 1.6 1.0X Platform get/putInt(Object) 492 / 494 272.8 3.7 0.4X OnHeapMemoryBlock get/putLong() 322 / 323 416.5 2.4 0.7X long[] 221 / 221 608.0 1.6 1.0X Platform get/putLong(long[]) 321 / 321 418.7 2.4 0.7X Platform get/putLong(Object) 561 / 563 239.2 4.2 0.4X ``` I also run [this benchmark program](https://gist.github.com/kiszk/5fdb4e03733a5d110421177e289d1fb5) for comparing performance of `Platform.copyMemory()`. ``` OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 3.20GHz Platform copyMemory: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Object to Object 1961 / 1967 8.6 116.9 1.0X System.arraycopy Object to Object 1917 / 1921 8.8 114.3 1.0X byte array to byte array 1961 / 1968 8.6 116.9 1.0X System.arraycopy byte array to byte array 1909 / 1937 8.8 113.8 1.0X int array to int array 1921 / 1990 8.7 114.5 1.0X double array to double array 1918 / 1923 8.7 114.3 1.0X Object to byte array 1961 / 1967 8.6 116.9 1.0X Object to short array 1965 / 1972 8.5 117.1 1.0X Object to int array 1910 / 1915 8.8 113.9 1.0X Object to float array 1971 / 1978 8.5 117.5 1.0X Object to double array 1919 / 1944 8.7 114.4 1.0X byte array to Object 1959 / 1967 8.6 116.8 1.0X int array to Object 1961 / 1970 8.6 116.9 1.0X double array to Object 1917 / 1924 8.8 114.3 1.0X ``` These results show three facts: 1. According to the second/third or sixth/seventh results in the first experiment, if we use `Platform.get/putInt(Object)`, we achieve more than 2x worse performance than `Platform.get/putInt(byte[])` with concrete type (i.e. `byte[]`). 2. According to the second/third or fourth/fifth/sixth results in the first experiment, the fastest way to access an array element on Java heap is `array[]`. Cons of `array[]` is that it is not possible to support unaligned-8byte access. 3. According to the first/second/third or fourth/sixth/seventh results in the first experiment, `getInt()/putInt() or getLong()/putLong()` in subclasses of `MemoryBlock` can achieve comparable performance to `Platform.get/putInt()` or `Platform.get/putLong()` with concrete type (second or sixth result). There is no overhead regarding virtual call. 4. According to results in the second experiment, for `Platform.copy()`, to pass `Object` can achieve the same performance as to pass any type of primitive array as source or destination. 5. According to second/fourth results in the second experiment, `Platform.copy()` can achieve the same performance as `System.arrayCopy`. It would be good to use `Platform.copy()` since `Platform.copy()` can take any types for src and dst. We are incrementally replace `Platform.get/putXXX` with `MemoryBlock.get/putXXX`. This is because we have two advantages. 1) Achieve better performance due to having a concrete type for an array. 2) Use simple OO design instead of passing `Object` It is easy to use `MemoryBlock` in `InternalRow`, `BufferHolder`, `TaskMemoryManager`, and others that are already abstracted. It is not easy to use `MemoryBlock` in utility classes related to hashing or others. Other candidates are - UnsafeRow, UnsafeArrayData, UnsafeMapData, SpecificUnsafeRowJoiner - UTF8StringBuffer - BufferHolder - TaskMemoryManager - OnHeapColumnVector - BytesToBytesMap - CachedBatch - classes for hash - others. ## How was this patch tested? Added `UnsafeMemoryAllocator` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #19222 from kiszk/SPARK-10399.	2018-04-06 10:13:59 +08:00
Gengliang Wang	d8379e5bc3	[SPARK-23838][WEBUI] Running SQL query is displayed as "completed" in SQL tab ## What changes were proposed in this pull request? A running SQL query would appear as completed in the Spark UI: ![image1](https://user-images.githubusercontent.com/1097932/38170733-3d7cb00c-35bf-11e8-994c-43f2d4fa285d.png) We can see the query in "Completed queries", while in in the job page we see it's still running Job 132. ![image2](https://user-images.githubusercontent.com/1097932/38170735-48f2c714-35bf-11e8-8a41-6fae23543c46.png) After some time in the query still appears in "Completed queries" (while it's still running), but the "Duration" gets increased. ![image3](https://user-images.githubusercontent.com/1097932/38170737-50f87ea4-35bf-11e8-8b60-000f6f918964.png) To reproduce, we can run a query with multiple jobs. E.g. Run TPCDS q6. The reason is that updates from executions are written into kvstore periodically, and the job start event may be missed. ## How was this patch tested? Manually run the job again and check the SQL Tab. The fix is pretty simple. Author: Gengliang Wang <gengliang.wang@databricks.com> Closes #20955 from gengliangwang/jobCompleted.	2018-04-04 15:43:58 -07:00
Takeshi Yamamuro	5197562afe	[SPARK-21351][SQL] Update nullability based on children's output ## What changes were proposed in this pull request? This pr added a new optimizer rule `UpdateNullabilityInAttributeReferences ` to update the nullability that `Filter` changes when having `IsNotNull`. In the master, optimized plans do not respect the nullability when `Filter` has `IsNotNull`. This wrongly generates unnecessary code. For example: ``` scala> val df = Seq((Some(1), Some(2))).toDF("a", "b") scala> val bIsNotNull = df.where($"b" =!= 2).select($"b") scala> val targetQuery = bIsNotNull.distinct scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable res5: Boolean = true scala> targetQuery.debugCodegen Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 == HashAggregate(keys=[b#19], functions=[], output=[b#19]) +- Exchange hashpartitioning(b#19, 200) +- HashAggregate(keys=[b#19], functions=[], output=[b#19]) +- Project [_2#16 AS b#19] +- Filter isnotnull(_2#16) +- LocalTableScan [_1#15, _2#16] Generated code: ... /* 124 / protected void processNext() throws java.io.IOException { ... / 132 / // output the result / 133 / / 134 / while (agg_mapIter.next()) { / 135 / wholestagecodegen_numOutputRows.add(1); / 136 / UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); / 137 / UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); / 138 / / 139 / boolean agg_isNull4 = agg_aggKey.isNullAt(0); / 140 / int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0)); / 141 / agg_rowWriter1.zeroOutNullBytes(); / 142 / // We don't need this NULL check because NULL is filtered out in `$"b" =!=2` / 143 / if (agg_isNull4) { / 144 / agg_rowWriter1.setNullAt(0); / 145 / } else { / 146 / agg_rowWriter1.write(0, agg_value4); / 147 / } / 148 / append(agg_result1); / 149 / / 150 / if (shouldStop()) return; / 151 / } / 152 / / 153 / agg_mapIter.close(); / 154 / if (agg_sorter == null) { / 155 / agg_hashMap.free(); / 156 / } / 157 / } / 158 / / 159 / } ``` In the line 143, we don't need this NULL check because NULL is filtered out in `$"b" =!=2`. This pr could remove this NULL check; ``` scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable res5: Boolean = false scala> targetQuery.debugCodegen ... Generated code: ... / 144 / protected void processNext() throws java.io.IOException { ... / 152 / // output the result / 153 / / 154 / while (agg_mapIter.next()) { / 155 / wholestagecodegen_numOutputRows.add(1); / 156 / UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); / 157 / UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); / 158 / / 159 / int agg_value4 = agg_aggKey.getInt(0); / 160 / agg_rowWriter1.write(0, agg_value4); / 161 / append(agg_result1); / 162 / / 163 / if (shouldStop()) return; / 164 / } / 165 / / 166 / agg_mapIter.close(); / 167 / if (agg_sorter == null) { / 168 / agg_hashMap.free(); / 169 / } / 170 */ } ``` ## How was this patch tested? Added `UpdateNullabilityInAttributeReferencesSuite` for unit tests. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #18576 from maropu/SPARK-21351.	2018-04-04 14:39:19 +08:00
Eric Liang	359375eff7	[SPARK-23809][SQL] Active SparkSession should be set by getOrCreate ## What changes were proposed in this pull request? Currently, the active spark session is set inconsistently (e.g., in createDataFrame, prior to query execution). Many places in spark also incorrectly query active session when they should be calling activeSession.getOrElse(defaultSession) and so might get None even if a Spark session exists. The semantics here can be cleaned up if we also set the active session when the default session is set. Related: https://github.com/apache/spark/pull/20926/files ## How was this patch tested? Unit test, existing test. Note that if https://github.com/apache/spark/pull/20926 merges first we should also update the tests there. Author: Eric Liang <ekl@databricks.com> Closes #20927 from ericl/active-session-cleanup.	2018-04-03 17:09:12 -07:00
Jose Torres	66a3a5a2dc	[SPARK-23099][SS] Migrate foreach sink to DataSourceV2 ## What changes were proposed in this pull request? Migrate foreach sink to DataSourceV2. Since the previous attempt at this PR #20552, we've changed and strictly defined the lifecycle of writer components. This means we no longer need the complicated lifecycle shim from that PR; it just naturally works. ## How was this patch tested? existing tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #20951 from jose-torres/foreach.	2018-04-03 11:05:29 -07:00
Kazuaki Ishizaki	a7c19d9c21	[SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder classes ## What changes were proposed in this pull request? This PR implemented the following cleanups related to `UnsafeWriter` class: - Remove code duplication between `UnsafeRowWriter` and `UnsafeArrayWriter` - Make `BufferHolder` class internal by delegating its accessor methods to `UnsafeWriter` - Replace `UnsafeRow.setTotalSize(...)` with `UnsafeRowWriter.setTotalSize()` ## How was this patch tested? Tested by existing UTs Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20850 from kiszk/SPARK-23713.	2018-04-02 21:48:44 +02:00
Tathagata Das	15298b99ac	[SPARK-23827][SS] StreamingJoinExec should ensure that input data is partitioned into specific number of partitions ## What changes were proposed in this pull request? Currently, the requiredChildDistribution does not specify the partitions. This can cause the weird corner cases where the child's distribution is `SinglePartition` which satisfies the required distribution of `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the shuffle needed to repartition input data into the required number of partitions (i.e. same as state stores). That can lead to "file not found" errors on the state store delta files as the micro-batch-with-no-shuffle will not run certain tasks and therefore not generate the expected state store delta files. This PR adds the required constraint on the number of partitions. ## How was this patch tested? Modified test harness to always check that ANY stateful operator should have a constraint on the number of partitions. As part of that, the existing opt-in checks on child output partitioning were removed, as they are redundant. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #20941 from tdas/SPARK-23827.	2018-03-30 16:48:26 -07:00
gatorsmile	bc8d093117	[SPARK-23500][SQL][FOLLOWUP] Fix complex type simplification rules to apply to entire plan ## What changes were proposed in this pull request? This PR is to improve the test coverage of the original PR https://github.com/apache/spark/pull/20687 ## How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #20911 from gatorsmile/addTests.	2018-03-30 23:21:07 +08:00
Jose Torres	5b5a36ed6d	Roll forward "[SPARK-23096][SS] Migrate rate source to V2" ## What changes were proposed in this pull request? Roll forward `c68ec4e` (#20688). There are two minor test changes required: * An error which used to be TreeNodeException[ArithmeticException] is no longer wrapped and is now just ArithmeticException. * The test framework simply does not set the active Spark session. (Or rather, it doesn't do so early enough - I think it only happens when a query is analyzed.) I've added the required logic to SQLTestUtils. ## How was this patch tested? existing tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Author: jerryshao <sshao@hortonworks.com> Closes #20922 from jose-torres/ratefix.	2018-03-30 21:54:26 +08:00
yucai	b02e76cbff	[SPARK-23727][SQL] Support for pushing down filters for DateType in parquet ## What changes were proposed in this pull request? This PR supports for pushing down filters for DateType in parquet ## How was this patch tested? Added UT and tested in local. Author: yucai <yyu1@ebay.com> Closes #20851 from yucai/SPARK-23727.	2018-03-30 15:07:38 +08:00
Jose Torres	b348901192	[SPARK-23808][SQL] Set default Spark session in test-only spark sessions. ## What changes were proposed in this pull request? Set default Spark session in the TestSparkSession and TestHiveSparkSession constructors. ## How was this patch tested? new unit tests Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #20926 from jose-torres/test3.	2018-03-29 21:36:56 -07:00
gatorsmile	761565a3cc	Revert "[SPARK-23096][SS] Migrate rate source to V2" This reverts commit `c68ec4e6a1`.	2018-03-28 09:11:52 -07:00
hyukjinkwon	34c4b9c57e	[SPARK-23765][SQL] Supports custom line separator for json datasource ## What changes were proposed in this pull request? This PR proposes to add lineSep option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. The approach is similar with https://github.com/apache/spark/pull/20727; however, one main difference is, it uses text datasource's `lineSep` option to parse line by line in JSON's schema inference. ## How was this patch tested? Manually tested and unit tests were added. Author: hyukjinkwon <gurwls223@apache.org> Author: hyukjinkwon <gurwls223@gmail.com> Closes #20877 from HyukjinKwon/linesep-json.	2018-03-28 19:49:27 +08:00
jerryshao	c68ec4e6a1	[SPARK-23096][SS] Migrate rate source to V2 ## What changes were proposed in this pull request? This PR migrate micro batch rate source to V2 API and rewrite UTs to suite V2 test. ## How was this patch tested? UTs. Author: jerryshao <sshao@hortonworks.com> Closes #20688 from jerryshao/SPARK-23096.	2018-03-27 14:39:05 -07:00
Kazuaki Ishizaki	e4bec7cb88	[SPARK-23549][SQL] Cast to timestamp when comparing timestamp with date ## What changes were proposed in this pull request? This PR fixes an incorrect comparison in SQL between timestamp and date. This is because both of them are casted to `string` and then are compared lexicographically. This implementation shows `false` regarding this query `spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as date) and cast('2017-03-01' as date)").show`. This PR shows `true` for this query by casting `date("2017-03-01")` to `timestamp("2017-03-01 00:00:00")`. (Please fill in changes proposed in this fix) ## How was this patch tested? Added new UTs to `TypeCoercionSuite`. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20774 from kiszk/SPARK-23549.	2018-03-25 16:38:49 -07:00
Takeshi Yamamuro	5f653d4f7c	[SPARK-23167][SQL] Add TPCDS queries v2.7 in TPCDSQuerySuite ## What changes were proposed in this pull request? This pr added TPCDS v2.7 (latest) queries in `TPCDSQuerySuite` because the current `TPCDSQuerySuite` tests older one (v1.4) and some queries are different from v1.4 and v2.7. Since the original v2.7 queries have the syntaxes that Spark cannot parse, I changed these queries in a following way: - [date] + 14 days -> date + `INTERVAL` 14 days - [column name] as "30 days" -> [column name] as \`30 days\` - Fix some syntax errors, e.g., missing brackets ## How was this patch tested? Added tests in `TPCDSQuerySuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20343 from maropu/TPCDSV2_7.	2018-03-25 09:18:26 -07:00
Jose Torres	816a5496ba	[SPARK-23788][SS] Fix race in StreamingQuerySuite ## What changes were proposed in this pull request? The serializability test uses the same MemoryStream instance for 3 different queries. If any of those queries ask it to commit before the others have run, the rest will see empty dataframes. This can fail the test if q3 is affected. We should use one instance per query instead. ## How was this patch tested? Existing unit test. If I move q2.processAllAvailable() before starting q3, the test always fails without the fix. Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #20896 from jose-torres/fixrace.	2018-03-24 18:21:01 -07:00
Liang-Chi Hsieh	b2edc30db1	[SPARK-23614][SQL] Fix incorrect reuse exchange when caching is used ## What changes were proposed in this pull request? We should provide customized canonicalize plan for `InMemoryRelation` and `InMemoryTableScanExec`. Otherwise, we can wrongly treat two different cached plans as same result. It causes wrongly reused exchange then. For a test query like this: ```scala val cached = spark.createDataset(Seq(TestDataUnion(1, 2, 3), TestDataUnion(4, 5, 6))).cache() val group1 = cached.groupBy("x").agg(min(col("y")) as "value") val group2 = cached.groupBy("x").agg(min(col("z")) as "value") group1.union(group2) ``` Canonicalized plans before: First exchange: ``` Exchange hashpartitioning(none#0, 5) +- (1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- (1) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [x#4253, y#4254, z#4255] ``` Second exchange: ``` Exchange hashpartitioning(none#0, 5) +- (3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- (3) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [x#4253, y#4254, z#4255], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [x#4253, y#4254, z#4255] ``` You can find that they have the canonicalized plans are the same, although we use different columns in two `InMemoryTableScan`s. Canonicalized plan after: First exchange: ``` Exchange hashpartitioning(none#0, 5) +- (1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- (1) InMemoryTableScan [none#0, none#1] +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas) +- LocalTableScan [none#0, none#1, none#2] ``` Second exchange: ``` Exchange hashpartitioning(none#0, 5) +- (3) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- (3) InMemoryTableScan [none#0, none#2] +- InMemoryRelation [none#0, none#1, none#2], true, 10000, StorageLevel(memory, 1 replicas) +- LocalTableScan [none#0, none#1, none#2] ``` ## How was this patch tested? Added unit test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20831 from viirya/SPARK-23614.	2018-03-22 21:23:25 -07:00
Liang-Chi Hsieh	4d37008c78	[SPARK-23599][SQL] Use RandomUUIDGenerator in Uuid expression ## What changes were proposed in this pull request? As stated in Jira, there are problems with current `Uuid` expression which uses `java.util.UUID.randomUUID` for UUID generation. This patch uses the newly added `RandomUUIDGenerator` for UUID generation. So we can make `Uuid` deterministic between retries. ## How was this patch tested? Added unit tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #20861 from viirya/SPARK-23599-2.	2018-03-22 19:57:32 +01:00
Dilip Biswal	5c9eaa6b58	[SPARK-23372][SQL] Writing empty struct in parquet fails during execution. It should fail earlier in the processing. ## What changes were proposed in this pull request? Currently we allow writing data frames with empty schema into a file based datasource for certain file formats such as JSON, ORC etc. For formats such as Parquet and Text, we raise error at different times of execution. For text format, we return error from the driver early on in processing where as for format such as parquet, the error is raised from executor. Example spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path) Results in ``` SQL org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message spark_schema { } at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) at org.apache.parquet.schema.MessageType.accept(MessageType.java:58) at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:225) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread. ``` In this PR, we unify the error processing and raise error on attempt to write empty schema based dataframes into file based datasource (orc, parquet, text , csv, json etc) early on in the processing. ## How was this patch tested? Unit tests added in FileBasedDatasourceSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #20579 from dilipbiswal/spark-23372.	2018-03-21 21:49:02 -07:00
Gabor Somogyi	918c7e99af	[SPARK-23288][SS] Fix output metrics with parquet sink ## What changes were proposed in this pull request? Output metrics were not filled when parquet sink used. This PR fixes this problem by passing a `BasicWriteJobStatsTracker` in `FileStreamSink`. ## How was this patch tested? Additional unit test added. Author: Gabor Somogyi <gabor.g.somogyi@gmail.com> Closes #20745 from gaborgsomogyi/SPARK-23288.	2018-03-21 10:06:26 -07:00
Takeshi Yamamuro	98d0ea3f60	[SPARK-23264][SQL] Fix scala.MatchError in literals.sql.out ## What changes were proposed in this pull request? To fix `scala.MatchError` in `literals.sql.out`, this pr added an entry for `CalendarIntervalType` in `QueryExecution.toHiveStructString`. ## How was this patch tested? Existing tests and added tests in `literals.sql` Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20872 from maropu/FixIntervalTests.	2018-03-21 09:52:28 -07:00
hyukjinkwon	8d79113b81	[SPARK-23577][SQL] Supports custom line separator for text datasource ## What changes were proposed in this pull request? This PR proposes to add `lineSep` option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. ## How was this patch tested? Manual tests and unit tests were added. Author: hyukjinkwon <gurwls223@gmail.com> Closes #20727 from HyukjinKwon/linesep-text.	2018-03-21 09:46:47 -07:00
Takeshi Yamamuro	983e8d9d64	[SPARK-23666][SQL] Do not display exprIds of Alias in user-facing info. ## What changes were proposed in this pull request? To drop `exprId`s for `Alias` in user-facing info., this pr added an entry for `Alias` in `NonSQLExpression.sql` ## How was this patch tested? Added tests in `UDFSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes #20827 from maropu/SPARK-23666.	2018-03-20 23:17:49 -07:00
Jose Torres	2c4b9962fd	[SPARK-23574][SQL] Report SinglePartition in DataSourceV2ScanExec when there's exactly 1 data reader factory. ## What changes were proposed in this pull request? Report SinglePartition in DataSourceV2ScanExec when there's exactly 1 data reader factory. Note that this means reader factories end up being constructed as partitioning is checked; let me know if you think that could be a problem. ## How was this patch tested? existing unit tests Author: Jose Torres <jose@databricks.com> Author: Jose Torres <torres.joseph.f+github@gmail.com> Closes #20726 from jose-torres/SPARK-23574.	2018-03-20 11:46:51 -07:00
Dongjoon Hyun	5414abca4f	[SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default` ## What changes were proposed in this pull request? Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format. This PR aims to - Improve test suites more robust and makes it easy to test new data sources in the future. - Test new native ORC data source with the full existing Apache Spark test coverage. As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted. ## How was this patch tested? Pass the Jenkins with updated tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20705 from dongjoon-hyun/SPARK-23553.	2018-03-16 09:36:30 -07:00
myroslavlisniak	c2632edebd	[SPARK-23670][SQL] Fix memory leak on SparkPlanGraphWrapper Clean up SparkPlanGraphWrapper objects from InMemoryStore together with cleaning up SQLExecutionUIData existing unit test was extended to check also SparkPlanGraphWrapper object count vanzin Author: myroslavlisniak <acnipin@gmail.com> Closes #20813 from myroslavlisniak/master.	2018-03-15 17:20:59 -07:00
Yuming Wang	15c3c98300	[HOT-FIX] Fix SparkOutOfMemoryError: Unable to acquire 262144 bytes of memory, got 224631 ## What changes were proposed in this pull request? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88263/testReport https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88260/testReport https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88257/testReport https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88224/testReport These tests all failed: ``` org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 262144 bytes of memory, got 224631 at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98) at org.apache.spark.unsafe.map.BytesToBytesMap.allocate(BytesToBytesMap.java:787) at org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:204) at org.apache.spark.unsafe.map.BytesToBytesMap.<init>(BytesToBytesMap.java:219) ... ``` This PR ignore this test. ## How was this patch tested? N/A Author: Yuming Wang <yumwang@ebay.com> Closes #20835 from wangyum/SPARK-23598.	2018-03-15 19:54:58 +01:00
Yuanjian Li	7c3e8995f1	[SPARK-23533][SS] Add support for changing ContinuousDataReader's startOffset ## What changes were proposed in this pull request? As discussion in #20675, we need add a new interface `ContinuousDataReaderFactory` to support the requirements of setting start offset in Continuous Processing. ## How was this patch tested? Existing UT. Author: Yuanjian Li <xyliyuanjian@gmail.com> Closes #20689 from xuanyuanking/SPARK-23533.	2018-03-15 00:04:28 -07:00
Kazuaki Ishizaki	1098933b0a	[SPARK-23598][SQL] Make methods in BufferedRowIterator public to avoid runtime error for a large query ## What changes were proposed in this pull request? This PR fixes runtime error regarding a large query when a generated code has split classes. The issue is `append()`, `stopEarly()`, and other methods are not accessible from split classes that are not subclasses of `BufferedRowIterator`. This PR fixes this issue by making them `public`. Before applying the PR, we see the following exception by running the attached program with `CodeGenerator.GENERATED_CLASS_SIZE_THRESHOLD=-1`. ``` test("SPARK-23598") { // When set -1 to CodeGenerator.GENERATED_CLASS_SIZE_THRESHOLD, an exception is thrown val df_pet_age = Seq((8, "bat"), (15, "mouse"), (5, "horse")).toDF("age", "name") df_pet_age.groupBy("name").avg("age").show() } ``` Exception: ``` 19:40:52.591 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 19:41:32.319 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalAccessError: tried to access method org.apache.spark.sql.execution.BufferedRowIterator.shouldStop()Z from class org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1$agg_NestedClass1 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1$agg_NestedClass1.agg_doAggregateWithKeys$(generated.java:203) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:160) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:616) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ... ``` Generated code (line 195 calles `stopEarly()`). ``` /* 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private boolean agg_initAgg; / 010 / private boolean agg_bufIsNull; / 011 / private double agg_bufValue; / 012 / private boolean agg_bufIsNull1; / 013 / private long agg_bufValue1; / 014 / private agg_FastHashMap agg_fastHashMap; / 015 / private org.apache.spark.unsafe.KVIterator<UnsafeRow, UnsafeRow> agg_fastHashMapIter; / 016 / private org.apache.spark.unsafe.KVIterator agg_mapIter; / 017 / private org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap agg_hashMap; / 018 / private org.apache.spark.sql.execution.UnsafeKVExternalSorter agg_sorter; / 019 / private scala.collection.Iterator inputadapter_input; / 020 / private boolean agg_agg_isNull11; / 021 / private boolean agg_agg_isNull25; / 022 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder[] agg_mutableStateArray1 = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder[2]; / 023 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] agg_mutableStateArray2 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; / 024 / private UnsafeRow[] agg_mutableStateArray = new UnsafeRow[2]; / 025 / / 026 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 027 / this.references = references; / 028 / } / 029 / / 030 / public void init(int index, scala.collection.Iterator[] inputs) { / 031 / partitionIndex = index; / 032 / this.inputs = inputs; / 033 / / 034 / agg_fastHashMap = new agg_FastHashMap(((org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0] / plan /).getTaskMemoryManager(), ((org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0] / plan /).getEmptyAggregationBuffer()); / 035 / agg_hashMap = ((org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0] / plan /).createHashMap(); / 036 / inputadapter_input = inputs[0]; / 037 / agg_mutableStateArray[0] = new UnsafeRow(1); / 038 / agg_mutableStateArray1[0] = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_mutableStateArray[0], 32); / 039 / agg_mutableStateArray2[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(agg_mutableStateArray1[0], 1); / 040 / agg_mutableStateArray[1] = new UnsafeRow(3); / 041 / agg_mutableStateArray1[1] = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_mutableStateArray[1], 32); / 042 / agg_mutableStateArray2[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(agg_mutableStateArray1[1], 3); / 043 / / 044 / } / 045 / / 046 / public class agg_FastHashMap { / 047 / private org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch batch; / 048 / private int[] buckets; / 049 / private int capacity = 1 << 16; / 050 / private double loadFactor = 0.5; / 051 / private int numBuckets = (int) (capacity / loadFactor); / 052 / private int maxSteps = 2; / 053 / private int numRows = 0; / 054 / private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1] / keyName /), org.apache.spark.sql.types.DataTypes.StringType); / 055 / private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2] / keyName /), org.apache.spark.sql.types.DataTypes.DoubleType) / 056 / .add(((java.lang.String) references[3] / keyName /), org.apache.spark.sql.types.DataTypes.LongType); / 057 / private Object emptyVBase; / 058 / private long emptyVOff; / 059 / private int emptyVLen; / 060 / private boolean isBatchFull = false; / 061 / / 062 / public agg_FastHashMap( / 063 / org.apache.spark.memory.TaskMemoryManager taskMemoryManager, / 064 / InternalRow emptyAggregationBuffer) { / 065 / batch = org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch / 066 / .allocate(keySchema, valueSchema, taskMemoryManager, capacity); / 067 / / 068 / final UnsafeProjection valueProjection = UnsafeProjection.create(valueSchema); / 069 / final byte[] emptyBuffer = valueProjection.apply(emptyAggregationBuffer).getBytes(); / 070 / / 071 / emptyVBase = emptyBuffer; / 072 / emptyVOff = Platform.BYTE_ARRAY_OFFSET; / 073 / emptyVLen = emptyBuffer.length; / 074 / / 075 / buckets = new int[numBuckets]; / 076 / java.util.Arrays.fill(buckets, -1); / 077 / } / 078 / / 079 / public org.apache.spark.sql.catalyst.expressions.UnsafeRow findOrInsert(UTF8String agg_key) { / 080 / long h = hash(agg_key); / 081 / int step = 0; / 082 / int idx = (int) h & (numBuckets - 1); / 083 / while (step < maxSteps) { / 084 / // Return bucket index if it's either an empty slot or already contains the key / 085 / if (buckets[idx] == -1) { / 086 / if (numRows < capacity && !isBatchFull) { / 087 / // creating the unsafe for new entry / 088 / UnsafeRow agg_result = new UnsafeRow(1); / 089 / org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder / 090 / = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_result, / 091 / 32); / 092 / org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter / 093 / = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter( / 094 / agg_holder, / 095 / 1); / 096 / agg_holder.reset(); //TODO: investigate if reset or zeroout are actually needed / 097 / agg_rowWriter.zeroOutNullBytes(); / 098 / agg_rowWriter.write(0, agg_key); / 099 / agg_result.setTotalSize(agg_holder.totalSize()); / 100 / Object kbase = agg_result.getBaseObject(); / 101 / long koff = agg_result.getBaseOffset(); / 102 / int klen = agg_result.getSizeInBytes(); / 103 / / 104 / UnsafeRow vRow / 105 / = batch.appendRow(kbase, koff, klen, emptyVBase, emptyVOff, emptyVLen); / 106 / if (vRow == null) { / 107 / isBatchFull = true; / 108 / } else { / 109 / buckets[idx] = numRows++; / 110 / } / 111 / return vRow; / 112 / } else { / 113 / // No more space / 114 / return null; / 115 / } / 116 / } else if (equals(idx, agg_key)) { / 117 / return batch.getValueRow(buckets[idx]); / 118 / } / 119 / idx = (idx + 1) & (numBuckets - 1); / 120 / step++; / 121 / } / 122 / // Didn't find it / 123 / return null; / 124 / } / 125 / / 126 / private boolean equals(int idx, UTF8String agg_key) { / 127 / UnsafeRow row = batch.getKeyRow(buckets[idx]); / 128 / return (row.getUTF8String(0).equals(agg_key)); / 129 / } / 130 / / 131 / private long hash(UTF8String agg_key) { / 132 / long agg_hash = 0; / 133 / / 134 / int agg_result = 0; / 135 / byte[] agg_bytes = agg_key.getBytes(); / 136 / for (int i = 0; i < agg_bytes.length; i++) { / 137 / int agg_hash1 = agg_bytes[i]; / 138 / agg_result = (agg_result ^ (0x9e3779b9)) + agg_hash1 + (agg_result << 6) + (agg_result >>> 2); / 139 / } / 140 / / 141 / agg_hash = (agg_hash ^ (0x9e3779b9)) + agg_result + (agg_hash << 6) + (agg_hash >>> 2); / 142 / / 143 / return agg_hash; / 144 / } / 145 / / 146 / public org.apache.spark.unsafe.KVIterator<UnsafeRow, UnsafeRow> rowIterator() { / 147 / return batch.rowIterator(); / 148 / } / 149 / / 150 / public void close() { / 151 / batch.close(); / 152 / } / 153 / / 154 / } / 155 / / 156 / protected void processNext() throws java.io.IOException { / 157 / if (!agg_initAgg) { / 158 / agg_initAgg = true; / 159 / long wholestagecodegen_beforeAgg = System.nanoTime(); / 160 / agg_nestedClassInstance1.agg_doAggregateWithKeys(); / 161 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[8] / aggTime /).add((System.nanoTime() - wholestagecodegen_beforeAgg) / 1000000); / 162 / } / 163 / / 164 / // output the result / 165 / / 166 / while (agg_fastHashMapIter.next()) { / 167 / UnsafeRow agg_aggKey = (UnsafeRow) agg_fastHashMapIter.getKey(); / 168 / UnsafeRow agg_aggBuffer = (UnsafeRow) agg_fastHashMapIter.getValue(); / 169 / wholestagecodegen_nestedClassInstance.agg_doAggregateWithKeysOutput(agg_aggKey, agg_aggBuffer); / 170 / / 171 / if (shouldStop()) return; / 172 / } / 173 / agg_fastHashMap.close(); / 174 / / 175 / while (agg_mapIter.next()) { / 176 / UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); / 177 / UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); / 178 / wholestagecodegen_nestedClassInstance.agg_doAggregateWithKeysOutput(agg_aggKey, agg_aggBuffer); / 179 / / 180 / if (shouldStop()) return; / 181 / } / 182 / / 183 / agg_mapIter.close(); / 184 / if (agg_sorter == null) { / 185 / agg_hashMap.free(); / 186 / } / 187 / } / 188 / / 189 / private wholestagecodegen_NestedClass wholestagecodegen_nestedClassInstance = new wholestagecodegen_NestedClass(); / 190 / private agg_NestedClass1 agg_nestedClassInstance1 = new agg_NestedClass1(); / 191 / private agg_NestedClass agg_nestedClassInstance = new agg_NestedClass(); / 192 / / 193 / private class agg_NestedClass1 { / 194 / private void agg_doAggregateWithKeys() throws java.io.IOException { / 195 / while (inputadapter_input.hasNext() && !stopEarly()) { / 196 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 197 / int inputadapter_value = inputadapter_row.getInt(0); / 198 / boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1); / 199 / UTF8String inputadapter_value1 = inputadapter_isNull1 ? / 200 / null : (inputadapter_row.getUTF8String(1)); / 201 / / 202 / agg_nestedClassInstance.agg_doConsume(inputadapter_row, inputadapter_value, inputadapter_value1, inputadapter_isNull1); / 203 / if (shouldStop()) return; / 204 / } / 205 / / 206 / agg_fastHashMapIter = agg_fastHashMap.rowIterator(); / 207 / agg_mapIter = ((org.apache.spark.sql.execution.aggregate.HashAggregateExec) references[0] / plan /).finishAggregate(agg_hashMap, agg_sorter, ((org.apache.spark.sql.execution.metric.SQLMetric) references[4] / peakMemory /), ((org.apache.spark.sql.execution.metric.SQLMetric) references[5] / spillSize /), ((org.apache.spark.sql.execution.metric.SQLMetric) references[6] / avgHashProbe /)); / 208 / / 209 / } / 210 / / 211 / } / 212 / / 213 / private class wholestagecodegen_NestedClass { / 214 / private void agg_doAggregateWithKeysOutput(UnsafeRow agg_keyTerm, UnsafeRow agg_bufferTerm) / 215 / throws java.io.IOException { / 216 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[7] / numOutputRows /).add(1); / 217 / / 218 / boolean agg_isNull35 = agg_keyTerm.isNullAt(0); / 219 / UTF8String agg_value37 = agg_isNull35 ? / 220 / null : (agg_keyTerm.getUTF8String(0)); / 221 / boolean agg_isNull36 = agg_bufferTerm.isNullAt(0); / 222 / double agg_value38 = agg_isNull36 ? / 223 / -1.0 : (agg_bufferTerm.getDouble(0)); / 224 / boolean agg_isNull37 = agg_bufferTerm.isNullAt(1); / 225 / long agg_value39 = agg_isNull37 ? / 226 / -1L : (agg_bufferTerm.getLong(1)); / 227 / / 228 / agg_mutableStateArray1[1].reset(); / 229 / / 230 / agg_mutableStateArray2[1].zeroOutNullBytes(); / 231 / / 232 / if (agg_isNull35) { / 233 / agg_mutableStateArray2[1].setNullAt(0); / 234 / } else { / 235 / agg_mutableStateArray2[1].write(0, agg_value37); / 236 / } / 237 / / 238 / if (agg_isNull36) { / 239 / agg_mutableStateArray2[1].setNullAt(1); / 240 / } else { / 241 / agg_mutableStateArray2[1].write(1, agg_value38); / 242 / } / 243 / / 244 / if (agg_isNull37) { / 245 / agg_mutableStateArray2[1].setNullAt(2); / 246 / } else { / 247 / agg_mutableStateArray2[1].write(2, agg_value39); / 248 / } / 249 / agg_mutableStateArray[1].setTotalSize(agg_mutableStateArray1[1].totalSize()); / 250 / append(agg_mutableStateArray[1]); / 251 / / 252 / } / 253 / / 254 / } / 255 / / 256 / private class agg_NestedClass { / 257 / private void agg_doConsume(InternalRow inputadapter_row, int agg_expr_0, UTF8String agg_expr_1, boolean agg_exprIsNull_1) throws java.io.IOException { / 258 / UnsafeRow agg_unsafeRowAggBuffer = null; / 259 / UnsafeRow agg_fastAggBuffer = null; / 260 / / 261 / if (true) { / 262 / if (!agg_exprIsNull_1) { / 263 / agg_fastAggBuffer = agg_fastHashMap.findOrInsert( / 264 / agg_expr_1); / 265 / } / 266 / } / 267 / // Cannot find the key in fast hash map, try regular hash map. / 268 / if (agg_fastAggBuffer == null) { / 269 / // generate grouping key / 270 / agg_mutableStateArray1[0].reset(); / 271 / / 272 / agg_mutableStateArray2[0].zeroOutNullBytes(); / 273 / / 274 / if (agg_exprIsNull_1) { / 275 / agg_mutableStateArray2[0].setNullAt(0); / 276 / } else { / 277 / agg_mutableStateArray2[0].write(0, agg_expr_1); / 278 / } / 279 / agg_mutableStateArray[0].setTotalSize(agg_mutableStateArray1[0].totalSize()); / 280 / int agg_value7 = 42; / 281 / / 282 / if (!agg_exprIsNull_1) { / 283 / agg_value7 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(agg_expr_1.getBaseObject(), agg_expr_1.getBaseOffset(), agg_expr_1.numBytes(), agg_value7); / 284 / } / 285 / if (true) { / 286 / // try to get the buffer from hash map / 287 / agg_unsafeRowAggBuffer = / 288 / agg_hashMap.getAggregationBufferFromUnsafeRow(agg_mutableStateArray[0], agg_value7); / 289 / } / 290 / // Can't allocate buffer from the hash map. Spill the map and fallback to sort-based / 291 / // aggregation after processing all input rows. / 292 / if (agg_unsafeRowAggBuffer == null) { / 293 / if (agg_sorter == null) { / 294 / agg_sorter = agg_hashMap.destructAndCreateExternalSorter(); / 295 / } else { / 296 / agg_sorter.merge(agg_hashMap.destructAndCreateExternalSorter()); / 297 / } / 298 / / 299 / // the hash map had be spilled, it should have enough memory now, / 300 / // try to allocate buffer again. / 301 / agg_unsafeRowAggBuffer = agg_hashMap.getAggregationBufferFromUnsafeRow( / 302 / agg_mutableStateArray[0], agg_value7); / 303 / if (agg_unsafeRowAggBuffer == null) { / 304 / // failed to allocate the first page / 305 / throw new OutOfMemoryError("No enough memory for aggregation"); / 306 / } / 307 / } / 308 / / 309 / } / 310 / / 311 / if (agg_fastAggBuffer != null) { / 312 / // common sub-expressions / 313 / boolean agg_isNull21 = false; / 314 / long agg_value23 = -1L; / 315 / if (!false) { / 316 / agg_value23 = (long) agg_expr_0; / 317 / } / 318 / // evaluate aggregate function / 319 / boolean agg_isNull23 = true; / 320 / double agg_value25 = -1.0; / 321 / / 322 / boolean agg_isNull24 = agg_fastAggBuffer.isNullAt(0); / 323 / double agg_value26 = agg_isNull24 ? / 324 / -1.0 : (agg_fastAggBuffer.getDouble(0)); / 325 / if (!agg_isNull24) { / 326 / agg_agg_isNull25 = true; / 327 / double agg_value27 = -1.0; / 328 / do { / 329 / boolean agg_isNull26 = agg_isNull21; / 330 / double agg_value28 = -1.0; / 331 / if (!agg_isNull21) { / 332 / agg_value28 = (double) agg_value23; / 333 / } / 334 / if (!agg_isNull26) { / 335 / agg_agg_isNull25 = false; / 336 / agg_value27 = agg_value28; / 337 / continue; / 338 / } / 339 / / 340 / boolean agg_isNull27 = false; / 341 / double agg_value29 = -1.0; / 342 / if (!false) { / 343 / agg_value29 = (double) 0; / 344 / } / 345 / if (!agg_isNull27) { / 346 / agg_agg_isNull25 = false; / 347 / agg_value27 = agg_value29; / 348 / continue; / 349 / } / 350 / / 351 / } while (false); / 352 / / 353 / agg_isNull23 = false; // resultCode could change nullability. / 354 / agg_value25 = agg_value26 + agg_value27; / 355 / / 356 / } / 357 / boolean agg_isNull29 = false; / 358 / long agg_value31 = -1L; / 359 / if (!false && agg_isNull21) { / 360 / boolean agg_isNull31 = agg_fastAggBuffer.isNullAt(1); / 361 / long agg_value33 = agg_isNull31 ? / 362 / -1L : (agg_fastAggBuffer.getLong(1)); / 363 / agg_isNull29 = agg_isNull31; / 364 / agg_value31 = agg_value33; / 365 / } else { / 366 / boolean agg_isNull32 = true; / 367 / long agg_value34 = -1L; / 368 / / 369 / boolean agg_isNull33 = agg_fastAggBuffer.isNullAt(1); / 370 / long agg_value35 = agg_isNull33 ? / 371 / -1L : (agg_fastAggBuffer.getLong(1)); / 372 / if (!agg_isNull33) { / 373 / agg_isNull32 = false; // resultCode could change nullability. / 374 / agg_value34 = agg_value35 + 1L; / 375 / / 376 / } / 377 / agg_isNull29 = agg_isNull32; / 378 / agg_value31 = agg_value34; / 379 / } / 380 / // update fast row / 381 / if (!agg_isNull23) { / 382 / agg_fastAggBuffer.setDouble(0, agg_value25); / 383 / } else { / 384 / agg_fastAggBuffer.setNullAt(0); / 385 / } / 386 / / 387 / if (!agg_isNull29) { / 388 / agg_fastAggBuffer.setLong(1, agg_value31); / 389 / } else { / 390 / agg_fastAggBuffer.setNullAt(1); / 391 / } / 392 / } else { / 393 / // common sub-expressions / 394 / boolean agg_isNull7 = false; / 395 / long agg_value9 = -1L; / 396 / if (!false) { / 397 / agg_value9 = (long) agg_expr_0; / 398 / } / 399 / // evaluate aggregate function / 400 / boolean agg_isNull9 = true; / 401 / double agg_value11 = -1.0; / 402 / / 403 / boolean agg_isNull10 = agg_unsafeRowAggBuffer.isNullAt(0); / 404 / double agg_value12 = agg_isNull10 ? / 405 / -1.0 : (agg_unsafeRowAggBuffer.getDouble(0)); / 406 / if (!agg_isNull10) { / 407 / agg_agg_isNull11 = true; / 408 / double agg_value13 = -1.0; / 409 / do { / 410 / boolean agg_isNull12 = agg_isNull7; / 411 / double agg_value14 = -1.0; / 412 / if (!agg_isNull7) { / 413 / agg_value14 = (double) agg_value9; / 414 / } / 415 / if (!agg_isNull12) { / 416 / agg_agg_isNull11 = false; / 417 / agg_value13 = agg_value14; / 418 / continue; / 419 / } / 420 / / 421 / boolean agg_isNull13 = false; / 422 / double agg_value15 = -1.0; / 423 / if (!false) { / 424 / agg_value15 = (double) 0; / 425 / } / 426 / if (!agg_isNull13) { / 427 / agg_agg_isNull11 = false; / 428 / agg_value13 = agg_value15; / 429 / continue; / 430 / } / 431 / / 432 / } while (false); / 433 / / 434 / agg_isNull9 = false; // resultCode could change nullability. / 435 / agg_value11 = agg_value12 + agg_value13; / 436 / / 437 / } / 438 / boolean agg_isNull15 = false; / 439 / long agg_value17 = -1L; / 440 / if (!false && agg_isNull7) { / 441 / boolean agg_isNull17 = agg_unsafeRowAggBuffer.isNullAt(1); / 442 / long agg_value19 = agg_isNull17 ? / 443 / -1L : (agg_unsafeRowAggBuffer.getLong(1)); / 444 / agg_isNull15 = agg_isNull17; / 445 / agg_value17 = agg_value19; / 446 / } else { / 447 / boolean agg_isNull18 = true; / 448 / long agg_value20 = -1L; / 449 / / 450 / boolean agg_isNull19 = agg_unsafeRowAggBuffer.isNullAt(1); / 451 / long agg_value21 = agg_isNull19 ? / 452 / -1L : (agg_unsafeRowAggBuffer.getLong(1)); / 453 / if (!agg_isNull19) { / 454 / agg_isNull18 = false; // resultCode could change nullability. / 455 / agg_value20 = agg_value21 + 1L; / 456 / / 457 / } / 458 / agg_isNull15 = agg_isNull18; / 459 / agg_value17 = agg_value20; / 460 / } / 461 / // update unsafe row buffer / 462 / if (!agg_isNull9) { / 463 / agg_unsafeRowAggBuffer.setDouble(0, agg_value11); / 464 / } else { / 465 / agg_unsafeRowAggBuffer.setNullAt(0); / 466 / } / 467 / / 468 / if (!agg_isNull15) { / 469 / agg_unsafeRowAggBuffer.setLong(1, agg_value17); / 470 / } else { / 471 / agg_unsafeRowAggBuffer.setNullAt(1); / 472 / } / 473 / / 474 / } / 475 / / 476 / } / 477 / / 478 / } / 479 / / 480 */ } ``` ## How was this patch tested? Added UT into `WholeStageCodegenSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20779 from kiszk/SPARK-23598.	2018-03-13 23:04:16 +01:00
Wang Gengliang	10b0657b03	[SPARK-23624][SQL] Revise doc of method pushFilters in Datasource V2 ## What changes were proposed in this pull request? Revise doc of method pushFilters in SupportsPushDownFilters/SupportsPushDownCatalystFilters In `FileSourceStrategy`, except `partitionKeyFilters`(the references of which is subset of partition keys), all filters needs to be evaluated after scanning. Otherwise, Spark will get wrong result from data sources like Orc/Parquet. This PR is to improve the doc. Author: Wang Gengliang <gengliang.wang@databricks.com> Closes #20769 from gengliangwang/revise_pushdown_doc.	2018-03-09 15:41:19 -08:00
Michał Świtakowski	2ca9bb083c	[SPARK-23173][SQL] Avoid creating corrupt parquet files when loading data from JSON ## What changes were proposed in this pull request? The from_json() function accepts an additional parameter, where the user might specify the schema. The issue is that the specified schema might not be compatible with data. In particular, the JSON data might be missing data for fields declared as non-nullable in the schema. The from_json() function does not verify the data against such errors. When data with missing fields is sent to the parquet encoder, there is no verification either. The end results is a corrupt parquet file. To avoid corruptions, make sure that all fields in the user-specified schema are set to be nullable. Since this changes the behavior of a public function, we need to include it in release notes. The behavior can be reverted by setting `spark.sql.fromJsonForceNullableSchema=false` ## How was this patch tested? Added two new tests. Author: Michał Świtakowski <michal.switakowski@databricks.com> Closes #20694 from mswit-databricks/SPARK-23173.	2018-03-09 14:29:31 -08:00
Dilip Biswal	d90e77bd0e	[SPARK-23271][SQL] Parquet output contains only _SUCCESS file after writing an empty dataframe ## What changes were proposed in this pull request? Below are the two cases. ``` SQL case 1 scala> List.empty[String].toDF().rdd.partitions.length res18: Int = 1 ``` When we write the above data frame as parquet, we create a parquet file containing just the schema of the data frame. Case 2 ``` SQL scala> val anySchema = StructType(StructField("anyName", StringType, nullable = false) :: Nil) anySchema: org.apache.spark.sql.types.StructType = StructType(StructField(anyName,StringType,false)) scala> spark.read.schema(anySchema).csv("/tmp/empty_folder").rdd.partitions.length res22: Int = 0 ``` For the 2nd case, since number of partitions = 0, we don't call the write task (the task has logic to create the empty metadata only parquet file) The fix is to create a dummy single partition RDD and set up the write task based on it to ensure the metadata-only file. ## How was this patch tested? A new test is added to DataframeReaderWriterSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #20525 from dilipbiswal/spark-23271.	2018-03-08 14:58:40 -08:00
Marco Gaido	ea480990e7	[SPARK-23628][SQL] calculateParamLength should not return 1 + num of epressions ## What changes were proposed in this pull request? There was a bug in `calculateParamLength` which caused it to return always 1 + the number of expressions. This could lead to Exceptions especially with expressions of type long. ## How was this patch tested? added UT + fixed previous UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20772 from mgaido91/SPARK-23628.	2018-03-08 11:09:15 -08:00
Li Jin	2cb23a8f51	[SPARK-23011][SQL][PYTHON] Support alternative function form with group aggregate pandas UDF ## What changes were proposed in this pull request? This PR proposes to support an alternative function from with group aggregate pandas UDF. The current form: ``` def foo(pdf): return ... ``` Takes a single arg that is a pandas DataFrame. With this PR, an alternative form is supported: ``` def foo(key, pdf): return ... ``` The alternative form takes two argument - a tuple that presents the grouping key, and a pandas DataFrame represents the data. ## How was this patch tested? GroupbyApplyTests Author: Li Jin <ice.xelloss@gmail.com> Closes #20295 from icexelloss/SPARK-23011-groupby-apply-key.	2018-03-08 20:29:07 +09:00
Xingbo Jiang	ac76eff6a8	[SPARK-23525][SQL] Support ALTER TABLE CHANGE COLUMN COMMENT for external hive table ## What changes were proposed in this pull request? The following query doesn't work as expected: ``` CREATE EXTERNAL TABLE ext_table(a STRING, b INT, c STRING) PARTITIONED BY (d STRING) LOCATION 'sql/core/spark-warehouse/ext_table'; ALTER TABLE ext_table CHANGE a a STRING COMMENT "new comment"; DESC ext_table; ``` The comment of column `a` is not updated, that's because `HiveExternalCatalog.doAlterTable` ignores table schema changes. To fix the issue, we should call `doAlterTableDataSchema` instead of `doAlterTable`. ## How was this patch tested? Updated `DDLSuite.testChangeColumn`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #20696 from jiangxb1987/alterColumnComment.	2018-03-07 13:51:44 -08:00
Marcelo Vanzin	c99fc9ad9b	[SPARK-23550][CORE] Cleanup `Utils`. A few different things going on: - Remove unused methods. - Move JSON methods to the only class that uses them. - Move test-only methods to TestUtils. - Make getMaxResultSize() a config constant. - Reuse functionality from existing libraries (JRE or JavaUtils) where possible. The change also includes changes to a few tests to call `Utils.createTempFile` correctly, so that temp dirs are created under the designated top-level temp dir instead of potentially polluting git index. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #20706 from vanzin/SPARK-23550.	2018-03-07 13:42:06 -08:00
Wenchen Fan	ad640a5aff	[SPARK-23303][SQL] improve the explain result for data source v2 relations ## What changes were proposed in this pull request? The proposed explain format: [streaming header] [RelationV2/ScanV2] [data source name] [output] [pushed filters] [options] streaming header: if it's a streaming relation, put a "Streaming" at the beginning. RelationV2/ScanV2: if it's a logical plan, put a "RelationV2", else, put a "ScanV2" data source name: the simple class name of the data source implementation output: a string of the plan output attributes pushed filters: a string of all the filters that have been pushed to this data source options: all the options to create the data source reader. The current explain result for data source v2 relation is unreadable: ``` == Parsed Logical Plan == 'Filter ('i > 6) +- AnalysisBarrier +- Project [j#1] +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader3b415940 == Analyzed Logical Plan == j: int Project [j#1] +- Filter (i#0 > 6) +- Project [j#1, i#0] +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader3b415940 == Optimized Logical Plan == Project [j#1] +- Filter isnotnull(i#0) +- DataSourceV2Relation [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader3b415940 == Physical Plan == (1) Project [j#1] +- (1) Filter isnotnull(i#0) +- (1) DataSourceV2Scan [i#0, j#1], org.apache.spark.sql.sources.v2.AdvancedDataSourceV2$Reader3b415940 ``` after this PR ``` == Parsed Logical Plan == 'Project [unresolvedalias('j, None)] +- AnalysisBarrier +- RelationV2 AdvancedDataSourceV2[i#0, j#1] == Analyzed Logical Plan == j: int Project [j#1] +- RelationV2 AdvancedDataSourceV2[i#0, j#1] == Optimized Logical Plan == RelationV2 AdvancedDataSourceV2[j#1] == Physical Plan == (1) ScanV2 AdvancedDataSourceV2[j#1] ``` ------- ``` == Analyzed Logical Plan == i: int, j: int Filter (i#88 > 3) +- RelationV2 JavaAdvancedDataSourceV2[i#88, j#89] == Optimized Logical Plan == Filter isnotnull(i#88) +- RelationV2 JavaAdvancedDataSourceV2[i#88, j#89] (Pushed Filters: [GreaterThan(i,3)]) == Physical Plan == (1) Filter isnotnull(i#88) +- (1) ScanV2 JavaAdvancedDataSourceV2[i#88, j#89] (Pushed Filters: [GreaterThan(i,3)]) ``` an example for streaming query ``` == Parsed Logical Plan == Aggregate [value#6], [value#6, count(1) AS count(1)#11L] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6] +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String +- DeserializeToObject cast(value#25 as string).toString, obj#4: java.lang.String +- Streaming RelationV2 MemoryStreamDataSource[value#25] == Analyzed Logical Plan == value: string, count(1): bigint Aggregate [value#6], [value#6, count(1) AS count(1)#11L] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6] +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String +- DeserializeToObject cast(value#25 as string).toString, obj#4: java.lang.String +- Streaming RelationV2 MemoryStreamDataSource[value#25] == Optimized Logical Plan == Aggregate [value#6], [value#6, count(1) AS count(1)#11L] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6] +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#5: java.lang.String +- DeserializeToObject value#25.toString, obj#4: java.lang.String +- Streaming RelationV2 MemoryStreamDataSource[value#25] == Physical Plan == (4) HashAggregate(keys=[value#6], functions=[count(1)], output=[value#6, count(1)#11L]) +- StateStoreSave [value#6], state info [ checkpoint = *******(redacted)/cloud/dev/spark/target/tmp/temporary-549f264b-2531-4fcb-a52f-433c77347c12/state, runId = f84d9da9-2f8c-45c1-9ea1-70791be684de, opId = 0, ver = 0, numPartitions = 5], Complete, 0 +- (3) HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#16L]) +- StateStoreRestore [value#6], state info [ checkpoint = ********(redacted)/cloud/dev/spark/target/tmp/temporary-549f264b-2531-4fcb-a52f-433c77347c12/state, runId = f84d9da9-2f8c-45c1-9ea1-70791be684de, opId = 0, ver = 0, numPartitions = 5] +- (2) HashAggregate(keys=[value#6], functions=[merge_count(1)], output=[value#6, count#16L]) +- Exchange hashpartitioning(value#6, 5) +- (1) HashAggregate(keys=[value#6], functions=[partial_count(1)], output=[value#6, count#16L]) +- (1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#6] +- (1) MapElements <function1>, obj#5: java.lang.String +- (1) DeserializeToObject value#25.toString, obj#4: java.lang.String +- *(1) ScanV2 MemoryStreamDataSource[value#25] ``` ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #20647 from cloud-fan/explain.	2018-03-05 20:35:14 -08:00
Henry Robinson	8c5b34c425	[SPARK-23604][SQL] Change Statistics.isEmpty to !Statistics.hasNonNul… …lValue ## What changes were proposed in this pull request? Parquet 1.9 will change the semantics of Statistics.isEmpty slightly to reflect if the null value count has been set. That breaks a timestamp interoperability test that cares only about whether there are column values present in the statistics of a written file for an INT96 column. Fix by using Statistics.hasNonNullValue instead. ## How was this patch tested? Unit tests continue to pass against Parquet 1.8, and also pass against a Parquet build including PARQUET-1217. Author: Henry Robinson <henry@cloudera.com> Closes #20740 from henryr/spark-23604.	2018-03-05 16:49:24 -08:00
Jose Torres	b0f422c386	[SPARK-23559][SS] Add epoch ID to DataWriterFactory. ## What changes were proposed in this pull request? Add an epoch ID argument to DataWriterFactory for use in streaming. As a side effect of passing in this value, DataWriter will now have a consistent lifecycle; commit() or abort() ends the lifecycle of a DataWriter instance in any execution mode. I considered making a separate streaming interface and adding the epoch ID only to that one, but I think it requires a lot of extra work for no real gain. I think it makes sense to define epoch 0 as the one and only epoch of a non-streaming query. ## How was this patch tested? existing unit tests Author: Jose Torres <jose@databricks.com> Closes #20710 from jose-torres/api2.	2018-03-05 13:23:01 -08:00
Mihaly Toth	a366b950b9	[SPARK-23329][SQL] Fix documentation of trigonometric functions ## What changes were proposed in this pull request? Provide more details in trigonometric function documentations. Referenced `java.lang.Math` for further details in the descriptions. ## How was this patch tested? Ran full build, checked generated documentation manually Author: Mihaly Toth <misutoth@gmail.com> Closes #20618 from misutoth/trigonometric-doc.	2018-03-05 23:46:40 +09:00
Kazuaki Ishizaki	2ce37b50fc	[SPARK-23546][SQL] Refactor stateless methods/values in CodegenContext ## What changes were proposed in this pull request? A current `CodegenContext` class has immutable value or method without mutable state, too. This refactoring moves them to `CodeGenerator` object class which can be accessed from anywhere without an instantiated `CodegenContext` in the program. ## How was this patch tested? Existing tests Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #20700 from kiszk/SPARK-23546.	2018-03-05 11:39:01 +01:00
Juliusz Sompolski	dea381dfaa	[SPARK-23514][FOLLOW-UP] Remove more places using sparkContext.hadoopConfiguration directly ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/20679 I missed a few places in SQL tests. For hygiene, they should also use the sessionState interface where possible. ## How was this patch tested? Modified existing tests. Author: Juliusz Sompolski <julek@databricks.com> Closes #20718 from juliuszsompolski/SPARK-23514-followup.	2018-03-03 09:10:48 +08:00
jerryshao	707e6506d0	[SPARK-23097][SQL][SS] Migrate text socket source to V2 ## What changes were proposed in this pull request? This PR moves structured streaming text socket source to V2. Questions: do we need to remove old "socket" source? ## How was this patch tested? Unit test and manual verification. Author: jerryshao <sshao@hortonworks.com> Closes #20382 from jerryshao/SPARK-23097.	2018-03-02 12:27:42 -08:00

1 2 3 4 5 ...

4541 commits