ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Wenchen Fan	18ee55dd5d	[SPARK-19148][SQL] do not expose the external table concept in Catalog ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/16296 , we reached a consensus that we should hide the external/managed table concept to users and only expose custom table path. This PR renames `Catalog.createExternalTable` to `createTable`(still keep the old versions for backward compatibility), and only set the table type to EXTERNAL if `path` is specified in options. ## How was this patch tested? new tests in `CatalogSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #16528 from cloud-fan/create-table.	2017-01-17 12:54:50 +08:00
Liang-Chi Hsieh	61e48f52d1	[SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet ## What changes were proposed in this pull request? We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet: 1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html 2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too. This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc. Two main changes in this patch: 1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`. 2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator. One thing to notice is: We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out `df9d8e4154/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java (L470)`. So this patch catches `RuntimeException`. One concern is that it might also shadow other runtime exceptions other than reading corrupt files. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #16474 from viirya/fix-ignorecorrupted-parquet-files.	2017-01-16 15:26:41 +08:00
gatorsmile	de62ddf7ff	[SPARK-19120] Refresh Metadata Cache After Loading Hive Tables ### What changes were proposed in this pull request? ```Scala sql("CREATE TABLE tab (a STRING) STORED AS PARQUET") // This table fetch is to fill the cache with zero leaf files spark.table("tab").show() sql( s""" \|LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE \|INTO TABLE tab """.stripMargin) spark.table("tab").show() ``` In the above example, the returned result is empty after table loading. The metadata cache could be out of dated after loading new data into the table, because loading/inserting does not update the cache. So far, the metadata cache is only used for data source tables. Thus, for Hive serde tables, only `parquet` and `orc` formats are facing such issues, because the Hive serde tables in the format of parquet/orc could be converted to data source tables when `spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` is on. This PR is to refresh the metadata cache after processing the `LOAD DATA` command. In addition, Spark SQL does not convert partitioned Hive tables (orc/parquet) to data source tables in the write path, but the read path is using the metadata cache for both partitioned and non-partitioned Hive tables (orc/parquet). That means, writing the partitioned parquet/orc tables still use `InsertIntoHiveTable`, instead of `InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, `InsertIntoHiveTable` needs to refresh the metadata cache for partitioned tables. Note, it does not need to refresh the cache for non-partitioned parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. Based on the comments, this PR will keep the existing logics unchanged. That means, we always refresh the table no matter whether the table is partitioned or not. ### How was this patch tested? Added test cases in parquetSuites.scala Author: gatorsmile <gatorsmile@gmail.com> Closes #16500 from gatorsmile/refreshInsertIntoHiveTable.	2017-01-15 20:40:44 +08:00
Tsuyoshi Ozawa	9112f31bb8	[SPARK-19207][SQL] LocalSparkSession should use Slf4JLoggerFactory.INSTANCE ## What changes were proposed in this pull request? Using Slf4JLoggerFactory.INSTANCE instead of creating Slf4JLoggerFactory's object with constructor. It's deprecated. ## How was this patch tested? With running StateStoreRDDSuite. Author: Tsuyoshi Ozawa <ozawa@apache.org> Closes #16570 from oza/SPARK-19207.	2017-01-15 11:11:21 +00:00
windpiger	8942353905	[SPARK-19151][SQL] DataFrameWriter.saveAsTable support hive overwrite ## What changes were proposed in this pull request? After [SPARK-19107](https://issues.apache.org/jira/browse/SPARK-19107), we now can treat hive as a data source and create hive tables with DataFrameWriter and Catalog. However, the support is not completed, there are still some cases we do not support. This PR implement: DataFrameWriter.saveAsTable work with hive format with overwrite mode ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes #16549 from windpiger/saveAsTableWithHiveOverwrite.	2017-01-14 10:53:33 -08:00
Yucai Yu	ad0dadaa25	[SPARK-19180] [SQL] the offset of short should be 2 in OffHeapColumn ## What changes were proposed in this pull request? the offset of short is 4 in OffHeapColumnVector's putShorts, but actually it should be 2. ## How was this patch tested? unit test Author: Yucai Yu <yucai.yu@intel.com> Closes #16555 from yucai/offheap_short.	2017-01-13 13:40:53 -08:00
Andrew Ash	b040cef2ed	Fix missing close-parens for In filter's toString Otherwise the open parentheses isn't closed in query plan descriptions of batch scans. PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ... Author: Andrew Ash <andrew@andrewash.com> Closes #16558 from ash211/patch-9.	2017-01-12 23:14:07 -08:00
Wenchen Fan	6b34e745bb	[SPARK-19178][SQL] convert string of large numbers to int should return null ## What changes were proposed in this pull request? When we convert a string to integral, we will convert that string to `decimal(20, 0)` first, so that we can turn a string with decimal format to truncated integral, e.g. `CAST('1.2' AS int)` will return `1`. However, this brings problems when we convert a string with large numbers to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, while Hive returns null as we expected. This is a long standing bug(seems it was there the first day Spark SQL was created), this PR fixes this bug by adding the native support to convert `UTF8String` to integral. ## How was this patch tested? new regression tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16550 from cloud-fan/string-to-int.	2017-01-12 22:52:34 -08:00
gatorsmile	3356b8b6a9	[SPARK-19092][SQL] Save() API of DataFrameWriter should not scan all the saved files ### What changes were proposed in this pull request? `DataFrameWriter`'s [save() API](`5d38f09f47/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala (L207)`) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it. The related PR: https://github.com/apache/spark/pull/16090 ### How was this patch tested? Updated the existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Closes #16481 from gatorsmile/saveFileScan.	2017-01-13 13:05:53 +08:00
Takeshi YAMAMURO	5585ed93b0	[SPARK-17237][SQL] Remove backticks in a pivot result schema ## What changes were proposed in this pull request? Pivoting adds backticks (e.g. 3_count(\`c\`)) in column names and, in some cases, thes causes analysis exceptions like; ``` scala> val df = Seq((2, 3, 4), (3, 4, 5)).toDF("a", "x", "y") scala> df.groupBy("a").pivot("x").agg(count("y"), avg("y")).na.fill(0) org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_count(`y`)`; at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:134) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:144) ... ``` So, this pr proposes to remove these backticks from column names. ## How was this patch tested? Added a test in `DataFrameAggregateSuite`. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #14812 from maropu/SPARK-17237.	2017-01-12 09:46:53 -08:00
Wenchen Fan	871d266649	[SPARK-18969][SQL] Support grouping by nondeterministic expressions ## What changes were proposed in this pull request? Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it. close https://github.com/apache/spark/pull/16379 There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue. ## How was this patch tested? a new test suite Author: Wenchen Fan <wenchen@databricks.com> Closes #16404 from cloud-fan/groupby.	2017-01-12 20:21:04 +08:00
Eric Liang	c71b25481a	[SPARK-19183][SQL] Add deleteWithJob hook to internal commit protocol API ## What changes were proposed in this pull request? Currently in SQL we implement overwrites by calling fs.delete() directly on the original data. This is not ideal since we the original files end up deleted even if the job aborts. We should extend the commit protocol to allow file overwrites to be managed as well. ## How was this patch tested? Existing tests. I also fixed a bunch of tests that were depending on the commit protocol implementation being set to the legacy mapreduce one. cc rxin cloud-fan Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #16554 from ericl/add-delete-protocol.	2017-01-12 17:45:55 +08:00
hyukjinkwon	24100f162d	[SPARK-16848][SQL] Check schema validation for user-specified schema in jdbc and table APIs ## What changes were proposed in this pull request? This PR proposes to throw an exception for both jdbc APIs when user specified schemas are not allowed or useless. DataFrameReader.jdbc(...) ``` scala spark.read.schema(StructType(Nil)).jdbc(...) ``` DataFrameReader.table(...) ```scala spark.read.schema(StructType(Nil)).table("usrdb.test") ``` ## How was this patch tested? Unit test in `JDBCSuite` and `DataFrameReaderWriterSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14451 from HyukjinKwon/SPARK-16848.	2017-01-11 21:03:48 -08:00
jiangxingbo	30a07071f0	[SPARK-18801][SQL] Support resolve a nested view ## What changes were proposed in this pull request? We should be able to resolve a nested view. The main advantage is that if you update an underlying view, the current view also gets updated. The new approach should be compatible with older versions of SPARK/HIVE, that means: 1. The new approach should be able to resolve the views that created by older versions of SPARK/HIVE; 2. The new approach should be able to resolve the views that are currently supported by SPARK SQL. The new approach mainly brings in the following changes: 1. Add a new operator called `View` to keep track of the CatalogTable that describes the view, and the output attributes as well as the child of the view; 2. Update the `ResolveRelations` rule to resolve the relations and views, note that a nested view should be resolved correctly; 3. Add `viewDefaultDatabase` variable to `CatalogTable` to keep track of the default database name used to resolve a view, if the `CatalogTable` is not a view, then the variable should be `None`; 4. Add `AnalysisContext` to enable us to still support a view created with CTE/Windows query; 5. Enables the view support without enabling Hive support (i.e., enableHiveSupport); 6. Fix a weird behavior: the result of a view query may have different schema if the referenced table has been changed. After this PR, we try to cast the child output attributes to that from the view schema, throw an AnalysisException if cast is not allowed. Note this is compatible with the views defined by older versions of Spark(before 2.2), which have empty `defaultDatabase` and all the relations in `viewText` have database part defined. ## How was this patch tested? 1. Add new tests in `SessionCatalogSuite` to test the function `lookupRelation`; 2. Add new test case in `SQLViewSuite` to test resolve a nested view. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #16233 from jiangxb1987/resolve-view.	2017-01-11 13:44:07 -08:00
wangzhenhua	a615513569	[SPARK-19149][SQL] Unify two sets of statistics in LogicalPlan ## What changes were proposed in this pull request? Currently we have two sets of statistics in LogicalPlan: a simple stats and a stats estimated by cbo, but the computing logic and naming are quite confusing, we need to unify these two sets of stats. ## How was this patch tested? Just modify existing tests. Author: wangzhenhua <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes #16529 from wzhfy/unifyStats.	2017-01-10 22:34:44 -08:00
Wenchen Fan	3b19c74e71	[SPARK-19157][SQL] should be able to change spark.sql.runSQLOnFiles at runtime ## What changes were proposed in this pull request? The analyzer rule that supports to query files directly will be added to `Analyzer.extendedResolutionRules` when SparkSession is created, according to the `spark.sql.runSQLOnFiles` flag. If the flag is off when we create `SparkSession`, this rule is not added and we can not query files directly even we turn on the flag later. This PR fixes this bug by always adding that rule to `Analyzer.extendedResolutionRules`. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #16531 from cloud-fan/sql-on-files.	2017-01-10 21:33:44 -08:00
Shixiong Zhu	bc6c56e940	[SPARK-19140][SS] Allow update mode for non-aggregation streaming queries ## What changes were proposed in this pull request? This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16520 from zsxwing/update-without-agg.	2017-01-10 17:58:11 -08:00
Dongjoon Hyun	d5b1dc934a	[SPARK-19137][SQL] Fix `withSQLConf` to reset `OptionalConfigEntry` correctly ## What changes were proposed in this pull request? `DataStreamReaderWriterSuite` makes test files in source folder like the followings. Interestingly, the root cause is `withSQLConf` fails to reset `OptionalConfigEntry` correctly. In other words, it resets the config into `Some(undefined)`. ```bash $ git status Untracked files: (use "git add <file>..." to include in what will be committed) sql/core/%253Cundefined%253E/ sql/core/%3Cundefined%3E/ ``` ## How was this patch tested? Manual. ``` build/sbt "project sql" test git status ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16522 from dongjoon-hyun/SPARK-19137.	2017-01-10 10:49:44 -08:00
Shixiong Zhu	3ef183a941	[SPARK-19113][SS][TESTS] Set UncaughtExceptionHandler in onQueryStarted to ensure catching fatal errors during query initialization ## What changes were proposed in this pull request? StreamTest sets `UncaughtExceptionHandler` after starting the query now. It may not be able to catch fatal errors during query initialization. This PR uses `onQueryStarted` callback to fix it. ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #16492 from zsxwing/SPARK-19113.	2017-01-10 14:24:45 +00:00
Dongjoon Hyun	a2c6adcc5d	[SPARK-18857][SQL] Don't use `Iterator.duplicate` for `incrementalCollect` in Thrift Server ## What changes were proposed in this pull request? To support `FETCH_FIRST`, SPARK-16563 used Scala `Iterator.duplicate`. However, Scala `Iterator.duplicate` uses a queue to buffer all items between both iterators, this causes GC and hangs for queries with large number of rows. We should not use this, especially for `spark.sql.thriftServer.incrementalCollect`. https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/Iterator.scala#L1262-L1300 ## How was this patch tested? Pass the existing tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16440 from dongjoon-hyun/SPARK-18857.	2017-01-10 13:27:55 +00:00
hyukjinkwon	4e27578faa	[SPARK-18922][SQL][CORE][STREAMING][TESTS] Fix all identified tests failed due to path and resource-not-closed problems on Windows ## What changes were proposed in this pull request? This PR proposes to fix all the test failures identified by testing with AppVeyor. Scala - aborted tests ``` WindowQuerySuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.execution.WindowQuerySuite * ABORTED * (156 milliseconds) org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive argetscala-2.11 est-classesdatafilespart_tiny.txt; OrcSourceSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.orc.OrcSourceSuite * ABORTED * (62 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ParquetMetastoreSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.ParquetMetastoreSuite * ABORTED * (4 seconds, 703 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ParquetSourceSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.ParquetSourceSuite * ABORTED * (3 seconds, 907 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-581a6575-454f-4f21-a516-a07f95266143; KafkaRDDSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaRDDSuite * ABORTED * (5 seconds, 212 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-4722304d-213e-4296-b556-951df1a46807 DirectKafkaStreamSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.DirectKafkaStreamSuite * ABORTED * (7 seconds, 127 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-d0d3eba7-4215-4e10-b40e-bb797e89338e at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) ReliableKafkaStreamSuite Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.ReliableKafkaStreamSuite * ABORTED * (5 seconds, 498 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-d33e45a0-287e-4bed-acae-ca809a89d888 KafkaStreamSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaStreamSuite * ABORTED * (2 seconds, 892 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-59c9d169-5a56-4519-9ef0-cefdbd3f2e6c KafkaClusterSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka.KafkaClusterSuite * ABORTED * (1 second, 690 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-3ef402b0-8689-4a60-85ae-e41e274f179d DirectKafkaStreamSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.DirectKafkaStreamSuite * ABORTED * (59 seconds, 626 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-426107da-68cf-4d94-b0d6-1f428f1c53f6 KafkaRDDSuite: Exception encountered when attempting to run a suite with class name: org.apache.spark.streaming.kafka010.KafkaRDDSuite * ABORTED * (2 minutes, 6 seconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-b9ce7929-5dae-46ab-a0c4-9ef6f58fbc2 ``` Java - failed tests ``` Test org.apache.spark.streaming.kafka.JavaKafkaRDDSuite.testKafkaRDD failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-1cee32f4-4390-4321-82c9-e8616b3f0fb0, took 9.61 sec Test org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-f42695dd-242e-4b07-847c-f299b8e4676e, took 11.797 sec Test org.apache.spark.streaming.kafka.JavaDirectKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-85c0d062-78cf-459c-a2dd-7973572101ce, took 1.581 sec Test org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite.testKafkaRDD failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-49eb6b5c-8366-47a6-83f2-80c443c48280, took 17.895 sec org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream failed: java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-898cf826-d636-4b1c-a61a-c12a364c02e7, took 8.858 sec ``` Scala - failed tests ``` PartitionProviderCompatibilitySuite: - insert overwrite partition of new datasource table overwrites just partition * FAILED * (828 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-bb6337b9-4f99-45ab-ad2c-a787ab965c09 - SPARK-18635 special chars in partition values - partition management true * FAILED * (5 seconds, 360 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18635 special chars in partition values - partition management false * FAILED * (141 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` UtilsSuite: - reading offset bytes of a file (compressed) * FAILED * (0 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-ecb2b7d5-db8b-43a7-b268-1bf242b5a491 - reading offset bytes across multiple files (compressed) * FAILED * (0 milliseconds) java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-25cc47a8-1faa-4da5-8862-cf174df63ce0 ``` ``` StatisticsSuite: - MetastoreRelations fallback to HDFS for size estimation * FAILED * (110 milliseconds) org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'csv_table' not found in database 'default'; ``` ``` SQLQuerySuite: - permanent UDTF * FAILED * (125 milliseconds) org.apache.spark.sql.AnalysisException: Undefined function: 'udtf_count_temp'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 24 - describe functions - user defined functions * FAILED * (125 milliseconds) org.apache.spark.sql.AnalysisException: Undefined function: 'udtf_count'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 - CTAS without serde with location * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-ed673d73-edfc-404e-829e-2e2b9725d94e/c1 - derived from Hive query file: drop_database_removes_partition_dirs.q * FAILED * (47 milliseconds) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-d2ddf08e-699e-45be-9ebd-3dfe619680fe/drop_database_removes_partition_dirs_table - derived from Hive query file: drop_table_removes_partition_dirs.q * FAILED * (0 milliseconds) java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:projectsspark%09arget%09mpspark-d2ddf08e-699e-45be-9ebd-3dfe619680fe/drop_table_removes_partition_dirs_table2 - SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH * FAILED * (109 milliseconds) java.nio.file.InvalidPathException: Illegal char <:> at index 2: /C:/projects/spark/sql/hive/projectsspark arget mpspark-1a122f8c-dfb3-46c4-bab1-f30764baee0e/part-r ``` ``` HiveDDLSuite: - drop external tables in default database * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - add/drop partitions - external table * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - create/drop database - location without pre-created directory * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - create/drop database - location with pre-created directory * FAILED * (32 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - drop database containing tables - CASCADE * FAILED * (94 milliseconds) CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675) - drop an empty database - CASCADE * FAILED * (63 milliseconds) CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675) - drop database containing tables - RESTRICT * FAILED * (47 milliseconds) CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675) - drop an empty database - RESTRICT * FAILED * (47 milliseconds) CatalogDatabase(db1,,file:/C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be/db1.db,Map()) did not equal CatalogDatabase(db1,,file:C:/projects/spark/target/tmp/warehouse-d0665ee0-1e39-4805-b471-0b764f7838be\db1.db,Map()) (HiveDDLSuite.scala:675) - CREATE TABLE LIKE an external data source table * FAILED * (140 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-c5eba16d-07ae-4186-95bb-21c5811cf888; - CREATE TABLE LIKE an external Hive serde table * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - desc table for data source table - no user-defined schema * FAILED * (125 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-e8bf5bf5-721a-4cbe-9d6 at scala.collection.immutable.List.foreach(List.scala:381)d-5543a8301c1d; ``` ``` MetastoreDataSourcesSuite - CTAS: persisted bucketed data source table * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ``` ``` ShowCreateTableSuite: - simple external hive table * FAILED * (0 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` PartitionedTablePerfStatsSuite: - hive table: partitioned pruned table reports only selected files * FAILED * (313 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - datasource table: partitioned pruned table reports only selected files * FAILED * (219 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-311f45f8-d064-4023-a4bb-e28235bff64d; - hive table: lazy partition pruning reads only necessary partition data * FAILED * (203 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - datasource table: lazy partition pruning reads only necessary partition data * FAILED * (187 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-fde874ca-66bd-4d0b-a40f-a043b65bf957; - hive table: lazy partition pruning with file status caching enabled * FAILED * (188 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - datasource table: lazy partition pruning with file status caching enabled * FAILED * (187 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-e6d20183-dd68-4145-acbe-4a509849accd; - hive table: file status caching respects refresh table and refreshByPath * FAILED * (172 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - datasource table: file status caching respects refresh table and refreshByPath * FAILED * (203 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-8b2c9651-2adf-4d58-874f-659007e21463; - hive table: file status cache respects size limit * FAILED * (219 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - datasource table: file status cache respects size limit * FAILED * (171 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-7835ab57-cb48-4d2c-bb1d-b46d5a4c47e4; - datasource table: table setup does not scan filesystem * FAILED * (266 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-20598d76-c004-42a7-8061-6c56f0eda5e2; - hive table: table setup does not scan filesystem * FAILED * (266 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - hive table: num hive client calls does not scale with partition count * FAILED * (2 seconds, 281 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - datasource table: num hive client calls does not scale with partition count * FAILED * (2 seconds, 422 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-4cfed321-4d1d-4b48-8d34-5c169afff383; - hive table: files read and cached when filesource partition management is off * FAILED * (234 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - datasource table: all partition data cached in memory when partition management is off * FAILED * (203 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-4bcc0398-15c9-4f6a-811e-12d40f3eec12; - SPARK-18700: table loaded only once even when resolved concurrently * FAILED * (1 second, 266 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` HiveSparkSubmitSuite: - temporary Hive UDF: define a UDF and use it * FAILED * (2 seconds, 94 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - permanent Hive UDF: define a UDF and use it * FAILED * (281 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - permanent Hive UDF: use a already defined permanent function * FAILED * (718 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - SPARK-8368: includes jars passed in through --jars * FAILED * (3 seconds, 521 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - SPARK-8020: set sql conf in spark conf * FAILED * (0 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - SPARK-8489: MissingRequirementError during reflection * FAILED * (94 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - SPARK-9757 Persist Parquet relation with decimal column * FAILED * (16 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - SPARK-11009 fix wrong result of Window function in cluster mode * FAILED * (16 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - SPARK-14244 fix window partition size attribute binding failure * FAILED * (78 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - set spark.sql.warehouse.dir * FAILED * (16 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - set hive.metastore.warehouse.dir * FAILED * (15 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - SPARK-16901: set javax.jdo.option.ConnectionURL * FAILED * (16 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified - SPARK-18360: default table path of tables in default database should depend on the location of default database * FAILED * (15 milliseconds) java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified ``` ``` UtilsSuite: - resolveURIs with multiple paths * FAILED * (0 milliseconds) ".../jar3,file:/C:/pi.py[%23]py.pi,file:/C:/path%..." did not equal ".../jar3,file:/C:/pi.py[#]py.pi,file:/C:/path%..." (UtilsSuite.scala:468) ``` ``` CheckpointSuite: - recovery with file input stream * FAILED * (10 seconds, 205 milliseconds) The code passed to eventually never returned normally. Attempted 660 times over 10.014272499999999 seconds. Last failure message: Unexpected internal error near index 1 \ ^. (CheckpointSuite.scala:680) ``` ## How was this patch tested? Manually via AppVeyor as below: Scala - aborted tests ``` WindowQuerySuite - all passed OrcSourceSuite: - SPARK-18220: read Hive orc table with varchar column * FAILED * (4 seconds, 417 milliseconds) org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625) ParquetMetastoreSuite - all passed ParquetSourceSuite - all passed KafkaRDDSuite - all passed DirectKafkaStreamSuite - all passed ReliableKafkaStreamSuite - all passed KafkaStreamSuite - all passed KafkaClusterSuite - all passed DirectKafkaStreamSuite - all passed KafkaRDDSuite - all passed ``` Java - failed tests ``` org.apache.spark.streaming.kafka.JavaKafkaRDDSuite - all passed org.apache.spark.streaming.kafka.JavaDirectKafkaStreamSuite - all passed org.apache.spark.streaming.kafka.JavaKafkaStreamSuite - all passed org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite - all passed org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite - all passed ``` Scala - failed tests ``` PartitionProviderCompatibilitySuite: - insert overwrite partition of new datasource table overwrites just partition (1 second, 953 milliseconds) - SPARK-18635 special chars in partition values - partition management true (6 seconds, 31 milliseconds) - SPARK-18635 special chars in partition values - partition management false (4 seconds, 578 milliseconds) ``` ``` UtilsSuite: - reading offset bytes of a file (compressed) (203 milliseconds) - reading offset bytes across multiple files (compressed) (0 milliseconds) ``` ``` StatisticsSuite: - MetastoreRelations fallback to HDFS for size estimation (94 milliseconds) ``` ``` SQLQuerySuite: - permanent UDTF (407 milliseconds) - describe functions - user defined functions (441 milliseconds) - CTAS without serde with location (2 seconds, 831 milliseconds) - derived from Hive query file: drop_database_removes_partition_dirs.q (734 milliseconds) - derived from Hive query file: drop_table_removes_partition_dirs.q (563 milliseconds) - SPARK-17796 Support wildcard character in filename for LOAD DATA LOCAL INPATH (453 milliseconds) ``` ``` HiveDDLSuite: - drop external tables in default database (3 seconds, 5 milliseconds) - add/drop partitions - external table (2 seconds, 750 milliseconds) - create/drop database - location without pre-created directory (500 milliseconds) - create/drop database - location with pre-created directory (407 milliseconds) - drop database containing tables - CASCADE (453 milliseconds) - drop an empty database - CASCADE (375 milliseconds) - drop database containing tables - RESTRICT (328 milliseconds) - drop an empty database - RESTRICT (391 milliseconds) - CREATE TABLE LIKE an external data source table (953 milliseconds) - CREATE TABLE LIKE an external Hive serde table (3 seconds, 782 milliseconds) - desc table for data source table - no user-defined schema (1 second, 150 milliseconds) ``` ``` MetastoreDataSourcesSuite - CTAS: persisted bucketed data source table (875 milliseconds) ``` ``` ShowCreateTableSuite: - simple external hive table (78 milliseconds) ``` ``` PartitionedTablePerfStatsSuite: - hive table: partitioned pruned table reports only selected files (1 second, 109 milliseconds) - datasource table: partitioned pruned table reports only selected files (860 milliseconds) - hive table: lazy partition pruning reads only necessary partition data (859 milliseconds) - datasource table: lazy partition pruning reads only necessary partition data (1 second, 219 milliseconds) - hive table: lazy partition pruning with file status caching enabled (875 milliseconds) - datasource table: lazy partition pruning with file status caching enabled (890 milliseconds) - hive table: file status caching respects refresh table and refreshByPath (922 milliseconds) - datasource table: file status caching respects refresh table and refreshByPath (640 milliseconds) - hive table: file status cache respects size limit (469 milliseconds) - datasource table: file status cache respects size limit (453 milliseconds) - datasource table: table setup does not scan filesystem (328 milliseconds) - hive table: table setup does not scan filesystem (313 milliseconds) - hive table: num hive client calls does not scale with partition count (5 seconds, 431 milliseconds) - datasource table: num hive client calls does not scale with partition count (4 seconds, 79 milliseconds) - hive table: files read and cached when filesource partition management is off (656 milliseconds) - datasource table: all partition data cached in memory when partition management is off (484 milliseconds) - SPARK-18700: table loaded only once even when resolved concurrently (2 seconds, 578 milliseconds) ``` ``` HiveSparkSubmitSuite: - temporary Hive UDF: define a UDF and use it (1 second, 745 milliseconds) - permanent Hive UDF: define a UDF and use it (406 milliseconds) - permanent Hive UDF: use a already defined permanent function (375 milliseconds) - SPARK-8368: includes jars passed in through --jars (391 milliseconds) - SPARK-8020: set sql conf in spark conf (156 milliseconds) - SPARK-8489: MissingRequirementError during reflection (187 milliseconds) - SPARK-9757 Persist Parquet relation with decimal column (157 milliseconds) - SPARK-11009 fix wrong result of Window function in cluster mode (156 milliseconds) - SPARK-14244 fix window partition size attribute binding failure (156 milliseconds) - set spark.sql.warehouse.dir (172 milliseconds) - set hive.metastore.warehouse.dir (156 milliseconds) - SPARK-16901: set javax.jdo.option.ConnectionURL (157 milliseconds) - SPARK-18360: default table path of tables in default database should depend on the location of default database (172 milliseconds) ``` ``` UtilsSuite: - resolveURIs with multiple paths (0 milliseconds) ``` ``` CheckpointSuite: - recovery with file input stream (4 seconds, 452 milliseconds) ``` Note: after resolving the aborted tests, there is a test failure identified as below: ``` OrcSourceSuite: - SPARK-18220: read Hive orc table with varchar column * FAILED * (4 seconds, 417 milliseconds) org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625) ``` This does not look due to this problem so this PR does not fix it here. Author: hyukjinkwon <gurwls223@gmail.com> Closes #16451 from HyukjinKwon/all-path-resource-fixes.	2017-01-10 13:19:21 +00:00
Wenchen Fan	b0319c2ecb	[SPARK-19107][SQL] support creating hive table with DataFrameWriter and Catalog ## What changes were proposed in this pull request? After unifying the CREATE TABLE syntax in https://github.com/apache/spark/pull/16296, it's pretty easy to support creating hive table with `DataFrameWriter` and `Catalog` now. This PR basically just removes the hive provider check in `DataFrameWriter.saveAsTable` and `Catalog.createExternalTable`, and add tests. ## How was this patch tested? new tests in `HiveDDLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #16487 from cloud-fan/hive-table.	2017-01-10 19:26:51 +08:00
Burak Yavuz	faabe69cc0	[SPARK-18952] Regex strings not properly escaped in codegen for aggregations ## What changes were proposed in this pull request? If I use the function regexp_extract, and then in my regex string, use `\`, i.e. escape character, this fails codegen, because the `\` character is not properly escaped when codegen'd. Example stack trace: ``` /* 059 / private int maxSteps = 2; / 060 / private int numRows = 0; / 061 / private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("date_format(window#325.start, yyyy-MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType) / 062 / .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[., 1)", org.apache.spark.sql.types.DataTypes.StringType); /* 063 / private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("sum", org.apache.spark.sql.types.DataTypes.LongType); / 064 */ private Object emptyVBase; ... org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 62, Column 58: Invalid escape sequence at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918) at org.codehaus.janino.Scanner.produce(Scanner.java:604) at org.codehaus.janino.Parser.peekRead(Parser.java:3239) at org.codehaus.janino.Parser.parseArguments(Parser.java:3055) at org.codehaus.janino.Parser.parseSelector(Parser.java:2914) at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617) at org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573) at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552) ``` In the codegend expression, the literal should use `\\` instead of `\` A similar problem was solved here: https://github.com/apache/spark/pull/15156. ## How was this patch tested? Regression test in `DataFrameAggregationSuite` Author: Burak Yavuz <brkyvz@gmail.com> Closes #16361 from brkyvz/reg-break.	2017-01-09 14:25:38 -08:00
anabranch	19d9d4c855	[SPARK-19126][DOCS] Update Join Documentation Across Languages ## What changes were proposed in this pull request? - [X] Make sure all join types are clearly mentioned - [X] Make join labeling/style consistent - [X] Make join label ordering docs the same - [X] Improve join documentation according to above for Scala - [X] Improve join documentation according to above for Python - [X] Improve join documentation according to above for R ## How was this patch tested? No tests b/c docs. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Closes #16504 from anabranch/SPARK-19126.	2017-01-08 20:37:46 -08:00
anabranch	1f6ded6455	[SPARK-19127][DOCS] Update Rank Function Documentation ## What changes were proposed in this pull request? - [X] Fix inconsistencies in function reference for dense rank and dense - [X] Make all languages equivalent in their reference to `dense_rank` and `rank`. ## How was this patch tested? N/A for docs. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Closes #16505 from anabranch/SPARK-19127.	2017-01-08 17:53:53 -08:00
Dilip Biswal	4351e62207	[SPARK-19093][SQL] Cached tables are not used in SubqueryExpression ## What changes were proposed in this pull request? Consider the plans inside subquery expressions while looking up cache manager to make use of cached data. Currently CacheManager.useCachedData does not consider the subquery expressions in the plan. SQL ``` select * from rows where not exists (select * from rows) ``` Before the fix ``` == Optimized Logical Plan == Join LeftAnti :- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) : +- FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string> +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] +- Relation[_1#3775,_2#3776] parquet ``` After ``` == Optimized Logical Plan == Join LeftAnti :- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) : +- FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string> +- Project [_1#256 AS _1#256#298, _2#257 AS _2#257#299] +- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string> ``` Query2 ``` SELECT FROM t1 WHERE c1 IN (SELECT c1 FROM t2 WHERE c1 IN (SELECT c1 FROM t3 WHERE c1 = 1)) ``` Before ``` == Analyzed Logical Plan == c1: int Project [c1#3] +- Filter predicate-subquery#47 [(c1#3 = c1#10)] : +- Project [c1#10] : +- Filter predicate-subquery#46 [(c1#10 = c1#17)] : : +- Project [c1#17] : : +- Filter (c1#17 = 1) : : +- SubqueryAlias t3, `t3` : : +- Project [value#15 AS c1#17] : : +- LocalRelation [value#15] : +- SubqueryAlias t2, `t2` : +- Project [value#8 AS c1#10] : +- LocalRelation [value#8] +- SubqueryAlias t1, `t1` +- Project [value#1 AS c1#3] +- LocalRelation [value#1] == Optimized Logical Plan == Join LeftSemi, (c1#3 = c1#10) :- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1 : +- LocalTableScan [c1#3] +- Project [value#8 AS c1#10] +- Join LeftSemi, (value#8 = c1#17) :- LocalRelation [value#8] +- Project [value#15 AS c1#17] +- Filter (value#15 = 1) +- LocalRelation [value#15] ``` After ``` == Analyzed Logical Plan == c1: int Project [c1#3] +- Filter predicate-subquery#47 [(c1#3 = c1#10)] : +- Project [c1#10] : +- Filter predicate-subquery#46 [(c1#10 = c1#17)] : : +- Project [c1#17] : : +- Filter (c1#17 = 1) : : +- SubqueryAlias t3, `t3` : : +- Project [value#15 AS c1#17] : : +- LocalRelation [value#15] : +- SubqueryAlias t2, `t2` : +- Project [value#8 AS c1#10] : +- LocalRelation [value#8] +- SubqueryAlias t1, `t1` +- Project [value#1 AS c1#3] +- LocalRelation [value#1] == Optimized Logical Plan == Join LeftSemi, (c1#3 = c1#10) :- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1 : +- LocalTableScan [c1#3] +- Join LeftSemi, (c1#10 = c1#17) :- InMemoryRelation [c1#10], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t2 : +- LocalTableScan [c1#10] +- Filter (c1#17 = 1) +- InMemoryRelation [c1#17], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1 +- LocalTableScan [c1#3] ``` ## How was this patch tested? Added new tests in CachedTableSuite. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #16493 from dilipbiswal/SPARK-19093.	2017-01-08 23:09:07 +01:00
Wenchen Fan	b3d39620c5	[SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter ## What changes were proposed in this pull request? `OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs: 1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it. 2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #16479 from cloud-fan/hive-writer.	2017-01-08 00:42:09 +08:00
Tathagata Das	b59cddaba0	[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options ## What changes were proposed in this pull request? Updates - Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure. - Updated Output Modes section with Update mode - Added options for all the sources and sinks --------------------------- --------------------------- ![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png) --------------------------- --------------------------- <img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png"> <img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png"> --------------------------- --------------------------- ![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png) ![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png) ![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png) Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #16468 from tdas/SPARK-19074.	2017-01-06 11:29:01 -08:00
Michal Senkyr	903bb8e8a2	[SPARK-16792][SQL] Dataset containing a Case Class with a List type causes a CompileException (converting sequence to list) ## What changes were proposed in this pull request? Added a `to` call at the end of the code generated by `ScalaReflection.deserializerFor` if the requested type is not a supertype of `WrappedArray[_]` that uses `CanBuildFrom[_, _, _]` to convert result into an arbitrary subtype of `Seq[_]`. Care was taken to preserve the original deserialization where it is possible to avoid the overhead of conversion in cases where it is not needed `ScalaReflection.serializerFor` could already be used to serialize any `Seq[_]` so it was not altered `SQLImplicits` had to be altered and new implicit encoders added to permit serialization of other sequence types Also fixes [SPARK-16815] Dataset[List[T]] leads to ArrayStoreException ## How was this patch tested? ```bash ./build/mvn -DskipTests clean package && ./dev/run-tests ``` Also manual execution of the following sets of commands in the Spark shell: ```scala case class TestCC(key: Int, letters: List[String]) val ds1 = sc.makeRDD(Seq( (List("D")), (List("S","H")), (List("F","H")), (List("D","L","L")) )).map(x=>(x.length,x)).toDF("key","letters").as[TestCC] val test1=ds1.map{_.key} test1.show ``` ```scala case class X(l: List[String]) spark.createDataset(Seq(List("A"))).map(X).show ``` ```scala spark.sqlContext.createDataset(sc.parallelize(List(1) :: Nil)).collect ``` After adding arbitrary sequence support also tested with the following commands: ```scala case class QueueClass(q: scala.collection.immutable.Queue[Int]) spark.createDataset(Seq(List(1,2,3))).map(x => QueueClass(scala.collection.immutable.Queue(x: _*))).map(_.q.dequeue).collect ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Closes #16240 from michalsenkyr/sql-caseclass-list-fix.	2017-01-06 15:05:20 +08:00
Kevin Yu	bcc510b021	[SPARK-18871][SQL] New test cases for IN/NOT IN subquery ## What changes were proposed in this pull request? This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite. Based on the discussion, we will create `subquery/in-subquery` sub structure under `sql/core/src/test/resources/sql-tests/inputs` directory. This is the high level grouping for IN subquery: `subquery/in-subquery/` `subquery/in-subquery/simple-in.sql` `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)` `subquery/in-subquery/not-in-group-by.sql` `subquery/in-subquery/in-order-by.sql` `subquery/in-subquery/in-limit.sql` `subquery/in-subquery/in-having.sql` `subquery/in-subquery/in-joins.sql` `subquery/in-subquery/not-in-joins.sql` `subquery/in-subquery/in-set-operations.sql` `subquery/in-subquery/in-with-cte.sql` `subquery/in-subquery/not-in-with-cte.sql` subquery/in-subquery/in-multiple-columns.sql` We will deliver it through multiple prs, this is the first pr for the IN subquery, it has `subquery/in-subquery/simple-in.sql` `subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)` These are the results from running on DB2. [Modified test file of in-group-by.sql used to run on DB2](https://github.com/apache/spark/files/683367/in-group-by.sql.db2.txt) [Output of the run result on DB2](https://github.com/apache/spark/files/683362/in-group-by.sql.db2.out.txt) [Modified test file of simple-in.sql used to run on DB2](https://github.com/apache/spark/files/683378/simple-in.sql.db2.txt) [Output of the run result on DB2](https://github.com/apache/spark/files/683379/simple-in.sql.db2.out.txt) ## How was this patch tested? This patch is adding tests. Author: Kevin Yu <qyu@us.ibm.com> Closes #16337 from kevinyu98/spark-18871.	2017-01-05 19:00:39 -08:00
Wenchen Fan	cca945b6aa	[SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables ## What changes were proposed in this pull request? Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source. Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details. TODO(for follow-up PRs): 1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later. 2. `SHOW CREATE TABLE` should be updated to use the new syntax. 3. we should decide if we wanna change the behavior of `SET LOCATION`. ## How was this patch tested? new tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16296 from cloud-fan/create-table.	2017-01-05 17:40:27 -08:00
Wenchen Fan	30345c43b7	[SPARK-19058][SQL] fix partition related behaviors with DataFrameWriter.saveAsTable ## What changes were proposed in this pull request? When we append data to a partitioned table with `DataFrameWriter.saveAsTable`, there are 2 issues: 1. doesn't work when the partition has custom location. 2. will recover all partitions This PR fixes them by moving the special partition handling code from `DataSourceAnalysis` to `InsertIntoHadoopFsRelationCommand`, so that the `DataFrameWriter.saveAsTable` code path can also benefit from it. ## How was this patch tested? newly added regression tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16460 from cloud-fan/append.	2017-01-05 14:11:05 +08:00
Herman van Hovell	4262fb0d55	[SPARK-19070] Clean-up dataset actions ## What changes were proposed in this pull request? Dataset actions currently spin off a new `Dataframe` only to track query execution. This PR simplifies this code path by using the `Dataset.queryExecution` directly. This PR also merges the typed and untyped action evaluation paths. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #16466 from hvanhovell/SPARK-19070.	2017-01-04 23:47:58 +08:00
Niranjan Padmanabhan	a1e40b1f5d	[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo ## What changes were proposed in this pull request? There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words. ## How was this patch tested? N/A since only docs or comments were updated. Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com> Closes #16455 from neurons/np.structure_streaming_doc.	2017-01-04 15:07:29 +00:00
Wenchen Fan	101556d0fa	[SPARK-19060][SQL] remove the supportsPartial flag in AggregateFunction ## What changes were proposed in this pull request? Now all aggregation functions support partial aggregate, we can remove the `supportsPartual` flag in `AggregateFunction` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #16461 from cloud-fan/partial.	2017-01-04 12:46:30 +01:00
gatorsmile	b67b35f76b	[SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned Tables in InMemoryCatalog ### What changes were proposed in this pull request? The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition. This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`. ### How was this patch tested? Added test cases for both HiveExternalCatalog and InMemoryCatalog Author: gatorsmile <gatorsmile@gmail.com> Closes #16448 from gatorsmile/unsetSerdeProp.	2017-01-03 11:43:47 -08:00
Dongjoon Hyun	7a2b5f93bc	[SPARK-18877][SQL] `CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar` ## What changes were proposed in this pull request? CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a partition. Specifically, `inferRowType`, the seqOp of aggregate, returns the last decimal type. This PR fixes it to use `findTightestCommonType`. decimal.csv ``` 9.03E+12 1.19E+11 ``` BEFORE ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root \|-- _c0: decimal(3,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show 16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3 ``` AFTER ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root \|-- _c0: decimal(4,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show +---------+ \| _c0\| +---------+ \|9.030E+12\| \| 1.19E+11\| +---------+ ``` ## How was this patch tested? Pass the newly add test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16320 from dongjoon-hyun/SPARK-18877.	2017-01-03 23:06:50 +08:00
Zhenhua Wang	ae83c21125	[SPARK-18998][SQL] Add a cbo conf to switch between default statistics and estimated statistics ## What changes were proposed in this pull request? We add a cbo configuration to switch between default stats and estimated stats. We also define a new statistics method `planStats` in LogicalPlan with conf as its parameter, in order to pass the cbo switch and other estimation related configurations in the future. `planStats` is used on the caller sides (i.e. in Optimizer and Strategies) to make transformation decisions based on stats. ## How was this patch tested? Add a test case using a dummy LogicalPlan. Author: Zhenhua Wang <wzh_zju@163.com> Closes #16401 from wzhfy/cboSwitch.	2017-01-03 12:19:52 +08:00
hyukjinkwon	f1330b1d9e	[SPARK-19022][TESTS] Fix tests dependent on OS due to different newline characters ## What changes were proposed in this pull request? There are two tests failing on Windows due to the different newlines. ``` - StreamingQueryProgress - prettyJson * FAILED * (0 milliseconds) "{ "id" : "39788670-6722-48b7-a248-df6ba08722ac", "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", "name" : "myName", ... }" did not equal "{ "id" : "39788670-6722-48b7-a248-df6ba08722ac", "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390", "name" : "myName", ... }" ... ``` ``` - StreamingQueryStatus - prettyJson * FAILED * (0 milliseconds) "{ "message" : "active", "isDataAvailable" : true, "isTriggerActive" : false }" did not equal "{ "message" : "active", "isDataAvailable" : true, "isTriggerActive" : false }" ... ``` The reason is, `pretty` in `org.json4s.pretty` writes OS-dependent newlines but the string defined in the tests are `\n`. This ends up with test failures. This PR proposes to compare these regardless of newline concerns. ## How was this patch tested? Manually tested via AppVeyor. Before https://ci.appveyor.com/project/spark-test/spark/build/417-newlines-fix-before After https://ci.appveyor.com/project/spark-test/spark/build/418-newlines-fix Author: hyukjinkwon <gurwls223@gmail.com> Closes #16433 from HyukjinKwon/tests-StreamingQueryStatusAndProgressSuite.	2017-01-02 15:17:02 +00:00
Shixiong Zhu	2394047370	[SPARK-19050][SS][TESTS] Fix EventTimeWatermarkSuite 'delay in months and years handled correctly' ## What changes were proposed in this pull request? `monthsSinceEpoch` in this test is like `math.floor(num)`, so `monthDiff` has two possible values. ## How was this patch tested? Jenkins. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16449 from zsxwing/watermark-test-hotfix.	2017-01-01 13:25:44 -08:00
Dongjoon Hyun	b85e29437d	[SPARK-18123][SQL] Use db column names instead of RDD column ones during JDBC Writing ## What changes were proposed in this pull request? Apache Spark supports the following cases by quoting RDD column names while saving through JDBC. - Allow reserved keyword as a column name, e.g., 'order'. - Allow mixed-case colume names like the following, e.g., `[a: int, A: int]`. ``` scala scala> val df = sql("select 1 a, 1 A") df: org.apache.spark.sql.DataFrame = [a: int, A: int] ... scala> df.write.mode("overwrite").format("jdbc").options(option).save() scala> df.write.mode("append").format("jdbc").options(option).save() ``` This PR aims to use database column names instead of RDD column ones in order to support the following additionally. Note that this case succeeds with `MySQL`, but fails on `Postgres`/`Oracle` before. ``` scala val df1 = sql("select 1 a") val df2 = sql("select 1 A") ... df1.write.mode("overwrite").format("jdbc").options(option).save() df2.write.mode("append").format("jdbc").options(option).save() ``` ## How was this patch tested? Pass the Jenkins test with a new testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Author: gatorsmile <gatorsmile@gmail.com> Closes #15664 from dongjoon-hyun/SPARK-18123.	2016-12-30 10:27:14 -08:00
hyukjinkwon	852782b83c	[SPARK-18922][TESTS] Fix more path-related test failures on Windows ## What changes were proposed in this pull request? This PR proposes to fix the test failures due to different format of paths on Windows. Failed tests are as below: ``` ColumnExpressionSuite: - input_file_name, input_file_block_start, input_file_block_length - FileScanRDD * FAILED * (187 milliseconds) "file:///C:/projects/spark/target/tmp/spark-0b21b963-6cfa-411c-8d6f-e6a5e1e73bce/part-00001-c083a03a-e55e-4b05-9073-451de352d006.snappy.parquet" did not contain "C:\projects\spark\target\tmp\spark-0b21b963-6cfa-411c-8d6f-e6a5e1e73bce" (ColumnExpressionSuite.scala:545) - input_file_name, input_file_block_start, input_file_block_length - HadoopRDD * FAILED * (172 milliseconds) "file:/C:/projects/spark/target/tmp/spark-5d0afa94-7c2f-463b-9db9-2e8403e2bc5f/part-00000-f6530138-9ad3-466d-ab46-0eeb6f85ed0b.txt" did not contain "C:\projects\spark\target\tmp\spark-5d0afa94-7c2f-463b-9db9-2e8403e2bc5f" (ColumnExpressionSuite.scala:569) - input_file_name, input_file_block_start, input_file_block_length - NewHadoopRDD * FAILED * (156 milliseconds) "file:/C:/projects/spark/target/tmp/spark-a894c7df-c74d-4d19-82a2-a04744cb3766/part-00000-29674e3f-3fcf-4327-9b04-4dab1d46338d.txt" did not contain "C:\projects\spark\target\tmp\spark-a894c7df-c74d-4d19-82a2-a04744cb3766" (ColumnExpressionSuite.scala:598) ``` ``` DataStreamReaderWriterSuite: - source metadataPath * FAILED * (62 milliseconds) org.mockito.exceptions.verification.junit.ArgumentsAreDifferent: Argument(s) are different! Wanted: streamSourceProvider.createSource( org.apache.spark.sql.SQLContext3b04133b, "C:\projects\spark\target\tmp\streaming.metadata-b05db6ae-c8dc-4ce4-b0d9-1eb8c84876c0/sources/0", None, "org.apache.spark.sql.streaming.test", Map() ); -> at org.apache.spark.sql.streaming.test.DataStreamReaderWriterSuite$$anonfun$12.apply$mcV$sp(DataStreamReaderWriterSuite.scala:374) Actual invocation has different arguments: streamSourceProvider.createSource( org.apache.spark.sql.SQLContext3b04133b, "/C:/projects/spark/target/tmp/streaming.metadata-b05db6ae-c8dc-4ce4-b0d9-1eb8c84876c0/sources/0", None, "org.apache.spark.sql.streaming.test", Map() ); ``` ``` GlobalTempViewSuite: - CREATE GLOBAL TEMP VIEW USING * FAILED * (110 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-960398ba-a0a1-45f6-a59a-d98533f9f519; ``` ``` CreateTableAsSelectSuite: - CREATE TABLE USING AS SELECT * FAILED * (0 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - create a table, drop it and create another one with the same name * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - create table using as select - with partitioned by * FAILED * (0 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - create table using as select - with non-zero buckets * FAILED * (0 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ``` ``` HiveMetadataCacheSuite: - partitioned table is cached when partition pruning is true * FAILED * (532 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - partitioned table is cached when partition pruning is false * FAILED * (297 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` MultiDatabaseSuite: - createExternalTable() to non-default database - with USE * FAILED * (954 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-0839d9a7-5e29-467a-9e3e-3e4cd618ee09; - createExternalTable() to non-default database - without USE * FAILED * (500 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-c7e24d73-1d8f-45e8-ab7d-53a83087aec3; - invalid database name and table names * FAILED * (31 milliseconds) "Path does not exist: file:/C:projectsspark arget mpspark-15a2a494-3483-4876-80e5-ec396e704b77;" did not contain "`t:a` is not a valid name for tables/databases. Valid names only contain alphabet characters, numbers and _." (MultiDatabaseSuite.scala:296) ``` ``` OrcQuerySuite: - SPARK-8501: Avoids discovery schema from empty ORC files * FAILED * (15 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - Verify the ORC conversion parameter: CONVERT_METASTORE_ORC * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - converted ORC table supports resolving mixed case field * FAILED * (297 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` HadoopFsRelationTest - JsonHadoopFsRelationSuite, OrcHadoopFsRelationSuite, ParquetHadoopFsRelationSuite, SimpleTextHadoopFsRelationSuite: - Locality support for FileScanRDD * FAILED * (15 milliseconds) java.lang.IllegalArgumentException: Wrong FS: file://C:\projects\spark\target\tmp\spark-383d1f13-8783-47fd-964d-9c75e5eec50f, expected: file:/// ``` ``` HiveQuerySuite: - CREATE TEMPORARY FUNCTION * FAILED * (0 milliseconds) java.net.MalformedURLException: For input string: "%5Cprojects%5Cspark%5Csql%5Chive%5Ctarget%5Cscala-2.11%5Ctest-classes%5CTestUDTF.jar" - ADD FILE command * FAILED * (500 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\sql\hive\target\scala-2.11\test-classes\data\files\v1.txt - ADD JAR command 2 * FAILED * (110 milliseconds) org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive argetscala-2.11 est-classesdatafilessample.json; ``` ``` PruneFileSourcePartitionsSuite: - PruneFileSourcePartitions should not change the output of LogicalRelation * FAILED * (15 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` HiveCommandSuite: - LOAD DATA LOCAL * FAILED * (109 milliseconds) org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive argetscala-2.11 est-classesdatafilesemployee.dat; - LOAD DATA * FAILED * (93 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 15: C:projectsspark arget mpemployee.dat7496657117354281006.tmp - Truncate Table * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: LOAD DATA input path does not exist: C:projectssparksqlhive argetscala-2.11 est-classesdatafilesemployee.dat; ``` ``` HiveExternalCatalogBackwardCompatibilitySuite: - make sure we can read table created by old version of Spark * FAILED * (0 milliseconds) "[/C:/projects/spark/target/tmp/]spark-0554d859-74e1-..." did not equal "[C:\projects\spark\target\tmp\]spark-0554d859-74e1-..." (HiveExternalCatalogBackwardCompatibilitySuite.scala:213) org.scalatest.exceptions.TestFailedException - make sure we can alter table location created by old version of Spark * FAILED * (110 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 15: C:projectsspark arget mpspark-0e9b2c5f-49a1-4e38-a32a-c0ab1813a79f ``` ``` ExternalCatalogSuite: - create/drop/rename partitions should create/delete/rename the directory * FAILED * (610 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\projects\spark\target\tmp\spark-4c24f010-18df-437b-9fed-990c6f9adece ``` ``` SQLQuerySuite: - describe functions - temporary user defined functions * FAILED * (16 milliseconds) java.net.URISyntaxException: Illegal character in opaque part at index 22: C:projectssparksqlhive argetscala-2.11 est-classesTestUDTF.jar - specifying database name for a temporary table is not allowed * FAILED * (125 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-a34c9814-a483-43f2-be29-37f616b6df91; ``` ``` PartitionProviderCompatibilitySuite: - convert partition provider to hive with repair table * FAILED * (281 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-ee5fc96d-8c7d-4ebf-8571-a1d62736473e; - when partition management is enabled, new tables have partition provider hive * FAILED * (187 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-803ad4d6-3e8c-498d-9ca5-5cda5d9b2a48; - when partition management is disabled, new tables have no partition provider * FAILED * (172 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-c9fda9e2-4020-465f-8678-52cd72d0a58f; - when partition management is disabled, we preserve the old behavior even for new tables * FAILED * (203 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-f4a518a6-c49d-43d3-b407-0ddd76948e13; - insert overwrite partition of legacy datasource table * FAILED * (188 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-f4a518a6-c49d-43d3-b407-0ddd76948e79; - insert overwrite partition of new datasource table overwrites just partition * FAILED * (219 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-6ba3a88d-6f6c-42c5-a9f4-6d924a0616ff; - SPARK-18544 append with saveAsTable - partition management true * FAILED * (173 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-cd234a6d-9cb4-4d1d-9e51-854ae9543bbd; - SPARK-18635 special chars in partition values - partition management true * FAILED * (2 seconds, 967 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18635 special chars in partition values - partition management false * FAILED * (62 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18659 insert overwrite table with lowercase - partition management true * FAILED * (63 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18544 append with saveAsTable - partition management false * FAILED * (266 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18659 insert overwrite table files - partition management false * FAILED * (63 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-18659 insert overwrite table with lowercase - partition management false * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - sanity check table setup * FAILED * (31 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - insert into partial dynamic partitions * FAILED * (47 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - insert into fully dynamic partitions * FAILED * (62 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - insert into static partition * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - overwrite partial dynamic partitions * FAILED * (63 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - overwrite fully dynamic partitions * FAILED * (47 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - overwrite static partition * FAILED * (63 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` MetastoreDataSourcesSuite: - check change without refresh * FAILED * (203 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-00713fe4-ca04-448c-bfc7-6c5e9a2ad2a1; - drop, change, recreate * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-2030a21b-7d67-4385-a65b-bb5e2bed4861; - SPARK-15269 external data source table creation * FAILED * (78 milliseconds) org.apache.spark.sql.AnalysisException: Path does not exist: file:/C:projectsspark arget mpspark-4d50fd4a-14bc-41d6-9232-9554dd233f86; - CTAS * FAILED * (109 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - CTAS with IF NOT EXISTS * FAILED * (109 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - CTAS: persisted partitioned bucketed data source table * FAILED * (0 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - SPARK-15025: create datasource table with path with select * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - CTAS: persisted partitioned data source table * FAILED * (47 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ``` ``` HiveMetastoreCatalogSuite: - Persist non-partitioned parquet relation into metastore as managed table using CTAS * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string - Persist non-partitioned orc relation into metastore as managed table using CTAS * FAILED * (16 milliseconds) java.lang.IllegalArgumentException: Can not create a Path from an empty string ``` ``` HiveUDFSuite: - SPARK-11522 select input_file_name from non-parquet table * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` QueryPartitionSuite: - SPARK-13709: reading partitioned Avro table with nested schema * FAILED * (250 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ``` ParquetHiveCompatibilitySuite: - simple primitives * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-10177 timestamp * FAILED * (0 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - array * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - map * FAILED * (16 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - struct * FAILED * (0 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); - SPARK-16344: array of struct with a single field named 'array_element' * FAILED * (15 milliseconds) org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: Can not create a Path from an empty string); ``` ## How was this patch tested? Manually tested via AppVeyor. ``` ColumnExpressionSuite: - input_file_name, input_file_block_start, input_file_block_length - FileScanRDD (234 milliseconds) - input_file_name, input_file_block_start, input_file_block_length - HadoopRDD (235 milliseconds) - input_file_name, input_file_block_start, input_file_block_length - NewHadoopRDD (203 milliseconds) ``` ``` DataStreamReaderWriterSuite: - source metadataPath (63 milliseconds) ``` ``` GlobalTempViewSuite: - CREATE GLOBAL TEMP VIEW USING (436 milliseconds) ``` ``` CreateTableAsSelectSuite: - CREATE TABLE USING AS SELECT (171 milliseconds) - create a table, drop it and create another one with the same name (422 milliseconds) - create table using as select - with partitioned by (141 milliseconds) - create table using as select - with non-zero buckets (125 milliseconds) ``` ``` HiveMetadataCacheSuite: - partitioned table is cached when partition pruning is true (3 seconds, 211 milliseconds) - partitioned table is cached when partition pruning is false (1 second, 781 milliseconds) ``` ``` MultiDatabaseSuite: - createExternalTable() to non-default database - with USE (797 milliseconds) - createExternalTable() to non-default database - without USE (640 milliseconds) - invalid database name and table names (62 milliseconds) ``` ``` OrcQuerySuite: - SPARK-8501: Avoids discovery schema from empty ORC files (703 milliseconds) - Verify the ORC conversion parameter: CONVERT_METASTORE_ORC (750 milliseconds) - converted ORC table supports resolving mixed case field (625 milliseconds) ``` ``` HadoopFsRelationTest - JsonHadoopFsRelationSuite, OrcHadoopFsRelationSuite, ParquetHadoopFsRelationSuite, SimpleTextHadoopFsRelationSuite: - Locality support for FileScanRDD (296 milliseconds) ``` ``` HiveQuerySuite: - CREATE TEMPORARY FUNCTION (125 milliseconds) - ADD FILE command (250 milliseconds) - ADD JAR command 2 (609 milliseconds) ``` ``` PruneFileSourcePartitionsSuite: - PruneFileSourcePartitions should not change the output of LogicalRelation (359 milliseconds) ``` ``` HiveCommandSuite: - LOAD DATA LOCAL (1 second, 829 milliseconds) - LOAD DATA (1 second, 735 milliseconds) - Truncate Table (1 second, 641 milliseconds) ``` ``` HiveExternalCatalogBackwardCompatibilitySuite: - make sure we can read table created by old version of Spark (32 milliseconds) - make sure we can alter table location created by old version of Spark (125 milliseconds) - make sure we can rename table created by old version of Spark (281 milliseconds) ``` ``` ExternalCatalogSuite: - create/drop/rename partitions should create/delete/rename the directory (625 milliseconds) ``` ``` SQLQuerySuite: - describe functions - temporary user defined functions (31 milliseconds) - specifying database name for a temporary table is not allowed (390 milliseconds) ``` ``` PartitionProviderCompatibilitySuite: - convert partition provider to hive with repair table (813 milliseconds) - when partition management is enabled, new tables have partition provider hive (562 milliseconds) - when partition management is disabled, new tables have no partition provider (344 milliseconds) - when partition management is disabled, we preserve the old behavior even for new tables (422 milliseconds) - insert overwrite partition of legacy datasource table (750 milliseconds) - SPARK-18544 append with saveAsTable - partition management true (985 milliseconds) - SPARK-18635 special chars in partition values - partition management true (3 seconds, 328 milliseconds) - SPARK-18635 special chars in partition values - partition management false (2 seconds, 891 milliseconds) - SPARK-18659 insert overwrite table with lowercase - partition management true (750 milliseconds) - SPARK-18544 append with saveAsTable - partition management false (656 milliseconds) - SPARK-18659 insert overwrite table files - partition management false (922 milliseconds) - SPARK-18659 insert overwrite table with lowercase - partition management false (469 milliseconds) - sanity check table setup (937 milliseconds) - insert into partial dynamic partitions (2 seconds, 985 milliseconds) - insert into fully dynamic partitions (1 second, 937 milliseconds) - insert into static partition (1 second, 578 milliseconds) - overwrite partial dynamic partitions (7 seconds, 561 milliseconds) - overwrite fully dynamic partitions (1 second, 766 milliseconds) - overwrite static partition (1 second, 797 milliseconds) ``` ``` MetastoreDataSourcesSuite: - check change without refresh (610 milliseconds) - drop, change, recreate (437 milliseconds) - SPARK-15269 external data source table creation (297 milliseconds) - CTAS with IF NOT EXISTS (437 milliseconds) - CTAS: persisted partitioned bucketed data source table (422 milliseconds) - SPARK-15025: create datasource table with path with select (265 milliseconds) - CTAS (438 milliseconds) - CTAS with IF NOT EXISTS (469 milliseconds) - CTAS: persisted partitioned bucketed data source table (406 milliseconds) ``` ``` HiveMetastoreCatalogSuite: - Persist non-partitioned parquet relation into metastore as managed table using CTAS (406 milliseconds) - Persist non-partitioned orc relation into metastore as managed table using CTAS (313 milliseconds) ``` ``` HiveUDFSuite: - SPARK-11522 select input_file_name from non-parquet table (3 seconds, 144 milliseconds) ``` ``` QueryPartitionSuite: - SPARK-13709: reading partitioned Avro table with nested schema (1 second, 67 milliseconds) ``` ``` ParquetHiveCompatibilitySuite: - simple primitives (745 milliseconds) - SPARK-10177 timestamp (375 milliseconds) - array (407 milliseconds) - map (409 milliseconds) - struct (437 milliseconds) - SPARK-16344: array of struct with a single field named 'array_element' (391 milliseconds) ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16397 from HyukjinKwon/SPARK-18922-paths.	2016-12-30 11:16:03 +00:00
Dongjoon Hyun	752d9eeb9b	[SPARK-19012][SQL] Fix `createTempViewCommand` to throw AnalysisException instead of ParseException ## What changes were proposed in this pull request? Currently, `createTempView`, `createOrReplaceTempView`, and `createGlobalTempView` show `ParseExceptions` on invalid table names. We had better show better error message. Also, this PR also adds and updates the missing description on the API docs correctly. BEFORE ``` scala> spark.range(10).createOrReplaceTempView("11111") org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '11111' expecting {'SELECT', 'FROM', 'ADD', ...}(line 1, pos 0) == SQL == 11111 ... ``` AFTER ``` scala> spark.range(10).createOrReplaceTempView("11111") org.apache.spark.sql.AnalysisException: Invalid view name: 11111; ... ``` ## How was this patch tested? Pass the Jenkins with updated a test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16427 from dongjoon-hyun/SPARK-19012.	2016-12-29 21:22:13 +01:00
Wenchen Fan	7d19b6ab7d	[SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand ## What changes were proposed in this pull request? The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a lot of work to do if the table already exists: 1. throw exception if we don't want to ignore it. 2. do some check and adjust the schema if we want to append data. 3. drop the table and create it again if we want to overwrite. The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #15996 from cloud-fan/append.	2016-12-28 21:50:21 -08:00
Kazuaki Ishizaki	93f35569fd	[SPARK-16213][SQL] Reduce runtime overhead of a program that creates an primitive array in DataFrame ## What changes were proposed in this pull request? This PR reduces runtime overhead of a program the creates an primitive array in DataFrame by using the similar approach to #15044. Generated code performs boxing operation in an assignment from InternalRow to an `Object[]` temporary array (at Lines 051 and 061 in the generated code before without this PR). If we know that type of array elements is primitive, we apply the following optimizations: 1. Eliminate a pair of `isNullAt()` and a null assignment 2. Allocate an primitive array instead of `Object[]` (eliminate boxing operations) 3. Create `UnsafeArrayData` by using `UnsafeArrayWriter` to keep a primitive array in a row format instead of doing non-lightweight operations in constructor of `GenericArrayData` The PR also performs the same things for `CreateMap`. Here are performance results of [DataFrame programs](`6bf54ec5e2/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/PrimitiveArrayBenchmark.scala (L83-L112)`) by up to 17.9x over without this PR. ``` Without SPARK-16043 OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64 Intel Xeon E3-12xx v2 (Ivy Bridge) Read a primitive array in DataFrame: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Int 3805 / 4150 0.0 507308.9 1.0X Double 3593 / 3852 0.0 479056.9 1.1X With SPARK-16043 Read a primitive array in DataFrame: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Int 213 / 271 0.0 28387.5 1.0X Double 204 / 223 0.0 27250.9 1.0X ``` Note : #15780 is enabled for these measurements An motivating example ``` java val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF df.selectExpr("Array(value + 1.1d, value + 2.2d)").show ``` Generated code without this PR ``` java /* 005 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 006 / private Object[] references; / 007 / private scala.collection.Iterator[] inputs; / 008 / private scala.collection.Iterator inputadapter_input; / 009 / private UnsafeRow serializefromobject_result; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; / 012 / private Object[] project_values; / 013 / private UnsafeRow project_result; / 014 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder; / 015 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter project_arrayWriter; / 017 / / 018 / public GeneratedIterator(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / inputadapter_input = inputs[0]; / 026 / serializefromobject_result = new UnsafeRow(1); / 027 / this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0); / 028 / this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); / 029 / this.project_values = null; / 030 / project_result = new UnsafeRow(1); / 031 / this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32); / 032 / this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1); / 033 / this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); / 034 / / 035 / } / 036 / / 037 / protected void processNext() throws java.io.IOException { / 038 / while (inputadapter_input.hasNext()) { / 039 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 040 / double inputadapter_value = inputadapter_row.getDouble(0); / 041 / / 042 / final boolean project_isNull = false; / 043 / this.project_values = new Object[2]; / 044 / boolean project_isNull1 = false; / 045 / / 046 / double project_value1 = -1.0; / 047 / project_value1 = inputadapter_value + 1.1D; / 048 / if (false) { / 049 / project_values[0] = null; / 050 / } else { / 051 / project_values[0] = project_value1; / 052 / } / 053 / / 054 / boolean project_isNull4 = false; / 055 / / 056 / double project_value4 = -1.0; / 057 / project_value4 = inputadapter_value + 2.2D; / 058 / if (false) { / 059 / project_values[1] = null; / 060 / } else { / 061 / project_values[1] = project_value4; / 062 / } / 063 / / 064 / final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values); / 065 / this.project_values = null; / 066 / project_holder.reset(); / 067 / / 068 / project_rowWriter.zeroOutNullBytes(); / 069 / / 070 / if (project_isNull) { / 071 / project_rowWriter.setNullAt(0); / 072 / } else { / 073 / // Remember the current cursor so that we can calculate how many bytes are / 074 / // written later. / 075 / final int project_tmpCursor = project_holder.cursor; / 076 / / 077 / if (project_value instanceof UnsafeArrayData) { / 078 / final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes(); / 079 / // grow the global buffer before writing data. / 080 / project_holder.grow(project_sizeInBytes); / 081 / ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor); / 082 / project_holder.cursor += project_sizeInBytes; / 083 / / 084 / } else { / 085 / final int project_numElements = project_value.numElements(); / 086 / project_arrayWriter.initialize(project_holder, project_numElements, 8); / 087 / / 088 / for (int project_index = 0; project_index < project_numElements; project_index++) { / 089 / if (project_value.isNullAt(project_index)) { / 090 / project_arrayWriter.setNullDouble(project_index); / 091 / } else { / 092 / final double project_element = project_value.getDouble(project_index); / 093 / project_arrayWriter.write(project_index, project_element); / 094 / } / 095 / } / 096 / } / 097 / / 098 / project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor); / 099 / } / 100 / project_result.setTotalSize(project_holder.totalSize()); / 101 / append(project_result); / 102 / if (shouldStop()) return; / 103 / } / 104 / } / 105 / } ``` Generated code with this PR ``` java / 005 / final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { / 006 / private Object[] references; / 007 / private scala.collection.Iterator[] inputs; / 008 / private scala.collection.Iterator inputadapter_input; / 009 / private UnsafeRow serializefromobject_result; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; / 011 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; / 012 / private UnsafeArrayData project_arrayData; / 013 / private UnsafeRow project_result; / 014 / private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder; / 015 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter; / 016 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter project_arrayWriter; / 017 / / 018 / public GeneratedIterator(Object[] references) { / 019 / this.references = references; / 020 / } / 021 / / 022 / public void init(int index, scala.collection.Iterator[] inputs) { / 023 / partitionIndex = index; / 024 / this.inputs = inputs; / 025 / inputadapter_input = inputs[0]; / 026 / serializefromobject_result = new UnsafeRow(1); / 027 / this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0); / 028 / this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); / 029 / / 030 / project_result = new UnsafeRow(1); / 031 / this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32); / 032 / this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1); / 033 / this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); / 034 / / 035 / } / 036 / / 037 / protected void processNext() throws java.io.IOException { / 038 / while (inputadapter_input.hasNext()) { / 039 / InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); / 040 / double inputadapter_value = inputadapter_row.getDouble(0); / 041 / / 042 / byte[] project_array = new byte[32]; / 043 / project_arrayData = new UnsafeArrayData(); / 044 / Platform.putLong(project_array, 16, 2); / 045 / project_arrayData.pointTo(project_array, 16, 32); / 046 / / 047 / boolean project_isNull1 = false; / 048 / / 049 / double project_value1 = -1.0; / 050 / project_value1 = inputadapter_value + 1.1D; / 051 / if (false) { / 052 / project_arrayData.setNullAt(0); / 053 / } else { / 054 / project_arrayData.setDouble(0, project_value1); / 055 / } / 056 / / 057 / boolean project_isNull4 = false; / 058 / / 059 / double project_value4 = -1.0; / 060 / project_value4 = inputadapter_value + 2.2D; / 061 / if (false) { / 062 / project_arrayData.setNullAt(1); / 063 / } else { / 064 / project_arrayData.setDouble(1, project_value4); / 065 / } / 066 / project_holder.reset(); / 067 / / 068 / // Remember the current cursor so that we can calculate how many bytes are / 069 / // written later. / 070 / final int project_tmpCursor = project_holder.cursor; / 071 / / 072 / if (project_arrayData instanceof UnsafeArrayData) { / 073 / final int project_sizeInBytes = ((UnsafeArrayData) project_arrayData).getSizeInBytes(); / 074 / // grow the global buffer before writing data. / 075 / project_holder.grow(project_sizeInBytes); / 076 / ((UnsafeArrayData) project_arrayData).writeToMemory(project_holder.buffer, project_holder.cursor); / 077 / project_holder.cursor += project_sizeInBytes; / 078 / / 079 / } else { / 080 / final int project_numElements = project_arrayData.numElements(); / 081 / project_arrayWriter.initialize(project_holder, project_numElements, 8); / 082 / / 083 / for (int project_index = 0; project_index < project_numElements; project_index++) { / 084 / if (project_arrayData.isNullAt(project_index)) { / 085 / project_arrayWriter.setNullDouble(project_index); / 086 / } else { / 087 / final double project_element = project_arrayData.getDouble(project_index); / 088 / project_arrayWriter.write(project_index, project_element); / 089 / } / 090 / } / 091 / } / 092 / / 093 / project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor); / 094 / project_result.setTotalSize(project_holder.totalSize()); / 095 / append(project_result); / 096 / if (shouldStop()) return; / 097 / } / 098 / } / 099 */ } ``` ## How was this patch tested? Added unit tests into `DataFrameComplexTypeSuite` Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #13909 from kiszk/SPARK-16213.	2016-12-29 10:59:37 +08:00
Carson Wang	2a5f52a714	[MINOR][DOC] Fix doc of ForeachWriter to use writeStream ## What changes were proposed in this pull request? Fix the document of `ForeachWriter` to use `writeStream` instead of `write` for a streaming dataset. ## How was this patch tested? Docs only. Author: Carson Wang <carson.wang@intel.com> Closes #16419 from carsonwang/FixDoc.	2016-12-28 12:12:44 +00:00
uncleGen	76e9bd7488	[SPARK-18960][SQL][SS] Avoid double reading file which is being copied. ## What changes were proposed in this pull request? In HDFS, when we copy a file into target directory, there will a temporary `._COPY_` file for a period of time. The duration depends on file size. If we do not skip this file, we will may read the same data for two times. ## How was this patch tested? update unit test Author: uncleGen <hustyugm@gmail.com> Closes #16370 from uncleGen/SPARK-18960.	2016-12-28 10:42:47 +00:00
gatorsmile	5ac62043cf	[SPARK-18992][SQL] Move spark.sql.hive.thriftServer.singleSession to SQLConf ### What changes were proposed in this pull request? Since `spark.sql.hive.thriftServer.singleSession` is a configuration of SQL component, this conf can be moved from `SparkConf` to `StaticSQLConf`. When we introduced `spark.sql.hive.thriftServer.singleSession`, all the SQL configuration are session specific. They can be modified in different sessions. In Spark 2.1, static SQL configuration is added. It is a perfect fit for `spark.sql.hive.thriftServer.singleSession`. Previously, we did the same move for `spark.sql.warehouse.dir` from `SparkConf` to `StaticSQLConf` ### How was this patch tested? Added test cases in HiveThriftServer2Suites.scala Author: gatorsmile <gatorsmile@gmail.com> Closes #16392 from gatorsmile/hiveThriftServerSingleSession.	2016-12-28 10:16:22 +08:00
Yin Huai	2404d8e54b	Revert "[SPARK-18990][SQL] make DatasetBenchmark fairer for Dataset" This reverts commit `a05cc425a0`.	2016-12-27 10:03:52 -08:00
Wenchen Fan	a05cc425a0	[SPARK-18990][SQL] make DatasetBenchmark fairer for Dataset ## What changes were proposed in this pull request? Currently `DatasetBenchmark` use `case class Data(l: Long, s: String)` as the record type of `RDD` and `Dataset`, which introduce serialization overhead only to `Dataset` and is unfair. This PR use `Long` as the record type, to be fairer for `Dataset` ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #16391 from cloud-fan/benchmark.	2016-12-27 22:42:28 +08:00

1 2 3 4 5 ...

3330 commits