ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
gatorsmile	a59ab594ca	[SPARK-15184][SQL] Fix Silent Removal of An Existent Temp Table by Rename Table #### What changes were proposed in this pull request? Currently, if we rename a temp table `Tab1` to another existent temp table `Tab2`. `Tab2` will be silently removed. This PR is to detect it and issue an exception message. In addition, this PR also detects another issue in the rename table command. When the destination table identifier does have database name, we should not ignore them. That might mean users could rename a regular table. #### How was this patch tested? Added two related test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #12959 from gatorsmile/rewriteTable.	2016-05-09 13:05:18 +08:00
Herman van Hovell	df89f1d43d	[SPARK-15122] [SQL] Fix TPC-DS 41 - Normalize predicates before pulling them out ## What changes were proposed in this pull request? The official TPC-DS 41 query currently fails because it contains a scalar subquery with a disjunctive correlated predicate (the correlated predicates were nested in ORs). This makes the `Analyzer` pull out the entire predicate which is wrong and causes the following (correct) analysis exception: `The correlated scalar subquery can only contain equality predicates` This PR fixes this by first simplifing (or normalizing) the correlated predicates before pulling them out of the subquery. ## How was this patch tested? Manual testing on TPC-DS 41, and added a test to SubquerySuite. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12954 from hvanhovell/SPARK-15122.	2016-05-06 21:06:03 -07:00
Kevin Yu	607a27a0d1	[SPARK-15051][SQL] Create a TypedColumn alias ## What changes were proposed in this pull request? Currently when we create an alias against a TypedColumn from user-defined Aggregator(for example: agg(aggSum.toColumn as "a")), spark is using the alias' function from Column( as), the alias function will return a column contains a TypedAggregateExpression, which is unresolved because the inputDeserializer is not defined. Later the aggregator function (agg) will inject the inputDeserializer back to the TypedAggregateExpression, but only if the aggregate columns are TypedColumn, in the above case, the TypedAggregateExpression will remain unresolved because it is under column and caused the problem reported by this jira [15051](https://issues.apache.org/jira/browse/SPARK-15051?jql=project%20%3D%20SPARK). This PR propose to create an alias function for TypedColumn, it will return a TypedColumn. It is using the similar code path as Column's alia function. For the spark build in aggregate function, like max, it is working with alias, for example val df1 = Seq(1 -> "a", 2 -> "b", 3 -> "b").toDF("i", "j") checkAnswer(df1.agg(max("j") as "b"), Row(3) :: Nil) Thanks for comments. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Add test cases in DatasetAggregatorSuite.scala run the sql related queries against this patch. Author: Kevin Yu <qyu@us.ibm.com> Closes #12893 from kevinyu98/spark-15051.	2016-05-07 11:13:48 +08:00
Tathagata Das	f7b7ef4166	[SPARK-14997][SQL] Fixed FileCatalog to return correct set of files when there is no partitioning scheme in the given paths ## What changes were proposed in this pull request? Lets says there are json files in the following directories structure ``` xyz/file0.json xyz/subdir1/file1.json xyz/subdir2/file2.json xyz/subdir1/subsubdir1/file3.json ``` `sqlContext.read.json("xyz")` should read only file0.json according to behavior in Spark 1.6.1. However in current master, all the 4 files are read. The fix is to make FileCatalog return only the children files of the given path if there is not partitioning detected (instead of all the recursive list of files). Closes #12774 ## How was this patch tested? unit tests Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12856 from tdas/SPARK-14997.	2016-05-06 15:04:16 -07:00
gatorsmile	5c8fad7b9b	[SPARK-15108][SQL] Describe Permanent UDTF #### What changes were proposed in this pull request? When Describe a UDTF, the command returns a wrong result. The command is unable to find the function, which has been created and cataloged in the catalog but not in the functionRegistry. This PR is to correct it. If the function is not in the functionRegistry, we will check the catalog for collecting the information of the UDTF function. #### How was this patch tested? Added test cases to verify the results Author: gatorsmile <gatorsmile@gmail.com> Closes #12885 from gatorsmile/showFunction.	2016-05-06 11:43:07 -07:00
hyukjinkwon	fa928ff9a3	[SPARK-14962][SQL] Do not push down isnotnull/isnull on unsuportted types in ORC ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14962 ORC filters were being pushed down for all types for both `IsNull` and `IsNotNull`. This is apparently OK because both `IsNull` and `IsNotNull` do not take a type as an argument (Hive 1.2.x) during building filters (`SearchArgument`) in Spark-side but they do not filter correctly because stored statistics always produces `null` for not supported types (eg `ArrayType`) in ORC-side. So, it is always `true` for `IsNull` which ends up with always `false` for `IsNotNull`. (Please see [RecordReaderImpl.java#L296-L318](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L296-L318) and [RecordReaderImpl.java#L359-L365](https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java#L359-L365) in Hive 1.2) This looks prevented in Hive 1.3.x >= by forcing to give a type ([`PredicateLeaf.Type`](`e085b7e9bd/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.java (L50-L56)`)) when building a filter ([`SearchArgument`](`26b5c7b56a/storage-api/src/java/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.java (L260)`)) but Hive 1.2.x seems not doing this. This PR prevents ORC filter creation for `IsNull` and `IsNotNull` on unsupported types. `OrcFilters` resembles `ParquetFilters`. ## How was this patch tested? Unittests in `OrcQuerySuite` and `OrcFilterSuite` and `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #12777 from HyukjinKwon/SPARK-14962.	2016-05-07 01:46:45 +08:00
Jacek Laskowski	bbb7773437	[SPARK-15152][DOC][MINOR] Scaladoc and Code style Improvements ## What changes were proposed in this pull request? Minor doc and code style fixes ## How was this patch tested? local build Author: Jacek Laskowski <jacek@japila.pl> Closes #12928 from jaceklaskowski/SPARK-15152.	2016-05-05 16:34:27 -07:00
Dongjoon Hyun	2c170dd3d7	[SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update binary_classification_metrics_example.py ## What changes were proposed in this pull request? This issue addresses the comments in SPARK-15031 and also fix java-linter errors. - Use multiline format in SparkSession builder patterns. - Update `binary_classification_metrics_example.py` to use `SparkSession`. - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far) ## How was this patch tested? After passing the Jenkins tests and run `dev/lint-java` manually. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12911 from dongjoon-hyun/SPARK-15134.	2016-05-05 14:37:50 -07:00
Shixiong Zhu	bb9991dec5	[SPARK-15135][SQL] Make sure SparkSession thread safe ## What changes were proposed in this pull request? Went through SparkSession and its members and fixed non-thread-safe classes used by SparkSession ## How was this patch tested? Existing unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #12915 from zsxwing/spark-session-thread-safe.	2016-05-05 14:36:47 -07:00
Sandeep Singh	ed6f3f8a5f	[SPARK-15072][SQL][REPL][EXAMPLES] Remove SparkSession.withHiveSupport ## What changes were proposed in this pull request? Removing the `withHiveSupport` method of `SparkSession`, instead use `enableHiveSupport` ## How was this patch tested? ran tests locally Author: Sandeep Singh <sandeep@techaddict.me> Closes #12851 from techaddict/SPARK-15072.	2016-05-05 14:35:15 -07:00
gatorsmile	8cba57a75c	[SPARK-14124][SQL][FOLLOWUP] Implement Database-related DDL Commands #### What changes were proposed in this pull request? First, a few test cases failed in mac OS X because the property value of `java.io.tmpdir` does not include a trailing slash on some platform. Hive always removes the last trailing slash. For example, what I got in the web: ``` Win NT --> C:\TEMP\ Win XP --> C:\TEMP Solaris --> /var/tmp/ Linux --> /var/tmp ``` Second, a couple of test cases are added to verify if the commands work properly. #### How was this patch tested? Added a test case for it and correct the previous test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12081 from gatorsmile/mkdir.	2016-05-05 14:34:24 -07:00
NarineK	22226fcc92	[SPARK-15110] [SPARKR] Implement repartitionByColumn for SparkR DataFrames ## What changes were proposed in this pull request? Implement repartitionByColumn on DataFrame. This will allow us to run R functions on each partition identified by column groups with dapply() method. ## How was this patch tested? Unit tests Author: NarineK <narine.kokhlikyan@us.ibm.com> Closes #12887 from NarineK/repartitionByColumns.	2016-05-05 12:00:55 -07:00
hyukjinkwon	ac12b35d31	[SPARK-15148][SQL] Upgrade Univocity library from 2.0.2 to 2.1.0 ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-15148 Mainly it improves the performance roughtly about 30%-40% according to the [release note](https://github.com/uniVocity/univocity-parsers/releases/tag/v2.1.0). For the details of the purpose is described in the JIRA. This PR upgrades Univocity library from 2.0.2 to 2.1.0. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12923 from HyukjinKwon/SPARK-15148.	2016-05-05 11:26:40 -07:00
Wenchen Fan	55cc1c991a	[SPARK-14139][SQL] RowEncoder should preserve schema nullability ## What changes were proposed in this pull request? The problem is: In `RowEncoder`, we use `Invoke` to get the field of an external row, which lose the nullability information. This PR creates a `GetExternalRowField` expression, so that we can preserve the nullability info. TODO: simplify the null handling logic in `RowEncoder`, to remove so many if branches, in follow-up PR. ## How was this patch tested? new tests in `RowEncoderSuite` Note that, This PR takes over https://github.com/apache/spark/pull/11980, with a little simplification, so all credits should go to koertkuipers Author: Wenchen Fan <wenchen@databricks.com> Author: Koert Kuipers <koert@tresata.com> Closes #12364 from cloud-fan/nullable.	2016-05-06 01:08:04 +08:00
Kousuke Saruta	1a9b341581	[SPARK-15132][MINOR][SQL] Debug log for generated code should be printed with proper indentation ## What changes were proposed in this pull request? Similar to #11990, GenerateOrdering and GenerateColumnAccessor should print debug log for generated code with proper indentation. ## How was this patch tested? Manually checked. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #12908 from sarutak/SPARK-15132.	2016-05-04 22:18:55 -07:00
Tathagata Das	bde27b89a2	[SPARK-15131][SQL] Shutdown StateStore management thread when SparkContext has been shutdown ## What changes were proposed in this pull request? Make sure that whenever the StateStoreCoordinator cannot be contacted, assume that the SparkContext and RpcEnv on the driver has been shutdown, and therefore stop the StateStore management thread, and unload all loaded stores. ## How was this patch tested? Updated unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12905 from tdas/SPARK-15131.	2016-05-04 21:19:53 -07:00
gatorsmile	ef55e46c92	[SPARK-14993][SQL] Fix Partition Discovery Inconsistency when Input is a Path to Parquet File #### What changes were proposed in this pull request? When we load a dataset, if we set the path to ```/path/a=1```, we will not take `a` as the partitioning column. However, if we set the path to ```/path/a=1/file.parquet```, we take `a` as the partitioning column and it shows up in the schema. This PR is to fix the behavior inconsistency issue. The base path contains a set of paths that are considered as the base dirs of the input datasets. The partitioning discovery logic will make sure it will stop when it reaches any base path. By default, the paths of the dataset provided by users will be base paths. Below are three typical cases, Case 1```sqlContext.read.parquet("/path/something=true/")```: the base path will be `/path/something=true/`, and the returned DataFrame will not contain a column of `something`. Case 2```sqlContext.read.parquet("/path/something=true/a.parquet")```: the base path will be still `/path/something=true/`, and the returned DataFrame will also not contain a column of `something`. Case 3```sqlContext.read.parquet("/path/")```: the base path will be `/path/`, and the returned DataFrame will have the column of `something`. Users also can override the basePath by setting `basePath` in the options to pass the new base path to the data source. For example, ```sqlContext.read.option("basePath", "/path/").parquet("/path/something=true/")```, and the returned DataFrame will have the column of `something`. The related PRs: - https://github.com/apache/spark/pull/9651 - https://github.com/apache/spark/pull/10211 #### How was this patch tested? Added a couple of test cases Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12828 from gatorsmile/readPartitionedTable.	2016-05-04 18:47:27 -07:00
Sean Zhong	8fb1463d6a	[SPARK-6339][SQL] Supports CREATE TEMPORARY VIEW tableIdentifier AS query ## What changes were proposed in this pull request? This PR support new SQL syntax CREATE TEMPORARY VIEW. Like: ``` CREATE TEMPORARY VIEW viewName AS SELECT * from xx CREATE OR REPLACE TEMPORARY VIEW viewName AS SELECT * from xx CREATE TEMPORARY VIEW viewName (c1 COMMENT 'blabla', c2 COMMENT 'blabla') AS SELECT * FROM xx ``` ## How was this patch tested? Unit tests. Author: Sean Zhong <clockfly@gmail.com> Closes #12872 from clockfly/spark-6399.	2016-05-04 18:27:25 -07:00
sethah	b281377647	[MINOR][SQL] Fix typo in DataFrameReader csv documentation ## What changes were proposed in this pull request? Typo fix ## How was this patch tested? No tests My apologies for the tiny PR, but I stumbled across this today and wanted to get it corrected for 2.0. Author: sethah <seth.hendrickson16@gmail.com> Closes #12912 from sethah/csv_typo.	2016-05-04 16:46:13 -07:00
Reynold Xin	6ae9fc00ed	[SPARK-15126][SQL] RuntimeConfig.set should return Unit ## What changes were proposed in this pull request? Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself. ## How was this patch tested? Updated unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12902 from rxin/SPARK-15126.	2016-05-04 14:26:05 -07:00
Tathagata Das	0fd3a47484	[SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog to infer partitioning ## What changes were proposed in this pull request? File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog. This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files. - HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning. - StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log. - The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala. ## How was this patch tested? - FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query. - Other unit tests are unchanged and pass as expected. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12879 from tdas/SPARK-15103.	2016-05-04 11:02:48 -07:00
Reynold Xin	6274a520fa	[SPARK-15115][SQL] Reorganize whole stage codegen benchmark suites ## What changes were proposed in this pull request? We currently have a single suite that is very large, making it difficult to maintain and play with specific primitives. This patch reorganizes the file by creating multiple benchmark suites in a single package. Most of the changes are straightforward move of code. On top of the code moving, I did: 1. Use SparkSession instead of SQLContext. 2. Turned most benchmark scenarios into a their own test cases, rather than having multiple scenarios in a single test case, which takes forever to run. ## How was this patch tested? This is a test only change. Author: Reynold Xin <rxin@databricks.com> Closes #12891 from rxin/SPARK-15115.	2016-05-04 11:00:01 -07:00
Liang-Chi Hsieh	b85d21fb9d	[SPARK-14951] [SQL] Support subexpression elimination in TungstenAggregate ## What changes were proposed in this pull request? We can support subexpression elimination in TungstenAggregate by using current `EquivalentExpressions` which is already used in subexpression elimination for expression codegen. However, in wholestage codegen, we can't wrap the common expression's codes in functions as before, we simply generate the code snippets for common expressions. These code snippets are inserted before the common expressions are actually used in generated java codes. For multiple `TypedAggregateExpression` used in aggregation operator, since their input type should be the same. So their `inputDeserializer` will be the same too. This patch can also reduce redundant input deserialization. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12729 from viirya/subexpr-elimination-tungstenaggregate.	2016-05-04 10:54:51 -07:00
Reynold Xin	d864c55cf8	[SPARK-15109][SQL] Accept Dataset[_] in joins ## What changes were proposed in this pull request? This patch changes the join API in Dataset so they can accept any Dataset, rather than just DataFrames. ## How was this patch tested? N/A. Author: Reynold Xin <rxin@databricks.com> Closes #12886 from rxin/SPARK-15109.	2016-05-04 10:38:27 -07:00
Liwei Lin	e597ec6f1c	[SPARK-15022][SPARK-15023][SQL][STREAMING] Add support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `ManualClock` ## What changes were proposed in this pull request? Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`. We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`. This patch: - fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions; - adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `AdvanceManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action; - adds a test, which takes advantage of the new `StartStream` and `AdvanceManualClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](https://github.com/apache/spark/pull/12725). ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12797 from lw-lin/add-trigger-test-support.	2016-05-04 10:25:14 -07:00
Cheng Lian	f152fae306	[SPARK-14127][SQL] Native "DESC [EXTENDED \| FORMATTED] <table>" DDL command ## What changes were proposed in this pull request? This PR implements native `DESC [EXTENDED \| FORMATTED] <table>` DDL command. Sample output: ``` scala> spark.sql("desc extended src").show(100, truncate = false) +----------------------------+---------------------------------+-------+ \|col_name \|data_type \|comment\| +----------------------------+---------------------------------+-------+ \|key \|int \| \| \|value \|string \| \| \| \| \| \| \|# Detailed Table Information\|CatalogTable(`default`.`src`, ...\| \| +----------------------------+---------------------------------+-------+ scala> spark.sql("desc formatted src").show(100, truncate = false) +----------------------------+----------------------------------------------------------+-------+ \|col_name \|data_type \|comment\| +----------------------------+----------------------------------------------------------+-------+ \|key \|int \| \| \|value \|string \| \| \| \| \| \| \|# Detailed Table Information\| \| \| \|Database: \|default \| \| \|Owner: \|lian \| \| \|Create Time: \|Mon Jan 04 17:06:00 CST 2016 \| \| \|Last Access Time: \|Thu Jan 01 08:00:00 CST 1970 \| \| \|Location: \|hdfs://localhost:9000/user/hive/warehouse_hive121/src \| \| \|Table Type: \|MANAGED \| \| \|Table Parameters: \| \| \| \| transient_lastDdlTime \|1451898360 \| \| \| \| \| \| \|# Storage Information \| \| \| \|SerDe Library: \|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe \| \| \|InputFormat: \|org.apache.hadoop.mapred.TextInputFormat \| \| \|OutputFormat: \|org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\| \| \|Num Buckets: \|-1 \| \| \|Bucket Columns: \|[] \| \| \|Sort Columns: \|[] \| \| \|Storage Desc Parameters: \| \| \| \| serialization.format \|1 \| \| +----------------------------+----------------------------------------------------------+-------+ ``` ## How was this patch tested? A test case is added to `HiveDDLSuite` to check command output. Author: Cheng Lian <lian@databricks.com> Closes #12844 from liancheng/spark-14127-desc-table.	2016-05-04 16:44:09 +08:00
Wenchen Fan	6c12e801e8	[SPARK-15029] improve error message for Generate ## What changes were proposed in this pull request? This PR improve the error message for `Generate` in 3 cases: 1. generator is nested in expressions, e.g. `SELECT explode(list) + 1 FROM tbl` 2. generator appears more than one time in SELECT, e.g. `SELECT explode(list), explode(list) FROM tbl` 3. generator appears in other operator which is not project, e.g. `SELECT * FROM tbl SORT BY explode(list)` ## How was this patch tested? new tests in `AnalysisErrorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12810 from cloud-fan/bug.	2016-05-04 00:10:20 -07:00
Cheng Lian	bc3760d405	[SPARK-14237][SQL] De-duplicate partition value appending logic in various buildReader() implementations ## What changes were proposed in this pull request? Currently, various `FileFormat` data sources share approximately the same code for partition value appending. This PR tries to eliminate this duplication. A new method `buildReaderWithPartitionValues()` is added to `FileFormat` with a default implementation that appends partition values to `InternalRow`s produced by the reader function returned by `buildReader()`. Special data sources like Parquet, which implements partition value appending inside `buildReader()` because of the vectorized reader, and the Text data source, which doesn't support partitioning, override `buildReaderWithPartitionValues()` and simply delegate to `buildReader()`. This PR brings two benefits: 1. Apparently, it de-duplicates partition value appending logic 2. Now the reader function returned by `buildReader()` is only required to produce `InternalRow`s rather than `UnsafeRow`s if the data source doesn't override `buildReaderWithPartitionValues()`. Because the safe-to-unsafe conversion is also performed while appending partition values. This makes 3rd-party data sources (e.g. spark-avro) easier to implement since they no longer need to access private APIs involving `UnsafeRow`. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #12866 from liancheng/spark-14237-simplify-partition-values-appending.	2016-05-04 14:16:57 +08:00
Reynold Xin	695f0e9195	[SPARK-15107][SQL] Allow varying # iterations by test case in Benchmark ## What changes were proposed in this pull request? This patch changes our micro-benchmark util to allow setting different iteration numbers for different test cases. For some of our benchmarks, turning off whole-stage codegen can make the runtime 20X slower, making it very difficult to run a large number of times without substantially shortening the input cardinality. With this change, I set the default num iterations to 2 for whole stage codegen off, and 5 for whole stage codegen on. I also updated some results. ## How was this patch tested? N/A - this is a test util. Author: Reynold Xin <rxin@databricks.com> Closes #12884 from rxin/SPARK-15107.	2016-05-03 22:56:40 -07:00
Andrew Or	6ba17cd147	[SPARK-14414][SQL] Make DDL exceptions more consistent ## What changes were proposed in this pull request? Just a bunch of small tweaks on DDL exception messages. ## How was this patch tested? `DDLCommandSuite` et al. Author: Andrew Or <andrew@databricks.com> Closes #12853 from andrewor14/make-exceptions-consistent.	2016-05-03 18:07:53 -07:00
Koert Kuipers	9e4928b7e0	[SPARK-15097][SQL] make Dataset.sqlContext a stable identifier for imports ## What changes were proposed in this pull request? Make Dataset.sqlContext a lazy val so that its a stable identifier and can be used for imports. Now this works again: import someDataset.sqlContext.implicits._ ## How was this patch tested? Add unit test to DatasetSuite that uses the import show above. Author: Koert Kuipers <koert@tresata.com> Closes #12877 from koertkuipers/feat-sqlcontext-stable-import.	2016-05-03 18:06:35 -07:00
Sandeep Singh	a8d56f5388	[SPARK-14422][SQL] Improve handling of optional configs in SQLConf ## What changes were proposed in this pull request? Create a new API for handling Optional Configs in SQLConf. Right now `getConf` for `OptionalConfigEntry[T]` returns value of type `T`, if doesn't exist throws an exception. Add new method `getOptionalConf`(suggestions on naming) which will now returns value of type `Option[T]`(so if doesn't exist it returns `None`). ## How was this patch tested? Add test and ran tests locally. Author: Sandeep Singh <sandeep@techaddict.me> Closes #12846 from techaddict/SPARK-14422.	2016-05-03 18:02:57 -07:00
Andrew Or	588cac414a	[SPARK-15073][SQL] Hide SparkSession constructor from the public ## What changes were proposed in this pull request? Users should use the builder pattern instead. ## How was this patch tested? Jenks. Author: Andrew Or <andrew@databricks.com> Closes #12873 from andrewor14/spark-session-constructor.	2016-05-03 13:47:58 -07:00
yzhou2001	a4aed71719	[SPARK-14521] [SQL] StackOverflowError in Kryo when executing TPC-DS ## What changes were proposed in this pull request? Observed stackOverflowError in Kryo when executing TPC-DS Query27. Spark thrift server disables kryo reference tracking (if not specified in conf). When "spark.kryo.referenceTracking" is set to true explicitly in spark-defaults.conf, query executes successfully. The root cause is that the TaskMemoryManager inside MemoryConsumer and LongToUnsafeRowMap were not transient and thus were serialized and broadcast around from within LongHashedRelation, which could potentially cause circular reference inside Kryo. But the TaskMemoryManager is per task and should not be passed around at the first place. This fix makes it transient. ## How was this patch tested? core/test, hive/test, sql/test, catalyst/test, dev/lint-scala, org.apache.spark.sql.hive.execution.HiveCompatibilitySuite, dev/scalastyle, manual test of TBC-DS Query 27 with 1GB data but without the "limit 100" which would cause a NPE due to SPARK-14752. Author: yzhou2001 <yzhou_1999@yahoo.com> Closes #12598 from yzhou2001/master.	2016-05-03 13:41:04 -07:00
Sandeep Singh	ca813330c7	[SPARK-15087][CORE][SQL] Remove AccumulatorV2.localValue and keep only value ## What changes were proposed in this pull request? Remove AccumulatorV2.localValue and keep only value ## How was this patch tested? existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #12865 from techaddict/SPARK-15087.	2016-05-03 11:38:43 -07:00
Shixiong Zhu	b545d75219	[SPARK-14860][TESTS] Create a new Waiter in reset to bypass an issue of ScalaTest's Waiter.wait ## What changes were proposed in this pull request? This PR updates `QueryStatusCollector.reset` to create Waiter instead of calling `await(1 milliseconds)` to bypass an ScalaTest's issue that Waiter.await may block forever. ## How was this patch tested? I created a local stress test to call codes in `test("event ordering")` 100 times. It cannot pass without this patch. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12623 from zsxwing/flaky-test.	2016-05-03 11:16:55 -07:00
Tathagata Das	4ad492c403	[SPARK-14716][SQL] Added support for partitioning in FileStreamSink # What changes were proposed in this pull request? Support partitioning in the file stream sink. This is implemented using a new, but simpler code path for writing parquet files - both unpartitioned and partitioned. This new code path does not use Output Committers, as we will eventually write the file names to the metadata log for "committing" them. This patch duplicates < 100 LOC from the WriterContainer. But its far simpler that WriterContainer as it does not involve output committing. In addition, it introduces the new APIs in FileFormat and OutputWriterFactory in an attempt to simplify the APIs (not have Job in the `FileFormat` API, not have bucket and other stuff in the `OutputWriterFactory.newInstance()` ). # Tests - New unit tests to test the FileStreamSinkWriter for partitioned and unpartitioned files - New unit test to partially test the FileStreamSink for partitioned files (does not test recovery of partition column data, as that requires change in the StreamFileCatalog, future PR). - Updated FileStressSuite to test number of records read from partitioned output files. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12409 from tdas/streaming-partitioned-parquet.	2016-05-03 10:58:26 -07:00
Liwei Lin	5bd9a2f697	[SPARK-14884][SQL][STREAMING][WEBUI] Fix call site for continuous queries ## What changes were proposed in this pull request? Since we've been processing continuous queries in separate threads, the call sites are then `run at <unknown>:0`. It's not wrong but provides very little information; in addition, we can not distinguish two queries only from their call sites. This patch fixes this. ### Before [Jobs Tab] ![s1a](https://cloud.githubusercontent.com/assets/15843379/14766101/a47246b2-0a30-11e6-8d81-06a9a600113b.png) [SQL Tab] ![s1b](https://cloud.githubusercontent.com/assets/15843379/14766102/a4750226-0a30-11e6-9ada-773d977d902b.png) ### After [Jobs Tab] ![s2a](https://cloud.githubusercontent.com/assets/15843379/14766104/a89705b6-0a30-11e6-9830-0d40ec68527b.png) [SQL Tab] ![s2b](https://cloud.githubusercontent.com/assets/15843379/14766103/a8966728-0a30-11e6-8e4d-c2e326400478.png) ## How was this patch tested? Manually checks - see screenshots above. Author: Liwei Lin <lwlin7@gmail.com> Closes #12650 from lw-lin/fix-call-site.	2016-05-03 10:10:25 -07:00
Reynold Xin	5503e453ba	[SPARK-15088] [SQL] Remove SparkSqlSerializer ## What changes were proposed in this pull request? This patch removes SparkSqlSerializer. I believe this is now dead code. ## How was this patch tested? Removed a test case related to it. Author: Reynold Xin <rxin@databricks.com> Closes #12864 from rxin/SPARK-15088.	2016-05-03 09:43:47 -07:00
Reynold Xin	d557a5e01e	[SPARK-15081] Move AccumulatorV2 and subclasses into util package ## What changes were proposed in this pull request? This patch moves AccumulatorV2 and subclasses into util package. ## How was this patch tested? Updated relevant tests. Author: Reynold Xin <rxin@databricks.com> Closes #12863 from rxin/SPARK-15081.	2016-05-03 19:45:12 +08:00
Andrew Ray	d8f528ceb6	[SPARK-13749][SQL][FOLLOW-UP] Faster pivot implementation for many distinct values with two phase aggregation ## What changes were proposed in this pull request? This is a follow up PR for #11583. It makes 3 lazy vals into just vals and adds unit test coverage. ## How was this patch tested? Existing unit tests and additional unit tests. Author: Andrew Ray <ray.andrew@gmail.com> Closes #12861 from aray/fast-pivot-follow-up.	2016-05-02 22:47:32 -07:00
Shixiong Zhu	4e3685ae5e	[SPARK-15077][SQL] Use a fair lock to avoid thread starvation in StreamExecution ## What changes were proposed in this pull request? Right now `StreamExecution.awaitBatchLock` uses an unfair lock. `StreamExecution.awaitOffset` may run too long and fail some test because `StreamExecution.constructNextBatch` keeps getting the lock. See: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/865/testReport/junit/org.apache.spark.sql.streaming/FileStreamSourceStressTestSuite/file_source_stress_test/ This PR uses a fair ReentrantLock to resolve the thread starvation issue. ## How was this patch tested? Modified `FileStreamSourceStressTestSuite.test("file source stress test")` to run the test codes 100 times locally. It always fails because of timeout without this patch. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12852 from zsxwing/SPARK-15077.	2016-05-02 18:27:49 -07:00
Herman van Hovell	1c19c2769e	[SPARK-15047][SQL] Cleanup SQL Parser ## What changes were proposed in this pull request? This PR addresses a few minor issues in SQL parser: - Removes some unused rules and keywords in the grammar. - Removes code path for fallback SQL parsing (was needed for Hive native parsing). - Use `UnresolvedGenerator` instead of hard-coding `Explode` & `JsonTuple`. - Adds a more generic way of creating error messages for unsupported Hive features. - Use `visitFunctionName` as much as possible. - Interpret a `CatalogColumn`'s `DataType` directly instead of parsing it again. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12826 from hvanhovell/SPARK-15047.	2016-05-02 18:12:31 -07:00
Liwei Lin	35d9c8aa69	[SPARK-14747][SQL] Add assertStreaming/assertNoneStreaming checks in DataFrameWriter ## Problem If an end user happens to write code mixed with continuous-query-oriented methods and non-continuous-query-oriented methods: ```scala ctx.read .format("text") .stream("...") // continuous query .write .text("...") // non-continuous query; should be startStream() here ``` He/she would get this somehow confusing exception: > Exception in thread "main" java.lang.AssertionError: assertion failed: No plan for FileSource[./continuous_query_test_input] at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at ... ## What changes were proposed in this pull request? This PR adds checks for continuous-query-oriented methods and non-continuous-query-oriented methods in `DataFrameWriter`: <table> <tr> <td align="center"></td> <td align="center"><strong>can be called on continuous query?</strong></td> <td align="center"><strong>can be called on non-continuous query?</strong></td> </tr> <tr> <td align="center">mode</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">trigger</td> <td align="center">yes</td> <td align="center"></td> </tr> <tr> <td align="center">format</td> <td align="center">yes</td> <td align="center">yes</td> </tr> <tr> <td align="center">option/options</td> <td align="center">yes</td> <td align="center">yes</td> </tr> <tr> <td align="center">partitionBy</td> <td align="center">yes</td> <td align="center">yes</td> </tr> <tr> <td align="center">bucketBy</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">sortBy</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">save</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">queryName</td> <td align="center">yes</td> <td align="center"></td> </tr> <tr> <td align="center">startStream</td> <td align="center">yes</td> <td align="center"></td> </tr> <tr> <td align="center">insertInto</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">saveAsTable</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">jdbc</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">json</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">parquet</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">orc</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">text</td> <td align="center"></td> <td align="center">yes</td> </tr> <tr> <td align="center">csv</td> <td align="center"></td> <td align="center">yes</td> </tr> </table> After this PR's change, the friendly exception would be: > Exception in thread "main" org.apache.spark.sql.AnalysisException: text() can only be called on non-continuous queries; at org.apache.spark.sql.DataFrameWriter.assertNotStreaming(DataFrameWriter.scala:678) at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:629) at ss.SSDemo$.main(SSDemo.scala:47) ## How was this patch tested? dedicated unit tests were added Author: Liwei Lin <lwlin7@gmail.com> Closes #12521 from lw-lin/dataframe-writer-check.	2016-05-02 16:48:20 -07:00
Herman van Hovell	f362363d14	[SPARK-14785] [SQL] Support correlated scalar subqueries ## What changes were proposed in this pull request? In this PR we add support for correlated scalar subqueries. An example of such a query is: ```SQL select * from tbl1 a where a.value > (select max(value) from tbl2 b where b.key = a.key) ``` The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans. I could not find a well defined semantics for the use of scalar subqueries in an `Aggregate`. The current implementation currently evaluates the scalar subquery before aggregation. This means that you either have to make scalar subquery part of the grouping expression, or that you have to aggregate it further on. I am open to suggestions on this. The implementation currently forces the uniqueness of a scalar subquery by enforcing that it is aggregated and that the resulting column is wrapped in an `AggregateExpression`. ## How was this patch tested? Added tests to `SubquerySuite`. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12822 from hvanhovell/SPARK-14785.	2016-05-02 16:32:31 -07:00
poolis	917d05f43b	[SPARK-12928][SQL] Oracle FLOAT datatype is not properly handled when reading via JDBC The contribution is my original work and that I license the work to the project under the project's open source license. Author: poolis <gmichalopoulos@gmail.com> Author: Greg Michalopoulos <gmichalopoulos@gmail.com> Closes #10899 from poolis/spark-12928.	2016-05-02 16:15:07 -07:00
Reynold Xin	ca1b219858	[SPARK-15052][SQL] Use builder pattern to create SparkSession ## What changes were proposed in this pull request? This patch creates a builder pattern for creating SparkSession. The new code is unused and mostly deadcode. I'm putting it up here for feedback. There are a few TODOs that can be done as follow-up pull requests: - [ ] Update tests to use this - [ ] Update examples to use this - [ ] Clean up SQLContext code w.r.t. this one (i.e. SparkSession shouldn't call into SQLContext.getOrCreate; it should be the other way around) - [ ] Remove SparkSession.withHiveSupport - [ ] Disable the old constructor (by making it private) so the only way to start a SparkSession is through this builder pattern ## How was this patch tested? Part of the future pull request is to clean this up and switch existing tests to use this. Author: Reynold Xin <rxin@databricks.com> Closes #12830 from rxin/sparksession-builder.	2016-05-02 15:27:16 -07:00
Pete Robbins	8a1ce4899f	[SPARK-13745] [SQL] Support columnar in memory representation on Big Endian platforms ## What changes were proposed in this pull request? parquet datasource and ColumnarBatch tests fail on big-endian platforms This patch adds support for the little-endian byte arrays being correctly interpreted on a big-endian platform ## How was this patch tested? Spark test builds ran on big endian z/Linux and regression build on little endian amd64 Author: Pete Robbins <robbinspg@gmail.com> Closes #12397 from robbinspg/master.	2016-05-02 13:16:46 -07:00
Davies Liu	95e372141a	[SPARK-14781] [SQL] support nested predicate subquery ## What changes were proposed in this pull request? In order to support nested predicate subquery, this PR introduce an internal join type ExistenceJoin, which will emit all the rows from left, plus an additional column, which presents there are any rows matched from right or not (it's not null-aware right now). This additional column could be used to replace the subquery in Filter. In theory, all the predicate subquery could use this join type, but it's slower than LeftSemi and LeftAnti, so it's only used for nested subquery (subquery inside OR). For example, the following SQL: ```sql SELECT a FROM t WHERE EXISTS (select 0) OR EXISTS (select 1) ``` This PR also fix a bug in predicate subquery push down through join (they should not). Nested null-aware subquery is still not supported. For example, `a > 3 OR b NOT IN (select bb from t)` After this, we could run TPCDS query Q10, Q35, Q45 ## How was this patch tested? Added unit tests. Author: Davies Liu <davies@databricks.com> Closes #12820 from davies/or_exists.	2016-05-02 12:58:59 -07:00
Shixiong Zhu	a35a67a83d	[SPARK-14579][SQL] Fix the race condition in StreamExecution.processAllAvailable again ## What changes were proposed in this pull request? #12339 didn't fix the race condition. MemorySinkSuite is still flaky: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/814/testReport/junit/org.apache.spark.sql.streaming/MemorySinkSuite/registering_as_a_table/ Here is an execution order to reproduce it. \| Time \|Thread 1 \| MicroBatchThread \| \|:-------------:\|:-------------:\|:-----:\| \| 1 \| \| `MemorySink.getOffset` \| \| 2 \| \| availableOffsets ++= newData (availableOffsets is not changed here) \| \| 3 \| addData(newData) \| \| \| 4 \| Set `noNewData` to `false` in processAllAvailable \| \| \| 5 \| \| `dataAvailable` returns `false` \| \| 6 \| \| noNewData = true \| \| 7 \| `noNewData` is true so just return \| \| \| 8 \| assert results and fail \| \| \| 9 \| \| `dataAvailable` returns true so process the new batch \| This PR expands the scope of `awaitBatchLock.synchronized` to eliminate the above race. ## How was this patch tested? test("stress test"). It always failed before this patch. And it will pass after applying this patch. Ignore this test in the PR as it takes several minutes to finish. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12582 from zsxwing/SPARK-14579-2.	2016-05-02 11:28:21 -07:00
Andrew Ray	9927441868	[SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation ## What changes were proposed in this pull request? The existing implementation of pivot translates into a single aggregation with one aggregate per distinct pivot value. When the number of distinct pivot values is large (say 1000+) this can get extremely slow since each input value gets evaluated on every aggregate even though it only affects the value of one of them. I'm proposing an alternate strategy for when there are 10+ (somewhat arbitrary threshold) distinct pivot values. We do two phases of aggregation. In the first we group by the grouping columns plus the pivot column and perform the specified aggregations (one or sometimes more). In the second aggregation we group by the grouping columns and use the new (non public) PivotFirst aggregate that rearranges the outputs of the first aggregation into an array indexed by the pivot value. Finally we do a project to extract the array entries into the appropriate output column. ## How was this patch tested? Additional unit tests in DataFramePivotSuite and manual larger scale testing. Author: Andrew Ray <ray.andrew@gmail.com> Closes #11583 from aray/fast-pivot.	2016-05-02 11:12:55 -07:00
Reynold Xin	44da8d8eab	[SPARK-15049] Rename NewAccumulator to AccumulatorV2 ## What changes were proposed in this pull request? NewAccumulator isn't the best name if we ever come up with v3 of the API. ## How was this patch tested? Updated tests to reflect the change. Author: Reynold Xin <rxin@databricks.com> Closes #12827 from rxin/SPARK-15049.	2016-05-01 20:21:02 -07:00
hyukjinkwon	a832cef112	[SPARK-13425][SQL] Documentation for CSV datasource options ## What changes were proposed in this pull request? This PR adds the explanation and documentation for CSV options for reading and writing. ## How was this patch tested? Style tests with `./dev/run_tests` for documentation style. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #12817 from HyukjinKwon/SPARK-13425.	2016-05-01 19:05:20 -07:00
Wenchen Fan	90787de864	[SPARK-15033][SQL] fix a flaky test in CachedTableSuite ## What changes were proposed in this pull request? This is caused by https://github.com/apache/spark/pull/12776, which removes the `synchronized` from all methods in `AccumulatorContext`. However, a test in `CachedTableSuite` synchronize on `AccumulatorContext` and expecting no one else can change it, which is not true anymore. This PR update that test to not require to lock on `AccumulatorContext`. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #12811 from cloud-fan/flaky.	2016-04-30 20:28:22 -07:00
Hossein	507bea5ca6	[SPARK-14143] Options for parsing NaNs, Infinity and nulls for numeric types 1. Adds the following options for parsing NaNs: nanValue 2. Adds the following options for parsing infinity: positiveInf, negativeInf. `TypeCast.castTo` is unit tested and an end-to-end test is added to `CSVSuite` Author: Hossein <hossein@databricks.com> Closes #11947 from falaki/SPARK-14143.	2016-04-30 18:12:03 -07:00
Yin Huai	0182d9599d	[SPARK-15034][SPARK-15035][SPARK-15036][SQL] Use spark.sql.warehouse.dir as the warehouse location This PR contains three changes: 1. We will use spark.sql.warehouse.dir set warehouse location. We will not use hive.metastore.warehouse.dir. 2. SessionCatalog needs to set the location to default db. Otherwise, when creating a table in SparkSession without hive support, the default db's path will be an empty string. 3. When we create a database, we need to make the path qualified. Existing tests and new tests Author: Yin Huai <yhuai@databricks.com> Closes #12812 from yhuai/warehouse.	2016-04-30 18:04:42 -07:00
Reynold Xin	8dc3987d09	[SPARK-15028][SQL] Remove HiveSessionState.setDefaultOverrideConfs ## What changes were proposed in this pull request? This patch removes some code that are no longer relevant -- mainly HiveSessionState.setDefaultOverrideConfs. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #12806 from rxin/SPARK-15028.	2016-04-30 01:32:00 -07:00
hyukjinkwon	4bac703eb9	[SPARK-13667][SQL] Support for specifying custom date format for date and timestamp types at CSV datasource. ## What changes were proposed in this pull request? This PR adds the support to specify custom date format for `DateType` and `TimestampType`. For `TimestampType`, this uses the given format to infer schema and also to convert the values For `DateType`, this uses the given format to convert the values. If the `dateFormat` is not given, then it works with `DateTimeUtils.stringToTime()` for backwords compatibility. When it's given, then it uses `SimpleDateFormat` for parsing data. In addition, `IntegerType`, `DoubleType` and `LongType` have a higher priority than `TimestampType` in type inference. This means even if the given format is `yyyy` or `yyyy.MM`, it will be inferred as `IntegerType` or `DoubleType`. Since it is type inference, I think it is okay to give such precedences. In addition, I renamed `csv.CSVInferSchema` to `csv.InferSchema` as JSON datasource has `json.InferSchema`. Although they have the same names, I did this because I thought the parent package name can still differentiate each. Accordingly, the suite name was also changed from `CSVInferSchemaSuite` to `InferSchemaSuite`. ## How was this patch tested? unit tests are used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #11550 from HyukjinKwon/SPARK-13667.	2016-04-29 22:52:21 -07:00
Yin Huai	ac41fc648d	[SPARK-14591][SQL] Remove DataTypeParser and add more keywords to the nonReserved list. ## What changes were proposed in this pull request? CatalystSqlParser can parse data types. So, we do not need to have an individual DataTypeParser. ## How was this patch tested? Existing tests Author: Yin Huai <yhuai@databricks.com> Closes #12796 from yhuai/removeDataTypeParser.	2016-04-29 22:49:12 -07:00
Andrew Or	66773eb8a5	[SPARK-15012][SQL] Simplify configuration API further ## What changes were proposed in this pull request? 1. Remove all the `spark.setConf` etc. Just expose `spark.conf` 2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused This was done for both the Python and Scala APIs. ## How was this patch tested? `SQLConfSuite`, python tests. This one fixes the failed tests in #12787 Closes #12787 Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12798 from yhuai/conf-api.	2016-04-29 20:46:07 -07:00
Herman van Hovell	83061be697	[SPARK-14858] [SQL] Enable subquery pushdown The previous subquery PRs did not include support for pushing subqueries used in filters (`WHERE`/`HAVING`) down. This PR adds this support. For example : ```scala range(0, 10).registerTempTable("a") range(5, 15).registerTempTable("b") range(7, 25).registerTempTable("c") range(3, 12).registerTempTable("d") val plan = sql("select * from a join b on a.id = b.id left join c on c.id = b.id where a.id in (select id from d)") plan.explain(true) ``` Leads to the following Analyzed & Optimized plans: ``` == Parsed Logical Plan == ... == Analyzed Logical Plan == id: bigint, id: bigint, id: bigint Project [id#0L,id#4L,id#8L] +- Filter predicate-subquery#16 [(id#0L = id#12L)] : +- SubqueryAlias predicate-subquery#16 [(id#0L = id#12L)] : +- Project [id#12L] : +- SubqueryAlias d : +- Range 3, 12, 1, 8, [id#12L] +- Join LeftOuter, Some((id#8L = id#4L)) :- Join Inner, Some((id#0L = id#4L)) : :- SubqueryAlias a : : +- Range 0, 10, 1, 8, [id#0L] : +- SubqueryAlias b : +- Range 5, 15, 1, 8, [id#4L] +- SubqueryAlias c +- Range 7, 25, 1, 8, [id#8L] == Optimized Logical Plan == Join LeftOuter, Some((id#8L = id#4L)) :- Join Inner, Some((id#0L = id#4L)) : :- Join LeftSemi, Some((id#0L = id#12L)) : : :- Range 0, 10, 1, 8, [id#0L] : : +- Range 3, 12, 1, 8, [id#12L] : +- Range 5, 15, 1, 8, [id#4L] +- Range 7, 25, 1, 8, [id#8L] == Physical Plan == ... ``` I have also taken the opportunity to move quite a bit of code around: - Rewriting subqueris and pulling out correlated predicated from subqueries has been moved into the analyzer. The analyzer transforms `Exists` and `InSubQuery` into `PredicateSubquery` expressions. A PredicateSubquery exposes the 'join' expressions and the proper references. This makes things like type coercion, optimization and planning easier to do. - I have added support for `Aggregate` plans in subqueries. Any correlated expressions will be added to the grouping expressions. I have removed support for `Union` plans, since pulling in an outer reference from beneath a Union has no value (a filtered value could easily be part of another Union child). - Resolution of subqueries is now done using `OuterReference`s. These are used to wrap any outer reference; this makes the identification of these references easier, and also makes dealing with duplicate attributes in the outer and inner plans easier. The resolution of subqueries initially used a resolution loop which would alternate between calling the analyzer and trying to resolve the outer references. We now use a dedicated analyzer which uses a special rule for outer reference resolution. These changes are a stepping stone for enabling correlated scalar subqueries, enabling all Hive tests & allowing us to use predicate subqueries anywhere. Current tests and added test cases in FilterPushdownSuite. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12720 from hvanhovell/SPARK-14858.	2016-04-29 16:50:12 -07:00
Andrew Or	d33e3d572e	[SPARK-14988][PYTHON] SparkSession API follow-ups ## What changes were proposed in this pull request? Addresses comments in #12765. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12784 from andrewor14/python-followup.	2016-04-29 16:41:13 -07:00
Sun Rui	4ae9fe091c	[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. ## What changes were proposed in this pull request? dapply() applies an R function on each partition of a DataFrame and returns a new DataFrame. The function signature is: dapply(df, function(localDF) {}, schema = NULL) R function input: local data.frame from the partition on local node R function output: local data.frame Schema specifies the Row format of the resulting DataFrame. It must match the R function's output. If schema is not specified, each partition of the result DataFrame will be serialized in R into a single byte array. Such resulting DataFrame can be processed by successive calls to dapply(). ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Author: Sun Rui <sunrui2016@gmail.com> Closes #12493 from sun-rui/SPARK-12919.	2016-04-29 16:41:07 -07:00
Cheng Lian	a04b1de5fa	[SPARK-14981][SQL] Throws exception if DESC is specified for sorting columns ## What changes were proposed in this pull request? Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column. ## How was this patch tested? A test case is added to test the invalid sorting order by checking exception message. Author: Cheng Lian <lian@databricks.com> Closes #12759 from liancheng/spark-14981.	2016-04-29 14:52:32 -07:00
Andrew Or	a7d0fedc94	[SPARK-14988][PYTHON] SparkSession catalog and conf API ## What changes were proposed in this pull request? The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API. ## How was this patch tested? Python tests. Author: Andrew Or <andrew@databricks.com> Closes #12765 from andrewor14/python-spark-session-more.	2016-04-29 09:34:10 -07:00
Reynold Xin	054f991c43	[SPARK-14994][SQL] Remove execution hive from HiveSessionState ## What changes were proposed in this pull request? This patch removes executionHive from HiveSessionState and HiveSharedState. ## How was this patch tested? Updated test cases. Author: Reynold Xin <rxin@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12770 from rxin/SPARK-14994.	2016-04-29 01:14:02 -07:00
Sameer Agarwal	2057cbcb0b	[SPARK-14996][SQL] Add TPCDS Benchmark Queries for SparkSQL ## What changes were proposed in this pull request? This PR adds support for easily running and benchmarking a set of common TPCDS queries locally in SparkSQL. ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #12771 from sameeragarwal/tpcds-2.	2016-04-29 00:52:42 -07:00
gatorsmile	222dcf7937	[SPARK-12660][SPARK-14967][SQL] Implement Except Distinct by Left Anti Join #### What changes were proposed in this pull request? Replaces a logical `Except` operator with a `Left-anti Join` operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins). ```SQL SELECT a1, a2 FROM Tab1 EXCEPT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT ANTI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 ``` Note: 1. This rule is only applicable to EXCEPT DISTINCT. Do not use it for EXCEPT ALL. 2. This rule has to be done after de-duplicating the attributes; otherwise, the enerated join conditions will be incorrect. This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like ```SQL test("except") { val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id") val df_right = Seq(1, 3).toDF("id") checkAnswer( df_left.except(df_right), Row(2) :: Row(2) :: Row(4) :: Nil ) } ``` After this PR, the result is corrected. We strictly follow the SQL compliance of `Except Distinct`. #### How was this patch tested? Modified and added a few test cases to verify the optimization rule and the results of operators. Author: gatorsmile <gatorsmile@gmail.com> Closes #12736 from gatorsmile/exceptByAntiJoin.	2016-04-29 15:30:36 +08:00
Zheng RuiFeng	4f83e442b1	[MINOR][DOC] Minor typo fixes ## What changes were proposed in this pull request? Minor typo fixes ## How was this patch tested? local build Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12755 from zhengruifeng/fix_doc_dataset.	2016-04-28 22:56:26 -07:00
Reynold Xin	4607f6e7f7	[SPARK-14991][SQL] Remove HiveNativeCommand ## What changes were proposed in this pull request? This patch removes HiveNativeCommand, so we can continue to remove the dependency on Hive. This pull request also removes the ability to generate golden result file using Hive. ## How was this patch tested? Updated tests to reflect this. Author: Reynold Xin <rxin@databricks.com> Closes #12769 from rxin/SPARK-14991.	2016-04-28 21:58:48 -07:00
Wenchen Fan	6f9a18fe31	[HOTFIX][CORE] fix a concurrence issue in NewAccumulator ## What changes were proposed in this pull request? `AccumulatorContext` is not thread-safe, that's why all of its methods are synchronized. However, there is one exception: the `AccumulatorContext.originals`. `NewAccumulator` use it to check if it's registered, which is wrong as it's not synchronized. This PR mark `AccumulatorContext.originals` as `private` and now all access to `AccumulatorContext` is synchronized. ## How was this patch tested? I verified it locally. To be safe, we can let jenkins test it many times to make sure this problem is gone. Author: Wenchen Fan <wenchen@databricks.com> Closes #12773 from cloud-fan/debug.	2016-04-28 21:57:58 -07:00
Tathagata Das	0ee5419b6c	[SPARK-14970][SQL] Prevent DataSource from enumerates all files in a directory if there is user specified schema ## What changes were proposed in this pull request? The FileCatalog object gets created even if the user specifies schema, which means files in the directory is enumerated even thought its not necessary. For large directories this is very slow. User would want to specify schema in such scenarios of large dirs, and this defeats the purpose quite a bit. ## How was this patch tested? Hard to test this with unit test. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12748 from tdas/SPARK-14970.	2016-04-28 12:59:08 -07:00
Liang-Chi Hsieh	7c6937a885	[SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType annotation ## What changes were proposed in this pull request? Currently we use `SQLUserDefinedType` annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes. For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty. We should provide alternative approach to register UDTs for user classes without `SQLUserDefinedType` annotation. ## How was this patch tested? `UserDefinedTypeSuite` Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12259 from viirya/improve-sql-usertype.	2016-04-28 01:14:49 -07:00
Wenchen Fan	bf5496dbda	[SPARK-14654][CORE] New accumulator API ## What changes were proposed in this pull request? This PR introduces a new accumulator API which is much simpler than before: 1. the type hierarchy is simplified, now we only have an `Accumulator` class 2. Combine `initialValue` and `zeroValue` concepts into just one concept: `zeroValue` 3. there in only one `register` method, the accumulator registration and cleanup registration are combined. 4. the `id`,`name` and `countFailedValues` are combined into an `AccumulatorMetadata`, and is provided during registration. `SQLMetric` is a good example to show the simplicity of this new API. What we break: 1. no `setValue` anymore. In the new API, the intermedia type can be different from the result type, it's very hard to implement a general `setValue` 2. accumulator can't be serialized before registered. Problems need to be addressed in follow-ups: 1. with this new API, `AccumulatorInfo` doesn't make a lot of sense, the partial output is not partial updates, we need to expose the intermediate value. 2. `ExceptionFailure` should not carry the accumulator updates. Why do users care about accumulator updates for failed cases? It looks like we only use this feature to update the internal metrics, how about we sending a heartbeat to update internal metrics after the failure event? 3. the public event `SparkListenerTaskEnd` carries a `TaskMetrics`. Ideally this `TaskMetrics` don't need to carry external accumulators, as the only method of `TaskMetrics` that can access external accumulators is `private[spark]`. However, `SQLListener` use it to retrieve sql metrics. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #12612 from cloud-fan/acc.	2016-04-28 00:26:39 -07:00
Davies Liu	ae4e3def5e	[SPARK-14961] Build HashedRelation larger than 1G ## What changes were proposed in this pull request? Currently, LongToUnsafeRowMap use byte array as the underlying page, which can't be larger 1G. This PR improves LongToUnsafeRowMap to scale up to 8G bytes by using array of Long instead of array of byte. ## How was this patch tested? Manually ran a test to confirm that both UnsafeHashedRelation and LongHashedRelation could build a map that larger than 2G. Author: Davies Liu <davies@databricks.com> Closes #12740 from davies/larger_broadcast.	2016-04-27 21:23:40 -07:00
Andrew Or	37575115b9	[SPARK-14940][SQL] Move ExternalCatalog to own file ## What changes were proposed in this pull request? `interfaces.scala` was getting big. This just moves the biggest class in there to a new file for cleanliness. ## How was this patch tested? Just moving things around. Author: Andrew Or <andrew@databricks.com> Closes #12721 from andrewor14/move-external-catalog.	2016-04-27 14:17:36 -07:00
Cheng Lian	24bea00047	[SPARK-14954] [SQL] Add PARTITION BY and BUCKET BY clause for data source CTAS syntax Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL: ``` CREATE TABLE <table-name> USING <provider> [OPTIONS (<key1> <value1>, <key2> <value2>, ...)] [PARTITIONED BY (col1, col2, ...)] [CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO <n> BUCKETS] AS SELECT ... ``` Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax. Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12734 from liancheng/spark-14954.	2016-04-27 13:55:13 -07:00
Dongjoon Hyun	af92299fdb	[SPARK-14664][SQL] Implement DecimalAggregates optimization for Window queries ## What changes were proposed in this pull request? This PR aims to implement decimal aggregation optimization for window queries by improving existing `DecimalAggregates`. Historically, `DecimalAggregates` optimizer is designed to transform general `sum/avg(decimal)`, but it breaks recently added windows queries like the followings. The following queries work well without the current `DecimalAggregates` optimizer. Sum ```scala scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").head java.lang.RuntimeException: Unsupported window function: MakeDecimal((sum(UnscaledValue(a#31)),mode=Complete,isDistinct=false),12,1) scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23] : +- INPUT +- Window [MakeDecimal((sum(UnscaledValue(a#21)),mode=Complete,isDistinct=false),12,1) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#23] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#21] +- Scan OneRowRelation[] ``` Average ```scala scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").head java.lang.RuntimeException: Unsupported window function: cast(((avg(UnscaledValue(a#40)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44] : +- INPUT +- Window [cast(((avg(UnscaledValue(a#42)),mode=Complete,isDistinct=false) / 10.0) as decimal(6,5)) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#44] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#42] +- Scan OneRowRelation[] ``` After this PR, those queries work fine and new optimized physical plans look like the followings. Sum ```scala scala> sql("select sum(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35] : +- INPUT +- Window [MakeDecimal((sum(UnscaledValue(a#33)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),12,1) AS sum(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#35] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#33] +- Scan OneRowRelation[] ``` Average ```scala scala> sql("select avg(a) over () from (select explode(array(1.0,2.0)) a) t").explain() == Physical Plan == WholeStageCodegen : +- Project [avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47] : +- INPUT +- Window [cast(((avg(UnscaledValue(a#45)),mode=Complete,isDistinct=false) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) / 10.0) as decimal(6,5)) AS avg(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#47] +- Exchange SinglePartition, None +- Generate explode([1.0,2.0]), false, false, [a#45] +- Scan OneRowRelation[] ``` In this PR, SUM over window pattern matching is based on the code of hvanhovell ; he should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests (with newly added testcases) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12421 from dongjoon-hyun/SPARK-14664.	2016-04-27 21:36:19 +02:00
Liwei Lin	a234cc6146	[SPARK-14874][SQL][STREAMING] Remove the obsolete Batch representation ## What changes were proposed in this pull request? The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](`caea152145`) and then became useless. This patch: - removes the `Batch` class - ~~does some related renaming~~ (update: this has been reverted) - fixes some related comments ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12638 from lw-lin/remove-batch.	2016-04-27 10:25:33 -07:00
Herman van Hovell	7dd01d9c01	[SPARK-14950][SQL] Fix BroadcastHashJoin's unique key Anti-Joins ### What changes were proposed in this pull request? Anti-Joins using BroadcastHashJoin's unique key code path are broken; it currently returns Semi Join results . This PR fixes this bug. ### How was this patch tested? Added tests cases to `ExistenceJoinSuite`. cc davies gatorsmile Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12730 from hvanhovell/SPARK-14950.	2016-04-27 19:15:17 +02:00
Yin Huai	54a3eb8312	[SPARK-14130][SQL] Throw exceptions for ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands ## What changes were proposed in this pull request? This PR will make Spark SQL not allow ALTER TABLE ADD/REPLACE/CHANGE COLUMN, ALTER TABLE SET FILEFORMAT, DFS, and transaction related commands. ## How was this patch tested? Existing tests. For those tests that I put in the blacklist, I am adding the useful parts back to SQLQuerySuite. Author: Yin Huai <yhuai@databricks.com> Closes #12714 from yhuai/banNativeCommand.	2016-04-27 00:30:54 -07:00
Reynold Xin	8fda5a73dc	[SPARK-14913][SQL] Simplify configuration API ## What changes were proposed in this pull request? We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users. As part of this patch, I also removed some config options deprecated in Spark 1.x. ## How was this patch tested? Updated relevant tests. Author: Reynold Xin <rxin@databricks.com> Closes #12689 from rxin/SPARK-14913.	2016-04-26 22:02:28 -07:00
Andrew Or	d8a83a564f	[SPARK-13477][SQL] Expose new user-facing Catalog interface ## What changes were proposed in this pull request? #12625 exposed a new user-facing conf interface in `SparkSession`. This patch adds a catalog interface. ## How was this patch tested? See `CatalogSuite`. Author: Andrew Or <andrew@databricks.com> Closes #12713 from andrewor14/user-facing-catalog.	2016-04-26 21:29:25 -07:00
Dilip Biswal	d93976d866	[SPARK-14445][SQL] Support native execution of SHOW COLUMNS and SHOW PARTITIONS ## What changes were proposed in this pull request? This PR adds Native execution of SHOW COLUMNS and SHOW PARTITION commands. Command Syntax: ``` SQL SHOW COLUMNS (FROM \| IN) table_identifier [(FROM \| IN) database] ``` ``` SQL SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] ``` ## How was this patch tested? Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite to verify plans. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #12222 from dilipbiswal/dkb_show_columns.	2016-04-27 09:28:24 +08:00
Sameer Agarwal	9797cc20c0	[SPARK-14929] [SQL] Disable vectorized map for wide schemas & high-precision decimals ## What changes were proposed in this pull request? While the vectorized hash map in `TungstenAggregate` is currently supported for all primitive data types during partial aggregation, this patch only enables the hash map for a subset of cases that've been verified to show performance improvements on our benchmarks subject to an internal conf that sets an upper limit on the maximum length of the aggregate key/value schema. This list of supported use-cases should be expanded over time. ## How was this patch tested? This is no new change in functionality so existing tests should suffice. Performance tests were done on TPCDS benchmarks. Author: Sameer Agarwal <sameer@databricks.com> Closes #12710 from sameeragarwal/vectorized-enable.	2016-04-26 14:51:14 -07:00
Davies Liu	7131b03bcf	[SPARK-14853] [SQL] Support LeftSemi/LeftAnti in SortMergeJoinExec ## What changes were proposed in this pull request? This PR update SortMergeJoinExec to support LeftSemi/LeftAnti, so it could support all the join types, same as other three join implementations: BroadcastHashJoinExec, ShuffledHashJoinExec,and BroadcastNestedLoopJoinExec. This PR also simplify the join selection in SparkStrategy. ## How was this patch tested? Added new tests. Author: Davies Liu <davies@databricks.com> Closes #12668 from davies/smj_semi.	2016-04-26 12:43:47 -07:00
Andrew Or	2a3d39f48b	[MINOR] Follow-up to #12625 ## What changes were proposed in this pull request? That patch mistakenly widened the visibility from `private[x]` to `protected[x]`. This patch reverts those changes. Author: Andrew Or <andrew@databricks.com> Closes #12686 from andrewor14/visibility.	2016-04-26 11:08:08 -07:00
Reynold Xin	5cb03220a0	[SPARK-14912][SQL] Propagate data source options to Hadoop configuration ## What changes were proposed in this pull request? We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration. ## How was this patch tested? Used a mock data source implementation to test both the read path and the write path. Author: Reynold Xin <rxin@databricks.com> Closes #12688 from rxin/SPARK-14912.	2016-04-26 10:58:56 -07:00
gatorsmile	162cf02efa	[SPARK-14910][SQL] Native DDL Command Support for Describe Function in Non-identifier Format #### What changes were proposed in this pull request? The existing `Describe Function` only support the function name in `identifier`. This is different from what Hive behaves. That is why many test cases `udf_abc` in `HiveCompatibilitySuite` are not using our native DDL support. For example, - udf_not.q - udf_bitwise_not.q This PR is to resolve the issues. Now, we can support the command of `Describe Function` whose function names are in the following format: - `qualifiedName` (e.g., `db.func1`) - `STRING` (e.g., `'func1'`) - `comparisonOperator` (e.g,. `<`) - `arithmeticOperator` (e.g., `+`) - `predicateOperator` (e.g., `or`) Note, before this PR, we only have a native command support when the function name is in the format of `qualifiedName`. #### How was this patch tested? Added test cases in `DDLSuite.scala`. Also manually verified all the related test cases in `HiveCompatibilitySuite` passed. Author: gatorsmile <gatorsmile@gmail.com> Closes #12679 from gatorsmile/descFunction.	2016-04-26 19:29:34 +02:00
Jacek Laskowski	b208229ba1	[MINOR][DOCS] Minor typo fixes ## What changes were proposed in this pull request? Minor typo fixes (too minor to deserve separate a JIRA) ## How was this patch tested? local build Author: Jacek Laskowski <jacek@japila.pl> Closes #12469 from jaceklaskowski/minor-typo-fixes.	2016-04-26 11:51:12 +01:00
Azeem Jiva	de6e633420	[SPARK-14756][CORE] Use parseLong instead of valueOf ## What changes were proposed in this pull request? Use Long.parseLong which returns a primative. Use a series of appends() reduces the creation of an extra StringBuilder type ## How was this patch tested? Unit tests Author: Azeem Jiva <azeemj@gmail.com> Closes #12520 from javawithjiva/minor.	2016-04-26 11:49:04 +01:00
Andrew Or	18c2c92580	[SPARK-14861][SQL] Replace internal usages of SQLContext with SparkSession ## What changes were proposed in this pull request? In Spark 2.0, `SparkSession` is the new thing. Internally we should stop using `SQLContext` everywhere since that's supposed to be not the main user-facing API anymore. In this patch I took care to not break any public APIs. The one place that's suspect is `o.a.s.ml.source.libsvm.DefaultSource`, but according to mengxr it's not supposed to be public so it's OK to change the underlying `FileFormat` trait. Reviewers: This is a big patch that may be difficult to review but the changes are actually really straightforward. If you prefer I can break it up into a few smaller patches, but it will delay the progress of this issue a little. ## How was this patch tested? No change in functionality intended. Author: Andrew Or <andrew@databricks.com> Closes #12625 from andrewor14/spark-session-refactor.	2016-04-25 20:54:31 -07:00
Sameer Agarwal	c71c6853fc	[SPARK-14870][SQL][FOLLOW-UP] Move decimalDataWithNulls in DataFrameAggregateSuite ## What changes were proposed in this pull request? Minor followup to https://github.com/apache/spark/pull/12651 ## How was this patch tested? Test-only change Author: Sameer Agarwal <sameer@databricks.com> Closes #12674 from sameeragarwal/tpcds-fix-2.	2016-04-25 18:22:06 -07:00
Andrew Or	cfa64882fc	[SPARK-14902][SQL] Expose RuntimeConfig in SparkSession ## What changes were proposed in this pull request? `RuntimeConfig` is the new user-facing API in 2.0 added in #11378. Until now, however, it's been dead code. This patch uses `RuntimeConfig` in `SessionState` and exposes that through the `SparkSession`. ## How was this patch tested? New test in `SQLContextSuite`. Author: Andrew Or <andrew@databricks.com> Closes #12669 from andrewor14/use-runtime-conf.	2016-04-25 17:52:25 -07:00
Reynold Xin	f36c9c8379	[SPARK-14888][SQL] UnresolvedFunction should use FunctionIdentifier ## What changes were proposed in this pull request? This patch changes UnresolvedFunction and UnresolvedGenerator to use a FunctionIdentifier rather than just a String for function name. Also changed SessionCatalog to accept FunctionIdentifier in lookupFunction. ## How was this patch tested? Updated related unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12659 from rxin/SPARK-14888.	2016-04-25 16:20:57 -07:00
Andrew Or	34336b6250	[SPARK-14828][SQL] Start SparkSession in REPL instead of SQLContext ## What changes were proposed in this pull request? ``` Spark context available as 'sc' (master = local[*], app id = local-1461283768192). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51) Type in expressions to have them evaluated. Type :help for more information. scala> sql("SHOW TABLES").collect() 16/04/21 17:09:39 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 16/04/21 17:09:39 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException res0: Array[org.apache.spark.sql.Row] = Array([src,false]) scala> sql("SHOW TABLES").collect() res1: Array[org.apache.spark.sql.Row] = Array([src,false]) scala> spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))) res2: org.apache.spark.sql.DataFrame = [_1: int, _2: int] ``` Hive things are loaded lazily. ## How was this patch tested? Manual. Author: Andrew Or <andrew@databricks.com> Closes #12589 from andrewor14/spark-session-repl.	2016-04-25 15:30:18 -07:00
gatorsmile	0c47e274ab	[SPARK-13739][SQL] Push Predicate Through Window #### What changes were proposed in this pull request? For performance, predicates can be pushed through Window if and only if the following conditions are satisfied: 1. All the expressions are part of window partitioning key. The expressions can be compound. 2. Deterministic #### How was this patch tested? TODO: - [X] DSL needs to be modified for window - [X] more tests will be added. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #11635 from gatorsmile/pushPredicateThroughWindow.	2016-04-25 22:32:34 +02:00
Andrew Or	3c5e65c339	[SPARK-14721][SQL] Remove HiveContext (part 2) ## What changes were proposed in this pull request? This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class. Note: A couple of things will break after this patch. These will be fixed separately. - the python HiveContext - all the documentation / comments referencing HiveContext - there will be no more HiveContext in the REPL (fixed by #12589) ## How was this patch tested? No change in functionality. Author: Andrew Or <andrew@databricks.com> Closes #12585 from andrewor14/delete-hive-context.	2016-04-25 13:23:05 -07:00
Cheng Lian	e66afd5c66	[SPARK-14875][SQL] Makes OutputWriterFactory.newInstance public ## What changes were proposed in this pull request? This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again, since 3rd party data sources like spark-avro depend on it. ## How was this patch tested? N/A Author: Cheng Lian <lian@databricks.com> Closes #12652 from liancheng/spark-14875.	2016-04-25 20:42:49 +08:00
Sameer Agarwal	cbdcd4edab	[SPARK-14870] [SQL] Fix NPE in TPCDS q14a ## What changes were proposed in this pull request? This PR fixes a bug in `TungstenAggregate` that manifests while aggregating by keys over nullable `BigDecimal` columns. This causes a null pointer exception while executing TPCDS q14a. ## How was this patch tested? 1. Added regression test in `DataFrameAggregateSuite`. 2. Verified that TPCDS q14a works Author: Sameer Agarwal <sameer@databricks.com> Closes #12651 from sameeragarwal/tpcds-fix.	2016-04-24 22:52:50 -07:00

1 2 3 4 5 ...

2358 commits