ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
hyukjinkwon	37686539f5	[SPARK-17388] [SQL] Support for inferring type date/timestamp/decimal for partition column ## What changes were proposed in this pull request? Currently, Spark only supports to infer `IntegerType`, `LongType`, `DoubleType` and `StringType`. `DecimalType` is being tried but it seems it never infers type as `DecimalType` as `DoubleType` is being tried first. Also, it seems `DateType` and `TimestampType` could be inferred. As far as I know, it is pretty common to use both for a partition column. This PR fixes the incorrect `DecimalType` try and also adds the support for both `DateType` and `TimestampType` for inferring partition column type. ## How was this patch tested? Unit tests in `ParquetPartitionDiscoverySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14947 from HyukjinKwon/SPARK-17388.	2016-10-18 13:20:42 -07:00
Tathagata Das	a9e79a41ee	[SQL][STREAMING][TEST] Follow up to remove Option.contains for Scala 2.10 compatibility ## What changes were proposed in this pull request? Scala 2.10 does not have Option.contains, which broke Scala 2.10 build. ## How was this patch tested? Locally compiled and ran sql/core unit tests in 2.10 Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15531 from tdas/metrics-flaky-test-fix-1.	2016-10-18 02:29:55 -07:00
Liwei Lin	7d878cf2da	[SQL][STREAMING][TEST] Fix flaky tests in StreamingQueryListenerSuite This work has largely been done by lw-lin in his PR #15497. This is a slight refactoring of it. ## What changes were proposed in this pull request? There were two sources of flakiness in StreamingQueryListener test. - When testing with manual clock, consecutive attempts to advance the clock can occur without the stream execution thread being unblocked and doing some work between the two attempts. Hence the following can happen with the current ManualClock. ``` +-----------------------------------+--------------------------------+ \| StreamExecution thread \| testing thread \| +-----------------------------------+--------------------------------+ \| ManualClock.waitTillTime(100) { \| \| \| _isWaiting = true \| \| \| wait(10) \| \| \| still in wait(10) \| if (_isWaiting) advance(100) \| \| still in wait(10) \| if (_isWaiting) advance(200) \| <- this should be disallowed ! \| still in wait(10) \| if (_isWaiting) advance(300) \| <- this should be disallowed ! \| wake up from wait(10) \| \| \| current time is 600 \| \| \| _isWaiting = false \| \| \| } \| \| +-----------------------------------+--------------------------------+ ``` - Second source of flakiness is that the adding data to memory stream may get processing in any trigger, not just the first trigger. My fix is to make the manual clock wait for the other stream execution thread to start waiting for the clock at the right wait start time. That is, `advance(200)` (see above) will wait for stream execution thread to complete the wait that started at time 0, and start a new wait at time 200 (i.e. time stamp after the previous `advance(100)`). In addition, since this is a feature that is solely used by StreamExecution, I removed all the non-generic code from ManualClock and put them in StreamManualClock inside StreamTest. ## How was this patch tested? Ran existing unit test MANY TIME in Jenkins Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Liwei Lin <lwlin7@gmail.com> Closes #15519 from tdas/metrics-flaky-test-fix.	2016-10-18 00:49:57 -07:00
Reynold Xin	1c5a7d7f64	Revert "[SPARK-17974] Refactor FileCatalog classes to simplify the inheritance tree" This reverts commit `8daa1a29b6`.	2016-10-17 21:26:28 -07:00
Eric Liang	8daa1a29b6	[SPARK-17974] Refactor FileCatalog classes to simplify the inheritance tree ## What changes were proposed in this pull request? This renames `BasicFileCatalog => FileCatalog`, combines `SessionFileCatalog` with `PartitioningAwareFileCatalog`, and removes the old `FileCatalog` trait. In summary, ``` MetadataLogFileCatalog extends PartitioningAwareFileCatalog ListingFileCatalog extends PartitioningAwareFileCatalog PartitioningAwareFileCatalog extends FileCatalog TableFileCatalog extends FileCatalog ``` cc cloud-fan mallman ## How was this patch tested? Existing tests Author: Eric Liang <ekl@databricks.com> Closes #15518 from ericl/refactor-session-file-catalog.	2016-10-17 21:01:22 -07:00
Dilip Biswal	813ab5e025	[SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables ## What changes were proposed in this pull request? Reopens the closed PR https://github.com/apache/spark/pull/15190 (Please refer to the above link for review comments on the PR) Make sure the hive.default.fileformat is used to when creating the storage format metadata. Output ``` SQL scala> spark.sql("SET hive.default.fileformat=orc") res1: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("CREATE TABLE tmp_default(id INT)") res2: org.apache.spark.sql.DataFrame = [] ``` Before ```SQL scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println) .. [# Storage Information,,] [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,] [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] [Compressed:,No,] [Storage Desc Parameters:,,] [ serialization.format,1,] ``` After ```SQL scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println) .. [# Storage Information,,] [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,] [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] [Compressed:,No,] [Storage Desc Parameters:,,] [ serialization.format,1,] ``` ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Added new tests to HiveDDLCommandSuite, SQLQuerySuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #15495 from dilipbiswal/orc2.	2016-10-17 20:46:30 -07:00
gatorsmile	d88a1bae6a	[SPARK-17751][SQL] Remove spark.sql.eagerAnalysis and Output the Plan if Existed in AnalysisException ### What changes were proposed in this pull request? Dataset always does eager analysis now. Thus, `spark.sql.eagerAnalysis` is not used any more. Thus, we need to remove it. This PR also outputs the plan. Without the fix, the analysis error is like ``` cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12 ``` After the fix, the analysis error becomes: ``` org.apache.spark.sql.AnalysisException: cannot resolve '`k1`' given input columns: [k, v]; line 1 pos 12; 'Project [unresolvedalias(CASE WHEN ('k1 = 2) THEN 22 WHEN ('k1 = 4) THEN 44 ELSE 0 END, None), v#6] +- SubqueryAlias t +- Project [_1#2 AS k#5, _2#3 AS v#6] +- LocalRelation [_1#2, _2#3] ``` ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #15316 from gatorsmile/eagerAnalysis.	2016-10-17 11:33:06 -07:00
Sital Kedia	c7ac027d5f	[SPARK-17839][CORE] Use Nio's directbuffer instead of BufferedInputStream in order to avoid additional copy from os buffer cache to user buffer ## What changes were proposed in this pull request? Currently we use BufferedInputStream to read the shuffle file which copies the file content from os buffer cache to the user buffer. This adds additional latency in reading the spill files. We made a change to use java nio's direct buffer to read the spill files and for certain pipelines spilling significant amount of data, we see up to 7% speedup for the entire pipeline. ## How was this patch tested? Tested by running the job in the cluster and observed up to 7% speedup. Author: Sital Kedia <skedia@fb.com> Closes #15408 from sitalkedia/skedia/nio_spill_read.	2016-10-17 11:03:04 -07:00
gatorsmile	e18d02c5a8	[SPARK-17947][SQL] Add Doc and Comment about spark.sql.debug ### What changes were proposed in this pull request? Just document the impact of `spark.sql.debug`: When enabling the debug, Spark SQL internal table properties are not filtered out; however, some related DDL commands (e.g., Analyze Table and CREATE TABLE LIKE) might not work properly. ### How was this patch tested? N/A Author: gatorsmile <gatorsmile@gmail.com> Closes #15494 from gatorsmile/addDocForSQLDebug.	2016-10-17 12:08:25 +08:00
Jun Kim	36d81c2c68	[SPARK-17953][DOCUMENTATION] Fix typo in SparkSession scaladoc ## What changes were proposed in this pull request? ### Before: ```scala SparkSession.builder() .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value"). .getOrCreate() ``` ### After: ```scala SparkSession.builder() .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value") .getOrCreate() ``` There was one unexpected dot! Author: Jun Kim <i2r.jun@gmail.com> Closes #15498 from tae-jun/SPARK-17953.	2016-10-15 00:36:55 -07:00
Michael Allman	6ce1b675ee	[SPARK-16980][SQL] Load only catalog table partition metadata required to answer a query (This PR addresses https://issues.apache.org/jira/browse/SPARK-16980.) ## What changes were proposed in this pull request? In a new Spark session, when a partitioned Hive table is converted to use Spark's `HadoopFsRelation` in `HiveMetastoreCatalog`, metadata for every partition of that table are retrieved from the metastore and loaded into driver memory. In addition, every partition's metadata files are read from the filesystem to perform schema inference. If a user queries such a table with predicates which prune that table's partitions, we would like to be able to answer that query without consulting partition metadata which are not involved in the query. When querying a table with a large number of partitions for some data from a small number of partitions (maybe even a single partition), the current conversion strategy is highly inefficient. I suspect this scenario is not uncommon in the wild. In addition to being inefficient in running time, the current strategy is inefficient in its use of driver memory. When the sum of the number of partitions of all tables loaded in a driver reaches a certain level (somewhere in the tens of thousands), their cached data exhaust all driver heap memory in the default configuration. I suspect this scenario is less common (in that not too many deployments work with tables with tens of thousands of partitions), however this does illustrate how large the memory footprint of this metadata can be. With tables with hundreds or thousands of partitions, I would expect the `HiveMetastoreCatalog` table cache to represent a significant portion of the driver's heap space. This PR proposes an alternative approach. Basically, it makes four changes: 1. It adds a new method, `listPartitionsByFilter` to the Catalyst `ExternalCatalog` trait which returns the partition metadata for a given sequence of partition pruning predicates. 1. It refactors the `FileCatalog` type hierarchy to include a new `TableFileCatalog` to efficiently return files only for partitions matching a sequence of partition pruning predicates. 1. It removes partition loading and caching from `HiveMetastoreCatalog`. 1. It adds a new Catalyst optimizer rule, `PruneFileSourcePartitions`, which applies a plan's partition-pruning predicates to prune out unnecessary partition files from a `HadoopFsRelation`'s underlying file catalog. The net effect is that when a query over a partitioned Hive table is planned, the analyzer retrieves the table metadata from `HiveMetastoreCatalog`. As part of this operation, the `HiveMetastoreCatalog` builds a `HadoopFsRelation` with a `TableFileCatalog`. It does not load any partition metadata or scan any files. The optimizer prunes-away unnecessary table partitions by sending the partition-pruning predicates to the relation's `TableFileCatalog `. The `TableFileCatalog` in turn calls the `listPartitionsByFilter` method on its external catalog. This queries the Hive metastore, passing along those filters. As a bonus, performing partition pruning during optimization leads to a more accurate relation size estimate. This, along with `c481bdf`, can lead to automatic, safe application of the broadcast optimization in a join where it might previously have been omitted. ## Open Issues 1. This PR omits partition metadata caching. I can add this once the overall strategy for the cold path is established, perhaps in a future PR. 1. This PR removes and omits partitioned Hive table schema reconciliation. As a result, it fails to find Parquet schema columns with upper case letters because of the Hive metastore's case-insensitivity. This issue may be fixed by #14750, but that PR appears to have stalled. ericl has contributed to this PR a workaround for Parquet wherein schema reconciliation occurs at query execution time instead of planning. Whether ORC requires a similar patch is an open issue. 1. This PR omits an implementation of `listPartitionsByFilter` for the `InMemoryCatalog`. 1. This PR breaks parquet log output redirection during query execution. I can work around this by running `Class.forName("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$")` first thing in a Spark shell session, but I haven't figured out how to fix this properly. ## How was this patch tested? The current Spark unit tests were run, and some ad-hoc tests were performed to validate that only the necessary partition metadata is loaded. Author: Michael Allman <michael@videoamp.com> Author: Eric Liang <ekl@databricks.com> Author: Eric Liang <ekhliang@gmail.com> Closes #14690 from mallman/spark-16980-lazy_partition_fetching.	2016-10-14 18:26:18 -07:00
Srinath Shankar	2d96d35dc0	[SPARK-17946][PYSPARK] Python crossJoin API similar to Scala ## What changes were proposed in this pull request? Add a crossJoin function to the DataFrame API similar to that in Scala. Joins with no condition (cartesian products) must be specified with the crossJoin API ## How was this patch tested? Added python tests to ensure that an AnalysisException if a cartesian product is specified without crossJoin(), and that cartesian products can execute if specified via crossJoin() (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request. Author: Srinath Shankar <srinath@databricks.com> Closes #15493 from srinathshankar/crosspython.	2016-10-14 18:24:47 -07:00
Reynold Xin	72adfbf94a	[SPARK-17900][SQL] Graduate a list of Spark SQL APIs to stable ## What changes were proposed in this pull request? This patch graduates a list of Spark SQL APIs and mark them stable. The following are marked stable: Dataset/DataFrame - functions, since 1.3 - ColumnName, since 1.3 - DataFrameNaFunctions, since 1.3.1 - DataFrameStatFunctions, since 1.4 - UserDefinedFunction, since 1.3 - UserDefinedAggregateFunction, since 1.5 - Window and WindowSpec, since 1.4 Data sources: - DataSourceRegister, since 1.5 - RelationProvider, since 1.3 - SchemaRelationProvider, since 1.3 - CreatableRelationProvider, since 1.3 - BaseRelation, since 1.3 - TableScan, since 1.3 - PrunedScan, since 1.3 - PrunedFilteredScan, since 1.3 - InsertableRelation, since 1.3 The following are kept experimental / evolving: Data sources: - CatalystScan (tied to internal logical plans so it is not stable by definition) Structured streaming: - all classes (introduced new in 2.0 and will likely change) Dataset typed operations (introduced in 1.6 and 2.0 and might change, although probability is low) - all typed methods on Dataset - KeyValueGroupedDataset - o.a.s.sql.expressions.javalang.typed - o.a.s.sql.expressions.scalalang.typed - methods that return typed Dataset in SparkSession We should discuss more whether we want to mark Dataset typed operations stable in 2.1. ## How was this patch tested? N/A - just annotation changes. Author: Reynold Xin <rxin@databricks.com> Closes #15469 from rxin/SPARK-17900.	2016-10-14 16:13:42 -07:00
Jeff Zhang	f00df40cfe	[SPARK-11775][PYSPARK][SQL] Allow PySpark to register Java UDF Currently pyspark can only call the builtin java UDF, but can not call custom java UDF. It would be better to allow that. 2 benefits: * Leverage the power of rich third party java library * Improve the performance. Because if we use python UDF, python daemons will be started on worker which will affect the performance. Author: Jeff Zhang <zjffdu@apache.org> Closes #9766 from zjffdu/SPARK-11775.	2016-10-14 15:50:35 -07:00
Nick Pentreath	5aeb7384c7	[SPARK-16063][SQL] Add storageLevel to Dataset [SPARK-11905](https://issues.apache.org/jira/browse/SPARK-11905) added support for `persist`/`cache` for `Dataset`. However, there is no user-facing API to check if a `Dataset` is cached and if so what the storage level is. This PR adds `getStorageLevel` to `Dataset`, analogous to `RDD.getStorageLevel`. Updated `DatasetCacheSuite`. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13780 from MLnick/ds-storagelevel. Signed-off-by: Michael Armbrust <michael@databricks.com>	2016-10-14 15:09:49 -07:00
Davies Liu	da9aeb0fde	[SPARK-17863][SQL] should not add column into Distinct ## What changes were proposed in this pull request? We are trying to resolve the attribute in sort by pulling up some column for grandchild into child, but that's wrong when the child is Distinct, because the added column will change the behavior of Distinct, we should not do that. ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #15489 from davies/order_distinct.	2016-10-14 14:45:20 -07:00
Yin Huai	522dd0d0e5	Revert "[SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables" This reverts commit `7ab86244e3`.	2016-10-14 14:09:35 -07:00
Dilip Biswal	7ab86244e3	[SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables ## What changes were proposed in this pull request? Make sure the hive.default.fileformat is used to when creating the storage format metadata. Output ``` SQL scala> spark.sql("SET hive.default.fileformat=orc") res1: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("CREATE TABLE tmp_default(id INT)") res2: org.apache.spark.sql.DataFrame = [] ``` Before ```SQL scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println) .. [# Storage Information,,] [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,] [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] [Compressed:,No,] [Storage Desc Parameters:,,] [ serialization.format,1,] ``` After ```SQL scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println) .. [# Storage Information,,] [SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,] [InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] [OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] [Compressed:,No,] [Storage Desc Parameters:,,] [ serialization.format,1,] ``` ## How was this patch tested? Added new tests to HiveDDLCommandSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #15190 from dilipbiswal/orc.	2016-10-14 13:22:59 -07:00
Tathagata Das	05800b4b4e	[TEST] Ignore flaky test in StreamingQueryListenerSuite ## What changes were proposed in this pull request? Ignoring the flaky test introduced in #15307 https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/1736/testReport/junit/org.apache.spark.sql.streaming/StreamingQueryListenerSuite/single_listener__check_trigger_statuses/ Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15491 from tdas/metrics-flaky-test.	2016-10-14 12:39:25 -07:00
Andrew Ash	fa37877af0	Typo: form -> from ## What changes were proposed in this pull request? Minor typo fix ## How was this patch tested? Existing unit tests on Jenkins Author: Andrew Ash <andrew@andrewash.com> Closes #15486 from ash211/patch-8.	2016-10-14 18:13:19 +01:00
wangzhenhua	7486442fe0	[SPARK-17073][SQL][FOLLOWUP] generate column-level statistics ## What changes were proposed in this pull request? This pr adds some test cases for statistics: case sensitive column names, non ascii column names, refresh table, and also improves some documentation. ## How was this patch tested? add test cases Author: wangzhenhua <wangzhenhua@huawei.com> Closes #15360 from wzhfy/colStats2.	2016-10-14 21:18:49 +08:00
Reynold Xin	6c29b3de76	[SPARK-17925][SQL] Break fileSourceInterfaces.scala into multiple pieces ## What changes were proposed in this pull request? This patch does a few changes to the file structure of data sources: - Break fileSourceInterfaces.scala into multiple pieces (HadoopFsRelation, FileFormat, OutputWriter) - Move ParquetOutputWriter into its own file I created this as a separate patch so it'd be easier to review my future PRs that focus on refactoring this internal logic. This patch only moves code around, and has no logic changes. ## How was this patch tested? N/A - should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #15473 from rxin/SPARK-17925.	2016-10-14 14:14:52 +08:00
Reynold Xin	8543996c3f	[SPARK-17927][SQL] Remove dead code in WriterContainer. ## What changes were proposed in this pull request? speculationEnabled and DATASOURCE_OUTPUTPATH seem like just dead code. ## How was this patch tested? Tests should fail if they are not dead code. Author: Reynold Xin <rxin@databricks.com> Closes #15477 from rxin/SPARK-17927.	2016-10-14 12:35:59 +08:00
petermaxlee	adc112429d	[SPARK-17661][SQL] Consolidate various listLeafFiles implementations ## What changes were proposed in this pull request? There are 4 listLeafFiles-related functions in Spark: - ListingFileCatalog.listLeafFiles (which calls HadoopFsRelation.listLeafFilesInParallel if the number of paths passed in is greater than a threshold; if it is lower, then it has its own serial version implemented) - HadoopFsRelation.listLeafFiles (called only by HadoopFsRelation.listLeafFilesInParallel) - HadoopFsRelation.listLeafFilesInParallel (called only by ListingFileCatalog.listLeafFiles) It is actually very confusing and error prone because there are effectively two distinct implementations for the serial version of listing leaf files. As an example, SPARK-17599 updated only one of the code path and ignored the other one. This code can be improved by: - Move all file listing code into ListingFileCatalog, since it is the only class that needs this. - Keep only one function for listing files in serial. ## How was this patch tested? This change should be covered by existing unit and integration tests. I also moved a test case for HadoopFsRelation.shouldFilterOut from HadoopFsRelationSuite to ListingFileCatalogSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #15235 from petermaxlee/SPARK-17661.	2016-10-13 14:16:39 -07:00
Tathagata Das	7106866c22	[SPARK-17731][SQL][STREAMING] Metrics for structured streaming ## What changes were proposed in this pull request? Metrics are needed for monitoring structured streaming apps. Here is the design doc for implementing the necessary metrics. https://docs.google.com/document/d/1NIdcGuR1B3WIe8t7VxLrt58TJB4DtipWEbj5I_mzJys/edit?usp=sharing Specifically, this PR adds the following public APIs changes. ### New APIs - `StreamingQuery.status` returns a `StreamingQueryStatus` object (renamed from `StreamingQueryInfo`, see later) - `StreamingQueryStatus` has the following important fields - inputRate - Current rate (rows/sec) at which data is being generated by all the sources - processingRate - Current rate (rows/sec) at which the query is processing data from all the sources - ~~outputRate~~ - Does not work with wholestage codegen - latency - Current average latency between the data being available in source and the sink writing the corresponding output - sourceStatuses: Array[SourceStatus] - Current statuses of the sources - sinkStatus: SinkStatus - Current status of the sink - triggerStatus - Low-level detailed status of the last completed/currently active trigger - latencies - getOffset, getBatch, full trigger, wal writes - timestamps - trigger start, finish, after getOffset, after getBatch - numRows - input, output, state total/updated rows for aggregations - `SourceStatus` has the following important fields - inputRate - Current rate (rows/sec) at which data is being generated by the source - processingRate - Current rate (rows/sec) at which the query is processing data from the source - triggerStatus - Low-level detailed status of the last completed/currently active trigger - Python API for `StreamingQuery.status()` ### Breaking changes to existing APIs Existing direct public facing APIs - Deprecated direct public-facing APIs `StreamingQuery.sourceStatuses` and `StreamingQuery.sinkStatus` in favour of `StreamingQuery.status.sourceStatuses/sinkStatus`. - Branch 2.0 should have it deprecated, master should have it removed. Existing advanced listener APIs - `StreamingQueryInfo` renamed to `StreamingQueryStatus` for consistency with `SourceStatus`, `SinkStatus` - Earlier StreamingQueryInfo was used only in the advanced listener API, but now it is used in direct public-facing API (StreamingQuery.status) - Field `queryInfo` in listener events `QueryStarted`, `QueryProgress`, `QueryTerminated` changed have name `queryStatus` and return type `StreamingQueryStatus`. - Field `offsetDesc` in `SourceStatus` was Option[String], converted it to `String`. - For `SourceStatus` and `SinkStatus` made constructor private instead of private[sql] to make them more java-safe. Instead added `private[sql] object SourceStatus/SinkStatus.apply()` which are harder to accidentally use in Java. ## How was this patch tested? Old and new unit tests. - Rate calculation and other internal logic of StreamMetrics tested by StreamMetricsSuite. - New info in statuses returned through StreamingQueryListener is tested in StreamingQueryListenerSuite. - New and old info returned through StreamingQuery.status is tested in StreamingQuerySuite. - Source-specific tests for making sure input rows are counted are is source-specific test suites. - Additional tests to test minor additions in LocalTableScanExec, StateStore, etc. Metrics also manually tested using Ganglia sink Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #15307 from tdas/SPARK-17731.	2016-10-13 13:36:26 -07:00
Pete Robbins	84f149e414	[SPARK-17827][SQL] maxColLength type should be Int for String and Binary ## What changes were proposed in this pull request? correct the expected type from Length function to be Int ## How was this patch tested? Test runs on little endian and big endian platforms Author: Pete Robbins <robbinspg@gmail.com> Closes #15464 from robbinspg/SPARK-17827.	2016-10-13 11:26:30 -07:00
Reynold Xin	04d417a7ca	[SPARK-17830][SQL] Annotate remaining SQL APIs with InterfaceStability ## What changes were proposed in this pull request? This patch annotates all the remaining APIs in SQL (excluding streaming) with InterfaceStability. ## How was this patch tested? N/A - just annotation change. Author: Reynold Xin <rxin@databricks.com> Closes #15457 from rxin/SPARK-17830-2.	2016-10-13 11:12:30 -07:00
Wenchen Fan	db8784feaa	[SPARK-17899][SQL] add a debug mode to keep raw table properties in HiveExternalCatalog ## What changes were proposed in this pull request? Currently `HiveExternalCatalog` will filter out the Spark SQL internal table properties, e.g. `spark.sql.sources.provider`, `spark.sql.sources.schema`, etc. This is reasonable for external users as they don't want to see these internal properties in `DESC TABLE`. However, as a Spark developer, sometimes we do wanna see the raw table properties. This PR adds a new internal SQL conf, `spark.sql.debug`, to enable debug mode and keep these raw table properties. This config can also be used in similar places where we wanna retain debug information in the future. ## How was this patch tested? new test in MetastoreDataSourcesSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #15458 from cloud-fan/debug.	2016-10-13 03:26:29 -04:00
Liang-Chi Hsieh	064d6650e9	[SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates ## What changes were proposed in this pull request? Two issues regarding Dataset.dropduplicates: 1. Dataset.dropDuplicates should consider the columns with same column name We find and get the first resolved attribute from output with the given column name in `Dataset.dropDuplicates`. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns. 2. Dataset.dropDuplicates should not change the output of child plan We create new `Alias` with new exprId in `Dataset.dropDuplicates` now. However it causes problem when we want to select the columns as follows: val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS() // ds("_2") will cause analysis exception ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int]) Because the two issues are both related to `Dataset.dropduplicates` and the code changes are not big, so submitting them together as one PR. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #15427 from viirya/fix-dropduplicates.	2016-10-13 13:27:57 +08:00
Burak Yavuz	edeb51a39d	[SPARK-17876] Write StructuredStreaming WAL to a stream instead of materializing all at once ## What changes were proposed in this pull request? The CompactibleFileStreamLog materializes the whole metadata log in memory as a String. This can cause issues when there are lots of files that are being committed, especially during a compaction batch. You may come across stacktraces that look like: ``` java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.lang.StringCoding.encode(StringCoding.java:350) at java.lang.String.getBytes(String.java:941) at org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127) ``` The safer way is to write to an output stream so that we don't have to materialize a huge string. ## How was this patch tested? Existing unit tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15437 from brkyvz/ser-to-stream.	2016-10-12 21:40:45 -07:00
Reynold Xin	6f20a92ca3	[SPARK-17845] [SQL] More self-evident window function frame boundary API ## What changes were proposed in this pull request? This patch improves the window function frame boundary API to make it more obvious to read and to use. The two high level changes are: 1. Create Window.currentRow, Window.unboundedPreceding, Window.unboundedFollowing to indicate the special values in frame boundaries. These methods map to the special integral values so we are not breaking backward compatibility here. This change makes the frame boundaries more self-evident (instead of Long.MinValue, it becomes Window.unboundedPreceding). 2. In Python, for any value less than or equal to JVM's Long.MinValue, treat it as Window.unboundedPreceding. For any value larger than or equal to JVM's Long.MaxValue, treat it as Window.unboundedFollowing. Before this change, if the user specifies any value that is less than Long.MinValue but not -sys.maxsize (e.g. -sys.maxsize + 1), the number we pass over to the JVM would overflow, resulting in a frame that does not make sense. Code example required to specify a frame before this patch: ``` Window.rowsBetween(-Long.MinValue, 0) ``` While the above code should still work, the new way is more obvious to read: ``` Window.rowsBetween(Window.unboundedPreceding, Window.currentRow) ``` ## How was this patch tested? - Updated DataFrameWindowSuite (for Scala/Java) - Updated test_window_functions_cumulative_sum (for Python) - Renamed DataFrameWindowSuite DataFrameWindowFunctionsSuite to better reflect its purpose Author: Reynold Xin <rxin@databricks.com> Closes #15438 from rxin/SPARK-17845.	2016-10-12 16:45:10 -07:00
Imran Rashid	9ce7d3e542	[SPARK-17675][CORE] Expand Blacklist for TaskSets ## What changes were proposed in this pull request? This is a step along the way to SPARK-8425. To enable incremental review, the first step proposed here is to expand the blacklisting within tasksets. In particular, this will enable blacklisting for * (task, executor) pairs (this already exists via an undocumented config) * (task, node) * (taskset, executor) * (taskset, node) Adding (task, node) is critical to making spark fault-tolerant of one-bad disk in a cluster, without requiring careful tuning of "spark.task.maxFailures". The other additions are also important to avoid many misleading task failures and long scheduling delays when there is one bad node on a large cluster. Note that some of the code changes here aren't really required for just this -- they put pieces in place for SPARK-8425 even though they are not used yet (eg. the `BlacklistTracker` helper is a little out of place, `TaskSetBlacklist` holds onto a little more info than it needs to for just this change, and `ExecutorFailuresInTaskSet` is more complex than it needs to be). ## How was this patch tested? Added unit tests, run tests via jenkins. Author: Imran Rashid <irashid@cloudera.com> Author: mwws <wei.mao@intel.com> Closes #15249 from squito/taskset_blacklist_only.	2016-10-12 16:43:03 -05:00
Shixiong Zhu	47776e7c0c	[SPARK-17850][CORE] Add a flag to ignore corrupt files ## What changes were proposed in this pull request? Add a flag to ignore corrupt files. For Spark core, the configuration is `spark.files.ignoreCorruptFiles`. For Spark SQL, it's `spark.sql.files.ignoreCorruptFiles`. ## How was this patch tested? The added unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #15422 from zsxwing/SPARK-17850.	2016-10-12 13:51:53 -07:00
Wenchen Fan	b9a147181d	[SPARK-17720][SQL] introduce static SQL conf ## What changes were proposed in this pull request? SQLConf is session-scoped and mutable. However, we do have the requirement for a static SQL conf, which is global and immutable, e.g. the `schemaStringThreshold` in `HiveExternalCatalog`, the flag to enable/disable hive support, the global temp view database in https://github.com/apache/spark/pull/14897. Actually we've already implemented static SQL conf implicitly via `SparkConf`, this PR just make it explicit and expose it to users, so that they can see the config value via SQL command or `SparkSession.conf`, and forbid users to set/unset static SQL conf. ## How was this patch tested? new tests in SQLConfSuite Author: Wenchen Fan <wenchen@databricks.com> Closes #15295 from cloud-fan/global-conf.	2016-10-11 20:27:08 -07:00
Wenchen Fan	7388ad94d7	[SPARK-17338][SQL][FOLLOW-UP] add global temp view ## What changes were proposed in this pull request? address post hoc review comments for https://github.com/apache/spark/pull/14897 ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #15424 from cloud-fan/global-temp-view.	2016-10-11 15:21:28 +08:00
Reynold Xin	b515768f26	[SPARK-17844] Simplify DataFrame API for defining frame boundaries in window functions ## What changes were proposed in this pull request? When I was creating the example code for SPARK-10496, I realized it was pretty convoluted to define the frame boundaries for window functions when there is no partition column or ordering column. The reason is that we don't provide a way to create a WindowSpec directly with the frame boundaries. We can trivially improve this by adding rowsBetween and rangeBetween to Window object. As an example, to compute cumulative sum using the natural ordering, before this pr: ``` df.select('key, sum("value").over(Window.partitionBy(lit(1)).rowsBetween(Long.MinValue, 0))) ``` After this pr: ``` df.select('key, sum("value").over(Window.rowsBetween(Long.MinValue, 0))) ``` Note that you could argue there is no point specifying a window frame without partitionBy/orderBy -- but it is strange that only rowsBetween and rangeBetween are not the only two APIs not available. This also fixes https://issues.apache.org/jira/browse/SPARK-17656 (removing _root_.scala). ## How was this patch tested? Added test cases to compute cumulative sum in DataFrameWindowSuite for Scala/Java and tests.py for Python. Author: Reynold Xin <rxin@databricks.com> Closes #15412 from rxin/SPARK-17844.	2016-10-10 22:33:20 -07:00
hyukjinkwon	0c0ad436ad	[SPARK-17719][SPARK-17776][SQL] Unify and tie up options in a single place in JDBC datasource package ## What changes were proposed in this pull request? This PR proposes to fix arbitrary usages among `Map[String, String]`, `Properties` and `JDBCOptions` instances for options in `execution/jdbc` package and make the connection properties exclude Spark-only options. This PR includes some changes as below: - Unify `Map[String, String]`, `Properties` and `JDBCOptions` in `execution/jdbc` package to `JDBCOptions`. - Move `batchsize`, `fetchszie`, `driver` and `isolationlevel` options into `JDBCOptions` instance. - Document `batchSize` and `isolationlevel` with marking both read-only options and write-only options. Also, this includes minor types and detailed explanation for some statements such as url. - Throw exceptions fast by checking arguments first rather than in execution time (e.g. for `fetchsize`). - Exclude Spark-only options in connection properties. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15292 from HyukjinKwon/SPARK-17719.	2016-10-10 22:22:41 -07:00
hyukjinkwon	90217f9dee	[SPARK-16896][SQL] Handle duplicated field names in header consistently with null or empty strings in CSV ## What changes were proposed in this pull request? Currently, CSV datasource allows to load duplicated empty string fields or fields having `nullValue` in the header. It'd be great if this can deal with normal fields as well. This PR proposes handling the duplicates consistently with the existing behaviour with considering case-sensitivity (`spark.sql.caseSensitive`) as below: data below: ``` fieldA,fieldB,,FIELDA,fielda,, 1,2,3,4,5,6,7 ``` is parsed as below: ```scala spark.read.format("csv").option("header", "true").load("test.csv").show() ``` - when `spark.sql.caseSensitive` is `false` (by default). ``` +-------+------+---+-------+-------+---+---+ \|fieldA0\|fieldB\|_c2\|FIELDA3\|fieldA4\|_c5\|_c6\| +-------+------+---+-------+-------+---+---+ \| 1\| 2\| 3\| 4\| 5\| 6\| 7\| +-------+------+---+-------+-------+---+---+ ``` - when `spark.sql.caseSensitive` is `true`. ``` +-------+------+---+-------+-------+---+---+ \|fieldA0\|fieldB\|_c2\| FIELDA\|fieldA4\|_c5\|_c6\| +-------+------+---+-------+-------+---+---+ \| 1\| 2\| 3\| 4\| 5\| 6\| 7\| +-------+------+---+-------+-------+---+---+ ``` In more details, There is a good reference about this problem, `read.csv()` in R. So, I initially wanted to propose the similar behaviour. In case of R, the CSV data below: ``` fieldA,fieldB,,fieldA,fieldA,, 1,2,3,4,5,6,7 ``` is parsed as below: ```r test <- read.csv(file="test.csv",header=TRUE,sep=",") > test fieldA fieldB X fieldA.1 fieldA.2 X.1 X.2 1 1 2 3 4 5 6 7 ``` However, Spark CSV datasource already is handling duplicated empty strings and `nullValue` as field names. So the data below: ``` ,,,fieldA,,fieldB, 1,2,3,4,5,6,7 ``` is parsed as below: ```scala spark.read.format("csv").option("header", "true").load("test.csv").show() ``` ``` +---+---+---+------+---+------+---+ \|_c0\|_c1\|_c2\|fieldA\|_c4\|fieldB\|_c6\| +---+---+---+------+---+------+---+ \| 1\| 2\| 3\| 4\| 5\| 6\| 7\| +---+---+---+------+---+------+---+ ``` R starts the number for each duplicate but Spark adds the number for its position for all fields for `nullValue` and empty strings. In terms of case-sensitivity, it seems R is case-sensitive as below: (it seems it is not configurable). ``` a,a,a,A,A 1,2,3,4,5 ``` is parsed as below: ```r test <- read.csv(file="test.csv",header=TRUE,sep=",") > test a a.1 a.2 A A.1 1 1 2 3 4 5 ``` ## How was this patch tested? Unit test in `CSVSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14745 from HyukjinKwon/SPARK-16896.	2016-10-11 10:21:22 +08:00
Davies Liu	d5ec4a3e01	[SPARK-17738][TEST] Fix flaky test in ColumnTypeSuite ## What changes were proposed in this pull request? The default buffer size is not big enough for randomly generated MapType. ## How was this patch tested? Ran the tests in 100 times, it never fail (it fail 8 times before the patch). Author: Davies Liu <davies@databricks.com> Closes #15395 from davies/flaky_map.	2016-10-10 19:14:01 -07:00
Reynold Xin	689de92005	[SPARK-17830] Annotate spark.sql package with InterfaceStability ## What changes were proposed in this pull request? This patch annotates the InterfaceStability level for top level classes in o.a.spark.sql and o.a.spark.sql.util packages, to experiment with this new annotation. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #15392 from rxin/SPARK-17830.	2016-10-10 11:29:09 -07:00
jiangxingbo	7e16c94f18	[HOT-FIX][SQL][TESTS] Remove unused function in `SparkSqlParserSuite` ## What changes were proposed in this pull request? The function `SparkSqlParserSuite.createTempViewUsing` is not used for now and causes build failure, this PR simply removes it. ## How was this patch tested? N/A Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15418 from jiangxb1987/parserSuite.	2016-10-10 13:49:25 +01:00
Wenchen Fan	23ddff4b2b	[SPARK-17338][SQL] add global temp view ## What changes were proposed in this pull request? Global temporary view is a cross-session temporary view, which means it's shared among all sessions. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database `global_temp`(configurable via SparkConf), and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1. changes for `SessionCatalog`: 1. add a new field `gloabalTempViews: GlobalTempViewManager`, to access the shared global temp views, and the global temp db name. 2. `createDatabase` will fail if users wanna create `global_temp`, which is system preserved. 3. `setCurrentDatabase` will fail if users wanna set `global_temp`, which is system preserved. 4. add `createGlobalTempView`, which is used in `CreateViewCommand` to create global temp views. 5. add `dropGlobalTempView`, which is used in `CatalogImpl` to drop global temp view. 6. add `alterTempViewDefinition`, which is used in `AlterViewAsCommand` to update the view definition for local/global temp views. 7. `renameTable`/`dropTable`/`isTemporaryTable`/`lookupRelation`/`getTempViewOrPermanentTableMetadata`/`refreshTable` will handle global temp views. changes for SQL commands: 1. `CreateViewCommand`/`AlterViewAsCommand` is updated to support global temp views 2. `ShowTablesCommand` outputs a new column `database`, which is used to distinguish global and local temp views. 3. other commands can also handle global temp views if they call `SessionCatalog` APIs which accepts global temp views, e.g. `DropTableCommand`, `AlterTableRenameCommand`, `ShowColumnsCommand`, etc. changes for other public API 1. add a new method `dropGlobalTempView` in `Catalog` 2. `Catalog.findTable` can find global temp view 3. add a new method `createGlobalTempView` in `Dataset` ## How was this patch tested? new tests in `SQLViewSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #14897 from cloud-fan/global-temp-view.	2016-10-10 15:48:57 +08:00
jiangxingbo	16590030c1	[SPARK-17741][SQL] Grammar to parse top level and nested data fields separately ## What changes were proposed in this pull request? Currently we use the same rule to parse top level and nested data fields. For example: ``` create table tbl_x( id bigint, nested struct<col1:string,col2:string> ) ``` Shows both syntaxes. In this PR we split this rule in a top-level and nested rule. Before this PR, ``` sql("CREATE TABLE my_tab(column1: INT)") ``` works fine. After this PR, it will throw a `ParseException`: ``` scala> sql("CREATE TABLE my_tab(column1: INT)") org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'CREATE TABLE my_tab(column1:'(line 1, pos 27) ``` ## How was this patch tested? Add new testcases in `SparkSqlParserSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Closes #15346 from jiangxb1987/cdt.	2016-10-09 22:00:54 -07:00
hyukjinkwon	24850c9415	[HOTFIX][BUILD] Do not use contains in Option in JdbcRelationProvider ## What changes were proposed in this pull request? This PR proposes the fix the use of `contains` API which only exists from Scala 2.11. ## How was this patch tested? Manually checked: ```scala scala> val o: Option[Boolean] = None o: Option[Boolean] = None scala> o == Some(false) res17: Boolean = false scala> val o: Option[Boolean] = Some(true) o: Option[Boolean] = Some(true) scala> o == Some(false) res18: Boolean = false scala> val o: Option[Boolean] = Some(false) o: Option[Boolean] = Some(false) scala> o == Some(false) res19: Boolean = true ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #15393 from HyukjinKwon/hotfix.	2016-10-07 17:59:24 -07:00
Davies Liu	94b24b84a6	[SPARK-17806] [SQL] fix bug in join key rewritten in HashJoin ## What changes were proposed in this pull request? In HashJoin, we try to rewrite the join key as Long to improve the performance of finding a match. The rewriting part is not well tested, has a bug that could cause wrong result when there are at least three integral columns in the joining key also the total length of the key exceed 8 bytes. ## How was this patch tested? Added unit test to covering the rewriting with different number of columns and different data types. Manually test the reported case and confirmed that this PR fix the bug. Author: Davies Liu <davies@databricks.com> Closes #15390 from davies/rewrite_key.	2016-10-07 15:03:47 -07:00
Herman van Hovell	97594c29b7	[SPARK-17761][SQL] Remove MutableRow ## What changes were proposed in this pull request? In practice we cannot guarantee that an `InternalRow` is immutable. This makes the `MutableRow` almost redundant. This PR folds `MutableRow` into `InternalRow`. The code below illustrates the immutability issue with InternalRow: ```scala import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions.GenericMutableRow val struct = new GenericMutableRow(1) val row = InternalRow(struct, 1) println(row) scala> [[null], 1] struct.setInt(0, 42) println(row) scala> [[42], 1] ``` This might be somewhat controversial, so feedback is appreciated. ## How was this patch tested? Existing tests. Author: Herman van Hovell <hvanhovell@databricks.com> Closes #15333 from hvanhovell/SPARK-17761.	2016-10-07 14:03:45 -07:00
Davies Liu	2badb58cdd	[SPARK-15621][SQL] Support spilling for Python UDF ## What changes were proposed in this pull request? When execute a Python UDF, we buffer the input row into as queue, then pull them out to join with the result from Python UDF. In the case that Python UDF is slow or the input row is too wide, we could ran out of memory because of the queue. Since we can't flush all the buffers (sockets) between JVM and Python process from JVM side, we can't limit the rows in the queue, otherwise it could deadlock. This PR will manage the memory used by the queue, spill that into disk when there is no enough memory (also release the memory and disk space as soon as possible). ## How was this patch tested? Added unit tests. Also manually ran a workload with large input row and slow python UDF (with large broadcast) like this: ``` b = range(1<<24) add = udf(lambda x: x + len(b), IntegerType()) df = sqlContext.range(1, 1<<26, 1, 4) print df.select(df.id, lit("adf"*10000).alias("s"), add(df.id).alias("add")).groupBy(length("s")).sum().collect() ``` It ran out of memory (hang because of full GC) before the patch, ran smoothly after the patch. Author: Davies Liu <davies@databricks.com> Closes #15089 from davies/spill_udf.	2016-10-07 13:45:00 -07:00
Prashant Sharma	bb1aaf28ec	[SPARK-16411][SQL][STREAMING] Add textFile to Structured Streaming. ## What changes were proposed in this pull request? Adds the textFile API which exists in DataFrameReader and serves same purpose. ## How was this patch tested? Added corresponding testcase. Author: Prashant Sharma <prashsh1@in.ibm.com> Closes #14087 from ScrapCodes/textFile.	2016-10-07 11:16:24 -07:00
hyukjinkwon	aa3a6841eb	[SPARK-14525][SQL][FOLLOWUP] Clean up JdbcRelationProvider ## What changes were proposed in this pull request? This PR proposes cleaning up the confusing part in `createRelation` as discussed in https://github.com/apache/spark/pull/12601/files#r80627940 Also, this PR proposes the changes below: - Add documentation for `batchsize` and `isolationLevel`. - Move property names into `JDBCOptions` so that they can be managed in a single place. which were, `fetchsize`, `batchsize`, `isolationLevel` and `driver`. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15263 from HyukjinKwon/SPARK-14525.	2016-10-07 10:52:32 -07:00
hyukjinkwon	2b01d3c701	[SPARK-16960][SQL] Deprecate approxCountDistinct, toDegrees and toRadians according to FunctionRegistry ## What changes were proposed in this pull request? It seems `approxCountDistinct`, `toDegrees` and `toRadians` are also missed while matching the names to the ones in `FunctionRegistry`. (please see [approx_count_distinct](`5c2ae79bfc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L244)`), [degrees](`5c2ae79bfc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L203)`) and [radians](`5c2ae79bfc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L222)`) in `FunctionRegistry`). I took a scan between `functions.scala` and `FunctionRegistry` and it seems these are all left. For `countDistinct` and `sumDistinct`, they are not registered in `FunctionRegistry`. This PR deprecates `approxCountDistinct`, `toDegrees` and `toRadians` and introduces `approx_count_distinct`, `degrees` and `radians`. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #14538 from HyukjinKwon/SPARK-16588-followup.	2016-10-07 11:49:34 +01:00

1 2 3 4 5 ...

3041 commits