ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Reynold Xin	79008e6cfd	[SPARK-14799][SQL] Remove MetastoreRelation dependency from AnalyzeTable - part 1 ## What changes were proposed in this pull request? This patch isolates AnalyzeTable's dependency on MetastoreRelation into a single line. After this we can work on converging MetastoreRelation and CatalogTable. ## How was this patch tested? Covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12566 from rxin/SPARK-14799.	2016-04-21 10:57:16 -07:00
Josh Rosen	a70d40314c	[SPARK-14783] Preserve full exception stacktrace in IsolatedClientLoader In IsolatedClientLoader, we have a`catch` block which throws an exception without wrapping the original exception, causing the full exception stacktrace and any nested exceptions to be lost. This patch fixes this, improving the usefulness of classloading error messages. Author: Josh Rosen <joshrosen@databricks.com> Closes #12548 from JoshRosen/improve-logging-for-hive-classloader-issues.	2016-04-21 10:43:22 -07:00
Josh Rosen	649335d6c1	[SPARK-14797][BUILD] Spark SQL POM should not hardcode spark-sketch_2.11 dep. Spark SQL's POM hardcodes a dependency on `spark-sketch_2.11`, which causes Scala 2.10 builds to include the `_2.11` dependency. This is harmless since `spark-sketch` is a pure-Java module (see #12334 for a discussion of dropping the Scala version suffixes from these modules' artifactIds), but it's confusing to people looking at the published POMs. This patch fixes this by using `${scala.binary.version}` to substitute the correct suffix, and also adds a set of Maven Enforcer rules to ensure that `_2.11` artifacts are not used in 2.10 builds (and vice-versa). /cc ahirreddy, who spotted this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes #12563 from JoshRosen/fix-sketch-scala-version.	2016-04-21 09:57:26 -07:00
Liang-Chi Hsieh	4ac6e75cd6	[HOTFIX] Remove wrong DDL tests ## What changes were proposed in this pull request? As we moved most parsing rules to `SparkSqlParser`, some tests expected to throw exception are not correct anymore. ## How was this patch tested? `DDLCommandSuite` Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12572 from viirya/hotfix-ddl.	2016-04-21 13:18:39 +02:00
Wenchen Fan	cb51680d22	[SPARK-14753][CORE] remove internal flag in Accumulable ## What changes were proposed in this pull request? the `Accumulable.internal` flag is only used to avoid registering internal accumulators for 2 certain cases: 1. `TaskMetrics.createTempShuffleReadMetrics`: the accumulators in the temp shuffle read metrics should not be registered. 2. `TaskMetrics.fromAccumulatorUpdates`: the created task metrics is only used to post event, accumulators inside it should not be registered. For 1, we can create a `TempShuffleReadMetrics` that don't create accumulators, just keep the data and merge it at last. For 2, we can un-register these accumulators immediately. TODO: remove `internal` flag in `AccumulableInfo` with followup PR ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12525 from cloud-fan/acc.	2016-04-21 01:06:22 -07:00
Reynold Xin	228128ce25	[SPARK-14794][SQL] Don't pass analyze command into Hive ## What changes were proposed in this pull request? We shouldn't pass analyze command to Hive because some of those would require running MapReduce jobs. For now, let's just always run the no scan analyze. ## How was this patch tested? Updated test case to reflect this change. Author: Reynold Xin <rxin@databricks.com> Closes #12558 from rxin/parser-analyze.	2016-04-21 00:31:06 -07:00
Reynold Xin	3b9fd51739	[HOTFIX] Disable flaky tests	2016-04-21 00:25:28 -07:00
Reynold Xin	77d847ddb2	[SPARK-14792][SQL] Move as many parsing rules as possible into SQL parser ## What changes were proposed in this pull request? This patch moves as many parsing rules as possible into SQL parser. There are only three more left after this patch: (1) run native command, (2) analyze, and (3) script IO. These 3 will be dealt with in a follow-up PR. ## How was this patch tested? No test change. This simply moves code around. Author: Reynold Xin <rxin@databricks.com> Closes #12556 from rxin/SPARK-14792.	2016-04-21 00:24:24 -07:00
Josh Rosen	cfe472a34e	[SPARK-14786] Remove hive-cli dependency from hive subproject The `hive` subproject currently depends on `hive-cli` in order to perform a check to see whether a `SessionState` is an instance of `org.apache.hadoop.hive.cli.CliSessionState` (see #9589). The introduction of this `hive-cli` dependency has caused problems for users whose Hive metastore JAR classpaths don't include the `hive-cli` classes (such as in #11495). This patch removes this dependency on `hive-cli` and replaces the `isInstanceOf` check by reflection. I added a Maven Enforcer rule to ban `hive-cli` from the `hive` subproject in order to make sure that this dependency is not accidentally reintroduced. /cc rxin yhuai adrian-wang preecet Author: Josh Rosen <joshrosen@databricks.com> Closes #12551 from JoshRosen/remove-hive-cli-dep-from-hive-subproject.	2016-04-20 22:50:27 -07:00
Reynold Xin	8045814114	[SPARK-14782][SPARK-14778][SQL] Remove HiveConf dependency from HiveSqlAstBuilder ## What changes were proposed in this pull request? The patch removes HiveConf dependency from HiveSqlAstBuilder. This is required in order to merge HiveSqlParser and SparkSqlAstBuilder, which would require getting rid of the Hive specific dependencies in HiveSqlParser. This patch also accomplishes [SPARK-14778] Remove HiveSessionState.substitutor. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #12550 from rxin/SPARK-14782.	2016-04-20 21:20:51 -07:00
Reynold Xin	24f338ba7b	[SPARK-14775][SQL] Remove TestHiveSparkSession.rewritePaths ## What changes were proposed in this pull request? The path rewrite in TestHiveSparkSession is pretty hacky. I think we can remove those complexity and just do a string replacement when we read the query files in. This would remove the overloading of runNativeSql in TestHive, which will simplify the removal of Hive specific variable substitution. ## How was this patch tested? This is a small test refactoring to simplify test infrastructure. Author: Reynold Xin <rxin@databricks.com> Closes #12543 from rxin/SPARK-14775.	2016-04-20 17:56:31 -07:00
Reynold Xin	334c293ec0	[SPARK-14769][SQL] Create built-in functionality for variable substitution ## What changes were proposed in this pull request? In order to fully merge the Hive parser and the SQL parser, we'd need to support variable substitution in Spark. The implementation of the substitute algorithm is mostly copied from Hive, but I simplified the overall structure quite a bit and added more comprehensive test coverage. Note that this pull request does not yet use this functionality anywhere. ## How was this patch tested? Added VariableSubstitutionSuite for unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #12538 from rxin/SPARK-14769.	2016-04-20 16:32:38 -07:00
Reynold Xin	b28fe448d9	[SPARK-14770][SQL] Remove unused queries in hive module test resources ## What changes were proposed in this pull request? We currently have five folders in queries: clientcompare, clientnegative, clientpositive, negative, and positive. Only clientpositive is used. We can remove the rest. ## How was this patch tested? N/A - removing unused test resources. Author: Reynold Xin <rxin@databricks.com> Closes #12540 from rxin/SPARK-14770.	2016-04-20 16:29:26 -07:00
Subhobrata Dey	fd82681945	[SPARK-14749][SQL, TESTS] PlannerSuite failed when it run individually ## What changes were proposed in this pull request? 3 testcases namely, ``` "count is partially aggregated" "count distinct is partially aggregated" "mixed aggregates are partially aggregated" ``` were failing when running PlannerSuite individually. The PR provides a fix for this. ## How was this patch tested? unit tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Subhobrata Dey <sbcd90@gmail.com> Closes #12532 from sbcd90/plannersuitetestsfix.	2016-04-20 14:26:07 -07:00
Shixiong Zhu	7bc948557b	[SPARK-14678][SQL] Add a file sink log to support versioning and compaction ## What changes were proposed in this pull request? This PR adds a special log for FileStreamSink for two purposes: - Versioning. A future Spark version should be able to read the metadata of an old FileStreamSink. - Compaction. As reading from many small files is usually pretty slow, we should compact small metadata files into big files. FileStreamSinkLog has a new log format instead of Java serialization format. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of FileLog. FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compact, it will read all history logs and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by FileLog.action). When the reader uses allLogs to list all files, this method only returns the visible files (drops the deleted files). ## How was this patch tested? FileStreamSinkLogSuite Author: Shixiong Zhu <shixiong@databricks.com> Closes #12435 from zsxwing/sink-log.	2016-04-20 13:33:04 -07:00
Andrew Or	8fc267ab33	[SPARK-14720][SPARK-13643] Move Hive-specific methods into HiveSessionState and Create a SparkSession class ## What changes were proposed in this pull request? This PR has two main changes. 1. Move Hive-specific methods from HiveContext to HiveSessionState, which help the work of removing HiveContext. 2. Create a SparkSession Class, which will later be the entry point of Spark SQL users. ## How was this patch tested? Existing tests This PR is trying to fix test failures of https://github.com/apache/spark/pull/12485. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12522 from yhuai/spark-session.	2016-04-20 12:58:48 -07:00
Tathagata Das	cb8ea9e1f3	[SPARK-14741][SQL] Fixed error in reading json file stream inside a partitioned directory ## What changes were proposed in this pull request? Consider the following directory structure dir/col=X/some-files If we create a text format streaming dataframe on `dir/col=X/` then it should not consider as partitioning in columns. Even though the streaming dataframe does not do so, the generated batch dataframes pick up col as a partitioning columns, causing mismatch streaming source schema and generated df schema. This leads to runtime failure: ``` 18:55:11.262 ERROR org.apache.spark.sql.execution.streaming.StreamExecution: Query query-0 terminated with error java.lang.AssertionError: assertion failed: Invalid batch: c#2 != c#7,type#8 ``` The reason is that the partition inferring code has no idea of a base path, above which it should not search of partitions. This PR makes sure that the batch DF is generated with the basePath set as the original path on which the file stream source is defined. ## How was this patch tested? New unit test Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12517 from tdas/SPARK-14741.	2016-04-20 12:22:51 -07:00
Burak Yavuz	80bf48f437	[SPARK-14555] First cut of Python API for Structured Streaming ## What changes were proposed in this pull request? This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes: - ContinuousQuery - Trigger - ProcessingTime in pyspark under `pyspark.sql.streaming`. In addition, it contains the new methods added under: - `DataFrameWriter` a) `startStream` b) `trigger` c) `queryName` - `DataFrameReader` a) `stream` - `DataFrame` a) `isStreaming` This PR doesn't contain all methods exposed for `ContinuousQuery`, for example: - `exception` - `sourceStatuses` - `sinkStatus` They may be added in a follow up. This PR also contains some very minor doc fixes in the Scala side. ## How was this patch tested? Python doc tests TODO: - [ ] verify Python docs look good Author: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <burak@databricks.com> Closes #12320 from brkyvz/stream-python.	2016-04-20 10:32:01 -07:00
Liwei Lin	17db4bfeaa	[SPARK-14687][CORE][SQL][MLLIB] Call path.getFileSystem(conf) instead of call FileSystem.get(conf) ## What changes were proposed in this pull request? - replaced `FileSystem.get(conf)` calls with `path.getFileSystem(conf)` ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #12450 from lw-lin/fix-fs-get.	2016-04-20 11:28:51 +01:00
Wenchen Fan	7abe9a6578	[SPARK-9013][SQL] generate MutableProjection directly instead of return a function `MutableProjection` is not thread-safe and we won't use it in multiple threads. I think the reason that we return `() => MutableProjection` is not about thread safety, but to save the costs of generating code when we need same but individual mutable projections. However, I only found one place that use this [feature](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L122-L123), and comparing to the troubles it brings, I think we should generate `MutableProjection` directly instead of return a function. Author: Wenchen Fan <wenchen@databricks.com> Closes #7373 from cloud-fan/project.	2016-04-20 00:44:02 -07:00
Dongjoon Hyun	6f1ec1f267	[MINOR] [SQL] Re-enable `explode()` and `json_tuple()` testcases in ExpressionToSQLSuite ## What changes were proposed in this pull request? Since [SPARK-12719: SQL Generation supports for generators](https://issues.apache.org/jira/browse/SPARK-12719) was resolved, this PR enables the related testcases: `explode()` and `json_tuple()`. ## How was this patch tested? Pass the Jenkins tests (with re-enabled test cases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12329 from dongjoon-hyun/minor_enable_testcases.	2016-04-19 21:55:29 -07:00
Wenchen Fan	856bc465d5	[SPARK-14600] [SQL] Push predicates through Expand ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14600 This PR makes `Expand.output` have different attributes from the grouping attributes produced by the underlying `Project`, as they have different meaning, so that we can safely push down filter through `Expand` ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12496 from cloud-fan/expand.	2016-04-19 21:53:19 -07:00
Wenchen Fan	85d759ca3a	[SPARK-14704][CORE] create accumulators in TaskMetrics ## What changes were proposed in this pull request? Before this PR, we create accumulators at driver side(and register them) and send them to executor side, then we create `TaskMetrics` with these accumulators at executor side. After this PR, we will create `TaskMetrics` at driver side and send it to executor side, so that we can create accumulators inside `TaskMetrics` directly, which is cleaner. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12472 from cloud-fan/acc.	2016-04-19 21:20:24 -07:00
Luciano Resende	78b38109ed	[SPARK-13419] [SQL] Update SubquerySuite to use checkAnswer for validation ## What changes were proposed in this pull request? Change SubquerySuite to validate test results utilizing checkAnswer helper method ## How was this patch tested? Existing tests Author: Luciano Resende <lresende@apache.org> Closes #12269 from lresende/SPARK-13419.	2016-04-19 21:02:10 -07:00
Joan	3ae25f244b	[SPARK-13929] Use Scala reflection for UDTs ## What changes were proposed in this pull request? Enable ScalaReflection and User Defined Types for plain Scala classes. This involves the move of `schemaFor` from `ScalaReflection` trait (which is Runtime and Compile time (macros) reflection) to the `ScalaReflection` object (runtime reflection only) as I believe this code wouldn't work at compile time anyway as it manipulates `Class`'s that are not compiled yet. ## How was this patch tested? Unit test Author: Joan <joan@goyeau.com> Closes #12149 from joan38/SPARK-13929-Scala-reflection.	2016-04-19 17:36:31 -07:00
Cheng Lian	10f273d8db	[SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution/datasources package #12178 ## What changes were proposed in this pull request? This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package. Note that to avoid conflicts, this PR is based on #12153. Effective changes for this PR only consist of the last three commits. Will rebase after merging #12153. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #12361 from liancheng/spark-14407-hide-hadoop-fs-relation.	2016-04-19 17:32:23 -07:00
Herman van Hovell	da8859226e	[SPARK-4226] [SQL] Support IN/EXISTS Subqueries ### What changes were proposed in this pull request? This PR adds support for in/exists predicate subqueries to Spark. Predicate sub-queries are used as a filtering condition in a query (this is the only supported use case). A predicate sub-query comes in two forms: - `[NOT] EXISTS(subquery)` - `[NOT] IN (subquery)` This PR is (loosely) based on the work of davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/9055). They should be credited for the work they did. ### How was this patch tested? Modified parsing unit tests. Added tests to `org.apache.spark.sql.SQLQuerySuite` cc rxin, davies & chenghao-intel Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12306 from hvanhovell/SPARK-4226.	2016-04-19 15:16:02 -07:00
Wenchen Fan	5cb2e33609	[SPARK-14675][SQL] ClassFormatError when use Seq as Aggregator buffer type ## What changes were proposed in this pull request? After https://github.com/apache/spark/pull/12067, we now use expressions to do the aggregation in `TypedAggregateExpression`. To implement buffer merge, we produce a new buffer deserializer expression by replacing `AttributeReference` with right-side buffer attribute, like other `DeclarativeAggregate`s do, and finally combine the left and right buffer deserializer with `Invoke`. However, after https://github.com/apache/spark/pull/12338, we will add loop variable to class members when codegen `MapObjects`. If the `Aggregator` buffer type is `Seq`, which is implemented by `MapObjects` expression, we will add the same loop variable to class members twice(by left and right buffer deserializer), which cause the `ClassFormatError`. This PR fixes this issue by calling `distinct` before declare the class menbers. ## How was this patch tested? new regression test in `DatasetAggregatorSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12468 from cloud-fan/bug.	2016-04-19 10:51:58 -07:00
Josh Rosen	947b9020b0	[SPARK-14676] Wrap and re-throw Await.result exceptions in order to capture full stacktrace When `Await.result` throws an exception which originated from a different thread, the resulting stacktrace doesn't include the path leading to the `Await.result` call itself, making it difficult to identify the impact of these exceptions. For example, I've seen cases where broadcast cleaning errors propagate to the main thread and crash it but the resulting stacktrace doesn't include any of the main thread's code, making it difficult to pinpoint which exception crashed that thread. This patch addresses this issue by explicitly catching, wrapping, and re-throwing exceptions that are thrown by `Await.result`. I tested this manually using `16b31c8251`, a patch which reproduces an issue where an RPC exception which occurs while unpersisting RDDs manages to crash the main thread without any useful stacktrace, and verified that informative, full stacktraces were generated after applying the fix in this PR. /cc rxin nongli yhuai anabranch Author: Josh Rosen <joshrosen@databricks.com> Closes #12433 from JoshRosen/wrap-and-rethrow-await-exceptions.	2016-04-19 10:38:10 -07:00
gatorsmile	d9620e769e	[SPARK-12457] Fixed the Wrong Description and Missing Example in Collection Functions #### What changes were proposed in this pull request? https://github.com/apache/spark/pull/12185 contains the original PR I submitted in https://github.com/apache/spark/pull/10418 However, it misses one of the extended example, a wrong description and a few typos for collection functions. This PR is fix all these issues. #### How was this patch tested? The existing test cases already cover it. Author: gatorsmile <gatorsmile@gmail.com> Closes #12492 from gatorsmile/expressionUpdate.	2016-04-19 10:33:40 -07:00
Wenchen Fan	9ee95b6ecc	[SPARK-14491] [SQL] refactor object operator framework to make it easy to eliminate serializations ## What changes were proposed in this pull request? This PR tries to separate the serialization and deserialization logic from object operators, so that it's easier to eliminate unnecessary serializations in optimizer. Typed aggregate related operators are special, they will deserialize the input row to multiple objects and it's difficult to simply use a deserializer operator to abstract it, so we still mix the deserialization logic there. ## How was this patch tested? existing tests and new test in `EliminateSerializationSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12260 from cloud-fan/encoder.	2016-04-19 10:00:44 -07:00
Cheng Lian	5e360c93be	[SPARK-13681][SPARK-14458][SPARK-14566][SQL] Add back once removed CommitFailureTestRelationSuite and SimpleTextHadoopFsRelationSuite ## What changes were proposed in this pull request? These test suites were removed while refactoring `HadoopFsRelation` related API. This PR brings them back. This PR also fixes two regressions: - SPARK-14458, which causes runtime error when saving partitioned tables using `FileFormat` data sources that are not able to infer their own schemata. This bug wasn't detected by any built-in data sources because all of them happen to have schema inference feature. - SPARK-14566, which happens to be covered by SPARK-14458 and causes wrong query result or runtime error when - appending a Dataset `ds` to a persisted partitioned data source relation `t`, and - partition columns in `ds` don't all appear after data columns ## How was this patch tested? `CommitFailureTestRelationSuite` uses a testing relation that always fails when committing write tasks to test write job cleanup. `SimpleTextHadoopFsRelationSuite` uses a testing relation to test general `HadoopFsRelation` and `FileFormat` interfaces. The two regressions are both covered by existing test cases. Author: Cheng Lian <lian@databricks.com> Closes #12179 from liancheng/spark-13681-commit-failure-test.	2016-04-19 09:37:00 -07:00
Dongjoon Hyun	3d46d796a3	[SPARK-14577][SQL] Add spark.sql.codegen.maxCaseBranches config option ## What changes were proposed in this pull request? We currently disable codegen for `CaseWhen` if the number of branches is greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better if this value is a non-public config defined in SQLConf. ## How was this patch tested? Pass the Jenkins tests (including a new testcase `Support spark.sql.codegen.maxCaseBranches option`) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12353 from dongjoon-hyun/SPARK-14577.	2016-04-19 21:38:15 +08:00
bomeng	74fe235ab5	[SPARK-14398][SQL] Audit non-reserved keyword list in ANTLR4 parser ## What changes were proposed in this pull request? I have compared non-reserved list in Antlr3 and Antlr4 one by one as well as all the existing keywords defined in Antlr4, added the missing keywords to the non-reserved keywords list. If we need to support more syntax, we can add more keywords by then. Any recommendation for the above is welcome. ## How was this patch tested? I manually checked the keywords one by one. Please let me know if there is a better way to test. Another thought: I suggest to put all the keywords definition and non-reserved list in order, that will be much easier to check in the future. Author: bomeng <bmeng@us.ibm.com> Closes #12191 from bomeng/SPARK-14398.	2016-04-19 09:09:58 +02:00
Wenchen Fan	d4b94ead92	[SPARK-14595][SQL] add input metrics for FileScanRDD ## What changes were proposed in this pull request? This is roughly based on the input metrics logic in `SqlNewHadoopRDD` ## How was this patch tested? Not sure how to write a test, I manually verified it in Spark UI. Author: Wenchen Fan <wenchen@databricks.com> Closes #12352 from cloud-fan/metrics.	2016-04-18 23:48:22 -07:00
Sameer Agarwal	6f88006895	[SPARK-14722][SQL] Rename upstreams() -> inputRDDs() in WholeStageCodegen ## What changes were proposed in this pull request? Per rxin's suggestions, this patch renames `upstreams()` to `inputRDDs()` in `WholeStageCodegen` for better implied semantics ## How was this patch tested? N/A Author: Sameer Agarwal <sameer@databricks.com> Closes #12486 from sameeragarwal/codegen-cleanup.	2016-04-18 20:28:58 -07:00
Sameer Agarwal	4eae1dbd7c	[SPARK-14718][SQL] Avoid mutating ExprCode in doGenCode ## What changes were proposed in this pull request? The `doGenCode` method currently takes in an `ExprCode`, mutates it and returns the java code to evaluate the given expression. It should instead just return a new `ExprCode` to avoid passing around mutable objects during code generation. ## How was this patch tested? Existing Tests Author: Sameer Agarwal <sameer@databricks.com> Closes #12483 from sameeragarwal/new-exprcode-2.	2016-04-18 20:28:22 -07:00
Reynold Xin	5e92583d38	[SPARK-14667] Remove HashShuffleManager ## What changes were proposed in this pull request? The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager. ## How was this patch tested? Removed some tests related to the old manager. Author: Reynold Xin <rxin@databricks.com> Closes #12423 from rxin/SPARK-14667.	2016-04-18 19:30:00 -07:00
Andrew Or	f1a11976db	[SPARK-14674][SQL] Move HiveContext.hiveconf to HiveSessionState ## What changes were proposed in this pull request? This is just cleanup. This allows us to remove HiveContext later without inflating the diff too much. This PR fixes the conflicts of https://github.com/apache/spark/pull/12431. It also removes the `def hiveConf` from `HiveSqlParser`. So, we will pass the HiveConf associated with a session explicitly instead of relying on Hive's `SessionState` to pass `HiveConf`. ## How was this patch tested? Existing tests. Closes #12431 Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #12449 from yhuai/hiveconf.	2016-04-18 14:28:47 -07:00
Sameer Agarwal	8bd8121329	[SPARK-14710][SQL] Rename gen/genCode to genCode/doGenCode to better reflect the semantics ## What changes were proposed in this pull request? Per rxin's suggestions, this patch renames `s/gen/genCode` and `s/genCode/doGenCode` to better reflect the semantics of these 2 function calls. ## How was this patch tested? N/A (refactoring only) Author: Sameer Agarwal <sameer@databricks.com> Closes #12475 from sameeragarwal/gencode.	2016-04-18 14:03:40 -07:00
hyukjinkwon	6fc1e72d9b	[MINOR] Revert removing explicit typing (changed in some examples and StatFunctions) ## What changes were proposed in this pull request? This PR reverts some changes in https://github.com/apache/spark/pull/12413. (please see the discussion in that PR). from ```scala words.foreachRDD { (rdd, time) => ... ``` to ```scala words.foreachRDD { (rdd: RDD[String], time: Time) => ... ``` Also, this was discussed in dev-mailing list, [here](http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-Scala-style-explicit-typing-within-transformation-functions-and-anonymous-val-td17173.html) ## How was this patch tested? This was tested with `sbt scalastyle`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12452 from HyukjinKwon/revert-explicit-typing.	2016-04-18 13:45:03 -07:00
Andrew Or	28ee15702d	[SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState ## What changes were proposed in this pull request? This patch adds a SharedState that groups state shared across multiple SQLContexts. This is analogous to the SessionState added in SPARK-13526 that groups session-specific state. This cleanup makes the constructors of the contexts simpler and ultimately allows us to remove HiveContext in the near future. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #12463 from yhuai/sharedState.	2016-04-18 13:15:23 -07:00
Reynold Xin	e4ae974294	[HOTFIX] Fix Scala 2.10 compilation break.	2016-04-18 12:57:23 -07:00
Dongjoon Hyun	d280d1da1a	[SPARK-14580][SPARK-14655][SQL] Hive IfCoercion should preserve predicate. ## What changes were proposed in this pull request? Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose return-type are null. However, some UDFs need evaluations because they are designed to throw exceptions. This PR fixes that to preserve the predicates. Also, `assert_true` is implemented as Spark SQL function. Before ``` scala> sql("select if(assert_true(false),2,3)").head res2: org.apache.spark.sql.Row = [3] ``` After ``` scala> sql("select if(assert_true(false),2,3)").head ... ASSERT_TRUE ... ``` Hive ``` hive> select if(assert_true(false),2,3); OK Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed. ``` ## How was this patch tested? Pass the Jenkins tests (including a new testcase in `HivePlanTest`) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12340 from dongjoon-hyun/SPARK-14580.	2016-04-18 12:26:56 -07:00
Tathagata Das	775cf17eaa	[SPARK-14473][SQL] Define analysis rules to catch operations not supported in streaming ## What changes were proposed in this pull request? There are many operations that are currently not supported in the streaming execution. For example: - joining two streams - unioning a stream and a batch source - sorting - window functions (not time windows) - distinct aggregates Furthermore, executing a query with a stream source as a batch query should also fail. This patch add an additional step after analysis in the QueryExecution which will check that all the operations in the analyzed logical plan is supported or not. ## How was this patch tested? unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12246 from tdas/SPARK-14473.	2016-04-18 11:09:33 -07:00
Dongjoon Hyun	432d1399cb	[SPARK-14614] [SQL] Add `bround` function ## What changes were proposed in this pull request? This PR aims to add `bound` function (aka Banker's round) by extending current `round` implementation. [Hive supports `bround` since 1.3.0.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) Hive (1.3 ~ 2.0) ``` hive> select round(2.5), bround(2.5); OK 3.0 2.0 ``` After this PR ```scala scala> sql("select round(2.5), bround(2.5)").head res0: org.apache.spark.sql.Row = [3,2] ``` ## How was this patch tested? Pass the Jenkins tests (with extended tests). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12376 from dongjoon-hyun/SPARK-14614.	2016-04-18 10:44:51 -07:00
Reynold Xin	1a3966472c	[SPARK-14696][SQL] Add implicit encoders for boxed primitive types ## What changes were proposed in this pull request? We currently only have implicit encoders for scala primitive types. We should also add implicit encoders for boxed primitives. Otherwise, the following code would not have an encoder: ```scala sqlContext.range(1000).map { i => i } ``` ## How was this patch tested? Added a unit test case for this. Author: Reynold Xin <rxin@databricks.com> Closes #12466 from rxin/SPARK-14696.	2016-04-18 17:03:15 +08:00
Wenchen Fan	2f1d0320c9	[SPARK-13363][SQL] support Aggregator in RelationalGroupedDataset ## What changes were proposed in this pull request? set the input encoder for `TypedColumn` in `RelationalGroupedDataset.agg`. ## How was this patch tested? new tests in `DatasetAggregatorSuite` close https://github.com/apache/spark/pull/11269 This PR brings https://github.com/apache/spark/pull/12359 up to date and fix the compile. Author: Wenchen Fan <wenchen@databricks.com> Closes #12451 from cloud-fan/agg.	2016-04-18 14:27:26 +08:00
Andrew Or	7de06a646d	Revert "[SPARK-14647][SQL] Group SQLContext/HiveContext state into SharedState" This reverts commit `5cefecc95a`.	2016-04-17 17:35:41 -07:00
Subhobrata Dey	699a4dfd89	[SPARK-14632] randomSplit method fails on dataframes with maps in schema ## What changes were proposed in this pull request? The patch fixes the issue with the randomSplit method which is not able to split dataframes which has maps in schema. The bug was introduced in spark 1.6.1. ## How was this patch tested? Tested with unit tests. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Subhobrata Dey <sbcd90@gmail.com> Closes #12438 from sbcd90/randomSplitIssue.	2016-04-17 15:18:32 -07:00

1 2 3 4 5 ...

3318 commits