ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Cheng Hao	f728e0fe7e	[SPARK-2663] [SQL] Support the Grouping Set Add support for `GROUPING SETS`, `ROLLUP`, `CUBE` and the the virtual column `GROUPING__ID`. More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf The generic idea of the implementations are : 1 Replace the `ROLLUP`, `CUBE` with `GROUPING SETS` 2 Explode each of the input row, and then feed them to `Aggregate` * Each grouping set are represented as the bit mask for the `GroupBy Expression List`, for each bit, `1` means the expression is selected, otherwise `0` (left is the lower bit, and right is the higher bit in the `GroupBy Expression List`) * Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with `Literal(null)` if it's not selected in the grouping set (based on the bit mask) * Output Schema of `Explode` is `child.output :+ grouping__id` * GroupBy Expressions of `Aggregate` is `GroupBy Expression List :+ grouping__id` * Keep the `Aggregation expressions` the same for the `Aggregate` The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an `Explosive` operator added for Physical Plan, which will explode the rows according the pre-set projections. A known issue will be done in the follow up PR: * Optimization `ColumnPruning` is not supported yet for `Explosive` node. Author: Cheng Hao <hao.cheng@intel.com> Closes #1567 from chenghao-intel/grouping_sets and squashes the following commits: fe65fcc [Cheng Hao] Remove the extra space 3547056 [Cheng Hao] Add more doc and Simplify the Expand a7c869d [Cheng Hao] update code as feedbacks d23c672 [Cheng Hao] Add GroupingExpression to replace the Seq[Expression] 414b165 [Cheng Hao] revert the unnecessary changes ec276c6 [Cheng Hao] Support Rollup/Cube/GroupingSets	2014-12-18 18:58:29 -08:00
Venkata Ramana Gollamudi	f33d550464	[SPARK-3891][SQL] Add array support to percentile, percentile_approx and constant inspectors support Supported passing array to percentile and percentile_approx UDAFs To support percentile_approx, constant inspectors are supported for GenericUDAF Constant folding support added to CreateArray expression Avoided constant udf expression re-evaluation Author: Venkata Ramana G <ramana.gollamudihuawei.com> Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com> Closes #2802 from gvramana/percentile_array_support and squashes the following commits: a0182e5 [Venkata Ramana Gollamudi] fixed review comment a18f917 [Venkata Ramana Gollamudi] avoid constant udf expression re-evaluation - fixes failure due to return iterator and value type mismatch c46db0f [Venkata Ramana Gollamudi] Removed TestHive reset 4d39105 [Venkata Ramana Gollamudi] Unified inspector creation, style check fixes f37fd69 [Venkata Ramana Gollamudi] Fixed review comments 47f6365 [Venkata Ramana Gollamudi] fixed test cb7c61e [Venkata Ramana Gollamudi] Supported ConstantInspector for UDAF Fixed HiveUdaf wrap object issue. 7f94aff [Venkata Ramana Gollamudi] Added foldable support to CreateArray	2014-12-17 15:41:35 -08:00
Cheng Hao	8d0d2a65eb	[SPARK-4856] [SQL] NullType instead of StringType when sampling against empty string or nul... ``` TestSQLContext.sparkContext.parallelize( """{"ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" :: """{"ip":"27.31.100.29","headers":{}}""" :: """{"ip":"27.31.100.29","headers":""}""" :: Nil) ``` As empty string (the "headers") will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type "headers" in line 1), and also take the line 1 (the "headers") as String Type, which is not our expected. Author: Cheng Hao <hao.cheng@intel.com> Closes #3708 from chenghao-intel/json and squashes the following commits: e7a72e9 [Cheng Hao] add more concise unit test 853de51 [Cheng Hao] NullType instead of StringType when sampling against empty string or null value	2014-12-17 15:01:59 -08:00
Michael Armbrust	19c0faad6d	[HOTFIX][SQL] Fix parquet filter suite Author: Michael Armbrust <michael@databricks.com> Closes #3727 from marmbrus/parquetNotEq and squashes the following commits: 2157bfc [Michael Armbrust] Fix parquet filter suite	2014-12-17 14:27:02 -08:00
Cheng Hao	636d9fc450	[SPARK-3739] [SQL] Update the split num base on block size for table scanning In local mode, Hadoop/Hive will ignore the "mapred.map.tasks", hence for small table file, it's always a single input split, however, SparkSQL doesn't honor that in table scanning, and we will get different result when do the Hive Compatibility test. This PR will fix that. Author: Cheng Hao <hao.cheng@intel.com> Closes #2589 from chenghao-intel/source_split and squashes the following commits: dff38e7 [Cheng Hao] Remove the extra blank line 160a2b6 [Cheng Hao] fix the compiling bug 04d67f7 [Cheng Hao] Keep 1 split for small file in table scanning	2014-12-17 13:39:36 -08:00
Daoyuan Wang	902e4d54ac	[SPARK-4755] [SQL] sqrt(negative value) should return null Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3616 from adrian-wang/sqrt and squashes the following commits: d877439 [Daoyuan Wang] fix NULLTYPE 3effa2c [Daoyuan Wang] sqrt(negative value) should return null	2014-12-17 12:51:27 -08:00
Cheng Lian	6277135376	[SPARK-4493][SQL] Don't pushdown Eq, NotEq, Lt, LtEq, Gt and GtEq predicates with nulls for Parquet Predicates like `a = NULL` and `a < NULL` can't be pushed down since Parquet `Lt`, `LtEq`, `Gt`, `GtEq` doesn't accept null value. Note that `Eq` and `NotEq` can only be used with `null` to represent predicates like `a IS NULL` and `a IS NOT NULL`. However, normally this issue doesn't cause NPE because any value compared to `NULL` results `NULL`, and Spark SQL automatically optimizes out `NULL` predicate in the `SimplifyFilters` rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do NOT think we need to backport this to branch-1.1 This PR restricts `Lt`, `LtEq`, `Gt` and `GtEq` to non-null values only, and only uses `Eq` with null value to pushdown `IsNull` and `IsNotNull`. Also, added support for Parquet `NotEq` filter for completeness and (tiny) performance gain, it's also used to pushdown `IsNotNull`. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3367) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3367 from liancheng/filters-with-null and squashes the following commits: cc41281 [Cheng Lian] Fixes several styling issues de7de28 [Cheng Lian] Adds stricter rules for Parquet filters with null	2014-12-17 12:48:04 -08:00
Michael Armbrust	7ad579ee97	[SPARK-3698][SQL] Fix case insensitive resolution of GetField. Based on #2543. Author: Michael Armbrust <michael@databricks.com> Closes #3724 from marmbrus/resolveGetField and squashes the following commits: 0a47aae [Michael Armbrust] Fix case insensitive resolution of GetField.	2014-12-17 12:43:51 -08:00
carlmartin	4782def094	[SPARK-4694]Fix HiveThriftServer2 cann't stop In Yarn HA mode. HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode. The scheduler backend was aware of the AM had been exited so it call sc.stop to exit the driver process but there was a user thread(HiveThriftServer2 ) which was still alive and cause this problem. To fix it, make a demo thread to detect the sparkContext is null or not.If the sc is stopped, call the ThriftServer.stop to stop the user thread. Author: carlmartin <carlmartinmax@gmail.com> Closes #3576 from SaintBacchus/ThriftServer2ExitBug and squashes the following commits: 2890b4a [carlmartin] Use SparkListener instead of the demo thread to stop the hive server. c15da0e [carlmartin] HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode	2014-12-17 12:24:03 -08:00
Cheng Hao	5fdcbdc0c9	[SPARK-4625] [SQL] Add sort by for DSL & SimpleSqlParser Add `sort by` support for both DSL & SqlParser. This PR is relevant with #3386, either one merged, will cause the other rebased. Author: Cheng Hao <hao.cheng@intel.com> Closes #3481 from chenghao-intel/sortby and squashes the following commits: 041004f [Cheng Hao] Add sort by for DSL & SimpleSqlParser	2014-12-17 12:01:57 -08:00
scwf	60698801eb	[SPARK-4618][SQL] Make foreign DDL commands options case-insensitive Using lowercase for ```options``` key to make it case-insensitive, then we should use lower case to get value from parameters. So flowing cmd work ``` create temporary table normal_parquet USING org.apache.spark.sql.parquet OPTIONS ( PATH '/xxx/data' ) ``` Author: scwf <wangfei1@huawei.com> Author: wangfei <wangfei1@huawei.com> Closes #3470 from scwf/ddl-ulcase and squashes the following commits: ae78509 [scwf] address comments 8f4f585 [wangfei] address comments 3c132ef [scwf] minor fix a0fc20b [scwf] Merge branch 'master' of https://github.com/apache/spark into ddl-ulcase 4f86401 [scwf] adding CaseInsensitiveMap e244e8d [wangfei] using lower case in json e0cb017 [wangfei] make options in-casesensitive	2014-12-16 21:26:36 -08:00
Davies Liu	ec5c4279ed	[SPARK-4866] support StructType as key in MapType This PR brings support of using StructType(and other hashable types) as key in MapType. Author: Davies Liu <davies@databricks.com> Closes #3714 from davies/fix_struct_in_map and squashes the following commits: 68585d7 [Davies Liu] fix primitive types in MapType 9601534 [Davies Liu] support StructType as key in MapType	2014-12-16 21:23:28 -08:00
Cheng Hao	770d8153a5	[SPARK-4375] [SQL] Add 0 argument support for udf Author: Cheng Hao <hao.cheng@intel.com> Closes #3595 from chenghao-intel/udf0 and squashes the following commits: a858973 [Cheng Hao] Add 0 arguments support for udf	2014-12-16 21:21:11 -08:00
Takuya UESHIN	ddc7ba31cb	[SPARK-4720][SQL] Remainder should also return null if the divider is 0. This is a follow-up of SPARK-4593 (#3443). Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3581 from ueshin/issues/SPARK-4720 and squashes the following commits: `c3959d4` [Takuya UESHIN] Make Remainder return null if the divider is 0.	2014-12-16 21:19:57 -08:00
Cheng Hao	0aa834adea	[SPARK-4744] [SQL] Short circuit evaluation for AND & OR in CodeGen Author: Cheng Hao <hao.cheng@intel.com> Closes #3606 from chenghao-intel/codegen_short_circuit and squashes the following commits: f466303 [Cheng Hao] short circuit for AND & OR	2014-12-16 21:18:39 -08:00
Cheng Lian	3b395e1051	[SPARK-4798][SQL] A new set of Parquet testing API and test suites This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled: - `ParquetQuerySuite` - `ParquetTestData` <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3644 from liancheng/parquet-tests and squashes the following commits: 800e745 [Cheng Lian] Enforces ordering of test output 3bb8731 [Cheng Lian] Refactors HiveParquetSuite aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext 7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc 7f07af0 [Cheng Lian] Adds a new set of Parquet test suites	2014-12-16 21:16:03 -08:00
Jacky Li	fa66ef6c97	[SPARK-4269][SQL] make wait time configurable in BroadcastHashJoin In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table. In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment. Author: Jacky Li <jacky.likun@huawei.com> Closes #3133 from jackylk/timeout-config and squashes the following commits: 733ac08 [Jacky Li] add spark.sql.broadcastTimeout in SQLConf.scala 557acd4 [Jacky Li] switch to sqlContext.getConf 81a5e20 [Jacky Li] make wait time configurable in BroadcastHashJoin	2014-12-16 15:34:59 -08:00
Michael Armbrust	a66c23e134	[SPARK-4827][SQL] Fix resolution of deeply nested Project(attr, Project(Star,...)). Since `AttributeReference` resolution and `*` expansion are currently in separate rules, each pair requires a full iteration instead of being able to resolve in a single pass. Since its pretty easy to construct queries that have many of these in a row, I combine them into a single rule in this PR. Author: Michael Armbrust <michael@databricks.com> Closes #3674 from marmbrus/projectStars and squashes the following commits: d83d6a1 [Michael Armbrust] Fix resolution of deeply nested Project(attr, Project(Star,...)).	2014-12-16 15:31:19 -08:00
tianyi	30f6b85c81	[SPARK-4483][SQL]Optimization about reduce memory costs during the HashOuterJoin In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one. Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue. for table test_csv contains 1 million records table dim_csv contains 10 thousand records SQL: `select * from test_csv a left outer join dim_csv b on a.key = b.key` the result is: master: ``` CSV: 12671 ms CSV: 9021 ms CSV: 9200 ms Current Mem Usage:787788984 ``` after patch： ``` CSV: 10382 ms CSV: 7543 ms CSV: 7469 ms Current Mem Usage:208145728 ``` Author: tianyi <tianyi@asiainfo-linkage.com> Author: tianyi <tianyi.asiainfo@gmail.com> Closes #3375 from tianyi/SPARK-4483 and squashes the following commits: 72a8aec [tianyi] avoid having mutable state stored inside of the task 99c5c97 [tianyi] performance optimization d2f94d7 [tianyi] fix bug: missing output when the join-key is null. 2be45d1 [tianyi] fix spell bug 1f2c6f1 [tianyi] remove commented codes a676de6 [tianyi] optimize some codes 9e7d5b5 [tianyi] remove commented old codes 838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin	2014-12-16 15:22:29 -08:00
wangxiaojing	ea1315e3e2	[SPARK-4527][SQl]Add BroadcastNestedLoopJoin operator selection testsuite In `JoinSuite` add BroadcastNestedLoopJoin operator selection testsuite Author: wangxiaojing <u9jing@gmail.com> Closes #3395 from wangxiaojing/SPARK-4527 and squashes the following commits: ea0e495 [wangxiaojing] change style 53c3952 [wangxiaojing] Add BroadcastNestedLoopJoin operator selection testsuite	2014-12-16 14:45:56 -08:00
zsxwing	6530243a52	[SPARK-4812][SQL] Fix the initialization issue of 'codegenEnabled' The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue. ```Scala scala> :paste // Entering paste mode (ctrl-D to finish) abstract class Foo { protected val sqlContext = "Foo" val codegenEnabled: Boolean = { println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized. if (sqlContext != null) { true } else { false } } } class Bar extends Foo { override val sqlContext = "Bar" } println(new Bar().codegenEnabled) // Exiting paste mode, now interpreting. null false defined class Foo defined class Bar ``` We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly. Author: zsxwing <zsxwing@gmail.com> Closes #3660 from zsxwing/SPARK-4812 and squashes the following commits: 1cbb623 [zsxwing] Make `sqlContext` final to prevent subclasses from overriding it incorrectly	2014-12-16 14:13:40 -08:00
jerryshao	dc8280dcca	[SPARK-4847][SQL]Fix "extraStrategies cannot take effect in SQLContext" issue Author: jerryshao <saisai.shao@intel.com> Closes #3698 from jerryshao/SPARK-4847 and squashes the following commits: 4741130 [jerryshao] Make later added extraStrategies effect when calling strategies	2014-12-16 14:08:28 -08:00
Judy Nash	17688d1429	[SQL] SPARK-4700: Add HTTP protocol spark thrift server Add HTTP protocol support and test cases to spark thrift server, so users can deploy thrift server in both TCP and http mode. Author: Judy Nash <judynash@microsoft.com> Author: judynash <judynash@microsoft.com> Closes #3672 from judynash/master and squashes the following commits: 526315d [Judy Nash] correct spacing on startThriftServer method 31a6520 [Judy Nash] fix code style issues and update sql programming guide format issue 47bf87e [Judy Nash] modify withJdbcStatement method definition to meet less than 100 line length 2e9c11c [Judy Nash] add thrift server in http mode documentation on sql programming guide 1cbd305 [Judy Nash] Merge remote-tracking branch 'upstream/master' 2b1d312 [Judy Nash] updated http thrift server support based on feedback 377532c [judynash] add HTTP protocol spark thrift server	2014-12-16 12:37:26 -08:00
Sean Owen	81112e4b57	SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger This enables assertions for the Maven and SBT build, but overrides the Hive module to not enable assertions. Author: Sean Owen <sowen@cloudera.com> Closes #3692 from srowen/SPARK-4814 and squashes the following commits: caca704 [Sean Owen] Disable assertions just for Hive f71e783 [Sean Owen] Enable assertions for SBT and Maven build	2014-12-15 17:12:05 -08:00
Daoyuan Wang	41a3f93438	[SPARK-4829] [SQL] add rule to fold count(expr) if expr is not null Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3676 from adrian-wang/countexpr and squashes the following commits: dc5765b [Daoyuan Wang] add rule to fold count(expr) if expr is not null	2014-12-11 22:56:42 -08:00
Sasaki Toru	8091dd62ea	[SPARK-4742][SQL] The name of Parquet File generated by AppendingParquetOutputFormat should be zero padded When I use Parquet File as a output file using ParquetOutputFormat#getDefaultWorkFile, the file name is not zero padded while RDD#saveAsText does zero padding. Author: Sasaki Toru <sasakitoa@nttdata.co.jp> Closes #3602 from sasakitoa/parquet-zeroPadding and squashes the following commits: 6b0e58f [Sasaki Toru] Merge branch 'master' of git://github.com/apache/spark into parquet-zeroPadding 20dc79d [Sasaki Toru] Fixed the name of Parquet File generated by AppendingParquetOutputFormat	2014-12-11 22:54:21 -08:00
Cheng Hao	0abbff2862	[SPARK-4825] [SQL] CTAS fails to resolve when created using saveAsTable Fix bug when query like: ``` test("save join to table") { val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)) sql("CREATE TABLE test1 (key INT, value STRING)") testData.insertInto("test1") sql("CREATE TABLE test2 (key INT, value STRING)") testData.insertInto("test2") testData.insertInto("test2") sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").saveAsTable("test") checkAnswer( table("test"), sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq) } ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #3673 from chenghao-intel/spark_4825 and squashes the following commits: e8cbd56 [Cheng Hao] alternate the pattern matching order for logical plan:CTAS e004895 [Cheng Hao] fix bug	2014-12-11 22:51:49 -08:00
Daoyuan Wang	cbb634ae69	[SQL] enable empty aggr test case This is fixed by SPARK-4318 #3184 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3445 from adrian-wang/emptyaggr and squashes the following commits: 982575e [Daoyuan Wang] enable empty aggr test case	2014-12-11 22:50:18 -08:00
Daoyuan Wang	acb3be6bc5	[SPARK-4828] [SQL] sum and avg on empty table should always return null So the optimizations are not valid. Also I think the optimization here is rarely encounter, so removing them will not have influence on performance. Can we merge #3445 before I add a comparison test case from this? Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3675 from adrian-wang/sumempty and squashes the following commits: 42df763 [Daoyuan Wang] sum and avg on empty table should always return null	2014-12-11 22:49:27 -08:00
scwf	d8cf678589	[SQL] Remove unnecessary case in HiveContext.toHiveString a follow up of #3547 /cc marmbrus Author: scwf <wangfei1@huawei.com> Closes #3563 from scwf/rnc and squashes the following commits: 9395661 [scwf] remove unnecessary condition	2014-12-11 22:48:03 -08:00
Takuya UESHIN	334480362b	[SPARK-4293][SQL] Make Cast be able to handle complex types. Inserting data of type including `ArrayType.containsNull == false` or `MapType.valueContainsNull == false` or `StructType.fields.exists(_.nullable == false)` into Hive table will fail because `Cast` inserted by `HiveMetastoreCatalog.PreInsertionCasts` rule of `Analyzer` can't handle these types correctly. Complex type cast rule proposal: - Cast for non-complex types should be able to cast the same as before. - Cast for `ArrayType` can evaluate if - Element type can cast - Nullability rule doesn't break - Cast for `MapType` can evaluate if - Key type can cast - Nullability for casted key type is `false` - Value type can cast - Nullability rule for value type doesn't break - Cast for `StructType` can evaluate if - The field size is the same - Each field can cast - Nullability rule for each field doesn't break - The nested structure should be the same. Nullability rule: - If the casted type is `nullable == true`, the target nullability should be `true` Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3150 from ueshin/issues/SPARK-4293 and squashes the following commits: e935939 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293 ba14003 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293 8999868 [Takuya UESHIN] Fix a test title. f677c30 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4293 287f410 [Takuya UESHIN] Add tests to insert data of types ArrayType / MapType / StructType with nullability is false into Hive table. 4f71bb8 [Takuya UESHIN] Make Cast be able to handle complex types.	2014-12-11 22:45:25 -08:00
Jacky Li	c152dde78f	[SPARK-4639] [SQL] Pass maxIterations in as a parameter in Analyzer fix a TODO in Analyzer: // TODO: pass this in as a parameter val fixedPoint = FixedPoint(100) Author: Jacky Li <jacky.likun@huawei.com> Closes #3499 from jackylk/config and squashes the following commits: 4c1252c [Jacky Li] fix scalastyle 820f460 [Jacky Li] pass maxIterations in as a parameter	2014-12-11 22:44:27 -08:00
Cheng Hao	a7f07f511c	[SPARK-4662] [SQL] Whitelist more unittest Whitelist more hive unit test: "create_like_tbl_props" "udf5" "udf_java_method" "decimal_1" "udf_pmod" "udf_to_double" "udf_to_float" "udf7" (this will fail in Hive 0.12) Author: Cheng Hao <hao.cheng@intel.com> Closes #3522 from chenghao-intel/unittest and squashes the following commits: f54e4c7 [Cheng Hao] work around to clean up the hive.table.parameters.default in reset 16fee22 [Cheng Hao] Whitelist more unittest	2014-12-11 22:43:02 -08:00
Cheng Hao	bf40cf89e3	[SPARK-4713] [SQL] SchemaRDD.unpersist() should not raise exception if it is not persisted Unpersist a uncached RDD, will not raise exception, for example: ``` val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) distData.unpersist(true) ``` But the `SchemaRDD` will raise exception if the `SchemaRDD` is not cached. Since `SchemaRDD` is the subclasses of the `RDD`, we should follow the same behavior. Author: Cheng Hao <hao.cheng@intel.com> Closes #3572 from chenghao-intel/try_uncache and squashes the following commits: 50a7a89 [Cheng Hao] SchemaRDD.unpersist() should not raise exception if it is not persisted	2014-12-11 22:41:36 -08:00
Joseph K. Bradley	2a5b5fd4cc	[SPARK-4791] [sql] Infer schema from case class with multiple constructors Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now. Needed for [https://github.com/apache/spark/pull/3637] CC: marmbrus Author: Joseph K. Bradley <joseph@databricks.com> Closes #3646 from jkbradley/sql-reflection and squashes the following commits: 796b2e4 [Joseph K. Bradley] Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now.	2014-12-10 23:41:15 -08:00
Cheng Hao	383c5555c9	[SPARK-4785][SQL] Initilize Hive UDFs on the driver and serialize them with a wrapper Different from Hive 0.12.0, in Hive 0.13.1 UDF/UDAF/UDTF (aka Hive function) objects should only be initialized once on the driver side and then serialized to executors. However, not all function objects are serializable (e.g. GenericUDF doesn't implement Serializable). Hive 0.13.1 solves this issue with Kryo or XML serializer. Several utility ser/de methods are provided in class o.a.h.h.q.e.Utilities for this purpose. In this PR we chose Kryo for efficiency. The Kryo serializer used here is created in Hive. Spark Kryo serializer wasn't used because there's no available SparkConf instance. Author: Cheng Hao <hao.cheng@intel.com> Author: Cheng Lian <lian@databricks.com> Closes #3640 from chenghao-intel/udf_serde and squashes the following commits: 8e13756 [Cheng Hao] Update the comment 74466a3 [Cheng Hao] refactor as feedbacks 396c0e1 [Cheng Hao] avoid Simple UDF to be serialized e9c3212 [Cheng Hao] update the comment 19cbd46 [Cheng Hao] support udf instance ser/de after initialization	2014-12-09 10:28:33 -08:00
Cheng Hao	51b1fe1426	[SPARK-4769] [SQL] CTAS does not work when reading from temporary tables This is the code refactor and follow ups for #2570 Author: Cheng Hao <hao.cheng@intel.com> Closes #3336 from chenghao-intel/createtbl and squashes the following commits: 3563142 [Cheng Hao] remove the unused variable e215187 [Cheng Hao] eliminate the compiling warning 4f97f14 [Cheng Hao] fix bug in unittest 5d58812 [Cheng Hao] revert the API changes b85b620 [Cheng Hao] fix the regression of temp tabl not found in CTAS	2014-12-08 17:39:12 -08:00
Jacky Li	944384363d	[SQL] remove unnecessary import in spark-sql Author: Jacky Li <jacky.likun@huawei.com> Closes #3630 from jackylk/remove and squashes the following commits: 150e7e0 [Jacky Li] remove unnecessary import	2014-12-08 17:27:46 -08:00
Cheng Lian	6f61e1f961	[SPARK-4761][SQL] Enables Kryo by default in Spark SQL Thrift server Enables Kryo and disables reference tracking by default in Spark SQL Thrift server. Configurations explicitly defined by users in `spark-defaults.conf` are respected (the Thrift server is started by `spark-submit`, which handles configuration properties properly). <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3621) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3621 from liancheng/kryo-by-default and squashes the following commits: 70c2775 [Cheng Lian] Enables Kryo by default in Spark SQL Thrift server	2014-12-05 10:27:40 -08:00
Michael Armbrust	f5801e813f	[SPARK-4753][SQL] Use catalyst for partition pruning in newParquet. Author: Michael Armbrust <michael@databricks.com> Closes #3613 from marmbrus/parquetPartitionPruning and squashes the following commits: 4f138f8 [Michael Armbrust] Use catalyst for partition pruning in newParquet.	2014-12-04 22:25:21 -08:00
Aaron Davidson	c6c7165e7e	[SQL] Minor: Avoid calling Seq#size in a loop Just found this instance while doing some jstack-based profiling of a Spark SQL job. It is very unlikely that this is causing much of a perf issue anywhere, but it is unnecessarily suboptimal. Author: Aaron Davidson <aaron@databricks.com> Closes #3593 from aarondav/seq-opt and squashes the following commits: 962cdfc [Aaron Davidson] [SQL] Minor: Avoid calling Seq#size in a loop	2014-12-04 00:58:42 -08:00
Jacky Li	ed88db4cb2	[SQL] remove unnecessary import Author: Jacky Li <jacky.likun@huawei.com> Closes #3585 from jackylk/remove and squashes the following commits: 045423d [Jacky Li] remove unnecessary import	2014-12-04 00:43:55 -08:00
Michael Armbrust	513ef82e85	[SPARK-4552][SQL] Avoid exception when reading empty parquet data through Hive This is a very small fix that catches one specific exception and returns an empty table. #3441 will address this in a more principled way. Author: Michael Armbrust <michael@databricks.com> Closes #3586 from marmbrus/fixEmptyParquet and squashes the following commits: 2781d9f [Michael Armbrust] Handle empty lists for newParquet 04dd376 [Michael Armbrust] Avoid exception when reading empty parquet data through Hive	2014-12-03 14:13:35 -08:00
wangfei	3ae0cda83c	[SPARK-4695][SQL] Get result using executeCollect Using ```executeCollect``` to collect the result, because executeCollect is a custom implementation of collect in spark sql which better than rdd's collect Author: wangfei <wangfei1@huawei.com> Closes #3547 from scwf/executeCollect and squashes the following commits: a5ab68e [wangfei] Revert "adding debug info" a60d680 [wangfei] fix test failure 0db7ce8 [wangfei] adding debug info 184c594 [wangfei] using executeCollect instead collect	2014-12-02 14:30:44 -08:00
Daoyuan Wang	1f5ddf17e8	[SPARK-4670] [SQL] wrong symbol for bitwise not We should use `~` instead of `-` for bitwise NOT. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3528 from adrian-wang/symbol and squashes the following commits: affd4ad [Daoyuan Wang] fix code gen test case 56efb79 [Daoyuan Wang] ensure bitwise NOT over byte and short persist data type f55fbae [Daoyuan Wang] wrong symbol for bitwise not	2014-12-02 14:25:12 -08:00
Daoyuan Wang	f6df609dcc	[SPARK-4593][SQL] Return null when denominator is 0 SELECT max(1/0) FROM src would return a very large number, which is obviously not right. For hive-0.12, hive would return `Infinity` for 1/0, while for hive-0.13.1, it is `NULL` for 1/0. I think it is better to keep our behavior with newer Hive version. This PR ensures that when the divider is 0, the result of expression should be NULL, same with hive-0.13.1 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3443 from adrian-wang/div and squashes the following commits: 2e98677 [Daoyuan Wang] fix code gen for divide 0 85c28ba [Daoyuan Wang] temp 36236a5 [Daoyuan Wang] add test cases 6f5716f [Daoyuan Wang] fix comments cee92bd [Daoyuan Wang] avoid evaluation 2 times 22ecd9a [Daoyuan Wang] fix style cf28c58 [Daoyuan Wang] divide fix 2dfe50f [Daoyuan Wang] return null when divider is 0 of Double type	2014-12-02 14:21:47 -08:00
YanTangZhai	1066427600	[SPARK-4676][SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null val jsc = new org.apache.spark.api.java.JavaSparkContext(sc) val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc) val nrdd = jhc.hql("select null from spark_test.for_test") println(nrdd.schema) Then the error is thrown as follows: scala.MatchError: NullType (of class org.apache.spark.sql.catalyst.types.NullType$) at org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43) Author: YanTangZhai <hakeemzhai@tencent.com> Author: yantangzhai <tyz0303@163.com> Author: Michael Armbrust <michael@databricks.com> Closes #3538 from YanTangZhai/MatchNullType and squashes the following commits: e052dff [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null 4b4bb34 [yantangzhai] [SPARK-4676] [SQL] JavaSchemaRDD.schema may throw NullType MatchError if sql has null 896c7b7 [yantangzhai] fix NullType MatchError in JavaSchemaRDD when sql has null 6e643f8 [YanTangZhai] Merge pull request #11 from apache/master e249846 [YanTangZhai] Merge pull request #10 from apache/master d26d982 [YanTangZhai] Merge pull request #9 from apache/master 76d4027 [YanTangZhai] Merge pull request #8 from apache/master 03b62b0 [YanTangZhai] Merge pull request #7 from apache/master 8a00106 [YanTangZhai] Merge pull request #6 from apache/master cbcba66 [YanTangZhai] Merge pull request #3 from apache/master cdef539 [YanTangZhai] Merge pull request #1 from apache/master	2014-12-02 14:15:12 -08:00
baishuo	69b6fed206	[SPARK-4663][sql]add finally to avoid resource leak Author: baishuo <vc_java@hotmail.com> Closes #3526 from baishuo/master-trycatch and squashes the following commits: d446e14 [baishuo] correct the code style b36bf96 [baishuo] correct the code style ae0e447 [baishuo] add finally to avoid resource leak	2014-12-02 12:12:03 -08:00
Kousuke Saruta	e75e04f980	[SPARK-4536][SQL] Add sqrt and abs to Spark SQL DSL Spark SQL has embeded sqrt and abs but DSL doesn't support those functions. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3401 from sarutak/dsl-missing-operator and squashes the following commits: 07700cf [Kousuke Saruta] Modified Literal(null, NullType) to Literal(null) in DslQuerySuite 8f366f8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator 1b88e2e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into dsl-missing-operator 0396f89 [Kousuke Saruta] Added sqrt and abs to Spark SQL DSL	2014-12-02 12:07:52 -08:00
Reynold Xin	b1f8fe316a	Indent license header properly for interfaces.scala. A very small nit update. Author: Reynold Xin <rxin@databricks.com> Closes #3552 from rxin/license-header and squashes the following commits: df8d1a4 [Reynold Xin] Indent license header properly for interfaces.scala.	2014-12-02 11:59:15 -08:00
zsxwing	d3e02dddf0	[SPARK-4268][SQL] Use #::: to get benefit from Stream in SqlLexical.allCaseVersions In addition, using `s.isEmpty` to eliminate the string comparison. Author: zsxwing <zsxwing@gmail.com> Closes #3132 from zsxwing/SPARK-4268 and squashes the following commits: 358e235 [zsxwing] Improvement of allCaseVersions	2014-12-01 16:39:54 -08:00
Daoyuan Wang	4df60a8cbc	[SPARK-4529] [SQL] support view with column alias Support view definition like CREATE VIEW view3(valoo) TBLPROPERTIES ("fear" = "factor") AS SELECT upper(value) FROM src WHERE key=86; [valoo as the alias of upper(value)]. This is missing part of SPARK-4239, for a fully view support. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3396 from adrian-wang/viewcolumn and squashes the following commits: 4d001d0 [Daoyuan Wang] support view with column alias	2014-12-01 16:08:51 -08:00
wangfei	7b79957879	[SQL] Minor fix for doc and comment Author: wangfei <wangfei1@huawei.com> Closes #3533 from scwf/sql-doc1 and squashes the following commits: 962910b [wangfei] doc and comment fix	2014-12-01 14:02:02 -08:00
ravipesala	bc353819cc	[SPARK-4658][SQL] Code documentation issue in DDL of datasource API Author: ravipesala <ravindra.pesala@huawei.com> Closes #3516 from ravipesala/ddl_doc and squashes the following commits: d101fdf [ravipesala] Style issues fixed d2238cd [ravipesala] Corrected documentation	2014-12-01 13:31:27 -08:00
ravipesala	6a9ff19dc0	[SPARK-4650][SQL] Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL Supporting multi column support in countDistinct function like count(distinct c1,c2..) in Spark SQL Author: ravipesala <ravindra.pesala@huawei.com> Author: Michael Armbrust <michael@databricks.com> Closes #3511 from ravipesala/countdistinct and squashes the following commits: cc4dbb1 [ravipesala] style 070e12a [ravipesala] Supporting multi column support in count(distinct c1,c2..) in Spark SQL	2014-12-01 13:28:04 -08:00
Liang-Chi Hsieh	b57365a1ec	[SPARK-4358][SQL] Let BigDecimal do checking type compatibility Remove hardcoding max and min values for types. Let BigDecimal do checking type compatibility. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3208 from viirya/more_numericLit and squashes the following commits: e9834b4 [Liang-Chi Hsieh] Remove byte and short types for number literal. 1bd1825 [Liang-Chi Hsieh] Fix Indentation and make the modification clearer. cf1a997 [Liang-Chi Hsieh] Modified for comment to add a rule of analysis that adds a cast. 91fe489 [Liang-Chi Hsieh] add Byte and Short. 1bdc69d [Liang-Chi Hsieh] Let BigDecimal do checking type compatibility.	2014-12-01 13:17:56 -08:00
Jacky Li	bafee67eba	[SQL] add @group tab in limit() and count() group tab is missing for scaladoc Author: Jacky Li <jacky.likun@gmail.com> Closes #3458 from jackylk/patch-7 and squashes the following commits: 0121a70 [Jacky Li] add @group tab in limit() and count()	2014-12-01 13:12:30 -08:00
zsxwing	30a86acdef	[SPARK-4661][Core] Minor code and docs cleanup Author: zsxwing <zsxwing@gmail.com> Closes #3521 from zsxwing/SPARK-4661 and squashes the following commits: 03cbe3f [zsxwing] Minor code and docs cleanup	2014-12-01 00:35:01 -08:00
Cheng Lian	5b99bf243e	[SPARK-4645][SQL] Disables asynchronous execution in Hive 0.13.1 HiveThriftServer2 This PR disables HiveThriftServer2 asynchronous execution by setting `runInBackground` argument in `ExecuteStatementOperation` to `false`, and reverting `SparkExecuteStatementOperation.run` in Hive 13 shim to Hive 12 version. This change makes Simba ODBC driver v1.0.0.1000 work. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3506) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3506 from liancheng/disable-async-exec and squashes the following commits: 593804d [Cheng Lian] Disables asynchronous execution in Hive 0.13.1 HiveThriftServer2	2014-11-28 11:42:40 -05:00
w00228970	723be60e23	[SQL] Compute timeTaken correctly ```timeTaken``` should not count the time of printing result. Author: w00228970 <wangfei1@huawei.com> Closes #3423 from scwf/time-taken-bug and squashes the following commits: da7e102 [w00228970] compute time taken correctly	2014-11-24 21:17:24 -08:00
Davies Liu	6cf507685e	[SPARK-4548] []SPARK-4517] improve performance of python broadcast Re-implement the Python broadcast using file: 1) serialize the python object using cPickle, write into disks. 2) Create a wrapper in JVM (for the dumped file), it read data from during serialization 3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors 4) During deserialization, writing the data into disk. 5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access. It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor). Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2): name \| 1.1 \| 1.2 with this patch \| improvement ---------\|--------\|---------\|-------- python-broadcast-w-bytes \| 25.20 \| 9.33 \| 170.13% \| python-broadcast-w-set \| 4.13 \| 4.50 \| -8.35% \| Testing with 100 tasks (16 CPUs): name \| 1.1 \| 1.2 with this patch \| improvement ---------\|--------\|---------\|-------- python-broadcast-w-bytes \| 38.16 \| 8.40 \| 353.98% python-broadcast-w-set \| 23.29 \| 9.59 \| 142.80% Author: Davies Liu <davies@databricks.com> Closes #3417 from davies/pybroadcast and squashes the following commits: 50a58e0 [Davies Liu] address comments b98de1d [Davies Liu] disable gc while unpickle e5ee6b9 [Davies Liu] support large string 09303b8 [Davies Liu] read all data into memory dde02dd [Davies Liu] improve performance of python broadcast	2014-11-24 17:17:03 -08:00
Kousuke Saruta	dd1c9cb36c	[SPARK-4487][SQL] Fix attribute reference resolution error when using ORDER BY. When we use ORDER BY clause, at first, attributes referenced by projection are resolved (1). And then, attributes referenced at ORDER BY clause are resolved (2). But when resolving attributes referenced at ORDER BY clause, the resolution result generated in (1) is discarded so for example, following query fails. SELECT c1 + c2 FROM mytable ORDER BY c1; The query above fails because when resolving the attribute reference 'c1', the resolution result of 'c2' is discarded. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3363 from sarutak/SPARK-4487 and squashes the following commits: fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer 6e60c20 [Kousuke Saruta] Fixed conflicts cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487 282d529 [Kousuke Saruta] Fixed attributes reference resolution error b6123e6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into concat-feature 317b7fb [Kousuke Saruta] WIP	2014-11-24 12:54:37 -08:00
Daniel Darabos	d5834f0732	[SQL] Fix comment in HiveShim This file is for Hive 0.13.1 I think. Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #3432 from darabos/patch-2 and squashes the following commits: 4fd22ed [Daniel Darabos] Fix comment. This file is for Hive 0.13.1.	2014-11-24 12:45:12 -08:00
Cheng Lian	a6d7b61f92	[SPARK-4479][SQL] Avoids unnecessary defensive copies when sort based shuffle is on This PR is a workaround for SPARK-4479. Two changes are introduced: when merge sort is bypassed in `ExternalSorter`, 1. also bypass RDD elements buffering as buffering is the reason that `MutableRow` backed row objects must be copied, and 2. avoids defensive copies in `Exchange` operator <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3422) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3422 from liancheng/avoids-defensive-copies and squashes the following commits: 591f2e9 [Cheng Lian] Passes all shuffle suites 0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed ed5df3c [Cheng Lian] Fixes styling changes f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based shuffle is on	2014-11-24 12:43:45 -08:00
Michael Armbrust	90a6a46bd1	[SPARK-4522][SQL] Parse schema with missing metadata. This is just a quick fix for 1.2. SPARK-4523 describes a more complete solution. Author: Michael Armbrust <michael@databricks.com> Closes #3392 from marmbrus/parquetMetadata and squashes the following commits: bcc6626 [Michael Armbrust] Parse schema with missing metadata.	2014-11-20 20:34:43 -08:00
Michael Armbrust	02ec058efe	[SPARK-4413][SQL] Parquet support through datasource API Goals: - Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns) - Support for folder based partitioning with automatic discovery of available partitions - Caching of file metadata See scaladoc of `ParquetRelation2` for more details. Author: Michael Armbrust <michael@databricks.com> Closes #3269 from marmbrus/newParquet and squashes the following commits: 1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once. 645768b [Michael Armbrust] Review comments. abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API. 938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions. e9d2641 [Michael Armbrust] logging / formatting improvements.	2014-11-20 18:31:02 -08:00
Cheng Hao	84d79ee9ec	[SPARK-4244] [SQL] Support Hive Generic UDFs with constant object inspector parameters Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the input object inspector is `ConstantObjectInspector`, however, we won't get that before the expression optimization executed. (Constant Folding). This PR is a work around to fix this. (As ideally, the `output` of LogicalPlan should be identical before and after Optimization). Author: Cheng Hao <hao.cheng@intel.com> Closes #3109 from chenghao-intel/optimized and squashes the following commits: 487ff79 [Cheng Hao] rebase to the latest master & update the unittest	2014-11-20 16:50:59 -08:00
Jacky Li	ad5f1f3ca2	[SQL] fix function description mistake Sample code in the description of SchemaRDD.where is not correct Author: Jacky Li <jacky.likun@gmail.com> Closes #3344 from jackylk/patch-6 and squashes the following commits: 62cd126 [Jacky Li] [SQL] fix function description mistake	2014-11-20 15:48:36 -08:00
Cheng Hao	6aa0fc9f4d	[SPARK-2918] [SQL] Support the CTAS in EXPLAIN command Hive supports the `explain` the CTAS, which was supported by Spark SQL previously, however, seems it was reverted after the code refactoring in HiveQL. Author: Cheng Hao <hao.cheng@intel.com> Closes #3357 from chenghao-intel/explain and squashes the following commits: 7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command	2014-11-20 15:46:00 -08:00
Takuya UESHIN	2c2e7a44db	[SPARK-4318][SQL] Fix empty sum distinct. Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following commits: 8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318 66fdb0a [Takuya UESHIN] Re-refine aggregate functions. 6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate. d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate. 1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions. 917e533 [Takuya UESHIN] Use aggregate instead of groupBy(). 1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation. a5a57d2 [Takuya UESHIN] Fix empty Average. 22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct. 65b7dd2 [Takuya UESHIN] Fix empty sum distinct.	2014-11-20 15:41:24 -08:00
ravipesala	98e9419784	[SPARK-4513][SQL] Support relational operator '<=>' in Spark SQL The relational operator '<=>' is not working in Spark SQL. Same works in Spark HiveQL Author: ravipesala <ravindra.pesala@huawei.com> Closes #3387 from ravipesala/<=> and squashes the following commits: 7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL	2014-11-20 15:34:03 -08:00
Dan McClary	b8e6886fb8	[SPARK-4228][SQL] SchemaRDD to JSON Here's a simple fix for SchemaRDD to JSON. Author: Dan McClary <dan.mcclary@gmail.com> Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits: d714e1d [Dan McClary] fixed PEP 8 error cac2879 [Dan McClary] move pyspark comment and doctest to correct location f9471d3 [Dan McClary] added pyspark doc and doctest 6598cee [Dan McClary] adding complex type queries 1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite 4a651f0 [Dan McClary] cleaned PEP and Scala style failures. Moved tests to JsonSuite 47ceff6 [Dan McClary] cleaned up scala style issues 2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD 4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting 8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON 1b11980 [Dan McClary] Map and UserDefinedTypes partially done 11d2016 [Dan McClary] formatting and unicode deserialization default fixed 6af72d1 [Dan McClary] deleted extaneous comment 4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD 149dafd [Dan McClary] wrapped scala toJSON in sql.py 5e5eb1b [Dan McClary] switched to Jackson for JSON processing 6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD 1d171aa [Dan McClary] upated missing brace on if statement 319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228 424f130 [Dan McClary] tests pass, ready for pull and PR 626a5b1 [Dan McClary] added toJSON to SchemaRDD f7d166a [Dan McClary] added toJSON method 5d34e37 [Dan McClary] merge resolved d6d19e9 [Dan McClary] pr example	2014-11-20 13:44:19 -08:00
Cheng Lian	abf29187f0	[SPARK-3938][SQL] Names in-memory columnar RDD with corresponding table name This PR enables the Web UI storage tab to show the in-memory table name instead of the mysterious query plan string as the name of the in-memory columnar RDD. Note that after #2501, a single columnar RDD can be shared by multiple in-memory tables, as long as their query results are the same. In this case, only the first cached table name is shown. For example: ```sql CACHE TABLE first AS SELECT * FROM src; CACHE TABLE second AS SELECT * FROM src; ``` The Web UI only shows "In-memory table first". <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3383) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3383 from liancheng/columnar-rdd-name and squashes the following commits: 071907f [Cheng Lian] Fixes tests 12ddfa6 [Cheng Lian] Names in-memory columnar RDD with corresponding table name	2014-11-20 13:12:24 -08:00
Marcelo Vanzin	397d3aae5b	Bumping version to 1.3.0-SNAPSHOT. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #3277 from vanzin/version-1.3 and squashes the following commits: 7c3c396 [Marcelo Vanzin] Added temp repo to sbt build. 5f404ff [Marcelo Vanzin] Add another exclusion. 19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo. 3c8d705 [Marcelo Vanzin] Workaround for MIMA checks. e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.	2014-11-18 21:24:18 -08:00
Cheng Lian	423baea953	[SPARK-4468][SQL] Fixes Parquet filter creation for inequality predicates with literals on the left hand side For expressions like `10 < someVar`, we should create an `Operators.Gt` filter, but right now an `Operators.Lt` is created. This issue affects all inequality predicates with literals on the left hand side. (This bug existed before #3317 and affects branch-1.1. #3338 was opened to backport this to branch-1.1.) <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3334) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3334 from liancheng/fix-parquet-comp-filter and squashes the following commits: 0130897 [Cheng Lian] Fixes Parquet comparison filter generation	2014-11-18 17:41:54 -08:00
Davies Liu	4a377aff2d	[SPARK-3721] [PySpark] broadcast objects larger than 2G This patch will bring support for broadcasting objects larger than 2G. pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]]. Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf. Author: Davies Liu <davies@databricks.com> Author: Davies Liu <davies.liu@gmail.com> Closes #2659 from davies/huge and squashes the following commits: 7b57a14 [Davies Liu] add more tests for broadcast 28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge a2f6a02 [Davies Liu] bug fix 4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 5875c73 [Davies Liu] address comments 10a349b [Davies Liu] address comments 0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 6182c8f [Davies Liu] Merge branch 'master' into huge d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 2514848 [Davies Liu] address comments fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge 1c2d928 [Davies Liu] fix scala style 091b107 [Davies Liu] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00
Michael Armbrust	90d72ec850	[SQL] Support partitioned parquet tables that have the key in both the directory and the file Author: Michael Armbrust <michael@databricks.com> Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following commits: 447f08c [Michael Armbrust] Support partitioned parquet tables that have the key in both the directory and the file	2014-11-18 12:13:23 -08:00
Cheng Lian	36b0956a3e	[SPARK-4453][SPARK-4213][SQL] Simplifies Parquet filter generation code While reviewing PR #3083 and #3161, I noticed that Parquet record filter generation code can be simplified significantly according to the clue stated in [SPARK-4453](https://issues.apache.org/jira/browse/SPARK-4213). This PR addresses both SPARK-4453 and SPARK-4213 with this simplification. While generating `ParquetTableScan` operator, we need to remove all Catalyst predicates that have already been pushed down to Parquet. Originally, we first generate the record filter, and then call `findExpression` to traverse the generated filter to find out all pushed down predicates [[1](`64c6b9bad5/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala (L213-L228)`)]. In this way, we have to introduce the `CatalystFilter` class hierarchy to bind the Catalyst predicates together with their generated Parquet filter, and complicate the code base a lot. The basic idea of this PR is that, we don't need `findExpression` after filter generation, because we already know a predicate can be pushed down if we can successfully generate its corresponding Parquet filter. SPARK-4213 is fixed by returning `None` for any unsupported predicate type. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3317) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3317 from liancheng/simplify-parquet-filters and squashes the following commits: d6a9499 [Cheng Lian] Fixes import styling issue 43760e8 [Cheng Lian] Simplifies Parquet filter generation logic	2014-11-17 16:55:12 -08:00
Cheng Hao	ef7c464eff	[SPARK-4448] [SQL] unwrap for the ConstantObjectInspector Author: Cheng Hao <hao.cheng@intel.com> Closes #3308 from chenghao-intel/unwrap_constant_oi and squashes the following commits: 156b500 [Cheng Hao] rebase the master c5b20ab [Cheng Hao] unwrap for the ConstantObjectInspector	2014-11-17 16:35:49 -08:00
w00228970	42389b1780	[SPARK-4443][SQL] Fix statistics for external table in spark sql hive The `totalSize` of external table is always zero, which will influence join strategy(always use broadcast join for external table). Author: w00228970 <wangfei1@huawei.com> Closes #3304 from scwf/statistics and squashes the following commits: 568f321 [w00228970] fix statistics for external table	2014-11-17 16:33:50 -08:00
Cheng Lian	6b7f2f753d	[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types This PR is exactly the same as #3178 except it reverts the `FileStatus.isDir` to `FileStatus.isDirectory` change, since it doesn't compile with Hadoop 1. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3298) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3298 from liancheng/date-for-thriftserver and squashes the following commits: 866037e [Cheng Lian] Revers isDirectory to isDir (it breaks Hadoop 1 profile) 6f71d0b [Cheng Lian] Makes toHiveString static 26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim a92882a [Cheng Lian] Updates HiveShim for 0.13.1 73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)	2014-11-17 16:31:05 -08:00
Cheng Hao	69e858cc77	[SQL] Construct the MutableRow from an Array Author: Cheng Hao <hao.cheng@intel.com> Closes #3217 from chenghao-intel/mutablerow and squashes the following commits: e8a10bd [Cheng Hao] revert the change of Row object 4681aea [Cheng Hao] Add toMutableRow method in object Row a751838 [Cheng Hao] Construct the MutableRow from an existed row	2014-11-17 16:29:52 -08:00
Takuya UESHIN	566c791931	[SPARK-4425][SQL] Handle NaN or Infinity cast to Timestamp correctly. `Cast` from `NaN` or `Infinity` of `Double` or `Float` to `TimestampType` throws `NumberFormatException`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3283 from ueshin/issues/SPARK-4425 and squashes the following commits: 14def0c [Takuya UESHIN] Fix Cast to be able to handle NaN or Infinity to TimestampType.	2014-11-17 16:28:07 -08:00
Takuya UESHIN	3a81a1c9e0	[SPARK-4420][SQL] Change nullability of Cast from DoubleType/FloatType to DecimalType. This is follow-up of [SPARK-4390](https://issues.apache.org/jira/browse/SPARK-4390) (#3256). Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3278 from ueshin/issues/SPARK-4420 and squashes the following commits: 7fea558 [Takuya UESHIN] Add some tests. cb2301a [Takuya UESHIN] Fix tests. 133bad5 [Takuya UESHIN] Change nullability of Cast from DoubleType/FloatType to DecimalType.	2014-11-17 16:26:48 -08:00
Cheng Lian	5ce7dae859	[SQL] Makes conjunction pushdown more aggressive for in-memory table This is inspired by the [Parquet record filter generation code](`64c6b9bad5/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala (L387-L400)`). <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3318) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3318 from liancheng/aggresive-conj-pushdown and squashes the following commits: 78b69d2 [Cheng Lian] Makes conjunction pushdown more aggressive	2014-11-17 15:33:13 -08:00
Michael Armbrust	64c6b9bad5	[SPARK-4410][SQL] Add support for external sort Adds a new operator that uses Spark's `ExternalSort` class. It is off by default now, but we might consider making it the default if benchmarks show that it does not regress performance. Author: Michael Armbrust <michael@databricks.com> Closes #3268 from marmbrus/externalSort and squashes the following commits: 48b9726 [Michael Armbrust] comments b98799d [Michael Armbrust] Add test afd7562 [Michael Armbrust] Add support for external sort.	2014-11-16 21:55:57 -08:00
Michael Armbrust	45ce3273cb	Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types" Author: Michael Armbrust <michael@databricks.com> Closes #3292 from marmbrus/revert4309 and squashes the following commits: 808e96e [Michael Armbrust] Revert "[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types"	2014-11-16 15:05:08 -08:00
Cheng Lian	cb6bd83a91	[SPARK-4309][SPARK-4407][SQL] Date type support for Thrift server, and fixes for complex types SPARK-4407 was detected while working on SPARK-4309. Merged these two into a single PR since 1.2.0 RC is approaching. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3178) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3178 from liancheng/date-for-thriftserver and squashes the following commits: 6f71d0b [Cheng Lian] Makes toHiveString static 26fa955 [Cheng Lian] Fixes complex type support in Hive 0.13.1 shim a92882a [Cheng Lian] Updates HiveShim for 0.13.1 73f442b [Cheng Lian] Adds Date support for HiveThriftServer2 (Hive 0.12.0)	2014-11-16 14:26:41 -08:00
Kousuke Saruta	84468b2e20	[SPARK-4426][SQL][Minor] The symbol of BitwiseOr is wrong, should not be '&' The symbol of BitwiseOr is defined as '&' but I think it's wrong. It should be '\|'. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3284 from sarutak/bitwise-or-symbol-fix and squashes the following commits: aff4be5 [Kousuke Saruta] Fixed symbol of BitwiseOr	2014-11-15 22:23:47 -08:00
kai	cbddac2369	Added contains(key) to Metadata Add contains(key) to org.apache.spark.sql.catalyst.util.Metadata to test the existence of a key. Otherwise, Class Metadata's get methods may throw NoSuchElement exception if the key does not exist. Testcases are added to MetadataSuite as well. Author: kai <kaizeng@eecs.berkeley.edu> Closes #3273 from kai-zeng/metadata-fix and squashes the following commits: 74b3d03 [kai] Added contains(key) to Metadata	2014-11-14 23:44:23 -08:00
Jim Carroll	37482ce5a7	[SPARK-4412][SQL] Fix Spark's control of Parquet logging. The Spark ParquetRelation.scala code makes the assumption that the parquet.Log class has already been loaded. If ParquetRelation.enableLogForwarding executes prior to the parquet.Log class being loaded then the code in enableLogForwarding has no affect. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with Parquet), the parquet.Log class hasn't been loaded yet. Therefore the code in ParquetRelation.enableLogForwarding has no affect. If you look at the code in parquet.Log there's a static initializer that needs to be called prior to enableLogForwarding or whatever enableLogForwarding does gets undone by this static initializer. The "fix" would be to force the static initializer to get called in parquet.Log as part of enableForwardLogging. Author: Jim Carroll <jim@dontcallme.com> Closes #3271 from jimfcarroll/parquet-logging and squashes the following commits: 37bdff7 [Jim Carroll] Fix Spark's control of Parquet logging.	2014-11-14 15:33:21 -08:00
Yash Datta	63ca3af66f	[SPARK-4365][SQL] Remove unnecessary filter call on records returned from parquet library Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those : from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java public boolean nextKeyValue() throws IOException, InterruptedException { boolean recordFound = false; while (!recordFound) { // no more records left if (current >= total) { return false; } try { checkRead(); currentValue = recordReader.read(); current ++; if (recordReader.shouldSkipCurrentRecord()) { // this record is being filtered via the filter2 package if (DEBUG) LOG.debug("skipping record"); continue; } if (currentValue == null) { // only happens with FilteredRecordReader at end of block current = totalCountLoadedSoFar; if (DEBUG) LOG.debug("filtered record reader reached end of block"); continue; } recordFound = true; if (DEBUG) LOG.debug("read value: " + currentValue); } catch (RuntimeException e) { throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e); } } return true; } Author: Yash Datta <Yash.Datta@guavus.com> Closes #3229 from saucam/remove_filter and squashes the following commits: 8909ae9 [Yash Datta] SPARK-4365: Remove unnecessary filter call on records returned from parquet library	2014-11-14 15:16:40 -08:00
Jim Carroll	f76b968370	[SPARK-4386] Improve performance when writing Parquet files. If you profile the writing of a Parquet file, the single worst time consuming call inside of org.apache.spark.sql.parquet.MutableRowWriteSupport.write is actually in the scala.collection.AbstractSequence.size call. This is because the size call actually ends up COUNTING the elements in a scala.collection.LinearSeqOptimized.length ("optimized?"). This doesn't need to be done. "size" is called repeatedly where needed rather than called once at the top of the method and stored in a 'val'. Author: Jim Carroll <jim@dontcallme.com> Closes #3254 from jimfcarroll/parquet-perf and squashes the following commits: 30cc0b5 [Jim Carroll] Improve performance when writing Parquet files.	2014-11-14 15:11:53 -08:00
Cheng Lian	0c7b66bd44	[SPARK-4322][SQL] Enables struct fields as sub expressions of grouping fields While resolving struct fields, the resulted `GetField` expression is wrapped with an `Alias` to make it a named expression. Assume `a` is a struct instance with a field `b`, then `"a.b"` will be resolved as `Alias(GetField(a, "b"), "b")`. Thus, for this following SQL query: ```sql SELECT a.b + 1 FROM t GROUP BY a.b + 1 ``` the grouping expression is ```scala Add(GetField(a, "b"), Literal(1, IntegerType)) ``` while the aggregation expression is ```scala Add(Alias(GetField(a, "b"), "b"), Literal(1, IntegerType)) ``` This mismatch makes the above SQL query fail during the both analysis and execution phases. This PR fixes this issue by removing the alias when substituting aggregation expressions. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3248) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3248 from liancheng/spark-4322 and squashes the following commits: 23a46ea [Cheng Lian] Code simplification dd20a79 [Cheng Lian] Should only trim aliases around `GetField`s 7f46532 [Cheng Lian] Enables struct fields as sub expressions of grouping fields	2014-11-14 15:09:36 -08:00
Michael Armbrust	4b4b50c9e5	[SQL] Don't shuffle code generated rows When sort based shuffle and code gen are on we were trying to ship the code generated rows during a shuffle. This doesn't work because the classes don't exist on the other side. Instead we now copy into a generic row before shipping. Author: Michael Armbrust <michael@databricks.com> Closes #3263 from marmbrus/aggCodeGen and squashes the following commits: f6ba8cf [Michael Armbrust] fix and test	2014-11-14 15:03:23 -08:00
Michael Armbrust	f805025e8e	[SQL] Minor cleanup of comments, errors and override. Author: Michael Armbrust <michael@databricks.com> Closes #3257 from marmbrus/minorCleanup and squashes the following commits: d8b5abc [Michael Armbrust] Use interpolation. 2fdf903 [Michael Armbrust] Better error message when coalesce can't be resolved. f9fa6cf [Michael Armbrust] Methods in a final class do not also need to be final, use override. 199fd98 [Michael Armbrust] Fix typo	2014-11-14 15:00:42 -08:00
Michael Armbrust	e47c387639	[SPARK-4391][SQL] Configure parquet filters using SQLConf This is more uniform with the rest of SQL configuration and allows it to be turned on and off without restarting the SparkContext. In this PR I also turn off filter pushdown by default due to a number of outstanding issues (in particular SPARK-4258). When those are fixed we should turn it back on by default. Author: Michael Armbrust <michael@databricks.com> Closes #3258 from marmbrus/parquetFilters and squashes the following commits: 5655bfe [Michael Armbrust] Remove extra line. 15e9a98 [Michael Armbrust] Enable filters for tests 75afd39 [Michael Armbrust] Fix comments 78fa02d [Michael Armbrust] off by default e7f9e16 [Michael Armbrust] First draft of correctly configuring parquet filter pushdown	2014-11-14 14:59:35 -08:00
Michael Armbrust	a0300ea32a	[SPARK-4390][SQL] Handle NaN cast to decimal correctly Author: Michael Armbrust <michael@databricks.com> Closes #3256 from marmbrus/NanDecimal and squashes the following commits: 4c3ba46 [Michael Armbrust] fix style d360f83 [Michael Armbrust] Handle NaN cast to decimal	2014-11-14 14:56:57 -08:00
DoingDone9	0cbdb01e1c	[SPARK-4333][SQL] Correctly log number of iterations in RuleExecutor When iterator of RuleExecutor breaks, the num of iterator should be (iteration - 1) not (iteration ).Because log looks like "Fixed point reached for batch ${batch.name} after 3 iterations.", but it did 2 iterations really! Author: DoingDone9 <799203320@qq.com> Closes #3180 from DoingDone9/issue_01 and squashes the following commits: 571e2ed [DoingDone9] Update RuleExecutor.scala 46514b6 [DoingDone9] When iterator of RuleExecutor breaks, the num of iterator should be iteration - 1 not iteration.	2014-11-14 14:28:06 -08:00
Sandy Ryza	f5f757e4ed	SPARK-4375. no longer require -Pscala-2.10 It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like. Author: Sandy Ryza <sandy@cloudera.com> Closes #3239 from sryza/sandy-spark-4375 and squashes the following commits: 0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt cd42d94 [Sandy Ryza] Update doc f6644c3 [Sandy Ryza] SPARK-4375 take 2	2014-11-14 14:21:57 -08:00
Takuya UESHIN	bbd8f5bee8	[SPARK-4245][SQL] Fix containsNull of the result ArrayType of CreateArray expression. The `containsNull` of the result `ArrayType` of `CreateArray` should be `true` only if the children is empty or there exists nullable child. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3110 from ueshin/issues/SPARK-4245 and squashes the following commits: 6f64746 [Takuya UESHIN] Move equalsIgnoreNullability method into DataType. 5a90e02 [Takuya UESHIN] Refine InsertIntoHiveType and add some comments. cbecba8 [Takuya UESHIN] Fix a test title. 884ec37 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4245 3c5274b [Takuya UESHIN] Add tests to insert data of types ArrayType / MapType / StructType with nullability is false into Hive table. 41a94a9 [Takuya UESHIN] Replace InsertIntoTable with InsertIntoHiveTable if data types ignoring nullability are same. 43e6ef5 [Takuya UESHIN] Fix containsNull for empty array. 778e997 [Takuya UESHIN] Fix containsNull of the result ArrayType of CreateArray expression.	2014-11-14 14:21:16 -08:00
Daoyuan Wang	ade72c4362	[SPARK-4239] [SQL] support view in HiveQl Currently still not support view like CREATE VIEW view3(valoo) TBLPROPERTIES ("fear" = "factor") AS SELECT upper(value) FROM src WHERE key=86; because the text in metastore for this view is like select \`_c0\` as \`valoo\` from (select upper(\`src\`.\`value\`) from \`default\`.\`src\` where ...) \`view3\` while catalyst cannot resolve \`_c0\` for this query. For view without colname definition in parentheses, it works fine. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3131 from adrian-wang/view and squashes the following commits: 8a56fd6 [Daoyuan Wang] michael's comments e46c056 [Daoyuan Wang] add some golden file 079290a [Daoyuan Wang] remove useless import 88afcad [Daoyuan Wang] support view in HiveQl	2014-11-14 13:51:20 -08:00
Michael Armbrust	77e845ca77	[SPARK-4394][SQL] Data Sources API Improvements This PR adds two features to the data sources API: - Support for pushing down `IN` filters - The ability for relations to optionally provide information about their `sizeInBytes`. Author: Michael Armbrust <michael@databricks.com> Closes #3260 from marmbrus/sourcesImprovements and squashes the following commits: 9a5e171 [Michael Armbrust] Use method instead of configuration directly 99c0e6b [Michael Armbrust] Add support for sizeInBytes. 416f167 [Michael Armbrust] Support for IN in data sources API. 2a04ab3 [Michael Armbrust] Simplify implementation of InSet.	2014-11-14 12:00:08 -08:00
Prashant Sharma	daaca14c16	Support cross building for Scala 2.11 Let's give this another go using a version of Hive that shades its JLine dependency. Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits: e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script. f65d17d [Patrick Wendell] Fixing build issue due to merge conflict a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state. 7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant 583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver 3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests." 935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily." 925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily. 2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future. 8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven. 5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs. 2121071 [Patrick Wendell] Migrating version detection to PySpark b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests. 1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11 f5cad4e [Patrick Wendell] Add Scala 2.11 docs 210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline" 48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles. e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only" 67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check 8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only e22b104 [Patrick Wendell] Small fix in pom file ec402ab [Patrick Wendell] Various fixes 0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline 4eaec65 [Prashant Sharma] Changed scripts to ignore target. 5167bea [Prashant Sharma] small correction a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins. 80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests. 034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt. d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11. 6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10 e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted. 937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION cb059b0 [Prashant Sharma] Code review 0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.	2014-11-11 21:36:48 -08:00
Cheng Hao	c764d0ac1c	[SPARK-4274] [SQL] Fix NPE in printing the details of the query plan Author: Cheng Hao <hao.cheng@intel.com> Closes #3139 from chenghao-intel/comparison_test and squashes the following commits: f5d7146 [Cheng Hao] avoid exception in printing the codegen enabled	2014-11-10 17:46:05 -08:00
Daoyuan Wang	a1fc059b69	[SPARK-4149][SQL] ISO 8601 support for json date time strings This implement the feature davies mentioned in https://github.com/apache/spark/pull/2901#discussion-diff-19313312 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3012 from adrian-wang/iso8601 and squashes the following commits: 50df6e7 [Daoyuan Wang] json data timestamp ISO8601 support	2014-11-10 17:26:03 -08:00
Cheng Hao	fa777833b5	[SPARK-4250] [SQL] Fix bug of constant null value mapping to ConstantObjectInspector Author: Cheng Hao <hao.cheng@intel.com> Closes #3114 from chenghao-intel/constant_null_oi and squashes the following commits: e603bda [Cheng Hao] fix the bug of null value for primitive types 50a13ba [Cheng Hao] fix the timezone issue f54f369 [Cheng Hao] fix bug of constant null value for ObjectInspector	2014-11-10 17:22:57 -08:00
Xiangrui Meng	d793d80c80	[SQL] remove a decimal case branch that has no effect at runtime it generates warnings at compile time marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #3192 from mengxr/dtc-decimal and squashes the following commits: 955e9fb [Xiangrui Meng] remove a decimal case branch that has no effect	2014-11-10 17:20:52 -08:00
Cheng Lian	acb55aeddb	[SPARK-4308][SQL] Sets SQL operation state to ERROR when exception is thrown In `HiveThriftServer2`, when an exception is thrown during a SQL execution, the SQL operation state should be set to `ERROR`, but now it remains `RUNNING`. This affects the result of the `GetOperationStatus` Thrift API. Author: Cheng Lian <lian@databricks.com> Closes #3175 from liancheng/fix-op-state and squashes the following commits: 6d4c1fe [Cheng Lian] Sets SQL operation state to ERROR when exception is thrown	2014-11-10 16:56:36 -08:00
Takuya UESHIN	dbf10588de	[SPARK-4319][SQL] Enable an ignored test "null count". Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3185 from ueshin/issues/SPARK-4319 and squashes the following commits: a44a38e [Takuya UESHIN] Enable an ignored test "null count".	2014-11-10 15:55:15 -08:00
Xiangrui Meng	894a7245c3	[SQL] support udt to hive types conversion (hive->udt is not supported) marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #3164 from mengxr/hive-udt and squashes the following commits: 57c7519 [Xiangrui Meng] support udt->hive types (hive->udt is not supported)	2014-11-10 11:04:12 -08:00
Sean Owen	f8e5732307	SPARK-1209 [CORE] (Take 2) SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop andrewor14 Another try at SPARK-1209, to address https://github.com/apache/spark/pull/2814#issuecomment-61197619 I successfully tested with `mvn -Dhadoop.version=1.0.4 -DskipTests clean package; mvn -Dhadoop.version=1.0.4 test` I assume that is what failed Jenkins last time. I also tried `-Dhadoop.version1.2.1` and `-Phadoop-2.4 -Pyarn -Phive` for more coverage. So this is why the class was put in `org.apache.hadoop` to begin with, I assume. One option is to leave this as-is for now and move it only when Hadoop 1.0.x support goes away. This is the other option, which adds a call to force the constructor to be public at run-time. It's probably less surprising than putting Spark code in `org.apache.hadoop`, but, does involve reflection. A `SecurityManager` might forbid this, but it would forbid a lot of stuff Spark does. This would also only affect Hadoop 1.0.x it seems. Author: Sean Owen <sowen@cloudera.com> Closes #3048 from srowen/SPARK-1209 and squashes the following commits: 0d48f4b [Sean Owen] For Hadoop 1.0.x, make certain constructors public, which were public in later versions 466e179 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though? eb61820 [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark	2014-11-09 22:11:20 -08:00
wangfei	d6e5552443	[SPARK-4292][SQL] Result set iterator bug in JDBC/ODBC select * from src, get the wrong result set as follows: ``` ... \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 309 \| val_309 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| \| 97 \| val_97 \| ... ``` Author: wangfei <wangfei1@huawei.com> Closes #3149 from scwf/SPARK-4292 and squashes the following commits: 1574a43 [wangfei] using result.collect 8b2d845 [wangfei] adding test f64eddf [wangfei] result set iter bug	2014-11-07 12:55:11 -08:00
Matthew Taylor	ac70c972a5	[SPARK-4203][SQL] Partition directories in random order when inserting into hive table When doing an insert into hive table with partitions the folders written to the file system are in a random order instead of the order defined in table creation. Seems that the loadPartition method in Hive.java has a Map<String,String> parameter but expects to be called with a map that has a defined ordering such as LinkedHashMap. Working on a test but having intillij problems Author: Matthew Taylor <matthew.t@tbfe.net> Closes #3076 from tbfenet/partition_dir_order_problem and squashes the following commits: f1b9a52 [Matthew Taylor] Comment format fix bca709f [Matthew Taylor] review changes 0e50f6b [Matthew Taylor] test fix 99f1a31 [Matthew Taylor] partition ordering fix 369e618 [Matthew Taylor] partition ordering fix	2014-11-07 12:53:08 -08:00
Takuya UESHIN	a6405c5ddc	[SPARK-4270][SQL] Fix Cast from DateType to DecimalType. `Cast` from `DateType` to `DecimalType` throws `NullPointerException`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #3134 from ueshin/issues/SPARK-4270 and squashes the following commits: 7394e4b [Takuya UESHIN] Fix Cast from DateType to DecimalType.	2014-11-07 12:30:47 -08:00
Cheng Hao	60ab80f501	[SPARK-4272] [SQL] Add more unwrapper functions for primitive type in TableReader Currently, the data "unwrap" only support couple of primitive types, not all, it will not cause exception, but may get some performance in table scanning for the type like binary, date, timestamp, decimal etc. Author: Cheng Hao <hao.cheng@intel.com> Closes #3136 from chenghao-intel/table_reader and squashes the following commits: fffb729 [Cheng Hao] fix bug for retrieving the timestamp object e9c97a4 [Cheng Hao] Add more unwrapper functions for primitive type in TableReader	2014-11-07 12:15:53 -08:00
Kousuke Saruta	14c54f1876	[SPARK-4213][SQL] ParquetFilters - No support for LT, LTE, GT, GTE operators Following description is quoted from JIRA: When I issue a hql query against a HiveContext where my predicate uses a column of string type with one of LT, LTE, GT, or GTE operator, I get the following error: scala.MatchError: StringType (of class org.apache.spark.sql.catalyst.types.StringType$) Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, StringType is absent from the corresponding functions for creating these filters. To reproduce, in a Hive 0.13.1 shell, I created the following table (at a specified DB): create table sparkbug ( id int, event string ) stored as parquet; Insert some sample data: insert into table sparkbug select 1, '2011-06-18' from <some table> limit 1; insert into table sparkbug select 2, '2012-01-01' from <some table> limit 1; Launch a spark shell and create a HiveContext to the metastore where the table above is located. import org.apache.spark.sql._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.hive.HiveContext val hc = new HiveContext(sc) hc.setConf("spark.sql.shuffle.partitions", "10") hc.setConf("spark.sql.hive.convertMetastoreParquet", "true") hc.setConf("spark.sql.parquet.compression.codec", "snappy") import hc._ hc.hql("select * from <db>.sparkbug where event >= '2011-12-01'") A scala.MatchError will appear in the output. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #3083 from sarutak/SPARK-4213 and squashes the following commits: 4ab6e56 [Kousuke Saruta] WIP b6890c6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4213 9a1fae7 [Kousuke Saruta] Fixed ParquetFilters so that compare Strings	2014-11-07 11:56:40 -08:00
Jacky Li	68609c51ad	[SQL] Modify keyword val location according to ordering 'DOUBLE' should be moved before 'ELSE' according to the ordering convension Author: Jacky Li <jacky.likun@gmail.com> Closes #3080 from jackylk/patch-5 and squashes the following commits: 3c11df7 [Jacky Li] [SQL] Modify keyword val location according to ordering	2014-11-07 11:52:08 -08:00
Michael Armbrust	8154ed7df6	[SQL] Support ScalaReflection of schema in different universes Author: Michael Armbrust <michael@databricks.com> Closes #3096 from marmbrus/reflectionContext and squashes the following commits: adc221f [Michael Armbrust] Support ScalaReflection of schema in different universes	2014-11-07 11:51:20 -08:00
Cheng Lian	86e9eaa3f0	[SPARK-4225][SQL] Resorts to SparkContext.version to inspect Spark version This PR resorts to `SparkContext.version` rather than META-INF/MANIFEST.MF in the assembly jar to inspect Spark version. Currently, when built with Maven, the MANIFEST.MF file in the assembly jar is incorrectly replaced by Guava 15.0 MANIFEST.MF, probably because of the assembly/shading tricks. Another related PR is #3103, which tries to fix the MANIFEST issue. Author: Cheng Lian <lian@databricks.com> Closes #3105 from liancheng/spark-4225 and squashes the following commits: d9585e1 [Cheng Lian] Resorts to SparkContext.version to inspect Spark version	2014-11-07 11:45:25 -08:00
Xiangrui Meng	3d2b5bc5bb	[SPARK-4262][SQL] add .schemaRDD to JavaSchemaRDD marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #3125 from mengxr/SPARK-4262 and squashes the following commits: 307695e [Xiangrui Meng] add .schemaRDD to JavaSchemaRDD	2014-11-05 19:56:16 -08:00
Michael Armbrust	515abb9afa	[SQL] Add String option for DSL AS Author: Michael Armbrust <michael@databricks.com> Closes #3097 from marmbrus/asString and squashes the following commits: 6430520 [Michael Armbrust] Add String option for DSL AS	2014-11-04 18:14:28 -08:00
Davies Liu	e4f42631a6	[SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1. Author: Davies Liu <davies@databricks.com> This patch had conflicts when merged, resolved by Committer: Josh Rosen <joshrosen@databricks.com> Closes #2920 from davies/fix_autobatch and squashes the following commits: e544ef9 [Davies Liu] revert unrelated change 6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 1d557fc [Davies Liu] fix tests 8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 76abdce [Davies Liu] clean up 53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch 2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch b4292ce [Davies Liu] fix bug in master d79744c [Davies Liu] recover hive tests be37ece [Davies Liu] refactor eb3938d [Davies Liu] refactor serializer in scala 8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.	2014-11-03 23:56:14 -08:00
Xiangrui Meng	04450d1154	[SPARK-4192][SQL] Internal API for Python UDT Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python. marmbrus jkbradley davies Author: Xiangrui Meng <meng@databricks.com> Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits: acff637 [Xiangrui Meng] merge master dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well 2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion 7c4a6a9 [Xiangrui Meng] address comments 75223db [Xiangrui Meng] minor update f740379 [Xiangrui Meng] remove UDT from default imports e98d9d0 [Xiangrui Meng] fix py style 4e84fce [Xiangrui Meng] remove local hive tests and add more tests 39f19e0 [Xiangrui Meng] add tests b7f666d [Xiangrui Meng] add Python UDT	2014-11-03 19:29:11 -08:00
Michael Armbrust	15b58a2234	[SQL] Convert arguments to Scala UDFs Author: Michael Armbrust <michael@databricks.com> Closes #3077 from marmbrus/udfsWithUdts and squashes the following commits: 34b5f27 [Michael Armbrust] style 504adef [Michael Armbrust] Convert arguments to Scala UDFs	2014-11-03 18:04:51 -08:00
Michael Armbrust	25bef7e695	[SQL] More aggressive defaults - Turns on compression for in-memory cached data by default - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory) - Ups the batch size to 10,000 rows - Increases the broadcast threshold to 10mb. - Uses our parquet implementation instead of the hive one by default. - Cache parquet metadata by default. Author: Michael Armbrust <michael@databricks.com> Closes #3064 from marmbrus/fasterDefaults and squashes the following commits: 97ee9f8 [Michael Armbrust] parquet codec docs e641694 [Michael Armbrust] Remote also a12866a [Michael Armbrust] Cache metadata. 2d73acc [Michael Armbrust] Update docs defaults. d63d2d5 [Michael Armbrust] document parquet option da373f9 [Michael Armbrust] More aggressive defaults	2014-11-03 14:08:27 -08:00
Cheng Hao	e83f13e8d3	[SPARK-4152] [SQL] Avoid data change in CTAS while table already existed CREATE TABLE t1 (a String); CREATE TABLE t1 AS SELECT key FROM src; – throw exception CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect. Author: Cheng Hao <hao.cheng@intel.com> Closes #3013 from chenghao-intel/ctas_unittest and squashes the following commits: 194113e [Cheng Hao] fix bug in CTAS when table already existed	2014-11-03 13:59:43 -08:00
Cheng Lian	c238fb423d	[SPARK-4202][SQL] Simple DSL support for Scala UDF This feature is based on an offline discussion with mengxr, hopefully can be useful for the new MLlib pipeline API. For the following test snippet ```scala case class KeyValue(key: Int, value: String) val testData = sc.parallelize(1 to 10).map(i => KeyValue(i, i.toString)).toSchemaRDD def foo(a: Int, b: String) => a.toString + b ``` the newly introduced DSL enables the following syntax ```scala import org.apache.spark.sql.catalyst.dsl._ testData.select(Star(None), foo.call('key, 'value) as 'result) ``` which is equivalent to ```scala testData.registerTempTable("testData") sqlContext.registerFunction("foo", foo) sql("SELECT *, foo(key, value) AS result FROM testData") ``` Author: Cheng Lian <lian@databricks.com> Closes #3067 from liancheng/udf-dsl and squashes the following commits: f132818 [Cheng Lian] Adds DSL support for Scala UDF	2014-11-03 13:20:33 -08:00
Davies Liu	24544fbce0	[SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling. If sampling is presented, it will infer schema from all the rows after sampling. Also, add samplingRatio for jsonFile() and jsonRDD() Author: Davies Liu <davies.liu@gmail.com> Author: Davies Liu <davies@databricks.com> Closes #2716 from davies/infer and squashes the following commits: e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 567dc60 [Davies Liu] update docs 9767b27 [Davies Liu] Merge branch 'master' into infer e48d7fb [Davies Liu] fix tests 29e94d5 [Davies Liu] let NullType inherit from PrimitiveType ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 540d1d5 [Davies Liu] merge fields for StructType f93fd84 [Davies Liu] add more tests 3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD	2014-11-03 13:17:09 -08:00
ravipesala	2b6e1ce6ee	[SPARK-4207][SQL] Query which has syntax like 'not like' is not working in Spark SQL Queries which has 'not like' is not working spark sql. sql("SELECT * FROM records where value not like 'val%'") same query works in Spark HiveQL Author: ravipesala <ravindra.pesala@huawei.com> Closes #3075 from ravipesala/SPARK-4207 and squashes the following commits: 35c11e7 [ravipesala] Supported 'not like' syntax in sql	2014-11-03 13:07:41 -08:00
Joseph K. Bradley	ebd6480587	[SPARK-3572] [SQL] Internal API for User-Defined Types This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet). Author: Joseph K. Bradley <joseph@databricks.com> Author: Michael Armbrust <michael@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3063 from marmbrus/udts and squashes the following commits: 7ccfc0d [Michael Armbrust] remove println 46a3aee [Michael Armbrust] Slightly easier to read test output. 6cc434d [Michael Armbrust] Recursively convert rows. e369b91 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udts 15c10a6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into sql-udt2 f3c72fe [Joseph K. Bradley] Fixing merge e13cd8a [Joseph K. Bradley] Removed Vector UDTs 5817b2b [Joseph K. Bradley] style edits 30ce5b2 [Joseph K. Bradley] updates based on code review d063380 [Joseph K. Bradley] Cleaned up Java UDT Suite, and added warning about element ordering when creating schema from Java Bean a571bb6 [Joseph K. Bradley] Removed old UDT code (registry and Java UDTs). Cleaned up other code. Extended JavaUserDefinedTypeSuite 6fddc1c [Joseph K. Bradley] Made MyLabeledPoint into a Java Bean 20630bc [Joseph K. Bradley] fixed scalastyle fa86b20 [Joseph K. Bradley] Removed Java UserDefinedType, and made UDTs private[spark] for now 8de957c [Joseph K. Bradley] Modified UserDefinedType to store Java class of user type so that registerUDT takes only the udt argument. 8b242ea [Joseph K. Bradley] Fixed merge error after last merge. Note: Last merge commit also removed SQL UDT examples from mllib. 7f29656 [Joseph K. Bradley] Moved udt case to top of all matches. Small cleanups b028675 [Xiangrui Meng] allow any type in UDT 4500d8a [Xiangrui Meng] update example code 87264a5 [Xiangrui Meng] remove debug code 3143ac3 [Xiangrui Meng] remove unnecessary changes cfbc321 [Xiangrui Meng] support UDT in parquet db16139 [Joseph K. Bradley] Added more doc for UserDefinedType. Removed unused code in Suite 759af7a [Joseph K. Bradley] Added more doc to UserDefineType 63626a4 [Joseph K. Bradley] Updated ScalaReflectionsSuite per @marmbrus suggestions 51e5282 [Joseph K. Bradley] fixed 1 test f025035 [Joseph K. Bradley] Cleanups before PR. Added new tests 85872f6 [Michael Armbrust] Allow schema calculation to be lazy, but ensure its available on executors. dff99d6 [Joseph K. Bradley] Added UDTs for Vectors in MLlib, plus DatasetExample using the UDTs cd60cb4 [Joseph K. Bradley] Trying to get other SQL tests to run 34a5831 [Joseph K. Bradley] Added MLlib dependency on SQL. e1f7b9c [Joseph K. Bradley] blah 2f40c02 [Joseph K. Bradley] renamed UDT types 3579035 [Joseph K. Bradley] udt annotation now working b226b9e [Joseph K. Bradley] Changing UDT to annotation fea04af [Joseph K. Bradley] more cleanups 964b32e [Joseph K. Bradley] some cleanups 893ee4c [Joseph K. Bradley] udt finallly working 50f9726 [Joseph K. Bradley] udts 04303c9 [Joseph K. Bradley] udts 39f8707 [Joseph K. Bradley] removed old udt suite 273ac96 [Joseph K. Bradley] basic UDT is working, but deserialization has yet to be done 8bebf24 [Joseph K. Bradley] commented out convertRowToScala for debugging 53de70f [Joseph K. Bradley] more udts... 982c035 [Joseph K. Bradley] still working on UDTs 19b2f60 [Joseph K. Bradley] still working on UDTs 0eaeb81 [Joseph K. Bradley] Still working on UDTs 105c5a3 [Joseph K. Bradley] Adding UserDefinedType to SQL, not done yet.	2014-11-02 17:56:00 -08:00
Cheng Lian	9081b9f9f7	[SPARK-2189][SQL] Adds dropTempTable API This PR adds an API for unregistering temporary tables. If a temporary table has been cached before, it's unpersisted as well. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #3039 from liancheng/unregister-temp-table and squashes the following commits: 54ae99f [Cheng Lian] Fixes Scala styling issue 1948c14 [Cheng Lian] Removes the unpersist argument aca41d3 [Cheng Lian] Ensures thread safety 7d4fb2b [Cheng Lian] Adds unregisterTempTable API	2014-11-02 16:00:24 -08:00
Yin Huai	06232d23ff	[SPARK-4185][SQL] JSON schema inference failed when dealing with type conflicts in arrays JIRA: https://issues.apache.org/jira/browse/SPARK-4185. This PR also has the fix of #3052. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #3056 from yhuai/SPARK-4185 and squashes the following commits: ed3a5a8 [Yin Huai] Correctly handle type conflicts between structs and primitive types in an array.	2014-11-02 15:46:56 -08:00
wangfei	e749f5dedb	[SPARK-4191][SQL]move wrapperFor to HiveInspectors to reuse it Move wrapperFor in InsertIntoHiveTable to HiveInspectors to reuse them, this method can be reused when writing date with ObjectInspector(such as orc support) Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #3057 from scwf/reuse-wraperfor and squashes the following commits: 7ccf932 [scwf] fix conflicts d44f4da [wangfei] fix imports 9bf1b50 [wangfei] revert no related change 9a5276a [wangfei] move wrapfor to hiveinspector to reuse them	2014-11-02 15:45:55 -08:00
Cheng Lian	c9f840046f	[SPARK-3791][SQL] Provides Spark version and Hive version in HiveThriftServer2 This PR overrides the `GetInfo` Hive Thrift API to provide correct version information. Another property `spark.sql.hive.version` is added to reveal the underlying Hive version. These are generally useful for Spark SQL ODBC driver providers. The Spark version information is extracted from the jar manifest. Also took the chance to remove the `SET -v` hack, which was a workaround for Simba ODBC driver connectivity. TODO - [x] Find a general way to figure out Hive (or even any dependency) version. This [blog post](http://blog.soebes.de/blog/2014/01/02/version-information-into-your-appas-with-maven/) suggests several methods to inspect application version. In the case of Spark, this can be tricky because the chosen method: 1. must applies to both Maven build and SBT build For Maven builds, we can retrieve the version information from the META-INF/maven directory within the assembly jar. But this doesn't work for SBT builds. 2. must not rely on the original jars of dependencies to extract specific dependency version, because Spark uses assembly jar. This implies we can't read Hive version from Hive jar files since standard Spark distribution doesn't include them. 3. should play well with `SPARK_PREPEND_CLASSES` to ease local testing during development. `SPARK_PREPEND_CLASSES` prevents classes to be loaded from the assembly jar, thus we can't locate the jar file and read its manifest. Given these, maybe the only reliable method is to generate a source file containing version information at build time. pwendell Do you have any suggestions from the perspective of the build process? Update Hive version is now retrieved from the newly introduced `HiveShim` object. Author: Cheng Lian <lian.cs.zju@gmail.com> Author: Cheng Lian <lian@databricks.com> Closes #2843 from liancheng/get-info and squashes the following commits: a873d0f [Cheng Lian] Updates test case 53f43cd [Cheng Lian] Retrieves underlying Hive verson via HiveShim 1d282b8 [Cheng Lian] Removes the Simba ODBC "SET -v" hack f857fce [Cheng Lian] Overrides Hive GetInfo Thrift API and adds Hive version property	2014-11-02 15:18:29 -08:00
Cheng Lian	495a132031	[SQL] Fixes race condition in CliSuite `CliSuite` has been flaky for a while, this PR tries to improve this situation by fixing a race condition in `CliSuite`. The `captureOutput` function is used to capture both stdout and stderr output of the forked external process in two background threads and search for expected strings, but wasn't been properly synchronized before. Author: Cheng Lian <lian@databricks.com> Closes #3060 from liancheng/fix-cli-suite and squashes the following commits: a70569c [Cheng Lian] Fixes race condition in CliSuite	2014-11-02 15:15:52 -08:00
Cheng Lian	e4b80894bd	[SPARK-4182][SQL] Fixes ColumnStats classes for boolean, binary and complex data types `NoopColumnStats` was once used for binary, boolean and complex data types. This `ColumnStats` doesn't return properly shaped column statistics and causes caching failure if a table contains columns of the aforementioned types. This PR adds `BooleanColumnStats`, `BinaryColumnStats` and `GenericColumnStats`, used for boolean, binary and all complex data types respectively. In addition, `NoopColumnStats` returns properly shaped column statistics containing null count and row count, but this class is now used for testing purpose only. Author: Cheng Lian <lian@databricks.com> Closes #3059 from liancheng/spark-4182 and squashes the following commits: b398cfd [Cheng Lian] Fixes failed test case fb3ee85 [Cheng Lian] Fixes SPARK-4182	2014-11-02 15:14:44 -08:00
Michael Armbrust	9c0eb57c73	[SPARK-3247][SQL] An API for adding data sources to Spark SQL This PR introduces a new set of APIs to Spark SQL to allow other developers to add support for reading data from new sources in `org.apache.spark.sql.sources`. New sources must implement the interface `BaseRelation`, which is responsible for describing the schema of the data. BaseRelations have three `Scan` subclasses, which are responsible for producing an RDD containing row objects. The [various Scan interfaces](https://github.com/marmbrus/spark/blob/foreign/sql/core/src/main/scala/org/apache/spark/sql/sources/package.scala#L50) allow for optimizations such as column pruning and filter push down, when the underlying data source can handle these operations. By implementing a class that inherits from RelationProvider these data sources can be accessed using using pure SQL. I've used the functionality to update the JSON support so it can now be used in this way as follows: ```sql CREATE TEMPORARY TABLE jsonTableSQL USING org.apache.spark.sql.json OPTIONS ( path '/home/michael/data.json' ) ``` Further example usage can be found in the test cases: https://github.com/marmbrus/spark/tree/foreign/sql/core/src/test/scala/org/apache/spark/sql/sources There is also a library that uses this new API to read avro data available here: https://github.com/marmbrus/sql-avro Author: Michael Armbrust <michael@databricks.com> Closes #2475 from marmbrus/foreign and squashes the following commits: 1ed6010 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign ab2c31f [Michael Armbrust] fix test 1d41bb5 [Michael Armbrust] unify argument names 5b47901 [Michael Armbrust] Remove sealed, more filter types fab154a [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign e3e690e [Michael Armbrust] Add hook for extraStrategies a70d602 [Michael Armbrust] Fix style, more tests, FilteredSuite => PrunedFilteredSuite 70da6d9 [Michael Armbrust] Modify API to ease binary compatibility and interop with Java 7d948ae [Michael Armbrust] Fix equality of AttributeReference. 5545491 [Michael Armbrust] Address comments 5031ac3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into foreign 22963ef [Michael Armbrust] package objects compile wierdly... b069146 [Michael Armbrust] traits => abstract classes 34f836a [Michael Armbrust] Make @DeveloperApi 0d74bcf [Michael Armbrust] Add documention on object life cycle 3e06776 [Michael Armbrust] remove line wraps de3b68c [Michael Armbrust] Remove empty file 360cb30 [Michael Armbrust] style and java api 2957875 [Michael Armbrust] add override 0fd3a07 [Michael Armbrust] Draft of data sources API	2014-11-02 15:08:35 -08:00
wangfei	f0a4b630ab	[HOTFIX][SQL] hive test missing some golden files cc marmbrus Author: wangfei <wangfei1@huawei.com> Closes #3055 from scwf/hotfix and squashes the following commits: d881bd7 [wangfei] miss golden files	2014-11-02 14:59:41 -08:00
Matei Zaharia	23f966f475	[SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some optimizations - Adds optional precision and scale to Spark SQL's decimal type, which behave similarly to those in Hive 13 (https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf) - Replaces our internal representation of decimals with a Decimal class that can store small values in a mutable Long, saving memory in this situation and letting some operations happen directly on Longs This is still marked WIP because there are a few TODOs, but I'll remove that tag when done. Author: Matei Zaharia <matei@databricks.com> Closes #2983 from mateiz/decimal-1 and squashes the following commits: 35e6b02 [Matei Zaharia] Fix issues after merge 227f24a [Matei Zaharia] Review comments 31f915e [Matei Zaharia] Implement Davies's suggestions in Python eb84820 [Matei Zaharia] Support reading/writing decimals as fixed-length binary in Parquet 4dc6bae [Matei Zaharia] Fix decimal support in PySpark d1d9d68 [Matei Zaharia] Fix compile error and test issues after rebase b28933d [Matei Zaharia] Support decimal precision/scale in Hive metastore 2118c0d [Matei Zaharia] Some test and bug fixes 81db9cb [Matei Zaharia] Added mutable Decimal that will be more efficient for small precisions 7af0c3b [Matei Zaharia] Add optional precision and scale to DecimalType, but use Unlimited for now ec0a947 [Matei Zaharia] Make the result of AVG on Decimals be Decimal, not Double	2014-11-01 19:29:14 -07:00
Cheng Lian	ad0fde10b2	[SPARK-4037][SQL] Removes the SessionState instance created in HiveThriftServer2 `HiveThriftServer2` creates a global singleton `SessionState` instance and overrides `HiveContext` to inject the `SessionState` object. This messes up `SessionState` initialization and causes problems. This PR replaces the global `SessionState` with `HiveContext.sessionState` to avoid the initialization conflict. Also `HiveContext` reuses existing started `SessionState` if any (this is required by `SparkSQLCLIDriver`, which uses specialized `CliSessionState`). Author: Cheng Lian <lian@databricks.com> Closes #2887 from liancheng/spark-4037 and squashes the following commits: 8446675 [Cheng Lian] Removes redundant Driver initialization a28fef5 [Cheng Lian] Avoid starting HiveContext.sessionState multiple times 49b1c5b [Cheng Lian] Reuses existing started SessionState if any 3cd6fab [Cheng Lian] Fixes SPARK-4037	2014-11-01 15:03:11 -07:00
Xiangrui Meng	1d4f355203	[SPARK-3569][SQL] Add metadata field to StructField Add `metadata: Metadata` to `StructField` to store extra information of columns. `Metadata` is a simple wrapper over `Map[String, Any]` with value types restricted to Boolean, Long, Double, String, Metadata, and arrays of those types. SerDe is via JSON. Metadata is preserved through simple operations like `SELECT`. marmbrus liancheng Author: Xiangrui Meng <meng@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #2701 from mengxr/structfield-metadata and squashes the following commits: dedda56 [Xiangrui Meng] merge remote 5ef930a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata c35203f [Xiangrui Meng] Merge pull request #1 from marmbrus/pr/2701 886b85c [Michael Armbrust] Expose Metadata and MetadataBuilder through the public scala and java packages. 589f314 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata 1e2abcf [Xiangrui Meng] change default value of metadata to None in python 611d3c2 [Xiangrui Meng] move metadata from Expr to NamedExpr ddfcfad [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata a438440 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata 4266f4d [Xiangrui Meng] add StructField.toString back for backward compatibility 3f49aab [Xiangrui Meng] remove StructField.toString 24a9f80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata 473a7c5 [Xiangrui Meng] merge master c9d7301 [Xiangrui Meng] organize imports 1fcbf13 [Xiangrui Meng] change metadata type in StructField for Scala/Java 60cc131 [Xiangrui Meng] add doc and header 60614c7 [Xiangrui Meng] add metadata e42c452 [Xiangrui Meng] merge master 93518fb [Xiangrui Meng] support metadata in python 905bb89 [Xiangrui Meng] java conversions 618e349 [Xiangrui Meng] make tests work in scala 61b8e0f [Xiangrui Meng] merge master 7e5a322 [Xiangrui Meng] do not output metadata in StructField.toString c41a664 [Xiangrui Meng] merge master d8af0ed [Xiangrui Meng] move tests to SQLQuerySuite 67fdebb [Xiangrui Meng] add test on join d65072e [Xiangrui Meng] remove Map.empty 367d237 [Xiangrui Meng] add test c194d5e [Xiangrui Meng] add metadata field to StructField and Attribute	2014-11-01 14:37:00 -07:00
Cheng Lian	23468e7e96	[SPARK-2220][SQL] Fixes remaining Hive commands This PR adds support for the `ADD FILE` Hive command, and removes `ShellCommand` and `SourceCommand`. The reason is described in [this SPARK-2220 comment](https://issues.apache.org/jira/browse/SPARK-2220?focusedCommentId=14191841&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14191841). Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #3038 from liancheng/hive-commands and squashes the following commits: 6db61e0 [Cheng Lian] Fixes remaining Hive commands	2014-10-31 11:34:51 -07:00
ravipesala	ea465af12d	[SPARK-4154][SQL] Query does not work if it has "not between " in Spark SQL and HQL if the query contains "not between" does not work like. SELECT * FROM src where key not between 10 and 20' Author: ravipesala <ravindra.pesala@huawei.com> Closes #3017 from ravipesala/SPARK-4154 and squashes the following commits: 65fc89e [ravipesala] Handled admin comments 32e6d42 [ravipesala] 'not between' is not working	2014-10-31 11:33:20 -07:00
Venkata Ramana Gollamudi	fa712b309c	[SPARK-4077][SQL] Spark SQL return wrong values for valid string timestamp values In org.apache.hadoop.hive.serde2.io.TimestampWritable.set , if the next entry is null then current time stamp object is being reset. However because of this hiveinspectors:unwrap cannot use the same timestamp object without creating a copy. Author: Venkata Ramana G <ramana.gollamudihuawei.com> Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com> Closes #3019 from gvramana/spark_4077 and squashes the following commits: 32d818f [Venkata Ramana Gollamudi] fixed check style fa01e71 [Venkata Ramana Gollamudi] cloned timestamp object as org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset current time object	2014-10-31 11:30:28 -07:00
wangfei	7c41d13570	[SPARK-3826][SQL]enable hive-thriftserver to support hive-0.13.1 In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241. 1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility 2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version 3 SBT cmd for different version as follows: hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly 4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits: f26f3be [wangfei] remove clean to save time f5cac74 [wangfei] remove local hivecontext test 578234d [wangfei] use new shaded hive 18fb1ff [wangfei] exclude kryo in hive pom fa21d09 [wangfei] clean package assembly/assembly 8a4daf2 [wangfei] minor fix 0d7f6cf [wangfei] address comments f7c93ae [wangfei] adding build with hive 0.13 before running tests bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1 c359822 [wangfei] reuse getCommandProcessor in hiveshim 52674a4 [scwf] sql/hive included since examples depend on it 3529e98 [scwf] move hive module to hive profile `f51ff4e` [wangfei] update and fix conflicts f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1 41f727b [scwf] revert pom changes `13afde0` [scwf] fix small bug 4b681f4 [scwf] enable thriftserver in profile hive-0.13.1 0bc53aa [scwf] fixed when result filed is null dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver 7c66b8e [scwf] update pom according spark-2706 ae47489 [scwf] update and fix conflicts	2014-10-31 11:27:59 -07:00
Cheng Hao	58a6077e56	[SPARK-4143] [SQL] Move inner class DeferredObjectAdapter to top level The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which may cause some overhead in closure ser/de-ser. Move it to top level. Author: Cheng Hao <hao.cheng@intel.com> Closes #3007 from chenghao-intel/move_deferred and squashes the following commits: 3a139b1 [Cheng Hao] Move inner class DeferredObjectAdapter to top level	2014-10-30 23:59:46 -07:00
Anant	d31517a3cd	[SPARK-4108][SQL] Fixed usage of deprecated in sql/catalyst/types/datatypes Fixed usage of deprecated in sql/catalyst/types/datatypes to have versio...n parameter Author: Anant <anant.asty@gmail.com> Closes #2970 from anantasty/SPARK-4108 and squashes the following commits: e92cb01 [Anant] Fixed usage of deprecated in sql/catalyst/types/datatypes to have version parameter	2014-10-30 23:02:42 -07:00
Andrew Or	26d31d15fd	Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop" This reverts commit `68cb69daf3`.	2014-10-30 17:56:10 -07:00
Yash Datta	2e35e24294	[SPARK-3968][SQL] Use parquet-mr filter2 api The parquet-mr project has introduced a new filter api (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max We can leverage that to further improve performance of queries with filters. Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself Author: Yash Datta <Yash.Datta@guavus.com> Closes #2841 from saucam/master and squashes the following commits: 8282ba0 [Yash Datta] SPARK-3968: fix scala code style and add some more tests for filtering on optional columns 515df1c [Yash Datta] SPARK-3968: Add a test case for filter pushdown on optional column 5f4530e [Yash Datta] SPARK-3968: Fix scala code style f304667 [Yash Datta] SPARK-3968: Using task metadata strategy for row group filtering ec53e92 [Yash Datta] SPARK-3968: No push down should result in case we are unable to create a record filter 48163c3 [Yash Datta] SPARK-3968: Code cleanup cc7b596 [Yash Datta] SPARK-3968: 1. Fix RowGroupFiltering not working 2. Use the serialization/deserialization from Parquet library for filter pushdown caed851 [Yash Datta] Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL columns" since filtering on optional columns is now supported in filter2 api 49703c9 [Yash Datta] SPARK-3968: Not pushing the filters in case of OPTIONAL columns 9d09741 [Yash Datta] SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr	2014-10-30 17:17:31 -07:00
ravipesala	9b6ebe33db	[SPARK-4120][SQL] Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL Right now it works for only 2 tables like below query. sql("SELECT * FROM records1 as a,records2 as b where a.key=b.key ") But it does not work for more than 2 tables like below query sql("SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and a.key=c.key"). Author: ravipesala <ravindra.pesala@huawei.com> Closes #2987 from ravipesala/multijoin and squashes the following commits: 429b005 [ravipesala] Support multiple joins	2014-10-30 17:15:45 -07:00
Sean Owen	68cb69daf3	SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop (This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?) Author: Sean Owen <sowen@cloudera.com> Closes #2814 from srowen/SPARK-1209 and squashes the following commits: ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though? 2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark	2014-10-30 15:54:53 -07:00
Daoyuan Wang	3535467663	[SPARK-4003] [SQL] add 3 types for java SQL context In JavaSqlContext, we need to let java program use big decimal, timestamp, date types. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2850 from adrian-wang/javacontext and squashes the following commits: 4c4292c [Daoyuan Wang] change underlying type of JavaSchemaRDD as scala bb0508f [Daoyuan Wang] add test cases 3c58b0d [Daoyuan Wang] add 3 types for java SQL context	2014-10-29 12:10:58 -07:00
Davies Liu	8c0bfd08fc	[SPARK-4133] [SQL] [PySpark] type conversionfor python udf Call Python UDF on ArrayType/MapType/PrimitiveType, the returnType can also be ArrayType/MapType/PrimitiveType. For StructType, it will act as tuple (without attributes). If returnType is StructType, it also should be tuple. Author: Davies Liu <davies@databricks.com> Closes #2973 from davies/udf_array and squashes the following commits: 306956e [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array 2c00e43 [Davies Liu] fix merge 11395fa [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array 9df50a2 [Davies Liu] address comments 79afb4e [Davies Liu] type conversionfor python udf	2014-10-28 19:38:16 -07:00
Cheng Hao	b5e79bf889	[SPARK-3904] [SQL] add constant objectinspector support for udfs In HQL, we convert all of the data type into normal `ObjectInspector`s for UDFs, most of cases it works, however, some of the UDF actually requires its children `ObjectInspector` to be the `ConstantObjectInspector`, which will cause exception. e.g. select named_struct("x", "str") from src limit 1; I updated the method `wrap` by adding the one more parameter `ObjectInspector`(to describe what it expects to wrap to, for example: java.lang.Integer or IntWritable). As well as the `unwrap` method by providing the input `ObjectInspector`. Author: Cheng Hao <hao.cheng@intel.com> Closes #2762 from chenghao-intel/udf_coi and squashes the following commits: bcacfd7 [Cheng Hao] Shim for both Hive 0.12 & 0.13.1 2416e5d [Cheng Hao] revert to hive 0.12 5793c01 [Cheng Hao] add space before while 4e56e1b [Cheng Hao] style issue 683d3fd [Cheng Hao] Add golden files fe591e4 [Cheng Hao] update HiveGenericUdf for set the ObjectInspector while constructing the DeferredObject f6740fe [Cheng Hao] Support Constant ObjectInspector for Map & List 8814c3a [Cheng Hao] Passing ContantObjectInspector(when necessary) for UDF initializing	2014-10-28 19:11:57 -07:00
Cheng Hao	4b55482abf	[SPARK-3343] [SQL] Add serde support for CTAS Currently, `CTAS` (Create Table As Select) doesn't support specifying the `SerDe` in HQL. This PR will pass down the `ASTNode` into the physical operator `execution.CreateTableAsSelect`, which will extract the `CreateTableDesc` object via Hive `SemanticAnalyzer`. In the meantime, I also update the `HiveMetastoreCatalog.createTable` to optionally support the `CreateTableDesc` for table creation. Author: Cheng Hao <hao.cheng@intel.com> Closes #2570 from chenghao-intel/ctas_serde and squashes the following commits: e011ef5 [Cheng Hao] shim for both 0.12 & 0.13.1 cfb3662 [Cheng Hao] revert to hive 0.12 c8a547d [Cheng Hao] Support SerDe properties within CTAS	2014-10-28 14:36:06 -07:00
Daoyuan Wang	47a40f60d6	[SPARK-3988][SQL] add public API for date type Add json and python api for date type. By using Pickle, `java.sql.Date` was serialized as calendar, and recognized in python as `datetime.datetime`. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2901 from adrian-wang/spark3988 and squashes the following commits: c51a24d [Daoyuan Wang] convert datetime to date 5670626 [Daoyuan Wang] minor line combine f760d8e [Daoyuan Wang] fix indent 444f100 [Daoyuan Wang] fix a typo 1d74448 [Daoyuan Wang] fix scala style 8d7dd22 [Daoyuan Wang] add json and python api for date type	2014-10-28 13:43:25 -07:00
ravipesala	5807cb40ae	[SPARK-3814][SQL] Support for Bitwise AND(&), OR(\|) ,XOR(^), NOT(~) in Spark HQL and SQL Currently there is no support of Bitwise & , \| in Spark HiveQl and Spark SQL as well. So this PR support the same. I am closing https://github.com/apache/spark/pull/2926 as it has conflicts to merge. And also added support for Bitwise AND(&), OR(\|) ,XOR(^), NOT(~) And I handled all review comments in that PR Author: ravipesala <ravindra.pesala@huawei.com> Closes #2961 from ravipesala/SPARK-3814-NEW4 and squashes the following commits: a391c7a [ravipesala] Rebase with master	2014-10-28 13:36:06 -07:00
wangxiaojing	0c34fa5b4b	[SPARK-3907][SQL] Add truncate table support JIRA issue: [SPARK-3907]https://issues.apache.org/jira/browse/SPARK-3907 Add turncate table support TRUNCATE TABLE table_name [PARTITION partition_spec]; partition_spec: : (partition_col = partition_col_value, partition_col = partiton_col_value, ...) Removes all rows from a table or partition(s). Currently target table should be native/managed table or exception will be thrown. User can specify partial partition_spec for truncating multiple partitions at once and omitting partition_spec will truncate all partitions in the table. Author: wangxiaojing <u9jing@gmail.com> Closes #2770 from wangxiaojing/spark-3907 and squashes the following commits: 63dbd81 [wangxiaojing] change hive scalastyle 7a03707 [wangxiaojing] add comment f6e710e [wangxiaojing] change truncate table a1f692c [wangxiaojing] Correct spelling mistakes 3b20007 [wangxiaojing] add truncate can not support column err message e483547 [wangxiaojing] add golden file 77b1f20 [wangxiaojing] add truncate table support	2014-10-27 22:02:52 -07:00
Yin Huai	27470d3406	[SQL] Correct a variable name in JavaApplySchemaSuite.applySchemaToJSON `schemaRDD2` is not tested because `schemaRDD1` is registered again. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #2869 from yhuai/JavaApplySchemaSuite and squashes the following commits: 95fe894 [Yin Huai] Correct variable name.	2014-10-27 20:50:09 -07:00
wangfei	89af6dfc3a	[SPARK-4041][SQL] Attributes names in table scan should converted to lowercase when compare with relation attributes In ```MetastoreRelation``` the attributes name is lowercase because of hive using lowercase for fields name, so we should convert attributes name in table scan lowercase in ```indexWhere(_.name == a.name)```. ```neededColumnIDs``` may be not correct if not convert to lowercase. Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #2884 from scwf/fixColumnIds and squashes the following commits: 6174046 [scwf] use AttributeMap for this issue dc74a24 [wangfei] use lowerName and add a test case for this issue 3ff3a80 [wangfei] more safer change 294fcb7 [scwf] attributes names in table scan should convert lowercase in neededColumnsIDs	2014-10-27 20:46:26 -07:00
Alex Liu	698a7eab77	[SPARK-3816][SQL] Add table properties from storage handler to output jobConf ...ob conf in SparkHadoopWriter class Author: Alex Liu <alex_liu68@yahoo.com> Closes #2677 from alexliu68/SPARK-SQL-3816 and squashes the following commits: 79c269b [Alex Liu] [SPARK-3816][SQL] Add table properties from storage handler to job conf	2014-10-27 20:43:29 -07:00
Cheng Hao	418ad83fe1	[SPARK-3911] [SQL] HiveSimpleUdf can not be optimized in constant folding ``` explain extended select cos(null) from src limit 1; ``` outputs: ``` Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5] MetastoreRelation default, src, None == Optimized Logical Plan == Limit 1 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5] MetastoreRelation default, src, None == Physical Plan == Limit 1 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#5] HiveTableScan [], (MetastoreRelation default, src, None), None ``` After patching this PR it outputs ``` == Parsed Logical Plan == Limit 1 Project ['cos(null) AS c_0#0] UnresolvedRelation None, src, None == Analyzed Logical Plan == Limit 1 Project [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFCos(null) AS c_0#0] MetastoreRelation default, src, None == Optimized Logical Plan == Limit 1 Project [null AS c_0#0] MetastoreRelation default, src, None == Physical Plan == Limit 1 Project [null AS c_0#0] HiveTableScan [], (MetastoreRelation default, src, None), None ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #2771 from chenghao-intel/hive_udf_constant_folding and squashes the following commits: 1379c73 [Cheng Hao] duplicate the PlanTest with catalyst/plans/PlanTest 1e52dda [Cheng Hao] add unit test for hive simple udf constant folding 01609ff [Cheng Hao] support constant folding for HiveSimpleUdf	2014-10-27 20:42:05 -07:00
Cheng Lian	1d7bcc8840	[SQL] Fixes caching related JoinSuite failure PR #2860 refines in-memory table statistics and enables broader broadcasted hash join optimization for in-memory tables. This makes `JoinSuite` fail when some test suite caches test table `testData` and gets executed before `JoinSuite`. Because expected `ShuffledHashJoin`s are optimized to `BroadcastedHashJoin` according to collected in-memory table statistics. This PR fixes this issue by clearing the cache before testing join operator selection. A separate test case is also added to test broadcasted hash join operator selection. Author: Cheng Lian <lian@databricks.com> Closes #2960 from liancheng/fix-join-suite and squashes the following commits: 715b2de [Cheng Lian] Fixes caching related JoinSuite failure	2014-10-27 10:06:09 -07:00
scwf	f4e8c289d8	[SPARK-4042][SQL] Append columns ids and names before broadcast Append columns ids and names before broadcast ```hiveExtraConf``` in ```HadoopTableReader```. Author: scwf <wangfei1@huawei.com> Closes #2885 from scwf/HadoopTableReader and squashes the following commits: a8c498c [scwf] append columns ids and names before broadcast	2014-10-26 16:56:11 -07:00
Kousuke Saruta	3a9d66cf59	[SPARK-4061][SQL] We cannot use EOL character in the operand of LIKE predicate. We cannot use EOL character like \n or \r in the operand of LIKE predicate. So following condition is never true. -- someStr is 'hoge\nfuga' where someStr LIKE 'hoge_fuga' Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2908 from sarutak/spark-sql-like-match-modification and squashes the following commits: d15798b [Kousuke Saruta] Remove test setting for thriftserver f99a2f4 [Kousuke Saruta] Fixed LIKE predicate so that we can use EOL character as in a operand	2014-10-26 16:54:07 -07:00
Kousuke Saruta	ace41e8bf2	[SPARK-3959][SPARK-3960][SQL] SqlParser fails to parse literal -9223372036854775808 (Long.MinValue). / We can apply unary minus only to literal. SqlParser fails to parse -9223372036854775808 (Long.MinValue) so we cannot write queries such like as follows. SELECT value FROM someTable WHERE value > -9223372036854775808 Additionally, because of the wrong syntax definition, we cannot apply unary minus only to literal. So, we cannot write such expressions. -(value1 + value2) // Parenthesized expressions -column // Columns -MAX(column) // Functions Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2816 from sarutak/spark-sql-dsl-improvement2 and squashes the following commits: 32a5005 [Kousuke Saruta] Remove test setting for thriftserver c2bab5e [Kousuke Saruta] Fixed SPARK-3959 and SPARK-3960	2014-10-26 16:40:29 -07:00
ravipesala	974d7b238b	[SPARK-3483][SQL] Special chars in column names Supporting special chars in column names by using back ticks. Closed https://github.com/apache/spark/pull/2804 and created this PR as it has merge conflicts Author: ravipesala <ravindra.pesala@huawei.com> Closes #2927 from ravipesala/SPARK-3483-NEW and squashes the following commits: f6329f3 [ravipesala] Rebased with master	2014-10-26 16:36:11 -07:00
Yin Huai	0481aaa8d7	[SPARK-4068][SQL] NPE in jsonRDD schema inference Please refer to added tests for cases that can trigger the bug. JIRA: https://issues.apache.org/jira/browse/SPARK-4068 Author: Yin Huai <huai@cse.ohio-state.edu> Closes #2918 from yhuai/SPARK-4068 and squashes the following commits: d360eae [Yin Huai] Handle nulls when building key paths from elements of an array.	2014-10-26 16:32:02 -07:00
Yin Huai	05308426f0	[SPARK-4052][SQL] Use scala.collection.Map for pattern matching instead of using Predef.Map (it is scala.collection.immutable.Map) Please check https://issues.apache.org/jira/browse/SPARK-4052 for cases triggering this bug. Author: Yin Huai <huai@cse.ohio-state.edu> Closes #2899 from yhuai/SPARK-4052 and squashes the following commits: 1188f70 [Yin Huai] Address liancheng's comments. b6712be [Yin Huai] Use scala.collection.Map instead of Predef.Map (scala.collection.immutable.Map).	2014-10-26 16:30:15 -07:00
Kousuke Saruta	d518bc24af	[SPARK-3953][SQL][Minor] Confusable variable name. In SqlParser.scala, there is following code. case d ~ p ~ r ~ f ~ g ~ h ~ o ~ l => val base = r.getOrElse(NoRelation) val withFilter = f.map(f => Filter(f, base)).getOrElse(base) In the code above, there are 2 variables which have same name "f" in near place. One is receiver "f" and other is bound variable "f". Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #2807 from sarutak/SPARK-3953 and squashes the following commits: 4957c32 [Kousuke Saruta] Improved variable name in SqlParser.scala	2014-10-26 16:28:33 -07:00
Cheng Lian	2838bf8aad	[SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics This PR refines in-memory columnar table statistics: 1. adds 2 more statistics for in-memory table columns: `count` and `sizeInBytes` 1. adds filter pushdown support for `IS NULL` and `IS NOT NULL`. 1. caches and propagates statistics in `InMemoryRelation` once the underlying cached RDD is materialized. Statistics are collected to driver side with an accumulator. This PR also fixes SPARK-3914 by properly propagating in-memory statistics. Author: Cheng Lian <lian@databricks.com> Closes #2860 from liancheng/propagates-in-mem-stats and squashes the following commits: 0cc5271 [Cheng Lian] Restricts visibility of o.a.s.s.c.p.l.Statistics c5ff904 [Cheng Lian] Fixes test table name conflict a8c818d [Cheng Lian] Refines tests 1d01074 [Cheng Lian] Bug fix: shouldn't call STRING.actualSize on null string value 7dc6a34 [Cheng Lian] Adds more in-memory table statistics and propagates them properly	2014-10-26 16:10:09 -07:00
Liang-Chi Hsieh	0af7e514c6	[SPARK-3925][SQL] Do not consider the ordering of qualifiers during comparison The orderings should not be considered during the comparison between old qualifiers and new qualifiers. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #2783 from viirya/full_qualifier_comp and squashes the following commits: 89f652c [Liang-Chi Hsieh] modification for comment. abb5762 [Liang-Chi Hsieh] More comprehensive comparison of qualifiers.	2014-10-26 14:29:13 -07:00
Josh Rosen	bf589fc717	[SPARK-3616] Add basic Selenium tests to WebUISuite This patch adds Selenium tests for Spark's web UI. To avoid adding extra dependencies to the test environment, the tests use Selenium's HtmlUnitDriver, which is pure-Java, instead of, say, ChromeDriver. I added new tests to try to reproduce a few UI bugs reported on JIRA, namely SPARK-3021, SPARK-2105, and SPARK-2527. I wasn't able to reproduce these bugs; I suspect that the older ones might have been fixed by other patches. In order to use HtmlUnitDriver, I added an explicit dependency on the org.apache.httpcomponents version of httpclient in order to prevent jets3t's older version from taking precedence on the classpath. I also upgraded ScalaTest to 2.2.1. Author: Josh Rosen <joshrosen@apache.org> Author: Josh Rosen <joshrosen@databricks.com> Closes #2474 from JoshRosen/webui-selenium-tests and squashes the following commits: fcc9e83 [Josh Rosen] scalautils -> scalactic package rename 510e54a [Josh Rosen] [SPARK-3616] Add basic Selenium tests to WebUISuite.	2014-10-26 11:29:27 -07:00
Sean Owen	df7974b8e5	SPARK-3359 [DOCS] sbt/sbt unidoc doesn't work with Java 8 This follows https://github.com/apache/spark/pull/2893 , but does not completely fix SPARK-3359 either. This fixes minor scaladoc/javadoc issues that Javadoc 8 will treat as errors. Author: Sean Owen <sowen@cloudera.com> Closes #2909 from srowen/SPARK-3359 and squashes the following commits: f62c347 [Sean Owen] Fix some javadoc issues that javadoc 8 considers errors. This is not all of the errors turned up when javadoc 8 runs on output of genjavadoc.	2014-10-25 23:18:02 -07:00
Michael Armbrust	3a845d3c04	[SQL] Update Hive test harness for Hive 12 and 13 As part of the upgrade I also copy the newest version of the query tests, and whitelist a bunch of new ones that are now passing. Author: Michael Armbrust <michael@databricks.com> Closes #2936 from marmbrus/fix13tests and squashes the following commits: d9cbdab [Michael Armbrust] Remove user specific tests 65801cd [Michael Armbrust] style and rat 8f6b09a [Michael Armbrust] Update test harness to work with both Hive 12 and 13. f044843 [Michael Armbrust] Update Hive query tests and golden files to 0.13	2014-10-24 18:36:35 -07:00
Michael Armbrust	3a906c6631	[HOTFIX][SQL] Remove sleep on reset() failure. Author: Michael Armbrust <michael@databricks.com> Closes #2934 from marmbrus/patch-2 and squashes the following commits: a96dab2 [Michael Armbrust] Remove sleep on reset() failure.	2014-10-24 14:03:03 -07:00
Zhan Zhang	7c89a8f0c8	[SPARK-2706][SQL] Enable Spark to support Hive 0.13 Given that a lot of users are trying to use hive 0.13 in spark, and the incompatibility between hive-0.12 and hive-0.13 on the API level I want to propose following approach, which has no or minimum impact on existing hive-0.12 support, but be able to jumpstart the development of hive-0.13 and future version support. Approach: Introduce “hive-version” property, and manipulate pom.xml files to support different hive version at compiling time through shim layer, e.g., hive-0.12.0 and hive-0.13.1. More specifically, 1. For each different hive version, there is a very light layer of shim code to handle API differences, sitting in sql/hive/hive-version, e.g., sql/hive/v0.12.0 or sql/hive/v0.13.1 2. Add a new profile hive-default active by default, which picks up all existing configuration and hive-0.12.0 shim (v0.12.0) if no hive.version is specified. 3. If user specifies different version (currently only 0.13.1 by -Dhive.version = 0.13.1), hive-versions profile will be activated, which pick up hive-version specific shim layer and configuration, mainly the hive jars and hive-version shim, e.g., v0.13.1. 4. With this approach, nothing is changed with current hive-0.12 support. No change by default: sbt/sbt -Phive For example: sbt/sbt -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly To enable hive-0.13: sbt/sbt -Dhive.version=0.13.1 For example: sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly Note that in hive-0.13, hive-thriftserver is not enabled, which should be fixed by other Jira, and we don’t need -Phive with -Dhive.version in building (probably we should use -Phive -Dhive.version=xxx instead after thrift server is also supported in hive-0.13.1). Author: Zhan Zhang <zhazhan@gmail.com> Author: zhzhan <zhazhan@gmail.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #2241 from zhzhan/spark-2706 and squashes the following commits: 3ece905 [Zhan Zhang] minor fix 410b668 [Zhan Zhang] solve review comments cbb4691 [Zhan Zhang] change run-test for new options 0d4d2ed [Zhan Zhang] rebase 497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 8fad1cf [Zhan Zhang] change the pom file and make hive-0.13.1 as the default ab028d1 [Zhan Zhang] rebase 4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 4cb1b93 [zhzhan] Merge pull request #1 from pwendell/pr-2241 b0478c0 [Patrick Wendell] Changes to simplify the build of SPARK-2706 2b50502 [Zhan Zhang] rebase a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark cb22863 [Zhan Zhang] correct the typo 20f6cf7 [Zhan Zhang] solve compatability issue f7912a9 [Zhan Zhang] rebase and solve review feedback 301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 10c3565 [Zhan Zhang] address review comments 6bc9204 [Zhan Zhang] rebase and remove temparory repo d3aa3f2 [Zhan Zhang] Merge branch 'master' into spark-2706 cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 3ced0d7 [Zhan Zhang] rebase d9b981d [Zhan Zhang] rebase and fix error due to rollback adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 3dd50e8 [Zhan Zhang] solve conflicts and remove unnecessary implicts d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark dc7bdb3 [Zhan Zhang] solve conflicts 7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark d7c3e1e [Zhan Zhang] Merge branch 'master' into spark-2706 68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark d48bd18 [Zhan Zhang] address review comments 3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 57ea52e [Zhan Zhang] Merge branch 'master' into spark-2706 2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 9412d24 [Zhan Zhang] address review comments f4af934 [Zhan Zhang] rebase 1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 128b60b [Zhan Zhang] ignore 0.12.0 test cases for the time being af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 5f5619f [Zhan Zhang] restructure the directory and different hive version support 05d3683 [Zhan Zhang] solve conflicts e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark 94b4fdc [Zhan Zhang] Spark-2706: hive-0.13.1 support on spark 87ebf3b [Zhan Zhang] Merge branch 'master' into spark-2706 921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark f896b2a [Zhan Zhang] Merge branch 'master' into spark-2706 789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark f6a8a40 [Zhan Zhang] revert ba14f28 [Zhan Zhang] test dbedff3 [Zhan Zhang] Merge remote-tracking branch 'upstream/master' 70964fe [Zhan Zhang] revert fe0f379 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark 70ffd93 [Zhan Zhang] revert 42585ec [Zhan Zhang] test 7d5fce2 [Zhan Zhang] test	2014-10-24 11:03:17 -07:00
Michael Armbrust	0e886610ee	[SPARK-4050][SQL] Fix caching of temporary tables with projections. Previously cached data was found by `sameResult` plan matching on optimized plans. This technique however fails to locate the cached data when a temporary table with a projection is queried with a further reduced projection. The failure is due to the fact that optimization will collapse the projections, producing a plan that no longer produces the sameResult as the cached data (though the cached data still subsumes the desired data). For example consider the following previously failing test case. ```scala sql("CACHE TABLE tempTable AS SELECT key FROM testData") assertCached(sql("SELECT COUNT() FROM tempTable")) ``` In this PR I change the matching to occur after analysis instead of optimization, so that in the case of temporary tables, the plans will always match. I think this should work generally, however, this error does raise questions about the need to do more thorough subsumption checking when locating cached data. Another question is what sort of semantics we want to provide when uncaching data from temporary tables. For example consider the following sequence of commands: ```scala testData.select('key).registerTempTable("tempTable1") testData.select('key).registerTempTable("tempTable2") cacheTable("tempTable1") // This obviously works. assertCached(sql("SELECT COUNT() FROM tempTable1")) // It seems good that this works ... assertCached(sql("SELECT COUNT() FROM tempTable2")) // ... but is this valid? uncacheTable("tempTable2") // Should this still be cached? assertCached(sql("SELECT COUNT() FROM tempTable1"), 0) ``` Author: Michael Armbrust <michael@databricks.com> Closes #2912 from marmbrus/cachingBug and squashes the following commits: 9c822d4 [Michael Armbrust] remove commented out code 5c72fb7 [Michael Armbrust] Add a test case / question about uncaching semantics. 63a23e4 [Michael Armbrust] Perform caching on analyzed instead of optimized plan. 03f1cfe [Michael Armbrust] Clean-up / add tests to SameResult suite.	2014-10-24 10:52:25 -07:00
wangfei	856b081729	[SQL]redundant methods for broadcast redundant methods for broadcast in ```TableReader``` Author: wangfei <wangfei1@huawei.com> Closes #2862 from scwf/TableReader and squashes the following commits: 414cc24 [wangfei] unnecessary methods for broadcast	2014-10-21 16:20:58 -07:00
wangxiaojing	0fe1c09369	[SPARK-3940][SQL] Avoid console printing error messages three times If wrong sql,the console print error one times。 eg： <pre> spark-sql> show tabless; show tabless; 14/10/13 21:03:48 INFO ParseDriver: Parsing command: show tabless ............ at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:274) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:209) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) Caused by: org.apache.hadoop.hive.ql.parse.ParseException: line 1:5 cannot recognize input near 'show' 'tabless' '<EOF>' in ddl statement at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:193) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:161) at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:218) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:226) ... 47 more Time taken: 4.35 seconds 14/10/13 21:03:51 INFO CliDriver: Time taken: 4.35 seconds </pre> Author: wangxiaojing <u9jing@gmail.com> Closes #2790 from wangxiaojing/spark-3940 and squashes the following commits: e2e5c14 [wangxiaojing] sql Print the error code three times	2014-10-20 17:15:28 -07:00
Takuya UESHIN	7586e2e67a	[SPARK-3969][SQL] Optimizer should have a super class as an interface. Some developers want to replace `Optimizer` to fit their projects but can't do so because currently `Optimizer` is an `object`. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2825 from ueshin/issues/SPARK-3969 and squashes the following commits: abbc53c [Takuya UESHIN] Re-rename Optimizer object. 4d2e1bc [Takuya UESHIN] Rename Optimizer object. 9547a23 [Takuya UESHIN] Extract abstract class from Optimizer for developers to be able to replace Optimizer.	2014-10-20 17:09:12 -07:00
luogankun	fce1d41611	[SPARK-3945]Properties of hive-site.xml is invalid in running the Thrift JDBC server Write properties of hive-site.xml to HiveContext when initilize session state in SparkSQLEnv.scala. The method of SparkSQLEnv.init() in HiveThriftServer2.scala can not write the properties of hive-site.xml to HiveContext. Such as: add configuration property spark.sql.shuffle.partititions in the hive-site.xml. Author: luogankun <luogankun@gmail.com> Closes #2800 from luogankun/SPARK-3945 and squashes the following commits: 3679efc [luogankun] [SPARK-3945]Write properties of hive-site.xml to HiveContext when initilize session state In SparkSQLEnv.scala	2014-10-20 16:50:51 -07:00
Takuya UESHIN	364d52b707	[SPARK-3966][SQL] Fix nullabilities of Cast related to DateType. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2820 from ueshin/issues/SPARK-3966 and squashes the following commits: ca4a745 [Takuya UESHIN] Fix nullabilities of Cast related to DateType.	2014-10-20 15:51:05 -07:00
Michael Armbrust	e9c1afa87b	[SPARK-3800][SQL] Clean aliases from grouping expressions Author: Michael Armbrust <michael@databricks.com> Closes #2658 from marmbrus/nestedAggs and squashes the following commits: 862b763 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into nestedAggs 3234521 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into nestedAggs 8b06fdc [Michael Armbrust] possible fix for grouping on nested fields	2014-10-20 15:32:17 -07:00
Cheng Lian	1b3ce61ce9	[SPARK-3906][SQL] Adds multiple join support for SQLContext Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2767 from liancheng/multi-join and squashes the following commits: 9dc0d18 [Cheng Lian] Adds multiple join support for SQLContext	2014-10-20 15:29:54 -07:00
Takuya UESHIN	ea054e1fc7	[SPARK-3986][SQL] Fix package names to fit their directory names. Package names of 2 test suites are different from their directory names. - `GeneratedEvaluationSuite` - `GeneratedMutableEvaluationSuite` Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2835 from ueshin/issues/SPARK-3986 and squashes the following commits: fa2cc05 [Takuya UESHIN] Fix package names to fit their directory names.	2014-10-20 11:31:51 -07:00
Sean Owen	f406a83918	SPARK-3926 [CORE] Result of JavaRDD.collectAsMap() is not Serializable Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are Author: Sean Owen <sowen@cloudera.com> Closes #2805 from srowen/SPARK-3926 and squashes the following commits: ecb78ee [Sean Owen] Fix conflict between java.io.Serializable and use of Scala's Serializable f4717f9 [Sean Owen] Oops, fix compile problem ae1b36f [Sean Owen] Expand to cover Maps returned from other Java API methods as well 51c26c2 [Sean Owen] Make JavaPairRDD.collectAsMap result Serializable since Java Maps generally are	2014-10-18 12:38:18 -07:00
Michael Armbrust	adcb7d3350	[SPARK-3855][SQL] Preserve the result attribute of python UDFs though transformations In the current implementation it was possible for the reference to change after analysis. Author: Michael Armbrust <michael@databricks.com> Closes #2717 from marmbrus/pythonUdfResults and squashes the following commits: `da14879` [Michael Armbrust] Fix test 6343bcb [Michael Armbrust] add test 9533286 [Michael Armbrust] Correctly preserve the result attribute of python UDFs though transformations	2014-10-17 14:12:07 -07:00
Prashant Sharma	2fe0ba9561	SPARK-3874: Provide stable TaskContext API This is a small number of clean-up changes on top of #2782. Closes #2782. Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #2803 from pwendell/pr-2782 and squashes the following commits: 56d5b7a [Patrick Wendell] Minor clean-up 44089ec [Patrick Wendell] Clean-up the TaskContext API. ed551ce [Prashant Sharma] Fixed a typo df261d0 [Prashant Sharma] Josh's suggestion facf3b1 [Prashant Sharma] Fixed the mima issue. 7ecc2fe [Prashant Sharma] CR, Moved implementations to TaskContextImpl bbd9e05 [Prashant Sharma] adding missed out files to git. ef633f5 [Prashant Sharma] SPARK-3874, Provide stable TaskContext API	2014-10-16 21:38:45 -04:00
Cheng Lian	99e416b6d6	[SQL] Fixes the race condition that may cause test failure The removed `Future` was used to end the test case as soon as the Spark SQL CLI process exits. When the process exits prematurely, this mechanism prevents the test case to wait until timeout. But it also creates a race condition: when `foundAllExpectedAnswers.tryFailure` is called, there are chances that the last expected output line of the CLI process hasn't been caught by the main logics of the test code, thus fails the test case. Removing this `Future` doesn't affect correctness. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2823 from liancheng/clean-clisuite and squashes the following commits: 489a97c [Cheng Lian] Fixes the race condition that may cause test failure	2014-10-16 16:15:55 -07:00
Cheng Lian	9eb49d4134	[SPARK-3809][SQL] Fixes test suites in hive-thriftserver As scwf pointed out, `HiveThriftServer2Suite` isn't effective anymore after the Thrift server was made a daemon. On the other hand, these test suites were known flaky, PR #2214 tried to fix them but failed because of unknown Jenkins build error. This PR fixes both sets of issues. In this PR, instead of watching `start-thriftserver.sh` output, the test code start a `tail` process to watch the log file. A `Thread.sleep` has to be introduced because the `kill` command used in `stop-thriftserver.sh` is not synchronous. As for the root cause of the mysterious Jenkins build failure. Please refer to [this comment](https://github.com/apache/spark/pull/2675#issuecomment-58464189) below for details. ---- (Copied from PR description of #2214) This PR fixes two issues of `HiveThriftServer2Suite` and brings 1 enhancement: 1. Although metastore, warehouse directories and listening port are randomly chosen, all test cases share the same configuration. Due to parallel test execution, one of the two test case is doomed to fail 2. We caught any exceptions thrown from a test case and print diagnosis information, but forgot to re-throw the exception... 3. When the forked server process ends prematurely (e.g., fails to start), the `serverRunning` promise is completed with a failure, preventing the test code to keep waiting until timeout. So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood. Author: Cheng Lian <lian.cs.zju@gmail.com> Author: wangfei <wangfei1@huawei.com> Closes #2675 from liancheng/fix-thriftserver-tests and squashes the following commits: 1c384b7 [Cheng Lian] Minor code cleanup, restore the logging level hack in TestHive.scala 7805c33 [wangfei] reset SPARK_TESTING to avoid loading Log4J configurations in testing class paths af2b5a9 [Cheng Lian] Removes log level hacks from TestHiveContext d116405 [wangfei] make sure that log4j level is INFO ee92a82 [Cheng Lian] Relaxes timeout 7fd6757 [Cheng Lian] Fixes test suites in hive-thriftserver	2014-10-13 13:50:27 -07:00
Liquan Pei	9d9ca91fef	[SQL]Small bug in unresolved.scala name should throw exception with name instead of exprId. Author: Liquan Pei <liquanpei@gmail.com> Closes #2758 from Ishiihara/SparkSQL-bug and squashes the following commits: aa36a3b [Liquan Pei] small bug	2014-10-13 13:49:11 -07:00
chirag	e6e37701f1	SPARK-3807: SparkSql does not work for tables created using custom serde SparkSql crashes on selecting tables using custom serde. Example: ---------------- CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE "org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer" with serdeproperties("serialization.format"="org.apache.thrift.protocol.TBinaryProtocol","serialization.class"="ser_class") STORED AS SEQUENCEFILE; The following exception is seen on running a query like 'select * from table_name limit 1': ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException at org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68) at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80) at org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86) at org.apache.spark.sql.hive.execution.HiveTableScan.<init>(HiveTableScan.scala:100) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.NullPointerException Author: chirag <chirag.aggarwal@guavus.com> Closes #2674 from chiragaggarwal/branch-1.1 and squashes the following commits: 370c31b [chirag] SPARK-3807: Add a test case to validate the fix. 1f26805 [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde (Incorporated Review Comments) ba4bc0c [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde 5c73b72 [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde (cherry picked from commit `925e22d313`) Signed-off-by: Michael Armbrust <michael@databricks.com>	2014-10-13 13:47:51 -07:00
Michael Armbrust	371321cade	[SQL] Add type checking debugging functions Adds some functions that were very useful when trying to track down the bug from #2656. This change also changes the tree output for query plans to include the `'` prefix to unresolved nodes and `!` prefix to nodes that refer to non-existent attributes. Author: Michael Armbrust <michael@databricks.com> Closes #2657 from marmbrus/debugging and squashes the following commits: 654b926 [Michael Armbrust] Clean-up, add tests 763af15 [Michael Armbrust] Add typeChecking debugging functions 8c69303 [Michael Armbrust] Add inputSet, references to QueryPlan. Improve tree string with a prefix to denote invalid or unresolved nodes. fbeab54 [Michael Armbrust] Better toString, factories for AttributeSet.	2014-10-13 13:46:34 -07:00
Venkata Ramana Gollamudi	e10d71e7e5	[SPARK-3559][SQL] Remove unnecessary columns from List of needed Column Ids in Hive Conf Author: Venkata Ramana G <ramana.gollamudihuawei.com> Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com> Closes #2713 from gvramana/remove_unnecessary_columns and squashes the following commits: b7ba768 [Venkata Ramana Gollamudi] Added comment and checkstyle fix 6a93459 [Venkata Ramana Gollamudi] cloned hiveconf for each TableScanOperators so that only required columns are added	2014-10-13 13:45:34 -07:00
Takuya UESHIN	73da9c26b0	[SPARK-3771][SQL] AppendingParquetOutputFormat should use reflection to prevent from breaking binary-compatibility. Original problem is [SPARK-3764](https://issues.apache.org/jira/browse/SPARK-3764). `AppendingParquetOutputFormat` uses a binary-incompatible method `context.getTaskAttemptID`. This causes binary-incompatible of Spark itself, i.e. if Spark itself is built against hadoop-1, the artifact is for only hadoop-1, and vice versa. Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #2638 from ueshin/issues/SPARK-3771 and squashes the following commits: efd3784 [Takuya UESHIN] Add a comment to explain the reason to use reflection. ec213c1 [Takuya UESHIN] Use reflection to prevent breaking binary-compatibility.	2014-10-13 13:43:41 -07:00
Cheng Hao	d3cdf9128a	[SPARK-3529] [SQL] Delete the temp files after test exit There are lots of temporal files created by TestHive under the /tmp by default, which may cause potential performance issue for testing. This PR will automatically delete them after test exit. Author: Cheng Hao <hao.cheng@intel.com> Closes #2393 from chenghao-intel/delete_temp_on_exit and squashes the following commits: 3a6511f [Cheng Hao] Remove the temp dir after text exit	2014-10-13 13:40:20 -07:00
Cheng Lian	56102dc2d8	[SPARK-2066][SQL] Adds checks for non-aggregate attributes with aggregation This PR adds a new rule `CheckAggregation` to the analyzer to provide better error message for non-aggregate attributes with aggregation. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #2774 from liancheng/non-aggregate-attr and squashes the following commits: 5246004 [Cheng Lian] Passes test suites bf1878d [Cheng Lian] Adds checks for non-aggregate attributes with aggregation	2014-10-13 13:36:39 -07:00
Daoyuan Wang	2ac40da3f9	[SPARK-3407][SQL]Add Date type support Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #2344 from adrian-wang/date and squashes the following commits: f15074a [Daoyuan Wang] remove outdated lines 2038085 [Daoyuan Wang] update return type 00fe81f [Daoyuan Wang] address lian cheng's comments 0df6ea1 [Daoyuan Wang] rebase and remove simple string bb1b1ef [Daoyuan Wang] remove failing test aa96735 [Daoyuan Wang] not cast for same type compare 30bf48b [Daoyuan Wang] resolve rebase conflict 617d1a8 [Daoyuan Wang] add date_udf case to white list c37e848 [Daoyuan Wang] comment update 5429212 [Daoyuan Wang] change to long f8f219f [Daoyuan Wang] revise according to Cheng Hao 0e0a4f5 [Daoyuan Wang] minor format 4ddcb92 [Daoyuan Wang] add java api for date 0e3110e [Daoyuan Wang] try to fix timezone issue 17fda35 [Daoyuan Wang] set test list 2dfbb5b [Daoyuan Wang] support date type	2014-10-13 13:33:12 -07:00

... 2 3 4 5 6 ...

824 commits