ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
gatorsmile	74ba84b64c	[HOT][BUILD] Changed the import order This PR is to fix the master's build break. The following tests failed due to the import order issues in the master. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49651/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49652/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49653/consoleFull Author: gatorsmile <gatorsmile@gmail.com> Closes #10823 from gatorsmile/importOrder.	2016-01-18 19:40:10 -08:00
Davies Liu	323d51f1da	[SPARK-12700] [SQL] embed condition into SMJ and BroadcastHashJoin Currently SortMergeJoin and BroadcastHashJoin do not support condition, the need a followed Filter for that, the result projection to generate UnsafeRow could be very expensive if they generate lots of rows and could be filtered mostly by condition. This PR brings the support of condition for SortMergeJoin and BroadcastHashJoin, just like other outer joins do. This could improve the performance of Q72 by 7x (from 120s to 16.5s). Author: Davies Liu <davies@databricks.com> Closes #10653 from davies/filter_join.	2016-01-18 17:29:54 -08:00
Reynold Xin	39ac56fc60	[SPARK-12889][SQL] Rename ParserDialect -> ParserInterface. Based on discussions in #10801, I'm submitting a pull request to rename ParserDialect to ParserInterface. Author: Reynold Xin <rxin@databricks.com> Closes #10817 from rxin/SPARK-12889.	2016-01-18 17:10:32 -08:00
Wenchen Fan	404190221a	[SPARK-12882][SQL] simplify bucket tests and add more comments Right now, the bucket tests are kind of hard to understand, this PR simplifies them and add more commetns. Author: Wenchen Fan <wenchen@databricks.com> Closes #10813 from cloud-fan/bucket-comment.	2016-01-18 15:10:04 -08:00
Wenchen Fan	4f11e3f2aa	[SPARK-12841][SQL] fix cast in filter In SPARK-10743 we wrap cast with `UnresolvedAlias` to give `Cast` a better alias if possible. However, for cases like `filter`, the `UnresolvedAlias` can't be resolved and actually we don't need a better alias for this case. This PR move the cast wrapping logic to `Column.named` so that we will only do it when we need a alias name. Author: Wenchen Fan <wenchen@databricks.com> Closes #10781 from cloud-fan/bug.	2016-01-18 14:15:27 -08:00
Reynold Xin	38c3c0e31a	[SPARK-12855][SQL] Remove parser dialect developer API This pull request removes the public developer parser API for external parsers. Given everything a parser depends on (e.g. logical plans and expressions) are internal and not stable, external parsers will break with every release of Spark. It is a bad idea to create the illusion that Spark actually supports pluggable parsers. In addition, this also reduces incentives for 3rd party projects to contribute parse improvements back to Spark. Author: Reynold Xin <rxin@databricks.com> Closes #10801 from rxin/SPARK-12855.	2016-01-18 13:55:42 -08:00
Reynold Xin	44fcf992aa	[SPARK-12873][SQL] Add more comment in HiveTypeCoercion for type widening I was reading this part of the analyzer code again and got confused by the difference between findWiderTypeForTwo and findTightestCommonTypeOfTwo. I also simplified WidenSetOperationTypes to make it a lot simpler. The easiest way to review this one is to just read the original code, and the new code. The logic is super simple. Author: Reynold Xin <rxin@databricks.com> Closes #10802 from rxin/SPARK-12873.	2016-01-18 11:08:44 -08:00
Dilip Biswal	db9a860589	[SPARK-12558][FOLLOW-UP] AnalysisException when multiple functions applied in GROUP BY clause Addresses the comments from Yin. https://github.com/apache/spark/pull/10520 Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10758 from dilipbiswal/spark-12558-followup.	2016-01-18 10:28:01 -08:00
Wenchen Fan	cede7b2a11	[SPARK-12860] [SQL] speed up safe projection for primitive types The idea is simple, use `SpecificMutableRow` instead of `GenericMutableRow` as result row for safe projection. A simple benchmark shows about 1.5x speed up for primitive types, code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-safeprojectionbenchmark-scala Author: Wenchen Fan <wenchen@databricks.com> Closes #10790 from cloud-fan/safe-projection.	2016-01-17 09:11:43 -08:00
Davies Liu	3c0d2365d5	[SPARK-12796] [SQL] Whole stage codegen This is the initial work for whole stage codegen, it support Projection/Filter/Range, we will continue work on this to support more physical operators. A micro benchmark show that a query with range, filter and projection could be 3X faster then before. It's turned on by default. For a tree that have at least two chained plans, a WholeStageCodegen will be inserted into it, for example, the following plan ``` Limit 10 +- Project [(id#5L + 1) AS (id + 1)#6L] +- Filter ((id#5L & 1) = 1) +- Range 0, 1, 4, 10, [id#5L] ``` will be translated into ``` Limit 10 +- WholeStageCodegen +- Project [(id#1L + 1) AS (id + 1)#2L] +- Filter ((id#1L & 1) = 1) +- Range 0, 1, 4, 10, [id#1L] ``` Here is the call graph to generate Java source for A and B (A support codegen, but B does not): ``` * WholeStageCodegen Plan A FakeInput Plan B * ========================================================================= * * -> execute() * \| * doExecute() --------> produce() * \| * doProduce() -------> produce() * \| * doProduce() ---> execute() * \| * consume() * doConsume() ------------\| * \| * doConsume() <----- consume() ``` A SparkPlan that support codegen need to implement doProduce() and doConsume(): ``` def doProduce(ctx: CodegenContext): (RDD[InternalRow], String) def doConsume(ctx: CodegenContext, child: SparkPlan, input: Seq[ExprCode]): String ``` Author: Davies Liu <davies@databricks.com> Closes #10735 from davies/whole2.	2016-01-16 10:29:27 -08:00
Wenchen Fan	2f7d0b68a2	[SPARK-12856] [SQL] speed up hashCode of unsafe array We iterate the bytes to calculate hashCode before, but now we have `Murmur3_x86_32.hashUnsafeBytes` that don't require the bytes to be word algned, we should use that instead. A simple benchmark shows it's about 3 X faster, benchmark code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-arrayhashbenchmark-scala Author: Wenchen Fan <wenchen@databricks.com> Closes #10784 from cloud-fan/array-hashcode.	2016-01-16 00:38:17 -08:00
Davies Liu	242efb7546	[SPARK-12840] [SQL] Support passing arbitrary objects (not just expressions) into code generated classes This is a refactor to support codegen for aggregation and broadcast join. Author: Davies Liu <davies@databricks.com> Closes #10777 from davies/rename2.	2016-01-15 19:07:42 -08:00
Nong Li	9039333c0a	[SPARK-12644][SQL] Update parquet reader to be vectorized. This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch. There are a few particulars in the Parquet encodings that make this much more efficient. In particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are also very suited for this. This is a work in progress and does not affect the current execution. In subsequent patches, we will support more encodings and types before enabling this. Simple benchmarks indicate this can decode single ints about > 3x faster. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10593 from nongli/spark-12644.	2016-01-15 17:40:26 -08:00
Wenchen Fan	3b5ccb12b8	[SPARK-12649][SQL] support reading bucketed table This PR adds the support to read bucketed tables, and correctly populate `outputPartitioning`, so that we can avoid shuffle for some cases. TODO(follow-up PRs): * bucket pruning * avoid shuffle for bucketed table join when use any super-set of the bucketing key. (we should re-visit it after https://issues.apache.org/jira/browse/SPARK-12704 is fixed) * recognize hive bucketed table Author: Wenchen Fan <wenchen@databricks.com> Closes #10604 from cloud-fan/bucket-read.	2016-01-15 17:20:01 -08:00
Yin Huai	f6ddbb360a	[SPARK-12833][HOT-FIX] Reset the locale after we set it. Author: Yin Huai <yhuai@databricks.com> Closes #10778 from yhuai/resetLocale.	2016-01-15 16:03:05 -08:00
Herman van Hovell	7cd7f22025	[SPARK-12575][SQL] Grammar parity with existing SQL parser In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base. Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out: - The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR removes this keyword. - The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is not supported anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this. - Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we remove this feature from the parser. It would be quite easy to implement such a feature as an Expression later on. - Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed. cc rxin viirya marmbrus yhuai cloud-fan Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10745 from hvanhovell/SPARK-12575-2.	2016-01-15 15:19:10 -08:00
Wenchen Fan	3f1c58d60b	[SQL][MINOR] BoundReference do not need to be NamedExpression We made it a `NamedExpression` to workaroud some hacky cases long time ago, and now seems it's safe to remove it. Author: Wenchen Fan <wenchen@databricks.com> Closes #10765 from cloud-fan/minor.	2016-01-15 14:20:22 -08:00
Julien Baley	0bb73554a9	Fix typo disvoered => discovered Author: Julien Baley <julien.baley@gmail.com> Closes #10773 from julienbaley/patch-1.	2016-01-15 13:53:20 -08:00
Yin Huai	513266c042	[SPARK-12833][HOT-FIX] Fix scala 2.11 compilation. Seems `5f83c6991c` breaks scala 2.11 compilation. Author: Yin Huai <yhuai@databricks.com> Closes #10774 from yhuai/fixScala211Compile.	2016-01-15 13:17:29 -08:00
Hossein	5f83c6991c	[SPARK-12833][SQL] Initial import of spark-csv CSV is the most common data format in the "small data" world. It is often the first format people want to try when they see Spark on a single node. Having to rely on a 3rd party component for this leads to poor user experience for new users. This PR merges the popular spark-csv data source package (https://github.com/databricks/spark-csv) with SparkSQL. This is a first PR to bring the functionality to spark 2.0 master. We will complete items outlines in the design document (see JIRA attachment) in follow up pull requests. Author: Hossein <hossein@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #10766 from rxin/csv.	2016-01-15 11:46:46 -08:00
Davies Liu	c5e7076da7	[MINOR] [SQL] GeneratedExpressionCode -> ExprCode GeneratedExpressionCode is too long Author: Davies Liu <davies@databricks.com> Closes #10767 from davies/renaming.	2016-01-15 08:26:20 -08:00
Reynold Xin	fe7246fea6	[SPARK-12830] Java style: disallow trailing whitespaces. Author: Reynold Xin <rxin@databricks.com> Closes #10764 from rxin/SPARK-12830.	2016-01-14 23:33:45 -08:00
Michael Armbrust	cc7af86afd	[SPARK-12813][SQL] Eliminate serialization for back to back operations The goal of this PR is to eliminate unnecessary translations when there are back-to-back `MapPartitions` operations. In order to achieve this I also made the following simplifications: - Operators no longer have hold encoders, instead they have only the expressions that they need. The benefits here are twofold: the expressions are visible to transformations so go through the normal resolution/binding process. now that they are visible we can change them on a case by case basis. - Operators no longer have type parameters. Since the engine is responsible for its own type checking, having the types visible to the complier was an unnecessary complication. We still leverage the scala compiler in the companion factory when constructing a new operator, but after this the types are discarded. Deferred to a follow up PR: - Remove as much of the resolution/binding from Dataset/GroupedDataset as possible. We should still eagerly check resolution and throw an error though in the case of mismatches for an `as` operation. - Eliminate serializations in more cases by adding more cases to `EliminateSerialization` Author: Michael Armbrust <michael@databricks.com> Closes #10747 from marmbrus/encoderExpressions.	2016-01-14 17:44:56 -08:00
Reynold Xin	902667fd27	[SPARK-12771][SQL] Simplify CaseWhen code generation The generated code for CaseWhen uses a control variable "got" to make sure we do not evaluate more branches once a branch is true. Changing that to generate just simple "if / else" would be slightly more efficient. This closes #10737. Author: Reynold Xin <rxin@databricks.com> Closes #10755 from rxin/SPARK-12771.	2016-01-14 10:09:03 -08:00
Wenchen Fan	962e9bcf94	[SPARK-12756][SQL] use hash expression in Exchange This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one. This PR also fixes the tests that are broken by the new hash behaviour in shuffle. Author: Wenchen Fan <wenchen@databricks.com> Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.	2016-01-13 22:43:28 -08:00
Reynold Xin	cbbcd8e425	[SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values" This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field. Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls. Author: Reynold Xin <rxin@databricks.com> Closes #10734 from rxin/simplify-case.	2016-01-13 12:44:35 -08:00
Wenchen Fan	c2ea79f96a	[SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe row https://issues.apache.org/jira/browse/SPARK-12642 Author: Wenchen Fan <wenchen@databricks.com> Closes #10694 from cloud-fan/hash-expr.	2016-01-13 12:29:02 -08:00
Liang-Chi Hsieh	63eee86cc6	[SPARK-9297] [SQL] Add covar_pop and covar_samp JIRA: https://issues.apache.org/jira/browse/SPARK-9297 Add two aggregation functions: covar_pop and covar_samp. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10029 from viirya/covar-funcs.	2016-01-13 10:26:55 -08:00
Kousuke Saruta	cb7b864a24	[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before ",") Fix the style violation (space before , and :). This PR is a followup for #10643 and rework of #10685 . Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10732 from sarutak/SPARK-12692-followup-sql.	2016-01-12 22:25:20 -08:00
Dilip Biswal	dc7b3870fc	[SPARK-12558][SQL] AnalysisException when multiple functions applied in GROUP BY clause cloud-fan Can you please take a look ? In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10520 from dilipbiswal/spark-12558.	2016-01-12 21:41:46 -08:00
Reynold Xin	b3b9ad23cf	[SPARK-12788][SQL] Simplify BooleanEquality by using casts. Author: Reynold Xin <rxin@databricks.com> Closes #10730 from rxin/SPARK-12788.	2016-01-12 18:45:55 -08:00
Nong Li	9247084962	[SPARK-12785][SQL] Add ColumnarBatch, an in memory columnar format for execution. There are many potential benefits of having an efficient in memory columnar format as an alternate to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The remaining implementation can be done as follow up patches. As stated in the in the JIRA, there are useful external components that operate on memory in a simple columnar format. ColumnarBatch would serve that purpose and could server as a zero-serialization/zero-copy exchange for this use case. This patch supports running the underlying data either on heap or off heap. On heap runs a bit faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one interface (ColumnVector). This differs from Parquet or the existing columnar cache because this is not intended to be used as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these batches in memory per task. The layout of the values is just dense arrays of the value type. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10628 from nongli/spark-12635.	2016-01-12 18:21:04 -08:00
Cheng Lian	8ed5f12d2b	[SPARK-12724] SQL generation support for persisted data source tables This PR implements SQL generation support for persisted data source tables. A new field `metastoreTableIdentifier: Option[TableIdentifier]` is added to `LogicalRelation`. When a `LogicalRelation` representing a persisted data source relation is created, this field holds the database name and table name of the relation. Author: Cheng Lian <lian@databricks.com> Closes #10712 from liancheng/spark-12724-datasources-sql-gen.	2016-01-12 14:19:53 -08:00
Reynold Xin	0d543b98f3	Revert "[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":")" This reverts commit `8cfa218f4f`.	2016-01-12 12:56:52 -08:00
Reynold Xin	0ed430e315	[SPARK-12768][SQL] Remove CaseKeyWhen expression This patch removes CaseKeyWhen expression and replaces it with a factory method that generates the equivalent CaseWhen. This reduces the amount of code we'd need to maintain in the future for both code generation and optimizer. Note that we introduced CaseKeyWhen to avoid duplicate evaluations of the key. This is no longer a problem because we now have common subexpression elimination. Author: Reynold Xin <rxin@databricks.com> Closes #10722 from rxin/SPARK-12768.	2016-01-12 11:13:08 -08:00
Robert Kruszewski	508592b1ba	[SPARK-9843][SQL] Make catalyst optimizer pass pluggable at runtime Let me know whether you'd like to see it in other place Author: Robert Kruszewski <robertk@palantir.com> Closes #10210 from robert3005/feature/pluggable-optimizer.	2016-01-12 11:09:28 -08:00
Reynold Xin	1d88879530	[SPARK-12762][SQL] Add unit test for SimplifyConditionals optimization rule This pull request does a few small things: 1. Separated if simplification from BooleanSimplification and created a new rule SimplifyConditionals. In the future we can also simplify other conditional expressions here. 2. Added unit test for SimplifyConditionals. 3. Renamed SimplifyCaseConversionExpressionsSuite to SimplifyStringCaseConversionSuite Author: Reynold Xin <rxin@databricks.com> Closes #10716 from rxin/SPARK-12762.	2016-01-12 10:58:57 -08:00
Kousuke Saruta	8cfa218f4f	[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":") Fix the style violation (space before , and :). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10718 from sarutak/SPARK-12692-followup-sql.	2016-01-12 00:51:00 -08:00
Yin Huai	aaa2c3b628	[SPARK-11823] Ignores HiveThriftBinaryServerSuite's test jdbc cancel https://issues.apache.org/jira/browse/SPARK-11823 This test often hangs and times out, leaving hanging processes. Let's ignore it for now and improve the test. Author: Yin Huai <yhuai@databricks.com> Closes #10715 from yhuai/SPARK-11823-ignore.	2016-01-11 19:59:15 -08:00
Cheng Lian	36d493509d	[SPARK-12498][SQL][MINOR] BooleanSimplication simplification Scala syntax allows binary case classes to be used as infix operator in pattern matching. This PR makes use of this syntax sugar to make `BooleanSimplification` more readable. Author: Cheng Lian <lian@databricks.com> Closes #10445 from liancheng/boolean-simplification-simplification.	2016-01-11 18:42:26 -08:00
wangfei	473907adf6	[SPARK-12742][SQL] org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already exists exception ``` [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.hive.LogicalPlanToSQLSuite * ABORTED * (325 milliseconds) [info] org.apache.spark.sql.AnalysisException: Table `t1` already exists.; [info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296) [info] at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33) [info] at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23) [info] at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) [info] at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745) ``` /cc liancheng Author: wangfei <wangfei_hello@126.com> Closes #10682 from scwf/fix-test.	2016-01-11 18:18:44 -08:00
Herman van Hovell	fe9eb0b0ce	[SPARK-12576][SQL] Enable expression parsing in CatalystQl The PR allows us to use the new SQL parser to parse SQL expressions such as: ```1 + sin(x*x)``` We enable this functionality in this PR, but we will not start using this actively yet. This will be done as soon as we have reached grammar parity with the existing parser stack. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10649 from hvanhovell/SPARK-12576.	2016-01-11 16:29:37 -08:00
Anatoliy Plastinin	9559ac5f74	[SPARK-12744][SQL] Change parsing JSON integers to timestamps to treat integers as number of seconds JIRA: https://issues.apache.org/jira/browse/SPARK-12744 This PR makes parsing JSON integers to timestamps consistent with casting behavior. Author: Anatoliy Plastinin <anatoliy.plastinin@gmail.com> Closes #10687 from antlypls/fix-json-timestamp-parsing.	2016-01-11 10:28:57 -08:00
Wenchen Fan	f253feff62	[SPARK-12539][FOLLOW-UP] always sort in partitioning writer address comments in #10498 , especially https://github.com/apache/spark/pull/10498#discussion_r49021259 Author: Wenchen Fan <wenchen@databricks.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@databricks.com> Closes #10638 from cloud-fan/bucket-write.	2016-01-11 00:44:33 -08:00
Marcelo Vanzin	6439a82503	[SPARK-3873][BUILD] Enable import ordering error checking. Turn import ordering violations into build errors, plus a few adjustments to account for how the checker behaves. I'm a little on the fence about whether the existing code is right, but it's easier to appease the checker than to discuss what's the more correct order here. Plus a few fixes to imports that cropped in since my recent cleanups. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10612 from vanzin/SPARK-3873-enable.	2016-01-10 20:04:50 -08:00
Reynold Xin	b23c4521f5	[SPARK-12340] Fix overflow in various take functions. This is a follow-up for the original patch #10562. Author: Reynold Xin <rxin@databricks.com> Closes #10670 from rxin/SPARK-12340.	2016-01-09 11:21:58 -08:00
Liang-Chi Hsieh	95cd5d95ce	[SPARK-12577] [SQL] Better support of parentheses in partition by and order by clause of window function's over clause JIRA: https://issues.apache.org/jira/browse/SPARK-12577 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10620 from viirya/fix-parentheses.	2016-01-08 21:48:06 -08:00
Cheng Lian	d9447cac74	[SPARK-12593][SQL] Converts resolved logical plan back to SQL This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings. For now, the major use case is to canonicalize Spark SQL native view support. The major entry point is `SQLBuilder.toSQL`, which returns an `Option[String]` if the logical plan is recognized. The current version is still in WIP status, and is quite limited. Known limitations include: 1. The logical plan must be analyzed but not optimized The optimizer erases `Subquery` operators, which contain necessary scope information for SQL generation. Future versions should be able to recover erased scope information by inserting subqueries when necessary. 1. The logical plan must be created using HiveQL query string Query plans generated by composing arbitrary DataFrame API combinations are not supported yet. Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation. For example, the following query plan ``` Filter (a#1 < 10) +- MetastoreRelation default, src, None ``` need to be canonicalized into the following form before SQL generation: ``` Project [a#1, b#2, c#3] +- Filter (a#1 < 10) +- MetastoreRelation default, src, None ``` Otherwise, the SQL generation process will have to handle a large number of special cases. 1. Only a fraction of expressions and basic logical plan operators are supported in this PR Currently, 95.7% (1720 out of 1798) query plans in `HiveCompatibilitySuite` can be successfully converted to SQL query strings. Known unsupported components are: - Expressions - Part of math expressions - Part of string expressions (buggy?) - Null expressions - Calendar interval literal - Part of date time expressions - Complex type creators - Special `NOT` expressions, e.g. `NOT LIKE` and `NOT IN` - Logical plan operators/patterns - Cube, rollup, and grouping set - Script transformation - Generator - Distinct aggregation patterns that fit `DistinctAggregationRewriter` analysis rule - Window functions Support for window functions, generators, and cubes etc. will be added in follow-up PRs. This PR leverages `HiveCompatibilitySuite` for testing SQL generation in a "round-trip" manner: * For all select queries, we try to convert it back to SQL * If the query plan is convertible, we parse the generated SQL into a new logical plan * Run the new logical plan instead of the original one If the query plan is inconvertible, the test case simply falls back to the original logic. TODO - [x] Fix failed test cases - [x] Support for more basic expressions and logical plan operators (e.g. distinct aggregation etc.) - [x] Comments and documentation Author: Cheng Lian <lian@databricks.com> Closes #10541 from liancheng/sql-generation.	2016-01-08 14:08:13 -08:00
Liang-Chi Hsieh	cfe1ba56e4	[SPARK-12687] [SQL] Support from clause surrounded by `()`. JIRA: https://issues.apache.org/jira/browse/SPARK-12687 Some queries such as `(select 1 as a) union (select 2 as a)` can't work. This patch fixes it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10660 from viirya/fix-union.	2016-01-08 09:50:41 -08:00
Sean Owen	b9c8353378	[SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs. Author: Sean Owen <sowen@cloudera.com> Closes #10570 from srowen/SPARK-12618.	2016-01-08 17:47:44 +00:00
Reynold Xin	726bd3c4ec	Fix indentation for the previous patch.	2016-01-07 21:15:43 -08:00
Kevin Yu	5028a001d5	[SPARK-12317][SQL] Support units (m,k,g) in SQLConf This PR is continue from previous closed PR 10314. In this PR, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE will be taken memory string conventions as input. For example, the user can now specify 10g for SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE in SQLConf file. marmbrus srowen : Can you help review this code changes ? Thanks. Author: Kevin Yu <qyu@us.ibm.com> Closes #10629 from kevinyu98/spark-12317.	2016-01-07 21:13:17 -08:00
Kazuaki Ishizaki	34dbc8af21	[SPARK-12580][SQL] Remove string concatenations from usage and extended in @ExpressionDescription Use multi-line string literals for ExpressionDescription with ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` The policy is here, as describe at https://github.com/apache/spark/pull/10488 Let's use multi-line string literals. If we have to have a line with more than 100 characters, let's use ``// scalastyle:off line.size.limit`` and ``// scalastyle:on line.size.limit`` to just bypass the line number requirement. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10524 from kiszk/SPARK-12580.	2016-01-07 13:56:34 -08:00
Jacek Laskowski	07b314a57a	[MINOR] Fix for BUILD FAILURE for Scala 2.11 It was introduced in `917d3fc069` /cc cloud-fan rxin Author: Jacek Laskowski <jacek@japila.pl> Closes #10636 from jaceklaskowski/fix-for-build-failure-2.11.	2016-01-07 10:39:46 -08:00
Sameer Agarwal	f194d9911a	[SPARK-12662][SQL] Fix DataFrame.randomSplit to avoid creating overlapping splits https://issues.apache.org/jira/browse/SPARK-12662 cc yhuai Author: Sameer Agarwal <sameer@databricks.com> Closes #10626 from sameeragarwal/randomsplit.	2016-01-07 10:37:15 -08:00
Davies Liu	fd1dcfaf26	[SPARK-12542][SQL] support except/intersect in HiveQl Parse the SQL query with except/intersect in FROM clause for HivQL. Author: Davies Liu <davies@databricks.com> Closes #10622 from davies/intersect.	2016-01-06 23:46:12 -08:00
Davies Liu	6a1c864ab6	[SPARK-12295] [SQL] external spilling for window functions This PR manage the memory used by window functions (buffered rows), also enable external spilling. After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G. Author: Davies Liu <davies@databricks.com> Closes #10605 from davies/unsafe_window.	2016-01-06 23:21:52 -08:00
Nong Li	a74d743cc7	[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks. We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do this. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10589 from nongli/spark-12640.	2016-01-06 19:20:43 -08:00
Wenchen Fan	917d3fc069	[SPARK-12539][SQL] support writing bucketed table This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example: ``` df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales") ``` When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write. Note that there may be multiply files for one bucket, as the data is distributed. Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway. Limitations: * Can't write bucketed data without hive metastore. * Can't insert bucketed data into existing hive tables. Author: Wenchen Fan <wenchen@databricks.com> Closes #10498 from cloud-fan/bucket-write.	2016-01-06 16:58:10 -08:00
Davies Liu	6f7ba6409a	[SPARK-12681] [SQL] split IdentifiersParser.g into two files To avoid to have a huge Java source (over 64K loc), that can't be compiled. cc hvanhovell Author: Davies Liu <davies@databricks.com> Closes #10624 from davies/split_ident.	2016-01-06 15:54:00 -08:00
Herman van Hovell	ea489f14f1	[SPARK-12573][SPARK-12574][SQL] Move SQL Parser from Hive to Catalyst This PR moves a major part of the new SQL parser to Catalyst. This is a prelude to start using this parser for all of our SQL parsing. The following key changes have been made: The ANTLR Parser & Supporting classes have been moved to the Catalyst project. They are now part of the ```org.apache.spark.sql.catalyst.parser``` package. These classes contained quite a bit of code that was originally from the Hive project, I have added aknowledgements whenever this applied. All Hive dependencies have been factored out. I have also taken this chance to clean-up the ```ASTNode``` class, and to improve the error handling. The HiveQl object that provides the functionality to convert an AST into a LogicalPlan has been refactored into three different classes, one for every SQL sub-project: - ```CatalystQl```: This implements Query and Expression parsing functionality. - ```SparkQl```: This is a subclass of CatalystQL and provides SQL/Core only functionality such as Explain and Describe. - ```HiveQl```: This is a subclass of ```SparkQl``` and this adds Hive-only functionality to the parser such as Analyze, Drop, Views, CTAS & Transforms. This class still depends on Hive. cc rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10583 from hvanhovell/SPARK-12575.	2016-01-06 11:16:53 -08:00
Yash Datta	9061e777fd	[SPARK-11878][SQL] Eliminate distribute by in case group by is present with exactly the same grouping expressi For queries like : select <> from table group by a distribute by a we can eliminate distribute by ; since group by will anyways do a hash partitioning Also applicable when user uses Dataframe API Author: Yash Datta <Yash.Datta@guavus.com> Closes #9858 from saucam/eliminatedistribute.	2016-01-06 10:37:53 -08:00
QiangCai	5d871ea43e	[SPARK-12340][SQL] fix Int overflow in the SparkPlan.executeTake, RDD.take and AsyncRDDActions.takeAsync I have closed pull request https://github.com/apache/spark/pull/10487. And I create this pull request to resolve the problem. spark jira https://issues.apache.org/jira/browse/SPARK-12340 Author: QiangCai <david.caiq@gmail.com> Closes #10562 from QiangCai/bugfix.	2016-01-06 18:13:07 +09:00
Liang-Chi Hsieh	b2467b3810	[SPARK-12578][SQL] Distinct should not be silently ignored when used in an aggregate function with OVER clause JIRA: https://issues.apache.org/jira/browse/SPARK-12578 Slightly update to Hive parser. We should keep the distinct keyword when used in an aggregate function with OVER clause. So the CheckAnalysis will detect it and throw exception later. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10557 from viirya/keep-distinct-hivesql.	2016-01-06 00:40:14 -08:00
Marcelo Vanzin	b3ba1be3b7	[SPARK-3873][TESTS] Import ordering fixes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10582 from vanzin/SPARK-3873-tests.	2016-01-05 19:07:39 -08:00
sureshthalamati	0d42292f6a	[SPARK-12504][SQL] Masking credentials in the sql plan explain output for JDBC data sources. This fix masks JDBC credentials in the explain output. URL patterns to specify credential seems to be vary between different databases. Added a new method to dialect to mask the credentials according to the database specific URL pattern. While adding tests I noticed explain output includes array variable for partitions ([Lorg.apache.spark.Partition;3ff74546,). Modified the code to include the first, and last partition information. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #10452 from sureshthalamati/mask_jdbc_credentials_spark-12504.	2016-01-05 17:48:05 -08:00
Marcelo Vanzin	df8bd97520	[SPARK-3873][SQL] Import ordering fixes. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10573 from vanzin/SPARK-3873-sql.	2016-01-05 16:48:59 -08:00
Nong	c26d174265	[SPARK-12636] [SQL] Update UnsafeRowParquetRecordReader to support reading files directly. As noted in the code, this change is to make this component easier to test in isolation. Author: Nong <nongli@gmail.com> Closes #10581 from nongli/spark-12636.	2016-01-05 13:47:24 -08:00
Liang-Chi Hsieh	d202ad2fc2	[SPARK-12439][SQL] Fix toCatalystArray and MapObjects JIRA: https://issues.apache.org/jira/browse/SPARK-12439 In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type. There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null). Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10391 from viirya/fix-catalystarray.	2016-01-05 12:33:21 -08:00
Reynold Xin	8ce645d4ee	[SPARK-12615] Remove some deprecated APIs in RDD/SparkContext I looked at each case individually and it looks like they can all be removed. The only one that I had to think twice was toArray (I even thought about un-deprecating it, until I realized it was a problem in Java to have toArray returning java.util.List). Author: Reynold Xin <rxin@databricks.com> Closes #10569 from rxin/SPARK-12615.	2016-01-05 11:10:14 -08:00
Wenchen Fan	76768337be	[SPARK-12480][FOLLOW-UP] use a single column vararg for hash address comments in #10435 This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty. Author: Wenchen Fan <wenchen@databricks.com> Closes #10588 from cloud-fan/hash.	2016-01-05 10:23:36 -08:00
Liang-Chi Hsieh	b3c48e39f4	[SPARK-12438][SQL] Add SQLUserDefinedType support for encoder JIRA: https://issues.apache.org/jira/browse/SPARK-12438 ScalaReflection lacks the support of SQLUserDefinedType. We should add it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10390 from viirya/encoder-udt.	2016-01-05 10:19:56 -08:00
Michael Armbrust	53beddc5bf	[SPARK-12568][SQL] Add BINARY to Encoders Author: Michael Armbrust <michael@databricks.com> Closes #10516 from marmbrus/datasetCleanup.	2016-01-04 23:23:41 -08:00
Reynold Xin	b634901bb2	[SPARK-12600][SQL] follow up: add range check for DecimalType This addresses davies' code review feedback in https://github.com/apache/spark/pull/10559 Author: Reynold Xin <rxin@databricks.com> Closes #10586 from rxin/remove-deprecated-sql-followup.	2016-01-04 21:05:27 -08:00
Wenchen Fan	b1a771231e	[SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions just write the arguments into unsafe row and use murmur3 to calculate hash code Author: Wenchen Fan <wenchen@databricks.com> Closes #10435 from cloud-fan/hash-expr.	2016-01-04 18:49:41 -08:00
Reynold Xin	77ab49b857	[SPARK-12600][SQL] Remove deprecated methods in Spark SQL Author: Reynold Xin <rxin@databricks.com> Closes #10559 from rxin/remove-deprecated-sql.	2016-01-04 18:02:38 -08:00
Narine Kokhlikyan	fdfac22d08	[SPARK-12509][SQL] Fixed error messages for DataFrame correlation and covariance Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov: - "Currently cov supports calculating the covariance between two columns" - "Covariance calculation for columns with dataType "[DataType Name]" not supported." I've fixed this issue by passing the function name as an argument. We could also do the input checks separately for each function. I avoided doing that because of code duplication. Thanks! Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #10458 from NarineK/sparksqlstatsmessages.	2016-01-04 16:14:49 -08:00
Nong Li	34de24abb5	[SPARK-12589][SQL] Fix UnsafeRowParquetRecordReader to properly set the row length. The reader was previously not setting the row length meaning it was wrong if there were variable length columns. This problem does not manifest usually, since the value in the column is correct and projecting the row fixes the issue. Author: Nong Li <nong@databricks.com> Closes #10576 from nongli/spark-12589.	2016-01-04 14:58:24 -08:00
Davies Liu	d084a2de32	[SPARK-12541] [SQL] support cube/rollup as function This PR enable cube/rollup as function, so they can be used as this: ``` select a, b, sum(c) from t group by rollup(a, b) ``` Author: Davies Liu <davies@databricks.com> Closes #10522 from davies/rollup.	2016-01-04 14:26:56 -08:00
Herman van Hovell	0171b71e95	[SPARK-12421][SQL] Prevent Internal/External row from exposing state. It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem. This PR was inspired by https://github.com/apache/spark/pull/10374 by apo1. cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation). Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10553 from hvanhovell/SPARK-12421.	2016-01-04 12:41:57 -08:00
tedyu	40d03960d7	[DOC] Adjust coverage for partitionBy() This is the related thread: http://search-hadoop.com/m/q3RTtO3ReeJ1iF02&subj=Re+partitioning+json+data+in+spark Michael suggested fixing the doc. Please review. Author: tedyu <yuzhihong@gmail.com> Closes #10499 from ted-yu/master.	2016-01-04 12:38:04 -08:00
Xiu Guo	573ac55d74	[SPARK-12512][SQL] support column name with dot in withColumn() Author: Xiu Guo <xguo27@gmail.com> Closes #10500 from xguo27/SPARK-12512.	2016-01-04 12:34:04 -08:00
Pete Robbins	b504b6a90a	[SPARK-12470] [SQL] Fix size reduction calculation also only allocate required buffer size Author: Pete Robbins <robbinspg@gmail.com> Closes #10421 from robbinspg/master.	2016-01-04 10:43:21 -08:00
Josh Rosen	6c83d938cc	[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection. In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection. This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly). If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different). This patch is inspired by a similar patch that I made to the `spark-redshift` library (https://github.com/databricks/spark-redshift/pull/143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons). Author: Josh Rosen <joshrosen@databricks.com> Closes #10519 from JoshRosen/jdbc-driver-precedence.	2016-01-04 10:39:42 -08:00
Xiu Guo	84f8492c15	[SPARK-12562][SQL] DataFrame.write.format(text) requires the column name to be called value Author: Xiu Guo <xguo27@gmail.com> Closes #10515 from xguo27/SPARK-12562.	2016-01-03 20:48:56 -08:00
Cazen	b8410ff9ce	[SPARK-12537][SQL] Add option to accept quoting of all character backslash quoting mechanism We can provides the option to choose JSON parser can be enabled to accept quoting of all character or not. Author: Cazen <Cazen@korea.com> Author: Cazen Lee <cazen.lee@samsung.com> Author: Cazen Lee <Cazen@korea.com> Author: cazen.lee <cazen.lee@samsung.com> Closes #10497 from Cazen/master.	2016-01-03 17:01:19 -08:00
thomastechs	c82924d564	[SPARK-12533][SQL] hiveContext.table() throws the wrong exception Avoiding the the No such table exception and throwing analysis exception as per the bug: SPARK-12533 Author: thomastechs <thomas.sebastian@tcs.com> Closes #10529 from thomastechs/topic-branch.	2016-01-03 11:09:30 -08:00
Reynold Xin	6c5bbd628a	Revert "Revert "[SPARK-12286][SPARK-12290][SPARK-12294][SPARK-12284][SQL] always output UnsafeRow"" This reverts commit `44ee920fd4`.	2016-01-02 22:39:25 -08:00
Reynold Xin	513e3b092c	[SPARK-12599][MLLIB][SQL] Remove the use of callUDF in MLlib callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that. Author: Reynold Xin <rxin@databricks.com> Closes #10547 from rxin/SPARK-12599.	2016-01-02 22:31:39 -08:00
Sean Owen	15bd73627e	[SPARK-12481][CORE][STREAMING][SQL] Remove usage of Hadoop deprecated APIs and reflection that supported 1.x Remove use of deprecated Hadoop APIs now that 2.2+ is required Author: Sean Owen <sowen@cloudera.com> Closes #10446 from srowen/SPARK-12481.	2016-01-02 13:15:53 +00:00
hyukjinkwon	94f7a12b3c	[SPARK-10180][SQL] JDBC datasource are not processing EqualNullSafe filter This PR is followed by https://github.com/apache/spark/pull/8391. Previous PR fixes JDBCRDD to support null-safe equality comparison for JDBC datasource. This PR fixes the problem that it can actually return null as a result of the comparison resulting error as using the value of that comparison. Author: hyukjinkwon <gurwls223@gmail.com> Author: HyukjinKwon <gurwls223@gmail.com> Closes #8743 from HyukjinKwon/SPARK-10180.	2016-01-02 00:04:48 -08:00
Herman van Hovell	970635a9f8	[SPARK-12362][SQL][WIP] Inline Hive Parser This PR inlines the Hive SQL parser in Spark SQL. The previous (merged) incarnation of this PR passed all tests, but had and still has problems with the build. These problems are caused by a the fact that - for some reason - in some cases the ANTLR generated code is not included in the compilation fase. This PR is a WIP and should not be merged until we have sorted out the build issues. Author: Herman van Hovell <hvanhovell@questtec.nl> Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Closes #10525 from hvanhovell/SPARK-12362.	2016-01-01 23:22:50 -08:00
Reynold Xin	44ee920fd4	Revert "[SPARK-12286][SPARK-12290][SPARK-12294][SPARK-12284][SQL] always output UnsafeRow" This reverts commit `0da7bd50dd`.	2016-01-01 19:23:06 -08:00
Davies Liu	0da7bd50dd	[SPARK-12286][SPARK-12290][SPARK-12294][SPARK-12284][SQL] always output UnsafeRow It's confusing that some operator output UnsafeRow but some not, easy to make mistake. This PR change to only output UnsafeRow for all the operators (SparkPlan), removed the rule to insert Unsafe/Safe conversions. For those that can't output UnsafeRow directly, added UnsafeProjection into them. Closes #10330 cc JoshRosen rxin Author: Davies Liu <davies@databricks.com> Closes #10511 from davies/unsafe_row.	2016-01-01 13:39:20 -08:00
Cheng Lian	01a29866b1	[SPARK-12592][SQL][TEST] Don't mute Spark loggers in TestHive.reset() There's a hack done in `TestHive.reset()`, which intended to mute noisy Hive loggers. However, Spark testing loggers are also muted. Author: Cheng Lian <lian@databricks.com> Closes #10540 from liancheng/spark-12592.dont-mute-spark-loggers.	2016-01-01 13:24:09 -08:00
Liang-Chi Hsieh	ad5b7cfcca	[SPARK-12409][SPARK-12387][SPARK-12391][SQL] Refactor filter pushdown for JDBCRDD and add few filters This patch refactors the filter pushdown for JDBCRDD and also adds few filters. Added filters are basically from #10468 with some refactoring. Test cases are from #10468. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10470 from viirya/refactor-jdbc-filter.	2016-01-01 00:54:25 -08:00
Liang-Chi Hsieh	c9dbfcc653	[SPARK-11743][SQL] Move the test for arrayOfUDT A following pr for #9712. Move the test for arrayOfUDT. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10538 from viirya/move-udt-test.	2015-12-31 23:48:05 -08:00
Yin Huai	5cdecb1841	[SPARK-12039][SQL] Re-enable HiveSparkSubmitSuite's SPARK-9757 Persist Parquet relation with decimal column https://issues.apache.org/jira/browse/SPARK-12039 since we do not support hadoop1, we can re-enable this test in master. Author: Yin Huai <yhuai@databricks.com> Closes #10533 from yhuai/SPARK-12039-enable.	2015-12-31 01:33:21 -08:00
Davies Liu	e6c77874b9	[SPARK-12585] [SQL] move numFields to constructor of UnsafeRow Right now, numFields will be passed in by pointTo(), then bitSetWidthInBytes is calculated, making pointTo() a little bit heavy. It should be part of constructor of UnsafeRow. Author: Davies Liu <davies@databricks.com> Closes #10528 from davies/numFields.	2015-12-30 22:16:37 -08:00
Herman van Hovell	f76ee109d8	[SPARK-8641][SPARK-12455][SQL] Native Spark Window functions - Follow-up (docs & tests) This PR is a follow-up for PR https://github.com/apache/spark/pull/9819. It adds documentation for the window functions and a couple of NULL tests. The documentation was largely based on the documentation in (the source of) Hive and Presto: * https://prestodb.io/docs/current/functions/window.html * https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics I am not sure if we need to add the licenses of these two projects to the licenses directory. They are both under the ASL. srowen any thoughts? cc yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10402 from hvanhovell/SPARK-8641-docs.	2015-12-30 16:51:07 -08:00
Takeshi YAMAMURO	5c2682b0c8	[SPARK-12409][SPARK-12387][SPARK-12391][SQL] Support AND/OR/IN/LIKE push-down filters for JDBC This is rework from #10386 and add more tests and LIKE push-down support. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10468 from maropu/SupportMorePushdownInJdbc.	2015-12-30 13:34:37 -08:00
Wenchen Fan	aa48164a43	[SPARK-12495][SQL] use true as default value for propagateNull in NewInstance Most of cases we should propagate null when call `NewInstance`, and so far there is only one case we should stop null propagation: create product/java bean. So I think it makes more sense to propagate null by dafault. This also fixes a bug when encode null array/map, which is firstly discovered in https://github.com/apache/spark/pull/10401 Author: Wenchen Fan <wenchen@databricks.com> Closes #10443 from cloud-fan/encoder.	2015-12-30 10:56:08 -08:00
Reynold Xin	27af6157f9	Revert "[SPARK-12362][SQL][WIP] Inline Hive Parser" This reverts commit `b600bccf41` due to non-deterministic build breaks.	2015-12-30 00:08:44 -08:00
gatorsmile	4f75f785df	[SPARK-12564][SQL] Improve missing column AnalysisException ``` org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input columns text; ``` lets put a `:` after `columns` and put the columns in `[]` so that they match the toString of DataFrame. Author: gatorsmile <gatorsmile@gmail.com> Closes #10518 from gatorsmile/improveAnalysisExceptionMsg.	2015-12-29 22:28:59 -08:00
Nong Li	b600bccf41	[SPARK-12362][SQL][WIP] Inline Hive Parser This is a WIP. The PR has been taken over from nongli (see https://github.com/apache/spark/pull/10420). I have removed some additional dead code, and fixed a few issues which were caused by the fact that the inlined Hive parser is newer than the Hive parser we currently use in Spark. I am submitting this PR in order to get some feedback and testing done. There is quite a bit of work to do: - [ ] Get it to pass jenkins build/test. - [ ] Aknowledge Hive-project for using their parser. - [ ] Refactorings between HiveQl and the java classes. - [ ] Create our own ASTNode and integrate the current implicit extentions. - [ ] Move remaining ```SemanticAnalyzer``` and ```ParseUtils``` functionality to ```HiveQl```. - [ ] Removing Hive dependencies from the parser. This will require some edits in the grammar files. - [ ] Introduce our own context which needs to contain a ```TokenRewriteStream```. - [ ] Add ```useSQL11ReservedKeywordsForIdentifier``` and ```allowQuotedId``` to the catalyst or sql configuration. - [ ] Remove ```HiveConf``` from grammar files &HiveQl, and pass in our own configuration. - [ ] Moving the parser into sql/core. cc nongli rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Author: Nong Li <nong@databricks.com> Author: Nong Li <nongli@gmail.com> Closes #10509 from hvanhovell/SPARK-12362.	2015-12-29 18:47:41 -08:00
Reynold Xin	270a659584	[SPARK-12549][SQL] Take Option[Seq[DataType]] in UDF input type specification. In Spark we allow UDFs to declare its expected input types in order to apply type coercion. The expected input type parameter takes a Seq[DataType] and uses Nil when no type coercion is applied. It makes more sense to take Option[Seq[DataType]] instead, so we can differentiate a no-arg function vs function with no expected input type specified. Author: Reynold Xin <rxin@databricks.com> Closes #10504 from rxin/SPARK-12549.	2015-12-29 16:58:23 -08:00
Hossein	f6ecf14333	[SPARK-11199][SPARKR] Improve R context management story and add getOrCreate * Changes api.r.SQLUtils to use ```SQLContext.getOrCreate``` instead of creating a new context. * Adds a simple test [SPARK-11199] #comment link with JIRA Author: Hossein <hossein@databricks.com> Closes #9185 from falaki/SPARK-11199.	2015-12-29 11:44:20 -08:00
Kazuaki Ishizaki	8e629b10cb	[SPARK-12530][BUILD] Fix build break at Spark-Master-Maven-Snapshots from #1293 Compilation error caused due to string concatenations that are not a constant Use raw string literal to avoid string concatenations https://amplab.cs.berkeley.edu/jenkins/view/Spark-Packaging/job/Spark-Master-Maven-Snapshots/1293/ Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10488 from kiszk/SPARK-12530.	2015-12-29 10:35:23 -08:00
Takeshi YAMAMURO	73862a1eb9	[SPARK-11394][SQL] Throw IllegalArgumentException for unsupported types in postgresql If DataFrame has BYTE types, throws an exception: org.postgresql.util.PSQLException: ERROR: type "byte" does not exist Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #9350 from maropu/FixBugInPostgreJdbc.	2015-12-28 21:28:32 -08:00
Reynold Xin	1a91be8078	[SPARK-12547][SQL] Tighten scala style checker enforcement for UDF registration We use scalastyle:off to turn off style checks in certain places where it is not possible to follow the style guide. This is usually ok. However, in udf registration, we disable the checker for a large amount of code simply because some of them exceed 100 char line limit. It is better to just disable the line limit check rather than everything. In this pull request, I only disabled line length check, and fixed a problem (lack explicit types for public methods). Author: Reynold Xin <rxin@databricks.com> Closes #10501 from rxin/SPARK-12547.	2015-12-28 20:43:06 -08:00
gatorsmile	043135819c	[SPARK-12522][SQL][MINOR] Add the missing document strings for the SQL configuration Fixing the missing the document for the configuration. We can see the missing messages "TODO" when issuing the command "SET -V". ``` spark.sql.columnNameOfCorruptRecord spark.sql.hive.verifyPartitionPath spark.sql.sources.parallelPartitionDiscovery.threshold spark.sql.hive.convertMetastoreParquet.mergeSchema spark.sql.hive.convertCTAS spark.sql.hive.thriftServer.async ``` Author: gatorsmile <gatorsmile@gmail.com> Closes #10471 from gatorsmile/commandDesc.	2015-12-28 17:22:18 -08:00
Shixiong Zhu	710b411729	[SPARK-12489][CORE][SQL][MLIB] Fix minor issues found by FindBugs Include the following changes: 1. Close `java.sql.Statement` 2. Fix incorrect `asInstanceOf`. 3. Remove unnecessary `synchronized` and `ReentrantLock`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10440 from zsxwing/findbugs.	2015-12-28 15:01:51 -08:00
gatorsmile	01ba95d8bf	[SPARK-12441][SQL] Fixing missingInput in Generate/MapPartitions/AppendColumns/MapGroups/CoGroup When explain any plan with Generate, we will see an exclamation mark in the plan. Normally, when we see this mark, it means the plan has an error. This PR is to correct the `missingInput` in `Generate`. For example, ```scala val df = Seq((1, "a b c"), (2, "a b"), (3, "a")).toDF("number", "letters") val df2 = df.explode('letters) { case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq } df2.explain(true) ``` Before the fix, the plan is like ``` == Parsed Logical Plan == 'Generate UserDefinedGenerator('letters), true, false, None +- Project [_1#0 AS number#2,_2#1 AS letters#3] +- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]] == Analyzed Logical Plan == number: int, letters: string, _1: string Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8] +- Project [_1#0 AS number#2,_2#1 AS letters#3] +- LocalRelation [_1#0,_2#1], [[1,a b c],[2,a b],[3,a]] == Optimized Logical Plan == Generate UserDefinedGenerator(letters#3), true, false, None, [_1#8] +- LocalRelation [number#2,letters#3], [[1,a b c],[2,a b],[3,a]] == Physical Plan == !Generate UserDefinedGenerator(letters#3), true, false, [number#2,letters#3,_1#8] +- LocalTableScan [number#2,letters#3], [[1,a b c],[2,a b],[3,a]] ``` Updates: The same issues are also found in the other four Dataset operators: `MapPartitions`/`AppendColumns`/`MapGroups`/`CoGroup`. Fixed all these four. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10393 from gatorsmile/generateExplain.	2015-12-28 12:48:30 -08:00
Stephan Kessler	a6a4812434	[SPARK-7727][SQL] Avoid inner classes in RuleExecutor Moved (case) classes Strategy, Once, FixedPoint and Batch to the companion object. This is necessary if we want to have the Optimizer easily extendable in the following sense: Usually a user wants to add additional rules, and just take the ones that are already there. However, inner classes made that impossible since the code did not compile This allows easy extension of existing Optimizers see the DefaultOptimizerExtendableSuite for a corresponding test case. Author: Stephan Kessler <stephan.kessler@sap.com> Closes #10174 from stephankessler/SPARK-7727.	2015-12-28 12:46:20 -08:00
gatorsmile	e01c6c8664	[SPARK-12287][SQL] Support UnsafeRow in MapPartitions/MapGroups/CoGroup Support Unsafe Row in MapPartitions/MapGroups/CoGroup. Added a test case for MapPartitions. Since MapGroups and CoGroup are built on AppendColumns, all the related dataset test cases already can verify the correctness when MapGroups and CoGroup processing unsafe rows. davies cloud-fan Not sure if my understanding is right, please correct me. Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #10398 from gatorsmile/unsafeRowMapGroup.	2015-12-28 12:23:28 -08:00
Kevin Yu	fd50df413f	[SPARK-12231][SQL] create a combineFilters' projection when we call buildPartitionedTableScan Hello Michael & All: We have some issues to submit the new codes in the other PR(#10299), so we closed that PR and open this one with the fix. The reason for the previous failure is that the projection for the scan when there is a filter that is not pushed down (the "left-over" filter) could be different, in elements or ordering, from the original projection. With this new codes, the approach to solve this problem is: Insert a new Project if the "left-over" filter is nonempty and (the original projection is not empty and the projection for the scan has more than one elements which could otherwise cause different ordering in projection). We create 3 test cases to cover the otherwise failure cases. Author: Kevin Yu <qyu@us.ibm.com> Closes #10388 from kevinyu98/spark-12231.	2015-12-28 11:58:33 -08:00
Wenchen Fan	8543997f2d	[HOT-FIX] bypass hive test when parse logical plan to json https://github.com/apache/spark/pull/10311 introduces some rare, non-deterministic flakiness for hive udf tests, see https://github.com/apache/spark/pull/10311#issuecomment-166548851 I can't reproduce it locally, and may need more time to investigate, a quick solution is: bypass hive tests for json serialization. Author: Wenchen Fan <wenchen@databricks.com> Closes #10430 from cloud-fan/hot-fix.	2015-12-28 11:45:44 -08:00
Cheng Lian	8e23d8db7f	[SPARK-12218] Fixes ORC conjunction predicate push down This PR is a follow-up of PR #10362. Two major changes: 1. The fix introduced in #10362 is OK for Parquet, but may disable ORC PPD in many cases PR #10362 stops converting an `AND` predicate if any branch is inconvertible. On the other hand, `OrcFilters` combines all filters into a single big conjunction first and then tries to convert it into ORC `SearchArgument`. This means, if any filter is inconvertible, no filters can be pushed down. This PR fixes this issue by finding out all convertible filters first before doing the actual conversion. The reason behind the current implementation is mostly due to the limitation of ORC `SearchArgument` builder, which is documented in this PR in detail. 1. Copied the `AND` predicate fix for ORC from #10362 to avoid merge conflict. Same as #10362, this PR targets master (2.0.0-SNAPSHOT), branch-1.6, and branch-1.5. Author: Cheng Lian <lian@databricks.com> Closes #10377 from liancheng/spark-12218.fix-orc-conjunction-ppd.	2015-12-28 08:48:44 -08:00
felixcheung	5aa2710c1e	[SPARK-12515][SQL][DOC] minor doc update for read.jdbc Author: felixcheung <felixcheung_m@hotmail.com> Closes #10465 from felixcheung/dfreaderjdbcdoc.	2015-12-28 10:22:45 +00:00
CK50	502476e45c	[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax In the past Spark JDBC write only worked with technologies which support the following INSERT statement syntax (JdbcUtils.scala: insertStatement()): INSERT INTO $table VALUES ( ?, ?, ..., ? ) But some technologies require a list of column names: INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? ) This was blocking the use of e.g. the Progress JDBC Driver for Cassandra. Another limitation is that syntax 1 relies no the dataframe field ordering match that of the target table. This works fine, as long as the target table has been created by writer.jdbc(). If the target table contains more columns (not created by writer.jdbc()), then the insert fails due mismatch of number of columns or their data types. This PR switches to the recommended second INSERT syntax. Column names are taken from datafram field names. Author: CK50 <christian.kurz@oracle.com> Closes #10380 from CK50/master-SPARK-12010-2.	2015-12-24 13:39:11 +00:00
pierre-borckmans	43b2a63900	[SPARK-12477][SQL] - Tungsten projection fails for null values in array fields Accessing null elements in an array field fails when tungsten is enabled. It works in Spark 1.3.1, and in Spark > 1.5 with Tungsten disabled. This PR solves this by checking if the accessed element in the array field is null, in the generated code. Example: ``` // Array of String case class AS( as: Seq[String] ) val dfAS = sc.parallelize( Seq( AS ( Seq("a",null,"b") ) ) ).toDF dfAS.registerTempTable("T_AS") for (i <- 0 to 2) { println(i + " = " + sqlContext.sql(s"select as[$i] from T_AS").collect.mkString(","))} ``` With Tungsten disabled: ``` 0 = [a] 1 = [null] 2 = [b] ``` With Tungsten enabled: ``` 0 = [a] 15/12/22 09:32:50 ERROR Executor: Exception in task 7.0 in stage 1.0 (TID 15) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters$UTF8StringWriter.getSize(UnsafeRowWriters.java:90) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) at org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ``` Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com> Closes #10429 from pierre-borckmans/SPARK-12477_Tungsten-Projection-Null-Element-In-Array.	2015-12-22 23:00:42 -08:00
Liang-Chi Hsieh	50301c0a28	[SPARK-11164][SQL] Add InSet pushdown filter back for Parquet When the filter is ```"b in ('1', '2')"```, the filter is not pushed down to Parquet. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10278 from gatorsmile/parquetFilterNot.	2015-12-23 14:08:29 +08:00
Cheng Lian	86761e10e1	[SPARK-12478][SQL] Bugfix: Dataset fields of product types can't be null When creating extractors for product types (i.e. case classes and tuples), a null check is missing, thus we always assume input product values are non-null. This PR adds a null check in the extractor expression for product types. The null check is stripped off for top level product fields, which are mapped to the outermost `Row`s, since they can't be null. Thanks cloud-fan for helping investigating this issue! Author: Cheng Lian <lian@databricks.com> Closes #10431 from liancheng/spark-12478.top-level-null-field.	2015-12-23 10:21:00 +08:00
Dilip Biswal	b374a25831	[SPARK-12102][SQL] Cast a non-nullable struct field to a nullable field during analysis Compare both left and right side of the case expression ignoring nullablity when checking for type equality. Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10156 from dilipbiswal/spark-12102.	2015-12-22 15:21:49 -08:00
Nong Li	575a132797	[SPARK-12471][CORE] Spark daemons will log their pid on start up. Author: Nong Li <nong@databricks.com> Closes #10422 from nongli/12471-pids.	2015-12-22 13:27:28 -08:00
Xiu Guo	b5ce84a1bb	[SPARK-12456][SQL] Add ExpressionDescription to misc functions First try, not sure how much information we need to provide in the usage part. Author: Xiu Guo <xguo27@gmail.com> Closes #10423 from xguo27/SPARK-12456.	2015-12-22 10:44:01 -08:00
hyukjinkwon	364d244a50	[SPARK-11677][SQL][FOLLOW-UP] Add tests for checking the ORC filter creation against pushed down filters. https://issues.apache.org/jira/browse/SPARK-11677 Although it checks correctly the filters by the number of results if ORC filter-push-down is enabled, the filters themselves are not being tested. So, this PR includes the test similarly with `ParquetFilterSuite`. Since the results are checked by `OrcQuerySuite`, this `OrcFilterSuite` only checks if the appropriate filters are created. One thing different with `ParquetFilterSuite` here is, it does not check the results because that is checked in `OrcQuerySuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10341 from HyukjinKwon/SPARK-11677-followup.	2015-12-23 00:39:49 +08:00
Cheng Lian	42bfde2983	[SPARK-12371][SQL] Runtime nullability check for NewInstance This PR adds a new expression `AssertNotNull` to ensure non-nullable fields of products and case classes don't receive null values at runtime. Author: Cheng Lian <lian@databricks.com> Closes #10331 from liancheng/dataset-nullability-check.	2015-12-22 19:41:44 +08:00
Takeshi YAMAMURO	8c1b867cee	[SPARK-12446][SQL] Add unit tests for JDBCRDD internal functions No tests done for JDBCRDD#compileFilter. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #10409 from maropu/AddTestsInJdbcRdd.	2015-12-22 00:50:05 -08:00
Josh Rosen	2235cd4440	[SPARK-11823][SQL] Fix flaky JDBC cancellation test in HiveThriftBinaryServerSuite This patch fixes a flaky "test jdbc cancel" test in HiveThriftBinaryServerSuite. This test is prone to a race-condition which causes it to block indefinitely with while waiting for an extremely slow query to complete, which caused many Jenkins builds to time out. For more background, see my comments on #6207 (the PR which introduced this test). Author: Josh Rosen <joshrosen@databricks.com> Closes #10425 from JoshRosen/SPARK-11823.	2015-12-21 23:12:05 -08:00
Reynold Xin	0a38637d05	[SPARK-11807] Remove support for Hadoop < 2.2 i.e. Hadoop 1 and Hadoop 2.0 Author: Reynold Xin <rxin@databricks.com> Closes #10404 from rxin/SPARK-11807.	2015-12-21 22:15:52 -08:00
Davies Liu	29cecd4a42	[SPARK-12388] change default compression to lz4 According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy. After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4). [1] https://github.com/ning/jvm-compressor-benchmark/wiki cc rxin Author: Davies Liu <davies@databricks.com> Closes #10342 from davies/lz4.	2015-12-21 14:21:43 -08:00
Alex Bozarth	b0849b8aea	[SPARK-12339][SPARK-11206][WEBUI] Added a null check that was removed in Updates made in SPARK-11206 missed an edge case which cause's a NullPointerException when a task is killed. In some cases when a task ends in failure taskMetrics is initialized as null (see JobProgressListener.onTaskEnd()). To address this a null check was added. Before the changes in SPARK-11206 this null check was called at the start of the updateTaskAccumulatorValues() function. Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #10405 from ajbozarth/spark12339.	2015-12-21 14:06:36 -08:00
gatorsmile	4883a5087d	[SPARK-12374][SPARK-12150][SQL] Adding logical/physical operators for Range Based on the suggestions from marmbrus , added logical/physical operators for Range for improving the performance. Also added another API for resolving the JIRA Spark-12150. Could you take a look at my implementation, marmbrus ? If not good, I can rework it. : ) Thank you very much! Author: gatorsmile <gatorsmile@gmail.com> Closes #10335 from gatorsmile/rangeOperators.	2015-12-21 13:46:58 -08:00
Wenchen Fan	7634fe9511	[SPARK-12321][SQL] JSON format for TreeNode (use reflection) An alternative solution for https://github.com/apache/spark/pull/10295 , instead of implementing json format for all logical/physical plans and expressions, use reflection to implement it in `TreeNode`. Here I use pre-order traversal to flattern a plan tree to a plan list, and add an extra field `num-children` to each plan node, so that we can reconstruct the tree from the list. example json: logical plan tree: ``` [ { "class" : "org.apache.spark.sql.catalyst.plans.logical.Sort", "num-children" : 1, "order" : [ [ { "class" : "org.apache.spark.sql.catalyst.expressions.SortOrder", "num-children" : 1, "child" : 0, "direction" : "Ascending" }, { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "i", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 10, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] } ] ], "global" : false, "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.Project", "num-children" : 1, "projectList" : [ [ { "class" : "org.apache.spark.sql.catalyst.expressions.Alias", "num-children" : 1, "child" : 0, "name" : "i", "exprId" : { "id" : 10, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Add", "num-children" : 2, "left" : 0, "right" : 1 }, { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "a", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 0, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Literal", "num-children" : 0, "value" : "1", "dataType" : "integer" } ], [ { "class" : "org.apache.spark.sql.catalyst.expressions.Alias", "num-children" : 1, "child" : 0, "name" : "j", "exprId" : { "id" : 11, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Multiply", "num-children" : 2, "left" : 0, "right" : 1 }, { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "a", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 0, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Literal", "num-children" : 0, "value" : "2", "dataType" : "integer" } ] ], "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.LocalRelation", "num-children" : 0, "output" : [ [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "a", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 0, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] } ] ], "data" : [ ] } ] ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #10311 from cloud-fan/toJson-reflection.	2015-12-21 12:47:07 -08:00
Dilip Biswal	474eb21a30	[SPARK-12398] Smart truncation of DataFrame / Dataset toString When a DataFrame or Dataset has a long schema, we should intelligently truncate to avoid flooding the screen with unreadable information. // Standard output [a: int, b: int] // Truncate many top level fields [a: int, b, string ... 10 more fields] // Truncate long inner structs [a: struct<a: Int ... 10 more fields>] Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10373 from dilipbiswal/spark-12398.	2015-12-21 12:46:06 -08:00
Reynold Xin	f496031bd2	Bump master version to 2.0.0-SNAPSHOT. Author: Reynold Xin <rxin@databricks.com> Closes #10387 from rxin/version-bump.	2015-12-19 15:13:05 -08:00
Kousuke Saruta	6eba655259	[SPARK-12404][SQL] Ensure objects passed to StaticInvoke is Serializable Now `StaticInvoke` receives `Any` as a object and `StaticInvoke` can be serialized but sometimes the object passed is not serializable. For example, following code raises Exception because `RowEncoder#extractorsFor` invoked indirectly makes `StaticInvoke`. ``` case class TimestampContainer(timestamp: java.sql.Timestamp) val rdd = sc.parallelize(1 to 2).map(_ => TimestampContainer(System.currentTimeMillis)) val df = rdd.toDF val ds = df.as[TimestampContainer] val rdd2 = ds.rdd <----------------- invokes extractorsFor indirectory ``` I'll add test cases. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Michael Armbrust <michael@databricks.com> Closes #10357 from sarutak/SPARK-12404.	2015-12-18 14:05:06 -08:00
Yin Huai	41ee7c57ab	[SPARK-12218][SQL] Invalid splitting of nested AND expressions in Data Source filter API JIRA: https://issues.apache.org/jira/browse/SPARK-12218 When creating filters for Parquet/ORC, we should not push nested AND expressions partially. Author: Yin Huai <yhuai@databricks.com> Closes #10362 from yhuai/SPARK-12218.	2015-12-18 10:53:13 -08:00
Davies Liu	4af647c77d	[SPARK-12054] [SQL] Consider nullability of expression in codegen This could simplify the generated code for expressions that is not nullable. This PR fix lots of bugs about nullability. Author: Davies Liu <davies@databricks.com> Closes #10333 from davies/skip_nullable.	2015-12-18 10:09:17 -08:00
Dilip Biswal	ee444fe4b8	[SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExpr Description of the problem from cloud-fan Actually this line: https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L689 When we use `selectExpr`, we pass in `UnresolvedFunction` to `DataFrame.select` and fall in the last case. A workaround is to do special handling for UDTF like we did for `explode`(and `json_tuple` in 1.6), wrap it with `MultiAlias`. Another workaround is using `expr`, for example, `df.select(expr("explode(a)").as(Nil))`, I think `selectExpr` is no longer needed after we have the `expr` function.... Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #9981 from dilipbiswal/spark-11619.	2015-12-18 09:54:30 -08:00
Shixiong Zhu	0370abdfd6	[MINOR] Hide the error logs for 'SQLListenerMemoryLeakSuite' Hide the error logs for 'SQLListenerMemoryLeakSuite' to avoid noises. Most of changes are space changes. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10363 from zsxwing/hide-log.	2015-12-17 18:18:12 -08:00
Herman van Hovell	658f66e620	[SPARK-8641][SQL] Native Spark Window functions This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features. This has the following advantages: * Better memory management. * The ability to use spark UDAFs in Window functions. cc rxin / yhuai Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #9819 from hvanhovell/SPARK-8641-2.	2015-12-17 15:16:35 -08:00
Reynold Xin	e096a652b9	[SPARK-12397][SQL] Improve error messages for data sources when they are not found Point users to spark-packages.org to find them. Author: Reynold Xin <rxin@databricks.com> Closes #10351 from rxin/SPARK-12397.	2015-12-17 14:16:49 -08:00
Yanbo Liang	6e0771665b	[SQL] Update SQLContext.read.text doc Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10349 from yanboliang/text-value.	2015-12-17 09:19:46 -08:00
Davies Liu	a170d34a1b	[SPARK-12395] [SQL] fix resulting columns of outer join For API DataFrame.join(right, usingColumns, joinType), if the joinType is right_outer or full_outer, the resulting join columns could be wrong (will be null). The order of columns had been changed to match that with MySQL and PostgreSQL [1]. This PR also fix the nullability of output for outer join. [1] http://www.postgresql.org/docs/9.2/static/queries-table-expressions.html Author: Davies Liu <davies@databricks.com> Closes #10353 from davies/fix_join.	2015-12-17 08:04:11 -08:00
Yin Huai	9d66c4216a	[SPARK-12057][SQL] Prevent failure on corrupt JSON records This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference. Regarding the schema inference change, if we have something like ``` {"f1":1} [1,2,3] ``` originally, we will get a DF without any column. After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`. When merge this PR, please make sure that the author is simplyianm. JIRA: https://issues.apache.org/jira/browse/SPARK-12057 Closes #10043 Author: Ian Macalinao <me@ian.pw> Author: Yin Huai <yhuai@databricks.com> Closes #10288 from yhuai/handleCorruptJson.	2015-12-16 23:18:53 -08:00
tedyu	f590178d7a	[SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called SPARK-9886 fixed ExternalBlockStore.scala This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook() Author: tedyu <yuzhihong@gmail.com> Closes #10325 from ted-yu/master.	2015-12-16 19:02:12 -08:00
hyukjinkwon	9657ee8788	[SPARK-11677][SQL] ORC filter tests all pass if filters are actually not pushed down. Currently ORC filters are not tested properly. All the tests pass even if the filters are not pushed down or disabled. In this PR, I add some logics for this. Since ORC does not filter record by record fully, this checks the count of the result and if it contains the expected values. Author: hyukjinkwon <gurwls223@gmail.com> Closes #9687 from HyukjinKwon/SPARK-11677.	2015-12-16 13:24:49 -08:00
gatorsmile	edf65cd961	[SPARK-12164][SQL] Decode the encoded values and then display Based on the suggestions from marmbrus cloud-fan in https://github.com/apache/spark/pull/10165 , this PR is to print the decoded values(user objects) in `Dataset.show` ```scala implicit val kryoEncoder = Encoders.kryo[KryoClassData] val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS() ds.show(20, false); ``` The current output is like ``` +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ \|value \| +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ \|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 97, 2]\| \|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 98, 4]\| \|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 99, 6]\| +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ ``` After the fix, it will be like the below if and only if the users override the `toString` function in the class `KryoClassData` ```scala override def toString: String = s"KryoClassData($a, $b)" ``` ``` +-------------------+ \|value \| +-------------------+ \|KryoClassData(a, 1)\| \|KryoClassData(b, 2)\| \|KryoClassData(c, 3)\| +-------------------+ ``` If users do not override the `toString` function, the results will be like ``` +---------------------------------------+ \|value \| +---------------------------------------+ \|org.apache.spark.sql.KryoClassData68ef\| \|org.apache.spark.sql.KryoClassData6915\| \|org.apache.spark.sql.KryoClassData693b\| +---------------------------------------+ ``` Question: Should we add another optional parameter in the function `show`? It will decide if the function `show` will display the hex values or the object values? Author: gatorsmile <gatorsmile@gmail.com> Closes #10215 from gatorsmile/showDecodedValue.	2015-12-16 13:22:34 -08:00

1 2 3 4 5 ...

2762 commits