ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Reynold Xin	5340dfaf94	[SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType. We want to introduce a new IntervalType in 1.6 that is based on only the number of microseoncds, so interval can be compared. Renaming the existing IntervalType to CalendarIntervalType so we can do that in the future. Author: Reynold Xin <rxin@databricks.com> Closes #7745 from rxin/calendarintervaltype and squashes the following commits: 99f64e8 [Reynold Xin] One more line ... 13466c8 [Reynold Xin] Fixed tests. e20f24e [Reynold Xin] [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType.	2015-07-29 13:49:22 -07:00
Reynold Xin	97906944e1	[SPARK-9127][SQL] Rand/Randn codegen fails with long seed. Author: Reynold Xin <rxin@databricks.com> Closes #7747 from rxin/SPARK-9127 and squashes the following commits: e851418 [Reynold Xin] [SPARK-9127][SQL] Rand/Randn codegen fails with long seed.	2015-07-29 09:36:22 -07:00
Wenchen Fan	708794e8aa	[SPARK-9251][SQL] do not order by expressions which still need evaluation as an offline discussion with rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7593 from cloud-fan/sort and squashes the following commits: 7b1bef7 [Wenchen Fan] add test daf206d [Wenchen Fan] add more comments 289bee0 [Wenchen Fan] do not order by expressions which still need evaluation	2015-07-29 00:08:45 -07:00
Davies Liu	15667a0afa	[SPARK-9281] [SQL] use decimal or double when parsing SQL Right now, we use double to parse all the float number in SQL. When it's used in expression together with DecimalType, it will turn the decimal into double as well. Also it will loss some precision when using double. This PR change to parse float number to decimal or double, based on it's using scientific notation or not, see https://msdn.microsoft.com/en-us/library/ms179899.aspx This is a break change, should we doc it somewhere? Author: Davies Liu <davies@databricks.com> Closes #7642 from davies/parse_decimal and squashes the following commits: 1f576d9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal 5e142b6 [Davies Liu] fix scala style eca99de [Davies Liu] fix tests 2afe702 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal f4a320b [Davies Liu] Update SqlParser.scala 1c48e34 [Davies Liu] use decimal or double when parsing SQL	2015-07-28 22:51:08 -07:00
Yijie Shen	6309b93467	[SPARK-9398] [SQL] Datetime cleanup JIRA: https://issues.apache.org/jira/browse/SPARK-9398 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7725 from yjshen/date_null_check and squashes the following commits: b4eade1 [Yijie Shen] inline daysToMonthEnd d09acc1 [Yijie Shen] implement getLastDayOfMonth to avoid repeated evaluation d857ec3 [Yijie Shen] add null check in DateExpressionSuite	2015-07-28 22:38:28 -07:00
Wenchen Fan	429b2f0df4	[SPARK-8608][SPARK-8609][SPARK-9083][SQL] reset mutable states of nondeterministic expression before evaluation and fix PullOutNondeterministic We will do local projection for LocalRelation, and thus reuse the same Expression object among multiply evaluations. We should reset the mutable states of Expression before evaluate it. Fix `PullOutNondeterministic` rule to make it work for `Sort`. Also got a chance to cleanup the dataframe test suite. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7674 from cloud-fan/show and squashes the following commits: 888934f [Wenchen Fan] fix sort c0e93e8 [Wenchen Fan] local DataFrame with random columns should return same value when call `show`	2015-07-28 21:37:50 -07:00
Yin Huai	3744b7fd42	[SPARK-9422] [SQL] Remove the placeholder attributes used in the aggregation buffers https://issues.apache.org/jira/browse/SPARK-9422 Author: Yin Huai <yhuai@databricks.com> Closes #7737 from yhuai/removePlaceHolder and squashes the following commits: ec29b44 [Yin Huai] Remove placeholder attributes.	2015-07-28 19:01:25 -07:00
Josh Rosen	e78ec1a8fa	[SPARK-9421] Fix null-handling bugs in UnsafeRow.getDouble, getFloat(), and get(ordinal, dataType) UnsafeRow.getDouble and getFloat() return NaN when called on columns that are null, which is inconsistent with the behavior of other row classes (which is to return 0.0). In addition, the generic get(ordinal, dataType) method should always return null for a null literal, but currently it handles nulls by calling the type-specific accessors. This patch addresses both of these issues and adds a regression test. Author: Josh Rosen <joshrosen@databricks.com> Closes #7736 from JoshRosen/unsafe-row-null-fixes and squashes the following commits: c8eb2ee [Josh Rosen] Fix test in UnsafeRowConverterSuite 6214682 [Josh Rosen] Fixes to null handling in UnsafeRow	2015-07-28 17:51:58 -07:00
Reynold Xin	6662ee2124	[SPARK-9418][SQL] Use sort-merge join as the default shuffle join. Sort-merge join is more robust in Spark since sorting can be made using the Tungsten sort operator. Author: Reynold Xin <rxin@databricks.com> Closes #7733 from rxin/smj and squashes the following commits: 61e4d34 [Reynold Xin] Fixed test case. 5ffd731 [Reynold Xin] Fixed JoinSuite. a137dc0 [Reynold Xin] [SPARK-9418][SQL] Use sort-merge join as the default shuffle join.	2015-07-28 17:42:35 -07:00
Reynold Xin	b7f54119f8	[SPARK-9420][SQL] Move expressions in sql/core package to catalyst. Since catalyst package already depends on Spark core, we can move those expressions into catalyst, and simplify function registry. This is a followup of #7478. Author: Reynold Xin <rxin@databricks.com> Closes #7735 from rxin/SPARK-8003 and squashes the following commits: 2ffbdc3 [Reynold Xin] [SPARK-8003][SQL] Move expressions in sql/core package to catalyst.	2015-07-28 17:03:59 -07:00
Josh Rosen	59b92add7c	[SPARK-9393] [SQL] Fix several error-handling bugs in ScriptTransform operator SparkSQL's ScriptTransform operator has several serious bugs which make debugging fairly difficult: - If exceptions are thrown in the writing thread then the child process will not be killed, leading to a deadlock because the reader thread will block while waiting for input that will never arrive. - TaskContext is not propagated to the writer thread, which may cause errors in upstream pipelined operators. - Exceptions which occur in the writer thread are not propagated to the main reader thread, which may cause upstream errors to be silently ignored instead of killing the job. This can lead to silently incorrect query results. - The writer thread is not a daemon thread, but it should be. In addition, the code in this file is extremely messy: - Lots of fields are nullable but the nullability isn't clearly explained. - Many confusing variable names: for instance, there are variables named `ite` and `iterator` that are defined in the same scope. - Some code was misindented. - The `*serdeClass` variables are actually expected to be single-quoted strings, which is really confusing: I feel that this parsing / extraction should be performed in the analyzer, not in the operator itself. - There were no unit tests for the operator itself, only end-to-end tests. This pull request addresses these issues, borrowing some error-handling techniques from PySpark's PythonRDD. Author: Josh Rosen <joshrosen@databricks.com> Closes #7710 from JoshRosen/script-transform and squashes the following commits: 16c44e2 [Josh Rosen] Update some comments 983f200 [Josh Rosen] Use unescapeSQLString instead of stripQuotes 6a06a8c [Josh Rosen] Clean up handling of quotes in serde class name 494cde0 [Josh Rosen] Propagate TaskContext to writer thread 323bb2b [Josh Rosen] Fix error-swallowing bug b31258d [Josh Rosen] Rename iterator variables to disambiguate. 88278de [Josh Rosen] Split ScriptTransformation writer thread into own class. 8b162b6 [Josh Rosen] Add failing test which demonstrates exception masking issue 4ee36a2 [Josh Rosen] Kill script transform subprocess when error occurs in input writer. bd4c948 [Josh Rosen] Skip launching of external command for empty partitions. b43e4ec [Josh Rosen] Clean up nullability in ScriptTransformation fa18d26 [Josh Rosen] Add basic unit test for script transform with 'cat' command.	2015-07-28 16:04:48 -07:00
Davies Liu	21825529ea	[SPARK-9247] [SQL] Use BytesToBytesMap for broadcast join This PR introduce BytesToBytesMap to UnsafeHashedRelation, use it in executor for better performance. It serialize all the key and values from java HashMap, put them into a BytesToBytesMap while deserializing. All the values for a same key are stored continuous to have better memory locality. This PR also address the comments for #7480 , do some clean up. Author: Davies Liu <davies@databricks.com> Closes #7592 from davies/unsafe_map2 and squashes the following commits: 42c578a [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_map2 fd09528 [Davies Liu] remove thread local cache and update docs 1c5ad8d [Davies Liu] fix test 5eb1b5a [Davies Liu] address comments in #7480 46f1f22 [Davies Liu] fix style fc221e0 [Davies Liu] use BytesToBytesMap for broadcast join	2015-07-28 15:56:19 -07:00
Joseph Batchik	b88b868eb3	[SPARK-8003][SQL] Added virtual column support to Spark Added virtual column support by adding a new resolution role to the query analyzer. Additional virtual columns can be added by adding case expressions to [the new rule](https://github.com/JDrit/spark/blob/virt_columns/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L1026) and my modifying the [logical plan](https://github.com/JDrit/spark/blob/virt_columns/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L216) to resolve them. This also solves [SPARK-8003](https://issues.apache.org/jira/browse/SPARK-8003) This allows you to perform queries such as: ```sql select spark__partition__id, count(*) as c from table group by spark__partition__id; ``` Author: Joseph Batchik <josephbatchik@gmail.com> Author: JD <jd@csh.rit.edu> Closes #7478 from JDrit/virt_columns and squashes the following commits: 7932bf0 [Joseph Batchik] adding spark__partition__id to hive as well f8a9c6c [Joseph Batchik] merging in master e49da48 [JD] fixes for @rxin's suggestions 60e120b [JD] fixing test in merge 4bf8554 [JD] merging in master c68bc0f [Joseph Batchik] Adding function register ability to SQLContext and adding a function for spark__partition__id()	2015-07-28 14:39:25 -07:00
Yin Huai	6cdcc21fe6	[SPARK-9196] [SQL] Ignore test DatetimeExpressionsSuite: function current_timestamp. This test is flaky. https://issues.apache.org/jira/browse/SPARK-9196 will track the fix of it. For now, let's disable this test. Author: Yin Huai <yhuai@databricks.com> Closes #7727 from yhuai/SPARK-9196-ignore and squashes the following commits: f92bded [Yin Huai] Ignore current_timestamp.	2015-07-28 13:16:48 -07:00
Aaron Davidson	35ef853b3f	[SPARK-9397] DataFrame should provide an API to find source data files if applicable Certain applications would benefit from being able to inspect DataFrames that are straightforwardly produced by data sources that stem from files, and find out their source data. For example, one might want to display to a user the size of the data underlying a table, or to copy or mutate it. This PR exposes an `inputFiles` method on DataFrame which attempts to discover the source data in a best-effort manner, by inspecting HadoopFsRelations and JSONRelations. Author: Aaron Davidson <aaron@databricks.com> Closes #7717 from aarondav/paths and squashes the following commits: ff67430 [Aaron Davidson] inputFiles 0acd3ad [Aaron Davidson] [SPARK-9397] DataFrame should provide an API to find source data files if applicable	2015-07-28 10:12:09 -07:00
Reynold Xin	9bbe0171cb	[SPARK-8196][SQL] Fix null handling & documentation for next_day. The original patch didn't handle nulls correctly for next_day. Author: Reynold Xin <rxin@databricks.com> Closes #7718 from rxin/next_day and squashes the following commits: 616a425 [Reynold Xin] Merged DatetimeExpressionsSuite into DateFunctionsSuite. faa78cf [Reynold Xin] Merged DatetimeFunctionsSuite into DateExpressionsSuite. 6c4fb6a [Reynold Xin] [SPARK-8196][SQL] Fix null handling & documentation for next_day.	2015-07-28 09:43:39 -07:00
Reynold Xin	c740bed172	[SPARK-9373][SQL] follow up for StructType support in Tungsten projection. Author: Reynold Xin <rxin@databricks.com> Closes #7720 from rxin/struct-followup and squashes the following commits: d9757f5 [Reynold Xin] [SPARK-9373][SQL] follow up for StructType support in Tungsten projection.	2015-07-28 09:43:12 -07:00
Reynold Xin	5a2330e546	[SPARK-9402][SQL] Remove CodegenFallback from Abs / FormatNumber. Both expressions already implement code generation. Author: Reynold Xin <rxin@databricks.com> Closes #7723 from rxin/abs-formatnum and squashes the following commits: 31ed765 [Reynold Xin] [SPARK-9402][SQL] Remove CodegenFallback from Abs / FormatNumber.	2015-07-28 09:42:35 -07:00
Reynold Xin	15724fac56	[SPARK-9394][SQL] Handle parentheses in CodeFormatter. Our CodeFormatter currently does not handle parentheses, and as a result in code dump, we see code formatted this way: ``` foo( a, b, c) ``` With this patch, it is formatted this way: ``` foo( a, b, c) ``` Author: Reynold Xin <rxin@databricks.com> Closes #7712 from rxin/codeformat-parentheses and squashes the following commits: c2b1c5f [Reynold Xin] Took square bracket out `3cfb174` [Reynold Xin] Code review feedback. 91f5bb1 [Reynold Xin] [SPARK-9394][SQL] Handle parentheses in CodeFormatter.	2015-07-28 00:52:26 -07:00
Cheng Hao	9c5612f4e1	[MINOR] [SQL] Support mutable expression unit test with codegen projection This is actually contains 3 minor issues: 1) Enable the unit test(codegen) for mutable expressions (FormatNumber, Regexp_Replace/Regexp_Extract) 2) Use the `PlatformDependent.copyMemory` instead of the `System.arrayCopy` Author: Cheng Hao <hao.cheng@intel.com> Closes #7566 from chenghao-intel/codegen_ut and squashes the following commits: 24f43ea [Cheng Hao] enable codegen for mutable expression & UTF8String performance	2015-07-27 23:02:23 -07:00
Reynold Xin	60f08c7c87	[SPARK-9373][SQL] Support StructType in Tungsten projection This pull request updates GenerateUnsafeProjection to support StructType. If an input struct type is backed already by an UnsafeRow, GenerateUnsafeProjection copies the bytes directly into its buffer space without any conversion. However, if the input is not an UnsafeRow, GenerateUnsafeProjection runs the code generated recursively to convert the input into an UnsafeRow and then copies it into the buffer space. Also create a TungstenProject operator that projects data directly into UnsafeRow. Note that I'm not sure if this is the way we want to structure Unsafe+codegen operators, but we can defer that decision to follow-up pull requests. Author: Reynold Xin <rxin@databricks.com> Closes #7689 from rxin/tungsten-struct-type and squashes the following commits: 9162f42 [Reynold Xin] Support IntervalType in UnsafeRow's getter. be9f377 [Reynold Xin] Fixed tests. 10c4b7c [Reynold Xin] Format generated code. 77e8d0e [Reynold Xin] Fixed NondeterministicSuite. ac4951d [Reynold Xin] Yay. ac203bf [Reynold Xin] More comments. 9f36216 [Reynold Xin] Updated comment. 6b781fe [Reynold Xin] Reset the change in DataFrameSuite. 525b95b [Reynold Xin] Merged with master, more documentation & test cases. 321859a [Reynold Xin] [SPARK-9373][SQL] Support StructType in Tungsten projection [WIP]	2015-07-27 22:51:15 -07:00
Yijie Shen	63a492b931	[SPARK-8828] [SQL] Revert SPARK-5680 JIRA: https://issues.apache.org/jira/browse/SPARK-8828 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7667 from yjshen/revert_combinesum_2 and squashes the following commits: c37ccb1 [Yijie Shen] add test case 8377214 [Yijie Shen] revert spark.sql.useAggregate2 to its default value e2305ac [Yijie Shen] fix bug - avg on decimal column 7cb0e95 [Yijie Shen] [wip] resolving bugs 1fadb5a [Yijie Shen] remove occurance 17c6248 [Yijie Shen] revert SPARK-5680	2015-07-27 22:47:33 -07:00
Reynold Xin	3bc7055e26	Fixed a test failure.	2015-07-27 22:04:54 -07:00
Reynold Xin	84da8792e2	[SPARK-9395][SQL] Create a SpecializedGetters interface to track all the specialized getters. As we are adding more and more specialized getters to more classes (coming soon ArrayData), this interface can help us prevent missing a method in some interfaces. Author: Reynold Xin <rxin@databricks.com> Closes #7713 from rxin/SpecializedGetters and squashes the following commits: 3b39be1 [Reynold Xin] Added override modifier. 567ba9c [Reynold Xin] [SPARK-9395][SQL] Create a SpecializedGetters interface to track all the specialized getters.	2015-07-27 21:41:15 -07:00
Daoyuan Wang	2e7f99a004	[SPARK-8195] [SPARK-8196] [SQL] udf next_day last_day next_day, returns next certain dayofweek. last_day, returns the last day of the month which given date belongs to. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6986 from adrian-wang/udfnlday and squashes the following commits: ef7e3da [Daoyuan Wang] fix 02b3426 [Daoyuan Wang] address 2 comments dc69630 [Daoyuan Wang] address comments from rxin 8846086 [Daoyuan Wang] address comments from rxin d09bcce [Daoyuan Wang] multi fix 1a9de3d [Daoyuan Wang] function next_day and last_day	2015-07-27 21:08:56 -07:00
Michael Armbrust	ce89ff477a	[SPARK-9386] [SQL] Feature flag for metastore partition pruning Since we have been seeing a lot of failures related to this new feature, lets put it behind a flag and turn it off by default. Author: Michael Armbrust <michael@databricks.com> Closes #7703 from marmbrus/optionalMetastorePruning and squashes the following commits: 6ad128c [Michael Armbrust] style 8447835 [Michael Armbrust] [SPARK-9386][SQL] Feature flag for metastore partition pruning fd37b87 [Michael Armbrust] add config flag	2015-07-27 17:32:34 -07:00
Wenchen Fan	3ab7525dce	[SPARK-9355][SQL] Remove InternalRow.get generic getter call in columnar cache code Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7673 from cloud-fan/row-generic-getter-columnar and squashes the following commits: 88b1170 [Wenchen Fan] fix style eeae712 [Wenchen Fan] Remove Internal.get generic getter call in columnar cache code	2015-07-27 13:40:50 -07:00
Cheng Lian	8e7d2bee23	[SPARK-9378] [SQL] Fixes test case "CTAS with serde" This is a proper version of PR #7693 authored by viirya The reason why "CTAS with serde" fails is that the `MetastoreRelation` gets converted to a Parquet data source relation by default. Author: Cheng Lian <lian@databricks.com> Closes #7700 from liancheng/spark-9378-fix-ctas-test and squashes the following commits: 4413af0 [Cheng Lian] Fixes test case "CTAS with serde"	2015-07-27 13:28:03 -07:00
Yin Huai	55946e76fd	[SPARK-9349] [SQL] UDAF cleanup https://issues.apache.org/jira/browse/SPARK-9349 With this PR, we only expose `UserDefinedAggregateFunction` (an abstract class) and `MutableAggregationBuffer` (an interface). Other internal wrappers and helper classes are moved to `org.apache.spark.sql.execution.aggregate` and marked as `private[sql]`. Author: Yin Huai <yhuai@databricks.com> Closes #7687 from yhuai/UDAF-cleanup and squashes the following commits: db36542 [Yin Huai] Add comments to UDAF examples. ae17f66 [Yin Huai] Address comments. 9c9fa5f [Yin Huai] UDAF cleanup.	2015-07-27 13:26:57 -07:00
Wenchen Fan	75438422c2	[SPARK-9369][SQL] Support IntervalType in UnsafeRow Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7688 from cloud-fan/interval and squashes the following commits: 5b36b17 [Wenchen Fan] fix codegen a99ed50 [Wenchen Fan] address comment 9e6d319 [Wenchen Fan] Support IntervalType in UnsafeRow	2015-07-27 11:28:22 -07:00
Wenchen Fan	dd9ae7945a	[SPARK-9351] [SQL] remove literals from grouping expressions in Aggregate literals in grouping expressions have no effect at all, only make our grouping key bigger, so we should remove them in Optimizer. I also make old and new aggregation code consistent about literals in grouping here. In old aggregation, actually literals in grouping are already removed but new aggregation is not. So I explicitly make it a rule in Optimizer. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7583 from cloud-fan/minor and squashes the following commits: 471adff [Wenchen Fan] add test 0839925 [Wenchen Fan] use transformDown when rewrite final result expressions	2015-07-27 11:23:29 -07:00
Wenchen Fan	e2f38167f8	[SPARK-9376] [SQL] use a seed in RandomDataGeneratorSuite Make this test deterministic, i.e. make sure this test can be passed no matter how many times we run it. The origin implementation uses a random seed and gives a chance that we may break the null check assertion `assert(Iterator.fill(100)(generator()).contains(null))`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7691 from cloud-fan/seed and squashes the following commits: eae7281 [Wenchen Fan] use a seed in RandomDataGeneratorSuite	2015-07-27 11:02:16 -07:00
Josh Rosen	ecad9d4346	[SPARK-9364] Fix array out of bounds and use-after-free bugs in UnsafeExternalSorter This patch fixes two bugs in UnsafeExternalSorter and UnsafeExternalRowSorter: - UnsafeExternalSorter does not properly update freeSpaceInCurrentPage, which can cause it to write past the end of memory pages and trigger segfaults. - UnsafeExternalRowSorter has a use-after-free bug when returning the last row from an iterator. Author: Josh Rosen <joshrosen@databricks.com> Closes #7680 from JoshRosen/SPARK-9364 and squashes the following commits: 590f311 [Josh Rosen] null out row f4cf91d [Josh Rosen] Fix use-after-free bug in UnsafeExternalRowSorter. 8abcf82 [Josh Rosen] Properly decrement freeSpaceInCurrentPage in UnsafeExternalSorter	2015-07-27 09:34:49 -07:00
Rene Treffer	aa19c696e2	[SPARK-4176] [SQL] Supports decimal types with precision > 18 in Parquet This PR is based on #6796 authored by rtreffer. To support large decimal precisions (> 18), we do the following things in this PR: 1. Making `CatalystSchemaConverter` support large decimal precision Decimal types with large precision are always converted to fixed-length byte array. 2. Making `CatalystRowConverter` support reading decimal values with large precision When the precision is > 18, constructs `Decimal` values with an unscaled `BigInteger` rather than an unscaled `Long`. 3. Making `RowWriteSupport` support writing decimal values with large precision In this PR we always write decimals as fixed-length byte array, because Parquet write path hasn't been refactored to conform Parquet format spec (see SPARK-6774 & SPARK-8848). Two follow-up tasks should be done in future PRs: - [ ] Writing decimals as `INT32`, `INT64` when possible while fixing SPARK-8848 - [ ] Adding compatibility tests as part of SPARK-5463 Author: Cheng Lian <lian@databricks.com> Closes #7455 from liancheng/spark-4176 and squashes the following commits: a543d10 [Cheng Lian] Fixes errors introduced while rebasing 9e31cdf [Cheng Lian] Supports decimals with precision > 18 for Parquet	2015-07-27 23:29:40 +08:00
Cheng Lian	72981bc8f0	[SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes multi-database support This PR fixes a set of issues related to multi-database. A new data structure `TableIdentifier` is introduced to identify a table among multiple databases. We should stop using a single `String` (table name without database name), or `Seq[String]` (optional database name plus table name) to identify tables internally. Author: Cheng Lian <lian@databricks.com> Closes #7623 from liancheng/spark-8131-multi-db and squashes the following commits: f3bcd4b [Cheng Lian] Addresses PR comments e0eb76a [Cheng Lian] Fixes styling issues 41e2207 [Cheng Lian] Fixes multi-database support d4d1ec2 [Cheng Lian] Adds multi-database test cases	2015-07-27 17:15:35 +08:00
Wenchen Fan	4ffd3a1db5	[SPARK-9371][SQL] fix the support for special chars in column names for hive context Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7684 from cloud-fan/hive and squashes the following commits: da21ffe [Wenchen Fan] fix the support for special chars in column names for hive context	2015-07-26 23:58:03 -07:00
Reynold Xin	aa80c64fcf	[SPARK-9368][SQL] Support get(ordinal, dataType) generic getter in UnsafeRow. Author: Reynold Xin <rxin@databricks.com> Closes #7682 from rxin/unsaferow-generic-getter and squashes the following commits: 3063788 [Reynold Xin] Reset the change for real this time. 0f57c55 [Reynold Xin] Reset the changes in ExpressionEvalHelper. fb6ca30 [Reynold Xin] Support BinaryType. 24a3e46 [Reynold Xin] Added support for DateType/TimestampType. 9989064 [Reynold Xin] JoinedRow. 11f80a3 [Reynold Xin] [SPARK-9368][SQL] Support get(ordinal, dataType) generic getter in UnsafeRow.	2015-07-26 23:01:04 -07:00
Liang-Chi Hsieh	945d8bcbf6	[SPARK-9306] [SQL] Don't use SortMergeJoin when joining on unsortable columns JIRA: https://issues.apache.org/jira/browse/SPARK-9306 Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7645 from viirya/smj_unsortable and squashes the following commits: a240707 [Liang-Chi Hsieh] Use forall instead of exists for readability. 55221fa [Liang-Chi Hsieh] Shouldn't use SortMergeJoin when joining on unsortable columns.	2015-07-26 22:13:37 -07:00
Cheng Hao	1efe97dc9e	[SPARK-8867][SQL] Support list / describe function usage As Hive does, we need to list all of the registered UDF and its usage for user. We add the annotation to describe a UDF, so we can get the literal description info while registering the UDF. e.g. ```scala ExpressionDescription( usage = "_FUNC_(expr) - Returns the absolute value of the numeric value", extended = """> SELECT _FUNC_('-1') 1""") case class Abs(child: Expression) extends UnaryArithmetic { ... ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #7259 from chenghao-intel/desc_function and squashes the following commits: cf29bba [Cheng Hao] fixing the code style issue 5193855 [Cheng Hao] Add more powerful parser for show functions c645a6b [Cheng Hao] fix bug in unit test 78d40f1 [Cheng Hao] update the padding issue for usage 48ee4b3 [Cheng Hao] update as feedback 70eb4e9 [Cheng Hao] add show/describe function support	2015-07-26 18:34:19 -07:00
Cheng Lian	c025c3d0a1	[SPARK-9095] [SQL] Removes the old Parquet support This PR removes the old Parquet support: - Removes the old `ParquetRelation` together with related SQL configuration, plan nodes, strategies, utility classes, and test suites. - Renames `ParquetRelation2` to `ParquetRelation` - Renames `RowReadSupport` and `RowRecordMaterializer` to `CatalystReadSupport` and `CatalystRecordMaterializer` respectively, and moved them to separate files. This follows naming convention used in other Parquet data models implemented in parquet-mr. It should be easier for developers who are familiar with Parquet to follow. There's still some other code that can be cleaned up. Especially `RowWriteSupport`. But I'd like to leave this part to SPARK-8848. Author: Cheng Lian <lian@databricks.com> Closes #7441 from liancheng/spark-9095 and squashes the following commits: c7b6e38 [Cheng Lian] Removes WriteToFile 2d688d6 [Cheng Lian] Renames ParquetRelation2 to ParquetRelation ca9e1b7 [Cheng Lian] Removes old Parquet support	2015-07-26 16:49:19 -07:00
Yijie Shen	fb5d43fb25	[SPARK-9356][SQL]Remove the internal use of DecimalType.Unlimited JIRA: https://issues.apache.org/jira/browse/SPARK-9356 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7671 from yjshen/deprecated_unlimit and squashes the following commits: c707f56 [Yijie Shen] remove pattern matching in changePrecision 4a1823c [Yijie Shen] remove internal occurrence of Decimal.Unlimited	2015-07-26 10:29:22 -07:00
Reynold Xin	6c400b4f39	[SPARK-9354][SQL] Remove InternalRow.get generic getter call in Hive integration code. Replaced them with get(ordinal, datatype) so we can use UnsafeRow here. I passed the data types throughout. Author: Reynold Xin <rxin@databricks.com> Closes #7669 from rxin/row-generic-getter-hive and squashes the following commits: 3467d8e [Reynold Xin] [SPARK-9354][SQL] Remove Internal.get generic getter call in Hive integration code.	2015-07-26 10:27:39 -07:00
Reynold Xin	4a01bfc2a2	[SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType Currently UnsafeRow cannot support a generic getter. However, if the data type is known, we can support a generic getter. Author: Reynold Xin <rxin@databricks.com> Closes #7666 from rxin/generic-getter-with-datatype and squashes the following commits: ee2874c [Reynold Xin] Add a default implementation for getStruct. 1e109a0 [Reynold Xin] [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType. 033ee88 [Reynold Xin] Removed getAs in non test code.	2015-07-25 23:52:37 -07:00
Reynold Xin	b1f4b4abfd	[SPARK-9348][SQL] Remove apply method on InternalRow. Author: Reynold Xin <rxin@databricks.com> Closes #7665 from rxin/remove-row-apply and squashes the following commits: 0b43001 [Reynold Xin] support getString in UnsafeRow. 176d633 [Reynold Xin] apply -> get. 2941324 [Reynold Xin] [SPARK-9348][SQL] Remove apply method on InternalRow.	2015-07-25 18:41:51 -07:00
Wenchen Fan	2c94d0f24a	[SPARK-9192][SQL] add initialization phase for nondeterministic expression Currently nondeterministic expression is broken without a explicit initialization phase. Let me take `MonotonicallyIncreasingID` as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use `transient var count: Long` there. By being transient, the `count` will be reset to 0 and only to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is no way to use another initial value for `count`, until we add the explicit initialization phase. Another use case is local execution for `LocalRelation`, there is no serialize and deserialize phase and thus we can't reset mutable states for it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7535 from cloud-fan/init and squashes the following commits: 6c6f332 [Wenchen Fan] add test ef68ff4 [Wenchen Fan] fix comments 9eac85e [Wenchen Fan] move init code to interpreted class bb7d838 [Wenchen Fan] pulls out nondeterministic expressions into a project b4a4fc7 [Wenchen Fan] revert a refactor 86fee36 [Wenchen Fan] add initialization phase for nondeterministic expression	2015-07-25 12:10:02 -07:00
Cheng Lian	e2ec018e37	[SPARK-9285] [SQL] Fixes Row/InternalRow conversion for HadoopFsRelation This is a follow-up of #7626. It fixes `Row`/`InternalRow` conversion for data sources extending `HadoopFsRelation` with `needConversion` being `true`. Author: Cheng Lian <lian@databricks.com> Closes #7649 from liancheng/spark-9285-conversion-fix and squashes the following commits: 036a50c [Cheng Lian] Addresses PR comment f6d7c6a [Cheng Lian] Fixes Row/InternalRow conversion for HadoopFsRelation	2015-07-25 11:42:49 -07:00
Reynold Xin	215713e199	[SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection. The two are redundant. Once this patch is merged, I plan to remove the inbound conversions from unsafe aggregates. Author: Reynold Xin <rxin@databricks.com> Closes #7658 from rxin/unsafeconverters and squashes the following commits: ed19e6c [Reynold Xin] Updated support types. 2a56d7e [Reynold Xin] [SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection.	2015-07-25 01:37:41 -07:00
Reynold Xin	f0ebab3f6d	[SPARK-9336][SQL] Remove extra JoinedRows They were added to improve performance (so JIT can inline the JoinedRow calls). However, we can also just improve it by projecting output out to UnsafeRow in Tungsten variant of the operators. Author: Reynold Xin <rxin@databricks.com> Closes #7659 from rxin/remove-joinedrows and squashes the following commits: 7510447 [Reynold Xin] [SPARK-9336][SQL] Remove extra JoinedRows	2015-07-25 01:28:46 -07:00
JD	723db13e06	[Spark-8668][SQL] Adding expr to functions Author: JD <jd@csh.rit.edu> Author: Joseph Batchik <josephbatchik@gmail.com> Closes #7606 from JDrit/expr and squashes the following commits: ad7f607 [Joseph Batchik] fixing python linter error 9d6daea [Joseph Batchik] removed order by per @rxin's comment 707d5c6 [Joseph Batchik] Added expr to fuctions.py 79df83c [JD] added example to the docs b89eec8 [JD] moved function up as per @rxin's comment 4960909 [JD] updated per @JoshRosen's comment 2cb329c [JD] updated per @rxin's comment 9a9ad0c [JD] removing unused import 6dc26d0 [JD] removed split 7f2222c [JD] Adding expr function as per SPARK-8668	2015-07-25 00:34:59 -07:00
Reynold Xin	c84acd4aa4	[SPARK-9331][SQL] Add a code formatter to auto-format generated code. The generated expression code can be hard to read since they are not indented well. This patch adds a code formatter that formats code automatically when we output them to the screen. Author: Reynold Xin <rxin@databricks.com> Closes #7656 from rxin/codeformatter and squashes the following commits: 5ba0e90 [Reynold Xin] [SPARK-9331][SQL] Add a code formatter to auto-format generated code.	2015-07-24 19:35:24 -07:00
Reynold Xin	f99cb5615c	[SPARK-9330][SQL] Create specialized getStruct getter in InternalRow. Also took the chance to rearrange some of the methods in UnsafeRow to group static/private/public things together. Author: Reynold Xin <rxin@databricks.com> Closes #7654 from rxin/getStruct and squashes the following commits: b491a09 [Reynold Xin] Fixed typo. 48d77e5 [Reynold Xin] [SPARK-9330][SQL] Create specialized getStruct getter in InternalRow.	2015-07-24 19:29:01 -07:00
Liang-Chi Hsieh	64135cbb33	[SPARK-9067] [SQL] Close reader in NewHadoopRDD early if there is no more data JIRA: https://issues.apache.org/jira/browse/SPARK-9067 According to the description of the JIRA ticket, calling `reader.close()` only after the task is finished will cause memory and file open limit problem since these resources are occupied even we don't need that anymore. This PR simply closes the reader early when we know there is no more data to read. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7424 from viirya/close_reader and squashes the following commits: 3ff64e5 [Liang-Chi Hsieh] For comments. 3d20267 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader e152182 [Liang-Chi Hsieh] For comments. 5116cbe [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader 3ceb755 [Liang-Chi Hsieh] For comments. e34d98e [Liang-Chi Hsieh] For comments. 50ed729 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader 216912f [Liang-Chi Hsieh] Fix it. f429016 [Liang-Chi Hsieh] Release reader if we don't need it. a305621 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into close_reader 67569da [Liang-Chi Hsieh] Close reader early if there is no more data.	2015-07-24 12:36:44 -07:00
Josh Rosen	6aceaf3d62	[SPARK-9295] Analysis should detect sorting on unsupported column types This patch extends CheckAnalysis to throw errors for queries that try to sort on unsupported column types, such as ArrayType. Author: Josh Rosen <joshrosen@databricks.com> Closes #7633 from JoshRosen/SPARK-9295 and squashes the following commits: 23b2fbf [Josh Rosen] Embed function in foreach bfe1451 [Josh Rosen] Update to allow sorting by null literals 2f1b802 [Josh Rosen] Add analysis rule to detect sorting on unsupported column types (SPARK-9295)	2015-07-24 11:34:23 -07:00
Josh Rosen	c2b50d693e	[SPARK-9292] Analysis should check that join conditions' data types are BooleanType This patch adds an analysis check to ensure that join conditions' data types are BooleanType. This check is necessary in order to report proper errors for non-boolean DataFrame join conditions. Author: Josh Rosen <joshrosen@databricks.com> Closes #7630 from JoshRosen/SPARK-9292 and squashes the following commits: aec6c7b [Josh Rosen] Check condition type in resolved() 75a3ea6 [Josh Rosen] Fix SPARK-9292.	2015-07-24 09:49:50 -07:00
Reynold Xin	c8d71a4183	[SPARK-9305] Rename org.apache.spark.Row to Item. It's a thing used in test cases, but named Row. Pretty annoying because everytime I search for Row, it shows up before the Spark SQL Row, which is what a developer wants most of the time. Author: Reynold Xin <rxin@databricks.com> Closes #7638 from rxin/remove-row and squashes the following commits: aeda52d [Reynold Xin] [SPARK-9305] Rename org.apache.spark.Row to Item.	2015-07-24 09:38:13 -07:00
Reynold Xin	431ca39be5	[SPARK-9285][SQL] Remove InternalRow's inheritance from Row. I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields. Author: Reynold Xin <rxin@databricks.com> Closes #7626 from rxin/internalRow and squashes the following commits: e124daf [Reynold Xin] Fixed test case. 805ceb7 [Reynold Xin] Commented out the failed test suite. f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining. 76d9081 [Reynold Xin] Fixed data sources. 7807f70 [Reynold Xin] Fixed DataFrameSuite. cb60cd2 [Reynold Xin] Code review & small bug fixes. 0a2948b [Reynold Xin] Fixed style. 3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.	2015-07-24 09:37:36 -07:00
Davies Liu	dfb18be036	[SPARK-9069] [SQL] follow up Address comments for #7605 cc rxin Author: Davies Liu <davies@databricks.com> Closes #7634 from davies/decimal_unlimited2 and squashes the following commits: b2d8b0d [Davies Liu] add doc and test for DecimalType.isWiderThan 65b251c [Davies Liu] fix test 6a91f32 [Davies Liu] fix style ca9c973 [Davies Liu] address comments	2015-07-24 08:24:13 -07:00
Liang-Chi Hsieh	6a7e537f3a	[SPARK-8756] [SQL] Keep cached information and avoid re-calculating footers in ParquetRelation2 JIRA: https://issues.apache.org/jira/browse/SPARK-8756 Currently, in ParquetRelation2, footers are re-read every time refresh() is called. But we can check if it is possibly changed before we do the reading because reading all footers will be expensive when there are too many partitions. This pr fixes this by keeping some cached information to check it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7154 from viirya/cached_footer_parquet_relation and squashes the following commits: 92e9347 [Liang-Chi Hsieh] Fix indentation. ae0ec64 [Liang-Chi Hsieh] Fix wrong assignment. c8fdfb7 [Liang-Chi Hsieh] Fix it. a52b6d1 [Liang-Chi Hsieh] For comments. c2a2420 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation fa5458f [Liang-Chi Hsieh] Use Map to cache FileStatus and do merging previously loaded schema and newly loaded one. 6ae0911 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation 21bbdec [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation 12a0ed9 [Liang-Chi Hsieh] Add check of FileStatus's modification time. 186429d [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cached_footer_parquet_relation 0ef8caf [Liang-Chi Hsieh] Keep cached information and avoid re-calculating footers.	2015-07-24 17:39:57 +08:00
Reynold Xin	cb8c241f05	[SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type. Author: Reynold Xin <rxin@databricks.com> Closes #7636 from rxin/complex-string-implicit-cast and squashes the following commits: 3e67327 [Reynold Xin] [SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type.	2015-07-24 01:18:43 -07:00
Wenchen Fan	408e64b284	[SPARK-9294][SQL] cleanup comments, code style, naming typo for the new aggregation fix some comments and code style for https://github.com/apache/spark/pull/7458 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7619 from cloud-fan/agg-clean and squashes the following commits: 3925457 [Wenchen Fan] one more... cc78357 [Wenchen Fan] one more cleanup 26f6a93 [Wenchen Fan] some minor cleanup for the new aggregation	2015-07-23 23:40:01 -07:00
Davies Liu	8a94eb23d5	[SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases). In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18). If no precision and scale is provide, it's Decimal(10, 0) as before. Author: Davies Liu <davies@databricks.com> Closes #7605 from davies/decimal_unlimited and squashes the following commits: aa3f115 [Davies Liu] fix tests and style fb0d20d [Davies Liu] address comments bfaae35 [Davies Liu] fix style df93657 [Davies Liu] address comments and clean up 06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited 4c28969 [Davies Liu] fix tests 8d783cc [Davies Liu] fix tests 788631c [Davies Liu] fix double with decimal in Union/except 1779bde [Davies Liu] fix scala style c9c7c78 [Davies Liu] remove Decimal.Unlimited	2015-07-23 18:31:13 -07:00
Cheng Lian	bebe3f7b45	[SPARK-9207] [SQL] Enables Parquet filter push-down by default PARQUET-136 and PARQUET-173 have been fixed in parquet-mr 1.7.0. It's time to enable filter push-down by default now. Author: Cheng Lian <lian@databricks.com> Closes #7612 from liancheng/spark-9207 and squashes the following commits: 77e6b5e [Cheng Lian] Enables Parquet filter push-down by default	2015-07-23 17:49:33 -07:00
Josh Rosen	b2f3aca1e8	[SPARK-9286] [SQL] Methods in Unevaluable should be final and AlgebraicAggregate should extend Unevaluable. This patch marks the Unevaluable.eval() and UnevaluablegenCode() methods as final and fixes two cases where they were overridden. It also updates AggregateFunction2 to extend Unevaluable. Author: Josh Rosen <joshrosen@databricks.com> Closes #7627 from JoshRosen/unevaluable-fix and squashes the following commits: 8d9ed22 [Josh Rosen] AlgebraicAggregate should extend Unevaluable 65329c2 [Josh Rosen] Do not have AggregateFunction1 inherit from AggregateExpression1 fa68a22 [Josh Rosen] Make eval() and genCode() final	2015-07-23 16:08:07 -07:00
David Arroyo Cazorla	662d60db3f	[SPARK-5447][SQL] Replace reference 'schema rdd' with DataFrame @rxin. Author: David Arroyo Cazorla <darroyo@stratio.com> Closes #7618 from darroyocazorla/master and squashes the following commits: 5f91379 [David Arroyo Cazorla] [SPARK-5447][SQL] Replace reference 'schema rdd' with DataFrame	2015-07-23 10:34:32 -07:00
Xiangrui Meng	ecfb312767	[SPARK-9243] [Documentation] null -> zero in crosstab doc We forgot to update doc. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #7608 from mengxr/SPARK-9243 and squashes the following commits: 0ea3236 [Xiangrui Meng] null -> zero in crosstab doc	2015-07-23 10:32:11 -07:00
Cheng Hao	19aeab57c1	[Build][Minor] Fix building error & performance 1. When build the latest code with sbt, it throws exception like: [error] /home/hcheng/git/catalyst/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala:78: match may not be exhaustive. [error] It would fail on the following input: UNKNOWN [error] val classNameByStatus = status match { [error] 2. Potential performance issue when implicitly convert an Array[Any] to Seq[Any] Author: Cheng Hao <hao.cheng@intel.com> Closes #7611 from chenghao-intel/toseq and squashes the following commits: cab75c5 [Cheng Hao] remove the toArray 24df682 [Cheng Hao] fix building error & performance	2015-07-23 10:28:20 -07:00
Wenchen Fan	52ef76de21	[SPARK-9082] [SQL] [FOLLOW-UP] use `partition` in `PushPredicateThroughProject` a follow up of https://github.com/apache/spark/pull/7446 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7607 from cloud-fan/tmp and squashes the following commits: 7106989 [Wenchen Fan] use `partition` in `PushPredicateThroughProject`	2015-07-23 09:38:02 -07:00
Reynold Xin	fb36397b3c	Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow" Reverts ObjectPool. As it stands, it has a few problems: 1. ObjectPool doesn't work with spilling and memory accounting. 2. I don't think in the long run the idea of an object pool is what we want to support, since it essentially goes back to unmanaged memory, and creates pressure on GC, and is hard to account for the total in memory size. 3. The ObjectPool patch removed the specialized getters for strings and binary, and as a result, actually introduced branches when reading non primitive data types. If we do want to support arbitrary user defined types in the future, I think we can just add an object array in UnsafeRow, rather than relying on indirect memory addressing through a pool. We also need to pick execution strategies that are optimized for those, rather than keeping a lot of unserialized JVM objects in memory during aggregation. This is probably the hardest thing I had to revert in Spark, due to recent patches that also change the same part of the code. Would be great to get a careful look. Author: Reynold Xin <rxin@databricks.com> Closes #7591 from rxin/revert-object-pool and squashes the following commits: 01db0bc [Reynold Xin] Scala style. eda89fc [Reynold Xin] Fixed describe. 2967118 [Reynold Xin] Fixed accessor for JoinedRow. e3294eb [Reynold Xin] Merge branch 'master' into revert-object-pool 657855f [Reynold Xin] Temp commit. c20f2c8 [Reynold Xin] Style fix. fe37079 [Reynold Xin] Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"	2015-07-23 01:51:34 -07:00
Yijie Shen	6d0d8b4069	[SPARK-8935] [SQL] Implement code generation for all casts JIRA: https://issues.apache.org/jira/browse/SPARK-8935 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7365 from yjshen/cast_codegen and squashes the following commits: ef6e8b5 [Yijie Shen] getColumn and setColumn in struct cast, autounboxing in array and map eaece18 [Yijie Shen] remove null case in cast code gen fd7eba4 [Yijie Shen] resolve comments 80378a5 [Yijie Shen] the missing self cast 611d66e [Yijie Shen] Bug fix: NullType & primitive object unboxing 6d5c0fe [Yijie Shen] rebase and add Interval codegen 9424b65 [Yijie Shen] tiny style fix 4a1c801 [Yijie Shen] remove CodeHolder class, use function instead. 3f5df88 [Yijie Shen] CodeHolder for complex dataTypes c286f13 [Yijie Shen] moved all the cast code into class body 4edfd76 [Yijie Shen] [WIP] finished primitive part	2015-07-22 23:44:08 -07:00
Josh Rosen	b217230f2a	[SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled Spark has an option called spark.localExecution.enabled; according to the docs: > Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver. This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5. This pull request simply brings #7484 up to date. Author: Josh Rosen <joshrosen@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #7585 from rxin/remove-local-exec and squashes the following commits: 84bd10e [Reynold Xin] Python fix. 1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it. b0835dc [Josh Rosen] Remove local execution code in DAGScheduler 8975d96 [Josh Rosen] Remove local execution tests. ffa8c9b [Josh Rosen] Remove documentation for configuration	2015-07-22 21:04:04 -07:00
Reynold Xin	d71a13f475	[SPARK-9262][build] Treat Scala compiler warnings as errors I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings. Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop). Most of the work are done by ericl. Author: Reynold Xin <rxin@databricks.com> Author: Eric Liang <ekl@databricks.com> Closes #7598 from rxin/warnings and squashes the following commits: beb311b [Reynold Xin] Fixed tests. 542c031 [Reynold Xin] Fixed one more warning. 87c354a [Reynold Xin] Fixed all non-deprecation warnings. 78660ac [Eric Liang] first effort to fix warnings	2015-07-22 21:02:19 -07:00
Matei Zaharia	fe26584a1f	[SPARK-9244] Increase some memory defaults There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. This memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. We increase it to 128. - spark.executor.memory: Defaults at 512m, which is really small. We increase it to 1g. Author: Matei Zaharia <matei@databricks.com> Closes #7586 from mateiz/configs and squashes the following commits: ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults	2015-07-22 15:28:09 -07:00
Yin Huai	cf21d05f8b	[SPARK-4366] [SQL] [Follow-up] Fix SqlParser compiling warning. Author: Yin Huai <yhuai@databricks.com> Closes #7588 from yhuai/SPARK-4366-update1 and squashes the following commits: 25f5f36 [Yin Huai] Fix SqlParser Warning.	2015-07-22 13:28:09 -07:00
Davies Liu	e0b7ba59a1	[SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoin This PR introduce unsafe version (using UnsafeRow) of HashJoin, HashOuterJoin and HashSemiJoin, including the broadcast one and shuffle one (except FullOuterJoin, which is better to be implemented using SortMergeJoin). It use HashMap to store UnsafeRow right now, will change to use BytesToBytesMap for better performance (in another PR). Author: Davies Liu <davies@databricks.com> Closes #7480 from davies/unsafe_join and squashes the following commits: 6294b1e [Davies Liu] fix projection 10583f1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join dede020 [Davies Liu] fix test 84c9807 [Davies Liu] address comments a05b4f6 [Davies Liu] support UnsafeRow in LeftSemiJoinBNL and BroadcastNestedLoopJoin 611d2ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 9481ae8 [Davies Liu] return UnsafeRow after join() ca2b40f [Davies Liu] revert unrelated change 68f5cd9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 0f4380d [Davies Liu] ada a comment 69e38f5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 1a40f02 [Davies Liu] refactor ab1690f [Davies Liu] address comments 60371f2 [Davies Liu] use UnsafeRow in SemiJoin a6c0b7d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_join 184b852 [Davies Liu] fix style 6acbb11 [Davies Liu] fix tests 95d0762 [Davies Liu] remove println bea4a50 [Davies Liu] Unsafe HashJoin	2015-07-22 13:02:43 -07:00
Yijie Shen	86f80e2b47	[SPARK-9165] [SQL] codegen for CreateArray, CreateStruct and CreateNamedStruct JIRA: https://issues.apache.org/jira/browse/SPARK-9165 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7537 from yjshen/array_struct_codegen and squashes the following commits: 3a6dce6 [Yijie Shen] use infix notion in createArray test 5e90f0a [Yijie Shen] resolve comments: classOf 39cefb8 [Yijie Shen] codegen for createArray createStruct & createNamedStruct	2015-07-22 12:19:59 -07:00
Wenchen Fan	76520955fd	[SPARK-9082] [SQL] Filter using non-deterministic expressions should not be pushed down Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7446 from cloud-fan/filter and squashes the following commits: 330021e [Wenchen Fan] add exists to tree node 2cab68c [Wenchen Fan] more enhance 949be07 [Wenchen Fan] push down part of predicate if possible 3912f84 [Wenchen Fan] address comments 8ce15ca [Wenchen Fan] fix bug 557158e [Wenchen Fan] Filter using non-deterministic expressions should not be pushed down	2015-07-22 11:45:51 -07:00
Yin Huai	c03299a18b	[SPARK-4233] [SPARK-4367] [SPARK-3947] [SPARK-3056] [SQL] Aggregation Improvement This is the first PR for the aggregation improvement, which is tracked by https://issues.apache.org/jira/browse/SPARK-4366 (umbrella JIRA). This PR contains work for its subtasks, SPARK-3056, SPARK-3947, SPARK-4233, and SPARK-4367. This PR introduces a new code path for evaluating aggregate functions. This code path is guarded by `spark.sql.useAggregate2` and by default the value of this flag is true. This new code path contains: * A new aggregate function interface (`AggregateFunction2`) and 7 built-int aggregate functions based on this new interface (`AVG`, `COUNT`, `FIRST`, `LAST`, `MAX`, `MIN`, `SUM`) * A UDAF interface (`UserDefinedAggregateFunction`) based on the new code path and two example UDAFs (`MyDoubleAvg` and `MyDoubleSum`). * A sort-based aggregate operator (`Aggregate2Sort`) for the new aggregate function interface . * A sort-based aggregate operator (`FinalAndCompleteAggregate2Sort`) for distinct aggregations (for distinct aggregations the query plan will use `Aggregate2Sort` and `FinalAndCompleteAggregate2Sort` together). With this change, `spark.sql.useAggregate2` is `true`, the flow of compiling an aggregation query is: 1. Our analyzer looks up functions and returns aggregate functions built based on the old aggregate function interface. 2. When our planner is compiling the physical plan, it tries try to convert all aggregate functions to the ones built based on the new interface. The planner will fallback to the old code path if any of the following two conditions is true: * code-gen is disabled. * there is any function that cannot be converted (right now, Hive UDAFs). * the schema of grouping expressions contain any complex data type. * There are multiple distinct columns. Right now, the new code path handles a single distinct column in the query (you can have multiple aggregate functions using that distinct column). For a query having a aggregate function with DISTINCT and regular aggregate functions, the generated plan will do partial aggregations for those regular aggregate function. Thanks chenghao-intel for his initial work on it. Author: Yin Huai <yhuai@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #7458 from yhuai/UDAF and squashes the following commits: 7865f5e [Yin Huai] Put the catalyst expression in the comment of the generated code for it. b04d6c8 [Yin Huai] Remove unnecessary change. f1d5901 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 35b0520 [Yin Huai] Use semanticEquals to replace grouping expressions in the output of the aggregate operator. 3b43b24 [Yin Huai] bug fix. 00eb298 [Yin Huai] Make it compile. a3ca551 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF e0afca3 [Yin Huai] Gracefully fallback to old aggregation code path. 8a8ac4a [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 88c7d4d [Yin Huai] Enable spark.sql.useAggregate2 by default for testing purpose. dc96fd1 [Yin Huai] Many updates: 85c9c4b [Yin Huai] newline. 43de3de [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF c3614d7 [Yin Huai] Handle single distinct column. 68b8ee9 [Yin Huai] Support single distinct column set. WIP 3013579 [Yin Huai] Format. d678aee [Yin Huai] Remove AggregateExpressionSuite.scala since our built-in aggregate functions will be based on AlgebraicAggregate and we need to have another way to test it. e243ca6 [Yin Huai] Add aggregation iterators. a101960 [Yin Huai] Change MyJavaUDAF to MyDoubleSum. 594cdf5 [Yin Huai] Change existing AggregateExpression to AggregateExpression1 and add an AggregateExpression as the common interface for both AggregateExpression1 and AggregateExpression2. 380880f [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 0a827b3 [Yin Huai] Add comments and doc. Move some classes to the right places. a19fea6 [Yin Huai] Add UDAF interface. 262d4c4 [Yin Huai] Make it compile. b2e358e [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 6edb5ac [Yin Huai] Format update. 70b169c [Yin Huai] Remove groupOrdering. 4721936 [Yin Huai] Add CheckAggregateFunction to extendedCheckRules. d821a34 [Yin Huai] Cleanup. 32aea9c [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 5b46d41 [Yin Huai] Bug fix. aff9534 [Yin Huai] Make Aggregate2Sort work with both algebraic AggregateFunctions and non-algebraic AggregateFunctions. 2857b55 [Yin Huai] Merge remote-tracking branch 'upstream/master' into UDAF 4435f20 [Yin Huai] Add ConvertAggregateFunction to HiveContext's analyzer. 1b490ed [Michael Armbrust] make hive test 8cfa6a9 [Michael Armbrust] add test 1b0bb3f [Yin Huai] Do not bind references in AlgebraicAggregate and use code gen for all places. 072209f [Yin Huai] Bug fix: Handle expressions in grouping columns that are not attribute references. f7d9e54 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into UDAF 39ee975 [Yin Huai] Code cleanup: Remove unnecesary AttributeReferences. b7720ba [Yin Huai] Add an analysis rule to convert aggregate function to the new version. 5c00f3f [Michael Armbrust] First draft of codegen 6bbc6ba [Michael Armbrust] now with correct answers\! f7996d0 [Michael Armbrust] Add AlgebraicAggregate dded1c5 [Yin Huai] wip	2015-07-21 23:26:11 -07:00
Andrew Or	f4785f5b82	[SPARK-9232] [SQL] Duplicate code in JSONRelation Author: Andrew Or <andrew@databricks.com> Closes #7576 from andrewor14/clean-up-json-relation and squashes the following commits: ea80803 [Andrew Or] Clean up duplicate code	2015-07-21 23:00:13 -07:00
Reynold Xin	a4c83cb1e4	[SPARK-9154][SQL] Rename formatString to format_string. Also make format_string the canonical form, rather than printf. Author: Reynold Xin <rxin@databricks.com> Closes #7579 from rxin/format_strings and squashes the following commits: 53ee54f [Reynold Xin] Fixed unit tests. 52357e1 [Reynold Xin] Add format_string alias. b40a42a [Reynold Xin] [SPARK-9154][SQL] Rename formatString to format_string.	2015-07-21 19:14:07 -07:00
Tarek Auel	d4c7a7a364	[SPARK-9154] [SQL] codegen StringFormat Jira: https://issues.apache.org/jira/browse/SPARK-9154 fixes bug of #7546 marmbrus I can't reopen the other PR, because I didn't closed it. Can you trigger Jenkins? Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7571 from tarekauel/SPARK-9154 and squashes the following commits: dcae272 [Tarek Auel] [SPARK-9154][SQL] build fix 1487602 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-9154 f512c5f [Tarek Auel] [SPARK-9154][SQL] build fix a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives 10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format 086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format	2015-07-21 15:47:40 -07:00
Dennis Huo	c07838b5a9	[SPARK-9206] [SQL] Fix HiveContext classloading for GCS connector. IsolatedClientLoader.isSharedClass includes all of com.google.\, presumably for Guava, protobuf, and/or other shared Google libraries, but needs to count com.google.cloud.\ as "hive classes" when determining which ClassLoader to use. Otherwise, things like HiveContext.parquetFile will throw a ClassCastException when fs.defaultFS is set to a Google Cloud Storage (gs://) path. On StackOverflow: http://stackoverflow.com/questions/31478955 EDIT: Adding yhuai who worked on the relevant classloading isolation pieces. Author: Dennis Huo <dhuo@google.com> Closes #7549 from dennishuo/dhuo-fix-hivecontext-gcs and squashes the following commits: 1f8db07 [Dennis Huo] Fix HiveContext classloading for GCS connector.	2015-07-21 13:12:11 -07:00
Reynold Xin	60c0ce134d	[SPARK-8906][SQL] Move all internal data source classes into execution.datasources. This way, the sources package contains only public facing interfaces. Author: Reynold Xin <rxin@databricks.com> Closes #7565 from rxin/move-ds and squashes the following commits: 7661aff [Reynold Xin] Mima 9d5196a [Reynold Xin] Rearranged imports. 3dd7174 [Reynold Xin] [SPARK-8906][SQL] Move all internal data source classes into execution.datasources.	2015-07-21 11:56:38 -07:00
navis.ryu	9ba7c64dec	[SPARK-8357] Fix unsafe memory leak on empty inputs in GeneratedAggregate This patch fixes a managed memory leak in GeneratedAggregate. The leak occurs when the unsafe aggregation path is used to perform grouped aggregation on an empty input; in this case, GeneratedAggregate allocates an UnsafeFixedWidthAggregationMap that is never cleaned up because `next()` is never called on the aggregate result iterator. This patch fixes this by short-circuiting on empty inputs. This patch is an updated version of #6810. Closes #6810. Author: navis.ryu <navis@apache.org> Author: Josh Rosen <joshrosen@databricks.com> Closes #7560 from JoshRosen/SPARK-8357 and squashes the following commits: 3486ce4 [Josh Rosen] Some minor cleanup c649310 [Josh Rosen] Revert SparkPlan change: 3c7db0f [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-8357 adc8239 [Josh Rosen] Back out Projection changes. c5419b3 [navis.ryu] addressed comments 143e1ef [navis.ryu] fixed format & added test for CCE case 735972f [navis.ryu] used new conf apis 1a02a55 [navis.ryu] Rolled-back test-conf cleanup & fixed possible CCE & added more tests 51178e8 [navis.ryu] addressed comments 4d326b9 [navis.ryu] fixed test fails 15c5afc [navis.ryu] added a test as suggested by JoshRosen d396589 [navis.ryu] added comments 1b07556 [navis.ryu] [SPARK-8357] [SQL] Memory leakage on unsafe aggregation path with empty input	2015-07-21 11:52:52 -07:00
Michael Armbrust	87d890cc10	Revert "[SPARK-9154] [SQL] codegen StringFormat" This reverts commit `7f072c3d5e`. Revert #7546 Author: Michael Armbrust <michael@databricks.com> Closes #7570 from marmbrus/revert9154 and squashes the following commits: ed2c32a [Michael Armbrust] Revert "[SPARK-9154] [SQL] codegen StringFormat"	2015-07-21 11:18:39 -07:00
Tarek Auel	7f072c3d5e	[SPARK-9154] [SQL] codegen StringFormat Jira: https://issues.apache.org/jira/browse/SPARK-9154 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7546 from tarekauel/SPARK-9154 and squashes the following commits: a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives 10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format 086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format	2015-07-21 09:58:16 -07:00
Yijie Shen	be5c5d3741	[SPARK-9081] [SPARK-9168] [SQL] nanvl & dropna/fillna supporting nan as well JIRA: https://issues.apache.org/jira/browse/SPARK-9081 https://issues.apache.org/jira/browse/SPARK-9168 This PR target at two modifications: 1. Change `isNaN` to return `false` on `null` input 2. Make `dropna` and `fillna` to fill/drop NaN values as well 3. Implement `nanvl` Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7523 from yjshen/fillna_dropna and squashes the following commits: f0a51db [Yijie Shen] make coalesce untouched and implement nanvl 1d3e35f [Yijie Shen] make Coalesce aware of NaN in order to support fillna 2760cbc [Yijie Shen] change isNaN(null) to false as well as implement dropna	2015-07-21 08:25:50 -07:00
Yijie Shen	ae230596b8	[SPARK-9173][SQL]UnionPushDown should also support Intersect and Except JIRA: https://issues.apache.org/jira/browse/SPARK-9173 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7540 from yjshen/union_pushdown and squashes the following commits: 278510a [Yijie Shen] rename UnionPushDown to SetOperationPushDown 91741c1 [Yijie Shen] Add UnionPushDown support for intersect and except	2015-07-21 00:56:57 -07:00
Pedro Rodriguez	560c658a74	[SPARK-8230][SQL] Add array/map size method Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230 Primary issue resolved is to implement array/map size for Spark SQL. Code is ready for review by a committer. Chen Hao is on the JIRA ticket, but I don't know his username on github, rxin is also on JIRA ticket. Things to review: 1. Where to put added functions namespace wise, they seem to be part of a few operations on collections which includes `sort_array` and `array_contains`. Hence the name given `collectionOperations.scala` and `_collection_functions` in python. 2. In Python code, should it be in a `1.5.0` function array or in a collections array? 3. Are there any missing methods on the `Size` case class? Looks like many of these functions have generated Java code, is that also needed in this case? 4. Something else? Author: Pedro Rodriguez <ski.rodriguez@gmail.com> Author: Pedro Rodriguez <prodriguez@trulia.com> Closes #7462 from EntilZha/SPARK-8230 and squashes the following commits: 9a442ae [Pedro Rodriguez] fixed functions and sorted __all__ 9aea3bb [Pedro Rodriguez] removed imports from python docs 15d4bf1 [Pedro Rodriguez] Added null test case and changed to nullSafeCodeGen d88247c [Pedro Rodriguez] removed python code bd5f0e4 [Pedro Rodriguez] removed duplicate function from rebase/merge 59931b4 [Pedro Rodriguez] fixed compile bug instroduced when merging c187175 [Pedro Rodriguez] updated code to add size to __all__ directly and removed redundent pretty print 130839f [Pedro Rodriguez] fixed failing test aa9bade [Pedro Rodriguez] fix style e093473 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests 0449377 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations 9a1a2ff [Pedro Rodriguez] added unit tests for map size 2bfbcb6 [Pedro Rodriguez] added unit test for size 20df2b4 [Pedro Rodriguez] Finished working version of size function and added it to python b503e75 [Pedro Rodriguez] First attempt at implementing size for maps and arrays 99a6a5c [Pedro Rodriguez] fixed failing test cac75ac [Pedro Rodriguez] fix style 933d843 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests 42bb7d4 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations f9c3b8a [Pedro Rodriguez] added unit tests for map size 2515d9f [Pedro Rodriguez] added documentation 0e60541 [Pedro Rodriguez] added unit test for size acf9853 [Pedro Rodriguez] Finished working version of size function and added it to python 84a5d38 [Pedro Rodriguez] First attempt at implementing size for maps and arrays	2015-07-21 00:53:20 -07:00
Cheng Hao	8c8f0ef59e	[SPARK-8255] [SPARK-8256] [SQL] Add regex_extract/regex_replace Add expressions `regex_extract` & `regex_replace` Author: Cheng Hao <hao.cheng@intel.com> Closes #7468 from chenghao-intel/regexp and squashes the following commits: e5ea476 [Cheng Hao] minor update for documentation ef96fd6 [Cheng Hao] update the code gen 72cf28f [Cheng Hao] Add more log for compilation error 4e11381 [Cheng Hao] Add regexp_replace / regexp_extract support	2015-07-21 00:48:07 -07:00
Cheng Lian	d38c5029a2	[SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python. Author: Cheng Lian <lian@databricks.com> Closes #7444 from liancheng/spark-9100 and squashes the following commits: 284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC	2015-07-21 15:08:44 +08:00
Tarek Auel	1ddd0f2f16	[SPARK-9161][SQL] codegen FormatNumber Jira https://issues.apache.org/jira/browse/SPARK-9161 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7545 from tarekauel/SPARK-9161 and squashes the following commits: 21425c8 [Tarek Auel] [SPARK-9161][SQL] codegen FormatNumber	2015-07-20 23:33:07 -07:00
Josh Rosen	48f8fd46b3	[SPARK-9023] [SQL] Followup for #7456 (Efficiency improvements for UnsafeRows in Exchange) This patch addresses code review feedback from #7456. Author: Josh Rosen <joshrosen@databricks.com> Closes #7551 from JoshRosen/unsafe-exchange-followup and squashes the following commits: 76dbdf8 [Josh Rosen] Add comments + more methods to UnsafeRowSerializer 3d7a1f2 [Josh Rosen] Add writeToStream() method to UnsafeRow	2015-07-20 23:28:35 -07:00
Reynold Xin	67570beed5	[SPARK-9208][SQL] Remove variant of DataFrame string functions that accept column names. It can be ambiguous whether that is a string literal or a column name. cc marmbrus Author: Reynold Xin <rxin@databricks.com> Closes #7556 from rxin/str-exprs and squashes the following commits: 92afa83 [Reynold Xin] [SPARK-9208][SQL] Remove variant of DataFrame string functions that accept column names.	2015-07-20 22:48:13 -07:00
Tarek Auel	560b355ccd	[SPARK-9157] [SQL] codegen substring https://issues.apache.org/jira/browse/SPARK-9157 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7534 from tarekauel/SPARK-9157 and squashes the following commits: e65e3e9 [Tarek Auel] [SPARK-9157] indent fix 44e89f8 [Tarek Auel] [SPARK-9157] use EMPTY_UTF8 37d54c4 [Tarek Auel] Merge branch 'master' into SPARK-9157 60732ea [Tarek Auel] [SPARK-9157] created substringSQL in UTF8String 18c3576 [Tarek Auel] [SPARK-9157][SQL] remove slice pos 1a2e611 [Tarek Auel] [SPARK-9157][SQL] codegen substring	2015-07-20 22:44:21 -07:00
Josh Rosen	c032b0bf92	[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort. The root of this problem is that `NaN > anything`, `NaN == anything`, and `NaN < anything` all return `false`. Per the design specified in SPARK-9079, we have decided that `NaN = NaN` should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value). In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147). Author: Josh Rosen <joshrosen@databricks.com> Closes #7194 from JoshRosen/nan and squashes the following commits: 983d4fc [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan 88bd73c [Josh Rosen] Fix Row.equals() a702e2e [Josh Rosen] normalization -> canonicalization a7267cf [Josh Rosen] Normalize NaNs in UnsafeRow fe629ae [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan fbb2a29 [Josh Rosen] Fix NaN comparisons in BinaryComparison expressions c1fd4fe [Josh Rosen] Fold NaN test into existing test framework b31eb19 [Josh Rosen] Uncomment failing tests 7fe67af [Josh Rosen] Support NaN == NaN (SPARK-9145) 58bad2c [Josh Rosen] Revert "Compare rows' string representations to work around NaN incomparability." fc6b4d2 [Josh Rosen] Update CodeGenerator 3998ef2 [Josh Rosen] Remove unused code a2ba2e7 [Josh Rosen] Fix prefix comparision for NaNs a30d371 [Josh Rosen] Compare rows' string representations to work around NaN incomparability. 6f03f85 [Josh Rosen] Fix bug in Double / Float ordering 42a1ad5 [Josh Rosen] Stop filtering NaNs in UnsafeExternalSortSuite bfca524 [Josh Rosen] Change ordering so that NaN is maximum value. 8d7be61 [Josh Rosen] Update randomized test to use ScalaTest's assume() b20837b [Josh Rosen] Add failing test for new NaN comparision ordering 5b88b2b [Josh Rosen] Fix compilation of CodeGenerationSuite d907b5b [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan 630ebc5 [Josh Rosen] Specify an ordering for NaN values. 9bf195a [Josh Rosen] Re-enable NaNs in CodeGenerationSuite to produce more regression tests 13fc06a [Josh Rosen] Add regression test for NaN sorting issue f9efbb5 [Josh Rosen] Fix ORDER BY NULL e7dc4fb [Josh Rosen] Add very generic test for ordering 7d5c13e [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL) b55875a [Josh Rosen] Generate doubles and floats over entire possible range. 5acdd5c [Josh Rosen] Infinity and NaN are interesting. ab76cbd [Josh Rosen] Move code to Catalyst package. d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.	2015-07-20 22:38:05 -07:00
Tarek Auel	a3c7a3ce32	[SPARK-9132][SPARK-9163][SQL] codegen conv Jira: https://issues.apache.org/jira/browse/SPARK-9132 https://issues.apache.org/jira/browse/SPARK-9163 rxin as you proposed in the Jira ticket, I just moved the logic to a separate object. I haven't changed anything of the logic of `NumberConverter`. Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7552 from tarekauel/SPARK-9163 and squashes the following commits: 40dcde9 [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] style fix fa985bd [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] codegen conv	2015-07-20 22:08:12 -07:00
Tarek Auel	936a96cb31	[SPARK-9164] [SQL] codegen hex/unhex Jira: https://issues.apache.org/jira/browse/SPARK-9164 The diff looks heavy, but I just moved the `hex` and `unhex` methods to `object Hex`. This allows me to call them from `eval` and `codeGen` Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7548 from tarekauel/SPARK-9164 and squashes the following commits: dd91c57 [Tarek Auel] [SPARK-9164][SQL] codegen hex/unhex	2015-07-20 19:17:59 -07:00
Reynold Xin	e90543e536	[SPARK-9142][SQL] Removing unnecessary self types in expressions. Also added documentation to expressions to explain the important traits and abstract classes. Author: Reynold Xin <rxin@databricks.com> Closes #7550 from rxin/remove-self-types and squashes the following commits: b2a3ec1 [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in expressions.	2015-07-20 18:23:51 -07:00
Tarek Auel	6853ac7c8c	[SPARK-9156][SQL] codegen StringSplit Jira: https://issues.apache.org/jira/browse/SPARK-9156 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7547 from tarekauel/SPARK-9156 and squashes the following commits: 0be2700 [Tarek Auel] [SPARK-9156][SQL] indention fix b860eaf [Tarek Auel] [SPARK-9156][SQL] codegen StringSplit 5ad6a1f [Tarek Auel] [SPARK-9156] codegen StringSplit	2015-07-20 18:21:05 -07:00
Cheng Lian	a1064df0ee	[SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery This PR tries to accelerate Parquet schema discovery and `HadoopFsRelation` partition discovery. The acceleration is done by the following means: - Turning off schema merging by default Schema merging is not the most common case, but requires reading footers of all Parquet part-files and can be very slow. - Avoiding `FileSystem.globStatus()` call when possible `FileSystem.globStatus()` may issue multiple synchronous RPC calls, and can be very slow (esp. on S3). This PR adds `SparkHadoopUtil.globPathIfNecessary()`, which only issues RPC calls when the path contain glob-pattern specific character(s) (`{}[]*?\`). This is especially useful when converting a metastore Parquet table with lots of partitions, since Spark SQL adds all partition directories as the input paths, and currently we do a `globStatus` call on each input path sequentially. - Listing leaf files in parallel when the number of input paths exceeds a threshold Listing leaf files is required by partition discovery. Currently it is done on driver side, and can be slow when there are lots of (nested) directories, since each `FileSystem.listStatus()` call issues an RPC. In this PR, we list leaf files in a BFS style, and resort to a Spark job once we found that the number of directories need to be listed exceed a threshold. The threshold is controlled by `SQLConf` option `spark.sql.sources.parallelPartitionDiscovery.threshold`, which defaults to 32. - Discovering Parquet schema in parallel Currently, schema merging is also done on driver side, and needs to read footers of all part-files. This PR uses a Spark job to do schema merging. Together with task side metadata reading in Parquet 1.7.0, we never read any footers on driver side now. Author: Cheng Lian <lian@databricks.com> Closes #7396 from liancheng/accel-parquet and squashes the following commits: 5598efc [Cheng Lian] Uses ParquetInputFormat[InternalRow] instead of ParquetInputFormat[Row] ff32cd0 [Cheng Lian] Excludes directories while listing leaf files 3c580f1 [Cheng Lian] Fixes test failure caused by making "mergeSchema" default to "false" b1646aa [Cheng Lian] Should allow empty input paths 32e5f0d [Cheng Lian] Moves schema merging to executor side	2015-07-20 16:42:43 -07:00
Tarek Auel	dac7dbf5a6	[SPARK-9160][SQL] codegen encode, decode Jira: https://issues.apache.org/jira/browse/SPARK-9160 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7543 from tarekauel/SPARK-9160 and squashes the following commits: 7528f0e [Tarek Auel] [SPARK-9160][SQL] codegen encode, decode	2015-07-20 16:11:56 -07:00
Tarek Auel	c9db8eaa42	[SPARK-9159][SQL] codegen ascii, base64, unbase64 Jira: https://issues.apache.org/jira/browse/SPARK-9159 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7542 from tarekauel/SPARK-9159 and squashes the following commits: 772e6bc [Tarek Auel] [SPARK-9159][SQL] codegen ascii, base64, unbase64	2015-07-20 15:32:46 -07:00
Tarek Auel	4863c11ea9	[SPARK-9155][SQL] codegen StringSpace Jira https://issues.apache.org/jira/browse/SPARK-9155 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7531 from tarekauel/SPARK-9155 and squashes the following commits: 423c426 [Tarek Auel] [SPARK-9155] language typo fix e34bd1b [Tarek Auel] [SPARK-9155] moved creation of blank string to UTF8String 4bc33e6 [Tarek Auel] [SPARK-9155] codegen StringSpace	2015-07-20 15:23:28 -07:00
Cheng Lian	dde0e12f32	[SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning This PR forks PR #7421 authored by piaozhexiu and adds [a workaround] [1] for fixing the occasional test failures occurred in PR #7421. Please refer to these [two] [2] [comments] [3] for details. [1]: `536ac41a7e` [2]: https://github.com/apache/spark/pull/7421#issuecomment-122527391 [3]: https://github.com/apache/spark/pull/7421#issuecomment-122528059 Author: Cheolsoo Park <cheolsoop@netflix.com> Author: Cheng Lian <lian@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #7492 from liancheng/pr-7421-workaround and squashes the following commits: 5599cc4 [Cheolsoo Park] Predicate pushdown to hive metastore 536ac41 [Cheng Lian] Sets hive.metastore.integral.jdo.pushdown to true to workaround test failures caused by in #7421	2015-07-20 15:12:14 -07:00
Davies Liu	9f913c4fd6	[SPARK-9114] [SQL] [PySpark] convert returned object from UDF into internal type This PR also remove the duplicated code between registerFunction and UserDefinedFunction. cc JoshRosen Author: Davies Liu <davies@databricks.com> Closes #7450 from davies/fix_return_type and squashes the following commits: e80bf9f [Davies Liu] remove debugging code f94b1f6 [Davies Liu] fix mima 8f9c58b [Davies Liu] convert returned object from UDF into internal type	2015-07-20 12:14:47 -07:00
Reynold Xin	c6fe9b4a17	[SQL] Remove space from DataFrame Scala/Java API. I don't think this function is useful at all in Scala/Java, since users can easily compute n * space easily. Author: Reynold Xin <rxin@databricks.com> Closes #7530 from rxin/remove-space and squashes the following commits: c147873 [Reynold Xin] [SQL] Remove space from DataFrame Scala/Java API.	2015-07-20 09:43:25 -07:00
Wenchen Fan	04db58ae30	[SPARK-9186][SQL] make deterministic describing the tree rather than the expression Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7525 from cloud-fan/deterministic and squashes the following commits: 4189bfa [Wenchen Fan] make deterministic describing the tree rather than the expression	2015-07-20 09:42:18 -07:00
Tarek Auel	a15ecd057a	[SPARK-9177][SQL] Reuse of calendar object in WeekOfYear https://issues.apache.org/jira/browse/SPARK-9177 rxin Are we sure that this is thread safe? chenghao-intel explained in another PR that every partition (if I remember correctly) uses one expression instance. This instance isn't used by multiple threads, is it? If not, we are fine. Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7516 from tarekauel/SPARK-9177 and squashes the following commits: 0c1313a [Tarek Auel] [SPARK-9177] utilize more powerful addMutableState 6e2f03f [Tarek Auel] Merge branch 'master' into SPARK-9177 a69ec92 [Tarek Auel] [SPARK-9177] address comment 6cfb180 [Tarek Auel] [SPARK-9177] calendar as lazy transient val ff97b09 [Tarek Auel] [SPARK-9177] Reuse calendar object in interpreted code and codegen	2015-07-20 09:41:25 -07:00
Tarek Auel	5112b7f58b	[SPARK-9153][SQL] codegen StringLPad/StringRPad Jira: https://issues.apache.org/jira/browse/SPARK-9153 Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7527 from tarekauel/SPARK-9153 and squashes the following commits: 3840c6b [Tarek Auel] [SPARK-9153] removed codegen fallback 92b6a5d [Tarek Auel] [SPARK-9153] codegen lpad/rpad	2015-07-20 09:35:45 -07:00
Josh Rosen	79ec07290d	[SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange This pull request aims to improve the performance of SQL's Exchange operator when shuffling UnsafeRows. It also makes several general efficiency improvements to Exchange. Key changes: - When performing hash partitioning, the old Exchange projected the partitioning columns into a new row then passed a `(partitioningColumRow: InternalRow, row: InternalRow)` pair into the shuffle. This is very inefficient because it ends up redundantly serializing the partitioning columns only to immediately discard them after the shuffle. After this patch's changes, Exchange now shuffles `(partitionId: Int, row: InternalRow)` pairs. This still isn't optimal, since we're still shuffling extra data that we don't need, but it's significantly more efficient than the old implementation; in the future, we may be able to further optimize this once we implement a new shuffle write interface that accepts non-key-value-pair inputs. - Exchange's `compute()` method has been significantly simplified; the new code has less duplication and thus is easier to understand. - When the Exchange's input operator produces UnsafeRows, Exchange will use a specialized `UnsafeRowSerializer` to serialize these rows. This serializer is significantly more efficient since it simply copies the UnsafeRow's underlying bytes. Note that this approach does not work for UnsafeRows that use the ObjectPool mechanism; I did not add support for this because we are planning to remove ObjectPool in the next few weeks. Author: Josh Rosen <joshrosen@databricks.com> Closes #7456 from JoshRosen/unsafe-exchange and squashes the following commits: 7e75259 [Josh Rosen] Fix cast in SparkSqlSerializer2Suite 0082515 [Josh Rosen] Some additional comments + small cleanup to remove an unused parameter a27cfc1 [Josh Rosen] Add missing newline 741973c [Josh Rosen] Add simple test of UnsafeRow shuffling in Exchange. 359c6a4 [Josh Rosen] Remove println() and add comments 93904e7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-exchange 8dd3ff2 [Josh Rosen] Exchange outputs UnsafeRows when its child outputs them dd9c66d [Josh Rosen] Fix for copying logic 035af21 [Josh Rosen] Add logic for choosing when to use UnsafeRowSerializer 7876f31 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-shuffle cbea80b [Josh Rosen] Add UnsafeRowSerializer 0f2ac86 [Josh Rosen] Import ordering 3ca8515 [Josh Rosen] Big code simplification in Exchange 3526868 [Josh Rosen] Iniitial cut at removing shuffle on KV pairs	2015-07-19 23:41:28 -07:00
Jacky Li	972d8900a1	[SQL][DOC] Minor document fix in HadoopFsRelationProvider Catch this while reading the code Author: Jacky Li <lee.unreal@gmail.com> Author: Jacky Li <jackylk@users.noreply.github.com> Closes #7524 from jackylk/patch-11 and squashes the following commits: b679011 [Jacky Li] fix doc e10e211 [Jacky Li] [SQL] Minor document fix in HadoopFsRelationProvider	2015-07-19 23:19:17 -07:00
Reynold Xin	5bdf16da90	Code review feedback for the previous patch.	2015-07-19 22:45:56 -07:00
Wenchen Fan	930253e076	[SPARK-9185][SQL] improve code gen for mutable states to support complex initialization Sometimes we need more than one step to initialize the mutable states in code gen like https://github.com/apache/spark/pull/7516 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7521 from cloud-fan/init and squashes the following commits: 2106445 [Wenchen Fan] improve code gen for mutable states	2015-07-19 22:42:44 -07:00
Liang-Chi Hsieh	d743bec645	[SPARK-9172][SQL] Make DecimalPrecision support for Intersect and Except JIRA: https://issues.apache.org/jira/browse/SPARK-9172 Simply make `DecimalPrecision` support for `Intersect` and `Except` in addition to `Union`. Besides, add unit test for `DecimalPrecision` as well. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7511 from viirya/more_decimalprecieion and squashes the following commits: 4d29d10 [Liang-Chi Hsieh] Fix code comment. 9fb0d49 [Liang-Chi Hsieh] Make DecimalPrecision support for Intersect and Except.	2015-07-19 20:53:18 -07:00
Reynold Xin	163e3f1df9	[SPARK-8241][SQL] string function: concat_ws. I also changed the semantics of concat w.r.t. null back to the same behavior as Hive. That is to say, concat now returns null if any input is null. Author: Reynold Xin <rxin@databricks.com> Closes #7504 from rxin/concat_ws and squashes the following commits: 83fd950 [Reynold Xin] Fixed type casting. 3ae85f7 [Reynold Xin] Write null better. cdc7be6 [Reynold Xin] Added code generation for pure string mode. a61c4e4 [Reynold Xin] Updated comments. 2d51406 [Reynold Xin] [SPARK-8241][SQL] string function: concat_ws.	2015-07-19 16:48:47 -07:00
Herman van Hovell	7a81245345	[SPARK-8638] [SQL] Window Function Performance Improvements - Cleanup This PR contains a few clean-ups that are a part of SPARK-8638: a few style issues got fixed, and a few tests were moved. Git commit message is wrong BTW :(... Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #7513 from hvanhovell/SPARK-8638-cleanup and squashes the following commits: 4e69d08 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)	2015-07-19 16:29:50 -07:00
Cheng Lian	34ed82bb44	[HOTFIX] [SQL] Fixes compilation error introduced by PR #7506 PR #7506 breaks master build because of compilation error. Note that #7506 itself looks good, but it seems that `git merge` did something stupid. Author: Cheng Lian <lian@databricks.com> Closes #7510 from liancheng/hotfix-for-pr-7506 and squashes the following commits: 7ea7e89 [Cheng Lian] Fixes compilation error	2015-07-19 18:58:19 +08:00
Reynold Xin	3427937ea2	[SQL] Make date/time functions more consistent with other database systems. This pull request fixes some of the problems in #6981. - Added date functions to `__all__` so they get exposed - Rename day_of_month -> dayofmonth - Rename day_in_year -> dayofyear - Rename week_of_year -> weekofyear - Removed "day" from Scala/Python API since it is ambiguous. Only leaving the alias in SQL. Author: Reynold Xin <rxin@databricks.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@databricks.com> Closes #7506 from rxin/datetime and squashes the following commits: 0cb24d9 [Reynold Xin] Export all functions in Python. e44a4a0 [Reynold Xin] Removed day function from Scala and Python. 9c08fdc [Reynold Xin] [SQL] Make date/time functions more consistent with other database systems.	2015-07-19 01:17:22 -07:00
Tarek Auel	a53d13f7aa	[SPARK-8199][SQL] follow up; revert change in test rxin / davies Sorry for that unnecessary change. And thanks again for all your support! Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7505 from tarekauel/SPARK-8199-FollowUp and squashes the following commits: d09321c [Tarek Auel] [SPARK-8199] follow up; revert change in test c17397f [Tarek Auel] [SPARK-8199] follow up; revert change in test 67acfe6 [Tarek Auel] [SPARK-8199] follow up; revert change in test	2015-07-19 01:16:01 -07:00
Herman van Hovell	a9a0d0cebf	[SPARK-8638] [SQL] Window Function Performance Improvements ## Description Performance improvements for Spark Window functions. This PR will also serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA tickets SPARK-8638 and SPARK-7712 for more information. ## Improvements * Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse. Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement. * A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. This will be enabled in a follow-up PR. * A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code. ## Benchmarking I have done a small benchmark using [on time performance](http://www.transtats.bts.gov) data of the month april. I have used the origin as a partioning key, as a result there is quite some variation in window sizes. The code for the benchmark can be found in the JIRA ticket. These are the results per Frame type: Frame \| Master \| SPARK-8638 ----- \| ------ \| ---------- Entire Frame \| 2 s \| 1 s Sliding \| 18 s \| 1 s Growing \| 14 s \| 0.9 s Shrinking \| 13 s \| 1 s Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #7057 from hvanhovell/SPARK-8638 and squashes the following commits: 3bfdc49 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase) 2eb3b33 [Herman van Hovell] Corrected reverse range frame processing. 2cd2d5b [Herman van Hovell] Corrected reverse range frame processing. b0654d7 [Herman van Hovell] Tests for exotic frame specifications. e75b76e [Herman van Hovell] More docs, added support for reverse sliding range frames, and some reorganization of code. 1fdb558 [Herman van Hovell] Changed Data In HiveDataFrameWindowSuite. ac2f682 [Herman van Hovell] Added a few more comments. 1938312 [Herman van Hovell] Added Documentation to the createBoundOrdering methods. bb020e6 [Herman van Hovell] Major overhaul of Window operator.	2015-07-18 23:44:38 -07:00
Reynold Xin	04c1b49f5e	Fixed test cases.	2015-07-18 22:50:34 -07:00
Tarek Auel	83b682beec	[SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK-8179][SPARK-8177][SPARK-8178][SPARK-9115][SQL] date functions Jira: https://issues.apache.org/jira/browse/SPARK-8199 https://issues.apache.org/jira/browse/SPARK-8184 https://issues.apache.org/jira/browse/SPARK-8183 https://issues.apache.org/jira/browse/SPARK-8182 https://issues.apache.org/jira/browse/SPARK-8181 https://issues.apache.org/jira/browse/SPARK-8180 https://issues.apache.org/jira/browse/SPARK-8179 https://issues.apache.org/jira/browse/SPARK-8177 https://issues.apache.org/jira/browse/SPARK-8179 https://issues.apache.org/jira/browse/SPARK-9115 Regarding `day`and `dayofmonth` are both necessary? ~~I am going to add `Quarter` to this PR as well.~~ Done. ~~As soon as the Scala coding is reviewed and discussed, I'll add the python api.~~ Done Author: Tarek Auel <tarek.auel@googlemail.com> Author: Tarek Auel <tarek.auel@gmail.com> Closes #6981 from tarekauel/SPARK-8199 and squashes the following commits: f7b4c8c [Tarek Auel] [SPARK-8199] fixed bug in tests bb567b6 [Tarek Auel] [SPARK-8199] fixed test 3e095ba [Tarek Auel] [SPARK-8199] style and timezone fix 256c357 [Tarek Auel] [SPARK-8199] code cleanup 5983dcc [Tarek Auel] [SPARK-8199] whitespace fix 6e0c78f [Tarek Auel] [SPARK-8199] removed setTimeZone in tests, according to cloud-fans comment in #7488 4afc09c [Tarek Auel] [SPARK-8199] concise leap year handling ea6c110 [Tarek Auel] [SPARK-8199] fix after merging master 70238e0 [Tarek Auel] Merge branch 'master' into SPARK-8199 3c6ae2e [Tarek Auel] [SPARK-8199] removed binary search fb98ba0 [Tarek Auel] [SPARK-8199] python docstring fix cdfae27 [Tarek Auel] [SPARK-8199] cleanup & python docstring fix 746b80a [Tarek Auel] [SPARK-8199] build fix 0ad6db8 [Tarek Auel] [SPARK-8199] minor fix 523542d [Tarek Auel] [SPARK-8199] address comments 2259299 [Tarek Auel] [SPARK-8199] day_of_month alias d01b977 [Tarek Auel] [SPARK-8199] python underscore 56c4a92 [Tarek Auel] [SPARK-8199] update python docu e223bc0 [Tarek Auel] [SPARK-8199] refactoring d6aa14e [Tarek Auel] [SPARK-8199] fixed Hive compatibility b382267 [Tarek Auel] [SPARK-8199] fixed bug in day calculation; removed set TimeZone in HiveCompatibilitySuite for test purposes; removed Hive tests for second and minute, because we can cast '2015-03-18' to a timestamp and extract a minute/second from it 1b2e540 [Tarek Auel] [SPARK-8119] style fix 0852655 [Tarek Auel] [SPARK-8119] changed from ExpectsInputTypes to implicit casts ec87c69 [Tarek Auel] [SPARK-8119] bug fixing and refactoring 1358cdc [Tarek Auel] Merge remote-tracking branch 'origin/master' into SPARK-8199 740af0e [Tarek Auel] implement date function using a calculation based on days 4fb66da [Tarek Auel] WIP: date functions on calculation only 1a436c9 [Tarek Auel] wip f775f39 [Tarek Auel] fixed return type ad17e96 [Tarek Auel] improved implementation c42b444 [Tarek Auel] Removed merge conflict file ccb723c [Tarek Auel] [SPARK-8199] style and fixed merge issues 10e4ad1 [Tarek Auel] Merge branch 'master' into date-functions-fast 7d9f0eb [Tarek Auel] [SPARK-8199] git renaming issue f3e7a9f [Tarek Auel] [SPARK-8199] revert change in DataFrameFunctionsSuite 6f5d95c [Tarek Auel] [SPARK-8199] fixed year interval d9f8ac3 [Tarek Auel] [SPARK-8199] implement fast track 7bc9d93 [Tarek Auel] Merge branch 'master' into SPARK-8199 5a105d9 [Tarek Auel] [SPARK-8199] rebase after #6985 got merged eb6760d [Tarek Auel] Merge branch 'master' into SPARK-8199 f120415 [Tarek Auel] improved runtime a8edebd [Tarek Auel] use Calendar instead of SimpleDateFormat 5fe74e1 [Tarek Auel] fixed python style 3bfac90 [Tarek Auel] fixed style 356df78 [Tarek Auel] rely on cast mechanism of Spark. Simplified implementation 02efc5d [Tarek Auel] removed doubled code a5ea120 [Tarek Auel] added python api; changed test to be more meaningful b680db6 [Tarek Auel] added codegeneration to all functions c739788 [Tarek Auel] added support for quarter SPARK-8178 849fb41 [Tarek Auel] fixed stupid test 638596f [Tarek Auel] improved codegen 4d8049b [Tarek Auel] fixed tests and added type check 5ebb235 [Tarek Auel] resolved naming conflict d0e2f99 [Tarek Auel] date functions	2015-07-18 22:48:05 -07:00
Forest Fang	6cb6096c01	[SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits By grouping projection calls into multiple apply function, we are able to push the number of projections codegen can handle from ~1k to ~60k. I have set the unit test to test against 5k as 60k took 15s for the unit test to complete. Author: Forest Fang <forest.fang@outlook.com> Closes #7076 from saurfang/codegen_size_limit and squashes the following commits: b7a7635 [Forest Fang] [SPARK-8443][SQL] Execute and verify split projections in test adef95a [Forest Fang] [SPARK-8443][SQL] Use safer factor and rewrite splitting code 1b5aa7e [Forest Fang] [SPARK-8443][SQL] inline execution if one block only 9405680 [Forest Fang] [SPARK-8443][SQL] split projection code by size limit	2015-07-18 21:05:44 -07:00
Reynold Xin	45d798c323	[SPARK-8278] Remove non-streaming JSON reader. Author: Reynold Xin <rxin@databricks.com> Closes #7501 from rxin/jsonrdd and squashes the following commits: 767ec55 [Reynold Xin] More Mima 51f456e [Reynold Xin] Mima exclude. 789cb80 [Reynold Xin] Fixed compilation error. b4cf50d [Reynold Xin] [SPARK-8278] Remove non-streaming JSON reader.	2015-07-18 20:27:55 -07:00
Reynold Xin	9914b1b2c5	[SPARK-9150][SQL] Create CodegenFallback and Unevaluable trait It is very hard to track which expressions have code gen implemented or not. This patch removes the default fallback gencode implementation from Expression, and moves that into a new trait called CodegenFallback. Each concrete expression needs to either implement code generation, or mix in CodegenFallback. This makes it very easy to track which expressions have code generation implemented already. Additionally, this patch creates an Unevaluable trait that can be used to track expressions that don't support evaluation (e.g. Star). Author: Reynold Xin <rxin@databricks.com> Closes #7487 from rxin/codegenfallback and squashes the following commits: 14ebf38 [Reynold Xin] Fixed Conv 6c1c882 [Reynold Xin] Fixed Alias. b42611b [Reynold Xin] [SPARK-9150][SQL] Create a trait to track code generation for expressions. cb5c066 [Reynold Xin] Removed extra import. 39cbe40 [Reynold Xin] [SPARK-8240][SQL] string function: concat	2015-07-18 18:18:19 -07:00
Reynold Xin	e16a19a39e	[SPARK-9174][SQL] Add documentation for all public SQLConfs. Author: Reynold Xin <rxin@databricks.com> Closes #7500 from rxin/sqlconf and squashes the following commits: a5726c8 [Reynold Xin] [SPARK-9174][SQL] Add documentation for all public SQLConfs.	2015-07-18 15:29:38 -07:00
Reynold Xin	6e1e2eba69	[SPARK-8240][SQL] string function: concat Author: Reynold Xin <rxin@databricks.com> Closes #7486 from rxin/concat and squashes the following commits: 5217d6e [Reynold Xin] Removed Hive's concat test. f5cb7a3 [Reynold Xin] Concat is never nullable. ae4e61f [Reynold Xin] Removed extra import. fddcbbd [Reynold Xin] Fixed NPE. 22e831c [Reynold Xin] Added missing file. 57a2352 [Reynold Xin] [SPARK-8240][SQL] string function: concat	2015-07-18 14:07:56 -07:00
Yijie Shen	3d2134fc0d	[SPARK-9055][SQL] WidenTypes should also support Intersect and Except JIRA: https://issues.apache.org/jira/browse/SPARK-9055 cc rxin Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7491 from yijieshen/widen and squashes the following commits: 079fa52 [Yijie Shen] widenType support for intersect and expect	2015-07-18 12:57:53 -07:00
Liang-Chi Hsieh	225de8da2b	[SPARK-9151][SQL] Implement code generation for Abs JIRA: https://issues.apache.org/jira/browse/SPARK-9151 Add codegen support for `Abs`. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7498 from viirya/abs_codegen and squashes the following commits: 0c8410f [Liang-Chi Hsieh] Implement code generation for Abs.	2015-07-18 12:11:37 -07:00
Wenchen Fan	86c50bf72c	[SPARK-9171][SQL] add and improve tests for nondeterministic expressions Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7496 from cloud-fan/tests and squashes the following commits: 0958f90 [Wenchen Fan] improve test for nondeterministic expressions	2015-07-18 11:58:53 -07:00
Wenchen Fan	692378c01d	[SPARK-9167][SQL] use UTC Calendar in `stringToDate` fix 2 bugs introduced in https://github.com/apache/spark/pull/7353 1. we should use UTC Calendar when cast string to date . Before #7353 , we use `DateTimeUtils.fromJavaDate(Date.valueOf(s.toString))` to cast string to date, and `fromJavaDate` will call `millisToDays` to avoid the time zone issue. Now we use `DateTimeUtils.stringToDate(s)`, we should create a Calendar with UTC in the begging. 2. we should not change the default time zone in test cases. The `threadLocalLocalTimeZone` and `threadLocalTimestampFormat` in `DateTimeUtils` will only be evaluated once for each thread, so we can't set the default time zone back anymore. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7488 from cloud-fan/datetime and squashes the following commits: 9cd6005 [Wenchen Fan] address comments 21ef293 [Wenchen Fan] fix 2 bugs in datetime	2015-07-18 11:25:16 -07:00
Wenchen Fan	1b4ff05538	[SPARK-9142][SQL] remove more self type in catalyst a follow up of https://github.com/apache/spark/pull/7479. The `TreeNode` is the root case of the requirement of `self: Product =>` stuff, so why not make `TreeNode` extend `Product`? Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7495 from cloud-fan/self-type and squashes the following commits: 8676af7 [Wenchen Fan] remove more self type	2015-07-18 11:13:49 -07:00
Josh Rosen	b8aec6cd23	[SPARK-9143] [SQL] Add planner rule for automatically inserting Unsafe <-> Safe row format converters Now that we have two different internal row formats, UnsafeRow and the old Java-object-based row format, we end up having to perform conversions between these two formats. These conversions should not be performed by the operators themselves; instead, the planner should be responsible for inserting appropriate format conversions when they are needed. This patch makes the following changes: - Add two new physical operators for performing row format conversions, `ConvertToUnsafe` and `ConvertFromUnsafe`. - Add new methods to `SparkPlan` to allow operators to express whether they output UnsafeRows and whether they can handle safe or unsafe rows as inputs. - Implement an `EnsureRowFormats` rule to automatically insert converter operators where necessary. Author: Josh Rosen <joshrosen@databricks.com> Closes #7482 from JoshRosen/unsafe-converter-planning and squashes the following commits: 7450fa5 [Josh Rosen] Resolve conflicts in favor of choosing UnsafeRow 5220cce [Josh Rosen] Add roundtrip converter test 2bb8da8 [Josh Rosen] Add Union unsafe support + tests to bump up test coverage 6f79449 [Josh Rosen] Add even more assertions to execute() 08ce199 [Josh Rosen] Rename ConvertFromUnsafe -> ConvertToSafe 0e2d548 [Josh Rosen] Add assertion if operators' input rows are in different formats cabb703 [Josh Rosen] Add tests for Filter 3b11ce3 [Josh Rosen] Add missing test file. ae2195a [Josh Rosen] Fixes 0fef0f8 [Josh Rosen] Rename file. d5f9005 [Josh Rosen] Finish writing EnsureRowFormats planner rule b5df19b [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-converter-planning 9ba3038 [Josh Rosen] WIP	2015-07-18 11:08:18 -07:00
Reynold Xin	fba3f5ba85	[SPARK-9169][SQL] Improve unit test coverage for null expressions. Author: Reynold Xin <rxin@databricks.com> Closes #7490 from rxin/unit-test-null-funcs and squashes the following commits: 7b276f0 [Reynold Xin] Move isNaN. 8307287 [Reynold Xin] [SPARK-9169][SQL] Improve unit test coverage for null expressions.	2015-07-18 11:06:46 -07:00
Yijie Shen	529a2c2d92	[SPARK-8280][SPARK-8281][SQL]Handle NaN, null and Infinity in math JIRA: https://issues.apache.org/jira/browse/SPARK-8280 https://issues.apache.org/jira/browse/SPARK-8281 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7451 from yijieshen/nan_null2 and squashes the following commits: 47a529d [Yijie Shen] style fix 63dee44 [Yijie Shen] handle log expressions similar to Hive 188be51 [Yijie Shen] null to nan in Math Expression	2015-07-17 17:33:19 -07:00
Daoyuan Wang	1707238601	[SPARK-7026] [SQL] fix left semi join with equi key and non-equi condition When the `condition` extracted by `ExtractEquiJoinKeys` contain join Predicate for left semi join, we can not plan it as semiJoin. Such as SELECT * FROM testData2 x LEFT SEMI JOIN testData2 y ON x.b = y.b AND x.a >= y.a + 2 Condition `x.a >= y.a + 2` can not evaluate on table `x`, so it throw errors Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #5643 from adrian-wang/spark7026 and squashes the following commits: cc09809 [Daoyuan Wang] refactor semijoin and add plan test 575a7c8 [Daoyuan Wang] fix notserializable 27841de [Daoyuan Wang] fix rebase 10bf124 [Daoyuan Wang] fix style 72baa02 [Daoyuan Wang] fix style 8e0afca [Daoyuan Wang] merge commits for rebase	2015-07-17 16:45:46 -07:00
Wenchen Fan	bd903ee89f	[SPARK-9117] [SQL] fix BooleanSimplification in case-insensitive Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7452 from cloud-fan/boolean-simplify and squashes the following commits: 2a6e692 [Wenchen Fan] fix style d3cfd26 [Wenchen Fan] fix BooleanSimplification in case-insensitive	2015-07-17 16:28:24 -07:00
Wenchen Fan	fd6b3101fb	[SPARK-9113] [SQL] enable analysis check code for self join The check was unreachable before, as `case operator: LogicalPlan` catches everything already. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7449 from cloud-fan/tmp and squashes the following commits: 2bb6637 [Wenchen Fan] add test 5493aea [Wenchen Fan] add the check back 27221a7 [Wenchen Fan] remove unnecessary analysis check code for self join	2015-07-17 16:03:33 -07:00
Yijie Shen	15fc2ffe55	[SPARK-9080][SQL] add isNaN predicate expression JIRA: https://issues.apache.org/jira/browse/SPARK-9080 cc rxin Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7464 from yijieshen/isNaN and squashes the following commits: 11ae039 [Yijie Shen] add isNaN in functions 666718e [Yijie Shen] add isNaN predicate expression	2015-07-17 15:49:31 -07:00
Reynold Xin	b2aa490bb6	[SPARK-9142] [SQL] Removing unnecessary self types in Catalyst. Just a small change to add Product type to the base expression/plan abstract classes, based on suggestions on #7434 and offline discussions. Author: Reynold Xin <rxin@databricks.com> Closes #7479 from rxin/remove-self-types and squashes the following commits: e407ffd [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in Catalyst.	2015-07-17 15:02:13 -07:00
Wenchen Fan	074085d678	[SPARK-9136] [SQL] fix several bugs in DateTimeUtils.stringToTimestamp a follow up of https://github.com/apache/spark/pull/7353 1. we should use `Calendar.HOUR_OF_DAY` instead of `Calendar.HOUR`(this is for AM, PM). 2. we should call `c.set(Calendar.MILLISECOND, 0)` after `Calendar.getInstance` I'm not sure why the tests didn't fail in jenkins, but I ran latest spark master branch locally and `DateTimeUtilsSuite` failed. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7473 from cloud-fan/datetime and squashes the following commits: 66cdaf2 [Wenchen Fan] fix several bugs in DateTimeUtils.stringToTimestamp	2015-07-17 13:57:31 -07:00
Liang-Chi Hsieh	eba6a1af4c	[SPARK-8945][SQL] Add add and subtract expressions for IntervalType JIRA: https://issues.apache.org/jira/browse/SPARK-8945 Add add and subtract expressions for IntervalType. Author: Liang-Chi Hsieh <viirya@appier.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@databricks.com> Closes #7398 from viirya/interval_add_subtract and squashes the following commits: acd1f1e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract 5abae28 [Liang-Chi Hsieh] For comments. 6f5b72e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract dbe3906 [Liang-Chi Hsieh] For comments. 13a2fc5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into interval_add_subtract 83ec129 [Liang-Chi Hsieh] Remove intervalMethod. acfe1ab [Liang-Chi Hsieh] Fix scala style. d3e9d0e [Liang-Chi Hsieh] Add add and subtract expressions for IntervalType.	2015-07-17 09:38:08 -07:00
zhichao.li	305e77cd83	[SPARK-8209[SQL]Add function conv cc chenghao-intel adrian-wang Author: zhichao.li <zhichao.li@intel.com> Closes #6872 from zhichao-li/conv and squashes the following commits: 6ef3b37 [zhichao.li] add unittest and comments 78d9836 [zhichao.li] polish dataframe api and add unittest e2bace3 [zhichao.li] update to use ImplicitCastInputTypes cbcad3f [zhichao.li] add function conv	2015-07-17 09:32:27 -07:00
Wenchen Fan	59d24c226a	[SPARK-9130][SQL] throw exception when check equality between external and internal row instead of return false, throw exception when check equality between external and internal row is better. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7460 from cloud-fan/row-compare and squashes the following commits: 8a20911 [Wenchen Fan] improve equals 402daa8 [Wenchen Fan] throw exception when check equality between external and internal row	2015-07-17 09:31:13 -07:00
Davies Liu	ec8973d124	[SPARK-9022] [SQL] Generated projections for UnsafeRow Added two projections: GenerateUnsafeProjection and FromUnsafeProjection, which could be used to convert UnsafeRow from/to GenericInternalRow. They will re-use the buffer during projection, similar to MutableProjection (without all the interface MutableProjection has). cc rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #7437 from davies/unsafe_proj2 and squashes the following commits: dbf538e [Davies Liu] test with all the expression (only for supported types) dc737b2 [Davies Liu] address comment e424520 [Davies Liu] fix scala style 70e231c [Davies Liu] address comments 729138d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_proj2 5a26373 [Davies Liu] unsafe projections	2015-07-17 01:27:14 -07:00
Wenchen Fan	3f6d28a5ca	[SPARK-9102] [SQL] Improve project collapse with nondeterministic expressions Currently we will stop project collapse when the lower projection has nondeterministic expressions. However it's overkill sometimes, we should be able to optimize `df.select(Rand(10)).select('a)` to `df.select('a)` Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7445 from cloud-fan/non-deterministic and squashes the following commits: 0deaef6 [Wenchen Fan] Improve project collapse with nondeterministic expressions	2015-07-17 00:59:15 -07:00
Reynold Xin	111c05538d	Added inline comment for the canEqual PR by @cloud-fan.	2015-07-16 23:13:06 -07:00
Wenchen Fan	f893955b9c	[SPARK-8899] [SQL] remove duplicated equals method for Row Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7291 from cloud-fan/row and squashes the following commits: a11addf [Wenchen Fan] move hashCode back to internal row 2de6180 [Wenchen Fan] making apply() call to get() fbe1b24 [Wenchen Fan] add null check ebdf148 [Wenchen Fan] address comments 25ef087 [Wenchen Fan] remove duplicated equals method for Row	2015-07-16 21:41:36 -07:00
Reynold Xin	fec10f0c63	[SPARK-9085][SQL] Remove LeafNode, UnaryNode, BinaryNode from TreeNode. This builds on #7433 but also removes LeafNode/UnaryNode. These are slightly more complicated to remove. I had to change some abstract classes to traits in order for it to work. The problem with LeafNode/UnaryNode is that they are often mixed in at the end of an Expression, and then the toString function actually gets resolved to the ones defined in TreeNode, rather than in Expression. Author: Reynold Xin <rxin@databricks.com> Closes #7434 from rxin/remove-binary-unary-leaf-node and squashes the following commits: 9e8a4de [Reynold Xin] Generator should not be foldable. 3135a8b [Reynold Xin] SortOrder should not be foldable. 9c589cf [Reynold Xin] Fixed one more test case... 2225331 [Reynold Xin] Aggregate expressions should not be foldable. 16b5c90 [Reynold Xin] [SPARK-9085][SQL] Remove LeafNode, UnaryNode, BinaryNode from TreeNode.	2015-07-16 13:58:39 -07:00
Yijie Shen	43dac2c880	[SPARK-6941] [SQL] Provide a better error message to when inserting into RDD based table JIRA: https://issues.apache.org/jira/browse/SPARK-6941 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7342 from yijieshen/SPARK-6941 and squashes the following commits: f82cbe7 [Yijie Shen] reorder import dd67e40 [Yijie Shen] resolve comments 09518af [Yijie Shen] fix import order in DataframeSuite 0c635d4 [Yijie Shen] make match more specific 9df388d [Yijie Shen] move check into PreWriteCheck 847ab20 [Yijie Shen] Detect insertion error in DataSourceStrategy	2015-07-16 10:52:09 -07:00
Jan Prach	b536d5dc6c	[SPARK-9015] [BUILD] Clean project import in scala ide Cleanup maven for a clean import in scala-ide / eclipse. * remove groovy plugin which is really not needed at all * add-source from build-helper-maven-plugin is not needed as recent version of scala-maven-plugin do it automatically * add lifecycle-mapping plugin to hide a few useless warnings from ide Author: Jan Prach <jendap@gmail.com> Closes #7375 from jendap/clean-project-import-in-scala-ide and squashes the following commits: c4b4c0f [Jan Prach] fix whitespaces 5a83e07 [Jan Prach] Revert "remove java compiler warnings from java tests" 312007e [Jan Prach] scala-maven-plugin itself add scala sources by default f47d856 [Jan Prach] remove spark-1.4-staging repository c8a54db [Jan Prach] remove java compiler warnings from java tests 999a068 [Jan Prach] remove some maven warnings in scala ide 80fbdc5 [Jan Prach] remove groovy and gmavenplus plugin	2015-07-16 18:42:41 +01:00
Tarek Auel	4ea6480a3b	[SPARK-8995] [SQL] cast date strings like '2015-01-01 12:15:31' to date Jira https://issues.apache.org/jira/browse/SPARK-8995 In PR #6981we noticed that we cannot cast date strings that contains a time, like '2015-03-18 12:39:40' to date. Besides it's not possible to cast a string like '18:03:20' to a timestamp. If a time is passed without a date, today is inferred as date. Author: Tarek Auel <tarek.auel@googlemail.com> Author: Tarek Auel <tarek.auel@gmail.com> Closes #7353 from tarekauel/SPARK-8995 and squashes the following commits: 14f333b [Tarek Auel] [SPARK-8995] added tests for daylight saving time ca1ae69 [Tarek Auel] [SPARK-8995] style fix d20b8b4 [Tarek Auel] [SPARK-8995] bug fix: distinguish between 0 and null ef05753 [Tarek Auel] [SPARK-8995] added check for year >= 1000 01c9ff3 [Tarek Auel] [SPARK-8995] support for time strings 34ec573 [Tarek Auel] fixed style 71622c0 [Tarek Auel] improved timestamp and date parsing 0e30c0a [Tarek Auel] Hive compatibility cfbaed7 [Tarek Auel] fixed wrong checks 71f89c1 [Tarek Auel] [SPARK-8995] minor style fix f7452fa [Tarek Auel] [SPARK-8995] removed old timestamp parsing 30e5aec [Tarek Auel] [SPARK-8995] date and timestamp cast c1083fb [Tarek Auel] [SPARK-8995] cast date strings like '2015-01-01 12:15:31' to date or timestamp	2015-07-16 08:26:39 -07:00
Cheng Hao	e27212317c	[SPARK-8972] [SQL] Incorrect result for rollup We don't support the complex expression keys in the rollup/cube, and we even will not report it if we have the complex group by keys, that will cause very confusing/incorrect result. e.g. `SELECT key%100 FROM src GROUP BY key %100 with ROLLUP` This PR adds an additional project during the analyzing for the complex GROUP BY keys, and that projection will be the child of `Expand`, so to `Expand`, the GROUP BY KEY are always the simple key(attribute names). Author: Cheng Hao <hao.cheng@intel.com> Closes #7343 from chenghao-intel/expand and squashes the following commits: 1ebbb59 [Cheng Hao] update the comment 827873f [Cheng Hao] update as feedback 34def69 [Cheng Hao] Add more unit test and comments c695760 [Cheng Hao] fix bug of incorrect result for rollup	2015-07-15 23:35:27 -07:00
Wenchen Fan	ba33096846	[SPARK-9068][SQL] refactor the implicit type cast code based on https://github.com/apache/spark/pull/7348 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7420 from cloud-fan/type-check and squashes the following commits: 7633fa9 [Wenchen Fan] revert fe169b0 [Wenchen Fan] improve test 03b70da [Wenchen Fan] enhance implicit type cast	2015-07-15 22:27:39 -07:00
Cheng Hao	42dea3acf9	[SPARK-8245][SQL] FormatNumber/Length Support for Expression - `BinaryType` for `Length` - `FormatNumber` Author: Cheng Hao <hao.cheng@intel.com> Closes #7034 from chenghao-intel/expression and squashes the following commits: e534b87 [Cheng Hao] python api style issue 601bbf5 [Cheng Hao] add python API support 3ebe288 [Cheng Hao] update as feedback 52274f7 [Cheng Hao] add support for udf_format_number and length for binary	2015-07-15 21:47:21 -07:00
Yin Huai	9c64a75bfc	[SPARK-9060] [SQL] Revert SPARK-8359, SPARK-8800, and SPARK-8677 JIRA: https://issues.apache.org/jira/browse/SPARK-9060 This PR reverts: * `31bd30687b` (SPARK-8359) * `24fda73811` (SPARK-8677) * `4b5cfc988f` (SPARK-8800) Author: Yin Huai <yhuai@databricks.com> Closes #7426 from yhuai/SPARK-9060 and squashes the following commits: 651264d [Yin Huai] Revert "[SPARK-8359] [SQL] Fix incorrect decimal precision after multiplication" cfda7e4 [Yin Huai] Revert "[SPARK-8677] [SQL] Fix non-terminating decimal expansion for decimal divide operation" 2de9afe [Yin Huai] Revert "[SPARK-8800] [SQL] Fix inaccurate precision/scale of Decimal division operation"	2015-07-15 21:08:30 -07:00
Reynold Xin	b0645195d0	[SPARK-9086][SQL] Remove BinaryNode from TreeNode. These traits are not super useful, and yet cause problems with toString in expressions due to the orders they are mixed in. Author: Reynold Xin <rxin@databricks.com> Closes #7433 from rxin/remove-binary-node and squashes the following commits: 1881f78 [Reynold Xin] [SPARK-9086][SQL] Remove BinaryNode from TreeNode.	2015-07-15 17:50:11 -07:00
Reynold Xin	affbe329ae	[SPARK-9071][SQL] MonotonicallyIncreasingID and SparkPartitionID should be marked as nondeterministic. I also took the chance to more explicitly define the semantics of deterministic. Author: Reynold Xin <rxin@databricks.com> Closes #7428 from rxin/non-deterministic and squashes the following commits: a760827 [Reynold Xin] [SPARK-9071][SQL] MonotonicallyIncreasingID and SparkPartitionID should be marked as nondeterministic.	2015-07-15 14:52:02 -07:00
Steve Loughran	ec9b621647	SPARK-9070 JavaDataFrameSuite teardown NPEs if setup failed fix teardown to skip table delete if hive context is null Author: Steve Loughran <stevel@hortonworks.com> Closes #7425 from steveloughran/stevel/patches/SPARK-9070-JavaDataFrameSuite-NPE and squashes the following commits: 1982d38 [Steve Loughran] SPARK-9070 JavaDataFrameSuite teardown NPEs if setup failed	2015-07-15 12:15:35 -07:00
zhichao.li	a9385271a9	[SPARK-8221][SQL]Add pmod function https://issues.apache.org/jira/browse/SPARK-8221 One concern is the result would be negative if the divisor is not positive( i.e pmod(7, -3) ), but the behavior is the same as hive. Author: zhichao.li <zhichao.li@intel.com> Closes #6783 from zhichao-li/pmod2 and squashes the following commits: 7083eb9 [zhichao.li] update to the latest type checking d26dba7 [zhichao.li] add pmod	2015-07-15 10:43:38 -07:00
Wenchen Fan	fa4ec3606a	[SPARK-9020][SQL] Support mutable state in code gen expressions We can keep expressions' mutable states in generated class(like `SpecificProjection`) as member variables, so that we can read and modify them inside codegened expressions. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7392 from cloud-fan/mutable-state and squashes the following commits: eb3a221 [Wenchen Fan] fix order 73144d8 [Wenchen Fan] naming improvement 318f41d [Wenchen Fan] address more comments d43b65d [Wenchen Fan] address comments fd45c7a [Wenchen Fan] Support mutable state in code gen expressions	2015-07-15 10:31:39 -07:00
Liang-Chi Hsieh	6f6902597d	[SPARK-8840] [SPARKR] Add float coercion on SparkR JIRA: https://issues.apache.org/jira/browse/SPARK-8840 Currently the type coercion rules don't include float type. This PR simply adds it. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7280 from viirya/add_r_float_coercion and squashes the following commits: c86dc0e [Liang-Chi Hsieh] For comments. dbf0c1b [Liang-Chi Hsieh] Implicitly convert Double to Float based on provided schema. 733015a [Liang-Chi Hsieh] Add test case for DataFrame with float type. 30c2a40 [Liang-Chi Hsieh] Update test case. 52b5294 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_r_float_coercion 6f9159d [Liang-Chi Hsieh] Add another test case. 8db3244 [Liang-Chi Hsieh] schema also needs to support float. add test case. 0dcc992 [Liang-Chi Hsieh] Add float coercion on SparkR.	2015-07-15 09:48:33 -07:00
Reynold Xin	14935d846a	[HOTFIX][SQL] Unit test breaking.	2015-07-15 00:12:21 -07:00
Yijie Shen	f0e129740d	[SPARK-8279][SQL]Add math function round JIRA: https://issues.apache.org/jira/browse/SPARK-8279 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #6938 from yijieshen/udf_round_3 and squashes the following commits: 07a124c [Yijie Shen] remove useless def children 392b65b [Yijie Shen] add negative scale test in DecimalSuite 61760ee [Yijie Shen] address reviews 302a78a [Yijie Shen] Add dataframe function test 31dfe7c [Yijie Shen] refactor round to make it readable 8c7a949 [Yijie Shen] rebase & inputTypes update 9555e35 [Yijie Shen] tiny style fix d10be4a [Yijie Shen] use TypeCollection to specify wanted input and implicit cast c3b9839 [Yijie Shen] rely on implict cast to handle string input b0bff79 [Yijie Shen] make round's inner method's name more meaningful 9bd6930 [Yijie Shen] revert accidental change e6f44c4 [Yijie Shen] refactor eval and genCode 1b87540 [Yijie Shen] modify checkInputDataTypes using foldable 5486b2d [Yijie Shen] DataFrame API modification 2077888 [Yijie Shen] codegen versioned eval 6cd9a64 [Yijie Shen] refactor Round's constructor 9be894e [Yijie Shen] add round functions in o.a.s.sql.functions 7c83e13 [Yijie Shen] more tests on round 56db4bb [Yijie Shen] Add decimal support to Round 7e163ae [Yijie Shen] style fix 653d047 [Yijie Shen] Add math function round	2015-07-14 23:30:41 -07:00
Michael Armbrust	c6b1a9e74e	Revert SPARK-6910 and SPARK-9027 Revert #7216 and #7386. These patch seems to be causing quite a few test failures: ``` Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor322.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:351) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getPartitionsByFilter$1.apply(ClientWrapper.scala:320) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getPartitionsByFilter$1.apply(ClientWrapper.scala:318) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:180) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:135) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:172) at org.apache.spark.sql.hive.client.ClientWrapper.getPartitionsByFilter(ClientWrapper.scala:318) at org.apache.spark.sql.hive.client.HiveTable.getPartitions(ClientInterface.scala:78) at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(HiveMetastoreCatalog.scala:670) at org.apache.spark.sql.hive.execution.HiveTableScan.doExecute(HiveTableScan.scala:137) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:90) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:90) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:89) at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:164) at org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1.apply(Exchange.scala:151) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) ... 85 more Caused by: MetaException(message:Filtering is supported only on partition keys of type string) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:185) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:452) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:357) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:279) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$TreeNode.generateJDOFilter(ExpressionTree.java:243) at org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:590) at org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:2417) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2029) at org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:146) at org.apache.hadoop.hive.metastore.ObjectStore$4.getJdoResult(ObjectStore.java:2332) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-with-YARN/2945/HADOOP_PROFILE=hadoop-2.4,label=centos/testReport/junit/org.apache.spark.sql.hive.execution/SortMergeCompatibilitySuite/auto_sortmerge_join_16/ Author: Michael Armbrust <michael@databricks.com> Closes #7409 from marmbrus/revertMetastorePushdown and squashes the following commits: 92fabd3 [Michael Armbrust] Revert SPARK-6910 and SPARK-9027 5d3bdf2 [Michael Armbrust] Revert "[SPARK-9027] [SQL] Generalize metastore predicate pushdown"	2015-07-14 22:57:39 -07:00
Reynold Xin	f23a721c10	[SPARK-8993][SQL] More comprehensive type checking in expressions. This patch makes the following changes: 1. ExpectsInputTypes only defines expected input types, but does not perform any implicit type casting. 2. ImplicitCastInputTypes is a new trait that defines both expected input types, as well as performs implicit type casting. 3. BinaryOperator has a new abstract function "inputType", which defines the expected input type for both left/right. Concrete BinaryOperator expressions no longer perform any implicit type casting. 4. For BinaryOperators, convert NullType (i.e. null literals) into some accepted type so BinaryOperators don't need to handle NullTypes. TODOs needed: fix unit tests for error reporting. I'm intentionally not changing anything in aggregate expressions because yhuai is doing a big refactoring on that right now. Author: Reynold Xin <rxin@databricks.com> Closes #7348 from rxin/typecheck and squashes the following commits: 8fcf814 [Reynold Xin] Fixed ordering of cases. 3bb63e7 [Reynold Xin] Style fix. f45408f [Reynold Xin] Comment update. aa7790e [Reynold Xin] Moved RemoveNullTypes into ImplicitTypeCasts. 438ea07 [Reynold Xin] space d55c9e5 [Reynold Xin] Removes NullTypes. 360d124 [Reynold Xin] Fixed the rule. fb66657 [Reynold Xin] Convert NullType into some accepted type for BinaryOperators. 2e22330 [Reynold Xin] Fixed unit tests. 4932d57 [Reynold Xin] Style fix. d061691 [Reynold Xin] Rename existing ExpectsInputTypes -> ImplicitCastInputTypes. e4727cc [Reynold Xin] BinaryOperator should not be doing implicit cast. d017861 [Reynold Xin] Improve expression type checking.	2015-07-14 22:52:53 -07:00
Josh Rosen	cc57d705e7	[SPARK-9050] [SQL] Remove unused newOrdering argument from Exchange (cleanup after SPARK-8317) SPARK-8317 changed the SQL Exchange operator so that it no longer pushed sorting into Spark's shuffle layer, a change which allowed more efficient SQL-specific sorters to be used. This patch performs some leftover cleanup based on those changes: - Exchange's constructor should no longer accept a `newOrdering` since it's no longer used and no longer works as expected. - `addOperatorsIfNecessary` looked at shuffle input's output ordering to decide whether to sort, but this is the wrong node to be examining: it needs to look at whether the post-shuffle node has the right ordering, since shuffling will not preserve row orderings. Thanks to davies for spotting this. Author: Josh Rosen <joshrosen@databricks.com> Closes #7407 from JoshRosen/SPARK-9050 and squashes the following commits: e70be50 [Josh Rosen] No need to wrap line e866494 [Josh Rosen] Refactor addOperatorsIfNecessary to make code clearer 2e467da [Josh Rosen] Remove `newOrdering` from Exchange.	2015-07-14 18:55:34 -07:00
Josh Rosen	e965a798d0	[SPARK-9045] Fix Scala 2.11 build break in UnsafeExternalRowSorter This fixes a compilation break in under Scala 2.11: ``` [error] /home/jenkins/workspace/Spark-Master-Scala211-Compile/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java:135: error: <anonymous org.apache.spark.sql.execution.UnsafeExternalRowSorter$1> is not abstract and does not override abstract method <B>minBy(Function1<InternalRow,B>,Ordering<B>) in TraversableOnce [error] return new AbstractScalaRowIterator() { [error] ^ [error] where B,A are type-variables: [error] B extends Object declared in method <B>minBy(Function1<A,B>,Ordering<B>) [error] A extends Object declared in interface TraversableOnce [error] 1 error ``` The workaround for this is to make `AbstractScalaRowIterator` into a concrete class. Author: Josh Rosen <joshrosen@databricks.com> Closes #7405 from JoshRosen/SPARK-9045 and squashes the following commits: cbcbb4c [Josh Rosen] Forgot that we can't use the ??? operator anymore 577ba60 [Josh Rosen] [SPARK-9045] Fix Scala 2.11 build break in UnsafeExternalRowSorter.	2015-07-14 17:21:48 -07:00
Josh Rosen	11e5c37286	[SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; fix existing uses This pull request adds a Scalastyle regex rule which fails the style check if `Class.forName` is used directly. `Class.forName` always loads classes from the default / system classloader, but in a majority of cases, we should be using Spark's own `Utils.classForName` instead, which tries to load classes from the current thread's context classloader and falls back to the classloader which loaded Spark when the context classloader is not defined. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7350) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Closes #7350 from JoshRosen/ban-Class.forName and squashes the following commits: e3e96f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName c0b7885 [Josh Rosen] Hopefully fix the last two cases d707ba7 [Josh Rosen] Fix uses of Class.forName that I missed in my first cleanup pass 046470d [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName 62882ee [Josh Rosen] Fix uses of Class.forName or add exclusion. d9abade [Josh Rosen] Add stylechecker rule to ban uses of Class.forName	2015-07-14 16:08:17 -07:00
Liang-Chi Hsieh	4b5cfc988f	[SPARK-8800] [SQL] Fix inaccurate precision/scale of Decimal division operation JIRA: https://issues.apache.org/jira/browse/SPARK-8800 Previously, we turn to Java BigDecimal's divide with specified ROUNDING_MODE to avoid non-terminating decimal expansion problem. However, as JihongMA reported, for the division operation on some specific values, we get inaccurate results. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #7212 from viirya/fix_decimal4 and squashes the following commits: 4205a0a [Liang-Chi Hsieh] Fix inaccuracy precision/scale of Decimal division operation.	2015-07-14 14:19:27 -07:00
Michael Armbrust	37f2d9635f	[SPARK-9027] [SQL] Generalize metastore predicate pushdown Add support for pushing down metastore filters that are in different orders and add some unit tests. Author: Michael Armbrust <michael@databricks.com> Closes #7386 from marmbrus/metastoreFilters and squashes the following commits: 05a4524 [Michael Armbrust] [SPARK-9027][SQL] Generalize metastore predicate pushdown	2015-07-14 11:22:09 -07:00
Wenchen Fan	59d820aa8d	[SPARK-9029] [SQL] shortcut CaseKeyWhen if key is null Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7389 from cloud-fan/case-when and squashes the following commits: ea4b6ba [Wenchen Fan] shortcut for case key when	2015-07-14 10:20:15 -07:00
Daoyuan Wang	257236c3e1	[SPARK-6851] [SQL] function least/greatest follow up This is a follow up of remaining comments from #6851 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7387 from adrian-wang/udflgfollow and squashes the following commits: 6163e62 [Daoyuan Wang] add skipping null values e8c2e09 [Daoyuan Wang] use seq 8362966 [Daoyuan Wang] pr6851 follow up	2015-07-14 01:09:33 -07:00
Cheolsoo Park	408b384de9	[SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning This PR supersedes my old one #6921. Since my patch has changed quite a bit, I am opening a new PR to make it easier to review. The changes include- * Implement `toMetastoreFilter()` function in `HiveShim` that takes `Seq[Expression]` and converts them into a filter string for Hive metastore. * This functions matches all the `AttributeReference` + `BinaryComparisonOp` + `Integral/StringType` patterns in `Seq[Expression]` and fold them into a string. * Change `hiveQlPartitions` field in `MetastoreRelation` to `getHiveQlPartitions()` function that takes a filter string parameter. * Call `getHiveQlPartitions()` in `HiveTableScan` with a filter string. But there are some cases in which predicate pushdown is disabled- Case \| Predicate pushdown ------- \| ----------------------------- Hive integral and string types \| Yes Hive varchar type \| No Hive 0.13 and newer \| Yes Hive 0.12 and older \| No convertMetastoreParquet=false \| Yes convertMetastoreParquet=true \| No In case of `convertMetastoreParquet=true`, predicates are not pushed down because this conversion happens in an `Analyzer` rule (`HiveMetastoreCatalog.ParquetConversions`). At this point, `HiveTableScan` hasn't run, so predicates are not available. But reading the source code, I think it is intentional to convert the entire Hive table w/ all the partitions into `ParquetRelation` because then `ParquetRelation` can be cached and reused for any query against that table. Please correct me if I am wrong. cc marmbrus Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #7216 from piaozhexiu/SPARK-6910-2 and squashes the following commits: aa1490f [Cheolsoo Park] Fix ordering of imports c212c4d [Cheolsoo Park] Incorporate review comments 5e93f9d [Cheolsoo Park] Predicate pushdown into Hive metastore	2015-07-13 19:45:10 -07:00
Vinod K C	4c797f2b09	[SPARK-8636] [SQL] Fix equalNullSafe comparison Author: Vinod K C <vinod.kc@huawei.com> Closes #7040 from vinodkc/fix_CaseKeyWhen_equalNullSafe and squashes the following commits: be5e641 [Vinod K C] Renamed equalNullSafe to threeValueEquals aac9f67 [Vinod K C] Updated test suite and genCode method f2d0b53 [Vinod K C] Fix equalNullSafe comparison	2015-07-13 12:51:33 -07:00
Wenchen Fan	6b89943834	[SPARK-8944][SQL] Support casting between IntervalType and StringType Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7355 from cloud-fan/fromString and squashes the following commits: 3bbb9d6 [Wenchen Fan] fix code gen 7dab957 [Wenchen Fan] naming fix 0fbbe19 [Wenchen Fan] address comments ac1f3d1 [Wenchen Fan] Support casting between IntervalType and StringType	2015-07-13 00:49:39 -07:00
Daoyuan Wang	92540d22e4	[SPARK-8203] [SPARK-8204] [SQL] conditional function: least/greatest chenghao-intel zhichao-li qiansl127 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6851 from adrian-wang/udflg and squashes the following commits: 0f1bff2 [Daoyuan Wang] address comments from davis 7a6bdbb [Daoyuan Wang] add '.' for hex() c1f6824 [Daoyuan Wang] add codegen, test for all types ec625b0 [Daoyuan Wang] conditional function: least/greatest	2015-07-13 00:14:32 -07:00
Wenchen Fan	c472eb17ae	[SPARK-8970][SQL] remove unnecessary abstraction for ExtractValue Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7339 from cloud-fan/minor and squashes the following commits: 84a2128 [Wenchen Fan] remove unapply 6a37c12 [Wenchen Fan] remove unnecessary abstraction for ExtractValue	2015-07-10 23:25:11 -07:00
Cheng Lian	3363088368	[SPARK-8961] [SQL] Makes BaseWriterContainer.outputWriterForRow accepts InternalRow instead of Row This is a follow-up of [SPARK-8888] [1], which also aims to optimize writing dynamic partitions. Three more changes can be made here: 1. Using `InternalRow` instead of `Row` in `BaseWriterContainer.outputWriterForRow` 2. Using `Cast` expressions to convert partition columns to strings, so that we can leverage code generation. 3. Replacing the FP-style `zip` and `map` calls with a faster imperative `while` loop. [1]: https://issues.apache.org/jira/browse/SPARK-8888 Author: Cheng Lian <lian@databricks.com> Closes #7331 from liancheng/spark-8961 and squashes the following commits: b5ab9ae [Cheng Lian] Casts Java iterator to Scala iterator explicitly 719e63b [Cheng Lian] Makes BaseWriterContainer.outputWriterForRow accepts InternalRow instead of Row	2015-07-10 18:15:36 -07:00
Cheng Lian	857e325f30	[SPARK-8990] [SQL] SPARK-8990 DataFrameReader.parquet() should respect user specified options Author: Cheng Lian <lian@databricks.com> Closes #7347 from liancheng/spark-8990 and squashes the following commits: 045698c [Cheng Lian] SPARK-8990 DataFrameReader.parquet() should respect user specified options	2015-07-10 16:49:45 -07:00
Josh Rosen	fb8807c9b0	[SPARK-7078] [SPARK-7079] Binary processing sort for Spark SQL This patch adds a cache-friendly external sorter which operates on serialized bytes and uses this sorter to implement a new sort operator for Spark SQL and DataFrames. ### Overview of the new sorter The new sorter design is inspired by [Alphasort](http://research.microsoft.com/pubs/68249/alphasort.doc) and implements a key-prefix optimization in order to improve the cache friendliness of the sort. In naive sort implementations, the sorting algorithm operates on an array of record pointers. To compare two records for ordering, the sorter must dereference these pointers, which likely involves random memory access, then compare the objects themselves. ![image](https://cloud.githubusercontent.com/assets/50748/8611390/3b1402ae-2675-11e5-8308-1a10bf347e6e.png) In a key-prefix sort, the sort operates on an array which stores the record pointer alongside a prefix of the record's key. When comparing two records for ordering, the sorter first compares the the stored key prefixes. If the ordering can be determined from the key prefixes (i.e. the prefixes are unequal), then the sort can avoid directly comparing the records, avoiding random memory accesses and full record comparisons. For example, if we're sorting a list of strings then we can store the first 8 bytes of the UTF-8 encoded string as the key-prefix and can perform unsigned byte-at-a-time comparisons to determine the ordering of strings based on their prefixes, only resorting to full comparisons for strings that share a common prefix. In cases where the sort key can fit entirely in the space allotted for the key prefix (e.g. the sorting key is an integer), we completely avoid direct record comparison. In this patch's implementation of key-prefix sorting, our sorter's internal array stores a 64-bit long and 64-bit pointer for each record being sorted. The key prefixes are generated by the user when inserting records into the sorter, which uses a user-defined comparison function for comparing them. The `PrefixComparators` object implements a set of comparators for many common types, including primitive numeric types and UTF-8 strings. The actual sorting is implemented by `UnsafeInMemorySorter`. Most consumers will not use this directly, but instead will use `UnsafeExternalSorter`, a class which implements a sort that can spill to disk in response to memory pressure. Internally, `UnsafeExternalSorter` creates `UnsafeInMemorySorters` to perform sorting and uses `UnsafeSortSpillReader/Writer` to spill and read back runs of sorted records and `UnsafeSortSpillMerger` to merge multiple sorted spills into a single sorted iterator. This external sorter integrates with Spark's existing ShuffleMemoryManager for controlling spilling. Many parts of this sorter's design are based on / copied from the more specialized external sort implementation that I designed for the new UnsafeShuffleManager write path; see #5868 for more details on that patch. ### Sorting rows in Spark SQL For now, `UnsafeExternalSorter` is only used by Spark SQL, which uses it to implement a new sort operator, `UnsafeExternalSort`. This sort operator uses a SQL-specific class called `UnsafeExternalRowSorter` that configures an `UnsafeExternalSorter` to use prefix generators and comparators that operate on rows encoded in the UnsafeRow format that was designed for Project Tungsten. I used some interesting unit-testing techniques to test this patch's SQL-specific components. `UnsafeExternalSortSuite` uses the SQL random data generators introduced in #7176 to test the UnsafeSort operator with all atomic types both with and without nullability and in both ascending and descending sort orders. `PrefixComparatorsSuite` contains a cool use of ScalaCheck + ScalaTest's `GeneratorDrivenPropertyChecks` in order to test UTF8String prefix comparison. ### Misc. additional improvements made in this patch This patch made several miscellaneous improvements to related code in Spark SQL: - The logic for selecting physical sort operator implementations, which was partially duplicated in both `Exchange` and `SparkStrategies, has now been consolidated into a `getSortOperator()` helper function in `SparkStrategies`. - The `SparkPlanTest` unit testing helper trait has been extended with new methods for comparing the output produced by two different physical plans. This makes it easy to write tests which assert that two physical operator implementations should produce the same output. I also added a method for disabling the implicit sorting of outputs prior to comparing them, a change which is necessary in order to be able to write proper SparkPlan tests for sort operators. ### Tasks deferred to followup patches While most of this patch's features are reasonably well-tested and complete, there are a number of tasks that are intentionally being deferred to followup patches: - Add tests which mock the ShuffleMemoryManager to check that memory pressure properly triggers spilling (there are examples of this type of test in #5868). - Add tests to ensure that spill files are properly cleaned up after errors. I'd like to do this in the context of a patch which introduces more general metrics for ensuring proper cleanup of tasks' temporary files; see https://issues.apache.org/jira/browse/SPARK-8966 for more details. - Metrics integration: there are some open questions regarding how to track / report spill metrics for non-shuffle operations, so I've deferred most of the IO / shuffle metrics integration for now. - Performance profiling. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6444) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Closes #6444 from JoshRosen/sql-external-sort and squashes the following commits: 6beb467 [Josh Rosen] Remove a bunch of overloaded methods to avoid default args. issue 2bbac9c [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort 35dad9f [Josh Rosen] Make sortAnswers = false the default in SparkPlanTest 5135200 [Josh Rosen] Fix spill reading for large rows; add test 2f48777 [Josh Rosen] Add test and fix bug for sorting empty arrays d1e28bc [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort cd05866 [Josh Rosen] Fix scalastyle 3947fc1 [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort d13ac55 [Josh Rosen] Hacky approach to copying of UnsafeRows for sort followed by limit. 845bea3 [Josh Rosen] Remove unnecessary zeroing of row conversion buffer c56ec18 [Josh Rosen] Clean up final row copying code. d31f180 [Josh Rosen] Re-enable NullType sorting test now that SPARK-8868 is fixed 844f4ca [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort 293f109 [Josh Rosen] Add missing license header. f99a612 [Josh Rosen] Fix bugs in string prefix comparison. 9d00afc [Josh Rosen] Clean up prefix comparators for integral types 88aff18 [Josh Rosen] NULL_PREFIX has to be negative infinity for floating point types 613e16f [Josh Rosen] Test with larger data. 1d7ffaa [Josh Rosen] Somewhat hacky fix for descending sorts 08701e7 [Josh Rosen] Fix prefix comparison of null primitives. b86e684 [Josh Rosen] Set global = true in UnsafeExternalSortSuite. 1c7bad8 [Josh Rosen] Make sorting of answers explicit in SparkPlanTest.checkAnswer(). b81a920 [Josh Rosen] Temporarily enable only the passing sort tests 5d6109d [Josh Rosen] Fix inconsistent handling / encoding of record lengths. 87b6ed9 [Josh Rosen] Fix critical issues in test which led to false negatives. 8d7fbe7 [Josh Rosen] Fixes to multiple spilling-related bugs. 82e21c1 [Josh Rosen] Force spilling in UnsafeExternalSortSuite. 88b72db [Josh Rosen] Test ascending and descending sort orders. f27be09 [Josh Rosen] Fix tests by binding attributes. 0a79d39 [Josh Rosen] Revert "Undo part of a SparkPlanTest change in #7162 that broke my test." 7c3c864 [Josh Rosen] Undo part of a SparkPlanTest change in #7162 that broke my test. 9969c14 [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort 5822e6f [Josh Rosen] Fix test compilation issue 939f824 [Josh Rosen] Remove code gen experiment. 0dfe919 [Josh Rosen] Implement prefix sort for strings (albeit inefficiently). 66a813e [Josh Rosen] Prefix comparators for float and double b310c88 [Josh Rosen] Integrate prefix comparators for Int and Long (others coming soon) 95058d9 [Josh Rosen] Add missing SortPrefixUtils file 4c37ba6 [Josh Rosen] Add tests for sorting on all primitive types. 6890863 [Josh Rosen] Fix memory leak on empty inputs. d246e29 [Josh Rosen] Fix consideration of column types when choosing sort implementation. 6b156fb [Josh Rosen] Some WIP work on prefix comparison. 7f875f9 [Josh Rosen] Commit failing test demonstrating bug in handling objects in spills 41b8881 [Josh Rosen] Get UnsafeInMemorySorterSuite to pass (WIP) 90c2b6a [Josh Rosen] Update test name 6d6a1e6 [Josh Rosen] Centralize logic for picking sort operator implementations 9869ec2 [Josh Rosen] Clean up Exchange code a bit 82bb0ec [Josh Rosen] Fix IntelliJ complaint due to negated if condition 1db845a [Josh Rosen] Many more changes to harmonize with shuffle sorter ebf9eea [Josh Rosen] Harmonization with shuffle's unsafe sorter 206bfa2 [Josh Rosen] Add some missing newlines at the ends of files 26c8931 [Josh Rosen] Back out some Hive changes that aren't needed anymore 62f0bb8 [Josh Rosen] Update to reflect SparkPlanTest changes 21d7d93 [Josh Rosen] Back out of BlockObjectWriter change 7eafecf [Josh Rosen] Port test to SparkPlanTest d468a88 [Josh Rosen] Update for InternalRow refactoring 269cf86 [Josh Rosen] Back out SMJ operator change; isolate changes to selection of sort op. 1b841ca [Josh Rosen] WIP towards copying b420a71 [Josh Rosen] Move most of the existing SMJ code into Java. dfdb93f [Josh Rosen] SparkFunSuite change 73cc761 [Josh Rosen] Fix whitespace 9cc98f5 [Josh Rosen] Move more code to Java; fix bugs in UnsafeRowConverter length type. c8792de [Josh Rosen] Remove some debug logging dda6752 [Josh Rosen] Commit some missing code from an old git stash. 58f36d0 [Josh Rosen] Merge in a sketch of a unit test for the new sorter (now failing). 2bd8c9a [Josh Rosen] Import my original tests and get them to pass. d5d3106 [Josh Rosen] WIP towards external sorter for Spark SQL.	2015-07-10 16:44:51 -07:00
Jonathan Alter	e14b545d2d	[SPARK-7977] [BUILD] Disallowing println Author: Jonathan Alter <jonalter@users.noreply.github.com> Closes #7093 from jonalter/SPARK-7977 and squashes the following commits: ccd44cc [Jonathan Alter] Changed println to log in ThreadingSuite 7fcac3e [Jonathan Alter] Reverting to println in ThreadingSuite 10724b6 [Jonathan Alter] Changing some printlns to logs in tests eeec1e7 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 0b1dcb4 [Jonathan Alter] More println cleanup aedaf80 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 925fd98 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 0c16fa3 [Jonathan Alter] Replacing some printlns with logs 45c7e05 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 5c8e283 [Jonathan Alter] Allowing println in audit-release examples 5b50da1 [Jonathan Alter] Allowing printlns in example files ca4b477 [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 83ab635 [Jonathan Alter] Fixing new printlns 54b131f [Jonathan Alter] Merge branch 'master' of github.com:apache/spark into SPARK-7977 1cd8a81 [Jonathan Alter] Removing some unnecessary comments and printlns b837c3a [Jonathan Alter] Disallowing println	2015-07-10 11:34:01 +01:00
huangzhaowei	1903641e68	[SPARK-8839] [SQL] ThriftServer2 will remove session and execution no matter it's finished or not. In my test, `sessions` and `executions` in ThriftServer2 is not the same number as the connection number. For example, if there are 200 clients connecting to the server, but it will have more than 200 `sessions` and `executions`. So if it reaches the `retainedStatements`, it has to remove some object which is not finished. So it may cause the exception described in [Jira Address](https://issues.apache.org/jira/browse/SPARK-8839) Author: huangzhaowei <carlmartinmax@gmail.com> Closes #7239 from SaintBacchus/SPARK-8839 and squashes the following commits: cf7ef40 [huangzhaowei] Remove the a meanless funciton call 3e9a5a6 [huangzhaowei] Add a filter before take 9d5ceb8 [huangzhaowei] [SPARK-8839][SQL]ThriftServer2 will remove session and execution no matter it's finished or not.	2015-07-09 19:31:31 -07:00
Cheng Lian	2d45571fcb	[SPARK-8959] [SQL] [HOTFIX] Removes parquet-thrift and libthrift dependencies These two dependencies were introduced in #7231 to help testing Parquet compatibility with `parquet-thrift`. However, they somehow crash the Scala compiler in Maven builds. This PR fixes this issue by: 1. Removing these two dependencies, and 2. Instead of generating the testing Parquet file programmatically, checking in an actual testing Parquet file generated by `parquet-thrift` as a test resource. This is just a quick fix to bring back Maven builds. Need to figure out the root case as binary Parquet files are harder to maintain. Author: Cheng Lian <lian@databricks.com> Closes #7330 from liancheng/spark-8959 and squashes the following commits: `cf69512` [Cheng Lian] Brings back Maven builds	2015-07-09 17:09:16 -07:00
Davies Liu	c9e2ef52bb	[SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of serialization for Python DataFrame This PR fix the long standing issue of serialization between Python RDD and DataFrame, it change to using a customized Pickler for InternalRow to enable customized unpickling (type conversion, especially for UDT), now we can support UDT for UDF, cc mengxr . There is no generated `Row` anymore. Author: Davies Liu <davies@databricks.com> Closes #7301 from davies/sql_ser and squashes the following commits: 81bef71 [Davies Liu] address comments e9217bd [Davies Liu] add regression tests db34167 [Davies Liu] Refactor of serialization for Python DataFrame	2015-07-09 14:43:38 -07:00
Cheng Hao	0b0b9ceaf7	[SPARK-8247] [SPARK-8249] [SPARK-8252] [SPARK-8254] [SPARK-8257] [SPARK-8258] [SPARK-8259] [SPARK-8261] [SPARK-8262] [SPARK-8253] [SPARK-8260] [SPARK-8267] [SQL] Add String Expressions Author: Cheng Hao <hao.cheng@intel.com> Closes #6762 from chenghao-intel/str_funcs and squashes the following commits: b09a909 [Cheng Hao] update the code as feedback 7ebbf4c [Cheng Hao] Add more string expressions	2015-07-09 11:11:34 -07:00
Wenchen Fan	f6c0bd5c37	[SPARK-8938][SQL] Implement toString for Interval data type Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7315 from cloud-fan/toString and squashes the following commits: 4fc8d80 [Wenchen Fan] Implement toString for Interval data type	2015-07-09 10:04:42 -07:00
Reynold Xin	a870a82fb6	[SPARK-8926][SQL] Code review followup. I merged https://github.com/apache/spark/pull/7303 so it unblocks another PR. This addresses my own code review comment for that PR. Author: Reynold Xin <rxin@databricks.com> Closes #7313 from rxin/adt and squashes the following commits: 7ade82b [Reynold Xin] Fixed unit tests. f8d5533 [Reynold Xin] [SPARK-8926][SQL] Code review followup.	2015-07-09 10:01:33 -07:00
Reynold Xin	e204d22bb7	[SPARK-8948][SQL] Remove ExtractValueWithOrdinal abstract class Also added more documentation for the file. Author: Reynold Xin <rxin@databricks.com> Closes #7316 from rxin/extract-value and squashes the following commits: 069cb7e [Reynold Xin] Removed ExtractValueWithOrdinal. 621b705 [Reynold Xin] Reverted a line. 11ebd6c [Reynold Xin] [Minor][SQL] Improve documentation for complex type extractors.	2015-07-09 10:01:01 -07:00
Tarek Auel	a1964e9d90	[SPARK-8830] [SQL] native levenshtein distance Jira: https://issues.apache.org/jira/browse/SPARK-8830 rxin and HuJiayin can you have a look on it. Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7236 from tarekauel/native-levenshtein-distance and squashes the following commits: ee4c4de [Tarek Auel] [SPARK-8830] implemented improvement proposals c252e71 [Tarek Auel] [SPARK-8830] removed chartAt; use unsafe method for byte array comparison ddf2222 [Tarek Auel] Merge branch 'master' into native-levenshtein-distance 179920a [Tarek Auel] [SPARK-8830] added description 5e9ed54 [Tarek Auel] [SPARK-8830] removed StringUtils import dce4308 [Tarek Auel] [SPARK-8830] native levenshtein distance	2015-07-09 09:23:35 -07:00
Davies Liu	23448a9e98	[SPARK-8931] [SQL] Fallback to interpreted evaluation if failed to compile in codegen Exception will not be catched during tests. cc marmbrus rxin Author: Davies Liu <davies@databricks.com> Closes #7309 from davies/fallback and squashes the following commits: 969a612 [Davies Liu] throw exception during tests f844f77 [Davies Liu] fallback a3091bc [Davies Liu] Merge branch 'master' of github.com:apache/spark into fallback 364a0d6 [Davies Liu] fallback to interpret mode if failed to compile	2015-07-09 09:20:16 -07:00
Wenchen Fan	09cb0d9c2d	[SPARK-8942][SQL] use double not decimal when cast double and float to timestamp Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7312 from cloud-fan/minor and squashes the following commits: a4589fa [Wenchen Fan] use double not decimal when cast double and float to timestamp	2015-07-09 00:26:25 -07:00
Weizhong Lin	851e247caa	[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when handling Parquet LISTs in compatible mode This PR is based on #7209 authored by Sephiroth-Lin. Author: Weizhong Lin <linweizhong@huawei.com> Closes #7314 from liancheng/spark-8928 and squashes the following commits: 75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode	2015-07-08 22:19:19 -07:00
Cheng Lian	c056484c07	Revert "[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when handling Parquet LISTs in compatible mode" This reverts commit `3dab0da429`.	2015-07-08 22:14:38 -07:00
Cheng Lian	3dab0da429	[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when handling Parquet LISTs in compatible mode This PR is based on #7209 authored by Sephiroth-Lin. Author: Weizhong Lin <linweizhong@huawei.com> Closes #7304 from liancheng/spark-8928 and squashes the following commits: 75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode	2015-07-08 22:09:14 -07:00
Michael Armbrust	768907eb7b	[SPARK-8926][SQL] Good errors for ExpectsInputType expressions For example: `cannot resolve 'testfunction(null)' due to data type mismatch: argument 1 is expected to be of type int, however, null is of type datetype.` Author: Michael Armbrust <michael@databricks.com> Closes #7303 from marmbrus/expectsTypeErrors and squashes the following commits: c654a0e [Michael Armbrust] fix udts and make errors pretty 137160d [Michael Armbrust] style 5428fda [Michael Armbrust] style 10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for ExpectsInputType expressions	2015-07-08 22:05:58 -07:00
Andrew Or	47ef423f86	[SPARK-8910] Fix MiMa flaky due to port contention issue Due to the way MiMa works, we currently start a `SQLContext` pretty early on. This causes us to start a `SparkUI` that attempts to bind to port 4040. Because many tests run in parallel on the Jenkins machines, this causes port contention sometimes and fails the MiMa tests. Note that we already disabled the SparkUI for scalatests. However, the MiMa test is run before we even have a chance to load the default scalatest settings, so we need to explicitly disable the UI ourselves. Author: Andrew Or <andrew@databricks.com> Closes #7300 from andrewor14/mima-flaky and squashes the following commits: b55a547 [Andrew Or] Do not enable SparkUI during tests	2015-07-08 20:29:08 -07:00
Josh Rosen	b55499a44a	[SPARK-8932] Support copy() for UnsafeRows that do not use ObjectPools We call Row.copy() in many places throughout SQL but UnsafeRow currently throws UnsupportedOperationException when copy() is called. Supporting copying when ObjectPool is used may be difficult, since we may need to handle deep-copying of objects in the pool. In addition, this copy() method needs to produce a self-contained row object which may be passed around / buffered by downstream code which does not understand the UnsafeRow format. In the long run, we'll need to figure out how to handle the ObjectPool corner cases, but this may be unnecessary if other changes are made. Therefore, in order to unblock my sort patch (#6444) I propose that we support copy() for the cases where UnsafeRow does not use an ObjectPool and continue to throw UnsupportedOperationException when an ObjectPool is used. This patch accomplishes this by modifying UnsafeRow so that it knows the size of the row's backing data in order to be able to copy it into a byte array. Author: Josh Rosen <joshrosen@databricks.com> Closes #7306 from JoshRosen/SPARK-8932 and squashes the following commits: 338e6bf [Josh Rosen] Support copy for UnsafeRows that do not use ObjectPools.	2015-07-08 20:28:05 -07:00
Yijie Shen	a290814877	[SPARK-8866][SQL] use 1us precision for timestamp type JIRA: https://issues.apache.org/jira/browse/SPARK-8866 Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7283 from yijieshen/micro_timestamp and squashes the following commits: dc735df [Yijie Shen] update CastSuite to avoid round error 714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp 8d4aa6b [Yijie Shen] use 1us precision for timestamp type	2015-07-08 20:20:17 -07:00
Davies Liu	74d8d3d928	[SPARK-8450] [SQL] [PYSARK] cleanup type converter for Python DataFrame This PR fixes the converter for Python DataFrame, especially for DecimalType Closes #7106 Author: Davies Liu <davies@databricks.com> Closes #7131 from davies/decimal_python and squashes the following commits: 4d3c234 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python 20531d6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python 7d73168 [Davies Liu] fix conflit 6cdd86a [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python 7104e97 [Davies Liu] improve type infer 9cd5a21 [Davies Liu] run python tests with SPARK_PREPEND_CLASSES 829a05b [Davies Liu] fix UDT in python c99e8c5 [Davies Liu] fix mima c46814a [Davies Liu] convert decimal for Python DataFrames	2015-07-08 18:22:53 -07:00
Kousuke Saruta	2a4f88b6c1	[SPARK-8914][SQL] Remove RDDApi As rxin suggested in #7298 , we should consider to remove `RDDApi`. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #7302 from sarutak/remove-rddapi and squashes the following commits: e495d35 [Kousuke Saruta] Fixed mima cb7ebb9 [Kousuke Saruta] Removed overriding RDDApi	2015-07-08 18:09:39 -07:00
Cheng Lian	4ffc27caaf	[SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support. And this one fixes the read path. Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]). ### Major changes 1. `CatalystConverter` class hierarchy refactoring - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`. Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`. This simplifies the design since converters don't need to care about details of their parent converters anymore. - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter` Specifically, now all row objects are represented by `SpecificMutableRow` during conversion. - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter` `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal. The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way. - Implements backwards-compatibility rules in `CatalystArrayConverter` When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`. 2. Requested columns handling When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns. This is not preferable when taking compatibility and interoperability into consideration. Because the actual Parquet file may have different physical structure from the converted schema. In this PR, the schema for requested columns is constructed using the following method: - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column. - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`. - Unions all single-field `MessageType`s into a full schema containing all requested fields With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files. ### Testing This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in. [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 [2]: https://issues.apache.org/jira/browse/SPARK-6774 [3]: https://issues.apache.org/jira/browse/SPARK-6123 [4]: https://issues.apache.org/jira/browse/SPARK-8848 Author: Cheng Lian <lian@databricks.com> Closes #7231 from liancheng/spark-6776 and squashes the following commits: 360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite c6fbc06 [Cheng Lian] Removes WIP file committed by mistake b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa 598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift 926af87 [Cheng Lian] Simplifies Parquet compatibility test suites 7946ee1 [Cheng Lian] Fixes Scala styling issues 3d7ab36 [Cheng Lian] Fixes .rat-excludes a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation 1d390aa [Cheng Lian] Adds parquet-thrift compatibility test 440f7b3 [Cheng Lian] Adds generated files to .rat-excludes 13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite 06cfe9d [Cheng Lian] Adds comments about TimestampType handling a099d3e [Cheng Lian] More comments 0cc1b37 [Cheng Lian] Fixes MiMa checks 884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes 802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns 38fe1e7 [Cheng Lian] Adds explicit return type 7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change 1781dff [Cheng Lian] Adds test case for SPARK-8811 6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals a74fb2c [Cheng Lian] More comments 0525346 [Cheng Lian] Removes old Parquet record converters 03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules	2015-07-08 15:51:01 -07:00
Cheolsoo Park	00b265f12c	[SPARK-8908] [SQL] Add () to distinct definition in dataframe Adding `()` to the definition of `distinct` in DataFrame allows distinct to be called with parentheses, which is consistent with `dropDuplicates`. Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #7298 from piaozhexiu/SPARK-8908 and squashes the following commits: 7f0d923 [Cheolsoo Park] Add () to distinct definition in dataframe	2015-07-08 15:18:24 -07:00
Keuntae Park	f031543782	[SPARK-8783] [SQL] CTAS with WITH clause does not work Currently, CTESubstitution only handles the case that WITH is on the top of the plan. I think it SHOULD handle the case that WITH is child of CTAS. This patch simply changes 'match' to 'transform' for recursive search of WITH in the plan. Author: Keuntae Park <sirpkt@apache.org> Closes #7180 from sirpkt/SPARK-8783 and squashes the following commits: e4428f0 [Keuntae Park] Merge remote-tracking branch 'upstream/master' into CTASwithWITH 1671c77 [Keuntae Park] WITH clause can be inside CTAS	2015-07-08 14:29:52 -07:00
Reynold Xin	f61c989b40	[SPARK-8888][SQL] Use java.util.HashMap in DynamicPartitionWriterContainer. Just a baby step towards making it more efficient. Author: Reynold Xin <rxin@databricks.com> Closes #7282 from rxin/SPARK-8888 and squashes the following commits: 3da51ae [Reynold Xin] [SPARK-8888][SQL] Use java.util.HashMap in DynamicPartitionWriterContainer.	2015-07-08 10:56:31 -07:00
Wenchen Fan	0ba98c04c7	[SPARK-8753][SQL] Create an IntervalType data type We need a new data type to represent time intervals. Because we can't determine how many days in a month, so we need 2 values for interval: a int `months`, a long `microseconds`. The interval literal syntax looks like: `interval 3 years -4 month 4 weeks 3 second` Because we use number of 100ns as value of `TimestampType`, so it may not makes sense to support nano second unit. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7226 from cloud-fan/interval and squashes the following commits: 632062d [Wenchen Fan] address comments ac348c3 [Wenchen Fan] use case class 0342d2e [Wenchen Fan] use array byte df9256c [Wenchen Fan] fix style fd6f18a [Wenchen Fan] address comments 1856af3 [Wenchen Fan] support interval type	2015-07-08 10:51:32 -07:00
Davies Liu	74335b3107	[SPARK-5707] [SQL] fix serialization of generated projection Author: Davies Liu <davies@databricks.com> Closes #7272 from davies/fix_projection and squashes the following commits: 075ef76 [Davies Liu] fix codegen with BroadcastHashJion	2015-07-08 10:43:00 -07:00
Takeshi YAMAMURO	3e831a2696	[SPARK-6912] [SQL] Throw an AnalysisException when unsupported Java Map<K,V> types used in Hive UDF To make UDF developers understood, throw an exception when unsupported Map<K,V> types used in Hive UDF. This fix is the same with #7248. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #7257 from maropu/ThrowExceptionWhenMapUsed and squashes the following commits: 916099a [Takeshi YAMAMURO] Fix style errors 7886dcc [Takeshi YAMAMURO] Throw an exception when Map<> used in Hive UDF	2015-07-08 10:33:27 -07:00
Liang-Chi Hsieh	6722aca809	[SPARK-8785] [SQL] Improve Parquet schema merging JIRA: https://issues.apache.org/jira/browse/SPARK-8785 Currently, the parquet schema merging (`ParquetRelation2.readSchema`) may spend much time to merge duplicate schema. We can select only non duplicate schema and merge them later. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7182 from viirya/improve_parquet_merging and squashes the following commits: 5cf934f [Liang-Chi Hsieh] Refactor it to make it faster. f3411ea [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into improve_parquet_merging a63c3ff [Liang-Chi Hsieh] Improve Parquet schema merging.	2015-07-08 10:09:50 -07:00
Cheng Hao	351a36d0c5	[SPARK-8883][SQL]Remove the OverrideFunctionRegistry Remove the `OverrideFunctionRegistry` from the Spark SQL, as the subclasses of `FunctionRegistry` have their own way to the delegate to the right underlying `FunctionRegistry`. Author: Cheng Hao <hao.cheng@intel.com> Closes #7260 from chenghao-intel/override and squashes the following commits: 164d093 [Cheng Hao] enable the function registry 2ca8459 [Cheng Hao] remove the OverrideFunctionRegistry	2015-07-08 00:10:24 -07:00
Reynold Xin	61c3cf793d	[SPARK-8879][SQL] Remove EmptyRow class. As a baby step towards no megamorphic InternalRow. Author: Reynold Xin <rxin@databricks.com> Closes #7277 from rxin/remove-empty-row and squashes the following commits: 594100e [Reynold Xin] [SPARK-8879][SQL] Remove EmptyRow class.	2015-07-07 22:12:46 -07:00
Reynold Xin	5d603dfe49	[SPARK-8878][SQL] Improve unit test coverage for bitwise expressions. Author: Reynold Xin <rxin@databricks.com> Closes #7273 from rxin/bitwise-unittest and squashes the following commits: 60c5667 [Reynold Xin] [SPARK-8878][SQL] Improve unit test coverage for bitwise expressions.	2015-07-07 19:12:40 -07:00
Yin Huai	68a4a16971	[SPARK-8868] SqlSerializer2 can go into infinite loop when row consists only of NullType columns https://issues.apache.org/jira/browse/SPARK-8868 Author: Yin Huai <yhuai@databricks.com> Closes #7262 from yhuai/SPARK-8868 and squashes the following commits: cb58780 [Yin Huai] Andrew's comment. e456857 [Yin Huai] Josh's comments. 5122e65 [Yin Huai] If types of all columns are NullTypes, do not use serializer2.	2015-07-07 18:36:35 -07:00
Davies Liu	4ca90935c5	[SPARK-7190] [SPARK-8804] [SPARK-7815] [SQL] unsafe UTF8String Let UTF8String work with binary buffer. Before we have better idea on manage the lifecycle of UTF8String in Row, we still do the copy when calling `UnsafeRow.get()` for StringType. cc rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #7197 from davies/unsafe_string and squashes the following commits: 51b0ea0 [Davies Liu] fix test 50c1ebf [Davies Liu] remove optimization for upper/lower case 315d491 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_string 93fce17 [Davies Liu] address comment e9ff7ba [Davies Liu] clean up 67ec266 [Davies Liu] fix bug 7b74b1f [Davies Liu] fallback to String if local dependent ab7857c [Davies Liu] address comments 7da92f5 [Davies Liu] handle local in toUpperCase/toLowerCase 59dbb23 [Davies Liu] revert python change d1e0716 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_string 002e35f [Davies Liu] rollback hashCode change a87b7a8 [Davies Liu] improve toLowerCase and toUpperCase 76e794a [Davies Liu] fix test 8b2d5ce [Davies Liu] fix tests fd3f0a6 [Davies Liu] bug fix c4e9c88 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_string c45d921 [Davies Liu] address comments 175405f [Davies Liu] unsafe UTF8String	2015-07-07 17:57:17 -07:00
Reynold Xin	770ff1025e	[SPARK-8876][SQL] Remove InternalRow type alias in expressions package. The type alias was there because initially when I moved Row around, I didn't want to do massive changes to the expression code. But now it should be pretty easy to just remove it. One less concept to worry about. Author: Reynold Xin <rxin@databricks.com> Closes #7270 from rxin/internalrow and squashes the following commits: 72fc842 [Reynold Xin] [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.	2015-07-07 17:40:14 -07:00
Liang-Chi Hsieh	da56c4e728	[SPARK-8794] [SQL] Make PrunedScan work for Sample JIRA: https://issues.apache.org/jira/browse/SPARK-8794 Currently `PrunedScan` works only when followed by project or filter operations. However, even if there is a `Sample` between these operations and `PrunedScan`, `PrunedScan` should work too. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #7228 from viirya/sample_prunedscan and squashes the following commits: ede7cd8 [Liang-Chi Hsieh] Keep PrunedScanSuite untouched. 6f05d30 [Liang-Chi Hsieh] Move unit test to FilterPushdownSuite. 5f32473 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sample_prunedscan 7e4ba76 [Liang-Chi Hsieh] Use Optimzier for push down projection and filter. 0686830 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sample_prunedscan df82785 [Liang-Chi Hsieh] Make PrunedScan work on Sample.	2015-07-07 15:49:22 -07:00
Wenchen Fan	c46aaf47f3	[SPARK-8759][SQL] add default eval to binary and unary expression according to default behavior of nullable We have `nullSafeCodeGen` to provide default code generation for binary and unary expression, and we can do the same thing for `eval`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7157 from cloud-fan/refactor and squashes the following commits: f3987c6 [Wenchen Fan] refactor Expression	2015-07-06 22:13:50 -07:00
Takeshi YAMAMURO	1821fc1658	[SPARK-6747] [SQL] Throw an AnalysisException when unsupported Java list types used in Hive UDF The current implementation can't handle List<> as a return type in Hive UDF and throws meaningless Match Error. We assume an UDF below; public class UDFToListString extends UDF { public List<String> evaluate(Object o) { return Arrays.asList("xxx", "yyy", "zzz"); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To make udf developers more understood, we need to throw a more suitable exception. Author: Takeshi YAMAMURO <linguin.m.s@gmail.com> Closes #7248 from maropu/FixBugInHiveInspectors and squashes the following commits: 1c3df2a [Takeshi YAMAMURO] Fix comments 56305de [Takeshi YAMAMURO] Fix conflicts 92ed7a6 [Takeshi YAMAMURO] Throw an exception when java list type used 2844a8e [Takeshi YAMAMURO] Apply comments 7114a47 [Takeshi YAMAMURO] Add TODO comments in UDFToListString of HiveUdfSuite fdb2ae4 [Takeshi YAMAMURO] Add StringToUtf8 to comvert String into UTF8String af61f2e [Takeshi YAMAMURO] Remove a new type 7f812fd [Takeshi YAMAMURO] Fix code-style errors 6984bf4 [Takeshi YAMAMURO] Apply review comments 93e3d4e [Takeshi YAMAMURO] Add a blank line at the end of UDFToListString ee232db [Takeshi YAMAMURO] Support List as a return type in Hive UDF 1e82316 [Takeshi YAMAMURO] Apply comments 21e8763 [Takeshi YAMAMURO] Add TODO comments in UDFToListString of HiveUdfSuite a488712 [Takeshi YAMAMURO] Add StringToUtf8 to comvert String into UTF8String 1c7b9d1 [Takeshi YAMAMURO] Remove a new type f965c34 [Takeshi YAMAMURO] Fix code-style errors 9406416 [Takeshi YAMAMURO] Apply review comments e21ce7e [Takeshi YAMAMURO] Add a blank line at the end of UDFToListString e553f10 [Takeshi YAMAMURO] Support List as a return type in Hive UDF	2015-07-06 19:44:31 -07:00
Liang-Chi Hsieh	d4d6d31db5	[SPARK-8463][SQL] Use DriverRegistry to load jdbc driver at writing path JIRA: https://issues.apache.org/jira/browse/SPARK-8463 Currently, at the reading path, `DriverRegistry` is used to load needed jdbc driver at executors. However, at the writing path, we also need `DriverRegistry` to load jdbc driver. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6900 from viirya/jdbc_write_driver and squashes the following commits: 16cd04b [Liang-Chi Hsieh] Use DriverRegistry to load jdbc driver at writing path.	2015-07-06 17:16:44 -07:00
animesh	09a06418de	[SPARK-8072] [SQL] Better AnalysisException for writing DataFrame with identically named columns Adding a function checkConstraints which will check for the constraints to be applied on the dataframe / dataframe schema. Function called before storing the dataframe to an external storage. Function added in the corresponding datasource API. cc rxin marmbrus Author: animesh <animesh@apache.spark> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #7013 from animeshbaranawal/8072 and squashes the following commits: f70dd0e [animesh] Change IO exception to Analysis Exception fd45e1b [animesh] 8072: Fix Style Issues a8a964f [animesh] 8072: Improving on previous commits 3cc4d2c [animesh] Fix Style Issues 1a89115 [animesh] Fix Style Issues 98b4399 [animesh] 8072 : Moved the exception handling to ResolvedDataSource specific to parquet format 7c3d928 [animesh] 8072: Adding check to DataFrameWriter.scala	2015-07-06 16:39:49 -07:00
Yin Huai	7b467cc934	[SPARK-8588] [SQL] Regression test This PR adds regression test for https://issues.apache.org/jira/browse/SPARK-8588 (fixed by `457d07eaa0`). Author: Yin Huai <yhuai@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #7103 from yhuai/SPARK-8588-test and squashes the following commits: eb5f418 [Yin Huai] Add a query test. c61a173 [Yin Huai] Regression test for SPARK-8588.	2015-07-06 16:28:47 -07:00
Daoyuan Wang	132e7fca12	[MINOR] [SQL] remove unused code in Exchange Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7234 from adrian-wang/exchangeclean and squashes the following commits: b093ec9 [Daoyuan Wang] remove unused code	2015-07-06 15:54:43 -07:00
kai	2471c0bf7f	[SPARK-4485] [SQL] 1) Add broadcast hash outer join, (2) Fix SparkPlanTest This pull request (1) extracts common functions used by hash outer joins and put it in interface HashOuterJoin (2) adds ShuffledHashOuterJoin and BroadcastHashOuterJoin (3) adds test cases for shuffled and broadcast hash outer join (3) makes SparkPlanTest to support binary or more complex operators, and fixes bugs in plan composition in SparkPlanTest Author: kai <kaizeng@eecs.berkeley.edu> Closes #7162 from kai-zeng/outer and squashes the following commits: 3742359 [kai] Fix not-serializable exception for code-generated keys in broadcasted relations 14e4bf8 [kai] Use CanBroadcast in broadcast outer join planning dc5127e [kai] code style fixes b5a4efa [kai] (1) Add broadcast hash outer join, (2) Fix SparkPlanTest	2015-07-06 14:33:30 -07:00
Davies Liu	37e4d92142	[SPARK-8784] [SQL] Add Python API for hex and unhex Add Python API for hex/unhex, also cleanup Hex/Unhex Author: Davies Liu <davies@databricks.com> Closes #7223 from davies/hex and squashes the following commits: 6f1249d [Davies Liu] no explicit rule to cast string into binary 711a6ed [Davies Liu] fix test f9fe5a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex 49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex b31fc9a [Davies Liu] Update math.scala 25156b7 [Davies Liu] address comments and fix test c3af78c [Davies Liu] address commments 1a24082 [Davies Liu] Add Python API for hex and unhex	2015-07-06 13:31:31 -07:00
Wenchen Fan	0e194645f4	[SPARK-8837][SPARK-7114][SQL] support using keyword in column name Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7237 from cloud-fan/parser and squashes the following commits: e7b49bb [Wenchen Fan] support using keyword in column name	2015-07-06 13:26:46 -07:00
Steve Lindemann	39e4e7e4d8	[SPARK-8841] [SQL] Fix partition pruning percentage log message When pruning partitions for a query plan, a message is logged indicating what how many partitions were selected based on predicate criteria, and what percent were pruned. The current release erroneously uses `1 - total/selected` to compute this quantity, leading to nonsense messages like "pruned -1000% partitions". The fix is simple and obvious. Author: Steve Lindemann <steve.lindemann@engineersgatelp.com> Closes #7227 from srlindemann/master and squashes the following commits: c788061 [Steve Lindemann] fix percentPruned log message	2015-07-06 10:17:05 -07:00
Reynold Xin	86768b7b3b	[SPARK-8831][SQL] Support AbstractDataType in TypeCollection. Otherwise it is impossible to declare an expression supporting DecimalType. Author: Reynold Xin <rxin@databricks.com> Closes #7232 from rxin/typecollection-adt and squashes the following commits: 934d3d1 [Reynold Xin] [SPARK-8831][SQL] Support AbstractDataType in TypeCollection.	2015-07-05 23:54:25 -07:00
Cheng Hao	6d0411b4f3	[SQL][Minor] Update the DataFrame API for encode/decode This is a the follow up of #6843. Author: Cheng Hao <hao.cheng@intel.com> Closes #7230 from chenghao-intel/str_funcs2_followup and squashes the following commits: 52cc553 [Cheng Hao] update the code as comment	2015-07-05 21:50:52 -07:00
Liang-Chi Hsieh	2b820f2a4b	[MINOR] [SQL] Minor fix for CatalystSchemaConverter ping liancheng Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #7224 from viirya/few_fix_catalystschema and squashes the following commits: d994330 [Liang-Chi Hsieh] Minor fix for CatalystSchemaConverter.	2015-07-04 22:52:50 -07:00
Reynold Xin	c991ef5abb	[SPARK-8822][SQL] clean up type checking in math.scala. Author: Reynold Xin <rxin@databricks.com> Closes #7220 from rxin/SPARK-8822 and squashes the following commits: 0cda076 [Reynold Xin] Test cases. 22d0463 [Reynold Xin] Fixed type precedence. beb2a97 [Reynold Xin] [SPARK-8822][SQL] clean up type checking in math.scala.	2015-07-04 11:55:20 -07:00
Reynold Xin	347cab85cd	[SQL] More unit tests for implicit type cast & add simpleString to AbstractDataType. Author: Reynold Xin <rxin@databricks.com> Closes #7221 from rxin/implicit-cast-tests and squashes the following commits: 64b13bd [Reynold Xin] Fixed a bug .. 489b732 [Reynold Xin] [SQL] More unit tests for implicit type cast & add simpleString to AbstractDataType.	2015-07-04 11:55:04 -07:00
Reynold Xin	48f7aed686	Fixed minor style issue with the previous merge.	2015-07-04 01:11:35 -07:00
Tarek Auel	6b3574e687	[SPARK-8270][SQL] levenshtein distance Jira: https://issues.apache.org/jira/browse/SPARK-8270 Info: I can not build the latest master, it stucks during the build process: `[INFO] Dependency-reduced POM written at: /Users/tarek/test/spark/bagel/dependency-reduced-pom.xml` Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7214 from tarekauel/SPARK-8270 and squashes the following commits: ab348b9 [Tarek Auel] Merge branch 'master' into SPARK-8270 a2ad318 [Tarek Auel] [SPARK-8270] changed order of fields d91b12c [Tarek Auel] [SPARK-8270] python fix adbd075 [Tarek Auel] [SPARK-8270] fixed typo 23185c9 [Tarek Auel] [SPARK-8270] levenshtein distance	2015-07-04 01:10:52 -07:00
Cheng Hao	f35b0c3436	[SPARK-8238][SPARK-8239][SPARK-8242][SPARK-8243][SPARK-8268][SQL]Add ascii/base64/unbase64/encode/decode functions Add `ascii`,`base64`,`unbase64`,`encode` and `decode` expressions. Author: Cheng Hao <hao.cheng@intel.com> Closes #6843 from chenghao-intel/str_funcs2 and squashes the following commits: 78dee7d [Cheng Hao] base 64 -> base64 9d6f9f4 [Cheng Hao] remove the toString method for expressions ed5c19c [Cheng Hao] update code as comments 96170fc [Cheng Hao] scalastyle issues e2df768 [Cheng Hao] remove the unused import 491ce7b [Cheng Hao] add ascii/base64/unbase64/encode/decode functions	2015-07-03 23:45:21 -07:00
Josh Rosen	f32487b7ca	[SPARK-8777] [SQL] Add random data generator test utilities to Spark SQL This commit adds a set of random data generation utilities to Spark SQL, for use in its own unit tests. - `RandomDataGenerator.forType(DataType)` returns an `Option[() => Any]` that, if defined, contains a function for generating random values for the given DataType. The random values use the external representations for the given DataType (for example, for DateType we return `java.sql.Date` instances instead of longs). - `DateTypeTestUtilities` defines some convenience fields for looping over instances of data types. For example, `numericTypes` holds `DataType` instances for all supported numeric types. These constants will help us to raise the level of abstraction in our tests. For example, it's now very easy to write a test which is parameterized by all common data types. Author: Josh Rosen <joshrosen@databricks.com> Closes #7176 from JoshRosen/sql-random-data-generators and squashes the following commits: f71634d [Josh Rosen] Roll back ScalaCheck usage e0d7d49 [Josh Rosen] Bump ScalaCheck version in LICENSE 89d86b1 [Josh Rosen] Bump ScalaCheck version. 0c20905 [Josh Rosen] Initial attempt at using ScalaCheck. b55875a [Josh Rosen] Generate doubles and floats over entire possible range. 5acdd5c [Josh Rosen] Infinity and NaN are interesting. ab76cbd [Josh Rosen] Move code to Catalyst package. d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.	2015-07-03 23:05:17 -07:00
Daoyuan Wang	9fb6b832bc	[SPARK-8192] [SPARK-8193] [SQL] udf current_date, current_timestamp Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6985 from adrian-wang/udfcurrent and squashes the following commits: 6a20b64 [Daoyuan Wang] remove codegen and add lazy in testsuite 27c9f95 [Daoyuan Wang] refine tests.. e11ae75 [Daoyuan Wang] refine tests 61ed3d5 [Daoyuan Wang] add in functions 98e8550 [Daoyuan Wang] fix sytle 427d9dc [Daoyuan Wang] add tests and codegen 0b69a1f [Daoyuan Wang] udf current	2015-07-03 22:19:43 -07:00
Cheolsoo Park	4a22bce8fc	[SPARK-8572] [SQL] Type coercion for ScalaUDFs Implemented type coercion for udf arguments in Scala. The changes include- * Add `with ExpectsInputTypes ` to `ScalaUDF` class. * Pass down argument types info from `UDFRegistration` and `functions`. With this patch, the example query in [SPARK-8572](https://issues.apache.org/jira/browse/SPARK-8572) no longer throws a type cast error at runtime. Also added a unit test to `UDFSuite` in which a decimal type is passed to a udf that expects an int. Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #7203 from piaozhexiu/SPARK-8572 and squashes the following commits: 2d0ed15 [Cheolsoo Park] Incorporate comments dce1efd [Cheolsoo Park] Fix unit tests and update the codegen script 066deed [Cheolsoo Park] Type coercion for udf inputs	2015-07-03 22:14:21 -07:00
Spiro Michaylov	e92c24d37c	[SPARK-8810] [SQL] Added several UDF unit tests for Spark SQL One test for each of the GROUP BY, WHERE and HAVING clauses, and one that combines all three with an additional UDF in the SELECT. (Since this is my first attempt at contributing to SPARK, meta-level guidance on anything I've screwed up would be greatly appreciated, whether important or minor.) Author: Spiro Michaylov <spiro@michaylov.com> Closes #7207 from spirom/udf-test-branch and squashes the following commits: 6bbba9e [Spiro Michaylov] Responded to review comments on UDF unit tests 1a3c5ff [Spiro Michaylov] Added several UDF unit tests for Spark SQL	2015-07-03 20:15:58 -07:00
zhichao.li	ab535b9a1d	[SPARK-8226] [SQL] Add function shiftrightunsigned Author: zhichao.li <zhichao.li@intel.com> Closes #7035 from zhichao-li/shiftRightUnsigned and squashes the following commits: 6bcca5a [zhichao.li] change coding style 3e9f5ae [zhichao.li] python style d85ae0b [zhichao.li] add shiftrightunsigned	2015-07-03 15:39:16 -07:00
Reynold Xin	2848f4da47	[SPARK-8809][SQL] Remove ConvertNaNs analyzer rule. "NaN" from string to double is already handled by Cast expression itself. Author: Reynold Xin <rxin@databricks.com> Closes #7206 from rxin/convertnans and squashes the following commits: 3d99c33 [Reynold Xin] [SPARK-8809][SQL] Remove ConvertNaNs analyzer rule.	2015-07-03 00:25:02 -07:00
Burak Yavuz	9b23e92c72	[SPARK-8803] handle special characters in elements in crosstab cc rxin Having back ticks or null as elements causes problems. Since elements become column names, we have to drop them from the element as back ticks are special characters. Having null throws exceptions, we could replace them with empty strings. Handling back ticks should be improved for 1.5 Author: Burak Yavuz <brkyvz@gmail.com> Closes #7201 from brkyvz/weird-ct-elements and squashes the following commits: e06b840 [Burak Yavuz] fix scalastyle 93a0d3f [Burak Yavuz] added tests for NaN and Infinity 9dba6ce [Burak Yavuz] address cr1 db71dbd [Burak Yavuz] handle special characters in elements in crosstab	2015-07-02 22:10:24 -07:00
Reynold Xin	a59d14f623	[SPARK-8801][SQL] Support TypeCollection in ExpectsInputTypes This patch adds a new TypeCollection AbstractDataType that can be used by expressions to specify more than one expected input types. Author: Reynold Xin <rxin@databricks.com> Closes #7202 from rxin/type-collection and squashes the following commits: c714ca1 [Reynold Xin] Fixed style. a0c0d12 [Reynold Xin] Fixed bugs and unit tests. d8b8ae7 [Reynold Xin] Added TypeCollection.	2015-07-02 21:45:25 -07:00
Cheng Lian	20a4d7dbd1	[SPARK-8501] [SQL] Avoids reading schema from empty ORC files ORC writes empty schema (`struct<>`) to ORC files containing zero rows. This is OK for Hive since the table schema is managed by the metastore. But it causes trouble when reading raw ORC files via Spark SQL since we have to discover the schema from the files. Notice that the ORC data source always avoids writing empty ORC files, but it's still problematic when reading Hive tables which contain empty part-files. Author: Cheng Lian <lian@databricks.com> Closes #7199 from liancheng/spark-8501 and squashes the following commits: bb8cd95 [Cheng Lian] Addresses comments a290221 [Cheng Lian] Avoids reading schema from empty ORC files	2015-07-02 21:30:57 -07:00
Reynold Xin	dfd8bac8f5	Minor style fix for the previous commit.	2015-07-02 20:47:04 -07:00
zhichao.li	1a7a7d7d57	[SPARK-8213][SQL]Add function factorial Author: zhichao.li <zhichao.li@intel.com> Closes #6822 from zhichao-li/factorial and squashes the following commits: 26edf4f [zhichao.li] add factorial	2015-07-02 20:37:31 -07:00
Josh Rosen	d9838196ff	[SPARK-8782] [SQL] Fix code generation for ORDER BY NULL This fixes code generation for queries containing `ORDER BY NULL`. Previously, the generated code would fail to compile. Author: Josh Rosen <joshrosen@databricks.com> Closes #7179 from JoshRosen/generate-order-fixes and squashes the following commits: 6ef49a6 [Josh Rosen] Fix ORDER BY NULL 0036696 [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL)	2015-07-02 18:07:09 -07:00
Reynold Xin	e589e71a29	Revert "[SPARK-8784] [SQL] Add Python API for hex and unhex" This reverts commit `fc7aebd94a`.	2015-07-02 16:25:10 -07:00
Davies Liu	fc7aebd94a	[SPARK-8784] [SQL] Add Python API for hex and unhex Also improve the performance of hex/unhex Author: Davies Liu <davies@databricks.com> Closes #7181 from davies/hex and squashes the following commits: f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex 49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex b31fc9a [Davies Liu] Update math.scala 25156b7 [Davies Liu] address comments and fix test c3af78c [Davies Liu] address commments 1a24082 [Davies Liu] Add Python API for hex and unhex	2015-07-02 15:43:02 -07:00
Reynold Xin	52508beb65	[SPARK-8772][SQL] Implement implicit type cast for expressions that define input types. Author: Reynold Xin <rxin@databricks.com> Closes #7175 from rxin/implicitCast and squashes the following commits: 88080a2 [Reynold Xin] Clearer definition of implicit type cast. f0ff97f [Reynold Xin] Added missing file. c65e532 [Reynold Xin] [SPARK-8772][SQL] Implement implicit type cast for expressions that defines input types.	2015-07-02 14:16:14 -07:00
Yijie Shen	52302a8039	[SPARK-8407] [SQL] complex type constructors: struct and named_struct This is a follow up of [SPARK-8283](https://issues.apache.org/jira/browse/SPARK-8283) ([PR-6828](https://github.com/apache/spark/pull/6828)), to support both `struct` and `named_struct` in Spark SQL. After [#6725](https://github.com/apache/spark/pull/6828), the semantic of [`CreateStruct`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala#L56) methods have changed a little and do not limited to cols of `NamedExpressions`, it will name non-NamedExpression fields following the hive convention, col1, col2 ... This PR would both loosen [`struct`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L723) to take children of `Expression` type and add `named_struct` support. Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #6874 from yijieshen/SPARK-8283 and squashes the following commits: 4cd3375ac [Yijie Shen] change struct documentation d599d0b [Yijie Shen] rebase code 9a7039e [Yijie Shen] fix reviews and regenerate golden answers b487354 [Yijie Shen] replace assert using checkAnswer f07e114 [Yijie Shen] tiny fix 9613be9 [Yijie Shen] review fix 7fef712 [Yijie Shen] Fix checkInputTypes' implementation using foldable and nullable 60812a7 [Yijie Shen] Fix type check 828d694 [Yijie Shen] remove unnecessary resolved assertion inside dataType method fd3cd8e [Yijie Shen] remove type check from eval 7a71255 [Yijie Shen] tiny fix ccbbd86 [Yijie Shen] Fix reviews 47da332 [Yijie Shen] remove nameStruct API from DataFrame 917e680 [Yijie Shen] Fix reviews 4bd75ad [Yijie Shen] loosen struct method in functions.scala to take Expression children 0acb7be [Yijie Shen] Add CreateNamedStruct in both DataFrame function API and FunctionRegistery	2015-07-02 10:12:25 -07:00
Wenchen Fan	afa021e03f	[SPARK-8747] [SQL] fix EqualNullSafe for binary type also improve tests for binary comparison. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7143 from cloud-fan/binary and squashes the following commits: 28a5b76 [Wenchen Fan] improve test 04ef4b0 [Wenchen Fan] fix equalNullSafe	2015-07-02 10:06:38 -07:00
Tarek Auel	5b3338130d	[SPARK-8223] [SPARK-8224] [SQL] shift left and shift right Jira: https://issues.apache.org/jira/browse/SPARK-8223 https://issues.apache.org/jira/browse/SPARK-8224 ~~I am aware of #7174 and will update this pr, if it's merged.~~ Done I don't know if #7034 can simplify this, but we can have a look on it, if it gets merged rxin In the Jira ticket the function as no second argument. I added a `numBits` argument that allows to specify the number of bits. I guess this improves the usability. I wanted to add `shiftleft(value)` as well, but the `selectExpr` dataframe tests crashes, if I have both. I order to do this, I added the following to the functions.scala `def shiftRight(e: Column): Column = ShiftRight(e.expr, lit(1).expr)`, but as I mentioned this doesn't pass tests like `df.selectExpr("shiftRight(a)", ...` (not enough arguments exception). If we need the bitwise shift in order to be hive compatible, I suggest to add `shiftLeft` and something like `shiftLeftX` Author: Tarek Auel <tarek.auel@googlemail.com> Closes #7178 from tarekauel/8223 and squashes the following commits: 8023bb5 [Tarek Auel] [SPARK-8223][SPARK-8224] fixed test f3f64e6 [Tarek Auel] [SPARK-8223][SPARK-8224] Integer -> Int f628706 [Tarek Auel] [SPARK-8223][SPARK-8224] removed toString; updated function description 3b56f2a [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223 5189690 [Tarek Auel] [SPARK-8223][SPARK-8224] minor fix and style fix 9434a28 [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223 44ee324 [Tarek Auel] [SPARK-8223][SPARK-8224] docu fix ac7fe9d [Tarek Auel] [SPARK-8223][SPARK-8224] right and left bit shift	2015-07-02 10:02:19 -07:00
Wisely Chen	246265f2bb	[SPARK-8690] [SQL] Add a setting to disable SparkSQL parquet schema merge by using datasource API The detail problem story is in https://issues.apache.org/jira/browse/SPARK-8690 General speaking, I add a config spark.sql.parquet.mergeSchema to achieve the sqlContext.load("parquet" , Map( "path" -> "..." , "mergeSchema" -> "false" )) It will become a simple flag and without any side affect. Author: Wisely Chen <wiselychen@appier.com> Closes #7070 from thegiive/SPARK8690 and squashes the following commits: c6f3e86 [Wisely Chen] Refactor some code style and merge the test case to ParquetSchemaMergeConfigSuite 94c9307 [Wisely Chen] Remove some style problem db8ef1b [Wisely Chen] Change config to SQLConf and add test case b6806fb [Wisely Chen] remove text c0edb8c [Wisely Chen] [SPARK-8690] add a config spark.sql.parquet.mergeSchema to disable datasource API schema merge feature.	2015-07-02 09:58:12 -07:00
Christian Kadner	1bbdf9ead9	[SPARK-8746] [SQL] update download link for Hive 0.13.1 updated the [Hive 0.13.1](https://archive.apache.org/dist/hive/hive-0.13.1) download link in `sql/README.md` Author: Christian Kadner <ckadner@us.ibm.com> Closes #7144 from ckadner/SPARK-8746 and squashes the following commits: 65d80f7 [Christian Kadner] [SPARK-8746][SQL] update download link for Hive 0.13.1	2015-07-02 13:45:19 +01:00
Vinod K C	c572e25617	[SPARK-8787] [SQL] Changed parameter order of @deprecated in package object sql Parameter order of deprecated annotation in package object sql is wrong >>deprecated("1.3.0", "use DataFrame") . This has to be changed to deprecated("use DataFrame", "1.3.0") Author: Vinod K C <vinod.kc@huawei.com> Closes #7183 from vinodkc/fix_deprecated_param_order and squashes the following commits: 1cbdbe8 [Vinod K C] Modified the message 700911c [Vinod K C] Changed order of parameters	2015-07-02 13:42:48 +01:00
Kousuke Saruta	41588365ad	[DOCS] Fix minor wrong lambda expression example. It's a really minor issue but there is an example with wrong lambda-expression usage in `SQLContext.scala` like as follows. ``` sqlContext.udf().register("myUDF", (Integer arg1, String arg2) -> arg2 + arg1), <- We have an extra `)` here. DataTypes.StringType); ``` Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #7187 from sarutak/fix-minor-wrong-lambda-expression and squashes the following commits: a13196d [Kousuke Saruta] Fixed minor wrong lambda expression example.	2015-07-02 21:16:35 +09:00
zhichao.li	b285ac5ba8	[SPARK-8227] [SQL] Add function unhex cc chenghao-intel adrian-wang Author: zhichao.li <zhichao.li@intel.com> Closes #7113 from zhichao-li/unhex and squashes the following commits: 379356e [zhichao.li] remove exception checking a4ae6dc [zhichao.li] add udf_unhex to whitelist fe5c14a [zhichao.li] add todigit 607d7a3 [zhichao.li] use checkInputTypes bffd37f [zhichao.li] change to use Hex in apache common package cde73f5 [zhichao.li] update to use AutoCastInputTypes 11945c7 [zhichao.li] style c852d46 [zhichao.li] Add function unhex	2015-07-01 22:19:51 -07:00
Reynold Xin	9fd13d5613	[SPARK-8770][SQL] Create BinaryOperator abstract class. Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression. This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression. Author: Reynold Xin <rxin@databricks.com> Closes #7174 from rxin/binary-opterator and squashes the following commits: f31900d [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class. fceb216 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into binary-opterator d8518cf [Reynold Xin] Updated Python tests.	2015-07-01 21:14:13 -07:00
Reynold Xin	3a342dedc0	Revert "[SPARK-8770][SQL] Create BinaryOperator abstract class." This reverts commit `2727789998`.	2015-07-01 16:59:39 -07:00
Reynold Xin	2727789998	[SPARK-8770][SQL] Create BinaryOperator abstract class. Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression. This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression. Author: Reynold Xin <rxin@databricks.com> Closes #7170 from rxin/binaryoperator and squashes the following commits: 51264a5 [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.	2015-07-01 16:56:48 -07:00
Davies Liu	3083e17645	[QUICKFIX] [SQL] fix copy of generated row copy() of generated Row doesn't check nullability of columns Author: Davies Liu <davies@databricks.com> Closes #7163 from davies/fix_copy and squashes the following commits: 661a206 [Davies Liu] fix copy of generated row	2015-07-01 12:39:57 -07:00
Wenchen Fan	31b4a3d7f2	[SPARK-8621] [SQL] support empty string as column name improve the empty check in `parseAttributeName` so that we can allow empty string as column name. Close https://github.com/apache/spark/pull/7117 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7149 from cloud-fan/8621 and squashes the following commits: efa9e3e [Wenchen Fan] support empty string	2015-07-01 10:31:35 -07:00
Reynold Xin	4137f769b8	[SPARK-8752][SQL] Add ExpectsInputTypes trait for defining expected input types. This patch doesn't actually introduce any code that uses the new ExpectsInputTypes. It just adds the trait so others can use it. Also renamed the old expectsInputTypes function to just inputTypes. We should add implicit type casting also in the future. Author: Reynold Xin <rxin@databricks.com> Closes #7151 from rxin/expects-input-types and squashes the following commits: 16cf07b [Reynold Xin] [SPARK-8752][SQL] Add ExpectsInputTypes trait for defining expected input types.	2015-07-01 10:30:54 -07:00
Reynold Xin	97652416e2	[SPARK-8750][SQL] Remove the closure in functions.callUdf. Author: Reynold Xin <rxin@databricks.com> Closes #7148 from rxin/calludf-closure and squashes the following commits: 00df372 [Reynold Xin] Fixed index out of bound exception. 4beba76 [Reynold Xin] [SPARK-8750][SQL] Remove the closure in functions.callUdf.	2015-07-01 01:08:20 -07:00
Wenchen Fan	0eee061589	[SQL] [MINOR] remove internalRowRDD in DataFrame Developers have already familiar with `queryExecution.toRDD` as internal row RDD, and we should not add new concept. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7116 from cloud-fan/internal-rdd and squashes the following commits: 24756ca [Wenchen Fan] remove internalRowRDD	2015-07-01 01:02:33 -07:00
Reynold Xin	fc3a6fe67f	[SPARK-8749][SQL] Remove HiveTypeCoercion trait. Moved all the rules into the companion object. Author: Reynold Xin <rxin@databricks.com> Closes #7147 from rxin/SPARK-8749 and squashes the following commits: c1c6dc0 [Reynold Xin] [SPARK-8749][SQL] Remove HiveTypeCoercion trait.	2015-07-01 00:08:16 -07:00
Reynold Xin	365c14055e	[SPARK-8748][SQL] Move castability test out from Cast case class into Cast object. This patch moved resolve function in Cast case class into the companion object, and renamed it canCast. We can then use this in the analyzer without a Cast expr. Author: Reynold Xin <rxin@databricks.com> Closes #7145 from rxin/cast and squashes the following commits: cd086a9 [Reynold Xin] Whitespace changes. 4d2d989 [Reynold Xin] [SPARK-8748][SQL] Move castability test out from Cast case class into Cast object.	2015-06-30 23:04:54 -07:00
Reynold Xin	8133125ca0	[SPARK-8741] [SQL] Remove e and pi from DataFrame functions. Author: Reynold Xin <rxin@databricks.com> Closes #7137 from rxin/SPARK-8741 and squashes the following commits: 32c7e75 [Reynold Xin] [SPARK-8741][SQL] Remove e and pi from DataFrame functions.	2015-06-30 16:54:51 -07:00
Vinod K C	b8e5bb6fc1	[SPARK-8628] [SQL] Race condition in AbstractSparkSQLParser.parse Made lexical iniatialization as lazy val Author: Vinod K C <vinod.kc@huawei.com> Closes #7015 from vinodkc/handle_lexical_initialize_schronization and squashes the following commits: b6d1c74 [Vinod K C] Avoided repeated lexical initialization 5863cf7 [Vinod K C] Removed space e27c66c [Vinod K C] Avoid reinitialization of lexical in parse method ef4f60f [Vinod K C] Reverted import order e9fc49a [Vinod K C] handle synchronization in SqlLexical.initialize	2015-06-30 12:24:47 -07:00
Christian Kadner	1e1f339976	[SPARK-6785] [SQL] fix DateTimeUtils for dates before 1970 Hi Michael, this Pull-Request is a follow-up to [PR-6242](https://github.com/apache/spark/pull/6242). I removed the two obsolete test cases from the HiveQuerySuite and deleted the corresponding golden answer files. Thanks for your review! Author: Christian Kadner <ckadner@us.ibm.com> Closes #6983 from ckadner/SPARK-6785 and squashes the following commits: ab1e79b [Christian Kadner] Merge remote-tracking branch 'origin/SPARK-6785' into SPARK-6785 1fed877 [Christian Kadner] [SPARK-6785][SQL] failed Scala style test, remove spaces on empty line DateTimeUtils.scala:61 9d8021d [Christian Kadner] [SPARK-6785][SQL] merge recent changes in DateTimeUtils & MiscFunctionsSuite b97c3fb [Christian Kadner] [SPARK-6785][SQL] move test case for DateTimeUtils to DateTimeUtilsSuite a451184 [Christian Kadner] [SPARK-6785][SQL] fix DateTimeUtils.fromJavaDate(java.util.Date) for Dates before 1970	2015-06-30 12:22:34 -07:00
Davies Liu	fbb267ed6f	[SPARK-8713] Make codegen thread safe Codegen takes three steps: 1. Take a list of expressions, convert them into Java source code and a list of expressions that don't not support codegen (fallback to interpret mode). 2. Compile the Java source into Java class (bytecode) 3. Using the Java class and the list of expression to build a Projection. Currently, we cache the whole three steps, the key is a list of expression, result is projection. Because some of expressions (which may not thread-safe, for example, Random) will be hold by the Projection, the projection maybe not thread safe. This PR change to only cache the second step, then we can build projection using codegen even some expressions are not thread-safe, because the cache will not hold any expression anymore. cc marmbrus rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #7101 from davies/codegen_safe and squashes the following commits: 7dd41f1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into codegen_safe 847bd08 [Davies Liu] don't use scala.refect 4ddaaed [Davies Liu] Merge branch 'master' of github.com:apache/spark into codegen_safe 1793cf1 [Davies Liu] make codegen thread safe	2015-06-30 10:48:49 -07:00
Shilei	722aa5f48e	[SPARK-8236] [SQL] misc functions: crc32 https://issues.apache.org/jira/browse/SPARK-8236 Author: Shilei <shilei.qian@intel.com> Closes #7108 from qiansl127/Crc32 and squashes the following commits: 5477352 [Shilei] Change to AutoCastInputTypes 5f16e5d [Shilei] Add misc function crc32	2015-06-30 09:49:58 -07:00
Liang-Chi Hsieh	a48e619153	[SPARK-8680] [SQL] Slightly improve PropagateTypes JIRA: https://issues.apache.org/jira/browse/SPARK-8680 This PR slightly improve `PropagateTypes` in `HiveTypeCoercion`. It moves `q.inputSet` outside `q transformExpressions` instead calling `inputSet` multiple times. It also builds a map of attributes for looking attribute easily. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #7087 from viirya/improve_propagatetypes and squashes the following commits: 5c314c1 [Liang-Chi Hsieh] For comments. 913f6ad [Liang-Chi Hsieh] Slightly improve PropagateTypes.	2015-06-30 08:17:24 -07:00
Wenchen Fan	865a834e51	[SPARK-8723] [SQL] improve divide and remainder code gen We can avoid execution of both left and right expression by null and zero check. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7111 from cloud-fan/cg and squashes the following commits: d6b12ef [Wenchen Fan] improve divide and remainder code gen	2015-06-30 08:08:15 -07:00
Wenchen Fan	08fab48438	[SPARK-8590] [SQL] add code gen for ExtractValue TODO: use array instead of Seq as internal representation for `ArrayType` Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6982 from cloud-fan/extract-value and squashes the following commits: e203bc1 [Wenchen Fan] address comments 4da0f0b [Wenchen Fan] some clean up f679969 [Wenchen Fan] fix bug e64f942 [Wenchen Fan] remove generic e3f8427 [Wenchen Fan] fix style and address comments fc694e8 [Wenchen Fan] add code gen for extract value	2015-06-30 07:58:49 -07:00
zsxwing	12671dd5e4	[SPARK-8434][SQL]Add a "pretty" parameter to the "show" method to display long strings Sometimes the user may want to show the complete content of cells. Now `sql("set -v").show()` displays: ![screen shot 2015-06-18 at 4 34 51 pm](https://cloud.githubusercontent.com/assets/1000778/8227339/14d3c5ea-15d9-11e5-99b9-f00b7e93beef.png) The user needs to use something like `sql("set -v").collect().foreach(r => r.toSeq.mkString("\t"))` to show the complete content. This PR adds a `pretty` parameter to show. If `pretty` is false, `show` won't truncate strings or align cells right. ![screen shot 2015-06-18 at 4 21 44 pm](https://cloud.githubusercontent.com/assets/1000778/8227407/b6f8dcac-15d9-11e5-8219-8079280d76fc.png) Author: zsxwing <zsxwing@gmail.com> Closes #6877 from zsxwing/show and squashes the following commits: 22e28e9 [zsxwing] pretty -> truncate e582628 [zsxwing] Add pretty parameter to the show method in R a3cd55b [zsxwing] Fix calling showString in R 923cee4 [zsxwing] Add a "pretty" parameter to show to display long strings	2015-06-29 23:44:11 -07:00
Yadong Qi	e6c3f7462b	[SPARK-8650] [SQL] Use the user-specified app name priority in SparkSQLCLIDriver or HiveThriftServer2 When run `./bin/spark-sql --name query1.sql` [Before] ![before](https://cloud.githubusercontent.com/assets/1400819/8370336/fa20b75a-1bf8-11e5-9171-040049a53240.png) [After] ![after](https://cloud.githubusercontent.com/assets/1400819/8370189/dcc35cb4-1bf6-11e5-8796-a0694140bffb.png) Author: Yadong Qi <qiyadong2010@gmail.com> Closes #7030 from watermen/SPARK-8650 and squashes the following commits: 51b5134 [Yadong Qi] Improve code and add comment. e3d7647 [Yadong Qi] use spark.app.name priority.	2015-06-29 22:34:38 -07:00
Reynold Xin	f79410c49b	[SPARK-8721][SQL] Rename ExpectsInputTypes => AutoCastInputTypes. Author: Reynold Xin <rxin@databricks.com> Closes #7109 from rxin/auto-cast and squashes the following commits: a914cc3 [Reynold Xin] [SPARK-8721][SQL] Rename ExpectsInputTypes => AutoCastInputTypes.	2015-06-29 22:32:43 -07:00
Steven She	4915e9e3bf	[SPARK-8669] [SQL] Fix crash with BINARY (ENUM) fields with Parquet 1.7 Patch to fix crash with BINARY fields with ENUM original types. Author: Steven She <steven@canopylabs.com> Closes #7048 from stevencanopy/SPARK-8669 and squashes the following commits: 2e72979 [Steven She] [SPARK-8669] [SQL] Fix crash with BINARY (ENUM) fields with Parquet 1.7	2015-06-29 18:50:09 -07:00
Burak Yavuz	ecacb1e88a	[SPARK-8715] ArrayOutOfBoundsException fixed for DataFrameStatSuite.crosstab cc yhuai Author: Burak Yavuz <brkyvz@gmail.com> Closes #7100 from brkyvz/ct-flakiness-fix and squashes the following commits: abc299a [Burak Yavuz] change 'to' to until 7e96d7c [Burak Yavuz] ArrayOutOfBoundsException fixed for DataFrameStatSuite.crosstab	2015-06-29 18:48:28 -07:00
Yin Huai	fbf75738fe	[SPARK-7287] [SPARK-8567] [TEST] Add sc.stop to applications in SparkSubmitSuite Hopefully, this suite will not be flaky anymore. Author: Yin Huai <yhuai@databricks.com> Closes #7027 from yhuai/SPARK-8567 and squashes the following commits: c0167e2 [Yin Huai] Add sc.stop().	2015-06-29 17:20:05 -07:00
Wenchen Fan	881662e9c9	[SPARK-8589] [SQL] cleanup DateTimeUtils move date time related operations into `DateTimeUtils` and rename some methods to make it more clear. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6980 from cloud-fan/datetime and squashes the following commits: 9373a9d [Wenchen Fan] cleanup DateTimeUtil	2015-06-29 16:34:50 -07:00
Yin Huai	4b497a724a	[SPARK-8710] [SQL] Change ScalaReflection.mirror from a val to a def. jira: https://issues.apache.org/jira/browse/SPARK-8710 Author: Yin Huai <yhuai@databricks.com> Closes #7094 from yhuai/SPARK-8710 and squashes the following commits: c854baa [Yin Huai] Change ScalaReflection.mirror from a val to a def.	2015-06-29 16:26:05 -07:00
Davies Liu	ed359de595	[SPARK-8579] [SQL] support arbitrary object in UnsafeRow This PR brings arbitrary object support in UnsafeRow (both in grouping key and aggregation buffer). Two object pools will be created to hold those non-primitive objects, and put the index of them into UnsafeRow. In order to compare the grouping key as bytes, the objects in key will be stored in a unique object pool, to make sure same objects will have same index (used as hashCode). For StringType and BinaryType, we still put them as var-length in UnsafeRow when initializing for better performance. But for update, they will be an object inside object pools (there will be some garbages left in the buffer). BTW: Will create a JIRA once issue.apache.org is available. cc JoshRosen rxin Author: Davies Liu <davies@databricks.com> Closes #6959 from davies/unsafe_obj and squashes the following commits: 5ce39da [Davies Liu] fix comment 5e797bf [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj 5803d64 [Davies Liu] fix conflict 461d304 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj 2f41c90 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj b04d69c [Davies Liu] address comments 4859b80 [Davies Liu] fix comments f38011c [Davies Liu] add a test for grouping by decimal d2cf7ab [Davies Liu] add more tests for null checking 71983c5 [Davies Liu] add test for timestamp e8a1649 [Davies Liu] reuse buffer for string 39f09ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj 035501e [Davies Liu] fix style 236d6de [Davies Liu] support arbitrary object in UnsafeRow	2015-06-29 15:59:20 -07:00
BenFradet	931da5c8ab	[SPARK-8478] [SQL] Harmonize UDF-related code to use uniformly UDF instead of Udf Follow-up of #6902 for being coherent between ```Udf``` and ```UDF``` Author: BenFradet <benjamin.fradet@gmail.com> Closes #6920 from BenFradet/SPARK-8478 and squashes the following commits: c500f29 [BenFradet] renamed a few variables in functions to use UDF 8ab0f2d [BenFradet] renamed idUdf to idUDF in SQLQuerySuite 98696c2 [BenFradet] renamed originalUdfs in TestHive to originalUDFs 7738f74 [BenFradet] modified HiveUDFSuite to use only UDF c52608d [BenFradet] renamed HiveUdfSuite to HiveUDFSuite e51b9ac [BenFradet] renamed ExtractPythonUdfs to ExtractPythonUDFs 8c756f1 [BenFradet] renamed Hive UDF related code 2a1ca76 [BenFradet] renamed pythonUdfs to pythonUDFs 261e6fb [BenFradet] renamed ScalaUdf to ScalaUDF	2015-06-29 15:27:13 -07:00
Ilya Ganelin	f6fc254ec4	[SPARK-8056][SQL] Design an easier way to construct schema for both Scala and Python I've added functionality to create new StructType similar to how we add parameters to a new SparkContext. I've also added tests for this type of creation. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #6686 from ilganeli/SPARK-8056B and squashes the following commits: 27c1de1 [Ilya Ganelin] Rename 467d836 [Ilya Ganelin] Removed from_string in favor of _parse_Datatype_json_value 5fef5a4 [Ilya Ganelin] Updates for type parsing 4085489 [Ilya Ganelin] Style errors 3670cf5 [Ilya Ganelin] added string to DataType conversion 8109e00 [Ilya Ganelin] Fixed error in tests 41ab686 [Ilya Ganelin] Fixed style errors e7ba7e0 [Ilya Ganelin] Moved some python tests to tests.py. Added cleaner handling of null data type and added test for correctness of input format 15868fa [Ilya Ganelin] Fixed python errors b79b992 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-8056B a3369fc [Ilya Ganelin] Fixing space errors e240040 [Ilya Ganelin] Style bab7823 [Ilya Ganelin] Constructor error 73d4677 [Ilya Ganelin] Style 4ed00d9 [Ilya Ganelin] Fixed default arg 67df57a [Ilya Ganelin] Removed Foo 04cbf0c [Ilya Ganelin] Added comments for single object 0484d7a [Ilya Ganelin] Restored second method 6aeb740 [Ilya Ganelin] Style 689e54d [Ilya Ganelin] Style f497e9e [Ilya Ganelin] Got rid of old code e3c7a88 [Ilya Ganelin] Fixed doctest failure a62ccde [Ilya Ganelin] Style 966ac06 [Ilya Ganelin] style checks dabb7e6 [Ilya Ganelin] Added Python tests a3f4152 [Ilya Ganelin] added python bindings and better comments e6e536c [Ilya Ganelin] Added extra space 7529a2e [Ilya Ganelin] Fixed formatting d388f86 [Ilya Ganelin] Fixed small bug c4e3bf5 [Ilya Ganelin] Reverted to using parse. Updated parse to support long d7634b6 [Ilya Ganelin] Reverted to fromString to properly support types 22c39d5 [Ilya Ganelin] replaced FromString with DataTypeParser.parse. Replaced empty constructor initializing a null to have it instead create a new array to allow appends to it. faca398 [Ilya Ganelin] [SPARK-8056] Replaced default argument usage. Updated usage and code for DataType.fromString 1acf76e [Ilya Ganelin] Scala style e31c674 [Ilya Ganelin] Fixed bug in test 8dc0795 [Ilya Ganelin] Added tests for creation of StructType object with new methods fdf7e9f [Ilya Ganelin] [SPARK-8056] Created add methods to facilitate building new StructType objects.	2015-06-29 14:15:15 -07:00
Burak Yavuz	be7ef06762	[SPARK-8681] fixed wrong ordering of columns in crosstab I specifically randomized the test. What crosstab does is equivalent to a countByKey, therefore if this test fails again for any reason, we will know that we hit a corner case or something. cc rxin marmbrus Author: Burak Yavuz <brkyvz@gmail.com> Closes #7060 from brkyvz/crosstab-fixes and squashes the following commits: 0a65234 [Burak Yavuz] addressed comments v1 d96da7e [Burak Yavuz] fixed wrong ordering of columns in crosstab	2015-06-29 13:15:04 -07:00
Cheng Hao	c6ba2ea341	[SPARK-7862] [SQL] Disable the error message redirect to stderr This is a follow up of #6404, the ScriptTransformation prints the error msg into stderr directly, probably be a disaster for application log. Author: Cheng Hao <hao.cheng@intel.com> Closes #6882 from chenghao-intel/verbose and squashes the following commits: bfedd77 [Cheng Hao] revert the write 76ff46b [Cheng Hao] update the CircularBuffer 692b19e [Cheng Hao] check the process exitValue for ScriptTransform 47e0970 [Cheng Hao] Use the RedirectThread instead 1de771d [Cheng Hao] naming the threads in ScriptTransformation 8536e81 [Cheng Hao] disable the error message redirection for stderr	2015-06-29 12:46:33 -07:00
zhichao.li	637b4eedad	[SPARK-8214] [SQL] Add function hex cc chenghao-intel adrian-wang Author: zhichao.li <zhichao.li@intel.com> Closes #6976 from zhichao-li/hex and squashes the following commits: e218d1b [zhichao.li] turn off scalastyle for non-ascii de3f5ea [zhichao.li] non-ascii char cf9c936 [zhichao.li] give separated buffer for each hex method 967ec90 [zhichao.li] Make 'value' as a feild of Hex 3b2fa13 [zhichao.li] tiny fix a647641 [zhichao.li] remove duplicate null check 7cab020 [zhichao.li] tiny refactoring 35ecfe5 [zhichao.li] add function hex	2015-06-29 12:25:16 -07:00
Kousuke Saruta	94e040d059	[SQL][DOCS] Remove wrong example from DataFrame.scala In DataFrame.scala, there are examples like as follows. ``` * // The following are equivalent: * peopleDf.filter($"age" > 15) * peopleDf.where($"age" > 15) * peopleDf($"age" > 15) ``` But, I think the last example doesn't work. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #6977 from sarutak/fix-dataframe-example and squashes the following commits: 46efbd7 [Kousuke Saruta] Removed wrong example	2015-06-29 12:16:12 -07:00
Tarek Auel	a5c2961caa	[SPARK-8235] [SQL] misc function sha / sha1 Jira: https://issues.apache.org/jira/browse/SPARK-8235 I added the support for sha1. If I understood rxin correctly, sha and sha1 should execute the same algorithm, shouldn't they? Please take a close look on the Python part. This is adopted from #6934 Author: Tarek Auel <tarek.auel@gmail.com> Author: Tarek Auel <tarek.auel@googlemail.com> Closes #6963 from tarekauel/SPARK-8235 and squashes the following commits: f064563 [Tarek Auel] change to shaHex 7ce3cdc [Tarek Auel] rely on automatic cast a1251d6 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-8235 68eb043 [Tarek Auel] added docstring be5aff1 [Tarek Auel] improved error message 7336c96 [Tarek Auel] added type check cf23a80 [Tarek Auel] simplified example ebf75ef [Tarek Auel] [SPARK-8301] updated the python documentation. Removed sha in python and scala 6d6ff0d [Tarek Auel] [SPARK-8233] added docstring ea191a9 [Tarek Auel] [SPARK-8233] fixed signatureof python function. Added expected type to misc e3fd7c3 [Tarek Auel] SPARK[8235] added sha to the list of __all__ e5dad4e [Tarek Auel] SPARK[8235] sha / sha1	2015-06-29 11:57:19 -07:00
Marcelo Vanzin	3664ee25f0	[SPARK-8066, SPARK-8067] [hive] Add support for Hive 1.0, 1.1 and 1.2. Allow HiveContext to connect to metastores of those versions; some new shims had to be added to account for changing internal APIs. A new test was added to exercise the "reset()" path which now also requires a shim; and the test code was changed to use a directory under the build's target to store ivy dependencies. Without that, at least I consistently run into issues with Ivy messing up (or being confused) by my existing caches. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7026 from vanzin/SPARK-8067 and squashes the following commits: 3e2e67b [Marcelo Vanzin] [SPARK-8066, SPARK-8067] [hive] Add support for Hive 1.0, 1.1 and 1.2.	2015-06-29 11:53:17 -07:00
Wenchen Fan	ed413bcc78	[SPARK-8692] [SQL] re-order the case statements that handling catalyst data types use same order: boolean, byte, short, int, date, long, timestamp, float, double, string, binary, decimal. Then we can easily check whether some data types are missing by just one glance, and make sure we handle data/timestamp just as int/long. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7073 from cloud-fan/fix-date and squashes the following commits: 463044d [Wenchen Fan] fix style 51cd347 [Wenchen Fan] refactor handling of date and timestmap	2015-06-29 11:41:26 -07:00
BenFradet	0b10662fef	[SPARK-8575] [SQL] Deprecate callUDF in favor of udf Follow up of [SPARK-8356](https://issues.apache.org/jira/browse/SPARK-8356) and #6902. Removes the unit test for the now deprecated ```callUdf``` Unit test in SQLQuerySuite now uses ```udf``` instead of ```callUDF``` Replaced ```callUDF``` by ```udf``` where possible in mllib Author: BenFradet <benjamin.fradet@gmail.com> Closes #6993 from BenFradet/SPARK-8575 and squashes the following commits: 26f5a7a [BenFradet] 2 spaces instead of 1 1ddb452 [BenFradet] renamed initUDF in order to be consistent in OneVsRest 48ca15e [BenFradet] used vector type tag for udf call in VectorIndexer 0ebd0da [BenFradet] replace the now deprecated callUDF by udf in VectorIndexer 8013409 [BenFradet] replaced the now deprecated callUDF by udf in Predictor 94345b5 [BenFradet] unifomized udf calls in ProbabilisticClassifier 1305492 [BenFradet] uniformized udf calls in Classifier a672228 [BenFradet] uniformized udf calls in OneVsRest 49e4904 [BenFradet] Revert "removal of the unit test for the now deprecated callUdf" bbdeaf3 [BenFradet] fixed syntax for init udf in OneVsRest fe2a10b [BenFradet] callUDF => udf in ProbabilisticClassifier 0ea30b3 [BenFradet] callUDF => udf in Classifier where possible 197ec82 [BenFradet] callUDF => udf in OneVsRest 84d6780 [BenFradet] modified unit test in SQLQuerySuite to use udf instead of callUDF 477709f [BenFradet] removal of the unit test for the now deprecated callUdf	2015-06-28 22:43:47 -07:00
Liang-Chi Hsieh	24fda73811	[SPARK-8677] [SQL] Fix non-terminating decimal expansion for decimal divide operation JIRA: https://issues.apache.org/jira/browse/SPARK-8677 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #7056 from viirya/fix_decimal3 and squashes the following commits: 34d7419 [Liang-Chi Hsieh] Fix Non-terminating decimal expansion for decimal divide operation.	2015-06-28 14:48:44 -07:00
Kousuke Saruta	ec78438196	[SPARK-8686] [SQL] DataFrame should support `where` with expression represented by String DataFrame supports `filter` function with two types of argument, `Column` and `String`. But `where` doesn't. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #7063 from sarutak/SPARK-8686 and squashes the following commits: 180f9a4 [Kousuke Saruta] Added test d61aec4 [Kousuke Saruta] Add "where" method with String argument to DataFrame	2015-06-28 08:29:07 -07:00
Davies Liu	77da5be6f1	[SPARK-8610] [SQL] Separate Row and InternalRow (part 2) Currently, we use GenericRow both for Row and InternalRow, which is confusing because it could contain Scala type also Catalyst types. This PR changes to use GenericInternalRow for InternalRow (contains catalyst types), GenericRow for Row (contains Scala types). Also fixes some incorrect use of InternalRow or Row. Author: Davies Liu <davies@databricks.com> Closes #7003 from davies/internalrow and squashes the following commits: d05866c [Davies Liu] fix test: rollback changes for pyspark 72878dd [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow efd0b25 [Davies Liu] fix copy of MutableRow 87b13cf [Davies Liu] fix test d2ebd72 [Davies Liu] fix style eb4b473 [Davies Liu] mark expensive API as final bd4e99c [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow bdfb78f [Davies Liu] remove BaseMutableRow 6f99a97 [Davies Liu] fix catalyst test defe931 [Davies Liu] remove BaseRow 288b31f [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow 9d24350 [Davies Liu] separate Row and InternalRow (part 2)	2015-06-28 08:03:58 -07:00
Wenchen Fan	1a79f0eb8d	[SPARK-8635] [SQL] improve performance of CatalystTypeConverters In `CatalystTypeConverters.createToCatalystConverter`, we add special handling for primitive types. We can apply this strategy to more places to improve performance. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7018 from cloud-fan/converter and squashes the following commits: 8b16630 [Wenchen Fan] another fix 326c82c [Wenchen Fan] optimize type converter	2015-06-25 22:44:26 -07:00
Wenchen Fan	40360112c4	[SPARK-8620] [SQL] cleanup CodeGenContext fix docs, remove nativeTypes , use java type to get boxed type ,default value, etc. to avoid handle `DateType` and `TimestampType` as int and long again and again. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #7010 from cloud-fan/cg and squashes the following commits: aa01cf9 [Wenchen Fan] cleanup CodeGenContext	2015-06-25 22:16:53 -07:00
Liang-Chi Hsieh	47c874babe	[SPARK-8237] [SQL] Add misc function sha2 JIRA: https://issues.apache.org/jira/browse/SPARK-8237 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6934 from viirya/expr_sha2 and squashes the following commits: 35e0bb3 [Liang-Chi Hsieh] For comments. 68b5284 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_sha2 8573aff [Liang-Chi Hsieh] Remove unnecessary Product. ee61e06 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_sha2 59e41aa [Liang-Chi Hsieh] Add misc function: sha2.	2015-06-25 22:07:37 -07:00
Yin Huai	f9b397f54d	[SPARK-8567] [SQL] Add logs to record the progress of HiveSparkSubmitSuite. Author: Yin Huai <yhuai@databricks.com> Closes #7009 from yhuai/SPARK-8567 and squashes the following commits: 62fb1f9 [Yin Huai] Add sc.stop(). b22cf7d [Yin Huai] Add logs.	2015-06-25 06:52:03 -07:00
Cheng Lian	c337844ed7	[SPARK-8604] [SQL] HadoopFsRelation subclasses should set their output format class `HadoopFsRelation` subclasses, especially `ParquetRelation2` should set its own output format class, so that the default output committer can be setup correctly when doing appending (where we ignore user defined output committers). Author: Cheng Lian <lian@databricks.com> Closes #6998 from liancheng/spark-8604 and squashes the following commits: 9be51d1 [Cheng Lian] Adds more comments 6db1368 [Cheng Lian] HadoopFsRelation subclasses should set their output format class	2015-06-25 00:06:23 -07:00
Reynold Xin	82f80c1c7d	Two minor SQL cleanup (compiler warning & indent). Author: Reynold Xin <rxin@databricks.com> Closes #7000 from rxin/minor-cleanup and squashes the following commits: 046044c [Reynold Xin] Two minor SQL cleanup (compiler warning & indent).	2015-06-24 19:34:07 -07:00
Wenchen Fan	b71d3254e5	[SPARK-8075] [SQL] apply type check interface to more expressions a follow up of https://github.com/apache/spark/pull/6405. Note: It's not a big change, a lot of changing is due to I swap some code in `aggregates.scala` to make aggregate functions right below its corresponding aggregate expressions. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6723 from cloud-fan/type-check and squashes the following commits: 2124301 [Wenchen Fan] fix tests 5a658bb [Wenchen Fan] add tests 287d3bb [Wenchen Fan] apply type check interface to more expressions	2015-06-24 16:26:00 -07:00
Yin Huai	7daa70292e	[SPARK-8567] [SQL] Increase the timeout of HiveSparkSubmitSuite https://issues.apache.org/jira/browse/SPARK-8567 Author: Yin Huai <yhuai@databricks.com> Closes #6957 from yhuai/SPARK-8567 and squashes the following commits: 62dff5b [Yin Huai] Increase the timeout.	2015-06-24 15:52:58 -07:00
Cheng Lian	8ab50765cd	[SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter This PR introduces `CatalystSchemaConverter` for converting Parquet schema to Spark SQL schema and vice versa. Original conversion code in `ParquetTypesConverter` is removed. Benefits of the new version are: 1. When converting Spark SQL schemas, it generates standard Parquet schemas conforming to [the most updated Parquet format spec] [1]. Converting to old style Parquet schemas is also supported via feature flag `spark.sql.parquet.followParquetFormatSpec` (which is set to `false` for now, and should be set to `true` after both read and write paths are fixed). Note that although this version of Parquet format spec hasn't been officially release yet, Parquet MR 1.7.0 already sticks to it. So it should be safe to follow. 1. It implements backwards-compatibility rules described in the most updated Parquet format spec. Thus can recognize more schema patterns generated by other/legacy systems/tools. 1. Code organization follows convention used in [parquet-mr] [2], which is easier to follow. (Structure of `CatalystSchemaConverter` is similar to `AvroSchemaConverter`). To fully implement backwards-compatibility rules in both read and write path, we also need to update `CatalystRowConverter` (which is responsible for converting Parquet records to `Row`s), `RowReadSupport`, and `RowWriteSupport`. These would be done in follow-up PRs. TODO - [x] More schema conversion test cases for legacy schema patterns. [1]: `ea09522659/LogicalTypes.md` [2]: https://github.com/apache/parquet-mr/ Author: Cheng Lian <lian@databricks.com> Closes #6617 from liancheng/spark-6777 and squashes the following commits: 2a2062d [Cheng Lian] Don't convert decimals without precision information b60979b [Cheng Lian] Adds a constructor which accepts a Configuration, and fixes default value of assumeBinaryIsString 743730f [Cheng Lian] Decimal scale shouldn't be larger than precision a104a9e [Cheng Lian] Fixes Scala style issue 1f71d8d [Cheng Lian] Adds feature flag to allow falling back to old style Parquet schema conversion ba84f4b [Cheng Lian] Fixes MapType schema conversion bug 13cb8d5 [Cheng Lian] Fixes MiMa failure 81de5b0 [Cheng Lian] Fixes UDT, workaround read path, and add tests 28ef95b [Cheng Lian] More AnalysisExceptions b10c322 [Cheng Lian] Replaces require() with analysisRequire() which throws AnalysisException cceaf3f [Cheng Lian] Implements backwards compatibility rules in CatalystSchemaConverter	2015-06-24 15:03:43 -07:00
Wenchen Fan	f04b5672c5	[SPARK-7289] handle project -> limit -> sort efficiently make the `TakeOrdered` strategy and operator more general, such that it can optionally handle a projection when necessary Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6780 from cloud-fan/limit and squashes the following commits: 34aa07b [Wenchen Fan] revert 07d5456 [Wenchen Fan] clean closure 20821ec [Wenchen Fan] fix 3676a82 [Wenchen Fan] address comments b558549 [Wenchen Fan] address comments 214842b [Wenchen Fan] fix style 2d8be83 [Wenchen Fan] add LimitPushDown 948f740 [Wenchen Fan] fix existing	2015-06-24 13:28:50 -07:00
Santiago M. Mola	b84d4b4dfe	[SPARK-7088] [SQL] Fix analysis for 3rd party logical plan. ResolveReferences analysis rule now does not throw when it cannot resolve references in a self-join. Author: Santiago M. Mola <smola@stratio.com> Closes #6853 from smola/SPARK-7088 and squashes the following commits: af71ac7 [Santiago M. Mola] [SPARK-7088] Fix analysis for 3rd party logical plan.	2015-06-24 12:29:07 -07:00
Yin Huai	bba6699d0e	[SPARK-8578] [SQL] Should ignore user defined output committer when appending data https://issues.apache.org/jira/browse/SPARK-8578 It is not very safe to use a custom output committer when append data to an existing dir. This changes adds the logic to check if we are appending data, and if so, we use the output committer associated with the file output format. Author: Yin Huai <yhuai@databricks.com> Closes #6964 from yhuai/SPARK-8578 and squashes the following commits: 43544c4 [Yin Huai] Do not use a custom output commiter when appendiing data.	2015-06-24 09:50:03 -07:00
Cheng Lian	9d36ec2431	[SPARK-8567] [SQL] Debugging flaky HiveSparkSubmitSuite Using similar approach used in `HiveThriftServer2Suite` to print stdout/stderr of the spawned process instead of logging them to see what happens on Jenkins. (This test suite only fails on Jenkins and doesn't spill out any log...) cc yhuai Author: Cheng Lian <lian@databricks.com> Closes #6978 from liancheng/debug-hive-spark-submit-suite and squashes the following commits: b031647 [Cheng Lian] Prints process stdout/stderr instead of logging them	2015-06-24 09:49:20 -07:00
Cheng Lian	cc465fd924	[SPARK-8138] [SQL] Improves error message when conflicting partition columns are found This PR improves the error message shown when conflicting partition column names are detected. This can be particularly annoying and confusing when there are a large number of partitions while a handful of them happened to contain unexpected temporary file(s). Now all suspicious directories are listed as below: ``` java.lang.AssertionError: assertion failed: Conflicting partition column names detected: Partition column name list #0: b, c, d Partition column name list #1: b, c Partition column name list #2: b For partitioned table directories, data files should only live in leaf directories. Please check the following directories for unexpected files: file:/tmp/foo/b=0 file:/tmp/foo/b=1 file:/tmp/foo/b=1/c=1 file:/tmp/foo/b=0/c=0 ``` Author: Cheng Lian <lian@databricks.com> Closes #6610 from liancheng/part-errmsg and squashes the following commits: 7d05f2c [Cheng Lian] Fixes Scala style issue a149250 [Cheng Lian] Adds test case for the error message 6b74dd8 [Cheng Lian] Also lists suspicious non-leaf partition directories a935eb8 [Cheng Lian] Improves error message when conflicting partition columns are found	2015-06-24 02:17:12 -07:00
Wenchen Fan	09fcf96b8f	[SPARK-8371] [SQL] improve unit test for MaxOf and MinOf and fix bugs a follow up of https://github.com/apache/spark/pull/6813 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6825 from cloud-fan/cg and squashes the following commits: 43170cc [Wenchen Fan] fix bugs in code gen	2015-06-23 23:11:42 -07:00
Eric Liang	50c3a86f42	[SPARK-6749] [SQL] Make metastore client robust to underlying socket connection loss This works around a bug in the underlying RetryingMetaStoreClient (HIVE-10384) by refreshing the metastore client on thrift exceptions. We attempt to emulate the proper hive behavior by retrying only as configured by hiveconf. Author: Eric Liang <ekl@databricks.com> Closes #6912 from ericl/spark-6749 and squashes the following commits: 2d54b55 [Eric Liang] use conf from state 0e3a74e [Eric Liang] use shim properly 980b3e5 [Eric Liang] Fix conf parsing hive 0.14 conf. 92459b6 [Eric Liang] Work around RetryingMetaStoreClient bug	2015-06-23 22:27:17 -07:00
Reynold Xin	a458efc66c	Revert "[SPARK-7157][SQL] add sampleBy to DataFrame" This reverts commit `0401cbaa8e`. The new test case on Jenkins is failing.	2015-06-23 19:30:25 -07:00
Xiangrui Meng	0401cbaa8e	[SPARK-7157][SQL] add sampleBy to DataFrame Add `sampleBy` to DataFrame. rxin Author: Xiangrui Meng <meng@databricks.com> Closes #6769 from mengxr/SPARK-7157 and squashes the following commits: 991f26f [Xiangrui Meng] fix seed 4a14834 [Xiangrui Meng] move sampleBy to stat 832f7cc [Xiangrui Meng] add sampleBy to DataFrame	2015-06-23 17:46:29 -07:00
Cheng Lian	111d6b9b8a	[SPARK-8139] [SQL] Updates docs and comments of data sources and Parquet output committer options This PR only applies to master branch (1.5.0-SNAPSHOT) since it references `org.apache.parquet` classes which only appear in Parquet 1.7.0. Author: Cheng Lian <lian@databricks.com> Closes #6683 from liancheng/output-committer-docs and squashes the following commits: b4648b8 [Cheng Lian] Removes spark.sql.sources.outputCommitterClass as it's not a public option ee63923 [Cheng Lian] Updates docs and comments of data sources and Parquet output committer options	2015-06-23 17:24:26 -07:00
Davies Liu	6f4cadf5ee	[SPARK-8432] [SQL] fix hashCode() and equals() of BinaryType in Row Also added more tests in LiteralExpressionSuite Author: Davies Liu <davies@databricks.com> Closes #6876 from davies/fix_hashcode and squashes the following commits: 429c2c0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_hashcode 32d9811 [Davies Liu] fix test a0626ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_hashcode 89c2432 [Davies Liu] fix style bd20780 [Davies Liu] check with catalyst types 41caec6 [Davies Liu] change for to while d96929b [Davies Liu] address comment 6ad2a90 [Davies Liu] fix style 5819d33 [Davies Liu] unify equals() and hashCode() 0fff25d [Davies Liu] fix style 53c38b1 [Davies Liu] fix hashCode() and equals() of BinaryType in Row	2015-06-23 11:55:47 -07:00
Cheng Hao	7b1450b666	[SPARK-7235] [SQL] Refactor the grouping sets The logical plan `Expand` takes the `output` as constructor argument, which break the references chain. We need to refactor the code, as well as the column pruning. Author: Cheng Hao <hao.cheng@intel.com> Closes #5780 from chenghao-intel/expand and squashes the following commits: 76e4aa4 [Cheng Hao] revert the change for case insenstive 7c10a83 [Cheng Hao] refactor the grouping sets	2015-06-23 10:52:17 -07:00
lockwobr	4f7fbefb8d	[SQL] [DOCS] updated the documentation for explode the syntax was incorrect in the example in explode Author: lockwobr <lockwobr@gmail.com> Closes #6943 from lockwobr/master and squashes the following commits: 3d864d1 [lockwobr] updated the documentation for explode	2015-06-24 02:48:56 +09:00
Reynold Xin	6ceb169608	[SPARK-8300] DataFrame hint for broadcast join. Users can now do ```scala left.join(broadcast(right), "joinKey") ``` to give the query planner a hint that "right" DataFrame is small and should be broadcasted. Author: Reynold Xin <rxin@databricks.com> Closes #6751 from rxin/broadcastjoin-hint and squashes the following commits: 953eec2 [Reynold Xin] Code review feedback. 88752d8 [Reynold Xin] Fixed import. 8187b88 [Reynold Xin] [SPARK-8300] DataFrame hint for broadcast join.	2015-06-23 01:50:31 -07:00
Liang-Chi Hsieh	31bd30687b	[SPARK-8359] [SQL] Fix incorrect decimal precision after multiplication JIRA: https://issues.apache.org/jira/browse/SPARK-8359 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6814 from viirya/fix_decimal2 and squashes the following commits: 071a757 [Liang-Chi Hsieh] Remove maximum precision and use MathContext.UNLIMITED. df217d4 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal2 a43bfc3 [Liang-Chi Hsieh] Add MathContext with maximum supported precision. 72eeb3f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal2 44c9348 [Liang-Chi Hsieh] Fix incorrect decimal precision after multiplication.	2015-06-22 23:11:56 -07:00
Cheng Hao	13321e6555	[SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit test under jdk8 To reproduce that: ``` JAVA_HOME=/home/hcheng/Java/jdk1.8.0_45 \| build/sbt -Phadoop-2.3 -Phive 'test-only org.apache.spark.sql.hive.execution.HiveWindowFunctionQueryWithoutCodeGenSuite' ``` A simple workaround to fix that is update the original query, for getting the output size instead of the exact elements of the array (output by collect_set()) Author: Cheng Hao <hao.cheng@intel.com> Closes #6402 from chenghao-intel/windowing and squashes the following commits: 99312ad [Cheng Hao] add order by for the select clause edf8ce3 [Cheng Hao] update the code as suggested 7062da7 [Cheng Hao] fix the collect_set() behaviour differences under different versions of JDK	2015-06-22 20:04:49 -07:00
Davies Liu	6b7f2ceafd	[SPARK-8307] [SQL] improve timestamp from parquet This PR change to convert julian day to unix timestamp directly (without Calendar and Timestamp). cc adrian-wang rxin Author: Davies Liu <davies@databricks.com> Closes #6759 from davies/improve_ts and squashes the following commits: 849e301 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts b0e4cad [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts 8e2d56f [Davies Liu] address comments 634b9f5 [Davies Liu] fix mima 4891efb [Davies Liu] address comment bfc437c [Davies Liu] fix build ae5979c [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts 602b969 [Davies Liu] remove jodd 2f2e48c [Davies Liu] fix test 8ace611 [Davies Liu] fix mima 212143b [Davies Liu] fix mina c834108 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts a3171b8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts 5233974 [Davies Liu] fix scala style 361fd62 [Davies Liu] address comments ea196d4 [Davies Liu] improve timestamp from parquet	2015-06-22 18:03:59 -07:00
Wenchen Fan	860a49ef20	[SPARK-7153] [SQL] support all integral type ordinal in GetArrayItem first convert `ordinal` to `Number`, then convert to int type. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #5706 from cloud-fan/7153 and squashes the following commits: 915db79 [Wenchen Fan] fix 7153	2015-06-22 17:37:35 -07:00
Davies Liu	96aa01378e	[SPARK-8492] [SQL] support binaryType in UnsafeRow Support BinaryType in UnsafeRow, just like StringType. Also change the layout of StringType and BinaryType in UnsafeRow, by combining offset and size together as Long, which will limit the size of Row to under 2G (given that fact that any single buffer can not be bigger than 2G in JVM). Author: Davies Liu <davies@databricks.com> Closes #6911 from davies/unsafe_bin and squashes the following commits: d68706f [Davies Liu] update comment 519f698 [Davies Liu] address comment 98a964b [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_bin 180b49d [Davies Liu] fix zero-out 22e4c0a [Davies Liu] zero-out padding bytes 6abfe93 [Davies Liu] fix style 447dea0 [Davies Liu] support binaryType in UnsafeRow	2015-06-22 15:22:17 -07:00
BenFradet	50d3242d6a	[SPARK-8356] [SQL] Reconcile callUDF and callUdf Deprecates ```callUdf``` in favor of ```callUDF```. Author: BenFradet <benjamin.fradet@gmail.com> Closes #6902 from BenFradet/SPARK-8356 and squashes the following commits: ef4e9d8 [BenFradet] deprecated callUDF, use udf instead 9b1de4d [BenFradet] reinstated unit test for the deprecated callUdf cbd80a5 [BenFradet] deprecated callUdf in favor of callUDF	2015-06-22 15:06:47 -07:00
Wenchen Fan	da7bbb9435	[SPARK-8104] [SQL] auto alias expressions in analyzer Currently we auto alias expression in parser. However, during parser phase we don't have enough information to do the right alias. For example, Generator that has more than 1 kind of element need MultiAlias, ExtractValue don't need Alias if it's in middle of a ExtractValue chain. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6647 from cloud-fan/alias and squashes the following commits: 552eba4 [Wenchen Fan] fix python 5b5786d [Wenchen Fan] fix agg 73a90cb [Wenchen Fan] fix case-preserve of ExtractValue 4cfd23c [Wenchen Fan] fix order by d18f401 [Wenchen Fan] refine 9f07359 [Wenchen Fan] address comments 39c1aef [Wenchen Fan] small fix 33640ec [Wenchen Fan] auto alias expressions in analyzer	2015-06-22 12:13:00 -07:00
Cheng Lian	0818fdec37	[SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental overwriting This PR fixes a Parquet output file name collision bug which may cause data loss. Changes made: 1. Identify each write job issued by `InsertIntoHadoopFsRelation` with a UUID All concrete data sources which extend `HadoopFsRelation` (Parquet and ORC for now) must use this UUID to generate task output file path to avoid name collision. 2. Make `TestHive` use a local mode `SparkContext` with 32 threads to increase parallelism The major reason for this is that, the original parallelism of 2 is too low to reproduce the data loss issue. Also, higher concurrency may potentially caught more concurrency bugs during testing phase. (It did help us spotted SPARK-8501.) 3. `OrcSourceSuite` was updated to workaround SPARK-8501, which we detected along the way. NOTE: This PR is made a little bit more complicated than expected because we hit two other bugs on the way and have to work them around. See [SPARK-8501] [1] and [SPARK-8513] [2]. [1]: https://github.com/liancheng/spark/tree/spark-8501 [2]: https://github.com/liancheng/spark/tree/spark-8513 ---- Some background and a summary of offline discussion with yhuai about this issue for better understanding: In 1.4.0, we added `HadoopFsRelation` to abstract partition support of all data sources that are based on Hadoop `FileSystem` interface. Specifically, this makes partition discovery, partition pruning, and writing dynamic partitions for data sources much easier. To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (i.e., `<id>` in output file name `part-r-<id>.gz.parquet`) at the beginning of the write job. In 1.3.0, this step happens on driver side before any files are written. However, in 1.4.0, this is moved to task side. Unfortunately, for tasks scheduled later, they may see wrong max part number generated of files newly written by other finished tasks within the same job. This actually causes a race condition. In most cases, this only causes nonconsecutive part numbers in output file names. But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, then one of them gets overwritten by the other. Before `HadoopFsRelation`, Spark SQL already supports appending data to Hive tables. From a user's perspective, these two look similar. However, they differ a lot internally. When data are inserted into Hive tables via Spark SQL, `InsertIntoHiveTable` simulates Hive's behaviors: 1. Write data to a temporary location 2. Move data in the temporary location to the final destination location using - `Hive.loadTable()` for non-partitioned table - `Hive.loadPartition()` for static partitions - `Hive.loadDynamicPartitions()` for dynamic partitions The important part is that, `Hive.copyFiles()` is invoked in step 2 to move the data to the destination directory (I found the name is kinda confusing since no "copying" occurs here, we are just moving and renaming stuff). If a file in the source directory and another file in the destination directory happen to have the same name, say `part-r-00001.parquet`, the former is moved to the destination directory and renamed with a `_copy_N` postfix (`part-r-00001_copy_1.parquet`). That's how Hive handles appending and avoids name collision between different write jobs. Some alternatives fixes considered for this issue: 1. Use a similar approach as Hive This approach is not preferred in Spark 1.4.0 mainly because file metadata operations in S3 tend to be slow, especially for tables with lots of file and/or partitions. That's why `InsertIntoHadoopFsRelation` just inserts to destination directory directly, and is often used together with `DirectParquetOutputCommitter` to reduce latency when working with S3. This means, we don't have the chance to do renaming, and must avoid name collision from the very beginning. 2. Same as 1.3, just move max part number detection back to driver side This isn't doable because unlike 1.3, 1.4 also takes dynamic partitioning into account. When inserting into dynamic partitions, we don't know which partition directories will be touched on driver side before issuing the write job. Checking all partition directories is simply too expensive for tables with thousands of partitions. 3. Add extra component to output file names to avoid name collision This seems to be the only reasonable solution for now. To be more specific, we need a JOB level unique identifier to identify all write jobs issued by `InsertIntoHadoopFile`. Notice that TASK level unique identifiers can NOT be used. Because in this way a speculative task will write to a different output file from the original task. If both tasks succeed, duplicate output will be left behind. Currently, the ORC data source adds `System.currentTimeMillis` to the output file name for uniqueness. This doesn't work because of exactly the same reason. That's why this PR adds a job level random UUID in `BaseWriterContainer` (which is used by `InsertIntoHadoopFsRelation` to issue write jobs). The drawback is that record order is not preserved any more (output files of a later job may be listed before those of a earlier job). However, we never promise to preserve record order when writing data, and Hive doesn't promise this either because the `_copy_N` trick breaks the order. Author: Cheng Lian <lian@databricks.com> Closes #6864 from liancheng/spark-8406 and squashes the following commits: db7a46a [Cheng Lian] More comments f5c1133 [Cheng Lian] Addresses comments 85c478e [Cheng Lian] Workarounds SPARK-8513 088c76c [Cheng Lian] Adds comment about SPARK-8501 99a5e7e [Cheng Lian] Uses job level UUID in SimpleTextRelation and avoids double task abortion 4088226 [Cheng Lian] Works around SPARK-8501 1d7d206 [Cheng Lian] Adds more logs 8966bbb [Cheng Lian] Fixes Scala style issue 18b7003 [Cheng Lian] Uses job level UUID to take speculative tasks into account 3806190 [Cheng Lian] Lets TestHive use all cores by default 748dbd7 [Cheng Lian] Adding UUID to output file name to avoid accidental overwriting	2015-06-22 10:03:57 -07:00
Cheng Lian	83cdfd84f8	[SPARK-8508] [SQL] Ignores a test case to cleanup unnecessary testing output until #6882 is merged Currently [the test case for SPARK-7862] [1] writes 100,000 lines of integer triples to stderr and makes Jenkins build output unnecessarily large and it's hard to debug other build errors. A proper fix is on the way in #6882. This PR ignores this test case temporarily until #6882 is merged. [1]: https://github.com/apache/spark/pull/6404/files#diff-1ea02a6fab84e938582f7f87cc4d9ea1R641 Author: Cheng Lian <lian@databricks.com> Closes #6925 from liancheng/spark-8508 and squashes the following commits: 41e5b47 [Cheng Lian] Ignores the test case until #6882 is merged	2015-06-21 13:20:28 -07:00
jeanlyn	a1e3649c87	[SPARK-8379] [SQL] avoid speculative tasks write to the same file The issue link [SPARK-8379](https://issues.apache.org/jira/browse/SPARK-8379) Currently,when we insert data to the dynamic partition with speculative tasks we will get the Exception ``` org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-10000/ds=2015-06-15/type=2/part-00301.lzo owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is accessed by DFSClient_attempt_201506031520_0011_m_000042_0_-1275047721_57 ``` This pr try to write the data to temporary dir when using dynamic parition avoid the speculative tasks writing the same file Author: jeanlyn <jeanlyn92@gmail.com> Closes #6833 from jeanlyn/speculation and squashes the following commits: 64bbfab [jeanlyn] use FileOutputFormat.getTaskOutputPath to get the path 8860af0 [jeanlyn] remove the never using code e19a3bd [jeanlyn] avoid speculative tasks write same file	2015-06-21 00:13:40 -07:00
Tarek Auel	41ab2853f4	[SPARK-8301] [SQL] Improve UTF8String substring/startsWith/endsWith/contains performance Jira: https://issues.apache.org/jira/browse/SPARK-8301 Added the private method startsWith(prefix, offset) to implement startsWith, endsWith and contains without copying the array I hope that the component SQL is still correct. I copied it from the Jira ticket. Author: Tarek Auel <tarek.auel@googlemail.com> Author: Tarek Auel <tarek.auel@gmail.com> Closes #6804 from tarekauel/SPARK-8301 and squashes the following commits: f5d6b9a [Tarek Auel] fixed parentheses and annotation 6d7b068 [Tarek Auel] [SPARK-8301] removed null checks 9ca0473 [Tarek Auel] [SPARK-8301] removed null checks 1c327eb [Tarek Auel] [SPARK-8301] removed new 9f17cc8 [Tarek Auel] [SPARK-8301] fixed conversion byte to string in codegen 3a0040f [Tarek Auel] [SPARK-8301] changed call of UTF8String.set to UTF8String.from e4530d2 [Tarek Auel] [SPARK-8301] changed call of UTF8String.set to UTF8String.from a5f853a [Tarek Auel] [SPARK-8301] changed visibility of set to protected. Changed annotation of bytes from Nullable to Nonnull d2fb05f [Tarek Auel] [SPARK-8301] added additional null checks 79cb55b [Tarek Auel] [SPARK-8301] null check. Added test cases for null check. b17909e [Tarek Auel] [SPARK-8301] removed unnecessary copying of UTF8String. Added a private function startsWith(prefix, offset) to implement the check for startsWith, endsWith and contains.	2015-06-20 20:03:59 -07:00
Andrew Or	bec40e52be	[HOTFIX] [SPARK-8489] Correct JIRA number in previous commit It should be SPARK-8489, not SPARK-8498.	2015-06-19 17:39:26 -07:00
Andrew Or	093c34838d	[SPARK-8498] [SQL] Add regression test for SPARK-8470 Summary of the problem in SPARK-8470. When using `HiveContext` to create a data frame of a user case class, Spark throws `scala.reflect.internal.MissingRequirementError` when it tries to infer the schema using reflection. This is caused by `HiveContext` silently overwriting the context class loader containing the user classes. What this issue is about. This issue adds regression tests for SPARK-8470, which is already fixed in #6891. We closed SPARK-8470 as a duplicate because it is a different manifestation of the same problem in SPARK-8368. Due to the complexity of the reproduction, this requires us to pre-package a special test jar and include it in the Spark project itself. I tested this with and without the fix in #6891 and verified that it passes only if the fix is present. Author: Andrew Or <andrew@databricks.com> Closes #6909 from andrewor14/SPARK-8498 and squashes the following commits: 5e9d688 [Andrew Or] Add regression test for SPARK-8470	2015-06-19 17:34:09 -07:00
Michael Armbrust	a333a72e02	[SPARK-8420] [SQL] Fix comparision of timestamps/dates with strings In earlier versions of Spark SQL we casted `TimestampType` and `DataType` to `StringType` when it was involved in a binary comparison with a `StringType`. This allowed comparing a timestamp with a partial date as a user would expect. - `time > "2014-06-10"` - `time > "2014"` In 1.4.0 we tried to cast the String instead into a Timestamp. However, since partial dates are not a valid complete timestamp this results in `null` which results in the tuple being filtered. This PR restores the earlier behavior. Note that we still special case equality so that these comparisons are not affected by not printing zeros for subsecond precision. Author: Michael Armbrust <michael@databricks.com> Closes #6888 from marmbrus/timeCompareString and squashes the following commits: bdef29c [Michael Armbrust] test partial date 1f09adf [Michael Armbrust] special handling of equality 1172c60 [Michael Armbrust] more test fixing 4dfc412 [Michael Armbrust] fix tests aaa9508 [Michael Armbrust] newline 04d908f [Michael Armbrust] [SPARK-8420][SQL] Fix comparision of timestamps/dates with strings	2015-06-19 16:54:51 -07:00
Nathan Howell	9814b971f0	[SPARK-8093] [SQL] Remove empty structs inferred from JSON documents Author: Nathan Howell <nhowell@godaddy.com> Closes #6799 from NathanHowell/spark-8093 and squashes the following commits: 76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents	2015-06-19 16:19:28 -07:00
Davies Liu	e41e2fd6c6	[SPARK-8461] [SQL] fix codegen with REPL class loader The ExecutorClassLoader for REPL will cause Janino failed to find class for those in java.lang, so switch to use default class loader for Janino, which will also help performance. cc liancheng yhuai Author: Davies Liu <davies@databricks.com> Closes #6898 from davies/fix_class_loader and squashes the following commits: 24276d4 [Davies Liu] add regression test 4ff0457 [Davies Liu] address comment, refactor 7f5ffbe [Davies Liu] fix REPL class loader with codegen	2015-06-19 11:40:04 -07:00
Yin Huai	c5876e529b	[SPARK-8368] [SPARK-8058] [SQL] HiveContext may override the context class loader of the current thread https://issues.apache.org/jira/browse/SPARK-8368 Also, I add tests according https://issues.apache.org/jira/browse/SPARK-8058. Author: Yin Huai <yhuai@databricks.com> Closes #6891 from yhuai/SPARK-8368 and squashes the following commits: 37bb3db [Yin Huai] Update test timeout and comment. 8762eec [Yin Huai] Style. 695cd2d [Yin Huai] Correctly set the class loader in the conf of the state in client wrapper. b3378fe [Yin Huai] Failed tests.	2015-06-19 11:11:58 -07:00
Shilei	0c32fc125c	[SPARK-8234][SQL] misc function: md5 Author: Shilei <shilei.qian@intel.com> Closes #6779 from qiansl127/MD5 and squashes the following commits: 11fcdb2 [Shilei] Fix the indent 04bd27b [Shilei] Add codegen da60eb3 [Shilei] Remove checkInputDataTypes function 9509ad0 [Shilei] Format code 12c61f4 [Shilei] Accept only BinaryType for Md5 1df0b5b [Shilei] format to scala type 60ccde1 [Shilei] Add more test case b8c73b4 [Shilei] Rewrite the type check for Md5 c166167 [Shilei] Add md5 function	2015-06-19 10:49:27 -07:00
Liang-Chi Hsieh	2c59d5c12a	[SPARK-8207] [SQL] Add math function bin JIRA: https://issues.apache.org/jira/browse/SPARK-8207 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6721 from viirya/expr_bin and squashes the following commits: 07e1c8f [Liang-Chi Hsieh] Remove AbstractUnaryMathExpression and let BIN inherit UnaryExpression. 0677f1a [Liang-Chi Hsieh] For comments. cf62b95 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin 0cf20f2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin dea9c12 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin d4f4774 [Liang-Chi Hsieh] Add @ignore_unicode_prefix. 7a0196f [Liang-Chi Hsieh] Fix python style. ac2bacd [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin a0a2d0f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin 4cb764d [Liang-Chi Hsieh] For comments. 0f78682 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin c0c3197 [Liang-Chi Hsieh] Add bin to FunctionRegistry. 824f761 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin 50e0c3b [Liang-Chi Hsieh] Add math function bin(a: long): string.	2015-06-19 10:09:31 -07:00
Yu ISHIKAWA	754929b153	[SPARK-8348][SQL] Add in operator to DataFrame Column I have added it for only Scala. TODO: we should also support `in` operator in Python. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6824 from yu-iskw/SPARK-8348 and squashes the following commits: e76d02f [Yu ISHIKAWA] Not use infix notation 6f744ac [Yu ISHIKAWA] Fit the test cases because these used the old test data set. 00077d3 [Yu ISHIKAWA] [SPARK-8348][SQL] Add in operator to DataFrame Column	2015-06-18 23:13:05 -07:00
Cheng Lian	a71cbbdea5	[SPARK-8458] [SQL] Don't strip scheme part of output path when writing ORC files `Path.toUri.getPath` strips scheme part of output path (from `file:///foo` to `/foo`), which causes ORC data source only writes to the file system configured in Hadoop configuration. Should use `Path.toString` instead. Author: Cheng Lian <lian@databricks.com> Closes #6892 from liancheng/spark-8458 and squashes the following commits: 87f8199 [Cheng Lian] Don't strip scheme of output path when writing ORC files	2015-06-18 22:01:52 -07:00
Sandy Ryza	43f50decdd	[SPARK-8135] Don't load defaults when reconstituting Hadoop Configurations Author: Sandy Ryza <sandy@cloudera.com> Closes #6679 from sryza/sandy-spark-8135 and squashes the following commits: c5554ff [Sandy Ryza] SPARK-8135. In SerializableWritable, don't load defaults when instantiating Configuration	2015-06-18 19:36:05 -07:00
Reynold Xin	dc41313899	[SPARK-8218][SQL] Binary log math function update. Some minor updates based on after merging #6725. Author: Reynold Xin <rxin@databricks.com> Closes #6871 from rxin/log and squashes the following commits: ab51542 [Reynold Xin] Use JVM log 76fc8de [Reynold Xin] Fixed arg. a7c1522 [Reynold Xin] [SPARK-8218][SQL] Binary log math function update.	2015-06-18 18:41:15 -07:00
Josh Rosen	207a98ca59	[SPARK-8446] [SQL] Add helper functions for testing SparkPlan physical operators This patch introduces `SparkPlanTest`, a base class for unit tests of SparkPlan physical operators. This is analogous to Spark SQL's existing `QueryTest`, which does something similar for end-to-end tests with actual queries. These helper methods provide nicer error output when tests fail and help developers to avoid writing lots of boilerplate in order to execute manually constructed physical plans. Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #6885 from JoshRosen/spark-plan-test and squashes the following commits: f8ce275 [Josh Rosen] Fix some IntelliJ inspections and delete some dead code 84214be [Josh Rosen] Add an extra column which isn't part of the sort ae1896b [Josh Rosen] Provide implicits automatically a80f9b0 [Josh Rosen] Merge pull request #4 from marmbrus/pr/6885 d9ab1e4 [Michael Armbrust] Add simple resolver c60a44d [Josh Rosen] Manually bind references 996332a [Josh Rosen] Add types so that tests compile a46144a [Josh Rosen] WIP	2015-06-18 16:45:14 -07:00
Liang-Chi Hsieh	31641128b3	[SPARK-8363][SQL] Move sqrt to math and extend UnaryMathExpression JIRA: https://issues.apache.org/jira/browse/SPARK-8363 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6823 from viirya/move_sqrt and squashes the following commits: 8977e11 [Liang-Chi Hsieh] Remove unnecessary old tests. d23e79e [Liang-Chi Hsieh] Explicitly indicate sqrt value sequence. 699f48b [Liang-Chi Hsieh] Use correct @since tag. 8dff6d1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into move_sqrt bc2ed77 [Liang-Chi Hsieh] Remove/move arithmetic expression test and expression type checking test. Remove unnecessary Sqrt type rule. d38492f [Liang-Chi Hsieh] Now sqrt accepts boolean because type casting is handled by HiveTypeCoercion. 297cc90 [Liang-Chi Hsieh] Sqrt only accepts double input. ef4a21a [Liang-Chi Hsieh] Move sqrt to math.	2015-06-18 13:00:31 -07:00
Yijie Shen	e86fbdb1e6	[SPARK-8283][SQL] Resolve udf_struct test failure in HiveCompatibilitySuite This PR aimed to resolve udf_struct test failure in HiveCompatibilitySuite. Currently, this is done by loosening CreateStruct's children type from NamedExpression to Expression and automatically generating StructField name for non-NamedExpression children. The naming convention for unnamed children follows the udf's counterpart in Hive: `col1, col2, col3, ...` Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #6828 from yijieshen/SPARK-8283 and squashes the following commits: 6052b73 [Yijie Shen] Doc fix 677e0b7 [Yijie Shen] Resolve udf_struct test failure by automatically generate structField name for non-NamedExpression children	2015-06-17 23:46:57 -07:00
Liang-Chi Hsieh	fee3438a32	[SPARK-8218][SQL] Add binary log math function JIRA: https://issues.apache.org/jira/browse/SPARK-8218 Because there is already `log` unary function defined, the binary log function is called `logarithm` for now. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6725 from viirya/expr_binary_log and squashes the following commits: bf96bd9 [Liang-Chi Hsieh] Compare log result in string. 102070d [Liang-Chi Hsieh] Round log result to better comparing in python test. fd01863 [Liang-Chi Hsieh] For comments. beed631 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log 6089d11 [Liang-Chi Hsieh] Remove unnecessary override. 8cf37b7 [Liang-Chi Hsieh] For comments. bc89597 [Liang-Chi Hsieh] For comments. db7dc38 [Liang-Chi Hsieh] Use ctor instead of companion object. 0634ef7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log 1750034 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log 3d75bfc [Liang-Chi Hsieh] Fix scala style. 5b39c02 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log 23c54a3 [Liang-Chi Hsieh] Fix scala style. ebc9929 [Liang-Chi Hsieh] Let Logarithm accept one parameter too. 605574d [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log 21c3bfd [Liang-Chi Hsieh] Fix scala style. c6c187f [Liang-Chi Hsieh] For comments. c795342 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log f373bac [Liang-Chi Hsieh] Add binary log expression.	2015-06-17 23:31:30 -07:00
zsxwing	78a430ea4d	[SPARK-7961][SQL]Refactor SQLConf to display better error message 1. Add `SQLConfEntry` to store the information about a configuration. For those configurations that cannot be found in `sql-programming-guide.md`, I left the doc as `<TODO>`. 2. Verify the value when setting a configuration if this is in SQLConf. 3. Use `SET -v` to display all public configurations. Author: zsxwing <zsxwing@gmail.com> Closes #6747 from zsxwing/sqlconf and squashes the following commits: 7d09bad [zsxwing] Use SQLConfEntry in HiveContext 49f6213 [zsxwing] Add getConf, setConf to SQLContext and HiveContext e014f53 [zsxwing] Merge branch 'master' into sqlconf 93dad8e [zsxwing] Fix the unit tests cf950c1 [zsxwing] Fix the code style and tests 3c5f03e [zsxwing] Add unsetConf(SQLConfEntry) and fix the code style a2f4add [zsxwing] getConf will return the default value if a config is not set 037b1db [zsxwing] Add schema to SetCommand 0520c3c [zsxwing] Merge branch 'master' into sqlconf 7afb0ec [zsxwing] Fix the configurations about HiveThriftServer 7e728e3 [zsxwing] Add doc for SQLConfEntry and fix 'toString' 5e95b10 [zsxwing] Add enumConf c6ba76d [zsxwing] setRawString => setConfString, getRawString => getConfString 4abd807 [zsxwing] Fix the test for 'set -v' 6e47e56 [zsxwing] Fix the compilation error 8973ced [zsxwing] Remove floatConf 1fc3a8b [zsxwing] Remove the 'conf' command and use 'set -v' instead 99c9c16 [zsxwing] Fix tests that use SQLConfEntry as a string 88a03cc [zsxwing] Add new lines between confs and return types ce7c6c8 [zsxwing] Remove seqConf f3c1b33 [zsxwing] Refactor SQLConf to display better error message	2015-06-17 23:22:54 -07:00
Lianhui Wang	9db73ec124	[SPARK-8381][SQL]reuse typeConvert when convert Seq[Row] to catalyst type reuse-typeConvert when convert Seq[Row] to CatalystType Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #6831 from lianhuiwang/reuse-typeConvert and squashes the following commits: 1fec395 [Lianhui Wang] remove CatalystTypeConverters.convertToCatalyst 714462d [Lianhui Wang] add package[sql] 9d1fbf3 [Lianhui Wang] address JoshRosen's comments 768956f [Lianhui Wang] update scala style 4498c62 [Lianhui Wang] reuse typeConvert	2015-06-17 22:52:47 -07:00
Punya Biswal	d1069cba4a	[SPARK-8397] [SQL] Allow custom configuration for TestHive We encourage people to use TestHive in unit tests, because it's impossible to create more than one HiveContext within one process. The current implementation locks people into using a local[2] SparkContext underlying their HiveContext. We should make it possible to override this using a system property so that people can test against local-cluster or remote spark clusters to make their tests more realistic. Author: Punya Biswal <pbiswal@palantir.com> Closes #6844 from punya/feature/SPARK-8397 and squashes the following commits: 97ef394 [Punya Biswal] [SPARK-8397][SQL] Allow custom configuration for TestHive	2015-06-17 15:29:39 -07:00
Yin Huai	302556ff99	[SPARK-8306] [SQL] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state. https://issues.apache.org/jira/browse/SPARK-8306 I will try to add a test later. marmbrus aarondav Author: Yin Huai <yhuai@databricks.com> Closes #6758 from yhuai/SPARK-8306 and squashes the following commits: 1292346 [Yin Huai] [SPARK-8306] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.	2015-06-17 14:52:43 -07:00
Wenchen Fan	7f05b1fe69	[SPARK-7067] [SQL] fix bug when use complex nested fields in ORDER BY This PR is a improvement for https://github.com/apache/spark/pull/5189. The resolution rule for ORDER BY is: first resolve based on what comes from the select clause and then fall back on its child only when this fails. There are 2 steps. First, try to resolve `Sort` in `ResolveReferences` based on select clause, and ignore exceptions. Second, try to resolve `Sort` in `ResolveSortReferences` and add missing projection. However, the way we resolve `SortOrder` is wrong. We just resolve `UnresolvedAttribute` and use the result to indicate if we can resolve `SortOrder`. But `UnresolvedAttribute` is only part of `GetField` chain(broken by `GetItem`), so we need to go through the whole chain to indicate if we can resolve `SortOrder`. With this change, we can also avoid re-throw GetField exception in `CheckAnalysis` which is little ugly. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #5659 from cloud-fan/order-by and squashes the following commits: cfa79f8 [Wenchen Fan] update test 3245d28 [Wenchen Fan] minor improve 465ee07 [Wenchen Fan] address comment 1fc41a2 [Wenchen Fan] fix SPARK-7067	2015-06-17 14:46:00 -07:00
OopsOutOfMemory	98ee3512b2	[SPARK-8010] [SQL] Promote types to StringType as implicit conversion in non-binary expression of HiveTypeCoercion 1. Given a query `select coalesce(null, 1, '1') from dual` will cause exception: java.lang.RuntimeException: Could not determine return type of Coalesce for IntegerType,StringType 2. Given a query: `select case when true then 1 else '1' end from dual` will cause exception: java.lang.RuntimeException: Types in CASE WHEN must be the same or coercible to a common type: StringType != IntegerType I checked the code, the main cause is the HiveTypeCoercion doesn't do implicit convert when there is a IntegerType and StringType. Numeric types can be promoted to string type Hive will always do this implicit conversion. Author: OopsOutOfMemory <victorshengli@126.com> Closes #6551 from OopsOutOfMemory/pnts and squashes the following commits: 7a209d7 [OopsOutOfMemory] rebase master 6018613 [OopsOutOfMemory] convert function to method 4cd5618 [OopsOutOfMemory] limit the data type to primitive type df365d2 [OopsOutOfMemory] refine 95cbd58 [OopsOutOfMemory] fix style 403809c [OopsOutOfMemory] promote non-string to string when can not found tighestCommonTypeOfTwo	2015-06-17 13:37:59 -07:00
Michael Davies	0c1b2df043	[SPARK-8077] [SQL] Optimization for TreeNodes with large numbers of children For example large IN clauses Large IN clauses are parsed very slowly. For example SQL below (10K items in IN) takes 45-50s. s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 10000).map("n" + _).mkString("','")}')""" This is principally due to TreeNode which repeatedly call contains on children, where children in this case is a List that is 10K long. In effect parsing for large IN clauses is O(N squared). A lazily initialised Set based on children for contains reduces parse time to around 2.5s Author: Michael Davies <Michael.BellDavies@gmail.com> Closes #6673 from MickDavies/SPARK-8077 and squashes the following commits: 38cd425 [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children d80103b [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children e6be8be [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children	2015-06-17 12:56:55 -07:00
Liang-Chi Hsieh	104f30c36f	[SPARK-7199] [SQL] Add date and timestamp support to UnsafeRow JIRA: https://issues.apache.org/jira/browse/SPARK-7199 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5984 from viirya/add_date_timestamp and squashes the following commits: 7f21ce9 [Liang-Chi Hsieh] For comment. 0b89698 [Liang-Chi Hsieh] Add timestamp to settableFieldTypes. c30d490 [Liang-Chi Hsieh] Use default IntUnsafeColumnWriter and LongUnsafeColumnWriter. 672ef17 [Liang-Chi Hsieh] Remove getter/setter for Date and Timestamp and use Int and Long for them. 9f3e577 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp 281e844 [Liang-Chi Hsieh] Fix scala style. fb532b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp 80af342 [Liang-Chi Hsieh] Fix compiling error. f4f5de6 [Liang-Chi Hsieh] Fix scala style. a463e83 [Liang-Chi Hsieh] Use Long to store timestamp for rows. 635388a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp 46946c6 [Liang-Chi Hsieh] Adapt for moved DateUtils. b16994e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp 752251f [Liang-Chi Hsieh] Support setDate. Fix failed test. fcf8db9 [Liang-Chi Hsieh] Add functions for Date and Timestamp to SpecificRow. e42a809 [Liang-Chi Hsieh] Fix style. 4c07b57 [Liang-Chi Hsieh] Add date and timestamp support to UnsafeRow.	2015-06-17 09:00:37 -07:00
dragonli	bedff7d532	[SPARK-8220][SQL]Add positive identify function chenghao-intel adrian-wang Author: dragonli <lisurprise@gmail.com> Author: zhichao.li <zhichao.li@intel.com> Closes #6838 from zhichao-li/positive and squashes the following commits: e1032a0 [dragonli] remove useless import and refactor code 624d438 [zhichao.li] add positive identify function	2015-06-16 23:44:10 -07:00
baishuo	0b8c8fdc12	[SPARK-8156] [SQL] create table to specific database by 'use dbname' when i test the following code: hiveContext.sql("""use testdb""") val df = (1 to 3).map(i => (i, s"val_$i", i * 2)).toDF("a", "b", "c") df.write .format("parquet") .mode(SaveMode.Overwrite) .saveAsTable("ttt3") hiveContext.sql("show TABLES in default") found that the table ttt3 will be created under the database "default" Author: baishuo <vc_java@hotmail.com> Closes #6695 from baishuo/SPARK-8516-use-database and squashes the following commits: 9e155f9 [baishuo] remove no use comment cb9f027 [baishuo] modify testcase 00a7a2d [baishuo] modify testcase 4df48c7 [baishuo] modify testcase b742e69 [baishuo] modify testcase 3d19ad9 [baishuo] create table to specific database	2015-06-16 16:40:02 -07:00
Radek Ostrowski	4bd10fd509	[SQL] [DOC] improved a comment [SQL][DOC] I found it a bit confusing when I came across it for the first time in the docs Author: Radek Ostrowski <dest.hawaii@gmail.com> Author: radek <radek@radeks-MacBook-Pro-2.local> Closes #6332 from radek1st/master and squashes the following commits: dae3347 [Radek Ostrowski] fixed typo c76bb3a [radek] improved a comment	2015-06-16 21:04:26 +01:00
Davies Liu	bc76a0f750	[SPARK-7184] [SQL] enable codegen by default In order to have better performance out of box, this PR turn on codegen by default, then codegen can be tested by sql/test and hive/test. This PR also fix some corner cases for codegen. Before 1.5 release, we should re-visit this, turn it off if it's not stable or causing regressions. cc rxin JoshRosen Author: Davies Liu <davies@databricks.com> Closes #6726 from davies/enable_codegen and squashes the following commits: f3b25a5 [Davies Liu] fix warning 73750ea [Davies Liu] fix long overflow when compare 3017a47 [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen a7d75da [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen ff5b75a [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen f4cf2c2 [Davies Liu] fix style 99fc139 [Davies Liu] Merge branch 'enable_codegen' of github.com:davies/spark into enable_codegen 91fc7a2 [Davies Liu] disable codegen for ScalaUDF 207e339 [Davies Liu] Update CodeGenerator.scala 44573a3 [Davies Liu] check thread safety of expression f3886fa [Davies Liu] don't inline primitiveTerm for null literal c8e7cd2 [Davies Liu] address comment a8618c9 [Davies Liu] enable codegen by default	2015-06-15 23:03:14 -07:00
tedyu	1a62d61696	SPARK-8336 Fix NullPointerException with functions.rand() This PR fixes the problem reported by Justin Yip in the thread 'NullPointerException with functions.rand()' Tested using spark-shell and verified that the following works: sqlContext.createDataFrame(Seq((1,2), (3, 100))).withColumn("index", rand(30)).show() Author: tedyu <yuzhihong@gmail.com> Closes #6793 from tedyu/master and squashes the following commits: 62fd97b [tedyu] Create RandomSuite 750f92c [tedyu] Add test for Rand() with seed a1d66c5 [tedyu] Fix NullPointerException with functions.rand()	2015-06-15 17:00:38 -07:00
Yadong Qi	6ae21a944a	[SPARK-6583] [SQL] Support aggregate functions in ORDER BY Add aggregates in ORDER BY clauses to the `Aggregate` operator beneath. Project these results away after the Sort. Based on work by watermen. Also Closes #5290. Author: Yadong Qi <qiyadong2010@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #6816 from marmbrus/pr/5290 and squashes the following commits: 3226a97 [Michael Armbrust] consistent ordering eb8938d [Michael Armbrust] no vars c8b25c1 [Yadong Qi] move the test data. 7f9b736 [Yadong Qi] delete Substring case a1e87c1 [Yadong Qi] fix conflict f119849 [Yadong Qi] order by aggregated function	2015-06-15 12:01:52 -07:00
Marcelo Vanzin	4eb48ed1da	[SPARK-8065] [SQL] Add support for Hive 0.14 metastores This change has two parts. The first one gets rid of "ReflectionMagic". That worked well for the differences between 0.12 and 0.13, but breaks in 0.14, since some of the APIs that need to be used have primitive types. I could not figure out a way to make that class work with primitive types. So instead I wrote some shims (I can already hear the collective sigh) that find the appropriate methods via reflection. This should be faster since the method instances are cached, and the code is not much uglier than before, with the advantage that all the ugliness is local to one file (instead of multiple switch statements on the version being used scattered in ClientWrapper). The second part is simple: add code to handle Hive 0.14. A few new methods had to be added to the new shims. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #6627 from vanzin/SPARK-8065 and squashes the following commits: 3fa4270 [Marcelo Vanzin] Indentation style. 4b8a3d4 [Marcelo Vanzin] Fix dep exclusion. be3d0cc [Marcelo Vanzin] Merge branch 'master' into SPARK-8065 ca3fb1e [Marcelo Vanzin] Merge branch 'master' into SPARK-8065 b43f13e [Marcelo Vanzin] Since exclusions seem to work, clean up some of the code. 73bd161 [Marcelo Vanzin] Botched merge. d2ddf01 [Marcelo Vanzin] Comment about excluded dep. 0c929d1 [Marcelo Vanzin] Merge branch 'master' into SPARK-8065 2c3c02e [Marcelo Vanzin] Try to fix tests by adding support for exclusions. 0a03470 [Marcelo Vanzin] Try to fix tests by upgrading calcite dependency. 13b2dfa [Marcelo Vanzin] Fix NPE. 6439d88 [Marcelo Vanzin] Minor style thing. 69b017b [Marcelo Vanzin] Style. a21cad8 [Marcelo Vanzin] Part II: Add shims / version for Hive 0.14. ae98c87 [Marcelo Vanzin] PART I: Get rid of reflection magic.	2015-06-14 11:49:22 -07:00
Reynold Xin	53c16b92a5	[SPARK-8362] [SQL] Add unit tests for +, -, , /, % Added unit tests for all supported data types for: - Add - Subtract - Multiply - Divide - UnaryMinus - Remainder Fixed bugs caught by the unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #6813 from rxin/SPARK-8362 and squashes the following commits: fb3fe62 [Reynold Xin] Added Remainder. 3b266ba [Reynold Xin] [SPARK-8362] Add unit tests for +, -, , /.	2015-06-14 11:23:23 -07:00
Michael Armbrust	9073a426e4	[SPARK-8358] [SQL] Wait for child resolution when resolving generators Author: Michael Armbrust <michael@databricks.com> Closes #6811 from marmbrus/aliasExplodeStar and squashes the following commits: fbd2065 [Michael Armbrust] more style 806a373 [Michael Armbrust] fix style 7cbb530 [Michael Armbrust] [SPARK-8358][SQL] Wait for child resolution when resolving generatorsa	2015-06-14 11:21:42 -07:00
Josh Rosen	ea7fd2ff64	[SPARK-8354] [SQL] Fix off-by-factor-of-8 error when allocating scratch space in UnsafeFixedWidthAggregationMap UnsafeFixedWidthAggregationMap contains an off-by-factor-of-8 error when allocating row conversion scratch space: we take a size requirement, measured in bytes, then allocate a long array of that size. This means that we end up allocating 8x too much conversion space. This patch fixes this by allocating a `byte[]` array instead. This doesn't impose any new limitations on the maximum sizes of UnsafeRows, since UnsafeRowConverter already used integers when calculating the size requirements for rows. Author: Josh Rosen <joshrosen@databricks.com> Closes #6809 from JoshRosen/sql-bytes-vs-words-fix and squashes the following commits: 6520339 [Josh Rosen] Updates to reflect fact that UnsafeRow max size is constrained by max byte[] size	2015-06-14 09:34:35 -07:00
Liang-Chi Hsieh	cb7ada1196	[SPARK-8342][SQL] Fix Decimal setOrNull JIRA: https://issues.apache.org/jira/browse/SPARK-8342 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6797 from viirya/fix_decimal and squashes the following commits: 8a447b1 [Liang-Chi Hsieh] Add unit test. d67a5ea [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal ab6d8af [Liang-Chi Hsieh] Fix setOrNull.	2015-06-13 22:42:28 -07:00
Reynold Xin	2d71ba4c8a	[SPARK-8349] [SQL] Use expression constructors (rather than apply) in FunctionRegistry Author: Reynold Xin <rxin@databricks.com> Closes #6806 from rxin/gs and squashes the following commits: ed1aebb [Reynold Xin] Fixed style. c7fc3e6 [Reynold Xin] [SPARK-8349][SQL] Use expression constructors (rather than apply) in FunctionRegistry	2015-06-13 18:22:17 -07:00
Reynold Xin	a138953391	[SPARK-8347][SQL] Add unit tests for abs. Also addressed code review feedback from #6754 Author: Reynold Xin <rxin@databricks.com> Closes #6803 from rxin/abs and squashes the following commits: d07beba [Reynold Xin] [SPARK-8347] Add unit tests for abs.	2015-06-13 17:10:13 -07:00
Liang-Chi Hsieh	ddec45279e	[SPARK-8052] [SQL] Use java.math.BigDecimal for casting String to Decimal instead of using toDouble JIRA: https://issues.apache.org/jira/browse/SPARK-8052 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6645 from viirya/cast_string_integraltype and squashes the following commits: e19c6a3 [Liang-Chi Hsieh] For comment. c3e472a [Liang-Chi Hsieh] Add test. 7ced9b0 [Liang-Chi Hsieh] Use java.math.BigDecimal for casting String to Decimal instead of using toDouble.	2015-06-13 16:39:52 -07:00
Josh Rosen	af31335adc	[SPARK-8319] [CORE] [SQL] Update logic related to key orderings in shuffle dependencies This patch updates two pieces of logic that are related to handling of keyOrderings in ShuffleDependencies: - The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever the shuffle dependency specifies a key ordering, but technically we only need to fall back when an aggregator is also specified. This patch updates the fallback logic to reflect this so that the Tungsten optimizations can apply to more workloads. - The SQL Exchange operator performs defensive copying of shuffle inputs when a key ordering is specified, but this is unnecessary. The copying was added to guard against cases where ExternalSorter would buffer non-serialized records in memory. When ExternalSorter is configured without an aggregator, it uses the following logic to determine whether to buffer records in a serialized or deserialized format: ```scala private val useSerializedPairBuffer = ordering.isEmpty && conf.getBoolean("spark.shuffle.sort.serializeMapOutputs", true) && ser.supportsRelocationOfSerializedObjects ``` The `newOrdering.isDefined` branch in `ExternalSorter.needToCopyObjectsBeforeShuffle`, removed by this patch, is not necessary: - It was checked even if we weren't using sort-based shuffle, but this was unnecessary because only SortShuffleManager performs map-side sorting. - Map-side sorting during shuffle writing is only performed for shuffles that perform map-side aggregation as part of the shuffle (to see this, look at how SortShuffleWriter constructs ExternalSorter). Since SQL never pushes aggregation into Spark's shuffle, we can guarantee that both the aggregator and ordering will be empty and Spark SQL always uses serializers that support relocation, so sort-shuffle will use the serialized pair buffer unless the user has explicitly disabled it via the SparkConf feature-flag. Therefore, I think my optimization in Exchange should be safe. Author: Josh Rosen <joshrosen@databricks.com> Closes #6773 from JoshRosen/SPARK-8319 and squashes the following commits: 7a14129 [Josh Rosen] Revise comments; add handler to guard against future ShuffleManager implementations 07bb2c9 [Josh Rosen] Update comment to clarify circumstances under which shuffle operates on serialized records 269089a [Josh Rosen] Avoid unnecessary copy in SQL Exchange 34e526e [Josh Rosen] Enable Tungsten shuffle for non-agg shuffles w/ key orderings	2015-06-13 16:14:24 -07:00
Davies Liu	ce1041c38f	[SPARK-8346] [SQL] Use InternalRow instread of catalyst.InternalRow cc rxin marmbrus Author: Davies Liu <davies@databricks.com> Closes #6802 from davies/cleanup_internalrow and squashes the following commits: 769d2aa [Davies Liu] remove not needed cast 4acbbe4 [Davies Liu] catalyst.Internal -> InternalRow	2015-06-13 16:13:26 -07:00
Rene Treffer	d986fb9a37	[SPARK-7897] Improbe type for jdbc/"unsigned bigint" The original fix uses DecimalType.Unlimited, which is harder to handle afterwards. There is no scale and most data should fit into a long, thus DecimalType(20,0) should be better. Author: Rene Treffer <treffer@measite.de> Closes #6789 from rtreffer/spark-7897-unsigned-bigint-as-decimal and squashes the following commits: 2006613 [Rene Treffer] Fix type for "unsigned bigint" jdbc loading.	2015-06-13 11:58:22 -07:00
Michael Armbrust	4aed66f299	[SPARK-8329][SQL] Allow _ in DataSource options Author: Michael Armbrust <michael@databricks.com> Closes #6786 from marmbrus/optionsParser and squashes the following commits: e7d18ef [Michael Armbrust] add dots 99a3452 [Michael Armbrust] [SPARK-8329][SQL] Allow _ in DataSource options	2015-06-12 23:11:16 -07:00
Davies Liu	d46f8e5d4b	[SPARK-7186] [SQL] Decouple internal Row from external Row Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations. As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users. For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters. For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow. cc rxin marmbrus JoshRosen Author: Davies Liu <davies@databricks.com> Closes #6792 from davies/internal_row and squashes the following commits: f2abd13 [Davies Liu] fix scalastyle a7e025c [Davies Liu] move InternalRow into catalyst 30db8ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into internal_row 7cbced8 [Davies Liu] separate Row and InternalRow	2015-06-12 23:06:31 -07:00
akhilthatipamula	19834fa918	[SPARK-7993] [SQL] Improved DataFrame.show() output Closes #6633 Author: akhilthatipamula <130050068@iitb.ac.in> Author: zsxwing <zsxwing@gmail.com> Closes #6784 from zsxwing/pr6633 and squashes the following commits: 5da1c51 [zsxwing] Address comments and add unit tests 17eab7b [akhilthatipamula] refactored code 19874b3 [akhilthatipamula] Update DataFrame.scala 0a76a5e [akhilthatipamula] Optimised showString() e3dd03f [akhilthatipamula] Modified showString() method a21012b [akhilthatipamula] improved the show() 4bb742f [akhilthatipamula] Modified dataframe.show() method	2015-06-12 10:40:28 -07:00
Wenchen Fan	c19c78577a	[SQL] [MINOR] correct semanticEquals logic It's a follow up of https://github.com/apache/spark/pull/6173, for expressions like `Coalesce` that have a `Seq[Expression]`, when we do semantic equal check for it, we need to do semantic equal check for all of its children. Also we can just use `Seq[(Expression, NamedExpression)]` instead of `Map[Expression, NamedExpression]` as we only search it with `find`. chenghao-intel, I agree that we probably never knows `semanticEquals` in a general way, but I think we have done that in `TreeNode`, so we can use similar logic. Then we can handle something like `Coalesce(children: Seq[Expression])` correctly. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6261 from cloud-fan/tmp and squashes the following commits: 4daef88 [Wenchen Fan] address comments dd8fbd9 [Wenchen Fan] correct semanticEquals	2015-06-12 16:38:28 +08:00
Yash Datta	e428b3a951	[SPARK-6566] [SQL] Related changes for newer parquet version This brings in major improvement in that footers are not read on the driver. This also cleans up the code in parquetTableOperations, where we had to override getSplits to eliminate multiple listStatus calls. cc liancheng are there any other changes we need for this ? Author: Yash Datta <Yash.Datta@guavus.com> Closes #5889 from saucam/parquet_1.6 and squashes the following commits: d1bf41e [Yash Datta] SPARK-7340: Fix scalastyle and incorporate review comments c9aa042 [Yash Datta] SPARK-7340: Use the new user defined filter predicate for pushing down inset into parquet 56bc750 [Yash Datta] SPARK-7340: Change parquet version to latest release	2015-06-12 13:44:09 +08:00
zhichao.li	2dd7f93080	[SPARK-7862] [SQL] Fix the deadlock in script transformation for stderr [Related PR SPARK-7044] (https://github.com/apache/spark/pull/5671) Author: zhichao.li <zhichao.li@intel.com> Closes #6404 from zhichao-li/transform and squashes the following commits: 8418c97 [zhichao.li] add comments and remove useless failAfter logic d9677e1 [zhichao.li] redirect the error desitination to be the same as the current process	2015-06-11 22:28:28 -07:00
Josh Rosen	b9d177c511	[SPARK-8317] [SQL] Do not push sort into shuffle in Exchange operator In some cases, Spark SQL pushes sorting operations into the shuffle layer by specifying a key ordering as part of the shuffle dependency. I think that we should not do this: - Since we do not delegate aggregation to Spark's shuffle, specifying the keyOrdering as part of the shuffle has no effect on the shuffle map side. - By performing the shuffle ourselves (by inserting a sort operator after the shuffle instead), we can use the Exchange planner to choose specialized sorting implementations based on the types of rows being sorted. - We can remove some complexity from SqlSerializer2 by not requiring it to know about sort orderings, since SQL's own sort operators will already perform the necessary defensive copying. This patch removes Exchange's `canSortWithShuffle` path and the associated code in `SqlSerializer2`. Shuffles that used to go through the `canSortWithShuffle` path would always wind up using Spark's `ExternalSorter` (inside of `HashShuffleReader`); to avoid a performance regression as a result of handling these shuffles ourselves, I've changed the SQLConf defaults so that external sorting is enabled by default. Author: Josh Rosen <joshrosen@databricks.com> Closes #6772 from JoshRosen/SPARK-8317 and squashes the following commits: ebf9c0f [Josh Rosen] Do not push sort into shuffle in Exchange operator bf3b4c8 [Josh Rosen] Enable external sort by default	2015-06-11 22:15:15 -07:00
Cheng Hao	767cc94ca6	[SPARK-7158] [SQL] Fix bug of cached data cannot be used in collect() after cache() When df.cache() method called, the `withCachedData` of `QueryExecution` has been created, which mean it will not look up the cached tables when action method called afterward. Author: Cheng Hao <hao.cheng@intel.com> Closes #5714 from chenghao-intel/SPARK-7158 and squashes the following commits: 58ea8aa [Cheng Hao] style issue 2bf740f [Cheng Hao] create new QueryExecution instance for CacheManager a5647d9 [Cheng Hao] hide the queryExecution of DataFrame fbfd3c5 [Cheng Hao] make the DataFrame.queryExecution mutable for cache/persist/unpersist	2015-06-11 18:01:47 -07:00
Reynold Xin	337c16d57e	[SQL] Miscellaneous SQL/DF expression changes. SPARK-8201 conditional function: if SPARK-8205 conditional function: nvl SPARK-8208 math function: ceiling SPARK-8210 math function: degrees SPARK-8211 math function: radians SPARK-8219 math function: negative SPARK-8216 math function: rename log -> ln SPARK-8222 math function: alias power / pow SPARK-8225 math function: alias sign / signum SPARK-8228 conditional function: isnull SPARK-8229 conditional function: isnotnull SPARK-8250 string function: alias lower/lcase SPARK-8251 string function: alias upper / ucase Author: Reynold Xin <rxin@databricks.com> Closes #6754 from rxin/expressions-misc and squashes the following commits: 35fce15 [Reynold Xin] Removed println. 2647067 [Reynold Xin] Promote to string type. 3c32bbc [Reynold Xin] Fixed if. de827ac [Reynold Xin] Fixed style b201cd4 [Reynold Xin] Removed if. 6b21a9b [Reynold Xin] [SQL] Miscellaneous SQL/DF expression changes.	2015-06-11 17:06:21 -07:00
Zhongshuai Pei	7914c720bf	[SPARK-7824] [SQL] Collapse operator reordering and constant folding into a single batch. SQL ``` select * from tableA join tableB on (a > 3 and b = d) or (a > 3 and b = e) ``` Plan before modify ``` == Optimized Logical Plan == Project [a#293,b#294,c#295,d#296,e#297] Join Inner, Some(((a#293 > 3) && ((b#294 = d#296) \|\| (b#294 = e#297)))) MetastoreRelation default, tablea, None MetastoreRelation default, tableb, None ``` Plan after modify ``` == Optimized Logical Plan == Project [a#293,b#294,c#295,d#296,e#297] Join Inner, Some(((b#294 = d#296) \|\| (b#294 = e#297))) Filter (a#293 > 3) MetastoreRelation default, tablea, None MetastoreRelation default, tableb, None ``` CombineLimits ==> Limit(If(LessThan(ne, le), ne, le), grandChild) and LessThan is in BooleanSimplification , so CombineLimits must before BooleanSimplification and BooleanSimplification must before PushPredicateThroughJoin. Author: Zhongshuai Pei <799203320@qq.com> Author: DoingDone9 <799203320@qq.com> Closes #6351 from DoingDone9/master and squashes the following commits: 20de7be [Zhongshuai Pei] Update Optimizer.scala 7bc7d28 [Zhongshuai Pei] Merge pull request #17 from apache/master 0ba5f42 [Zhongshuai Pei] Update Optimizer.scala f8b9314 [Zhongshuai Pei] Update FilterPushdownSuite.scala c529d9f [Zhongshuai Pei] Update FilterPushdownSuite.scala ae3af6d [Zhongshuai Pei] Update FilterPushdownSuite.scala a04ffae [Zhongshuai Pei] Update Optimizer.scala 11beb61 [Zhongshuai Pei] Update FilterPushdownSuite.scala f2ee5fe [Zhongshuai Pei] Update Optimizer.scala be6b1d5 [Zhongshuai Pei] Update Optimizer.scala b01e622 [Zhongshuai Pei] Merge pull request #15 from apache/master 8df716a [Zhongshuai Pei] Update FilterPushdownSuite.scala d98bc35 [Zhongshuai Pei] Update FilterPushdownSuite.scala fa65718 [Zhongshuai Pei] Update Optimizer.scala ab8e9a6 [Zhongshuai Pei] Merge pull request #14 from apache/master 14952e2 [Zhongshuai Pei] Merge pull request #13 from apache/master f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master 34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master 802261c [DoingDone9] Merge pull request #7 from apache/master d00303b [DoingDone9] Merge pull request #6 from apache/master 98b134f [DoingDone9] Merge pull request #5 from apache/master 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master	2015-06-11 17:01:07 -07:00
Reynold Xin	7d669a56ff	[SPARK-8286] Rewrite UTF8String in Java and move it into unsafe package. Unit test is still in Scala. Author: Reynold Xin <rxin@databricks.com> Closes #6738 from rxin/utf8string-java and squashes the following commits: 562dc6e [Reynold Xin] Flag... 98e600b [Reynold Xin] Another try with encoding setting .. cfa6bdf [Reynold Xin] Merge branch 'master' into utf8string-java a3b124d [Reynold Xin] Try different UTF-8 encoded characters. 1ff7c82 [Reynold Xin] Enable UTF-8 encoding. 82d58cc [Reynold Xin] Reset run-tests. 2cb3c69 [Reynold Xin] Use utf-8 encoding in set bytes. 53f8ef4 [Reynold Xin] Hack Jenkins to run one test. 9a48e8d [Reynold Xin] Fixed runtime compilation error. 911c450 [Reynold Xin] Moved unit test also to Java. 4eff7bd [Reynold Xin] Improved unit test coverage. 8e89a3c [Reynold Xin] Fixed tests. 77c64bd [Reynold Xin] Fixed string type codegen. ffedb62 [Reynold Xin] Code review feedback. 0967ce6 [Reynold Xin] Fixed import ordering. 45a123d [Reynold Xin] [SPARK-8286] Rewrite UTF8String in Java and move it into unsafe package.	2015-06-11 16:07:15 -07:00
zsxwing	95690a17d3	[SPARK-7444] [TESTS] Eliminate noisy css warn/error logs for UISeleniumSuite Eliminate the following noisy logs for `UISeleniumSuite`: ``` 15/05/07 10:09:50.196 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS error: 'http://192.168.0.170:4040/static/bootstrap.min.css' [793:167] Error in style rule. (Invalid token "". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 15/05/07 10:09:50.196 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS warning: 'http://192.168.0.170:4040/static/bootstrap.min.css' [793:167] Ignoring the following declarations in this rule. 15/05/07 10:09:50.197 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS error: 'http://192.168.0.170:4040/static/bootstrap.min.css' [799:325] Error in style rule. (Invalid token "". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 15/05/07 10:09:50.197 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS warning: 'http://192.168.0.170:4040/static/bootstrap.min.css' [799:325] Ignoring the following declarations in this rule. 15/05/07 10:09:50.198 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS error: 'http://192.168.0.170:4040/static/bootstrap.min.css' [805:18] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 15/05/07 10:09:50.198 pool-1-thread-1-ScalaTest-running-UISeleniumSuite WARN DefaultCssErrorHandler: CSS warning: 'http://192.168.0.170:4040/static/bootstrap.min.css' [805:18] Ignoring the following declarations in this rule. ``` Author: zsxwing <zsxwing@gmail.com> Closes #5983 from zsxwing/SPARK-7444 and squashes the following commits: 4202728 [zsxwing] Add SparkUICssErrorHandler for all tests d1398ad [zsxwing] Merge remote-tracking branch 'origin/master' into SPARK-7444 7bb7f11 [zsxwing] Merge branch 'master' into SPARK-7444 a59f40e [zsxwing] Eliminate noisy css warn/error logs for UISeleniumSuite	2015-06-11 14:21:49 -07:00
Cheng Hao	040f223c5b	[SPARK-7915] [SQL] Support specifying the column list for target table in CTAS ``` create table t1 (a int, b string) as select key, value from src; desc t1; key int NULL value string NULL ``` Thus Hive doesn't support specifying the column list for target table in CTAS, however, we should either throwing exception explicity, or supporting the this feature, we just pick up the later one, which seems useful and straightforward. Author: Cheng Hao <hao.cheng@intel.com> Closes #6458 from chenghao-intel/ctas_column and squashes the following commits: d1fa9b6 [Cheng Hao] bug in unittest 4e701aa [Cheng Hao] update as feedback f305ec1 [Cheng Hao] support specifying the column list for target table in CTAS	2015-06-11 14:03:08 -07:00
Davies Liu	1191c3efc6	[SPARK-8305] [SPARK-8190] [SQL] improve codegen This PR fix a few small issues about codgen: 1. cast decimal to boolean 2. do not inline literal with null 3. improve SpecificRow.equals() 4. test expressions with optimized express 5. fix compare with BinaryType cc rxin chenghao-intel Author: Davies Liu <davies@databricks.com> Closes #6755 from davies/fix_codegen and squashes the following commits: ef27343 [Davies Liu] address comments 6617ea6 [Davies Liu] fix scala tyle 70b7dda [Davies Liu] improve codegen	2015-06-11 12:57:33 -07:00
Davies Liu	424b0075a1	[SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python Spark SQL does not support timezone, and Pyrolite does not support timezone well. This patch will convert datetime into POSIX timestamp (without confusing of timezone), which is used by SQL. If the datetime object does not have timezone, it's treated as local time. The timezone in RDD will be lost after one round trip, all the datetime from SQL will be local time. Because of Pyrolite, datetime from SQL only has precision as 1 millisecond. This PR also drop the timezone in date, convert it to number of days since epoch (used in SQL). Author: Davies Liu <davies@databricks.com> Closes #6250 from davies/tzone and squashes the following commits: 44d8497 [Davies Liu] add timezone support for DateType 99d9d9c [Davies Liu] use int for timestamp 10aa7ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into tzone 6a29aa4 [Davies Liu] support datetime with timezone	2015-06-11 01:00:41 -07:00
Daoyuan Wang	2758ff0a96	[SPARK-8217] [SQL] math function log2 Author: Daoyuan Wang <daoyuan.wang@intel.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@databricks.com> Closes #6718 from adrian-wang/udflog2 and squashes the following commits: 3909f48 [Daoyuan Wang] math function: log2	2015-06-10 20:22:32 -07:00
Cheng Hao	9fe3adccef	[SPARK-8248][SQL] string function: length Author: Cheng Hao <hao.cheng@intel.com> Closes #6724 from chenghao-intel/length and squashes the following commits: aaa3c31 [Cheng Hao] revert the additional change 97148a9 [Cheng Hao] remove the codegen testing temporally ae08003 [Cheng Hao] update the comments 1eb1fd1 [Cheng Hao] simplify the code as commented 3e92d32 [Cheng Hao] use the selectExpr in unit test intead of SQLQuery 3c729aa [Cheng Hao] fix bug for constant null value in codegen 3641f06 [Cheng Hao] keep the length() method for registered function 8e30171 [Cheng Hao] update the code as comment db604ae [Cheng Hao] Add code gen support 548d2ef [Cheng Hao] register the length() 09a0738 [Cheng Hao] add length support	2015-06-10 19:55:10 -07:00
Wenchen Fan	4e42842e82	[SPARK-8164] transformExpressions should support nested expression sequence Currently we only support `Seq[Expression]`, we should handle cases like `Seq[Seq[Expression]]` so that we can remove the unnecessary `GroupExpression`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6706 from cloud-fan/clean and squashes the following commits: 60a1193 [Wenchen Fan] support nested expression sequence and remove GroupExpression	2015-06-10 18:22:47 -07:00
navis.ryu	6a47114bc2	[SPARK-8285] [SQL] CombineSum should be calculated as unlimited decimal first case cs CombineSum(expr) => val calcType = expr.dataType expr.dataType match { case DecimalType.Fixed(_, _) => DecimalType.Unlimited case _ => expr.dataType } calcType is always expr.dataType. credits are all belong to IntelliJ Author: navis.ryu <navis@apache.org> Closes #6736 from navis/SPARK-8285 and squashes the following commits: 20382c1 [navis.ryu] [SPARK-8285] [SQL] CombineSum should be calculated as unlimited decimal first	2015-06-10 18:19:12 -07:00
Davies Liu	37719e0cd0	[SPARK-8189] [SQL] use Long for TimestampType in SQL This PR change to use Long as internal type for TimestampType for efficiency, which means it will the precision below 100ns. Author: Davies Liu <davies@databricks.com> Closes #6733 from davies/timestamp and squashes the following commits: d9565fa [Davies Liu] remove print 65cf2f1 [Davies Liu] fix Timestamp in SparkR 86fecfb [Davies Liu] disable two timestamp tests 8f77ee0 [Davies Liu] fix scala style 246ee74 [Davies Liu] address comments 309d2e1 [Davies Liu] use Long for TimestampType in SQL	2015-06-10 16:55:39 -07:00
Daoyuan Wang	c6ba7cca33	[SPARK-8215] [SPARK-8212] [SQL] add leaf math expression for e and pi Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6716 from adrian-wang/epi and squashes the following commits: e2e8dbd [Daoyuan Wang] move tests 11b351c [Daoyuan Wang] add tests and remove pu db331c9 [Daoyuan Wang] py style 599ddd8 [Daoyuan Wang] add py e6783ef [Daoyuan Wang] register function 82d426e [Daoyuan Wang] add function entry dbf3ab5 [Daoyuan Wang] add PI and E	2015-06-10 09:45:45 -07:00
Reynold Xin	e90035e676	[SPARK-7886] Added unit test for HAVING aggregate pushdown. This is a followup to #6712. Author: Reynold Xin <rxin@databricks.com> Closes #6739 from rxin/6712-followup and squashes the following commits: fd9acfb [Reynold Xin] [SPARK-7886] Added unit test for HAVING aggregate pushdown.	2015-06-10 18:58:01 +08:00
Reynold Xin	57c60c5be7	[SPARK-7886] Use FunctionRegistry for built-in expressions in HiveContext. This builds on #6710 and also uses FunctionRegistry for function lookup in HiveContext. Author: Reynold Xin <rxin@databricks.com> Closes #6712 from rxin/udf-registry-hive and squashes the following commits: f4c2df0 [Reynold Xin] Fixed style violation. 0bd4127 [Reynold Xin] Fixed Python UDFs. f9a0378 [Reynold Xin] Disable one more test. 5609494 [Reynold Xin] Disable some failing tests. 4efea20 [Reynold Xin] Don't check children resolved for UDF resolution. 2ebe549 [Reynold Xin] Removed more hardcoded functions. aadce78 [Reynold Xin] [SPARK-7886] Use FunctionRegistry for built-in expressions in HiveContext.	2015-06-10 00:36:16 -07:00
navis.ryu	778f3ca81f	[SPARK-7792] [SQL] HiveContext registerTempTable not thread safe Just replaced mutable.HashMap to ConcurrentHashMap Author: navis.ryu <navis@apache.org> Closes #6699 from navis/SPARK-7792 and squashes the following commits: f03654a [navis.ryu] [SPARK-7792] [SQL] HiveContext registerTempTable not thread safe	2015-06-09 19:33:00 -07:00
Reynold Xin	1b499993ad	[SPARK-7886] Add built-in expressions to FunctionRegistry. This patch switches to using FunctionRegistry for built-in expressions. It is based on #6463, but with some work to simplify it along with unit tests. TODOs for future pull requests: - Use static registration so we don't need to register all functions every time we start a new SQLContext - Switch to using this in HiveContext Author: Reynold Xin <rxin@databricks.com> Author: Santiago M. Mola <santi@mola.io> Closes #6710 from rxin/udf-registry and squashes the following commits: 6930822 [Reynold Xin] Fixed Python test. b802c9a [Reynold Xin] Made UDF case insensitive. e60d815 [Reynold Xin] Made UDF case insensitive. 852f9c0 [Reynold Xin] Fixed style violation. e76a3c1 [Reynold Xin] Fixed parser. 52ddaba [Reynold Xin] Fixed compilation. ee7854f [Reynold Xin] Improved error reporting. ff906f2 [Reynold Xin] More robust constructor calling. 77b46f1 [Reynold Xin] Simplified the code. 2a2a149 [Reynold Xin] Merge pull request #6463 from smola/SPARK-7886 8616924 [Santiago M. Mola] [SPARK-7886] Add built-in expressions to FunctionRegistry.	2015-06-09 16:24:38 +08:00
Liang-Chi Hsieh	7658eb28a2	[SPARK-7990][SQL] Add methods to facilitate equi-join on multiple joining keys JIRA: https://issues.apache.org/jira/browse/SPARK-7990 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6616 from viirya/multi_keys_equi_join and squashes the following commits: cd5c888 [Liang-Chi Hsieh] Import reduce in python3. c43722c [Liang-Chi Hsieh] For comments. 0400e89 [Liang-Chi Hsieh] Fix scala style. cc90015 [Liang-Chi Hsieh] Add methods to facilitate equi-join on multiple joining keys.	2015-06-08 23:27:05 -07:00
Reynold Xin	5185389168	[SPARK-8148] Do not use FloatType in partition column inference. Use DoubleType instead to be more stable and robust. Author: Reynold Xin <rxin@databricks.com> Closes #6692 from rxin/SPARK-8148 and squashes the following commits: 6742ecc [Reynold Xin] [SPARK-8148] Do not use FloatType in partition column inference.	2015-06-08 13:15:44 -07:00
Wenchen Fan	fe7669d307	[SQL][minor] remove duplicated cases in `DecimalPrecision` We already have a rule to do type coercion for fixed decimal and unlimited decimal in `WidenTypes`, so we don't need to handle them in `DecimalPrecision`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6698 from cloud-fan/fix and squashes the following commits: 413ad4a [Wenchen Fan] remove duplicated cases	2015-06-08 11:52:02 -07:00
Cheng Lian	bbdfc0a40f	[SPARK-8121] [SQL] Fixes InsertIntoHadoopFsRelation job initialization for Hadoop 1.x For Hadoop 1.x, `TaskAttemptContext` constructor clones the `Configuration` argument, thus configurations done in `HadoopFsRelation.prepareForWriteJob()` are not populated to driver side `TaskAttemptContext` (executor side configurations are properly populated). Currently this should only affect Parquet output committer class configuration. Author: Cheng Lian <lian@databricks.com> Closes #6669 from liancheng/spark-8121 and squashes the following commits: 73819e8 [Cheng Lian] Minor logging fix fce089c [Cheng Lian] Adds more logging b6f78a6 [Cheng Lian] Fixes compilation error introduced while rebasing 963a1aa [Cheng Lian] Addresses @yhuai's comment c3a0b1a [Cheng Lian] Fixes InsertIntoHadoopFsRelation job initialization	2015-06-08 11:34:18 -07:00
Daoyuan Wang	ed5c2dccd0	[SPARK-8158] [SQL] several fix for HiveShim 1. explicitly import implicit conversion support. 2. use .nonEmpty instead of .size > 0 3. use val instead of var 4. comment indention Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6700 from adrian-wang/shimsimprove and squashes the following commits: d22e108 [Daoyuan Wang] several fix for HiveShim	2015-06-08 11:06:27 -07:00
Daoyuan Wang	49f19b954b	[MINOR] change new Exception to IllegalArgumentException Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6434 from adrian-wang/joinerr and squashes the following commits: ee1b64f [Daoyuan Wang] break line f7c53e9 [Daoyuan Wang] to IllegalArgumentException f8dea2d [Daoyuan Wang] sys.err to IllegalStateException be82259 [Daoyuan Wang] change new exception to sys.err	2015-06-08 09:41:06 -07:00
Liang-Chi Hsieh	03ef6be9ce	[SPARK-7939] [SQL] Add conf to enable/disable partition column type inference JIRA: https://issues.apache.org/jira/browse/SPARK-7939 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6503 from viirya/disable_partition_type_inference and squashes the following commits: 3e90470 [Liang-Chi Hsieh] Default to enable type inference and update docs. 455edb1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into disable_partition_type_inference 9a57933 [Liang-Chi Hsieh] Add conf to enable/disable partition column type inference.	2015-06-08 17:50:38 +08:00
Reynold Xin	72ba0fc4fd	[SPARK-8154][SQL] Remove Term/Code type aliases in code generation. From my perspective as a code reviewer, I find them more confusing than using String directly. Author: Reynold Xin <rxin@databricks.com> Closes #6694 from rxin/SPARK-8154 and squashes the following commits: 4e5056c [Reynold Xin] [SPARK-8154][SQL] Remove Term/Code type aliases in code generation.	2015-06-07 23:16:19 -07:00
Reynold Xin	f74be744d4	[SPARK-8149][SQL] Break ExpressionEvaluationSuite down to multiple files Also moved a few files in expressions package around to match test suites. Author: Reynold Xin <rxin@databricks.com> Closes #6693 from rxin/expr-refactoring and squashes the following commits: 857599f [Reynold Xin] Fixed style violation. c0eb74b [Reynold Xin] Fixed compilation. b3a40f8 [Reynold Xin] Refactored expression test suites.	2015-06-07 18:45:24 -07:00
Davies Liu	5e7b6b67be	[SPARK-8117] [SQL] Push codegen implementation into each Expression This PR move codegen implementation of expressions into Expression class itself, make it easy to manage. It introduces two APIs in Expression: ``` def gen(ctx: CodeGenContext): GeneratedExpressionCode def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): Code ``` gen(ctx) will call genSource(ctx, ev) to generate Java source code for the current expression. A expression needs to override genSource(). Here are the types: ``` type Term String type Code String /** * Java source for evaluating an [[Expression]] given a [[Row]] of input. / case class GeneratedExpressionCode(var code: Code, nullTerm: Term, primitiveTerm: Term, objectTerm: Term) /* * A context for codegen, which is used to bookkeeping the expressions those are not supported * by codegen, then they are evaluated directly. The unsupported expression is appended at the * end of `references`, the position of it is kept in the code, used to access and evaluate it. / class CodeGenContext { /* * Holding all the expressions those do not support codegen, will be evaluated directly. */ val references: Seq[Expression] = new mutable.ArrayBuffer[Expression]() } ``` This is basically #6660, but fixed style violation and compilation failure. Author: Davies Liu <davies@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #6690 from rxin/codegen and squashes the following commits: e1368c2 [Reynold Xin] Fixed tests. 73db80e [Reynold Xin] Fixed compilation failure. 19d6435 [Reynold Xin] Fixed style violation. 9adaeaf [Davies Liu] address comments f42c732 [Davies Liu] improve coverage and tests bad6828 [Davies Liu] address comments e03edaa [Davies Liu] consts fold 86fac2c [Davies Liu] fix style 02262c9 [Davies Liu] address comments b5d3617 [Davies Liu] Merge pull request #5 from rxin/codegen 48c454f [Reynold Xin] Some code gen update. 2344bc0 [Davies Liu] fix test 12ff88a [Davies Liu] fix build c5fb514 [Davies Liu] rename 8c6d82d [Davies Liu] update docs b145047 [Davies Liu] fix style e57959d [Davies Liu] add type alias 3ff25f8 [Davies Liu] refactor 593d617 [Davies Liu] pushing codegen into Expression	2015-06-07 14:11:20 -07:00
Wenchen Fan	db81b9d89f	[SPARK-7952][SQL] use internal Decimal instead of java.math.BigDecimal This PR fixes a bug introduced in https://github.com/apache/spark/pull/6505. Decimal literal's value is not `java.math.BigDecimal`, but Spark SQL internal type: `Decimal`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6574 from cloud-fan/fix and squashes the following commits: b0e3549 [Wenchen Fan] rename to BooleanEquality 1987b37 [Wenchen Fan] use Decimal instead of java.math.BigDecimal f93c420 [Wenchen Fan] compare literal	2015-06-07 11:07:19 -07:00
Reynold Xin	d6d601a07b	[SPARK-8004][SQL] Quote identifier in JDBC data source. This is a follow-up patch to #6577 to replace columnEnclosing to quoteIdentifier. I also did some minor cleanup to the JdbcDialect file. Author: Reynold Xin <rxin@databricks.com> Closes #6689 from rxin/jdbc-quote and squashes the following commits: bad365f [Reynold Xin] Fixed test compilation... e39e14e [Reynold Xin] Fixed compilation. db9a8e0 [Reynold Xin] [SPARK-8004][SQL] Quote identifier in JDBC data source.	2015-06-07 10:52:02 -07:00
Cheng Lian	8c321d66d7	[SPARK-8118] [SQL] Mutes noisy Parquet log output reappeared after upgrading Parquet to 1.7.0 Author: Cheng Lian <lian@databricks.com> Closes #6670 from liancheng/spark-8118 and squashes the following commits: b6e85a6 [Cheng Lian] Suppresses unnecesary ParquetRecordReader log message (PARQUET-220) 385603c [Cheng Lian] Mutes noisy Parquet log output reappeared after upgrading Parquet to 1.7.0	2015-06-07 16:59:55 +08:00
Liang-Chi Hsieh	26d07f1ece	[SPARK-8141] [SQL] Precompute datatypes for partition columns and reuse it JIRA: https://issues.apache.org/jira/browse/SPARK-8141 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6687 from viirya/reuse_partition_column_types and squashes the following commits: dab0688 [Liang-Chi Hsieh] Reuse partitionColumnTypes.	2015-06-07 15:33:48 +08:00
Liang-Chi Hsieh	901a552c5e	[SPARK-8004][SQL] Enclose column names by JDBC Dialect JIRA: https://issues.apache.org/jira/browse/SPARK-8004 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6577 from viirya/enclose_jdbc_columns and squashes the following commits: 614606a [Liang-Chi Hsieh] For comment. bc50182 [Liang-Chi Hsieh] Enclose column names by JDBC Dialect.	2015-06-06 22:59:31 -07:00
Cheng Lian	16fc49617e	[SPARK-8079] [SQL] Makes InsertIntoHadoopFsRelation job/task abortion more robust As described in SPARK-8079, when writing a DataFrame to a `HadoopFsRelation`, if `HadoopFsRelation.prepareForWriteJob` throws exception, an unexpected NPE will be thrown during job abortion. (This issue doesn't bring much damage since the job is failing anyway.) This PR makes the job/task abortion logic in `InsertIntoHadoopFsRelation` more robust to avoid such confusing exceptions. Author: Cheng Lian <lian@databricks.com> Closes #6612 from liancheng/spark-8079 and squashes the following commits: 87cd81e [Cheng Lian] Addresses @rxin's comment 1864c75 [Cheng Lian] Addresses review comments 9e6dbb3 [Cheng Lian] Makes InsertIntoHadoopFsRelation job/task abortion more robust	2015-06-06 17:23:12 +08:00
Reynold Xin	a71be0a36d	[SPARK-8114][SQL] Remove some wildcard import on TestSQLContext._ round 3. Author: Reynold Xin <rxin@databricks.com> Closes #6677 from rxin/test-wildcard and squashes the following commits: 8a17b33 [Reynold Xin] Fixed line length. 6663813 [Reynold Xin] [SPARK-8114][SQL] Remove some wildcard import on TestSQLContext._ round 3.	2015-06-05 23:15:10 -07:00
Dong Wang	eb19d3f75c	[SPARK-6964] [SQL] Support Cancellation in the Thrift Server Support runInBackground in SparkExecuteStatementOperation, and add cancellation Author: Dong Wang <dong@databricks.com> Closes #6207 from dongwang218/SPARK-6964-jdbc-cancel and squashes the following commits: 687c113 [Dong Wang] fix 100 characters 7bfa2a7 [Dong Wang] fix merge 380480f [Dong Wang] fix for liancheng's comments eb3e385 [Dong Wang] small nit 341885b [Dong Wang] small fix 3d8ebf8 [Dong Wang] add spark.sql.hive.thriftServer.async flag 04142c3 [Dong Wang] set SQLSession for async execution 184ec35 [Dong Wang] keep hive conf 819ae03 [Dong Wang] [SPARK-6964][SQL][WIP] Support Cancellation in the Thrift Server	2015-06-05 17:41:12 -07:00
Reynold Xin	6ebe419f33	[SPARK-8114][SQL] Remove some wildcard import on TestSQLContext._ cont'd. Fixed the following packages: sql.columnar sql.jdbc sql.json sql.parquet Author: Reynold Xin <rxin@databricks.com> Closes #6667 from rxin/testsqlcontext_wildcard and squashes the following commits: 134a776 [Reynold Xin] Fixed compilation break. 6da7b69 [Reynold Xin] [SPARK-8114][SQL] Remove some wildcard import on TestSQLContext._ cont'd.	2015-06-05 13:57:21 -07:00
Shivaram Venkataraman	12f5eaeee1	[SPARK-8085] [SPARKR] Support user-specified schema in read.df cc davies sun-rui Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6620 from shivaram/sparkr-read-schema and squashes the following commits: 16a6726 [Shivaram Venkataraman] Fix loadDF to pass schema Also add a unit test a229877 [Shivaram Venkataraman] Use wrapper function to DataFrameReader ee70ba8 [Shivaram Venkataraman] Support user-specified schema in read.df	2015-06-05 10:19:03 -07:00
Cheng Lian	bc0d76a246	[SQL] Simplifies binary node pattern matching This PR is a simpler version of #2764, and adds `unapply` methods to the following binary nodes for simpler pattern matching: - `BinaryExpression` - `BinaryComparison` - `BinaryArithmetics` This enables nested pattern matching for binary nodes. For example, the following pattern matching ```scala case p: BinaryComparison if p.left.dataType == StringType && p.right.dataType == DateType => p.makeCopy(Array(p.left, Cast(p.right, StringType))) ``` can be simplified to ```scala case p BinaryComparison(l StringType(), r DateType()) => p.makeCopy(Array(l, Cast(r, StringType))) ``` Author: Cheng Lian <lian@databricks.com> Closes #6537 from liancheng/binary-node-patmat and squashes the following commits: a3bf5fe [Cheng Lian] Fixes compilation error introduced while rebasing b738986 [Cheng Lian] Renames `l`/`r` to `left`/`right` or `lhs`/`rhs` 14900ae [Cheng Lian] Simplifies binary node pattern matching	2015-06-05 23:06:19 +08:00
Reynold Xin	8f16b94afb	[SPARK-8114][SQL] Remove some wildcard import on TestSQLContext._ I kept some of the sql import there to avoid changing too many lines. Author: Reynold Xin <rxin@databricks.com> Closes #6661 from rxin/remove-wildcard-import-sqlcontext and squashes the following commits: c265347 [Reynold Xin] Fixed ListTablesSuite failure. de9d491 [Reynold Xin] Fixed tests. 73b5365 [Reynold Xin] Mima. 8f6b642 [Reynold Xin] Fixed style violation. 443f6e8 [Reynold Xin] [SPARK-8113][SQL] Remove some wildcard import on TestSQLContext._	2015-06-04 22:15:58 -07:00
Reynold Xin	2bcdf8c239	[SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate This patch replaces Distinct with Aggregate in the optimizer, so Distinct will become more efficient over time as we optimize Aggregate (via Tungsten). Author: Reynold Xin <rxin@databricks.com> Closes #6637 from rxin/replace-distinct and squashes the following commits: b3cc50e [Reynold Xin] Mima excludes. 93d6117 [Reynold Xin] Code review feedback. 87e4741 [Reynold Xin] [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate.	2015-06-04 13:52:53 -07:00
Reynold Xin	6593842271	Fixed style issues for [SPARK-6909][SQL] Remove Hive Shim code.	2015-06-04 13:44:47 -07:00
Cheolsoo Park	0526fea483	[SPARK-6909][SQL] Remove Hive Shim code This is a follow-up on #6393. I am removing the following files in this PR. ``` ./sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala ./sql/hive-thriftserver/v0.13.1/src/main/scala/org/apache/spark/sql/hive/thriftserver/Shim13.scala ``` Basically, I re-factored the shim code as follows- * Rewrote code directly with Hive 0.13 methods, or * Converted code into private methods, or * Extracted code into separate classes But for leftover code that didn't fit in any of these cases, I created a HiveShim object. For eg, helper functions which wrap Hive 0.13 methods to work around Hive bugs are placed here. Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #6604 from piaozhexiu/SPARK-6909 and squashes the following commits: 5dccc20 [Cheolsoo Park] Remove hive shim code	2015-06-04 13:27:35 -07:00
Thomas Omans	cd3176bd86	[SPARK-7743] [SQL] Parquet 1.7 Resolves [SPARK-7743](https://issues.apache.org/jira/browse/SPARK-7743). Trivial changes of versions, package names, as well as a small issue in `ParquetTableOperations.scala` ```diff - val readContext = getReadSupport(configuration).init( + val readContext = ParquetInputFormat.getReadSupportInstance(configuration).init( ``` Since ParquetInputFormat.getReadSupport was made package private in the latest release. Thanks -- Thomas Omans Author: Thomas Omans <tomans@cj.com> Closes #6597 from eggsby/SPARK-7743 and squashes the following commits: 2df0d1b [Thomas Omans] [SPARK-7743] [SQL] Upgrading parquet version to 1.7.0	2015-06-04 11:32:03 -07:00
Mike Dusenberry	df7da07a86	[SPARK-7969] [SQL] Added a DataFrame.drop function that accepts a Column reference. Added a `DataFrame.drop` function that accepts a `Column` reference rather than a `String`, and added associated unit tests. Basically iterates through the `DataFrame` to find a column with an expression that is equivalent to that of the `Column` argument supplied to the function. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6585 from dusenberrymw/SPARK-7969_Drop_method_on_Dataframes_should_handle_Column and squashes the following commits: 514727a [Mike Dusenberry] Updating the @since tag of the drop(Column) function doc to reflect version 1.4.1 instead of 1.4.0. 2f1bb4e [Mike Dusenberry] Adding an additional assert statement to the 'drop column after join' unit test in order to make sure the correct column was indeed left over. 6bf7c0e [Mike Dusenberry] Minor code formatting change. e583888 [Mike Dusenberry] Adding more Python doctests for the df.drop with column reference function to test joined datasets that have columns with the same name. 5f74401 [Mike Dusenberry] Updating DataFrame.drop with column reference function to use logicalPlan.output to prevent ambiguities resulting from columns with the same name. Also added associated unit tests for joined datasets with duplicate column names. 4b8bbe8 [Mike Dusenberry] Adding Python support for Dataframe.drop with a Column reference. 986129c [Mike Dusenberry] Added a DataFrame.drop function that accepts a Column reference rather than a String, and added associated unit tests. Basically iterates through the DataFrame to find a column with an expression that is equivalent to one supplied to the function.	2015-06-04 11:30:07 -07:00
Davies Liu	c8709dcfd1	[SPARK-7956] [SQL] Use Janino to compile SQL expressions into bytecode In order to reduce the overhead of codegen, this PR switch to use Janino to compile SQL expressions into bytecode. After this, the time used to compile a SQL expression is decreased from 100ms to 5ms, which is necessary to turn on codegen for general workload, also tests. cc rxin Author: Davies Liu <davies@databricks.com> Closes #6479 from davies/janino and squashes the following commits: cc689f5 [Davies Liu] remove globalLock 262d848 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino eec3a33 [Davies Liu] address comments from Josh f37c8c3 [Davies Liu] fix DecimalType and cast to String 202298b [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino a21e968 [Davies Liu] fix style 0ed3dc6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino 551a851 [Davies Liu] fix tests c3bdffa [Davies Liu] remove print 6089ce5 [Davies Liu] change logging level 7e46ac3 [Davies Liu] fix style d8f0f6c [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino da4926a [Davies Liu] fix tests 03660f3 [Davies Liu] WIP: use Janino to compile Java source f2629cd [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino f7d66cf [Davies Liu] use template based string for codegen	2015-06-04 10:28:59 -07:00
Reynold Xin	2c5a06cafd	Update documentation for [SPARK-7980] [SQL] Support SQLContext.range(end)	2015-06-03 14:20:27 -07:00
Reynold Xin	939e4f3d8d	[SPARK-8074] Parquet should throw AnalysisException during setup for data type/name related failures. Author: Reynold Xin <rxin@databricks.com> Closes #6608 from rxin/parquet-analysis and squashes the following commits: b5dc8e2 [Reynold Xin] Code review feedback. 5617cf6 [Reynold Xin] [SPARK-8074] Parquet should throw AnalysisException during setup for data type/name related failures.	2015-06-03 13:57:57 -07:00
animesh	d053a31be9	[SPARK-7980] [SQL] Support SQLContext.range(end) 1. range() overloaded in SQLContext.scala 2. range() modified in python sql context.py 3. Tests added accordingly in DataFrameSuite.scala and python sql tests.py Author: animesh <animesh@apache.spark> Closes #6609 from animeshbaranawal/SPARK-7980 and squashes the following commits: 935899c [animesh] SPARK-7980:python+scala changes	2015-06-03 11:28:18 -07:00
Patrick Wendell	2c4d550eda	[SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0 Author: Patrick Wendell <patrick@databricks.com> Closes #6328 from pwendell/spark-1.5-update and squashes the following commits: 2f42d02 [Patrick Wendell] A few more excludes 4bebcf0 [Patrick Wendell] Update to RC4 61aaf46 [Patrick Wendell] Using new release candidate 55f1610 [Patrick Wendell] Another exclude 04b4f04 [Patrick Wendell] More issues with transient 1.4 changes 36f549b [Patrick Wendell] [SPARK-7801] [BUILD] Updating versions to SPARK 1.5.0	2015-06-03 10:11:27 -07:00
Yin Huai	f1646e1023	[SPARK-7973] [SQL] Increase the timeout of two CliSuite tests. https://issues.apache.org/jira/browse/SPARK-7973 Author: Yin Huai <yhuai@databricks.com> Closes #6525 from yhuai/SPARK-7973 and squashes the following commits: 763b821 [Yin Huai] Also change the timeout of "Single command with -e" to 2 minutes. e598a08 [Yin Huai] Increase the timeout to 3 minutes.	2015-06-03 09:26:21 -07:00
Wenchen Fan	d38cf217e0	[SPARK-7562][SPARK-6444][SQL] Improve error reporting for expression data type mismatch It seems hard to find a common pattern of checking types in `Expression`. Sometimes we know what input types we need(like `And`, we know we need two booleans), sometimes we just have some rules(like `Add`, we need 2 numeric types which are equal). So I defined a general interface `checkInputDataTypes` in `Expression` which returns a `TypeCheckResult`. `TypeCheckResult` can tell whether this expression passes the type checking or what the type mismatch is. This PR mainly works on apply input types checking for arithmetic and predicate expressions. TODO: apply type checking interface to more expressions. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6405 from cloud-fan/6444 and squashes the following commits: b5ff31b [Wenchen Fan] address comments b917275 [Wenchen Fan] rebase 39929d9 [Wenchen Fan] add todo 0808fd2 [Wenchen Fan] make constrcutor of TypeCheckResult private 3bee157 [Wenchen Fan] and decimal type coercion rule for binary comparison 8883025 [Wenchen Fan] apply type check interface to CaseWhen cffb67c [Wenchen Fan] to have resolved call the data type check function 6eaadff [Wenchen Fan] add equal type constraint to EqualTo 3affbd8 [Wenchen Fan] more fixes 654d46a [Wenchen Fan] improve tests e0a3628 [Wenchen Fan] improve error message 1524ff6 [Wenchen Fan] fix style 69ca3fe [Wenchen Fan] add error message and tests c71d02c [Wenchen Fan] fix hive tests 6491721 [Wenchen Fan] use value class TypeCheckResult 7ae76b9 [Wenchen Fan] address comments cb77e4f [Wenchen Fan] Improve error reporting for expression data type mismatch	2015-06-03 00:47:52 -07:00
Josh Rosen	cafd5056e1	[SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specific row accessors This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features. At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`. The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL: - #6217: DescribeCommand is assigned wrong output attributes in SparkStrategies - #6218: DataFrame.describe() should cast all aggregates to String - #6400: Use output schema, not relation schema, for data source input conversion Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc: > It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions. In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases. Author: Josh Rosen <joshrosen@databricks.com> Closes #6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits: 740341b [Josh Rosen] Optimize method dispatch for primitive type conversions befc613 [Josh Rosen] Add tests to document Option-handling behavior. 5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite 6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it 3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first 6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException 677ff27 [Josh Rosen] Fix null handling bug; add tests. 8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator. 85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite 9c0e4e1 [Josh Rosen] Remove last use of convertToScala(). ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions. 7ca7fcb [Josh Rosen] Comments and cleanup 1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters	2015-06-02 22:11:03 -07:00
Cheng Lian	5cd6a63d96	[SQL] [TEST] [MINOR] Follow-up of PR #6493 , use Guava API to ensure Java 6 friendliness This is a follow-up of PR #6493, which has been reverted in branch-1.4 because it uses Java 7 specific APIs and breaks Java 6 build. This PR replaces those APIs with equivalent Guava ones to ensure Java 6 friendliness. cc andrewor14 pwendell, this should also be back ported to branch-1.4. Author: Cheng Lian <lian@databricks.com> Closes #6547 from liancheng/override-log4j and squashes the following commits: c900cfd [Cheng Lian] Addresses Shixiong's comment 72da795 [Cheng Lian] Uses Guava API to ensure Java 6 friendliness	2015-06-02 17:07:13 -07:00
Cheng Lian	686a45f0b9	[SPARK-8014] [SQL] Avoid premature metadata discovery when writing a HadoopFsRelation with a save mode other than Append The current code references the schema of the DataFrame to be written before checking save mode. This triggers expensive metadata discovery prematurely. For save mode other than `Append`, this metadata discovery is useless since we either ignore the result (for `Ignore` and `ErrorIfExists`) or delete existing files (for `Overwrite`) later. This PR fixes this issue by deferring metadata discovery after save mode checking. Author: Cheng Lian <lian@databricks.com> Closes #6583 from liancheng/spark-8014 and squashes the following commits: 1aafabd [Cheng Lian] Updates comments 088abaa [Cheng Lian] Avoids schema merging and partition discovery when data schema and partition schema are defined 8fbd93f [Cheng Lian] Fixes SPARK-8014	2015-06-02 13:32:13 -07:00
Cheng Lian	1bb5d716c0	[SPARK-8037] [SQL] Ignores files whose name starts with dot in HadoopFsRelation Author: Cheng Lian <lian@databricks.com> Closes #6581 from liancheng/spark-8037 and squashes the following commits: d08e97b [Cheng Lian] Ignores files whose name starts with dot in HadoopFsRelation	2015-06-03 00:59:50 +08:00
Yin Huai	0f80990bfa	[SPARK-8023][SQL] Add "deterministic" attribute to Expression to avoid collapsing nondeterministic projects. This closes #6570. Author: Yin Huai <yhuai@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #6573 from rxin/deterministic and squashes the following commits: 356cd22 [Reynold Xin] Added unit test for the optimizer. da3fde1 [Reynold Xin] Merge pull request #6570 from yhuai/SPARK-8023 da56200 [Yin Huai] Comments. e38f264 [Yin Huai] Comment. f9d6a73 [Yin Huai] Add a deterministic method to Expression.	2015-06-02 00:20:52 -07:00
Yin Huai	7b7f7b6c6f	[SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early https://issues.apache.org/jira/browse/SPARK-8020 Author: Yin Huai <yhuai@databricks.com> Closes #6571 from yhuai/SPARK-8020-1 and squashes the following commits: 0398f5b [Yin Huai] First populate the SQLConf and then construct executionHive and metadataHive.	2015-06-02 00:16:56 -07:00
Davies Liu	bcb47ad771	[SPARK-6917] [SQL] DecimalType is not read back when non-native type exists cc yhuai Author: Davies Liu <davies@databricks.com> Closes #6558 from davies/decimalType and squashes the following commits: c877ca8 [Davies Liu] Update ParquetConverter.scala 48cc57c [Davies Liu] Update ParquetConverter.scala b43845c [Davies Liu] add test 3b4a94f [Davies Liu] DecimalType is not read back when non-native type exists	2015-06-01 23:12:29 -07:00
Reynold Xin	b53a011647	Fixed typo in the previous commit.	2015-06-01 21:41:53 -07:00
Yin Huai	e797dba58e	[SPARK-7965] [SPARK-7972] [SQL] Handle expressions containing multiple window expressions and make parser match window frames in case insensitive way JIRAs: https://issues.apache.org/jira/browse/SPARK-7965 https://issues.apache.org/jira/browse/SPARK-7972 Author: Yin Huai <yhuai@databricks.com> Closes #6524 from yhuai/7965-7972 and squashes the following commits: c12c79c [Yin Huai] Add doc for returned value. de64328 [Yin Huai] Address rxin's comments. fc9b1ad [Yin Huai] wip 2996da4 [Yin Huai] scala style 20b65b7 [Yin Huai] Handle expressions containing multiple window expressions. 9568b21 [Yin Huai] case insensitive matches 41f633d [Yin Huai] Failed test case.	2015-06-01 21:40:17 -07:00
Reynold Xin	75dda33f3e	Revert "[SPARK-8020] Spark SQL in spark-defaults.conf make metadataHive get constructed too early" This reverts commit `91f6be87bc`.	2015-06-01 21:35:55 -07:00
Yin Huai	91f6be87bc	[SPARK-8020] Spark SQL in spark-defaults.conf make metadataHive get constructed too early https://issues.apache.org/jira/browse/SPARK-8020 Author: Yin Huai <yhuai@databricks.com> Closes #6563 from yhuai/SPARK-8020 and squashes the following commits: 4e5addc [Yin Huai] style bf766c6 [Yin Huai] Failed test. 0398f5b [Yin Huai] First populate the SQLConf and then construct executionHive and metadataHive.	2015-06-01 21:33:57 -07:00
Reynold Xin	4c868b9943	[minor doc] Add exploratory data analysis warning for DataFrame.stat.freqItem API Author: Reynold Xin <rxin@databricks.com> Closes #6569 from rxin/freqItemsWarning and squashes the following commits: 7eec145 [Reynold Xin] [minor doc] Add exploratory data analysis warning for DataFrame.stat.freqItem API.	2015-06-01 21:29:39 -07:00
Reynold Xin	89f642a0e8	[SPARK-8026][SQL] Add Column.alias to Scala/Java DataFrame API Author: Reynold Xin <rxin@databricks.com> Closes #6565 from rxin/alias and squashes the following commits: 286d880 [Reynold Xin] [SPARK-8026][SQL] Add Column.alias to Scala/Java DataFrame API	2015-06-01 21:13:15 -07:00
Reynold Xin	6396cc0303	[SPARK-7982][SQL] DataFrame.stat.crosstab should use 0 instead of null for pairs that don't appear Author: Reynold Xin <rxin@databricks.com> Closes #6566 from rxin/crosstab and squashes the following commits: e0ace1c [Reynold Xin] [SPARK-7982][SQL] DataFrame.stat.crosstab should use 0 instead of null for pairs that don't appear	2015-06-01 21:11:19 -07:00
Wenchen Fan	a0e46a0d2a	[SPARK-7952][SPARK-7984][SQL] equality check between boolean type and numeric type is broken. The origin code has several problems: * `true <=> 1` will return false as we didn't set a rule to handle it. * `true = a` where `a` is not `Literal` and its value is 1, will return false as we only handle literal values. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6505 from cloud-fan/tmp1 and squashes the following commits: 77f0f39 [Wenchen Fan] minor fix b6401ba [Wenchen Fan] add type coercion for CaseKeyWhen and address comments ebc8c61 [Wenchen Fan] use SQLTestUtils and If 625973c [Wenchen Fan] improve 9ba2130 [Wenchen Fan] address comments fc0d741 [Wenchen Fan] fix style 2846a04 [Wenchen Fan] fix 7952	2015-05-31 21:01:46 -07:00
Reynold Xin	866652c903	[SPARK-3850] Turn style checker on for trailing whitespaces. Author: Reynold Xin <rxin@databricks.com> Closes #6541 from rxin/trailing-whitespace-on and squashes the following commits: f72ebe4 [Reynold Xin] [SPARK-3850] Turn style checker on for trailing whitespaces.	2015-05-31 14:23:42 -07:00
Reynold Xin	63a50be13d	[SPARK-3850] Trim trailing spaces for SQL. Author: Reynold Xin <rxin@databricks.com> Closes #6535 from rxin/whitespace-sql and squashes the following commits: de50316 [Reynold Xin] [SPARK-3850] Trim trailing spaces for SQL.	2015-05-31 00:48:49 -07:00
Reynold Xin	7896e99b2a	[SPARK-7975] Add style checker to disallow overriding equals covariantly. Author: Reynold Xin <rxin@databricks.com> This patch had conflicts when merged, resolved by Committer: Reynold Xin <rxin@databricks.com> Closes #6527 from rxin/covariant-equals and squashes the following commits: e7d7784 [Reynold Xin] [SPARK-7975] Enforce CovariantEqualsChecker	2015-05-31 00:05:55 -07:00
Cheng Lian	8764dccebd	[SQL] [MINOR] Adds @deprecated Scaladoc entry for SchemaRDD Author: Cheng Lian <lian@databricks.com> Closes #6529 from liancheng/schemardd-deprecation-fix and squashes the following commits: 49765c2 [Cheng Lian] Adds @deprecated Scaladoc entry for SchemaRDD	2015-05-30 23:49:42 -07:00
Cheng Lian	f7fe9e4744	[SQL] [MINOR] Fixes a minor comment mistake in IsolatedClientLoader Author: Cheng Lian <lian@databricks.com> Closes #6521 from liancheng/classloader-comment-fix and squashes the following commits: fc09606 [Cheng Lian] Addresses @srowen's comment 59945c5 [Cheng Lian] Fixes a minor comment mistake in IsolatedClientLoader	2015-05-31 12:56:41 +08:00
Reynold Xin	c63e1a742b	[SPARK-7971] Add JavaDoc style deprecation for deprecated DataFrame methods Scala deprecated annotation actually doesn't show up in JavaDoc. Author: Reynold Xin <rxin@databricks.com> Closes #6523 from rxin/df-deprecated-javadoc and squashes the following commits: 26da2b2 [Reynold Xin] [SPARK-7971] Add JavaDoc style deprecation for deprecated DataFrame methods.	2015-05-30 19:51:53 -07:00
Reynold Xin	14b314dc2c	[SQL] Tighten up visibility for JavaDoc. I went through all the JavaDocs and tightened up visibility. Author: Reynold Xin <rxin@databricks.com> Closes #6526 from rxin/sql-1.4-visibility-for-docs and squashes the following commits: bc37d1e [Reynold Xin] Tighten up visibility for JavaDoc.	2015-05-30 19:50:52 -07:00
Wenchen Fan	0978aec9cd	[SPARK-7964][SQL] remove unnecessary type coercion rule We have defined these logics in `Cast` already, I think we should remove this rule. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #6516 from cloud-fan/tmp2 and squashes the following commits: d5035a4 [Wenchen Fan] remove useless rule	2015-05-30 00:26:46 -07:00
Andrew Or	8c9979337f	[HOTFIX] [SQL] Maven test compilation issue Tests compile in SBT but not Maven.	2015-05-29 15:26:49 -07:00
Andrew Or	9eb222c139	[SPARK-7558] Demarcate tests in unit-tests.log Right now `unit-tests.log` are not of much value because we can't tell where the test boundaries are easily. This patch adds log statements before and after each test to outline the test boundaries, e.g.: ``` ===== TEST OUTPUT FOR o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' ===== 15/05/27 12:36:39.596 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO SparkContext: Starting job: count at KryoSerializerSuite.scala:230 15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Got job 3 (count at KryoSerializerSuite.scala:230) with 4 output partitions (allowLocal=false) 15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 3(count at KryoSerializerSuite.scala:230) 15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List() 15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List() 15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Submitting ResultStage 3 (ParallelCollectionRDD[5] at parallelize at KryoSerializerSuite.scala:230), which has no missing parents ... 15/05/27 12:36:39.624 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO DAGScheduler: Job 3 finished: count at KryoSerializerSuite.scala:230, took 0.028563 s 15/05/27 12:36:39.625 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO KryoSerializerSuite: *** FINISHED o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' *** ... ``` Author: Andrew Or <andrew@databricks.com> Closes #6441 from andrewor14/demarcate-tests and squashes the following commits: 879b060 [Andrew Or] Fix compile after rebase d622af7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests 017c8ba [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests 7790b6c [Andrew Or] Fix tests after logical merge conflict c7460c0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests c43ffc4 [Andrew Or] Fix tests? 8882581 [Andrew Or] Fix tests ee22cda [Andrew Or] Fix log message fa9450e [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests 12d1e1b [Andrew Or] Various whitespace changes (minor) 69cbb24 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite bbce12e [Andrew Or] Fix manual things that cannot be covered through automation da0b12f [Andrew Or] Add core tests as dependencies in all modules f7d29ce [Andrew Or] Introduce base abstract class for all test suites	2015-05-29 14:03:12 -07:00
Reynold Xin	94f62a4979	[SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker. … Author: Reynold Xin <rxin@databricks.com> Closes #6491 from rxin/more-whitespace and squashes the following commits: f6e63dc [Reynold Xin] [SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker.	2015-05-29 13:38:37 -07:00
Cheng Lian	4782e13040	[SQL] [TEST] [MINOR] Uses a temporary log4j.properties in HiveThriftServer2Test to ensure expected logging behavior The `HiveThriftServer2Test` relies on proper logging behavior to assert whether the Thrift server daemon process is started successfully. However, some other jar files listed in the classpath may potentially contain an unexpected Log4J configuration file which overrides the logging behavior. This PR writes a temporary `log4j.properties` and prepend it to driver classpath before starting the testing Thrift server process to ensure proper logging behavior. cc andrewor14 yhuai Author: Cheng Lian <lian@databricks.com> Closes #6493 from liancheng/override-log4j and squashes the following commits: c489e0e [Cheng Lian] Fixes minor Scala styling issue b46ef0d [Cheng Lian] Uses a temporary log4j.properties in HiveThriftServer2Test to ensure expected logging behavior	2015-05-29 11:11:40 -07:00
Cheng Lian	e7b6177557	[SPARK-7950] [SQL] Sets spark.sql.hive.version in HiveThriftServer2.startWithContext() When starting `HiveThriftServer2` via `startWithContext`, property `spark.sql.hive.version` isn't set. This causes Simba ODBC driver 1.0.8.1006 behaves differently and fails simple queries. Hive2 JDBC driver works fine in this case. Also, when starting the server with `start-thriftserver.sh`, both Hive2 JDBC driver and Simba ODBC driver works fine. Please refer to [SPARK-7950] [1] for details. [1]: https://issues.apache.org/jira/browse/SPARK-7950 Author: Cheng Lian <lian@databricks.com> Closes #6500 from liancheng/odbc-bugfix and squashes the following commits: 051e3a3 [Cheng Lian] Fixes import order 3a97376 [Cheng Lian] Sets spark.sql.hive.version in HiveThriftServer2.startWithContext()	2015-05-29 10:43:34 -07:00
Reynold Xin	97a60cf75d	[SPARK-7929] Turn whitespace checker on for more token types. This is the last batch of changes to complete SPARK-7929. Previous related PRs: https://github.com/apache/spark/pull/6480 https://github.com/apache/spark/pull/6478 https://github.com/apache/spark/pull/6477 https://github.com/apache/spark/pull/6476 https://github.com/apache/spark/pull/6475 https://github.com/apache/spark/pull/6474 https://github.com/apache/spark/pull/6473 Author: Reynold Xin <rxin@databricks.com> Closes #6487 from rxin/whitespace-lint and squashes the following commits: b33d43d [Reynold Xin] [SPARK-7929] Turn whitespace checker on for more token types.	2015-05-28 23:00:02 -07:00
Reynold Xin	8da560d7de	[SPARK-7927] whitespace fixes for Catalyst module. So we can enable a whitespace enforcement rule in the style checker to save code review time. Author: Reynold Xin <rxin@databricks.com> Closes #6476 from rxin/whitespace-catalyst and squashes the following commits: 650409d [Reynold Xin] Fixed tests. 51a9e5d [Reynold Xin] [SPARK-7927] whitespace fixes for Catalyst module.	2015-05-28 20:11:57 -07:00
Reynold Xin	ff44c711ab	[SPARK-7927] whitespace fixes for SQL core. So we can enable a whitespace enforcement rule in the style checker to save code review time. Author: Reynold Xin <rxin@databricks.com> Closes #6477 from rxin/whitespace-sql-core and squashes the following commits: ce6e369 [Reynold Xin] Fixed tests. 6095fed [Reynold Xin] [SPARK-7927] whitespace fixes for SQL core.	2015-05-28 20:10:21 -07:00
Reynold Xin	ee6a0e12fb	[SPARK-7927] whitespace fixes for Hive and ThriftServer. So we can enable a whitespace enforcement rule in the style checker to save code review time. Author: Reynold Xin <rxin@databricks.com> Closes #6478 from rxin/whitespace-hive and squashes the following commits: e01b0e0 [Reynold Xin] Fixed tests. a3bba22 [Reynold Xin] [SPARK-7927] whitespace fixes for Hive and ThriftServer.	2015-05-28 18:08:56 -07:00
Yin Huai	572b62cafe	[SPARK-7853] [SQL] Fix HiveContext in Spark Shell https://issues.apache.org/jira/browse/SPARK-7853 This fixes the problem introduced by my change in https://github.com/apache/spark/pull/6435, which causes that Hive Context fails to create in spark shell because of the class loader issue. Author: Yin Huai <yhuai@databricks.com> Closes #6459 from yhuai/SPARK-7853 and squashes the following commits: 37ad33e [Yin Huai] Do not use hiveQlTable at all. 47cdb6d [Yin Huai] Move hiveconf.set to the end of setConf. 005649b [Yin Huai] Update comment. 35d86f3 [Yin Huai] Access TTable directly to make sure Hive will not internally use any metastore utility functions. 3737766 [Yin Huai] Recursively find all jars.	2015-05-28 17:12:30 -07:00
Yin Huai	3c1f1baaf0	[SPARK-7907] [SQL] [UI] Rename tab ThriftServer to SQL. This PR has three changes: 1. Renaming the table of `ThriftServer` to `SQL`; 2. Renaming the title of the tab from `ThriftServer` to `JDBC/ODBC Server`; and 3. Renaming the title of the session page from `ThriftServer` to `JDBC/ODBC Session`. https://issues.apache.org/jira/browse/SPARK-7907 Author: Yin Huai <yhuai@databricks.com> Closes #6448 from yhuai/JDBCServer and squashes the following commits: eadcc3d [Yin Huai] Update test. 9168005 [Yin Huai] Use SQL as the tab name. 221831e [Yin Huai] Rename ThriftServer to JDBCServer.	2015-05-27 20:04:29 -07:00
Liang-Chi Hsieh	a1e092eae5	[SPARK-7897][SQL] Use DecimalType to represent unsigned bigint in JDBCRDD JIRA: https://issues.apache.org/jira/browse/SPARK-7897 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6438 from viirya/jdbc_unsigned_bigint and squashes the following commits: ccb3c3f [Liang-Chi Hsieh] Use DecimalType to represent unsigned bigint.	2015-05-27 18:51:36 -07:00
Cheng Hao	db3fd054f2	[SPARK-7853] [SQL] Fixes a class loader issue in Spark SQL This PR is based on PR #6396 authored by chenghao-intel. Essentially, Spark SQL should use context classloader to load SerDe classes. yhuai helped updating the test case, and I fixed a bug in the original `CliSuite`: while testing the CLI tool with `runCliWithin`, we don't append `\n` to the last query, thus the last query is never executed. Original PR description is pasted below. ---- ``` bin/spark-sql --jars ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; ``` Throws exception like ``` 15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'] org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.apache.hive.hcatalog.data.JsonSerDe at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139) at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310) at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:457) at org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:922) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:922) at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:147) at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:727) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57) ``` Author: Cheng Hao <hao.cheng@intel.com> Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #6435 from liancheng/classLoader and squashes the following commits: d4c4845 [Cheng Lian] Fixes CliSuite 75e80e2 [Yin Huai] Update the fix. fd26533 [Cheng Hao] scalastyle dd78775 [Cheng Hao] workaround for classloader of IsolatedClientLoader	2015-05-27 14:21:00 -07:00
Cheng Lian	b97ddff000	[SPARK-7684] [SQL] Refactoring MetastoreDataSourcesSuite to workaround SPARK-7684 As stated in SPARK-7684, currently `TestHive.reset` has some execution order specific bug, which makes running specific test suites locally pretty frustrating. This PR refactors `MetastoreDataSourcesSuite` (which relies on `TestHive.reset` heavily) using various `withXxx` utility methods in `SQLTestUtils` to ask each test case to cleanup their own mess so that we can avoid calling `TestHive.reset`. Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #6353 from liancheng/workaround-spark-7684 and squashes the following commits: 26939aa [Yin Huai] Move the initialization of jsonFilePath to beforeAll. a423d48 [Cheng Lian] Fixes Scala style issue dfe45d0 [Cheng Lian] Refactors MetastoreDataSourcesSuite to workaround SPARK-7684 92a116d [Cheng Lian] Fixes minor styling issues	2015-05-27 13:09:33 -07:00
Daoyuan Wang	8161562eab	[SPARK-7790] [SQL] date and decimal conversion for dynamic partition key Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #6318 from adrian-wang/dynpart and squashes the following commits: ad73b61 [Daoyuan Wang] not use sqlTestUtils for try catch because dont have sqlcontext here 6c33b51 [Daoyuan Wang] fix according to liancheng f0f8074 [Daoyuan Wang] some specific types as dynamic partition	2015-05-27 12:42:13 -07:00
Reynold Xin	6fec1a9409	Removed Guava dependency from JavaTypeInference's type signature. This should also close #6243. Author: Reynold Xin <rxin@databricks.com> Closes #6431 from rxin/JavaTypeInference-guava and squashes the following commits: e58df3c [Reynold Xin] Removed Gauva dependency from JavaTypeInference's type signature.	2015-05-27 11:54:35 -07:00
Cheng Lian	15459db4f6	[SPARK-7847] [SQL] Fixes dynamic partition directory escaping Please refer to [SPARK-7847] [1] for details. [1]: https://issues.apache.org/jira/browse/SPARK-7847 Author: Cheng Lian <lian@databricks.com> Closes #6389 from liancheng/spark-7847 and squashes the following commits: 935c652 [Cheng Lian] Adds test case for writing various data types as dynamic partition value f4fc398 [Cheng Lian] Converts partition columns to Scala type when writing dynamic partitions d0aeca0 [Cheng Lian] Fixes dynamic partition directory escaping	2015-05-27 10:09:12 -07:00
Reynold Xin	3e7d7d6b3d	[SQL] Rename MathematicalExpression UnaryMathExpression, and specify BinaryMathExpression's output data type as DoubleType. Two minor changes. cc brkyvz Author: Reynold Xin <rxin@databricks.com> Closes #6428 from rxin/math-func-cleanup and squashes the following commits: 5910df5 [Reynold Xin] [SQL] Rename MathematicalExpression UnaryMathExpression, and specify BinaryMathExpression's output data type as DoubleType.	2015-05-27 01:13:57 -07:00
Reynold Xin	9f48bf6b37	[SPARK-7887][SQL] Remove EvaluatedType from SQL Expression. This type is not really used. Might as well remove it. Author: Reynold Xin <rxin@databricks.com> Closes #6427 from rxin/evalutedType and squashes the following commits: 51a319a [Reynold Xin] [SPARK-7887][SQL] Remove EvaluatedType from SQL Expression.	2015-05-27 01:12:59 -07:00
Liang-Chi Hsieh	4f98d7a7f1	[SPARK-7697][SQL] Use LongType for unsigned int in JDBCRDD JIRA: https://issues.apache.org/jira/browse/SPARK-7697 The reported problem case is mysql. But for h2 db, there is no unsigned int. So it is not able to add corresponding test. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6229 from viirya/unsignedint_as_long and squashes the following commits: dc4b5d8 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into unsignedint_as_long 608695b [Liang-Chi Hsieh] Use LongType for unsigned int in JDBCRDD.	2015-05-27 00:27:39 -07:00
Cheolsoo Park	6dd645870d	[SPARK-7850][BUILD] Hive 0.12.0 profile in POM should be removed I grep'ed hive-0.12.0 in the source code and removed all the profiles and doc references. Author: Cheolsoo Park <cheolsoop@netflix.com> Closes #6393 from piaozhexiu/SPARK-7850 and squashes the following commits: fb429ce [Cheolsoo Park] Remove hive-0.13.1 profile 82bf09a [Cheolsoo Park] Remove hive 0.12.0 shim code f3722da [Cheolsoo Park] Remove hive-0.12.0 profile and references from POM and build docs	2015-05-27 00:18:42 -07:00
Cheng Lian	b463e6d618	[SPARK-7868] [SQL] Ignores _temporary directories in HadoopFsRelation So that potential partial/corrupted data files left by failed tasks/jobs won't affect normal data scan. Author: Cheng Lian <lian@databricks.com> Closes #6411 from liancheng/spark-7868 and squashes the following commits: 273ea36 [Cheng Lian] Ignores _temporary directories	2015-05-26 20:48:56 -07:00
Josh Rosen	0c33c7b4a6	[SPARK-7858] [SQL] Use output schema, not relation schema, for data source input conversion In `DataSourceStrategy.createPhysicalRDD`, we use the relation schema as the target schema for converting incoming rows into Catalyst rows. However, we should be using the output schema instead, since our scan might return a subset of the relation's columns. This patch incorporates #6414 by liancheng, which fixes an issue in `SimpleTestRelation` that prevented this bug from being caught by our old tests: > In `SimpleTextRelation`, we specified `needsConversion` to `true`, indicating that values produced by this testing relation should be of Scala types, and need to be converted to Catalyst types when necessary. However, we also used `Cast` to convert strings to expected data types. And `Cast` always produces values of Catalyst types, thus no conversion is done at all. This PR makes `SimpleTextRelation` produce Scala values so that data conversion code paths can be properly tested. Closes #5986. Author: Josh Rosen <joshrosen@databricks.com> Author: Cheng Lian <lian@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Closes #6400 from JoshRosen/SPARK-7858 and squashes the following commits: e71c866 [Josh Rosen] Re-fix bug so that the tests pass again 56b13e5 [Josh Rosen] Add regression test to hadoopFsRelationSuites 2169a0f [Josh Rosen] Remove use of SpecificMutableRow and BufferedIterator 6cd7366 [Josh Rosen] Fix SPARK-7858 by using output types for conversion. 5a00e66 [Josh Rosen] Add assertions in order to reproduce SPARK-7858 8ba195c [Cheng Lian] Merge 9968fba9979287aaa1f141ba18bfb9d4c116a3b3 into `61664732b2` 9968fba [Cheng Lian] Tests the data type conversion code paths	2015-05-26 20:24:35 -07:00
rowan	03668348e2	[SPARK-7637] [SQL] O(N) merge implementation for StructType merge Contribution is my original work and I license the work to the project under the projects open source license. Author: rowan <rowan.chattaway@googlemail.com> Closes #6259 from rowan000/SPARK-7637 and squashes the following commits: c479df4 [rowan] SPARK-7637: rename mapFields to fieldsMap as per comments on github. 8d2e419 [rowan] SPARK-7637: fix up whitespace changes 0e9d662 [rowan] SPARK-7637: O(N) merge implementatio for StructType merge	2015-05-26 18:17:16 -07:00
Reynold Xin	c9adcad81a	[SQL][minor] Removed unused Catalyst logical plan DSL. The Catalyst DSL is no longer used as a public facing API. This pull request removes the UDF and writeToFile feature from it since they are not used in unit tests. Author: Reynold Xin <rxin@databricks.com> Closes #6350 from rxin/unused-logical-dsl and squashes the following commits: 90b3de6 [Reynold Xin] [SQL][minor] Removed unused Catalyst logical plan DSL.	2015-05-25 23:09:22 -07:00
Cheng Lian	8af1bf10b7	[SPARK-7842] [SQL] Makes task committing/aborting in InsertIntoHadoopFsRelation more robust When committing/aborting a write task issued in `InsertIntoHadoopFsRelation`, if an exception is thrown from `OutputWriter.close()`, the committing/aborting process will be interrupted, and leaves messy stuff behind (e.g., the `_temporary` directory created by `FileOutputCommitter`). This PR makes these two process more robust by catching potential exceptions and falling back to normal task committment/abort. Author: Cheng Lian <lian@databricks.com> Closes #6378 from liancheng/spark-7838 and squashes the following commits: f18253a [Cheng Lian] Makes task committing/aborting in InsertIntoHadoopFsRelation more robust	2015-05-26 00:28:47 +08:00
Cheng Lian	bfeedc69a2	[SPARK-7684] [SQL] Invoking HiveContext.newTemporaryConfiguration() shouldn't create new metastore directory The "Database does not exist" error reported in SPARK-7684 was caused by `HiveContext.newTemporaryConfiguration()`, which always creates a new temporary metastore directory and returns a metastore configuration pointing that directory. This makes `TestHive.reset()` always replaces old temporary metastore with an empty new one. Author: Cheng Lian <lian@databricks.com> Closes #6359 from liancheng/spark-7684 and squashes the following commits: 95d2eb8 [Cheng Lian] Addresses @marmbrust's comment 042769d [Cheng Lian] Don't create new temp directory in HiveContext.newTemporaryConfiguration()	2015-05-26 00:16:06 +08:00
Yin Huai	ed21476bc0	[SPARK-7805] [SQL] Move SQLTestUtils.scala and ParquetTest.scala to src/test https://issues.apache.org/jira/browse/SPARK-7805 Because `sql/hive`'s tests depend on the test jar of `sql/core`, we do not need to store `SQLTestUtils` and `ParquetTest` in `src/main`. We should only add stuff that will be needed by `sql/console` or Python tests (for Python, we need it in `src/main`, right? davies). Author: Yin Huai <yhuai@databricks.com> Closes #6334 from yhuai/SPARK-7805 and squashes the following commits: af6d0c9 [Yin Huai] mima b86746a [Yin Huai] Move SQLTestUtils.scala and ParquetTest.scala to src/test.	2015-05-24 09:51:37 -07:00
Yin Huai	2b7e63585d	[SPARK-7654] [SQL] Move insertInto into reader/writer interface. This one continues the work of https://github.com/apache/spark/pull/6216. Author: Yin Huai <yhuai@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #6366 from yhuai/insert and squashes the following commits: 3d717fb [Yin Huai] Use insertInto to handle the casue when table exists and Append is used for saveAsTable. 56d2540 [Yin Huai] Add PreWriteCheck to HiveContext's analyzer. c636e35 [Yin Huai] Remove unnecessary empty lines. cf83837 [Yin Huai] Move insertInto to write. Also, remove the partition columns from InsertIntoHadoopFsRelation. 0841a54 [Reynold Xin] Removed experimental tag for deprecated methods. 33ed8ef [Reynold Xin] [SPARK-7654][SQL] Move insertInto into reader/writer interface.	2015-05-23 09:48:20 -07:00
Davies Liu	efe3bfdf49	[SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related updates 1. ntile should take an integer as parameter. 2. Added Python API (based on #6364) 3. Update documentation of various DataFrame Python functions. Author: Davies Liu <davies@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #6374 from rxin/window-final and squashes the following commits: 69004c7 [Reynold Xin] Style fix. 288cea9 [Reynold Xin] Update documentaiton. 7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window 66092b4 [Davies Liu] update docs ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation. ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4 8936ade [Davies Liu] fix maxint in python 3 2649358 [Davies Liu] update docs 778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions	2015-05-23 08:30:05 -07:00
Michael Armbrust	3c1305107a	[SPARK-7834] [SQL] Better window error messages Author: Michael Armbrust <michael@databricks.com> Closes #6363 from marmbrus/windowErrors and squashes the following commits: 516b02d [Michael Armbrust] [SPARK-7834] [SQL] Better window error messages	2015-05-22 17:23:12 -07:00
Liang-Chi Hsieh	126d7235de	[SPARK-7270] [SQL] Consider dynamic partition when inserting into hive table JIRA: https://issues.apache.org/jira/browse/SPARK-7270 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5864 from viirya/dyn_partition_insert and squashes the following commits: b5627df [Liang-Chi Hsieh] For comments. 3b21e4b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into dyn_partition_insert 8a4352d [Liang-Chi Hsieh] Consider dynamic partition when inserting into hive table.	2015-05-22 15:39:58 -07:00
Santiago M. Mola	e4aef91fe7	[SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL. Author: Santiago M. Mola <santi@mola.io> Closes #6327 from smola/feature/catalyst-dsl-set-ops and squashes the following commits: 11db778 [Santiago M. Mola] [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL.	2015-05-22 15:10:27 -07:00
WangTaoTheTonic	31d5d463e7	[SPARK-7758] [SQL] Override more configs to avoid failure when connect to a postgre sql https://issues.apache.org/jira/browse/SPARK-7758 When initializing `executionHive`, we only masks `javax.jdo.option.ConnectionURL` to override metastore location. However, other properties that relates to the actual Hive metastore data source are not masked. For example, when using Spark SQL with a PostgreSQL backed Hive metastore, `executionHive` actually tries to use settings read from `hive-site.xml`, which talks about PostgreSQL, to connect to the temporary Derby metastore, thus causes error. To fix this, we need to mask all metastore data source properties. Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()` method] [1], all properties whose name mentions "jdo" and "datanucleus" must be included. [1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288 Have tested using postgre sql as metastore, it worked fine. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits: ca7ae7c [WangTaoTheTonic] add comments 86caf2c [WangTaoTheTonic] delete unused import e4f0feb [WangTaoTheTonic] block more data source related property 92a81fa [WangTaoTheTonic] fix style check e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql	2015-05-22 14:43:16 -07:00
Michael Armbrust	3b68cb0430	[SPARK-6743] [SQL] Fix empty projections of cached data Author: Michael Armbrust <michael@databricks.com> Closes #6165 from marmbrus/wrongColumn and squashes the following commits: 4fad158 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into wrongColumn aad7eab [Michael Armbrust] rxins comments f1e8df1 [Michael Armbrust] [SPARK-6743][SQL] Fix empty projections of cached data	2015-05-22 09:43:46 -07:00
Cheng Lian	4e5220c317	[MINOR] [SQL] Ignores Thrift server UISeleniumSuite This Selenium test case has been flaky for a while and led to frequent Jenkins build failure. Let's disable it temporarily until we figure out a proper solution. Author: Cheng Lian <lian@databricks.com> Closes #6345 from liancheng/ignore-selenium-test and squashes the following commits: 09996fe [Cheng Lian] Ignores Thrift server UISeleniumSuite	2015-05-22 16:25:52 +08:00
Cheng Hao	f6f2eeb179	[SPARK-7322][SQL] Window functions in DataFrame This closes #6104. Author: Cheng Hao <hao.cheng@intel.com> Author: Reynold Xin <rxin@databricks.com> Closes #6343 from rxin/window-df and squashes the following commits: 026d587 [Reynold Xin] Address code review feedback. dc448fe [Reynold Xin] Fixed Hive tests. 9794d9d [Reynold Xin] Moved Java test package. 9331605 [Reynold Xin] Refactored API. 3313e2a [Reynold Xin] Merge pull request #6104 from chenghao-intel/df_window d625a64 [Cheng Hao] Update the dataframe window API as suggsted c141fb1 [Cheng Hao] hide all of properties of the WindowFunctionDefinition 3b1865f [Cheng Hao] scaladoc typos f3fd2d0 [Cheng Hao] polish the unit test 6847825 [Cheng Hao] Add additional analystcs functions 57e3bc0 [Cheng Hao] typos 24a08ec [Cheng Hao] scaladoc 28222ed [Cheng Hao] fix bug of range/row Frame 1d91865 [Cheng Hao] style issue 53f89f2 [Cheng Hao] remove the over from the functions.scala 964c013 [Cheng Hao] add more unit tests and window functions 64e18a7 [Cheng Hao] Add Window Function support for DataFrame	2015-05-22 01:00:16 -07:00
Yin Huai	347b50106b	[SPARK-7737] [SQL] Use leaf dirs having data files to discover partitions. https://issues.apache.org/jira/browse/SPARK-7737 cc liancheng Author: Yin Huai <yhuai@databricks.com> Closes #6329 from yhuai/spark-7737 and squashes the following commits: 7e0dfc7 [Yin Huai] Use leaf dirs having data files to discover partitions.	2015-05-22 07:10:26 +08:00
Andrew Or	5287eec5a6	[SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning. Author: Andrew Or <andrew@databricks.com> Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits: a82b451 [Andrew Or] Fix style 10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures 17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up 523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too f7fe143 [Andrew Or] Avoid unnecessary closure cleaning	2015-05-21 14:33:11 -07:00
Tathagata Das	3d0cccc858	[SPARK-7478] [SQL] Added SQLContext.getOrCreate Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf. rxin marmbrus Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #6006 from tdas/SPARK-7478 and squashes the following commits: 25f4da9 [Tathagata Das] Addressed comments. 79fe069 [Tathagata Das] Added comments. c66ca76 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478 48adb14 [Tathagata Das] Removed HiveContext.getOrCreate bf8cf50 [Tathagata Das] Fix more bug dec5594 [Tathagata Das] Fixed bug b4e9721 [Tathagata Das] Remove unnecessary import 4ef513b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7478 d3ea8e4 [Tathagata Das] Added HiveContext 83bc950 [Tathagata Das] Updated tests f82ae81 [Tathagata Das] Fixed test bc72868 [Tathagata Das] Added SQLContext.getOrCreate	2015-05-21 14:08:20 -07:00
Yin Huai	30f3f556f7	[SPARK-7763] [SPARK-7616] [SQL] Persists partition columns into metastore Author: Yin Huai <yhuai@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #6285 from liancheng/spark-7763 and squashes the following commits: bb2829d [Yin Huai] Fix hashCode. d677f7d [Cheng Lian] Fixes Scala style issue 44b283f [Cheng Lian] Adds test case for SPARK-7616 6733276 [Yin Huai] Fix a bug that potentially causes https://issues.apache.org/jira/browse/SPARK-7616. 6cabf3c [Yin Huai] Update unit test. 7e02910 [Yin Huai] Use metastore partition columns and do not hijack maybePartitionSpec. e9a03ec [Cheng Lian] Persists partition columns into metastore	2015-05-21 13:51:40 -07:00
scwf	f6c486aa4b	[SQL] [TEST] udf_java_method failed due to jdk version java.lang.Math.exp(1.0) has different result between jdk versions. so do not use createQueryTest, write a separate test for it. ``` jdk version result 1.7.0_11 2.7182818284590455 1.7.0_05 2.7182818284590455 1.7.0_71 2.718281828459045 ``` Author: scwf <wangfei1@huawei.com> Closes #6274 from scwf/java_method and squashes the following commits: 3dd2516 [scwf] address comments 5fa1459 [scwf] style df46445 [scwf] fix test error fcb6d22 [scwf] fix udf_java_method	2015-05-21 12:31:58 -07:00
Cheng Lian	8730fbb47b	[SPARK-7749] [SQL] Fixes partition discovery for non-partitioned tables When no partition columns can be found, we should have an empty `PartitionSpec`, rather than a `PartitionSpec` with empty partition columns. This PR together with #6285 should fix SPARK-7749. Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #6287 from liancheng/spark-7749 and squashes the following commits: a799ff3 [Cheng Lian] Adds test cases for SPARK-7749 c4949be [Cheng Lian] Minor refactoring, and tolerant _TEMPORARY directory name 5aa87ea [Yin Huai] Make parsePartitions more robust. fc56656 [Cheng Lian] Returns empty PartitionSpec if no partition columns can be inferred 19ae41e [Cheng Lian] Don't list base directory as leaf directory	2015-05-21 10:56:17 -07:00
Davies Liu	a25c1ab8f0	[SPARK-7565] [SQL] fix MapType in JsonRDD The key of Map in JsonRDD should be converted into UTF8String (also failed records), Thanks to yhuai viirya Closes #6084 Author: Davies Liu <davies@databricks.com> Closes #6299 from davies/string_in_json and squashes the following commits: 0dbf559 [Davies Liu] improve test, fix corrupt record 6836a80 [Davies Liu] move unit tests into Scala b97af11 [Davies Liu] fix MapType in JsonRDD	2015-05-21 09:58:47 -07:00
Cheng Hao	feb3a9d3f8	[SPARK-7320] [SQL] [Minor] Move the testData into beforeAll() Follow up of #6340, to avoid the test report missing once it fails. Author: Cheng Hao <hao.cheng@intel.com> Closes #6312 from chenghao-intel/rollup_minor and squashes the following commits: b03a25f [Cheng Hao] simplify the testData instantiation 09b7e8b [Cheng Hao] move the testData into beforeAll()	2015-05-21 09:28:00 -07:00
Liang-Chi Hsieh	d0eb9ffe97	[SPARK-7746][SQL] Add FetchSize parameter for JDBC driver JIRA: https://issues.apache.org/jira/browse/SPARK-7746 Looks like an easy to add parameter but can show significant performance improvement if the JDBC driver accepts it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6283 from viirya/jdbc_fetchsize and squashes the following commits: de47f94 [Liang-Chi Hsieh] Don't keep fetchSize as single parameter. b7bff2f [Liang-Chi Hsieh] Add FetchSize parameter for JDBC driver.	2015-05-20 22:23:49 -07:00

... 8 9 10 11 12 ...

2216 commits