This PR is based on #6988 , thanks to adrian-wang .
This brings two SQL functions: to_date() and trunc().
Closes#6988
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>
Closes#7805 from davies/to_date and squashes the following commits:
2c7beba [Davies Liu] Merge branch 'master' of github.com:apache/spark into to_date
310dd55 [Daoyuan Wang] remove dup test in rebase
980b092 [Daoyuan Wang] resolve rebase conflict
a476c5a [Daoyuan Wang] address comments from davies
d44ea5f [Daoyuan Wang] function to_date, trunc
This PR brings the support of DecimalType in UnsafeRow, for precision <= 18, it's settable, otherwise it's not settable.
Author: Davies Liu <davies@databricks.com>
Closes#7758 from davies/unsafe_decimal and squashes the following commits:
478b1ba [Davies Liu] address comments
536314c [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_decimal
7c2e77a [Davies Liu] fix JoinedRow
76d6fa4 [Davies Liu] fix tests
99d3151 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_decimal
d49c6ae [Davies Liu] support DecimalType in UnsafeRow
This PR is based on #7589 , thanks to adrian-wang
Added SQL function date_add, date_sub, add_months, month_between, also add a rule for
add/subtract of date/timestamp and interval.
Closes#7589
cc rxin
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>
Closes#7754 from davies/date_add and squashes the following commits:
e8c633a [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
9e8e085 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
6224ce4 [Davies Liu] fix conclict
bd18cd4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
e47ff2c [Davies Liu] add python api, fix date functions
01943d0 [Davies Liu] Merge branch 'master' into date_add
522e91a [Daoyuan Wang] fix
e8a639a [Daoyuan Wang] fix
42df486 [Daoyuan Wang] fix style
87c4b77 [Daoyuan Wang] function add_months, months_between and some fixes
1a68e03 [Daoyuan Wang] poc of time interval calculation
c506661 [Daoyuan Wang] function date_add , date_sub
unix_timestamp(): long
Gets current Unix timestamp in seconds.
unix_timestamp(string|date): long
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
unix_timestamp(string date, string pattern): long
Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return null if fail: unix_timestamp('2009-03-20', 'yyyy-MM-dd') = 1237532400.
from_unixtime(bigint unixtime[, string format]): string
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00".
Jira:
https://issues.apache.org/jira/browse/SPARK-8174https://issues.apache.org/jira/browse/SPARK-8175
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#7644 from adrian-wang/udfunixtime and squashes the following commits:
2fe20c4 [Daoyuan Wang] util.Date
ea2ec16 [Daoyuan Wang] use util.Date for better performance
a2cf929 [Daoyuan Wang] doc return null instead of 0
f6f070a [Daoyuan Wang] address comments from davies
6a4cbb3 [Daoyuan Wang] temp
56ded53 [Daoyuan Wang] rebase and address comments
14a8b37 [Daoyuan Wang] function unix_timestamp, from_unixtime
Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it.
This supports:
`df.select(inputFileName())`
and
`sqlContext.sql("select input_file_name() from table")`
Author: Joseph Batchik <josephbatchik@gmail.com>
Closes#7743 from JDrit/input_file_name and squashes the following commits:
abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string
d2f323d [Joseph Batchik] updates per review
102061f [Joseph Batchik] updates per review
75313f5 [Joseph Batchik] small fixes
c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL
JIRA: https://issues.apache.org/jira/browse/SPARK-9428
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7748 from yjshen/string_cleanup and squashes the following commits:
e0c2b3d [Yijie Shen] update codegen in RegExpExtract and RegExpReplace
26614d2 [Yijie Shen] MathFunctionSuite
a402859 [Yijie Shen] complex_create, conditional and cast
6e4e608 [Yijie Shen] arithmetic and cast
52593c1 [Yijie Shen] null input test cases for StringExpressionSuite
Author: Reynold Xin <rxin@databricks.com>
Closes#7767 from rxin/SPARK-9462 and squashes the following commits:
ef3e2d9 [Reynold Xin] Removed println
713ac3a [Reynold Xin] More unit tests.
bb5c334 [Reynold Xin] [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode.
We accidentally moved the list of expressions from the generated code instance to the class wrapper, and as a result, different threads are sharing the same set of expressions, which cause problems for expressions with mutable state.
This pull request fixed that problem, and also added unit tests for all codegen classes, except GeneratedOrdering (which will never need any expressions since sort now only accepts bound references.
Author: Reynold Xin <rxin@databricks.com>
Closes#7759 from rxin/SPARK-9448 and squashes the following commits:
c09b50f [Reynold Xin] [SPARK-9448][SQL] GenerateUnsafeProjection should not share expressions across instances.
We need to make page sizes configurable so we can reduce them in unit tests and increase them in real production workloads. These sizes are now controlled by a new configuration, `spark.buffer.pageSize`. The new default is 64 megabytes.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7741 from JoshRosen/SPARK-9411 and squashes the following commits:
a43c4db [Josh Rosen] Fix pow
2c0eefc [Josh Rosen] Fix MAXIMUM_PAGE_SIZE_BYTES comment + value
bccfb51 [Josh Rosen] Lower page size to 4MB in TestHive
ba54d4b [Josh Rosen] Make UnsafeExternalSorter's page size configurable
0045aa2 [Josh Rosen] Make UnsafeShuffle's page size configurable
bc734f0 [Josh Rosen] Rename configuration
e614858 [Josh Rosen] Makes BytesToBytesMap page size configurable
We want to introduce a new IntervalType in 1.6 that is based on only the number of microseoncds,
so interval can be compared.
Renaming the existing IntervalType to CalendarIntervalType so we can do that in the future.
Author: Reynold Xin <rxin@databricks.com>
Closes#7745 from rxin/calendarintervaltype and squashes the following commits:
99f64e8 [Reynold Xin] One more line ...
13466c8 [Reynold Xin] Fixed tests.
e20f24e [Reynold Xin] [SPARK-9430][SQL] Rename IntervalType to CalendarIntervalType.
Author: Reynold Xin <rxin@databricks.com>
Closes#7747 from rxin/SPARK-9127 and squashes the following commits:
e851418 [Reynold Xin] [SPARK-9127][SQL] Rand/Randn codegen fails with long seed.
as an offline discussion with rxin , it's weird to be computing stuff while doing sorting, we should only order by bound reference during execution.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7593 from cloud-fan/sort and squashes the following commits:
7b1bef7 [Wenchen Fan] add test
daf206d [Wenchen Fan] add more comments
289bee0 [Wenchen Fan] do not order by expressions which still need evaluation
Right now, we use double to parse all the float number in SQL. When it's used in expression together with DecimalType, it will turn the decimal into double as well. Also it will loss some precision when using double.
This PR change to parse float number to decimal or double, based on it's using scientific notation or not, see https://msdn.microsoft.com/en-us/library/ms179899.aspx
This is a break change, should we doc it somewhere?
Author: Davies Liu <davies@databricks.com>
Closes#7642 from davies/parse_decimal and squashes the following commits:
1f576d9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
5e142b6 [Davies Liu] fix scala style
eca99de [Davies Liu] fix tests
2afe702 [Davies Liu] Merge branch 'master' of github.com:apache/spark into parse_decimal
f4a320b [Davies Liu] Update SqlParser.scala
1c48e34 [Davies Liu] use decimal or double when parsing SQL
We will do local projection for LocalRelation, and thus reuse the same Expression object among multiply evaluations. We should reset the mutable states of Expression before evaluate it.
Fix `PullOutNondeterministic` rule to make it work for `Sort`.
Also got a chance to cleanup the dataframe test suite.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7674 from cloud-fan/show and squashes the following commits:
888934f [Wenchen Fan] fix sort
c0e93e8 [Wenchen Fan] local DataFrame with random columns should return same value when call `show`
UnsafeRow.getDouble and getFloat() return NaN when called on columns that are null, which is inconsistent with the behavior of other row classes (which is to return 0.0).
In addition, the generic get(ordinal, dataType) method should always return null for a null literal, but currently it handles nulls by calling the type-specific accessors.
This patch addresses both of these issues and adds a regression test.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7736 from JoshRosen/unsafe-row-null-fixes and squashes the following commits:
c8eb2ee [Josh Rosen] Fix test in UnsafeRowConverterSuite
6214682 [Josh Rosen] Fixes to null handling in UnsafeRow
Since catalyst package already depends on Spark core, we can move those expressions
into catalyst, and simplify function registry.
This is a followup of #7478.
Author: Reynold Xin <rxin@databricks.com>
Closes#7735 from rxin/SPARK-8003 and squashes the following commits:
2ffbdc3 [Reynold Xin] [SPARK-8003][SQL] Move expressions in sql/core package to catalyst.
The original patch didn't handle nulls correctly for next_day.
Author: Reynold Xin <rxin@databricks.com>
Closes#7718 from rxin/next_day and squashes the following commits:
616a425 [Reynold Xin] Merged DatetimeExpressionsSuite into DateFunctionsSuite.
faa78cf [Reynold Xin] Merged DatetimeFunctionsSuite into DateExpressionsSuite.
6c4fb6a [Reynold Xin] [SPARK-8196][SQL] Fix null handling & documentation for next_day.
Our CodeFormatter currently does not handle parentheses, and as a result in code dump, we see code formatted this way:
```
foo(
a,
b,
c)
```
With this patch, it is formatted this way:
```
foo(
a,
b,
c)
```
Author: Reynold Xin <rxin@databricks.com>
Closes#7712 from rxin/codeformat-parentheses and squashes the following commits:
c2b1c5f [Reynold Xin] Took square bracket out
3cfb174 [Reynold Xin] Code review feedback.
91f5bb1 [Reynold Xin] [SPARK-9394][SQL] Handle parentheses in CodeFormatter.
This pull request updates GenerateUnsafeProjection to support StructType. If an input struct type is backed already by an UnsafeRow, GenerateUnsafeProjection copies the bytes directly into its buffer space without any conversion. However, if the input is not an UnsafeRow, GenerateUnsafeProjection runs the code generated recursively to convert the input into an UnsafeRow and then copies it into the buffer space.
Also create a TungstenProject operator that projects data directly into UnsafeRow. Note that I'm not sure if this is the way we want to structure Unsafe+codegen operators, but we can defer that decision to follow-up pull requests.
Author: Reynold Xin <rxin@databricks.com>
Closes#7689 from rxin/tungsten-struct-type and squashes the following commits:
9162f42 [Reynold Xin] Support IntervalType in UnsafeRow's getter.
be9f377 [Reynold Xin] Fixed tests.
10c4b7c [Reynold Xin] Format generated code.
77e8d0e [Reynold Xin] Fixed NondeterministicSuite.
ac4951d [Reynold Xin] Yay.
ac203bf [Reynold Xin] More comments.
9f36216 [Reynold Xin] Updated comment.
6b781fe [Reynold Xin] Reset the change in DataFrameSuite.
525b95b [Reynold Xin] Merged with master, more documentation & test cases.
321859a [Reynold Xin] [SPARK-9373][SQL] Support StructType in Tungsten projection [WIP]
next_day, returns next certain dayofweek.
last_day, returns the last day of the month which given date belongs to.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#6986 from adrian-wang/udfnlday and squashes the following commits:
ef7e3da [Daoyuan Wang] fix
02b3426 [Daoyuan Wang] address 2 comments
dc69630 [Daoyuan Wang] address comments from rxin
8846086 [Daoyuan Wang] address comments from rxin
d09bcce [Daoyuan Wang] multi fix
1a9de3d [Daoyuan Wang] function next_day and last_day
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7688 from cloud-fan/interval and squashes the following commits:
5b36b17 [Wenchen Fan] fix codegen
a99ed50 [Wenchen Fan] address comment
9e6d319 [Wenchen Fan] Support IntervalType in UnsafeRow
literals in grouping expressions have no effect at all, only make our grouping key bigger, so we should remove them in Optimizer.
I also make old and new aggregation code consistent about literals in grouping here. In old aggregation, actually literals in grouping are already removed but new aggregation is not. So I explicitly make it a rule in Optimizer.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7583 from cloud-fan/minor and squashes the following commits:
471adff [Wenchen Fan] add test
0839925 [Wenchen Fan] use transformDown when rewrite final result expressions
Make this test deterministic, i.e. make sure this test can be passed no matter how many times we run it.
The origin implementation uses a random seed and gives a chance that we may break the null check assertion `assert(Iterator.fill(100)(generator()).contains(null))`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7691 from cloud-fan/seed and squashes the following commits:
eae7281 [Wenchen Fan] use a seed in RandomDataGeneratorSuite
JIRA: https://issues.apache.org/jira/browse/SPARK-9356
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7671 from yjshen/deprecated_unlimit and squashes the following commits:
c707f56 [Yijie Shen] remove pattern matching in changePrecision
4a1823c [Yijie Shen] remove internal occurrence of Decimal.Unlimited
Currently UnsafeRow cannot support a generic getter. However, if the data type is known, we can support a generic getter.
Author: Reynold Xin <rxin@databricks.com>
Closes#7666 from rxin/generic-getter-with-datatype and squashes the following commits:
ee2874c [Reynold Xin] Add a default implementation for getStruct.
1e109a0 [Reynold Xin] [SPARK-9350][SQL] Introduce an InternalRow generic getter that requires a DataType.
033ee88 [Reynold Xin] Removed getAs in non test code.
Author: Reynold Xin <rxin@databricks.com>
Closes#7665 from rxin/remove-row-apply and squashes the following commits:
0b43001 [Reynold Xin] support getString in UnsafeRow.
176d633 [Reynold Xin] apply -> get.
2941324 [Reynold Xin] [SPARK-9348][SQL] Remove apply method on InternalRow.
Currently nondeterministic expression is broken without a explicit initialization phase.
Let me take `MonotonicallyIncreasingID` as an example. This expression need a mutable state to remember how many times it has been evaluated, so we use `transient var count: Long` there. By being transient, the `count` will be reset to 0 and **only** to 0 when serialize and deserialize it, as deserialize transient variable will result to default value. There is *no way* to use another initial value for `count`, until we add the explicit initialization phase.
Another use case is local execution for `LocalRelation`, there is no serialize and deserialize phase and thus we can't reset mutable states for it.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7535 from cloud-fan/init and squashes the following commits:
6c6f332 [Wenchen Fan] add test
ef68ff4 [Wenchen Fan] fix comments
9eac85e [Wenchen Fan] move init code to interpreted class
bb7d838 [Wenchen Fan] pulls out nondeterministic expressions into a project
b4a4fc7 [Wenchen Fan] revert a refactor
86fee36 [Wenchen Fan] add initialization phase for nondeterministic expression
The two are redundant.
Once this patch is merged, I plan to remove the inbound conversions from unsafe aggregates.
Author: Reynold Xin <rxin@databricks.com>
Closes#7658 from rxin/unsafeconverters and squashes the following commits:
ed19e6c [Reynold Xin] Updated support types.
2a56d7e [Reynold Xin] [SPARK-9334][SQL] Remove UnsafeRowConverter in favor of UnsafeProjection.
The generated expression code can be hard to read since they are not indented well. This patch adds a code formatter that formats code automatically when we output them to the screen.
Author: Reynold Xin <rxin@databricks.com>
Closes#7656 from rxin/codeformatter and squashes the following commits:
5ba0e90 [Reynold Xin] [SPARK-9331][SQL] Add a code formatter to auto-format generated code.
This patch extends CheckAnalysis to throw errors for queries that try to sort on unsupported column types, such as ArrayType.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7633 from JoshRosen/SPARK-9295 and squashes the following commits:
23b2fbf [Josh Rosen] Embed function in foreach
bfe1451 [Josh Rosen] Update to allow sorting by null literals
2f1b802 [Josh Rosen] Add analysis rule to detect sorting on unsupported column types (SPARK-9295)
This patch adds an analysis check to ensure that join conditions' data types are BooleanType. This check is necessary in order to report proper errors for non-boolean DataFrame join conditions.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7630 from JoshRosen/SPARK-9292 and squashes the following commits:
aec6c7b [Josh Rosen] Check condition type in resolved()
75a3ea6 [Josh Rosen] Fix SPARK-9292.
It's a thing used in test cases, but named Row. Pretty annoying because everytime I search for Row, it shows up before the Spark SQL Row, which is what a developer wants most of the time.
Author: Reynold Xin <rxin@databricks.com>
Closes#7638 from rxin/remove-row and squashes the following commits:
aeda52d [Reynold Xin] [SPARK-9305] Rename org.apache.spark.Row to Item.
I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields.
Author: Reynold Xin <rxin@databricks.com>
Closes#7626 from rxin/internalRow and squashes the following commits:
e124daf [Reynold Xin] Fixed test case.
805ceb7 [Reynold Xin] Commented out the failed test suite.
f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining.
76d9081 [Reynold Xin] Fixed data sources.
7807f70 [Reynold Xin] Fixed DataFrameSuite.
cb60cd2 [Reynold Xin] Code review & small bug fixes.
0a2948b [Reynold Xin] Fixed style.
3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
Address comments for #7605
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#7634 from davies/decimal_unlimited2 and squashes the following commits:
b2d8b0d [Davies Liu] add doc and test for DecimalType.isWiderThan
65b251c [Davies Liu] fix test
6a91f32 [Davies Liu] fix style
ca9c973 [Davies Liu] address comments
Author: Reynold Xin <rxin@databricks.com>
Closes#7636 from rxin/complex-string-implicit-cast and squashes the following commits:
3e67327 [Reynold Xin] [SPARK-9200][SQL] Don't implicitly cast non-atomic types to string type.
Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases).
In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18).
If no precision and scale is provide, it's Decimal(10, 0) as before.
Author: Davies Liu <davies@databricks.com>
Closes#7605 from davies/decimal_unlimited and squashes the following commits:
aa3f115 [Davies Liu] fix tests and style
fb0d20d [Davies Liu] address comments
bfaae35 [Davies Liu] fix style
df93657 [Davies Liu] address comments and clean up
06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited
4c28969 [Davies Liu] fix tests
8d783cc [Davies Liu] fix tests
788631c [Davies Liu] fix double with decimal in Union/except
1779bde [Davies Liu] fix scala style
c9c7c78 [Davies Liu] remove Decimal.Unlimited
Reverts ObjectPool. As it stands, it has a few problems:
1. ObjectPool doesn't work with spilling and memory accounting.
2. I don't think in the long run the idea of an object pool is what we want to support, since it essentially goes back to unmanaged memory, and creates pressure on GC, and is hard to account for the total in memory size.
3. The ObjectPool patch removed the specialized getters for strings and binary, and as a result, actually introduced branches when reading non primitive data types.
If we do want to support arbitrary user defined types in the future, I think we can just add an object array in UnsafeRow, rather than relying on indirect memory addressing through a pool. We also need to pick execution strategies that are optimized for those, rather than keeping a lot of unserialized JVM objects in memory during aggregation.
This is probably the hardest thing I had to revert in Spark, due to recent patches that also change the same part of the code. Would be great to get a careful look.
Author: Reynold Xin <rxin@databricks.com>
Closes#7591 from rxin/revert-object-pool and squashes the following commits:
01db0bc [Reynold Xin] Scala style.
eda89fc [Reynold Xin] Fixed describe.
2967118 [Reynold Xin] Fixed accessor for JoinedRow.
e3294eb [Reynold Xin] Merge branch 'master' into revert-object-pool
657855f [Reynold Xin] Temp commit.
c20f2c8 [Reynold Xin] Style fix.
fe37079 [Reynold Xin] Revert "[SPARK-8579] [SQL] support arbitrary object in UnsafeRow"
JIRA: https://issues.apache.org/jira/browse/SPARK-8935
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7365 from yjshen/cast_codegen and squashes the following commits:
ef6e8b5 [Yijie Shen] getColumn and setColumn in struct cast, autounboxing in array and map
eaece18 [Yijie Shen] remove null case in cast code gen
fd7eba4 [Yijie Shen] resolve comments
80378a5 [Yijie Shen] the missing self cast
611d66e [Yijie Shen] Bug fix: NullType & primitive object unboxing
6d5c0fe [Yijie Shen] rebase and add Interval codegen
9424b65 [Yijie Shen] tiny style fix
4a1c801 [Yijie Shen] remove CodeHolder class, use function instead.
3f5df88 [Yijie Shen] CodeHolder for complex dataTypes
c286f13 [Yijie Shen] moved all the cast code into class body
4edfd76 [Yijie Shen] [WIP] finished primitive part
JIRA: https://issues.apache.org/jira/browse/SPARK-9165
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7537 from yjshen/array_struct_codegen and squashes the following commits:
3a6dce6 [Yijie Shen] use infix notion in createArray test
5e90f0a [Yijie Shen] resolve comments: classOf
39cefb8 [Yijie Shen] codegen for createArray createStruct & createNamedStruct
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7446 from cloud-fan/filter and squashes the following commits:
330021e [Wenchen Fan] add exists to tree node
2cab68c [Wenchen Fan] more enhance
949be07 [Wenchen Fan] push down part of predicate if possible
3912f84 [Wenchen Fan] address comments
8ce15ca [Wenchen Fan] fix bug
557158e [Wenchen Fan] Filter using non-deterministic expressions should not be pushed down
Also make format_string the canonical form, rather than printf.
Author: Reynold Xin <rxin@databricks.com>
Closes#7579 from rxin/format_strings and squashes the following commits:
53ee54f [Reynold Xin] Fixed unit tests.
52357e1 [Reynold Xin] Add format_string alias.
b40a42a [Reynold Xin] [SPARK-9154][SQL] Rename formatString to format_string.
Jira: https://issues.apache.org/jira/browse/SPARK-9154
fixes bug of #7546
marmbrus I can't reopen the other PR, because I didn't closed it. Can you trigger Jenkins?
Author: Tarek Auel <tarek.auel@googlemail.com>
Closes#7571 from tarekauel/SPARK-9154 and squashes the following commits:
dcae272 [Tarek Auel] [SPARK-9154][SQL] build fix
1487602 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-9154
f512c5f [Tarek Auel] [SPARK-9154][SQL] build fix
a943d3e [Tarek Auel] [SPARK-9154] implicit input cast, added tests for null, support for null primitives
10b4de8 [Tarek Auel] [SPARK-9154][SQL] codegen removed fallback trait
cd8322b [Tarek Auel] [SPARK-9154][SQL] codegen string format
086caba [Tarek Auel] [SPARK-9154][SQL] codegen string format
This reverts commit 7f072c3d5e.
Revert #7546
Author: Michael Armbrust <michael@databricks.com>
Closes#7570 from marmbrus/revert9154 and squashes the following commits:
ed2c32a [Michael Armbrust] Revert "[SPARK-9154] [SQL] codegen StringFormat"
JIRA:
https://issues.apache.org/jira/browse/SPARK-9081https://issues.apache.org/jira/browse/SPARK-9168
This PR target at two modifications:
1. Change `isNaN` to return `false` on `null` input
2. Make `dropna` and `fillna` to fill/drop NaN values as well
3. Implement `nanvl`
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7523 from yjshen/fillna_dropna and squashes the following commits:
f0a51db [Yijie Shen] make coalesce untouched and implement nanvl
1d3e35f [Yijie Shen] make Coalesce aware of NaN in order to support fillna
2760cbc [Yijie Shen] change isNaN(null) to false as well as implement dropna
JIRA: https://issues.apache.org/jira/browse/SPARK-9173
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7540 from yjshen/union_pushdown and squashes the following commits:
278510a [Yijie Shen] rename UnionPushDown to SetOperationPushDown
91741c1 [Yijie Shen] Add UnionPushDown support for intersect and except
Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230
Primary issue resolved is to implement array/map size for Spark SQL. Code is ready for review by a committer. Chen Hao is on the JIRA ticket, but I don't know his username on github, rxin is also on JIRA ticket.
Things to review:
1. Where to put added functions namespace wise, they seem to be part of a few operations on collections which includes `sort_array` and `array_contains`. Hence the name given `collectionOperations.scala` and `_collection_functions` in python.
2. In Python code, should it be in a `1.5.0` function array or in a collections array?
3. Are there any missing methods on the `Size` case class? Looks like many of these functions have generated Java code, is that also needed in this case?
4. Something else?
Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
Author: Pedro Rodriguez <prodriguez@trulia.com>
Closes#7462 from EntilZha/SPARK-8230 and squashes the following commits:
9a442ae [Pedro Rodriguez] fixed functions and sorted __all__
9aea3bb [Pedro Rodriguez] removed imports from python docs
15d4bf1 [Pedro Rodriguez] Added null test case and changed to nullSafeCodeGen
d88247c [Pedro Rodriguez] removed python code
bd5f0e4 [Pedro Rodriguez] removed duplicate function from rebase/merge
59931b4 [Pedro Rodriguez] fixed compile bug instroduced when merging
c187175 [Pedro Rodriguez] updated code to add size to __all__ directly and removed redundent pretty print
130839f [Pedro Rodriguez] fixed failing test
aa9bade [Pedro Rodriguez] fix style
e093473 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
0449377 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
9a1a2ff [Pedro Rodriguez] added unit tests for map size
2bfbcb6 [Pedro Rodriguez] added unit test for size
20df2b4 [Pedro Rodriguez] Finished working version of size function and added it to python
b503e75 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
99a6a5c [Pedro Rodriguez] fixed failing test
cac75ac [Pedro Rodriguez] fix style
933d843 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
42bb7d4 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
f9c3b8a [Pedro Rodriguez] added unit tests for map size
2515d9f [Pedro Rodriguez] added documentation
0e60541 [Pedro Rodriguez] added unit test for size
acf9853 [Pedro Rodriguez] Finished working version of size function and added it to python
84a5d38 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
Add expressions `regex_extract` & `regex_replace`
Author: Cheng Hao <hao.cheng@intel.com>
Closes#7468 from chenghao-intel/regexp and squashes the following commits:
e5ea476 [Cheng Hao] minor update for documentation
ef96fd6 [Cheng Hao] update the code gen
72cf28f [Cheng Hao] Add more log for compilation error
4e11381 [Cheng Hao] Add regexp_replace / regexp_extract support
This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort. The root of this problem is that `NaN > anything`, `NaN == anything`, and `NaN < anything` all return `false`.
Per the design specified in SPARK-9079, we have decided that `NaN = NaN` should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value).
In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147).
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7194 from JoshRosen/nan and squashes the following commits:
983d4fc [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
88bd73c [Josh Rosen] Fix Row.equals()
a702e2e [Josh Rosen] normalization -> canonicalization
a7267cf [Josh Rosen] Normalize NaNs in UnsafeRow
fe629ae [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
fbb2a29 [Josh Rosen] Fix NaN comparisons in BinaryComparison expressions
c1fd4fe [Josh Rosen] Fold NaN test into existing test framework
b31eb19 [Josh Rosen] Uncomment failing tests
7fe67af [Josh Rosen] Support NaN == NaN (SPARK-9145)
58bad2c [Josh Rosen] Revert "Compare rows' string representations to work around NaN incomparability."
fc6b4d2 [Josh Rosen] Update CodeGenerator
3998ef2 [Josh Rosen] Remove unused code
a2ba2e7 [Josh Rosen] Fix prefix comparision for NaNs
a30d371 [Josh Rosen] Compare rows' string representations to work around NaN incomparability.
6f03f85 [Josh Rosen] Fix bug in Double / Float ordering
42a1ad5 [Josh Rosen] Stop filtering NaNs in UnsafeExternalSortSuite
bfca524 [Josh Rosen] Change ordering so that NaN is maximum value.
8d7be61 [Josh Rosen] Update randomized test to use ScalaTest's assume()
b20837b [Josh Rosen] Add failing test for new NaN comparision ordering
5b88b2b [Josh Rosen] Fix compilation of CodeGenerationSuite
d907b5b [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
630ebc5 [Josh Rosen] Specify an ordering for NaN values.
9bf195a [Josh Rosen] Re-enable NaNs in CodeGenerationSuite to produce more regression tests
13fc06a [Josh Rosen] Add regression test for NaN sorting issue
f9efbb5 [Josh Rosen] Fix ORDER BY NULL
e7dc4fb [Josh Rosen] Add very generic test for ordering
7d5c13e [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL)
b55875a [Josh Rosen] Generate doubles and floats over entire possible range.
5acdd5c [Josh Rosen] Infinity and NaN are interesting.
ab76cbd [Josh Rosen] Move code to Catalyst package.
d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.
Jira: https://issues.apache.org/jira/browse/SPARK-9132https://issues.apache.org/jira/browse/SPARK-9163
rxin as you proposed in the Jira ticket, I just moved the logic to a separate object. I haven't changed anything of the logic of `NumberConverter`.
Author: Tarek Auel <tarek.auel@googlemail.com>
Closes#7552 from tarekauel/SPARK-9163 and squashes the following commits:
40dcde9 [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] style fix
fa985bd [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] codegen conv
JIRA: https://issues.apache.org/jira/browse/SPARK-9172
Simply make `DecimalPrecision` support for `Intersect` and `Except` in addition to `Union`.
Besides, add unit test for `DecimalPrecision` as well.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#7511 from viirya/more_decimalprecieion and squashes the following commits:
4d29d10 [Liang-Chi Hsieh] Fix code comment.
9fb0d49 [Liang-Chi Hsieh] Make DecimalPrecision support for Intersect and Except.
I also changed the semantics of concat w.r.t. null back to the same behavior as Hive.
That is to say, concat now returns null if any input is null.
Author: Reynold Xin <rxin@databricks.com>
Closes#7504 from rxin/concat_ws and squashes the following commits:
83fd950 [Reynold Xin] Fixed type casting.
3ae85f7 [Reynold Xin] Write null better.
cdc7be6 [Reynold Xin] Added code generation for pure string mode.
a61c4e4 [Reynold Xin] Updated comments.
2d51406 [Reynold Xin] [SPARK-8241][SQL] string function: concat_ws.
PR #7506 breaks master build because of compilation error. Note that #7506 itself looks good, but it seems that `git merge` did something stupid.
Author: Cheng Lian <lian@databricks.com>
Closes#7510 from liancheng/hotfix-for-pr-7506 and squashes the following commits:
7ea7e89 [Cheng Lian] Fixes compilation error
This pull request fixes some of the problems in #6981.
- Added date functions to `__all__` so they get exposed
- Rename day_of_month -> dayofmonth
- Rename day_in_year -> dayofyear
- Rename week_of_year -> weekofyear
- Removed "day" from Scala/Python API since it is ambiguous. Only leaving the alias in SQL.
Author: Reynold Xin <rxin@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>
Closes#7506 from rxin/datetime and squashes the following commits:
0cb24d9 [Reynold Xin] Export all functions in Python.
e44a4a0 [Reynold Xin] Removed day function from Scala and Python.
9c08fdc [Reynold Xin] [SQL] Make date/time functions more consistent with other database systems.
rxin / davies
Sorry for that unnecessary change. And thanks again for all your support!
Author: Tarek Auel <tarek.auel@googlemail.com>
Closes#7505 from tarekauel/SPARK-8199-FollowUp and squashes the following commits:
d09321c [Tarek Auel] [SPARK-8199] follow up; revert change in test
c17397f [Tarek Auel] [SPARK-8199] follow up; revert change in test
67acfe6 [Tarek Auel] [SPARK-8199] follow up; revert change in test
By grouping projection calls into multiple apply function, we are able to push the number of projections codegen can handle from ~1k to ~60k. I have set the unit test to test against 5k as 60k took 15s for the unit test to complete.
Author: Forest Fang <forest.fang@outlook.com>
Closes#7076 from saurfang/codegen_size_limit and squashes the following commits:
b7a7635 [Forest Fang] [SPARK-8443][SQL] Execute and verify split projections in test
adef95a [Forest Fang] [SPARK-8443][SQL] Use safer factor and rewrite splitting code
1b5aa7e [Forest Fang] [SPARK-8443][SQL] inline execution if one block only
9405680 [Forest Fang] [SPARK-8443][SQL] split projection code by size limit
It is very hard to track which expressions have code gen implemented or not. This patch removes the default fallback gencode implementation from Expression, and moves that into a new trait called CodegenFallback. Each concrete expression needs to either implement code generation, or mix in CodegenFallback. This makes it very easy to track which expressions have code generation implemented already.
Additionally, this patch creates an Unevaluable trait that can be used to track expressions that don't support evaluation (e.g. Star).
Author: Reynold Xin <rxin@databricks.com>
Closes#7487 from rxin/codegenfallback and squashes the following commits:
14ebf38 [Reynold Xin] Fixed Conv
6c1c882 [Reynold Xin] Fixed Alias.
b42611b [Reynold Xin] [SPARK-9150][SQL] Create a trait to track code generation for expressions.
cb5c066 [Reynold Xin] Removed extra import.
39cbe40 [Reynold Xin] [SPARK-8240][SQL] string function: concat
JIRA: https://issues.apache.org/jira/browse/SPARK-9055
cc rxin
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7491 from yijieshen/widen and squashes the following commits:
079fa52 [Yijie Shen] widenType support for intersect and expect
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7496 from cloud-fan/tests and squashes the following commits:
0958f90 [Wenchen Fan] improve test for nondeterministic expressions
fix 2 bugs introduced in https://github.com/apache/spark/pull/7353
1. we should use UTC Calendar when cast string to date . Before #7353 , we use `DateTimeUtils.fromJavaDate(Date.valueOf(s.toString))` to cast string to date, and `fromJavaDate` will call `millisToDays` to avoid the time zone issue. Now we use `DateTimeUtils.stringToDate(s)`, we should create a Calendar with UTC in the begging.
2. we should not change the default time zone in test cases. The `threadLocalLocalTimeZone` and `threadLocalTimestampFormat` in `DateTimeUtils` will only be evaluated once for each thread, so we can't set the default time zone back anymore.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7488 from cloud-fan/datetime and squashes the following commits:
9cd6005 [Wenchen Fan] address comments
21ef293 [Wenchen Fan] fix 2 bugs in datetime
Author: Reynold Xin <rxin@databricks.com>
Closes#7490 from rxin/unit-test-null-funcs and squashes the following commits:
7b276f0 [Reynold Xin] Move isNaN.
8307287 [Reynold Xin] [SPARK-9169][SQL] Improve unit test coverage for null expressions.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7452 from cloud-fan/boolean-simplify and squashes the following commits:
2a6e692 [Wenchen Fan] fix style
d3cfd26 [Wenchen Fan] fix BooleanSimplification in case-insensitive
The check was unreachable before, as `case operator: LogicalPlan` catches everything already.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7449 from cloud-fan/tmp and squashes the following commits:
2bb6637 [Wenchen Fan] add test
5493aea [Wenchen Fan] add the check back
27221a7 [Wenchen Fan] remove unnecessary analysis check code for self join
JIRA: https://issues.apache.org/jira/browse/SPARK-9080
cc rxin
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7464 from yijieshen/isNaN and squashes the following commits:
11ae039 [Yijie Shen] add isNaN in functions
666718e [Yijie Shen] add isNaN predicate expression
a follow up of https://github.com/apache/spark/pull/7353
1. we should use `Calendar.HOUR_OF_DAY` instead of `Calendar.HOUR`(this is for AM, PM).
2. we should call `c.set(Calendar.MILLISECOND, 0)` after `Calendar.getInstance`
I'm not sure why the tests didn't fail in jenkins, but I ran latest spark master branch locally and `DateTimeUtilsSuite` failed.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7473 from cloud-fan/datetime and squashes the following commits:
66cdaf2 [Wenchen Fan] fix several bugs in DateTimeUtils.stringToTimestamp
cc chenghao-intel adrian-wang
Author: zhichao.li <zhichao.li@intel.com>
Closes#6872 from zhichao-li/conv and squashes the following commits:
6ef3b37 [zhichao.li] add unittest and comments
78d9836 [zhichao.li] polish dataframe api and add unittest
e2bace3 [zhichao.li] update to use ImplicitCastInputTypes
cbcad3f [zhichao.li] add function conv
instead of return false, throw exception when check equality between external and internal row is better.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7460 from cloud-fan/row-compare and squashes the following commits:
8a20911 [Wenchen Fan] improve equals
402daa8 [Wenchen Fan] throw exception when check equality between external and internal row
Added two projections: GenerateUnsafeProjection and FromUnsafeProjection, which could be used to convert UnsafeRow from/to GenericInternalRow.
They will re-use the buffer during projection, similar to MutableProjection (without all the interface MutableProjection has).
cc rxin JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#7437 from davies/unsafe_proj2 and squashes the following commits:
dbf538e [Davies Liu] test with all the expression (only for supported types)
dc737b2 [Davies Liu] address comment
e424520 [Davies Liu] fix scala style
70e231c [Davies Liu] address comments
729138d [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_proj2
5a26373 [Davies Liu] unsafe projections
Currently we will stop project collapse when the lower projection has nondeterministic expressions. However it's overkill sometimes, we should be able to optimize `df.select(Rand(10)).select('a)` to `df.select('a)`
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7445 from cloud-fan/non-deterministic and squashes the following commits:
0deaef6 [Wenchen Fan] Improve project collapse with nondeterministic expressions
Jira https://issues.apache.org/jira/browse/SPARK-8995
In PR #6981we noticed that we cannot cast date strings that contains a time, like '2015-03-18 12:39:40' to date. Besides it's not possible to cast a string like '18:03:20' to a timestamp.
If a time is passed without a date, today is inferred as date.
Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Tarek Auel <tarek.auel@gmail.com>
Closes#7353 from tarekauel/SPARK-8995 and squashes the following commits:
14f333b [Tarek Auel] [SPARK-8995] added tests for daylight saving time
ca1ae69 [Tarek Auel] [SPARK-8995] style fix
d20b8b4 [Tarek Auel] [SPARK-8995] bug fix: distinguish between 0 and null
ef05753 [Tarek Auel] [SPARK-8995] added check for year >= 1000
01c9ff3 [Tarek Auel] [SPARK-8995] support for time strings
34ec573 [Tarek Auel] fixed style
71622c0 [Tarek Auel] improved timestamp and date parsing
0e30c0a [Tarek Auel] Hive compatibility
cfbaed7 [Tarek Auel] fixed wrong checks
71f89c1 [Tarek Auel] [SPARK-8995] minor style fix
f7452fa [Tarek Auel] [SPARK-8995] removed old timestamp parsing
30e5aec [Tarek Auel] [SPARK-8995] date and timestamp cast
c1083fb [Tarek Auel] [SPARK-8995] cast date strings like '2015-01-01 12:15:31' to date or timestamp
based on https://github.com/apache/spark/pull/7348
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7420 from cloud-fan/type-check and squashes the following commits:
7633fa9 [Wenchen Fan] revert
fe169b0 [Wenchen Fan] improve test
03b70da [Wenchen Fan] enhance implicit type cast
- `BinaryType` for `Length`
- `FormatNumber`
Author: Cheng Hao <hao.cheng@intel.com>
Closes#7034 from chenghao-intel/expression and squashes the following commits:
e534b87 [Cheng Hao] python api style issue
601bbf5 [Cheng Hao] add python API support
3ebe288 [Cheng Hao] update as feedback
52274f7 [Cheng Hao] add support for udf_format_number and length for binary
https://issues.apache.org/jira/browse/SPARK-8221
One concern is the result would be negative if the divisor is not positive( i.e pmod(7, -3) ), but the behavior is the same as hive.
Author: zhichao.li <zhichao.li@intel.com>
Closes#6783 from zhichao-li/pmod2 and squashes the following commits:
7083eb9 [zhichao.li] update to the latest type checking
d26dba7 [zhichao.li] add pmod
This patch makes the following changes:
1. ExpectsInputTypes only defines expected input types, but does not perform any implicit type casting.
2. ImplicitCastInputTypes is a new trait that defines both expected input types, as well as performs implicit type casting.
3. BinaryOperator has a new abstract function "inputType", which defines the expected input type for both left/right. Concrete BinaryOperator expressions no longer perform any implicit type casting.
4. For BinaryOperators, convert NullType (i.e. null literals) into some accepted type so BinaryOperators don't need to handle NullTypes.
TODOs needed: fix unit tests for error reporting.
I'm intentionally not changing anything in aggregate expressions because yhuai is doing a big refactoring on that right now.
Author: Reynold Xin <rxin@databricks.com>
Closes#7348 from rxin/typecheck and squashes the following commits:
8fcf814 [Reynold Xin] Fixed ordering of cases.
3bb63e7 [Reynold Xin] Style fix.
f45408f [Reynold Xin] Comment update.
aa7790e [Reynold Xin] Moved RemoveNullTypes into ImplicitTypeCasts.
438ea07 [Reynold Xin] space
d55c9e5 [Reynold Xin] Removes NullTypes.
360d124 [Reynold Xin] Fixed the rule.
fb66657 [Reynold Xin] Convert NullType into some accepted type for BinaryOperators.
2e22330 [Reynold Xin] Fixed unit tests.
4932d57 [Reynold Xin] Style fix.
d061691 [Reynold Xin] Rename existing ExpectsInputTypes -> ImplicitCastInputTypes.
e4727cc [Reynold Xin] BinaryOperator should not be doing implicit cast.
d017861 [Reynold Xin] Improve expression type checking.
JIRA: https://issues.apache.org/jira/browse/SPARK-8800
Previously, we turn to Java BigDecimal's divide with specified ROUNDING_MODE to avoid non-terminating decimal expansion problem. However, as JihongMA reported, for the division operation on some specific values, we get inaccurate results.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#7212 from viirya/fix_decimal4 and squashes the following commits:
4205a0a [Liang-Chi Hsieh] Fix inaccuracy precision/scale of Decimal division operation.
This is a follow up of remaining comments from #6851
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#7387 from adrian-wang/udflgfollow and squashes the following commits:
6163e62 [Daoyuan Wang] add skipping null values
e8c2e09 [Daoyuan Wang] use seq
8362966 [Daoyuan Wang] pr6851 follow up
Author: Vinod K C <vinod.kc@huawei.com>
Closes#7040 from vinodkc/fix_CaseKeyWhen_equalNullSafe and squashes the following commits:
be5e641 [Vinod K C] Renamed equalNullSafe to threeValueEquals
aac9f67 [Vinod K C] Updated test suite and genCode method
f2d0b53 [Vinod K C] Fix equalNullSafe comparison
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7355 from cloud-fan/fromString and squashes the following commits:
3bbb9d6 [Wenchen Fan] fix code gen
7dab957 [Wenchen Fan] naming fix
0fbbe19 [Wenchen Fan] address comments
ac1f3d1 [Wenchen Fan] Support casting between IntervalType and StringType
chenghao-intel zhichao-li qiansl127
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#6851 from adrian-wang/udflg and squashes the following commits:
0f1bff2 [Daoyuan Wang] address comments from davis
7a6bdbb [Daoyuan Wang] add '.' for hex()
c1f6824 [Daoyuan Wang] add codegen, test for all types
ec625b0 [Daoyuan Wang] conditional function: least/greatest
Author: Cheng Hao <hao.cheng@intel.com>
Closes#6762 from chenghao-intel/str_funcs and squashes the following commits:
b09a909 [Cheng Hao] update the code as feedback
7ebbf4c [Cheng Hao] Add more string expressions
I merged https://github.com/apache/spark/pull/7303 so it unblocks another PR. This addresses my own code review comment for that PR.
Author: Reynold Xin <rxin@databricks.com>
Closes#7313 from rxin/adt and squashes the following commits:
7ade82b [Reynold Xin] Fixed unit tests.
f8d5533 [Reynold Xin] [SPARK-8926][SQL] Code review followup.
For example: `cannot resolve 'testfunction(null)' due to data type mismatch: argument 1 is expected to be of type int, however, null is of type datetype.`
Author: Michael Armbrust <michael@databricks.com>
Closes#7303 from marmbrus/expectsTypeErrors and squashes the following commits:
c654a0e [Michael Armbrust] fix udts and make errors pretty
137160d [Michael Armbrust] style
5428fda [Michael Armbrust] style
10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for ExpectsInputType expressions
We call Row.copy() in many places throughout SQL but UnsafeRow currently throws UnsupportedOperationException when copy() is called.
Supporting copying when ObjectPool is used may be difficult, since we may need to handle deep-copying of objects in the pool. In addition, this copy() method needs to produce a self-contained row object which may be passed around / buffered by downstream code which does not understand the UnsafeRow format.
In the long run, we'll need to figure out how to handle the ObjectPool corner cases, but this may be unnecessary if other changes are made. Therefore, in order to unblock my sort patch (#6444) I propose that we support copy() for the cases where UnsafeRow does not use an ObjectPool and continue to throw UnsupportedOperationException when an ObjectPool is used.
This patch accomplishes this by modifying UnsafeRow so that it knows the size of the row's backing data in order to be able to copy it into a byte array.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7306 from JoshRosen/SPARK-8932 and squashes the following commits:
338e6bf [Josh Rosen] Support copy for UnsafeRows that do not use ObjectPools.
JIRA: https://issues.apache.org/jira/browse/SPARK-8866
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#7283 from yijieshen/micro_timestamp and squashes the following commits:
dc735df [Yijie Shen] update CastSuite to avoid round error
714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose
c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp
8d4aa6b [Yijie Shen] use 1us precision for timestamp type
Author: Reynold Xin <rxin@databricks.com>
Closes#7273 from rxin/bitwise-unittest and squashes the following commits:
60c5667 [Reynold Xin] [SPARK-8878][SQL] Improve unit test coverage for bitwise expressions.
The type alias was there because initially when I moved Row around, I didn't want to do massive changes to the expression code. But now it should be pretty easy to just remove it. One less concept to worry about.
Author: Reynold Xin <rxin@databricks.com>
Closes#7270 from rxin/internalrow and squashes the following commits:
72fc842 [Reynold Xin] [SPARK-8876][SQL] Remove InternalRow type alias in expressions package.
JIRA: https://issues.apache.org/jira/browse/SPARK-8794
Currently `PrunedScan` works only when followed by project or filter operations. However, even if there is a `Sample` between these operations and `PrunedScan`, `PrunedScan` should work too.
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#7228 from viirya/sample_prunedscan and squashes the following commits:
ede7cd8 [Liang-Chi Hsieh] Keep PrunedScanSuite untouched.
6f05d30 [Liang-Chi Hsieh] Move unit test to FilterPushdownSuite.
5f32473 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sample_prunedscan
7e4ba76 [Liang-Chi Hsieh] Use Optimzier for push down projection and filter.
0686830 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sample_prunedscan
df82785 [Liang-Chi Hsieh] Make PrunedScan work on Sample.
This PR adds regression test for https://issues.apache.org/jira/browse/SPARK-8588 (fixed by 457d07eaa0).
Author: Yin Huai <yhuai@databricks.com>
This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>
Closes#7103 from yhuai/SPARK-8588-test and squashes the following commits:
eb5f418 [Yin Huai] Add a query test.
c61a173 [Yin Huai] Regression test for SPARK-8588.
Add Python API for hex/unhex, also cleanup Hex/Unhex
Author: Davies Liu <davies@databricks.com>
Closes#7223 from davies/hex and squashes the following commits:
6f1249d [Davies Liu] no explicit rule to cast string into binary
711a6ed [Davies Liu] fix test
f9fe5a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex
49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
b31fc9a [Davies Liu] Update math.scala
25156b7 [Davies Liu] address comments and fix test
c3af78c [Davies Liu] address commments
1a24082 [Davies Liu] Add Python API for hex and unhex
Otherwise it is impossible to declare an expression supporting DecimalType.
Author: Reynold Xin <rxin@databricks.com>
Closes#7232 from rxin/typecollection-adt and squashes the following commits:
934d3d1 [Reynold Xin] [SPARK-8831][SQL] Support AbstractDataType in TypeCollection.
Author: Reynold Xin <rxin@databricks.com>
Closes#7220 from rxin/SPARK-8822 and squashes the following commits:
0cda076 [Reynold Xin] Test cases.
22d0463 [Reynold Xin] Fixed type precedence.
beb2a97 [Reynold Xin] [SPARK-8822][SQL] clean up type checking in math.scala.
Author: Reynold Xin <rxin@databricks.com>
Closes#7221 from rxin/implicit-cast-tests and squashes the following commits:
64b13bd [Reynold Xin] Fixed a bug ..
489b732 [Reynold Xin] [SQL] More unit tests for implicit type cast & add simpleString to AbstractDataType.
Jira: https://issues.apache.org/jira/browse/SPARK-8270
Info: I can not build the latest master, it stucks during the build process: `[INFO] Dependency-reduced POM written at: /Users/tarek/test/spark/bagel/dependency-reduced-pom.xml`
Author: Tarek Auel <tarek.auel@googlemail.com>
Closes#7214 from tarekauel/SPARK-8270 and squashes the following commits:
ab348b9 [Tarek Auel] Merge branch 'master' into SPARK-8270
a2ad318 [Tarek Auel] [SPARK-8270] changed order of fields
d91b12c [Tarek Auel] [SPARK-8270] python fix
adbd075 [Tarek Auel] [SPARK-8270] fixed typo
23185c9 [Tarek Auel] [SPARK-8270] levenshtein distance
This commit adds a set of random data generation utilities to Spark SQL, for use in its own unit tests.
- `RandomDataGenerator.forType(DataType)` returns an `Option[() => Any]` that, if defined, contains a function for generating random values for the given DataType. The random values use the external representations for the given DataType (for example, for DateType we return `java.sql.Date` instances instead of longs).
- `DateTypeTestUtilities` defines some convenience fields for looping over instances of data types. For example, `numericTypes` holds `DataType` instances for all supported numeric types. These constants will help us to raise the level of abstraction in our tests. For example, it's now very easy to write a test which is parameterized by all common data types.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#7176 from JoshRosen/sql-random-data-generators and squashes the following commits:
f71634d [Josh Rosen] Roll back ScalaCheck usage
e0d7d49 [Josh Rosen] Bump ScalaCheck version in LICENSE
89d86b1 [Josh Rosen] Bump ScalaCheck version.
0c20905 [Josh Rosen] Initial attempt at using ScalaCheck.
b55875a [Josh Rosen] Generate doubles and floats over entire possible range.
5acdd5c [Josh Rosen] Infinity and NaN are interesting.
ab76cbd [Josh Rosen] Move code to Catalyst package.
d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.
This patch adds a new TypeCollection AbstractDataType that can be used by expressions to specify more than one expected input types.
Author: Reynold Xin <rxin@databricks.com>
Closes#7202 from rxin/type-collection and squashes the following commits:
c714ca1 [Reynold Xin] Fixed style.
a0c0d12 [Reynold Xin] Fixed bugs and unit tests.
d8b8ae7 [Reynold Xin] Added TypeCollection.
Also improve the performance of hex/unhex
Author: Davies Liu <davies@databricks.com>
Closes#7181 from davies/hex and squashes the following commits:
f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex
49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
b31fc9a [Davies Liu] Update math.scala
25156b7 [Davies Liu] address comments and fix test
c3af78c [Davies Liu] address commments
1a24082 [Davies Liu] Add Python API for hex and unhex
Author: Reynold Xin <rxin@databricks.com>
Closes#7175 from rxin/implicitCast and squashes the following commits:
88080a2 [Reynold Xin] Clearer definition of implicit type cast.
f0ff97f [Reynold Xin] Added missing file.
c65e532 [Reynold Xin] [SPARK-8772][SQL] Implement implicit type cast for expressions that defines input types.
This is a follow up of [SPARK-8283](https://issues.apache.org/jira/browse/SPARK-8283) ([PR-6828](https://github.com/apache/spark/pull/6828)), to support both `struct` and `named_struct` in Spark SQL.
After [#6725](https://github.com/apache/spark/pull/6828), the semantic of [`CreateStruct`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala#L56) methods have changed a little and do not limited to cols of `NamedExpressions`, it will name non-NamedExpression fields following the hive convention, col1, col2 ...
This PR would both loosen [`struct`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L723) to take children of `Expression` type and add `named_struct` support.
Author: Yijie Shen <henry.yijieshen@gmail.com>
Closes#6874 from yijieshen/SPARK-8283 and squashes the following commits:
4cd3375ac [Yijie Shen] change struct documentation
d599d0b [Yijie Shen] rebase code
9a7039e [Yijie Shen] fix reviews and regenerate golden answers
b487354 [Yijie Shen] replace assert using checkAnswer
f07e114 [Yijie Shen] tiny fix
9613be9 [Yijie Shen] review fix
7fef712 [Yijie Shen] Fix checkInputTypes' implementation using foldable and nullable
60812a7 [Yijie Shen] Fix type check
828d694 [Yijie Shen] remove unnecessary resolved assertion inside dataType method
fd3cd8e [Yijie Shen] remove type check from eval
7a71255 [Yijie Shen] tiny fix
ccbbd86 [Yijie Shen] Fix reviews
47da332 [Yijie Shen] remove nameStruct API from DataFrame
917e680 [Yijie Shen] Fix reviews
4bd75ad [Yijie Shen] loosen struct method in functions.scala to take Expression children
0acb7be [Yijie Shen] Add CreateNamedStruct in both DataFrame function API and FunctionRegistery
also improve tests for binary comparison.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#7143 from cloud-fan/binary and squashes the following commits:
28a5b76 [Wenchen Fan] improve test
04ef4b0 [Wenchen Fan] fix equalNullSafe
Jira:
https://issues.apache.org/jira/browse/SPARK-8223https://issues.apache.org/jira/browse/SPARK-8224
~~I am aware of #7174 and will update this pr, if it's merged.~~ Done
I don't know if #7034 can simplify this, but we can have a look on it, if it gets merged
rxin In the Jira ticket the function as no second argument. I added a `numBits` argument that allows to specify the number of bits. I guess this improves the usability. I wanted to add `shiftleft(value)` as well, but the `selectExpr` dataframe tests crashes, if I have both. I order to do this, I added the following to the functions.scala `def shiftRight(e: Column): Column = ShiftRight(e.expr, lit(1).expr)`, but as I mentioned this doesn't pass tests like `df.selectExpr("shiftRight(a)", ...` (not enough arguments exception).
If we need the bitwise shift in order to be hive compatible, I suggest to add `shiftLeft` and something like `shiftLeftX`
Author: Tarek Auel <tarek.auel@googlemail.com>
Closes#7178 from tarekauel/8223 and squashes the following commits:
8023bb5 [Tarek Auel] [SPARK-8223][SPARK-8224] fixed test
f3f64e6 [Tarek Auel] [SPARK-8223][SPARK-8224] Integer -> Int
f628706 [Tarek Auel] [SPARK-8223][SPARK-8224] removed toString; updated function description
3b56f2a [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
5189690 [Tarek Auel] [SPARK-8223][SPARK-8224] minor fix and style fix
9434a28 [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
44ee324 [Tarek Auel] [SPARK-8223][SPARK-8224] docu fix
ac7fe9d [Tarek Auel] [SPARK-8223][SPARK-8224] right and left bit shift
cc chenghao-intel adrian-wang
Author: zhichao.li <zhichao.li@intel.com>
Closes#7113 from zhichao-li/unhex and squashes the following commits:
379356e [zhichao.li] remove exception checking
a4ae6dc [zhichao.li] add udf_unhex to whitelist
fe5c14a [zhichao.li] add todigit
607d7a3 [zhichao.li] use checkInputTypes
bffd37f [zhichao.li] change to use Hex in apache common package
cde73f5 [zhichao.li] update to use AutoCastInputTypes
11945c7 [zhichao.li] style
c852d46 [zhichao.li] Add function unhex
Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.
This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.
Author: Reynold Xin <rxin@databricks.com>
Closes#7174 from rxin/binary-opterator and squashes the following commits:
f31900d [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
fceb216 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into binary-opterator
d8518cf [Reynold Xin] Updated Python tests.
Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.
This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.
Author: Reynold Xin <rxin@databricks.com>
Closes#7170 from rxin/binaryoperator and squashes the following commits:
51264a5 [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
copy() of generated Row doesn't check nullability of columns
Author: Davies Liu <davies@databricks.com>
Closes#7163 from davies/fix_copy and squashes the following commits:
661a206 [Davies Liu] fix copy of generated row
Moved all the rules into the companion object.
Author: Reynold Xin <rxin@databricks.com>
Closes#7147 from rxin/SPARK-8749 and squashes the following commits:
c1c6dc0 [Reynold Xin] [SPARK-8749][SQL] Remove HiveTypeCoercion trait.
Hi Michael,
this Pull-Request is a follow-up to [PR-6242](https://github.com/apache/spark/pull/6242). I removed the two obsolete test cases from the HiveQuerySuite and deleted the corresponding golden answer files.
Thanks for your review!
Author: Christian Kadner <ckadner@us.ibm.com>
Closes#6983 from ckadner/SPARK-6785 and squashes the following commits:
ab1e79b [Christian Kadner] Merge remote-tracking branch 'origin/SPARK-6785' into SPARK-6785
1fed877 [Christian Kadner] [SPARK-6785][SQL] failed Scala style test, remove spaces on empty line DateTimeUtils.scala:61
9d8021d [Christian Kadner] [SPARK-6785][SQL] merge recent changes in DateTimeUtils & MiscFunctionsSuite
b97c3fb [Christian Kadner] [SPARK-6785][SQL] move test case for DateTimeUtils to DateTimeUtilsSuite
a451184 [Christian Kadner] [SPARK-6785][SQL] fix DateTimeUtils.fromJavaDate(java.util.Date) for Dates before 1970
https://issues.apache.org/jira/browse/SPARK-8236
Author: Shilei <shilei.qian@intel.com>
Closes#7108 from qiansl127/Crc32 and squashes the following commits:
5477352 [Shilei] Change to AutoCastInputTypes
5f16e5d [Shilei] Add misc function crc32
TODO: use array instead of Seq as internal representation for `ArrayType`
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6982 from cloud-fan/extract-value and squashes the following commits:
e203bc1 [Wenchen Fan] address comments
4da0f0b [Wenchen Fan] some clean up
f679969 [Wenchen Fan] fix bug
e64f942 [Wenchen Fan] remove generic
e3f8427 [Wenchen Fan] fix style and address comments
fc694e8 [Wenchen Fan] add code gen for extract value
This PR brings arbitrary object support in UnsafeRow (both in grouping key and aggregation buffer).
Two object pools will be created to hold those non-primitive objects, and put the index of them into UnsafeRow. In order to compare the grouping key as bytes, the objects in key will be stored in a unique object pool, to make sure same objects will have same index (used as hashCode).
For StringType and BinaryType, we still put them as var-length in UnsafeRow when initializing for better performance. But for update, they will be an object inside object pools (there will be some garbages left in the buffer).
BTW: Will create a JIRA once issue.apache.org is available.
cc JoshRosen rxin
Author: Davies Liu <davies@databricks.com>
Closes#6959 from davies/unsafe_obj and squashes the following commits:
5ce39da [Davies Liu] fix comment
5e797bf [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
5803d64 [Davies Liu] fix conflict
461d304 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
2f41c90 [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
b04d69c [Davies Liu] address comments
4859b80 [Davies Liu] fix comments
f38011c [Davies Liu] add a test for grouping by decimal
d2cf7ab [Davies Liu] add more tests for null checking
71983c5 [Davies Liu] add test for timestamp
e8a1649 [Davies Liu] reuse buffer for string
39f09ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_obj
035501e [Davies Liu] fix style
236d6de [Davies Liu] support arbitrary object in UnsafeRow
I've added functionality to create new StructType similar to how we add parameters to a new SparkContext.
I've also added tests for this type of creation.
Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Closes#6686 from ilganeli/SPARK-8056B and squashes the following commits:
27c1de1 [Ilya Ganelin] Rename
467d836 [Ilya Ganelin] Removed from_string in favor of _parse_Datatype_json_value
5fef5a4 [Ilya Ganelin] Updates for type parsing
4085489 [Ilya Ganelin] Style errors
3670cf5 [Ilya Ganelin] added string to DataType conversion
8109e00 [Ilya Ganelin] Fixed error in tests
41ab686 [Ilya Ganelin] Fixed style errors
e7ba7e0 [Ilya Ganelin] Moved some python tests to tests.py. Added cleaner handling of null data type and added test for correctness of input format
15868fa [Ilya Ganelin] Fixed python errors
b79b992 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-8056B
a3369fc [Ilya Ganelin] Fixing space errors
e240040 [Ilya Ganelin] Style
bab7823 [Ilya Ganelin] Constructor error
73d4677 [Ilya Ganelin] Style
4ed00d9 [Ilya Ganelin] Fixed default arg
67df57a [Ilya Ganelin] Removed Foo
04cbf0c [Ilya Ganelin] Added comments for single object
0484d7a [Ilya Ganelin] Restored second method
6aeb740 [Ilya Ganelin] Style
689e54d [Ilya Ganelin] Style
f497e9e [Ilya Ganelin] Got rid of old code
e3c7a88 [Ilya Ganelin] Fixed doctest failure
a62ccde [Ilya Ganelin] Style
966ac06 [Ilya Ganelin] style checks
dabb7e6 [Ilya Ganelin] Added Python tests
a3f4152 [Ilya Ganelin] added python bindings and better comments
e6e536c [Ilya Ganelin] Added extra space
7529a2e [Ilya Ganelin] Fixed formatting
d388f86 [Ilya Ganelin] Fixed small bug
c4e3bf5 [Ilya Ganelin] Reverted to using parse. Updated parse to support long
d7634b6 [Ilya Ganelin] Reverted to fromString to properly support types
22c39d5 [Ilya Ganelin] replaced FromString with DataTypeParser.parse. Replaced empty constructor initializing a null to have it instead create a new array to allow appends to it.
faca398 [Ilya Ganelin] [SPARK-8056] Replaced default argument usage. Updated usage and code for DataType.fromString
1acf76e [Ilya Ganelin] Scala style
e31c674 [Ilya Ganelin] Fixed bug in test
8dc0795 [Ilya Ganelin] Added tests for creation of StructType object with new methods
fdf7e9f [Ilya Ganelin] [SPARK-8056] Created add methods to facilitate building new StructType objects.
cc chenghao-intel adrian-wang
Author: zhichao.li <zhichao.li@intel.com>
Closes#6976 from zhichao-li/hex and squashes the following commits:
e218d1b [zhichao.li] turn off scalastyle for non-ascii
de3f5ea [zhichao.li] non-ascii char
cf9c936 [zhichao.li] give separated buffer for each hex method
967ec90 [zhichao.li] Make 'value' as a feild of Hex
3b2fa13 [zhichao.li] tiny fix
a647641 [zhichao.li] remove duplicate null check
7cab020 [zhichao.li] tiny refactoring
35ecfe5 [zhichao.li] add function hex
Jira: https://issues.apache.org/jira/browse/SPARK-8235
I added the support for sha1. If I understood rxin correctly, sha and sha1 should execute the same algorithm, shouldn't they?
Please take a close look on the Python part. This is adopted from #6934
Author: Tarek Auel <tarek.auel@gmail.com>
Author: Tarek Auel <tarek.auel@googlemail.com>
Closes#6963 from tarekauel/SPARK-8235 and squashes the following commits:
f064563 [Tarek Auel] change to shaHex
7ce3cdc [Tarek Auel] rely on automatic cast
a1251d6 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-8235
68eb043 [Tarek Auel] added docstring
be5aff1 [Tarek Auel] improved error message
7336c96 [Tarek Auel] added type check
cf23a80 [Tarek Auel] simplified example
ebf75ef [Tarek Auel] [SPARK-8301] updated the python documentation. Removed sha in python and scala
6d6ff0d [Tarek Auel] [SPARK-8233] added docstring
ea191a9 [Tarek Auel] [SPARK-8233] fixed signatureof python function. Added expected type to misc
e3fd7c3 [Tarek Auel] SPARK[8235] added sha to the list of __all__
e5dad4e [Tarek Auel] SPARK[8235] sha / sha1
JIRA: https://issues.apache.org/jira/browse/SPARK-8677
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#7056 from viirya/fix_decimal3 and squashes the following commits:
34d7419 [Liang-Chi Hsieh] Fix Non-terminating decimal expansion for decimal divide operation.
Currently, we use GenericRow both for Row and InternalRow, which is confusing because it could contain Scala type also Catalyst types.
This PR changes to use GenericInternalRow for InternalRow (contains catalyst types), GenericRow for Row (contains Scala types).
Also fixes some incorrect use of InternalRow or Row.
Author: Davies Liu <davies@databricks.com>
Closes#7003 from davies/internalrow and squashes the following commits:
d05866c [Davies Liu] fix test: rollback changes for pyspark
72878dd [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
efd0b25 [Davies Liu] fix copy of MutableRow
87b13cf [Davies Liu] fix test
d2ebd72 [Davies Liu] fix style
eb4b473 [Davies Liu] mark expensive API as final
bd4e99c [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
bdfb78f [Davies Liu] remove BaseMutableRow
6f99a97 [Davies Liu] fix catalyst test
defe931 [Davies Liu] remove BaseRow
288b31f [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
9d24350 [Davies Liu] separate Row and InternalRow (part 2)
a follow up of https://github.com/apache/spark/pull/6405.
Note: It's not a big change, a lot of changing is due to I swap some code in `aggregates.scala` to make aggregate functions right below its corresponding aggregate expressions.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6723 from cloud-fan/type-check and squashes the following commits:
2124301 [Wenchen Fan] fix tests
5a658bb [Wenchen Fan] add tests
287d3bb [Wenchen Fan] apply type check interface to more expressions
make the `TakeOrdered` strategy and operator more general, such that it can optionally handle a projection when necessary
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6780 from cloud-fan/limit and squashes the following commits:
34aa07b [Wenchen Fan] revert
07d5456 [Wenchen Fan] clean closure
20821ec [Wenchen Fan] fix
3676a82 [Wenchen Fan] address comments
b558549 [Wenchen Fan] address comments
214842b [Wenchen Fan] fix style
2d8be83 [Wenchen Fan] add LimitPushDown
948f740 [Wenchen Fan] fix existing
a follow up of https://github.com/apache/spark/pull/6813
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6825 from cloud-fan/cg and squashes the following commits:
43170cc [Wenchen Fan] fix bugs in code gen
Also added more tests in LiteralExpressionSuite
Author: Davies Liu <davies@databricks.com>
Closes#6876 from davies/fix_hashcode and squashes the following commits:
429c2c0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_hashcode
32d9811 [Davies Liu] fix test
a0626ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_hashcode
89c2432 [Davies Liu] fix style
bd20780 [Davies Liu] check with catalyst types
41caec6 [Davies Liu] change for to while
d96929b [Davies Liu] address comment
6ad2a90 [Davies Liu] fix style
5819d33 [Davies Liu] unify equals() and hashCode()
0fff25d [Davies Liu] fix style
53c38b1 [Davies Liu] fix hashCode() and equals() of BinaryType in Row
JIRA: https://issues.apache.org/jira/browse/SPARK-8359
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#6814 from viirya/fix_decimal2 and squashes the following commits:
071a757 [Liang-Chi Hsieh] Remove maximum precision and use MathContext.UNLIMITED.
df217d4 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal2
a43bfc3 [Liang-Chi Hsieh] Add MathContext with maximum supported precision.
72eeb3f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal2
44c9348 [Liang-Chi Hsieh] Fix incorrect decimal precision after multiplication.
first convert `ordinal` to `Number`, then convert to int type.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5706 from cloud-fan/7153 and squashes the following commits:
915db79 [Wenchen Fan] fix 7153
Support BinaryType in UnsafeRow, just like StringType.
Also change the layout of StringType and BinaryType in UnsafeRow, by combining offset and size together as Long, which will limit the size of Row to under 2G (given that fact that any single buffer can not be bigger than 2G in JVM).
Author: Davies Liu <davies@databricks.com>
Closes#6911 from davies/unsafe_bin and squashes the following commits:
d68706f [Davies Liu] update comment
519f698 [Davies Liu] address comment
98a964b [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_bin
180b49d [Davies Liu] fix zero-out
22e4c0a [Davies Liu] zero-out padding bytes
6abfe93 [Davies Liu] fix style
447dea0 [Davies Liu] support binaryType in UnsafeRow
Author: Shilei <shilei.qian@intel.com>
Closes#6779 from qiansl127/MD5 and squashes the following commits:
11fcdb2 [Shilei] Fix the indent
04bd27b [Shilei] Add codegen
da60eb3 [Shilei] Remove checkInputDataTypes function
9509ad0 [Shilei] Format code
12c61f4 [Shilei] Accept only BinaryType for Md5
1df0b5b [Shilei] format to scala type
60ccde1 [Shilei] Add more test case
b8c73b4 [Shilei] Rewrite the type check for Md5
c166167 [Shilei] Add md5 function
JIRA: https://issues.apache.org/jira/browse/SPARK-8363
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#6823 from viirya/move_sqrt and squashes the following commits:
8977e11 [Liang-Chi Hsieh] Remove unnecessary old tests.
d23e79e [Liang-Chi Hsieh] Explicitly indicate sqrt value sequence.
699f48b [Liang-Chi Hsieh] Use correct @since tag.
8dff6d1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into move_sqrt
bc2ed77 [Liang-Chi Hsieh] Remove/move arithmetic expression test and expression type checking test. Remove unnecessary Sqrt type rule.
d38492f [Liang-Chi Hsieh] Now sqrt accepts boolean because type casting is handled by HiveTypeCoercion.
297cc90 [Liang-Chi Hsieh] Sqrt only accepts double input.
ef4a21a [Liang-Chi Hsieh] Move sqrt to math.
JIRA: https://issues.apache.org/jira/browse/SPARK-7199
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5984 from viirya/add_date_timestamp and squashes the following commits:
7f21ce9 [Liang-Chi Hsieh] For comment.
0b89698 [Liang-Chi Hsieh] Add timestamp to settableFieldTypes.
c30d490 [Liang-Chi Hsieh] Use default IntUnsafeColumnWriter and LongUnsafeColumnWriter.
672ef17 [Liang-Chi Hsieh] Remove getter/setter for Date and Timestamp and use Int and Long for them.
9f3e577 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
281e844 [Liang-Chi Hsieh] Fix scala style.
fb532b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
80af342 [Liang-Chi Hsieh] Fix compiling error.
f4f5de6 [Liang-Chi Hsieh] Fix scala style.
a463e83 [Liang-Chi Hsieh] Use Long to store timestamp for rows.
635388a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
46946c6 [Liang-Chi Hsieh] Adapt for moved DateUtils.
b16994e [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into add_date_timestamp
752251f [Liang-Chi Hsieh] Support setDate. Fix failed test.
fcf8db9 [Liang-Chi Hsieh] Add functions for Date and Timestamp to SpecificRow.
e42a809 [Liang-Chi Hsieh] Fix style.
4c07b57 [Liang-Chi Hsieh] Add date and timestamp support to UnsafeRow.
In order to have better performance out of box, this PR turn on codegen by default, then codegen can be tested by sql/test and hive/test.
This PR also fix some corner cases for codegen.
Before 1.5 release, we should re-visit this, turn it off if it's not stable or causing regressions.
cc rxin JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#6726 from davies/enable_codegen and squashes the following commits:
f3b25a5 [Davies Liu] fix warning
73750ea [Davies Liu] fix long overflow when compare
3017a47 [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen
a7d75da [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen
ff5b75a [Davies Liu] Merge branch 'master' of github.com:apache/spark into enable_codegen
f4cf2c2 [Davies Liu] fix style
99fc139 [Davies Liu] Merge branch 'enable_codegen' of github.com:davies/spark into enable_codegen
91fc7a2 [Davies Liu] disable codegen for ScalaUDF
207e339 [Davies Liu] Update CodeGenerator.scala
44573a3 [Davies Liu] check thread safety of expression
f3886fa [Davies Liu] don't inline primitiveTerm for null literal
c8e7cd2 [Davies Liu] address comment
a8618c9 [Davies Liu] enable codegen by default
This PR fixes the problem reported by Justin Yip in the thread 'NullPointerException with functions.rand()'
Tested using spark-shell and verified that the following works:
sqlContext.createDataFrame(Seq((1,2), (3, 100))).withColumn("index", rand(30)).show()
Author: tedyu <yuzhihong@gmail.com>
Closes#6793 from tedyu/master and squashes the following commits:
62fd97b [tedyu] Create RandomSuite
750f92c [tedyu] Add test for Rand() with seed
a1d66c5 [tedyu] Fix NullPointerException with functions.rand()
Added unit tests for all supported data types for:
- Add
- Subtract
- Multiply
- Divide
- UnaryMinus
- Remainder
Fixed bugs caught by the unit tests.
Author: Reynold Xin <rxin@databricks.com>
Closes#6813 from rxin/SPARK-8362 and squashes the following commits:
fb3fe62 [Reynold Xin] Added Remainder.
3b266ba [Reynold Xin] [SPARK-8362] Add unit tests for +, -, *, /.
Also addressed code review feedback from #6754
Author: Reynold Xin <rxin@databricks.com>
Closes#6803 from rxin/abs and squashes the following commits:
d07beba [Reynold Xin] [SPARK-8347] Add unit tests for abs.
Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations.
As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users.
For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters.
For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow.
cc rxin marmbrus JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes#6792 from davies/internal_row and squashes the following commits:
f2abd13 [Davies Liu] fix scalastyle
a7e025c [Davies Liu] move InternalRow into catalyst
30db8ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into internal_row
7cbced8 [Davies Liu] separate Row and InternalRow
Unit test is still in Scala.
Author: Reynold Xin <rxin@databricks.com>
Closes#6738 from rxin/utf8string-java and squashes the following commits:
562dc6e [Reynold Xin] Flag...
98e600b [Reynold Xin] Another try with encoding setting ..
cfa6bdf [Reynold Xin] Merge branch 'master' into utf8string-java
a3b124d [Reynold Xin] Try different UTF-8 encoded characters.
1ff7c82 [Reynold Xin] Enable UTF-8 encoding.
82d58cc [Reynold Xin] Reset run-tests.
2cb3c69 [Reynold Xin] Use utf-8 encoding in set bytes.
53f8ef4 [Reynold Xin] Hack Jenkins to run one test.
9a48e8d [Reynold Xin] Fixed runtime compilation error.
911c450 [Reynold Xin] Moved unit test also to Java.
4eff7bd [Reynold Xin] Improved unit test coverage.
8e89a3c [Reynold Xin] Fixed tests.
77c64bd [Reynold Xin] Fixed string type codegen.
ffedb62 [Reynold Xin] Code review feedback.
0967ce6 [Reynold Xin] Fixed import ordering.
45a123d [Reynold Xin] [SPARK-8286] Rewrite UTF8String in Java and move it into unsafe package.
This PR fix a few small issues about codgen:
1. cast decimal to boolean
2. do not inline literal with null
3. improve SpecificRow.equals()
4. test expressions with optimized express
5. fix compare with BinaryType
cc rxin chenghao-intel
Author: Davies Liu <davies@databricks.com>
Closes#6755 from davies/fix_codegen and squashes the following commits:
ef27343 [Davies Liu] address comments
6617ea6 [Davies Liu] fix scala tyle
70b7dda [Davies Liu] improve codegen
Author: Daoyuan Wang <daoyuan.wang@intel.com>
This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>
Closes#6718 from adrian-wang/udflog2 and squashes the following commits:
3909f48 [Daoyuan Wang] math function: log2
Author: Cheng Hao <hao.cheng@intel.com>
Closes#6724 from chenghao-intel/length and squashes the following commits:
aaa3c31 [Cheng Hao] revert the additional change
97148a9 [Cheng Hao] remove the codegen testing temporally
ae08003 [Cheng Hao] update the comments
1eb1fd1 [Cheng Hao] simplify the code as commented
3e92d32 [Cheng Hao] use the selectExpr in unit test intead of SQLQuery
3c729aa [Cheng Hao] fix bug for constant null value in codegen
3641f06 [Cheng Hao] keep the length() method for registered function
8e30171 [Cheng Hao] update the code as comment
db604ae [Cheng Hao] Add code gen support
548d2ef [Cheng Hao] register the length()
09a0738 [Cheng Hao] add length support
Currently we only support `Seq[Expression]`, we should handle cases like `Seq[Seq[Expression]]` so that we can remove the unnecessary `GroupExpression`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6706 from cloud-fan/clean and squashes the following commits:
60a1193 [Wenchen Fan] support nested expression sequence and remove GroupExpression
This PR change to use Long as internal type for TimestampType for efficiency, which means it will the precision below 100ns.
Author: Davies Liu <davies@databricks.com>
Closes#6733 from davies/timestamp and squashes the following commits:
d9565fa [Davies Liu] remove print
65cf2f1 [Davies Liu] fix Timestamp in SparkR
86fecfb [Davies Liu] disable two timestamp tests
8f77ee0 [Davies Liu] fix scala style
246ee74 [Davies Liu] address comments
309d2e1 [Davies Liu] use Long for TimestampType in SQL
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Closes#6716 from adrian-wang/epi and squashes the following commits:
e2e8dbd [Daoyuan Wang] move tests
11b351c [Daoyuan Wang] add tests and remove pu
db331c9 [Daoyuan Wang] py style
599ddd8 [Daoyuan Wang] add py
e6783ef [Daoyuan Wang] register function
82d426e [Daoyuan Wang] add function entry
dbf3ab5 [Daoyuan Wang] add PI and E
Also moved a few files in expressions package around to match test suites.
Author: Reynold Xin <rxin@databricks.com>
Closes#6693 from rxin/expr-refactoring and squashes the following commits:
857599f [Reynold Xin] Fixed style violation.
c0eb74b [Reynold Xin] Fixed compilation.
b3a40f8 [Reynold Xin] Refactored expression test suites.
This PR move codegen implementation of expressions into Expression class itself, make it easy to manage.
It introduces two APIs in Expression:
```
def gen(ctx: CodeGenContext): GeneratedExpressionCode
def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): Code
```
gen(ctx) will call genSource(ctx, ev) to generate Java source code for the current expression. A expression needs to override genSource().
Here are the types:
```
type Term String
type Code String
/**
* Java source for evaluating an [[Expression]] given a [[Row]] of input.
*/
case class GeneratedExpressionCode(var code: Code,
nullTerm: Term,
primitiveTerm: Term,
objectTerm: Term)
/**
* A context for codegen, which is used to bookkeeping the expressions those are not supported
* by codegen, then they are evaluated directly. The unsupported expression is appended at the
* end of `references`, the position of it is kept in the code, used to access and evaluate it.
*/
class CodeGenContext {
/**
* Holding all the expressions those do not support codegen, will be evaluated directly.
*/
val references: Seq[Expression] = new mutable.ArrayBuffer[Expression]()
}
```
This is basically #6660, but fixed style violation and compilation failure.
Author: Davies Liu <davies@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6690 from rxin/codegen and squashes the following commits:
e1368c2 [Reynold Xin] Fixed tests.
73db80e [Reynold Xin] Fixed compilation failure.
19d6435 [Reynold Xin] Fixed style violation.
9adaeaf [Davies Liu] address comments
f42c732 [Davies Liu] improve coverage and tests
bad6828 [Davies Liu] address comments
e03edaa [Davies Liu] consts fold
86fac2c [Davies Liu] fix style
02262c9 [Davies Liu] address comments
b5d3617 [Davies Liu] Merge pull request #5 from rxin/codegen
48c454f [Reynold Xin] Some code gen update.
2344bc0 [Davies Liu] fix test
12ff88a [Davies Liu] fix build
c5fb514 [Davies Liu] rename
8c6d82d [Davies Liu] update docs
b145047 [Davies Liu] fix style
e57959d [Davies Liu] add type alias
3ff25f8 [Davies Liu] refactor
593d617 [Davies Liu] pushing codegen into Expression
This PR fixes a bug introduced in https://github.com/apache/spark/pull/6505.
Decimal literal's value is not `java.math.BigDecimal`, but Spark SQL internal type: `Decimal`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6574 from cloud-fan/fix and squashes the following commits:
b0e3549 [Wenchen Fan] rename to BooleanEquality
1987b37 [Wenchen Fan] use Decimal instead of java.math.BigDecimal
f93c420 [Wenchen Fan] compare literal
This patch replaces Distinct with Aggregate in the optimizer, so Distinct will become
more efficient over time as we optimize Aggregate (via Tungsten).
Author: Reynold Xin <rxin@databricks.com>
Closes#6637 from rxin/replace-distinct and squashes the following commits:
b3cc50e [Reynold Xin] Mima excludes.
93d6117 [Reynold Xin] Code review feedback.
87e4741 [Reynold Xin] [SPARK-7440][SQL] Remove physical Distinct operator in favor of Aggregate.
In order to reduce the overhead of codegen, this PR switch to use Janino to compile SQL expressions into bytecode.
After this, the time used to compile a SQL expression is decreased from 100ms to 5ms, which is necessary to turn on codegen for general workload, also tests.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes#6479 from davies/janino and squashes the following commits:
cc689f5 [Davies Liu] remove globalLock
262d848 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
eec3a33 [Davies Liu] address comments from Josh
f37c8c3 [Davies Liu] fix DecimalType and cast to String
202298b [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
a21e968 [Davies Liu] fix style
0ed3dc6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
551a851 [Davies Liu] fix tests
c3bdffa [Davies Liu] remove print
6089ce5 [Davies Liu] change logging level
7e46ac3 [Davies Liu] fix style
d8f0f6c [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
da4926a [Davies Liu] fix tests
03660f3 [Davies Liu] WIP: use Janino to compile Java source
f2629cd [Davies Liu] Merge branch 'master' of github.com:apache/spark into janino
f7d66cf [Davies Liu] use template based string for codegen
It seems hard to find a common pattern of checking types in `Expression`. Sometimes we know what input types we need(like `And`, we know we need two booleans), sometimes we just have some rules(like `Add`, we need 2 numeric types which are equal). So I defined a general interface `checkInputDataTypes` in `Expression` which returns a `TypeCheckResult`. `TypeCheckResult` can tell whether this expression passes the type checking or what the type mismatch is.
This PR mainly works on apply input types checking for arithmetic and predicate expressions.
TODO: apply type checking interface to more expressions.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6405 from cloud-fan/6444 and squashes the following commits:
b5ff31b [Wenchen Fan] address comments
b917275 [Wenchen Fan] rebase
39929d9 [Wenchen Fan] add todo
0808fd2 [Wenchen Fan] make constrcutor of TypeCheckResult private
3bee157 [Wenchen Fan] and decimal type coercion rule for binary comparison
8883025 [Wenchen Fan] apply type check interface to CaseWhen
cffb67c [Wenchen Fan] to have resolved call the data type check function
6eaadff [Wenchen Fan] add equal type constraint to EqualTo
3affbd8 [Wenchen Fan] more fixes
654d46a [Wenchen Fan] improve tests
e0a3628 [Wenchen Fan] improve error message
1524ff6 [Wenchen Fan] fix style
69ca3fe [Wenchen Fan] add error message and tests
c71d02c [Wenchen Fan] fix hive tests
6491721 [Wenchen Fan] use value class TypeCheckResult
7ae76b9 [Wenchen Fan] address comments
cb77e4f [Wenchen Fan] Improve error reporting for expression data type mismatch
This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features.
At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`.
The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL:
- #6217: DescribeCommand is assigned wrong output attributes in SparkStrategies
- #6218: DataFrame.describe() should cast all aggregates to String
- #6400: Use output schema, not relation schema, for data source input conversion
Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc:
> It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception.
Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions.
In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits:
740341b [Josh Rosen] Optimize method dispatch for primitive type conversions
befc613 [Josh Rosen] Add tests to document Option-handling behavior.
5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite
6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it
3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first
6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException
677ff27 [Josh Rosen] Fix null handling bug; add tests.
8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator.
85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite
9c0e4e1 [Josh Rosen] Remove last use of convertToScala().
ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions.
7ca7fcb [Josh Rosen] Comments and cleanup
1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters
This closes#6570.
Author: Yin Huai <yhuai@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes#6573 from rxin/deterministic and squashes the following commits:
356cd22 [Reynold Xin] Added unit test for the optimizer.
da3fde1 [Reynold Xin] Merge pull request #6570 from yhuai/SPARK-8023
da56200 [Yin Huai] Comments.
e38f264 [Yin Huai] Comment.
f9d6a73 [Yin Huai] Add a deterministic method to Expression.
The origin code has several problems:
* `true <=> 1` will return false as we didn't set a rule to handle it.
* `true = a` where `a` is not `Literal` and its value is 1, will return false as we only handle literal values.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6505 from cloud-fan/tmp1 and squashes the following commits:
77f0f39 [Wenchen Fan] minor fix
b6401ba [Wenchen Fan] add type coercion for CaseKeyWhen and address comments
ebc8c61 [Wenchen Fan] use SQLTestUtils and If
625973c [Wenchen Fan] improve
9ba2130 [Wenchen Fan] address comments
fc0d741 [Wenchen Fan] fix style
2846a04 [Wenchen Fan] fix 7952
Author: Reynold Xin <rxin@databricks.com>
Closes#6535 from rxin/whitespace-sql and squashes the following commits:
de50316 [Reynold Xin] [SPARK-3850] Trim trailing spaces for SQL.
We have defined these logics in `Cast` already, I think we should remove this rule.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6516 from cloud-fan/tmp2 and squashes the following commits:
d5035a4 [Wenchen Fan] remove useless rule
Right now `unit-tests.log` are not of much value because we can't tell where the test boundaries are easily. This patch adds log statements before and after each test to outline the test boundaries, e.g.:
```
===== TEST OUTPUT FOR o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' =====
15/05/27 12:36:39.596 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO SparkContext: Starting job: count at KryoSerializerSuite.scala:230
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Got job 3 (count at KryoSerializerSuite.scala:230) with 4 output partitions (allowLocal=false)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Final stage: ResultStage 3(count at KryoSerializerSuite.scala:230)
15/05/27 12:36:39.596 dag-scheduler-event-loop INFO DAGScheduler: Parents of final stage: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Missing parents: List()
15/05/27 12:36:39.597 dag-scheduler-event-loop INFO DAGScheduler: Submitting ResultStage 3 (ParallelCollectionRDD[5] at parallelize at KryoSerializerSuite.scala:230), which has no missing parents
...
15/05/27 12:36:39.624 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO DAGScheduler: Job 3 finished: count at KryoSerializerSuite.scala:230, took 0.028563 s
15/05/27 12:36:39.625 pool-1-thread-1-ScalaTest-running-KryoSerializerSuite INFO KryoSerializerSuite:
***** FINISHED o.a.s.serializer.KryoSerializerSuite: 'kryo with parallelize for primitive arrays' *****
...
```
Author: Andrew Or <andrew@databricks.com>
Closes#6441 from andrewor14/demarcate-tests and squashes the following commits:
879b060 [Andrew Or] Fix compile after rebase
d622af7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
017c8ba [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
7790b6c [Andrew Or] Fix tests after logical merge conflict
c7460c0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
c43ffc4 [Andrew Or] Fix tests?
8882581 [Andrew Or] Fix tests
ee22cda [Andrew Or] Fix log message
fa9450e [Andrew Or] Merge branch 'master' of github.com:apache/spark into demarcate-tests
12d1e1b [Andrew Or] Various whitespace changes (minor)
69cbb24 [Andrew Or] Make all test suites extend SparkFunSuite instead of FunSuite
bbce12e [Andrew Or] Fix manual things that cannot be covered through automation
da0b12f [Andrew Or] Add core tests as dependencies in all modules
f7d29ce [Andrew Or] Introduce base abstract class for all test suites
So we can enable a whitespace enforcement rule in the style checker to save code review time.
Author: Reynold Xin <rxin@databricks.com>
Closes#6476 from rxin/whitespace-catalyst and squashes the following commits:
650409d [Reynold Xin] Fixed tests.
51a9e5d [Reynold Xin] [SPARK-7927] whitespace fixes for Catalyst module.
This type is not really used. Might as well remove it.
Author: Reynold Xin <rxin@databricks.com>
Closes#6427 from rxin/evalutedType and squashes the following commits:
51a319a [Reynold Xin] [SPARK-7887][SQL] Remove EvaluatedType from SQL Expression.
Contribution is my original work and I license the work to the project under the projects open source license.
Author: rowan <rowan.chattaway@googlemail.com>
Closes#6259 from rowan000/SPARK-7637 and squashes the following commits:
c479df4 [rowan] SPARK-7637: rename mapFields to fieldsMap as per comments on github.
8d2e419 [rowan] SPARK-7637: fix up whitespace changes
0e9d662 [rowan] SPARK-7637: O(N) merge implementatio for StructType merge
Author: Michael Armbrust <michael@databricks.com>
Closes#6363 from marmbrus/windowErrors and squashes the following commits:
516b02d [Michael Armbrust] [SPARK-7834] [SQL] Better window error messages
Add an `explode` function for dataframes and modify the analyzer so that single table generating functions can be present in a select clause along with other expressions. There are currently the following restrictions:
- only top level TGFs are allowed (i.e. no `select(explode('list) + 1)`)
- only one may be present in a single select to avoid potentially confusing implicit Cartesian products.
TODO:
- [ ] Python
Author: Michael Armbrust <michael@databricks.com>
Closes#6107 from marmbrus/explodeFunction and squashes the following commits:
7ee2c87 [Michael Armbrust] whitespace
6f80ba3 [Michael Armbrust] Update dataframe.py
c176c89 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
81b5da3 [Michael Armbrust] style
d3faa05 [Michael Armbrust] fix self join case
f9e1e3e [Michael Armbrust] fix python, add since
4f0d0a9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
e710fe4 [Michael Armbrust] add java and python
52ca0dc [Michael Armbrust] [SPARK-7548][SQL] Add explode function for dataframes.
JavaTypeInference into catalyst
types.DateUtils into catalyst
CacheManager into execution
DefaultParserDialect into catalyst
Author: Reynold Xin <rxin@databricks.com>
Closes#6108 from rxin/sql-rename and squashes the following commits:
3fc9613 [Reynold Xin] Fixed import ordering.
83d9ff4 [Reynold Xin] Fixed codegen tests.
e271e86 [Reynold Xin] mima
f4e24a6 [Reynold Xin] [SQL] Move some classes into packages that are more appropriate.
Optimize the case of `project(_, sort)` , a example is:
`select key from (select * from testData order by key) t`
before this PR:
```
== Parsed Logical Plan ==
'Project ['key]
'Subquery t
'Sort ['key ASC], true
'Project [*]
'UnresolvedRelation [testData], None
== Analyzed Logical Plan ==
Project [key#0]
Subquery t
Sort [key#0 ASC], true
Project [key#0,value#1]
Subquery testData
LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
== Optimized Logical Plan ==
Project [key#0]
Sort [key#0 ASC], true
LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
== Physical Plan ==
Project [key#0]
Sort [key#0 ASC], true
Exchange (RangePartitioning [key#0 ASC], 5), []
PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
```
after this PR
```
== Parsed Logical Plan ==
'Project ['key]
'Subquery t
'Sort ['key ASC], true
'Project [*]
'UnresolvedRelation [testData], None
== Analyzed Logical Plan ==
Project [key#0]
Subquery t
Sort [key#0 ASC], true
Project [key#0,value#1]
Subquery testData
LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
== Optimized Logical Plan ==
Sort [key#0 ASC], true
Project [key#0]
LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
== Physical Plan ==
Sort [key#0 ASC], true
Exchange (RangePartitioning [key#0 ASC], 5), []
Project [key#0]
PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
```
with this rule we will first do column pruning on the table and then do sorting.
Author: scwf <wangfei1@huawei.com>
This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>
Closes#5838 from scwf/pruning and squashes the following commits:
b00d833 [scwf] address michael's comment
e230155 [scwf] fix tests failure
b09b895 [scwf] improve column pruning
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5831 from cloud-fan/7276 and squashes the following commits:
ee4a1e1 [Wenchen Fan] fix rebase mistake
a3b565d [Wenchen Fan] refactor
99deb5d [Wenchen Fan] add test
f1f67ad [Wenchen Fan] fix 7276
It's the first step: generalize UnresolvedGetField to support all map, struct, and array
TODO: add `apply` in Scala and `__getitem__` in Python, and unify the `getItem` and `getField` methods to one single API(or should we keep them for compatibility?).
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5744 from cloud-fan/generalize and squashes the following commits:
715c589 [Wenchen Fan] address comments
7ea5b31 [Wenchen Fan] fix python test
4f0833a [Wenchen Fan] add python test
f515d69 [Wenchen Fan] add apply method and test cases
8df6199 [Wenchen Fan] fix python test
239730c [Wenchen Fan] fix test compile
2a70526 [Wenchen Fan] use _bin_op in dataframe.py
6bf72bc [Wenchen Fan] address comments
3f880c3 [Wenchen Fan] add java doc
ab35ab5 [Wenchen Fan] fix python test
b5961a9 [Wenchen Fan] fix style
c9d85f5 [Wenchen Fan] generalize UnresolvedGetField to support all map, struct, and array
This PR switches Spark SQL's Hive support to use the isolated hive client interface introduced by #5851, instead of directly interacting with the client. By using this isolated client we can now allow users to dynamically configure the version of Hive that they are connecting to by setting `spark.sql.hive.metastore.version` without the need recompile. This also greatly reduces the surface area for our interaction with the hive libraries, hopefully making it easier to support other versions in the future.
Jars for the desired hive version can be configured using `spark.sql.hive.metastore.jars`, which accepts the following options:
- a colon-separated list of jar files or directories for hive and hadoop.
- `builtin` - attempt to discover the jars that were used to load Spark SQL and use those. This
option is only valid when using the execution version of Hive.
- `maven` - download the correct version of hive on demand from maven.
By default, `builtin` is used for Hive 13.
This PR also removes the test step for building against Hive 12, as this will no longer be required to talk to Hive 12 metastores. However, the full removal of the Shim is deferred until a later PR.
Remaining TODOs:
- Remove the Hive Shims and inline code for Hive 13.
- Several HiveCompatibility tests are not yet passing.
- `nullformatCTAS` - As detailed below, we now are handling CTAS parsing ourselves instead of hacking into the Hive semantic analyzer. However, we currently only handle the common cases and not things like CTAS where the null format is specified.
- `combine1` now leaks state about compression somehow, breaking all subsequent tests. As such we currently add it to the blacklist
- `part_inherit_tbl_props` and `part_inherit_tbl_props_with_star` do not work anymore. We are correctly propagating the information
- "load_dyn_part14.*" - These tests pass when run on their own, but fail when run with all other tests. It seems our `RESET` mechanism may not be as robust as it used to be?
Other required changes:
- `CreateTableAsSelect` no longer carries parts of the HiveQL AST with it through the query execution pipeline. Instead, we parse CTAS during the HiveQL conversion and construct a `HiveTable`. The full parsing here is not yet complete as detailed above in the remaining TODOs. Since the operator is Hive specific, it is moved to the hive package.
- `Command` is simplified to be a trait that simply acts as a marker for a LogicalPlan that should be eagerly evaluated.
Author: Michael Armbrust <michael@databricks.com>
Closes#5876 from marmbrus/useIsolatedClient and squashes the following commits:
258d000 [Michael Armbrust] really really correct path handling
e56fd4a [Michael Armbrust] getAbsolutePath
5a259f5 [Michael Armbrust] fix typos
81bb366 [Michael Armbrust] comments from vanzin
5f3945e [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
4b5cd41 [Michael Armbrust] yin's comments
f5de7de [Michael Armbrust] cleanup
11e9c72 [Michael Armbrust] better coverage in versions suite
7e8f010 [Michael Armbrust] better error messages and jar handling
e7b3941 [Michael Armbrust] more permisive checking for function registration
da91ba7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
5fe5894 [Michael Armbrust] fix serialization suite
81711c4 [Michael Armbrust] Initial support for running without maven
1d8ae44 [Michael Armbrust] fix final tests?
1c50813 [Michael Armbrust] more comments
a3bee70 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into useIsolatedClient
a6f5df1 [Michael Armbrust] style
ab07f7e [Michael Armbrust] WIP
4d8bf02 [Michael Armbrust] Remove hive 12 compilation
8843a25 [Michael Armbrust] [SPARK-6908] [SQL] Use isolated Hive client
Avoid translating to CaseWhen and evaluate the key expression many times.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5979 from cloud-fan/condition and squashes the following commits:
3ce54e1 [Wenchen Fan] add CaseKeyWhen
Address marmbrus and scwf's comments in #5604.
Author: Yin Huai <yhuai@databricks.com>
Closes#5945 from yhuai/windowFollowup and squashes the following commits:
0ef879d [Yin Huai] Add collectFirst to TreeNode.
2373968 [Yin Huai] wip
4a16df9 [Yin Huai] Address minor comments for [SPARK-1442].
Adding more information about the implementation...
This PR is adding the support of window functions to Spark SQL (specifically OVER and WINDOW clause). For every expression having a OVER clause, we use a WindowExpression as the container of a WindowFunction and the corresponding WindowSpecDefinition (the definition of a window frame, i.e. partition specification, order specification, and frame specification appearing in a OVER clause).
# Implementation #
The high level work flow of the implementation is described as follows.
* Query parsing: In the query parse process, all WindowExpressions are originally placed in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. It makes our changes to simple and keep all of parsing rules for window functions at a single place (nodesToWindowSpecification). For the WINDOWclause in a query, we use a WithWindowDefinition as the container as the mapping from the name of a window specification to a WindowSpecDefinition. This changes is similar with our common table expression support.
* Analysis: The query analysis process has three steps for window functions.
* Resolve all WindowSpecReferences by replacing them with WindowSpecReferences according to the mapping table stored in the node of WithWindowDefinition.
* Resolve WindowFunctions in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. For this PR, we use Hive's functions for window functions because we will have a major refactoring of our internal UDAFs and it is better to switch our UDAFs after that refactoring work.
* Once we have resolved all WindowFunctions, we will use ResolveWindowFunction to extract WindowExpressions from projectList and aggregateExpressions and then create a Window operator for every distinct WindowSpecDefinition. With this choice, at the execution time, we can rely on the Exchange operator to do all of work on reorganizing the table and we do not need to worry about it in the physical Window operator. An example analyzed plan is shown as follows
```
sql("""
SELECT
year, country, product, sales,
avg(sales) over(partition by product) avg_product,
sum(sales) over(partition by country) sum_country
FROM sales
ORDER BY year, country, product
""").explain(true)
== Analyzed Logical Plan ==
Sort [year#34 ASC,country#35 ASC,product#36 ASC], true
Project [year#34,country#35,product#36,sales#37,avg_product#27,sum_country#28]
Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Project [year#34,country#35,product#36,sales#37]
MetastoreRelation default, sales, None
```
* Query planning: In the process of query planning, we simple generate the physical Window operator based on the logical Window operator. Then, to prepare the executedPlan, the EnsureRequirements rule will add Exchange and Sort operators if necessary. The EnsureRequirements rule will analyze the data properties and try to not add unnecessary shuffle and sort. The physical plan for the above example query is shown below.
```
== Physical Plan ==
Sort [year#34 ASC,country#35 ASC,product#36 ASC], true
Exchange (RangePartitioning [year#34 ASC,country#35 ASC,product#36 ASC], 200), []
Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Exchange (HashPartitioning [country#35], 200), [country#35 ASC]
Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Exchange (HashPartitioning [product#36], 200), [product#36 ASC]
HiveTableScan [year#34,country#35,product#36,sales#37], (MetastoreRelation default, sales, None), None
```
* Execution time: At execution time, a physical Window operator buffers all rows in a partition specified in the partition spec of a OVER clause. If necessary, it also maintains a sliding window frame. The current implementation tries to buffer the input parameters of a window function according to the window frame to avoid evaluating a row multiple times.
# Future work #
Here are three improvements that are not hard to add:
* Taking advantage of the window frame specification to reduce the number of rows buffered in the physical Window operator. For some cases, we only need to buffer the rows appearing in the sliding window. But for other cases, we will not be able to reduce the number of rows buffered (e.g. ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).
* When aRAGEN frame is used, for <value> PRECEDING and <value> FOLLOWING, it will be great if the <value> part is an expression (we can start with Literal). So, when the data type of ORDER BY expression is a FractionalType, we can support FractionalType as the type <value> (<value> still needs to be evaluated as a positive value).
* When aRAGEN frame is used, we need to support DateType and TimestampType as the data type of the expression appearing in the order specification. Then, the <value> part of <value> PRECEDING and <value> FOLLOWING can support interval types (once we support them).
This is a joint work with guowei2 and yhuai
Thanks hbutani hvanhovell for his comments
Thanks scwf for his comments and unit tests
Author: Yin Huai <yhuai@databricks.com>
Closes#5604 from guowei2/windowImplement and squashes the following commits:
76fe1c8 [Yin Huai] Implementation.
aa2b0ae [Yin Huai] Tests.
After a discussion on the user mailing list, it was decided to put all UDF's under `o.a.s.sql.functions`
cc rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5923 from brkyvz/move-math-funcs and squashes the following commits:
a8dc3f7 [Burak Yavuz] address comments
cf7a7bb [Burak Yavuz] [SPARK-7358] Move DataFrame mathfunctions into functions
Adds the functions `rand` (Uniform Dist) and `randn` (Normal Dist.) as expressions to DataFrames.
cc mengxr rxin
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5819 from brkyvz/df-rng and squashes the following commits:
50d69d4 [Burak Yavuz] add seed for test that failed
4234c3a [Burak Yavuz] fix Rand expression
13cad5c [Burak Yavuz] couple fixes
7d53953 [Burak Yavuz] waiting for hive tests
b453716 [Burak Yavuz] move radn with seed down
03637f0 [Burak Yavuz] fix broken hive func
c5909eb [Burak Yavuz] deleted old implementation of Rand
6d43895 [Burak Yavuz] implemented random generators
Now in spark sql optimizer we only push down right side filter for left semi join, actually we can push down left side filter because left semi join is doing filter on left table essentially.
Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>
Closes#5677 from scwf/leftsemi and squashes the following commits:
483d205 [wangfei] update with master to fix compile issue
82df0e1 [wangfei] Merge branch 'master' of https://github.com/apache/spark into leftsemi
d68a053 [wangfei] added apply
8f48a3d [scwf] added test
ebadaa9 [wangfei] left filter push down for left semi join
This patch adds managed-memory-based aggregation to Spark SQL / DataFrames. Instead of working with Java objects, this new aggregation path uses `sun.misc.Unsafe` to manipulate raw memory. This reduces the memory footprint for aggregations, resulting in fewer spills, OutOfMemoryErrors, and garbage collection pauses. As a result, this allows for higher memory utilization. It can also result in better cache locality since objects will be stored closer together in memory.
This feature can be eanbled by setting `spark.sql.unsafe.enabled=true`. For now, this feature is only supported when codegen is enabled and only supports aggregations for which the grouping columns are primitive numeric types or strings and aggregated values are numeric.
### Managing memory with sun.misc.Unsafe
This patch supports both on- and off-heap managed memory.
- In on-heap mode, memory addresses are identified by the combination of a base Object and an offset within that object.
- In off-heap mode, memory is addressed directly with 64-bit long addresses.
To support both modes, functions that manipulate memory accept both `baseObject` and `baseOffset` fields. In off-heap mode, we simply pass `null` as `baseObject`.
We allocate memory in large chunks, so memory fragmentation and allocation speed are not significant bottlenecks.
By default, we use on-heap mode. To enable off-heap mode, set `spark.unsafe.offHeap=true`.
To track allocated memory, this patch extends `SparkEnv` with an `ExecutorMemoryManager` and supplies each `TaskContext` with a `TaskMemoryManager`. These classes work together to track allocations and detect memory leaks.
### Compact tuple format
This patch introduces `UnsafeRow`, a compact row layout. In this format, each tuple has three parts: a null bit set, fixed length values, and variable-length values:
![image](https://cloud.githubusercontent.com/assets/50748/7328538/2fdb65ce-ea8b-11e4-9743-6c0f02bb7d1f.png)
- Rows are always 8-byte word aligned (so their sizes will always be a multiple of 8 bytes)
- The bit set is used for null tracking:
- Position _i_ is set if and only if field _i_ is null
- The bit set is aligned to an 8-byte word boundary.
- Every field appears as an 8-byte word in the fixed-length values part:
- If a field is null, we zero out the values.
- If a field is variable-length, the word stores a relative offset (w.r.t. the base of the tuple) that points to the beginning of the field's data in the variable-length part.
- Each variable-length data type can have its own encoding:
- For strings, the first word stores the length of the string and is followed by UTF-8 encoded bytes. If necessary, the end of the string is padded with empty bytes in order to ensure word-alignment.
For example, a tuple that consists 3 fields of type (int, string, string), with value (null, “data”, “bricks”) would look like this:
![image](https://cloud.githubusercontent.com/assets/50748/7328526/1e21959c-ea8b-11e4-9a28-a4350fe4a7b5.png)
This format allows us to compare tuples for equality by directly comparing their raw bytes. This also enables fast hashing of tuples.
### Hash map for performing aggregations
This patch introduces `UnsafeFixedWidthAggregationMap`, a hash map for performing aggregations where the aggregation result columns are fixed-with. This map's keys and values are `Row` objects. `UnsafeFixedWidthAggregationMap` is implemented on top of `BytesToBytesMap`, an append-only map which supports byte-array keys and values.
`BytesToBytesMap` stores pointers to key and value tuples. For each record with a new key, we copy the key and create the aggregation value buffer for that key and put them in a buffer. The hash table then simply stores pointers to the key and value. For each record with an existing key, we simply run the aggregation function to update the values in place.
This map is implemented using open hashing with triangular sequence probing. Each entry stores two words in a long array: the first word stores the address of the key and the second word stores the relative offset from the key tuple to the value tuple, as well as the key's 32-bit hashcode. By storing the full hashcode, we reduce the number of equality checks that need to be performed to handle position collisions ()since the chance of hashcode collision is much lower than position collision).
`UnsafeFixedWidthAggregationMap` allows regular Spark SQL `Row` objects to be used when probing the map. Internally, it encodes these rows into `UnsafeRow` format using `UnsafeRowConverter`. This conversion has a small overhead that can be eliminated in the future once we use UnsafeRows in other operators.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5725)
<!-- Reviewable:end -->
Author: Josh Rosen <joshrosen@databricks.com>
Closes#5725 from JoshRosen/unsafe and squashes the following commits:
eeee512 [Josh Rosen] Add converters for Null, Boolean, Byte, and Short columns.
81f34f8 [Josh Rosen] Follow 'place children last' convention for GeneratedAggregate
1bc36cc [Josh Rosen] Refactor UnsafeRowConverter to avoid unnecessary boxing.
017b2dc [Josh Rosen] Remove BytesToBytesMap.finalize()
50e9671 [Josh Rosen] Throw memory leak warning even in case of error; add warning about code duplication
70a39e4 [Josh Rosen] Split MemoryManager into ExecutorMemoryManager and TaskMemoryManager:
6e4b192 [Josh Rosen] Remove an unused method from ByteArrayMethods.
de5e001 [Josh Rosen] Fix debug vs. trace in logging message.
a19e066 [Josh Rosen] Rename unsafe Java test suites to match Scala test naming convention.
78a5b84 [Josh Rosen] Add logging to MemoryManager
ce3c565 [Josh Rosen] More comments, formatting, and code cleanup.
529e571 [Josh Rosen] Measure timeSpentResizing in nanoseconds instead of milliseconds.
3ca84b2 [Josh Rosen] Only zero the used portion of groupingKeyConversionScratchSpace
162caf7 [Josh Rosen] Fix test compilation
b45f070 [Josh Rosen] Don't redundantly store the offset from key to value, since we can compute this from the key size.
a8e4a3f [Josh Rosen] Introduce MemoryManager interface; add to SparkEnv.
0925847 [Josh Rosen] Disable MiMa checks for new unsafe module
cde4132 [Josh Rosen] Add missing pom.xml
9c19fc0 [Josh Rosen] Add configuration options for heap vs. offheap
6ffdaa1 [Josh Rosen] Null handling improvements in UnsafeRow.
31eaabc [Josh Rosen] Lots of TODO and doc cleanup.
a95291e [Josh Rosen] Cleanups to string handling code
afe8dca [Josh Rosen] Some Javadoc cleanup
f3dcbfe [Josh Rosen] More mod replacement
854201a [Josh Rosen] Import and comment cleanup
06e929d [Josh Rosen] More warning cleanup
ef6b3d3 [Josh Rosen] Fix a bunch of FindBugs and IntelliJ inspections
29a7575 [Josh Rosen] Remove debug logging
49aed30 [Josh Rosen] More long -> int conversion.
b26f1d3 [Josh Rosen] Fix bug in murmur hash implementation.
765243d [Josh Rosen] Enable optional performance metrics for hash map.
23a440a [Josh Rosen] Bump up default hash map size
628f936 [Josh Rosen] Use ints intead of longs for indexing.
92d5a06 [Josh Rosen] Address a number of minor code review comments.
1f4b716 [Josh Rosen] Merge Unsafe code into the regular GeneratedAggregate, guarded by a configuration flag; integrate planner support and re-enable all tests.
d85eeff [Josh Rosen] Add basic sanity test for UnsafeFixedWidthAggregationMap
bade966 [Josh Rosen] Comment update (bumping to refresh GitHub cache...)
b3eaccd [Josh Rosen] Extract aggregation map into its own class.
d2bb986 [Josh Rosen] Update to implement new Row methods added upstream
58ac393 [Josh Rosen] Use UNSAFE allocator in GeneratedAggregate (TODO: make this configurable)
7df6008 [Josh Rosen] Optimizations related to zeroing out memory:
c1b3813 [Josh Rosen] Fix bug in UnsafeMemoryAllocator.free():
738fa33 [Josh Rosen] Add feature flag to guard UnsafeGeneratedAggregate
c55bf66 [Josh Rosen] Free buffer once iterator has been fully consumed.
62ab054 [Josh Rosen] Optimize for fact that get() is only called on String columns.
c7f0b56 [Josh Rosen] Reuse UnsafeRow pointer in UnsafeRowConverter
ae39694 [Josh Rosen] Add finalizer as "cleanup method of last resort"
c754ae1 [Josh Rosen] Now that the store*() contract has been stregthened, we can remove an extra lookup
f764d13 [Josh Rosen] Simplify address + length calculation in Location.
079f1bf [Josh Rosen] Some clarification of the BytesToBytesMap.lookup() / set() contract.
1a483c5 [Josh Rosen] First version that passes some aggregation tests:
fc4c3a8 [Josh Rosen] Sketch how the converters will be used in UnsafeGeneratedAggregate
53ba9b7 [Josh Rosen] Start prototyping Java Row -> UnsafeRow converters
1ff814d [Josh Rosen] Add reminder to free memory on iterator completion
8a8f9df [Josh Rosen] Add skeleton for GeneratedAggregate integration.
5d55cef [Josh Rosen] Add skeleton for Row implementation.
f03e9c1 [Josh Rosen] Play around with Unsafe implementations of more string methods.
ab68e08 [Josh Rosen] Begin merging the UTF8String implementations.
480a74a [Josh Rosen] Initial import of code from Databricks unsafe utils repo.
Adds support for the math functions for DataFrames in PySpark.
rxin I love Davies.
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5750 from brkyvz/python-math-udfs and squashes the following commits:
7c4f563 [Burak Yavuz] removed is_math
3c4adde [Burak Yavuz] cleanup imports
d5dca3f [Burak Yavuz] moved math functions to mathfunctions
25e6534 [Burak Yavuz] addressed comments v2.0
d3f7e0f [Burak Yavuz] addressed comments and added tests
7b7d7c4 [Burak Yavuz] remove tests for removed methods
33c2c15 [Burak Yavuz] fixed python style
3ee0c05 [Burak Yavuz] added python functions
Implemented almost all math functions found in scala.math (max, min and abs were already present).
cc mengxr marmbrus
Author: Burak Yavuz <brkyvz@gmail.com>
Closes#5616 from brkyvz/math-udfs and squashes the following commits:
fb27153 [Burak Yavuz] reverted exception message
836a098 [Burak Yavuz] fixed test and addressed small comment
e5f0d13 [Burak Yavuz] addressed code review v2.2
b26c5fb [Burak Yavuz] addressed review v2.1
2761f08 [Burak Yavuz] addressed review v2
6588a5b [Burak Yavuz] fixed merge conflicts
b084e10 [Burak Yavuz] Addressed code review
029e739 [Burak Yavuz] fixed atan2 test
534cc11 [Burak Yavuz] added more tests, addressed comments
fa68dbe [Burak Yavuz] added double specific test data
937d5a5 [Burak Yavuz] use doubles instead of ints
8e28fff [Burak Yavuz] Added apache header
7ec8f7f [Burak Yavuz] Added math functions for DataFrames
rename DataTypeParser.apply to DataTypeParser.parse to make it more clear and readable.
/cc rxin
Author: wangfei <wangfei1@huawei.com>
Closes#5710 from scwf/apply and squashes the following commits:
c319977 [wangfei] rename apply to parse