Commit graph

12165 commits

Author SHA1 Message Date
Josh Rosen c032b0bf92 [SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL
This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort.  The root of this problem is that `NaN > anything`, `NaN == anything`, and `NaN < anything` all return `false`.

Per the design specified in SPARK-9079, we have decided that `NaN = NaN` should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value).

In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7194 from JoshRosen/nan and squashes the following commits:

983d4fc [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
88bd73c [Josh Rosen] Fix Row.equals()
a702e2e [Josh Rosen] normalization -> canonicalization
a7267cf [Josh Rosen] Normalize NaNs in UnsafeRow
fe629ae [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
fbb2a29 [Josh Rosen] Fix NaN comparisons in BinaryComparison expressions
c1fd4fe [Josh Rosen] Fold NaN test into existing test framework
b31eb19 [Josh Rosen] Uncomment failing tests
7fe67af [Josh Rosen] Support NaN == NaN (SPARK-9145)
58bad2c [Josh Rosen] Revert "Compare rows' string representations to work around NaN incomparability."
fc6b4d2 [Josh Rosen] Update CodeGenerator
3998ef2 [Josh Rosen] Remove unused code
a2ba2e7 [Josh Rosen] Fix prefix comparision for NaNs
a30d371 [Josh Rosen] Compare rows' string representations to work around NaN incomparability.
6f03f85 [Josh Rosen] Fix bug in Double / Float ordering
42a1ad5 [Josh Rosen] Stop filtering NaNs in UnsafeExternalSortSuite
bfca524 [Josh Rosen] Change ordering so that NaN is maximum value.
8d7be61 [Josh Rosen] Update randomized test to use ScalaTest's assume()
b20837b [Josh Rosen] Add failing test for new NaN comparision ordering
5b88b2b [Josh Rosen] Fix compilation of CodeGenerationSuite
d907b5b [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan
630ebc5 [Josh Rosen] Specify an ordering for NaN values.
9bf195a [Josh Rosen] Re-enable NaNs in CodeGenerationSuite to produce more regression tests
13fc06a [Josh Rosen] Add regression test for NaN sorting issue
f9efbb5 [Josh Rosen] Fix ORDER BY NULL
e7dc4fb [Josh Rosen] Add very generic test for ordering
7d5c13e [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL)
b55875a [Josh Rosen] Generate doubles and floats over entire possible range.
5acdd5c [Josh Rosen] Infinity and NaN are interesting.
ab76cbd [Josh Rosen] Move code to Catalyst package.
d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.
2015-07-20 22:38:05 -07:00
Holden Karau 4d97be9530 [SPARK-9204][ML] Add default params test for linearyregression suite
Author: Holden Karau <holden@pigscanfly.ca>

Closes #7553 from holdenk/SPARK-9204-add-default-params-test-to-linear-regression and squashes the following commits:

630ba19 [Holden Karau] style fix
faa08a3 [Holden Karau] Add default params test for linearyregression suite
2015-07-20 22:15:10 -07:00
Tarek Auel a3c7a3ce32 [SPARK-9132][SPARK-9163][SQL] codegen conv
Jira: https://issues.apache.org/jira/browse/SPARK-9132
https://issues.apache.org/jira/browse/SPARK-9163

rxin as you proposed in the Jira ticket, I just moved the logic to a separate object. I haven't changed anything of the logic of `NumberConverter`.

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7552 from tarekauel/SPARK-9163 and squashes the following commits:

40dcde9 [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] style fix
fa985bd [Tarek Auel] [SPARK-9132][SPARK-9163][SQL] codegen conv
2015-07-20 22:08:12 -07:00
Eric Liang 1cbdd89918 [SPARK-9201] [ML] Initial integration of MLlib + SparkR using RFormula
This exposes the SparkR:::glm() and SparkR:::predict() APIs. It was necessary to change RFormula to silently drop the label column if it was missing from the input dataset, which is kind of a hack but necessary to integrate with the Pipeline API.

The umbrella design doc for MLlib + SparkR integration can be viewed here: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #7483 from ericl/spark-8774 and squashes the following commits:

3dfac0c [Eric Liang] update
17ef516 [Eric Liang] more comments
1753a0f [Eric Liang] make glm generic
b0f50f8 [Eric Liang] equivalence test
550d56d [Eric Liang] export methods
c015697 [Eric Liang] second pass
117949a [Eric Liang] comments
5afbc67 [Eric Liang] test label columns
6b7f15f [Eric Liang] Fri Jul 17 14:20:22 PDT 2015
3a63ae5 [Eric Liang] Fri Jul 17 13:41:52 PDT 2015
ce61367 [Eric Liang] Fri Jul 17 13:41:17 PDT 2015
0299c59 [Eric Liang] Fri Jul 17 13:40:32 PDT 2015
e37603f [Eric Liang] Fri Jul 17 12:15:03 PDT 2015
d417d0c [Eric Liang] Merge remote-tracking branch 'upstream/master' into spark-8774
29a2ce7 [Eric Liang] Merge branch 'spark-8774-1' into spark-8774
d1959d2 [Eric Liang] clarify comment
2db68aa [Eric Liang] second round of comments
dc3c943 [Eric Liang] address comments
5765ec6 [Eric Liang] fix style checks
1f361b0 [Eric Liang] doc
d33211b [Eric Liang] r support
fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
2015-07-20 20:49:38 -07:00
Yu ISHIKAWA 2bdf9914ab [SPARK-9052] [SPARKR] Fix comments after curly braces
[[SPARK-9052] Fix comments after curly braces - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9052)

This is the full result of lintr at the rivision:011551620faa87107a787530f074af3d9be7e695.
[[SPARK-9052] the result of lint-r at the revision:011551620faa87107a787530f074af3d9be7e695](https://gist.github.com/yu-iskw/e7246041b173a3f29482)

This is the difference of the result between before and after.
https://gist.github.com/yu-iskw/e7246041b173a3f29482/revisions

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #7440 from yu-iskw/SPARK-9052 and squashes the following commits:

015d738 [Yu ISHIKAWA] Fix the indentations and move the placement of commna
5cc30fe [Yu ISHIKAWA] Fix the indentation in a condition
4ead0e5 [Yu ISHIKAWA] [SPARK-9052][SparkR] Fix comments after curly braces
2015-07-21 11:38:22 +09:00
Tarek Auel 936a96cb31 [SPARK-9164] [SQL] codegen hex/unhex
Jira: https://issues.apache.org/jira/browse/SPARK-9164

The diff looks heavy, but I just moved the `hex` and `unhex` methods to `object Hex`.  This allows me to call them from `eval` and `codeGen`

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7548 from tarekauel/SPARK-9164 and squashes the following commits:

dd91c57 [Tarek Auel] [SPARK-9164][SQL] codegen hex/unhex
2015-07-20 19:17:59 -07:00
Reynold Xin e90543e536 [SPARK-9142][SQL] Removing unnecessary self types in expressions.
Also added documentation to expressions to explain the important traits and abstract classes.

Author: Reynold Xin <rxin@databricks.com>

Closes #7550 from rxin/remove-self-types and squashes the following commits:

b2a3ec1 [Reynold Xin] [SPARK-9142][SQL] Removing unnecessary self types in expressions.
2015-07-20 18:23:51 -07:00
Tarek Auel 6853ac7c8c [SPARK-9156][SQL] codegen StringSplit
Jira: https://issues.apache.org/jira/browse/SPARK-9156

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7547 from tarekauel/SPARK-9156 and squashes the following commits:

0be2700 [Tarek Auel] [SPARK-9156][SQL] indention fix
b860eaf [Tarek Auel] [SPARK-9156][SQL] codegen StringSplit
5ad6a1f [Tarek Auel] [SPARK-9156] codegen StringSplit
2015-07-20 18:21:05 -07:00
Tarek Auel 047ccc8c9a [SPARK-9178][SQL] Add an empty string constant to UTF8String
Jira: https://issues.apache.org/jira/browse/SPARK-9178

In order to avoid calls of `UTF8String.fromString("")` this pr adds an `EMPTY_STRING` constant to `UTF8String`. An `UTF8String` is immutable, so we can use a constant, isn't it?

I searched for current usage of `UTF8String.fromString("")` with
`grep -R  "UTF8String.fromString(\"\")" .`

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7509 from tarekauel/SPARK-9178 and squashes the following commits:

8d6c405 [Tarek Auel] [SPARK-9178] revert intellij indents
3627b80 [Tarek Auel] [SPARK-9178] revert concat tests changes
3f5fbf5 [Tarek Auel] [SPARK-9178] rebase and add final to UTF8String.EMPTY_UTF8
47cda68 [Tarek Auel] Merge branch 'master' into SPARK-9178
4a37344 [Tarek Auel] [SPARK-9178] changed name to EMPTY_UTF8, added tests
748b87a [Tarek Auel] [SPARK-9178] Add empty string constant to UTF8String
2015-07-20 18:16:49 -07:00
Carson Wang 66bb8003b9 [SPARK-9187] [WEBUI] Timeline view may show negative value for running tasks
For running tasks, the executorRunTime metrics is 0 which causes negative executorComputingTime in the timeline. It also causes an incorrect SchedulerDelay time.
![timelinenegativevalue](https://cloud.githubusercontent.com/assets/9278199/8770953/f4362378-2eec-11e5-81e6-a06a07c04794.png)

Author: Carson Wang <carson.wang@intel.com>

Closes #7526 from carsonwang/timeline-negValue and squashes the following commits:

7b17db2 [Carson Wang] Fix negative value in timeline view
2015-07-20 18:08:59 -07:00
Meihua Wu ff3c72dbaf [SPARK-9175] [MLLIB] BLAS.gemm fails to update matrix C when alpha==0 and beta!=1
Fix BLAS.gemm to update matrix C when alpha==0 and beta!=1
Also include unit tests to verify the fix.

mengxr brkyvz

Author: Meihua Wu <meihuawu@umich.edu>

Closes #7503 from rotationsymmetry/fix_BLAS_gemm and squashes the following commits:

fce199c [Meihua Wu] Fix BLAS.gemm to update C when alpha==0 and beta!=1
2015-07-20 17:03:46 -07:00
Joseph K. Bradley a5d05819af [SPARK-9198] [MLLIB] [PYTHON] Fixed typo in pyspark sparsevector doc tests
Several places in the PySpark SparseVector docs have one defined as:
```
SparseVector(4, [2, 4], [1.0, 2.0])
```
The index 4 goes out of bounds (but this is not checked).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7541 from jkbradley/sparsevec-doc-typo-fix and squashes the following commits:

c806a65 [Joseph K. Bradley] fixed doc test
e2dcb23 [Joseph K. Bradley] Fixed typo in pyspark sparsevector doc tests
2015-07-20 16:49:55 -07:00
Cheng Lian a1064df0ee [SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery
This PR tries to accelerate Parquet schema discovery and `HadoopFsRelation` partition discovery.  The acceleration is done by the following means:

- Turning off schema merging by default

  Schema merging is not the most common case, but requires reading footers of all Parquet part-files and can be very slow.

- Avoiding `FileSystem.globStatus()` call when possible

  `FileSystem.globStatus()` may issue multiple synchronous RPC calls, and can be very slow (esp. on S3).  This PR adds `SparkHadoopUtil.globPathIfNecessary()`, which only issues RPC calls when the path contain glob-pattern specific character(s) (`{}[]*?\`).

  This is especially useful when converting a metastore Parquet table with lots of partitions, since Spark SQL adds all partition directories as the input paths, and currently we do a `globStatus` call on each input path sequentially.

- Listing leaf files in parallel when the number of input paths exceeds a threshold

  Listing leaf files is required by partition discovery.  Currently it is done on driver side, and can be slow when there are lots of (nested) directories, since each `FileSystem.listStatus()` call issues an RPC.  In this PR, we list leaf files in a BFS style, and resort to a Spark job once we found that the number of directories need to be listed exceed a threshold.

  The threshold is controlled by `SQLConf` option `spark.sql.sources.parallelPartitionDiscovery.threshold`, which defaults to 32.

- Discovering Parquet schema in parallel

  Currently, schema merging is also done on driver side, and needs to read footers of all part-files.  This PR uses a Spark job to do schema merging.  Together with task side metadata reading in Parquet 1.7.0, we never read any footers on driver side now.

Author: Cheng Lian <lian@databricks.com>

Closes #7396 from liancheng/accel-parquet and squashes the following commits:

5598efc [Cheng Lian] Uses ParquetInputFormat[InternalRow] instead of ParquetInputFormat[Row]
ff32cd0 [Cheng Lian] Excludes directories while listing leaf files
3c580f1 [Cheng Lian] Fixes test failure caused by making "mergeSchema" default to "false"
b1646aa [Cheng Lian] Should allow empty input paths
32e5f0d [Cheng Lian] Moves schema merging to executor side
2015-07-20 16:42:43 -07:00
Tarek Auel dac7dbf5a6 [SPARK-9160][SQL] codegen encode, decode
Jira: https://issues.apache.org/jira/browse/SPARK-9160

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7543 from tarekauel/SPARK-9160 and squashes the following commits:

7528f0e [Tarek Auel] [SPARK-9160][SQL] codegen encode, decode
2015-07-20 16:11:56 -07:00
Tarek Auel c9db8eaa42 [SPARK-9159][SQL] codegen ascii, base64, unbase64
Jira: https://issues.apache.org/jira/browse/SPARK-9159

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7542 from tarekauel/SPARK-9159 and squashes the following commits:

772e6bc [Tarek Auel] [SPARK-9159][SQL] codegen ascii, base64, unbase64
2015-07-20 15:32:46 -07:00
Tarek Auel 4863c11ea9 [SPARK-9155][SQL] codegen StringSpace
Jira https://issues.apache.org/jira/browse/SPARK-9155

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7531 from tarekauel/SPARK-9155 and squashes the following commits:

423c426 [Tarek Auel] [SPARK-9155] language typo fix
e34bd1b [Tarek Auel] [SPARK-9155] moved creation of blank string to UTF8String
4bc33e6 [Tarek Auel] [SPARK-9155] codegen StringSpace
2015-07-20 15:23:28 -07:00
Cheng Lian dde0e12f32 [SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning
This PR forks PR #7421 authored by piaozhexiu and adds [a workaround] [1] for fixing the occasional test failures occurred in PR #7421. Please refer to these [two] [2] [comments] [3] for details.

[1]: 536ac41a7e
[2]: https://github.com/apache/spark/pull/7421#issuecomment-122527391
[3]: https://github.com/apache/spark/pull/7421#issuecomment-122528059

Author: Cheolsoo Park <cheolsoop@netflix.com>
Author: Cheng Lian <lian@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #7492 from liancheng/pr-7421-workaround and squashes the following commits:

5599cc4 [Cheolsoo Park] Predicate pushdown to hive metastore
536ac41 [Cheng Lian] Sets hive.metastore.integral.jdo.pushdown to true to workaround test failures caused by in #7421
2015-07-20 15:12:14 -07:00
Davies Liu 9f913c4fd6 [SPARK-9114] [SQL] [PySpark] convert returned object from UDF into internal type
This PR also remove the duplicated code between registerFunction and UserDefinedFunction.

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #7450 from davies/fix_return_type and squashes the following commits:

e80bf9f [Davies Liu] remove debugging code
f94b1f6 [Davies Liu] fix mima
8f9c58b [Davies Liu] convert returned object from UDF into internal type
2015-07-20 12:14:47 -07:00
Mateusz Buśkiewicz 02181fb6d1 [SPARK-9101] [PySpark] Add missing NullType
JIRA: https://issues.apache.org/jira/browse/SPARK-9101

Author: Mateusz Buśkiewicz <mateusz.buskiewicz@getbase.com>

Closes #7499 from sixers/spark-9101 and squashes the following commits:

dd75aa6 [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Test for selecting null literal
97e3f2f [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Add missing NullType to _atomic_types in pyspark.sql.types
2015-07-20 12:00:48 -07:00
Imran Rashid 80e2568b25 [SPARK-8103][core] DAGScheduler should not submit multiple concurrent attempts for a stage
https://issues.apache.org/jira/browse/SPARK-8103

cc kayousterhout (thanks for the extra test case)

Author: Imran Rashid <irashid@cloudera.com>
Author: Kay Ousterhout <kayousterhout@gmail.com>
Author: Imran Rashid <squito@users.noreply.github.com>

Closes #6750 from squito/SPARK-8103 and squashes the following commits:

fb3acfc [Imran Rashid] fix log msg
e01b7aa [Imran Rashid] fix some comments, style
584acd4 [Imran Rashid] simplify going from taskId to taskSetMgr
e43ac25 [Imran Rashid] Merge branch 'master' into SPARK-8103
6bc23af [Imran Rashid] update log msg
4470fa1 [Imran Rashid] rename
c04707e [Imran Rashid] style
88b61cc [Imran Rashid] add tests to make sure that TaskSchedulerImpl schedules correctly with zombie attempts
d7f1ef2 [Imran Rashid] get rid of activeTaskSets
a21c8b5 [Imran Rashid] Merge branch 'master' into SPARK-8103
906d626 [Imran Rashid] fix merge
109900e [Imran Rashid] Merge branch 'master' into SPARK-8103
c0d4d90 [Imran Rashid] Revert "Index active task sets by stage Id rather than by task set id"
f025154 [Imran Rashid] Merge pull request #2 from kayousterhout/imran_SPARK-8103
baf46e1 [Kay Ousterhout] Index active task sets by stage Id rather than by task set id
19685bb [Imran Rashid] switch to using latestInfo.attemptId, and add comments
a5f7c8c [Imran Rashid] remove comment for reviewers
227b40d [Imran Rashid] style
517b6e5 [Imran Rashid] get rid of SparkIllegalStateException
b2faef5 [Imran Rashid] faster check for conflicting task sets
6542b42 [Imran Rashid] remove extra stageAttemptId
ada7726 [Imran Rashid] reviewer feedback
d8eb202 [Imran Rashid] Merge branch 'master' into SPARK-8103
46bc26a [Imran Rashid] more cleanup of debug garbage
cb245da [Imran Rashid] finally found the issue ... clean up debug stuff
8c29707 [Imran Rashid] Merge branch 'master' into SPARK-8103
89a59b6 [Imran Rashid] more printlns ...
9601b47 [Imran Rashid] more debug printlns
ecb4e7d [Imran Rashid] debugging printlns
b6bc248 [Imran Rashid] style
55f4a94 [Imran Rashid] get rid of more random test case since kays tests are clearer
7021d28 [Imran Rashid] update test since listenerBus.waitUntilEmpty now throws an exception instead of returning a boolean
883fe49 [Kay Ousterhout] Unit tests for concurrent stages issue
6e14683 [Imran Rashid] unit test just to make sure we fail fast on concurrent attempts
06a0af6 [Imran Rashid] ignore for jenkins
c443def [Imran Rashid] better fix and simpler test case
28d70aa [Imran Rashid] wip on getting a better test case ...
a9bf31f [Imran Rashid] wip
2015-07-20 10:28:32 -07:00
Reynold Xin c6fe9b4a17 [SQL] Remove space from DataFrame Scala/Java API.
I don't think this function is useful at all in Scala/Java, since users can easily compute n * space easily.

Author: Reynold Xin <rxin@databricks.com>

Closes #7530 from rxin/remove-space and squashes the following commits:

c147873 [Reynold Xin] [SQL] Remove space from DataFrame Scala/Java API.
2015-07-20 09:43:25 -07:00
Wenchen Fan 04db58ae30 [SPARK-9186][SQL] make deterministic describing the tree rather than the expression
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7525 from cloud-fan/deterministic and squashes the following commits:

4189bfa [Wenchen Fan] make deterministic describing the tree rather than the expression
2015-07-20 09:42:18 -07:00
Tarek Auel a15ecd057a [SPARK-9177][SQL] Reuse of calendar object in WeekOfYear
https://issues.apache.org/jira/browse/SPARK-9177

rxin Are we sure that this is thread safe? chenghao-intel explained in another PR that every partition (if I remember correctly) uses one expression instance. This instance isn't used by multiple threads, is it? If not, we are fine.

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7516 from tarekauel/SPARK-9177 and squashes the following commits:

0c1313a [Tarek Auel] [SPARK-9177] utilize more powerful addMutableState
6e2f03f [Tarek Auel] Merge branch 'master' into SPARK-9177
a69ec92 [Tarek Auel] [SPARK-9177] address comment
6cfb180 [Tarek Auel] [SPARK-9177] calendar as lazy transient val
ff97b09 [Tarek Auel] [SPARK-9177] Reuse calendar object in interpreted code and codegen
2015-07-20 09:41:25 -07:00
Tarek Auel 5112b7f58b [SPARK-9153][SQL] codegen StringLPad/StringRPad
Jira: https://issues.apache.org/jira/browse/SPARK-9153

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7527 from tarekauel/SPARK-9153 and squashes the following commits:

3840c6b [Tarek Auel] [SPARK-9153] removed codegen fallback
92b6a5d [Tarek Auel] [SPARK-9153] codegen lpad/rpad
2015-07-20 09:35:45 -07:00
MechCoder d0b4e93f7e [SPARK-8996] [MLLIB] [PYSPARK] Python API for Kolmogorov-Smirnov Test
Python API for the KS-test

Statistics.kolmogorovSmirnovTest(data, distName, *params)
I'm not quite sure how to support the callable function since it is not serializable.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7430 from MechCoder/spark-8996 and squashes the following commits:

2dd009d [MechCoder] minor
021d233 [MechCoder] Remove one wrapper and other minor stuff
49d07ab [MechCoder] [SPARK-8996] [MLlib] Python API for Kolmogorov-Smirnov Test
2015-07-20 09:00:01 -07:00
George Dittmar 3f7de7db4c [SPARK-7422] [MLLIB] Add argmax to Vector, SparseVector
Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423.

Author: George Dittmar <georgedittmar@gmail.com>
Author: George <dittmar@Georges-MacBook-Pro.local>
Author: dittmarg <george.dittmar@webtrends.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #6112 from GeorgeDittmar/SPARK-7422 and squashes the following commits:

3e0a939 [George Dittmar] Merge pull request #1 from mengxr/SPARK-7422
127dec5 [Xiangrui Meng] update argmax impl
2ea6a55 [George Dittmar] Added MimaExcludes for Vectors.argmax
98058f4 [George Dittmar] Merge branch 'master' of github.com:apache/spark into SPARK-7422
5fd9380 [George Dittmar] fixing style check error
42341fb [George Dittmar] refactoring arg max check to better handle zero values
b22af46 [George Dittmar] Fixing spaces between commas in unit test
f2eba2f [George Dittmar] Cleaning up unit tests to be fewer lines
aa330e3 [George Dittmar] Fixing some last if else spacing issues
ac53c55 [George Dittmar] changing dense vector argmax unit test to be one line call vs 2
d5b5423 [George Dittmar] Fixing code style and updating if logic on when to check for zero values
ee1a85a [George Dittmar] Cleaning up unit tests a bit and modifying a few cases
3ee8711 [George Dittmar] Fixing corner case issue with zeros in the active values of the sparse vector. Updated unit tests
b1f059f [George Dittmar] Added comment before we start arg max calculation. Updated unit tests to cover corner cases
f21dcce [George Dittmar] commit
af17981 [dittmarg] Initial work fixing bug that was made clear in pr
eeda560 [George] Fixing SparseVector argmax function to ignore zero values while doing the calculation.
4526acc [George] Merge branch 'master' of github.com:apache/spark into SPARK-7422
df9538a [George] Added argmax to sparse vector and added unit test
3cffed4 [George] Adding unit tests for argmax functions for Dense and Sparse vectors
04677af [George] initial work on adding argmax to Vector and SparseVector
2015-07-20 08:55:37 -07:00
Josh Rosen 79ec07290d [SPARK-9023] [SQL] Efficiency improvements for UnsafeRows in Exchange
This pull request aims to improve the performance of SQL's Exchange operator when shuffling UnsafeRows.  It also makes several general efficiency improvements to Exchange.

Key changes:

- When performing hash partitioning, the old Exchange projected the partitioning columns into a new row then passed a `(partitioningColumRow: InternalRow, row: InternalRow)` pair into the shuffle. This is very inefficient because it ends up redundantly serializing the partitioning columns only to immediately discard them after the shuffle.  After this patch's changes, Exchange now shuffles `(partitionId: Int, row: InternalRow)` pairs.  This still isn't optimal, since we're still shuffling extra data that we don't need, but it's significantly more efficient than the old implementation; in the future, we may be able to further optimize this once we implement a new shuffle write interface that accepts non-key-value-pair inputs.
- Exchange's `compute()` method has been significantly simplified; the new code has less duplication and thus is easier to understand.
- When the Exchange's input operator produces UnsafeRows, Exchange will use a specialized `UnsafeRowSerializer` to serialize these rows.  This serializer is significantly more efficient since it simply copies the UnsafeRow's underlying bytes.  Note that this approach does not work for UnsafeRows that use the ObjectPool mechanism; I did not add support for this because we are planning to remove ObjectPool in the next few weeks.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7456 from JoshRosen/unsafe-exchange and squashes the following commits:

7e75259 [Josh Rosen] Fix cast in SparkSqlSerializer2Suite
0082515 [Josh Rosen] Some additional comments + small cleanup to remove an unused parameter
a27cfc1 [Josh Rosen] Add missing newline
741973c [Josh Rosen] Add simple test of UnsafeRow shuffling in Exchange.
359c6a4 [Josh Rosen] Remove println() and add comments
93904e7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-exchange
8dd3ff2 [Josh Rosen] Exchange outputs UnsafeRows when its child outputs them
dd9c66d [Josh Rosen] Fix for copying logic
035af21 [Josh Rosen] Add logic for choosing when to use UnsafeRowSerializer
7876f31 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-shuffle
cbea80b [Josh Rosen] Add UnsafeRowSerializer
0f2ac86 [Josh Rosen] Import ordering
3ca8515 [Josh Rosen] Big code simplification in Exchange
3526868 [Josh Rosen] Iniitial cut at removing shuffle on KV pairs
2015-07-19 23:41:28 -07:00
Jacky Li 972d8900a1 [SQL][DOC] Minor document fix in HadoopFsRelationProvider
Catch this while reading the code

Author: Jacky Li <lee.unreal@gmail.com>
Author: Jacky Li <jackylk@users.noreply.github.com>

Closes #7524 from jackylk/patch-11 and squashes the following commits:

b679011 [Jacky Li] fix doc
e10e211 [Jacky Li] [SQL] Minor document fix in HadoopFsRelationProvider
2015-07-19 23:19:17 -07:00
Reynold Xin 5bdf16da90 Code review feedback for the previous patch. 2015-07-19 22:45:56 -07:00
Wenchen Fan 930253e076 [SPARK-9185][SQL] improve code gen for mutable states to support complex initialization
Sometimes we need more than one step to initialize the mutable states in code gen like https://github.com/apache/spark/pull/7516

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7521 from cloud-fan/init and squashes the following commits:

2106445 [Wenchen Fan] improve code gen for mutable states
2015-07-19 22:42:44 -07:00
Liang-Chi Hsieh d743bec645 [SPARK-9172][SQL] Make DecimalPrecision support for Intersect and Except
JIRA: https://issues.apache.org/jira/browse/SPARK-9172

Simply make `DecimalPrecision` support for `Intersect` and `Except` in addition to `Union`.

Besides, add unit test for `DecimalPrecision` as well.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7511 from viirya/more_decimalprecieion and squashes the following commits:

4d29d10 [Liang-Chi Hsieh] Fix code comment.
9fb0d49 [Liang-Chi Hsieh] Make DecimalPrecision support for Intersect and Except.
2015-07-19 20:53:18 -07:00
Tathagata Das 93eb2acfb2 [SPARK-9030] [STREAMING] [HOTFIX] Make sure that no attempts to create Kinesis streams is made when no enabled
Problem: Even when the environment variable to enable tests are disabled, the `beforeAll` of the KinesisStreamSuite attempted to find AWS credentials to create Kinesis stream, and failed.

Solution: Made sure all accesses to KinesisTestUtils, that created streams, is under `testOrIgnore`

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #7519 from tdas/kinesis-tests and squashes the following commits:

64d6d06 [Tathagata Das] Removed empty lines.
7c18473 [Tathagata Das] Putting all access to KinesisTestUtils inside testOrIgnore
2015-07-19 20:34:30 -07:00
Reynold Xin 163e3f1df9 [SPARK-8241][SQL] string function: concat_ws.
I also changed the semantics of concat w.r.t. null back to the same behavior as Hive.
That is to say, concat now returns null if any input is null.

Author: Reynold Xin <rxin@databricks.com>

Closes #7504 from rxin/concat_ws and squashes the following commits:

83fd950 [Reynold Xin] Fixed type casting.
3ae85f7 [Reynold Xin] Write null better.
cdc7be6 [Reynold Xin] Added code generation for pure string mode.
a61c4e4 [Reynold Xin] Updated comments.
2d51406 [Reynold Xin] [SPARK-8241][SQL] string function: concat_ws.
2015-07-19 16:48:47 -07:00
Herman van Hovell 7a81245345 [SPARK-8638] [SQL] Window Function Performance Improvements - Cleanup
This PR contains a few clean-ups that are a part of SPARK-8638: a few style issues got fixed, and a few tests were moved.

Git commit message is wrong BTW :(...

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #7513 from hvanhovell/SPARK-8638-cleanup and squashes the following commits:

4e69d08 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
2015-07-19 16:29:50 -07:00
Nicholas Hwang a803ac3e06 [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()
I'm relatively new to Spark and functional programming, so forgive me if this pull request is just a result of my misunderstanding of how Spark should be used.

Currently, if one happens to use a mutable object as `zeroValue` for `RDD.aggregate()`, possibly unexpected behavior can occur.

This is because pyspark's current implementation of `RDD.aggregate()` does not serialize or make a copy of `zeroValue` before handing it off to `RDD.mapPartitions(...).fold(...)`. This results in a single reference to `zeroValue` being used for both `RDD.mapPartitions()` and `RDD.fold()` on each partition. This can result in strange accumulator values being fed into each partition's call to `RDD.fold()`, as the `zeroValue` may have been changed in-place during the `RDD.mapPartitions()` call.

As an illustrative example, submit the following to `spark-submit`:
```
from pyspark import SparkConf, SparkContext
import collections

def updateCounter(acc, val):
    print 'update acc:', acc
    print 'update val:', val
    acc[val] += 1
    return acc

def comboCounter(acc1, acc2):
    print 'combo acc1:', acc1
    print 'combo acc2:', acc2
    acc1.update(acc2)
    return acc1

def main():
    conf = SparkConf().setMaster("local").setAppName("Aggregate with Counter")
    sc = SparkContext(conf = conf)

    print '======= AGGREGATING with ONE PARTITION ======='
    print sc.parallelize(range(1,10), 1).aggregate(collections.Counter(), updateCounter, comboCounter)

    print '======= AGGREGATING with TWO PARTITIONS ======='
    print sc.parallelize(range(1,10), 2).aggregate(collections.Counter(), updateCounter, comboCounter)

if __name__ == "__main__":
    main()
```

One probably expects this to output the following:
```
Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})
```

But it instead outputs this (regardless of the number of partitions):
```
Counter({1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2})
```

This is because (I believe) `zeroValue` gets passed correctly to each partition, but after `RDD.mapPartitions()` completes, the `zeroValue` object has been updated and is then passed to `RDD.fold()`, which results in all items being double-counted within each partition before being finally reduced at the calling node.

I realize that this type of calculation is typically done by `RDD.mapPartitions(...).reduceByKey(...)`, but hopefully this illustrates some potentially confusing behavior. I also noticed that other `RDD` methods use this `deepcopy` approach to creating unique copies of `zeroValue` (i.e., `RDD.aggregateByKey()` and `RDD.foldByKey()`), and that the Scala implementations do seem to serialize the `zeroValue` object appropriately to prevent this type of behavior.

Author: Nicholas Hwang <moogling@gmail.com>

Closes #7378 from njhwang/master and squashes the following commits:

659bb27 [Nicholas Hwang] Fixed RDD.aggregate() to perform a reduce operation on collected mapPartitions results, similar to how fold currently is implemented. This prevents an initial combOp being performed on each partition with zeroValue (which leads to unexpected behavior if zeroValue is a mutable object) before being combOp'ed with other partition results.
8d8d694 [Nicholas Hwang] Changed dict construction to be compatible with Python 2.6 (cannot use list comprehensions to make dicts)
56eb2ab [Nicholas Hwang] Fixed whitespace after colon to conform with PEP8
391de4a [Nicholas Hwang] Removed used of collections.Counter from RDD tests for Python 2.6 compatibility; used defaultdict(int) instead. Merged treeAggregate test with mutable zero value into aggregate test to reduce code duplication.
2fa4e4b [Nicholas Hwang] Merge branch 'master' of https://github.com/njhwang/spark
ba528bd [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3e1. Also replaced some parallelizations of ranges with xranges, per the documentation's recommendations of preferring xrange over range.
7820391 [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3e1.
90d1544 [Nicholas Hwang] Made sure RDD.aggregate() makes a deepcopy of zeroValue for all partitions; this ensures that the mapPartitions call works with unique copies of zeroValue in each partition, and prevents a single reference to zeroValue being used for both map and fold calls on each partition (resulting in possibly unexpected behavior).
2015-07-19 10:30:28 -07:00
Cheng Lian 34ed82bb44 [HOTFIX] [SQL] Fixes compilation error introduced by PR #7506
PR #7506 breaks master build because of compilation error. Note that #7506 itself looks good, but it seems that `git merge` did something stupid.

Author: Cheng Lian <lian@databricks.com>

Closes #7510 from liancheng/hotfix-for-pr-7506 and squashes the following commits:

7ea7e89 [Cheng Lian] Fixes compilation error
2015-07-19 18:58:19 +08:00
Cheng Lian bc24289f5d [SPARK-9179] [BUILD] Allows committers to specify primary author of the PR to be merged
It's a common case that some contributor contributes an initial version of a feature/bugfix, and later on some other people (mostly committers) fork and add more improvements. When merging these PRs, we probably want to specify the original author as the primary author. Currently we can only do this by running

```
$ git commit --amend --author="name <email>"
```

manually right before the merge script pushes to Apache Git repo. It would be nice if the script accepts user specified primary author information.

Author: Cheng Lian <lian@databricks.com>

Closes #7508 from liancheng/spark-9179 and squashes the following commits:

218d88e [Cheng Lian] Allows committers to specify primary author of the PR to be merged
2015-07-19 17:37:25 +08:00
Reynold Xin 3427937ea2 [SQL] Make date/time functions more consistent with other database systems.
This pull request fixes some of the problems in #6981.

- Added date functions to `__all__` so they get exposed
- Rename day_of_month -> dayofmonth
- Rename day_in_year -> dayofyear
- Rename week_of_year -> weekofyear
- Removed "day" from Scala/Python API since it is ambiguous. Only leaving the alias in SQL.

Author: Reynold Xin <rxin@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>

Closes #7506 from rxin/datetime and squashes the following commits:

0cb24d9 [Reynold Xin] Export all functions in Python.
e44a4a0 [Reynold Xin] Removed day function from Scala and Python.
9c08fdc [Reynold Xin] [SQL] Make date/time functions more consistent with other database systems.
2015-07-19 01:17:22 -07:00
Tarek Auel a53d13f7aa [SPARK-8199][SQL] follow up; revert change in test
rxin / davies

Sorry for that unnecessary change. And thanks again for all your support!

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7505 from tarekauel/SPARK-8199-FollowUp and squashes the following commits:

d09321c [Tarek Auel] [SPARK-8199] follow up; revert change in test
c17397f [Tarek Auel] [SPARK-8199] follow up; revert change in test
67acfe6 [Tarek Auel] [SPARK-8199] follow up; revert change in test
2015-07-19 01:16:01 -07:00
Carl Anders Düvel 344d1567e5 [SPARK-9094] [PARENT] Increased io.dropwizard.metrics from 3.1.0 to 3.1.2
We are running Spark 1.4.0 in production and ran into problems because after a network hiccup (which happens often in our current environment) no more metrics were reported to graphite leaving us blindfolded about the current state of our spark applications. [This problem](70559816f1) was fixed in the current version of the metrics library. We run spark with this change  in production now and have seen no problems. We also had a look at the commit history since 3.1.0 and did not detect any potentially  incompatible changes but many fixes which could potentially help other users as well.

Author: Carl Anders Düvel <c.a.duevel@gmail.com>

Closes #7493 from hackbert/bump-metrics-lib-version and squashes the following commits:

6677565 [Carl Anders Düvel] [SPARK-9094] [PARENT] Increased io.dropwizard.metrics from 3.1.0 to 3.1.2 in order to get this fix 70559816f1
2015-07-19 09:14:55 +01:00
Liang-Chi Hsieh 9b644c4130 [SPARK-9166][SQL][PYSPARK] Capture and hide IllegalArgumentException in Python API
JIRA: https://issues.apache.org/jira/browse/SPARK-9166

Simply capture and hide `IllegalArgumentException` in Python API.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7497 from viirya/hide_illegalargument and squashes the following commits:

8324dce [Liang-Chi Hsieh] Fix python style.
9ace67d [Liang-Chi Hsieh] Also check exception message.
8b2ce5c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into hide_illegalargument
7be016a [Liang-Chi Hsieh] Capture and hide IllegalArgumentException in Python.
2015-07-19 00:32:56 -07:00
Reynold Xin 89d135851d Closes #6775 since it is subsumbed by other patches. 2015-07-18 23:47:40 -07:00
Herman van Hovell a9a0d0cebf [SPARK-8638] [SQL] Window Function Performance Improvements
## Description
Performance improvements for Spark Window functions. This PR will also serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA tickets SPARK-8638 and SPARK-7712 for more information.

## Improvements
* Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case also uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse.
* Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement.
* A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. This will be enabled in a follow-up PR.
* A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code.

## Benchmarking
I have done a small benchmark using [on time performance](http://www.transtats.bts.gov) data of the month april. I have used the origin as a partioning key, as a result there is quite some variation in window sizes. The code for the benchmark can be found in the JIRA ticket. These are the results per Frame type:

Frame | Master | SPARK-8638
----- | ------ | ----------
Entire Frame | 2 s | 1 s
Sliding | 18 s | 1 s
Growing | 14 s | 0.9 s
Shrinking | 13 s | 1 s

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #7057 from hvanhovell/SPARK-8638 and squashes the following commits:

3bfdc49 [Herman van Hovell] Fixed Perfomance Regression for Shrinking Window Frames (+Rebase)
2eb3b33 [Herman van Hovell] Corrected reverse range frame processing.
2cd2d5b [Herman van Hovell] Corrected reverse range frame processing.
b0654d7 [Herman van Hovell] Tests for exotic frame specifications.
e75b76e [Herman van Hovell] More docs, added support for reverse sliding range frames, and some reorganization of code.
1fdb558 [Herman van Hovell] Changed Data In HiveDataFrameWindowSuite.
ac2f682 [Herman van Hovell] Added a few more comments.
1938312 [Herman van Hovell] Added Documentation to the createBoundOrdering methods.
bb020e6 [Herman van Hovell] Major overhaul of Window operator.
2015-07-18 23:44:38 -07:00
Reynold Xin 04c1b49f5e Fixed test cases. 2015-07-18 22:50:34 -07:00
Tarek Auel 83b682beec [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK-8179][SPARK-8177][SPARK-8178][SPARK-9115][SQL] date functions
Jira:
https://issues.apache.org/jira/browse/SPARK-8199
https://issues.apache.org/jira/browse/SPARK-8184
https://issues.apache.org/jira/browse/SPARK-8183
https://issues.apache.org/jira/browse/SPARK-8182
https://issues.apache.org/jira/browse/SPARK-8181
https://issues.apache.org/jira/browse/SPARK-8180
https://issues.apache.org/jira/browse/SPARK-8179
https://issues.apache.org/jira/browse/SPARK-8177
https://issues.apache.org/jira/browse/SPARK-8179
https://issues.apache.org/jira/browse/SPARK-9115

Regarding `day`and `dayofmonth` are both necessary?

~~I am going to add `Quarter` to this PR as well.~~ Done.

~~As soon as the Scala coding is reviewed and discussed, I'll add the python api.~~ Done

Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Tarek Auel <tarek.auel@gmail.com>

Closes #6981 from tarekauel/SPARK-8199 and squashes the following commits:

f7b4c8c [Tarek Auel] [SPARK-8199] fixed bug in tests
bb567b6 [Tarek Auel] [SPARK-8199] fixed test
3e095ba [Tarek Auel] [SPARK-8199] style and timezone fix
256c357 [Tarek Auel] [SPARK-8199] code cleanup
5983dcc [Tarek Auel] [SPARK-8199] whitespace fix
6e0c78f [Tarek Auel] [SPARK-8199] removed setTimeZone in tests, according to cloud-fans comment in #7488
4afc09c [Tarek Auel] [SPARK-8199] concise leap year handling
ea6c110 [Tarek Auel] [SPARK-8199] fix after merging master
70238e0 [Tarek Auel] Merge branch 'master' into SPARK-8199
3c6ae2e [Tarek Auel] [SPARK-8199] removed binary search
fb98ba0 [Tarek Auel] [SPARK-8199] python docstring fix
cdfae27 [Tarek Auel] [SPARK-8199] cleanup & python docstring fix
746b80a [Tarek Auel] [SPARK-8199] build fix
0ad6db8 [Tarek Auel] [SPARK-8199] minor fix
523542d [Tarek Auel] [SPARK-8199] address comments
2259299 [Tarek Auel] [SPARK-8199] day_of_month alias
d01b977 [Tarek Auel] [SPARK-8199] python underscore
56c4a92 [Tarek Auel] [SPARK-8199] update python docu
e223bc0 [Tarek Auel] [SPARK-8199] refactoring
d6aa14e [Tarek Auel] [SPARK-8199] fixed Hive compatibility
b382267 [Tarek Auel] [SPARK-8199] fixed bug in day calculation; removed set TimeZone in HiveCompatibilitySuite for test purposes; removed Hive tests for second and minute, because we can cast '2015-03-18' to a timestamp and extract a minute/second from it
1b2e540 [Tarek Auel] [SPARK-8119] style fix
0852655 [Tarek Auel] [SPARK-8119] changed from ExpectsInputTypes to implicit casts
ec87c69 [Tarek Auel] [SPARK-8119] bug fixing and refactoring
1358cdc [Tarek Auel] Merge remote-tracking branch 'origin/master' into SPARK-8199
740af0e [Tarek Auel] implement date function using a calculation based on days
4fb66da [Tarek Auel] WIP: date functions on calculation only
1a436c9 [Tarek Auel] wip
f775f39 [Tarek Auel] fixed return type
ad17e96 [Tarek Auel] improved implementation
c42b444 [Tarek Auel] Removed merge conflict file
ccb723c [Tarek Auel] [SPARK-8199] style and fixed merge issues
10e4ad1 [Tarek Auel] Merge branch 'master' into date-functions-fast
7d9f0eb [Tarek Auel] [SPARK-8199] git renaming issue
f3e7a9f [Tarek Auel] [SPARK-8199] revert change in DataFrameFunctionsSuite
6f5d95c [Tarek Auel] [SPARK-8199] fixed year interval
d9f8ac3 [Tarek Auel] [SPARK-8199] implement fast track
7bc9d93 [Tarek Auel] Merge branch 'master' into SPARK-8199
5a105d9 [Tarek Auel] [SPARK-8199] rebase after #6985 got merged
eb6760d [Tarek Auel] Merge branch 'master' into SPARK-8199
f120415 [Tarek Auel] improved runtime
a8edebd [Tarek Auel] use Calendar instead of SimpleDateFormat
5fe74e1 [Tarek Auel] fixed python style
3bfac90 [Tarek Auel] fixed style
356df78 [Tarek Auel] rely on cast mechanism of Spark. Simplified implementation
02efc5d [Tarek Auel] removed doubled code
a5ea120 [Tarek Auel] added python api; changed test to be more meaningful
b680db6 [Tarek Auel] added codegeneration to all functions
c739788 [Tarek Auel] added support for quarter SPARK-8178
849fb41 [Tarek Auel] fixed stupid test
638596f [Tarek Auel] improved codegen
4d8049b [Tarek Auel] fixed tests and added type check
5ebb235 [Tarek Auel] resolved naming conflict
d0e2f99 [Tarek Auel] date functions
2015-07-18 22:48:05 -07:00
Forest Fang 6cb6096c01 [SPARK-8443][SQL] Split GenerateMutableProjection Codegen due to JVM Code Size Limits
By grouping projection calls into multiple apply function, we are able to push the number of projections codegen can handle from ~1k to ~60k. I have set the unit test to test against 5k as 60k took 15s for the unit test to complete.

Author: Forest Fang <forest.fang@outlook.com>

Closes #7076 from saurfang/codegen_size_limit and squashes the following commits:

b7a7635 [Forest Fang] [SPARK-8443][SQL] Execute and verify split projections in test
adef95a [Forest Fang] [SPARK-8443][SQL] Use safer factor and rewrite splitting code
1b5aa7e [Forest Fang] [SPARK-8443][SQL] inline execution if one block only
9405680 [Forest Fang] [SPARK-8443][SQL] split projection code by size limit
2015-07-18 21:05:44 -07:00
Reynold Xin 45d798c323 [SPARK-8278] Remove non-streaming JSON reader.
Author: Reynold Xin <rxin@databricks.com>

Closes #7501 from rxin/jsonrdd and squashes the following commits:

767ec55 [Reynold Xin] More Mima
51f456e [Reynold Xin] Mima exclude.
789cb80 [Reynold Xin] Fixed compilation error.
b4cf50d [Reynold Xin] [SPARK-8278] Remove non-streaming JSON reader.
2015-07-18 20:27:55 -07:00
Reynold Xin 9914b1b2c5 [SPARK-9150][SQL] Create CodegenFallback and Unevaluable trait
It is very hard to track which expressions have code gen implemented or not. This patch removes the default fallback gencode implementation from Expression, and moves that into a new trait called CodegenFallback. Each concrete expression needs to either implement code generation, or mix in CodegenFallback. This makes it very easy to track which expressions have code generation implemented already.

Additionally, this patch creates an Unevaluable trait that can be used to track expressions that don't support evaluation (e.g. Star).

Author: Reynold Xin <rxin@databricks.com>

Closes #7487 from rxin/codegenfallback and squashes the following commits:

14ebf38 [Reynold Xin] Fixed Conv
6c1c882 [Reynold Xin] Fixed Alias.
b42611b [Reynold Xin] [SPARK-9150][SQL] Create a trait to track code generation for expressions.
cb5c066 [Reynold Xin] Removed extra import.
39cbe40 [Reynold Xin] [SPARK-8240][SQL] string function: concat
2015-07-18 18:18:19 -07:00
Reynold Xin e16a19a39e [SPARK-9174][SQL] Add documentation for all public SQLConfs.
Author: Reynold Xin <rxin@databricks.com>

Closes #7500 from rxin/sqlconf and squashes the following commits:

a5726c8 [Reynold Xin] [SPARK-9174][SQL] Add documentation for all public SQLConfs.
2015-07-18 15:29:38 -07:00
Reynold Xin 6e1e2eba69 [SPARK-8240][SQL] string function: concat
Author: Reynold Xin <rxin@databricks.com>

Closes #7486 from rxin/concat and squashes the following commits:

5217d6e [Reynold Xin] Removed Hive's concat test.
f5cb7a3 [Reynold Xin] Concat is never nullable.
ae4e61f [Reynold Xin] Removed extra import.
fddcbbd [Reynold Xin] Fixed NPE.
22e831c [Reynold Xin] Added missing file.
57a2352 [Reynold Xin] [SPARK-8240][SQL] string function: concat
2015-07-18 14:07:56 -07:00