Commit graph

11558 commits

Author SHA1 Message Date
Wenchen Fan b71d3254e5 [SPARK-8075] [SQL] apply type check interface to more expressions
a follow up of https://github.com/apache/spark/pull/6405.
Note: It's not a big change, a lot of changing is due to I swap some code in `aggregates.scala` to make aggregate functions right below its corresponding aggregate expressions.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6723 from cloud-fan/type-check and squashes the following commits:

2124301 [Wenchen Fan] fix tests
5a658bb [Wenchen Fan] add tests
287d3bb [Wenchen Fan] apply type check interface to more expressions
2015-06-24 16:26:00 -07:00
Yin Huai 7daa70292e [SPARK-8567] [SQL] Increase the timeout of HiveSparkSubmitSuite
https://issues.apache.org/jira/browse/SPARK-8567

Author: Yin Huai <yhuai@databricks.com>

Closes #6957 from yhuai/SPARK-8567 and squashes the following commits:

62dff5b [Yin Huai] Increase the timeout.
2015-06-24 15:52:58 -07:00
fe2s dca21a83ac [SPARK-8558] [BUILD] Script /dev/run-tests fails when _JAVA_OPTIONS env var set
Author: fe2s <aka.fe2s@gmail.com>
Author: Oleksiy Dyagilev <oleksiy_dyagilev@epam.com>

Closes #6956 from fe2s/fix-run-tests and squashes the following commits:

31b6edc [fe2s] str is a built-in function, so using it as a variable name will lead to spurious warnings in some Python linters
7d781a0 [fe2s] fixing for openjdk/IBM, seems like they have slightly different wording, but all have 'version' word. Surrounding with spaces for the case if version word appears in _JAVA_OPTIONS
cd455ef [fe2s] address comment, looking for java version string rather than expecting to have on a certain line number
ad577d7 [Oleksiy Dyagilev] [SPARK-8558][BUILD] Script /dev/run-tests fails when _JAVA_OPTIONS env var set
2015-06-24 15:12:23 -07:00
Cheng Lian 8ab50765cd [SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter
This PR introduces `CatalystSchemaConverter` for converting Parquet schema to Spark SQL schema and vice versa.  Original conversion code in `ParquetTypesConverter` is removed. Benefits of the new version are:

1. When converting Spark SQL schemas, it generates standard Parquet schemas conforming to [the most updated Parquet format spec] [1]. Converting to old style Parquet schemas is also supported via feature flag `spark.sql.parquet.followParquetFormatSpec` (which is set to `false` for now, and should be set to `true` after both read and write paths are fixed).

   Note that although this version of Parquet format spec hasn't been officially release yet, Parquet MR 1.7.0 already sticks to it. So it should be safe to follow.

1. It implements backwards-compatibility rules described in the most updated Parquet format spec. Thus can recognize more schema patterns generated by other/legacy systems/tools.
1. Code organization follows convention used in [parquet-mr] [2], which is easier to follow. (Structure of `CatalystSchemaConverter` is similar to `AvroSchemaConverter`).

To fully implement backwards-compatibility rules in both read and write path, we also need to update `CatalystRowConverter` (which is responsible for converting Parquet records to `Row`s), `RowReadSupport`, and `RowWriteSupport`. These would be done in follow-up PRs.

TODO

- [x] More schema conversion test cases for legacy schema patterns.

[1]: ea09522659/LogicalTypes.md
[2]: https://github.com/apache/parquet-mr/

Author: Cheng Lian <lian@databricks.com>

Closes #6617 from liancheng/spark-6777 and squashes the following commits:

2a2062d [Cheng Lian] Don't convert decimals without precision information
b60979b [Cheng Lian] Adds a constructor which accepts a Configuration, and fixes default value of assumeBinaryIsString
743730f [Cheng Lian] Decimal scale shouldn't be larger than precision
a104a9e [Cheng Lian] Fixes Scala style issue
1f71d8d [Cheng Lian] Adds feature flag to allow falling back to old style Parquet schema conversion
ba84f4b [Cheng Lian] Fixes MapType schema conversion bug
13cb8d5 [Cheng Lian] Fixes MiMa failure
81de5b0 [Cheng Lian] Fixes UDT, workaround read path, and add tests
28ef95b [Cheng Lian] More AnalysisExceptions
b10c322 [Cheng Lian] Replaces require() with analysisRequire() which throws AnalysisException
cceaf3f [Cheng Lian] Implements backwards compatibility rules in CatalystSchemaConverter
2015-06-24 15:03:43 -07:00
MechCoder fb32c38898 [SPARK-7633] [MLLIB] [PYSPARK] Python bindings for StreamingLogisticRegressionwithSGD
Add Python bindings to StreamingLogisticRegressionwithSGD.

No Java wrappers are needed as models are updated directly using train.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6849 from MechCoder/spark-3258 and squashes the following commits:

b4376a5 [MechCoder] minor
d7e5fc1 [MechCoder] Refactor into StreamingLinearAlgorithm Better docs
9c09d4e [MechCoder] [SPARK-7633] Python bindings for StreamingLogisticRegressionwithSGD
2015-06-24 14:58:43 -07:00
Wenchen Fan f04b5672c5 [SPARK-7289] handle project -> limit -> sort efficiently
make the `TakeOrdered` strategy and operator more general, such that it can optionally handle a projection when necessary

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6780 from cloud-fan/limit and squashes the following commits:

34aa07b [Wenchen Fan] revert
07d5456 [Wenchen Fan] clean closure
20821ec [Wenchen Fan] fix
3676a82 [Wenchen Fan] address comments
b558549 [Wenchen Fan] address comments
214842b [Wenchen Fan] fix style
2d8be83 [Wenchen Fan] add LimitPushDown
948f740 [Wenchen Fan] fix existing
2015-06-24 13:28:50 -07:00
Santiago M. Mola b84d4b4dfe [SPARK-7088] [SQL] Fix analysis for 3rd party logical plan.
ResolveReferences analysis rule now does not throw when it cannot resolve references in a self-join.

Author: Santiago M. Mola <smola@stratio.com>

Closes #6853 from smola/SPARK-7088 and squashes the following commits:

af71ac7 [Santiago M. Mola] [SPARK-7088] Fix analysis for 3rd party logical plan.
2015-06-24 12:29:07 -07:00
Holden Karau 43e66192f4 [SPARK-8506] Add pakages to R context created through init.
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6928 from holdenk/SPARK-8506-sparkr-does-not-provide-an-easy-way-to-depend-on-spark-packages-when-performing-init-from-inside-of-r and squashes the following commits:

b60dd63 [Holden Karau] Add an example with the spark-csv package
fa8bc92 [Holden Karau] typo: sparm -> spark
865a90c [Holden Karau] strip spaces for comparision
c7a4471 [Holden Karau] Add some documentation
c1a9233 [Holden Karau] refactor for testing
c818556 [Holden Karau] Add pakages to R
2015-06-24 11:55:20 -07:00
BenFradet 1173483f3f [SPARK-8399] [STREAMING] [WEB UI] Overlap between histograms and axis' name in Spark Streaming UI
Moved where the X axis' name (#batches) is written in histograms in the spark streaming web ui so the histograms and the axis' name do not overlap.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #6845 from BenFradet/SPARK-8399 and squashes the following commits:

b63695f [BenFradet] adjusted inner histograms
eb610ee [BenFradet] readjusted #batches on the x axis
dd46f98 [BenFradet] aligned all unit labels and ticks
0564b62 [BenFradet] readjusted #batches placement
edd0936 [BenFradet] moved where the X axis' name (#batches) is written in histograms in the spark streaming web ui
2015-06-24 11:53:03 -07:00
Nicholas Chammas 31f48e5af8 [SPARK-8576] Add spark-ec2 options to set IAM roles and instance-initiated shutdown behavior
Both of these options are useful when spark-ec2 is being used as part of an automated pipeline and the engineers want to minimize the need to pass around AWS keys for access to things like S3 (keys are replaced by the IAM role) and to be able to launch a cluster that can terminate itself cleanly.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #6962 from nchammas/additional-ec2-options and squashes the following commits:

fcf252e [Nicholas Chammas] PEP8 fixes
efba9ee [Nicholas Chammas] add help for --instance-initiated-shutdown-behavior
598aecf [Nicholas Chammas] option to launch instances into IAM role
2743632 [Nicholas Chammas] add option for instance initiated shutdown
2015-06-24 11:20:51 -07:00
Yin Huai bba6699d0e [SPARK-8578] [SQL] Should ignore user defined output committer when appending data
https://issues.apache.org/jira/browse/SPARK-8578

It is not very safe to use a custom output committer when append data to an existing dir. This changes adds the logic to check if we are appending data, and if so, we use the output committer associated with the file output format.

Author: Yin Huai <yhuai@databricks.com>

Closes #6964 from yhuai/SPARK-8578 and squashes the following commits:

43544c4 [Yin Huai] Do not use a custom output commiter when appendiing data.
2015-06-24 09:50:03 -07:00
Cheng Lian 9d36ec2431 [SPARK-8567] [SQL] Debugging flaky HiveSparkSubmitSuite
Using similar approach used in `HiveThriftServer2Suite` to print stdout/stderr of the spawned process instead of logging them to see what happens on Jenkins. (This test suite only fails on Jenkins and doesn't spill out any log...)

cc yhuai

Author: Cheng Lian <lian@databricks.com>

Closes #6978 from liancheng/debug-hive-spark-submit-suite and squashes the following commits:

b031647 [Cheng Lian] Prints process stdout/stderr instead of logging them
2015-06-24 09:49:20 -07:00
Cheng Lian cc465fd924 [SPARK-8138] [SQL] Improves error message when conflicting partition columns are found
This PR improves the error message shown when conflicting partition column names are detected.  This can be particularly annoying and confusing when there are a large number of partitions while a handful of them happened to contain unexpected temporary file(s).  Now all suspicious directories are listed as below:

```
java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

        Partition column name list #0: b, c, d
        Partition column name list #1: b, c
        Partition column name list #2: b

For partitioned table directories, data files should only live in leaf directories. Please check the following directories for unexpected files:

        file:/tmp/foo/b=0
        file:/tmp/foo/b=1
        file:/tmp/foo/b=1/c=1
        file:/tmp/foo/b=0/c=0
```

Author: Cheng Lian <lian@databricks.com>

Closes #6610 from liancheng/part-errmsg and squashes the following commits:

7d05f2c [Cheng Lian] Fixes Scala style issue
a149250 [Cheng Lian] Adds test case for the error message
6b74dd8 [Cheng Lian] Also lists suspicious non-leaf partition directories
a935eb8 [Cheng Lian] Improves error message when conflicting partition columns are found
2015-06-24 02:17:12 -07:00
Wenchen Fan 09fcf96b8f [SPARK-8371] [SQL] improve unit test for MaxOf and MinOf and fix bugs
a follow up of https://github.com/apache/spark/pull/6813

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6825 from cloud-fan/cg and squashes the following commits:

43170cc [Wenchen Fan] fix bugs in code gen
2015-06-23 23:11:42 -07:00
Josh Rosen 13ae806b25 [HOTFIX] [BUILD] Fix MiMa checks in master branch; enable MiMa for launcher project
This commit changes the MiMa tests to test against the released 1.4.0 artifacts rather than 1.4.0-rc4; this change is necessary to fix a Jenkins build break since it seems that the RC4 snapshot is no longer available via Maven.

I also enabled MiMa checks for the `launcher` subproject, which we should have done right after 1.4.0 was released.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6974 from JoshRosen/mima-hotfix and squashes the following commits:

4b4175a [Josh Rosen] [HOTFIX] [BUILD] Fix MiMa checks in master branch; enable MiMa for launcher project
2015-06-23 23:03:59 -07:00
Eric Liang 50c3a86f42 [SPARK-6749] [SQL] Make metastore client robust to underlying socket connection loss
This works around a bug in the underlying RetryingMetaStoreClient (HIVE-10384) by refreshing the metastore client on thrift exceptions. We attempt to emulate the proper hive behavior by retrying only as configured by hiveconf.

Author: Eric Liang <ekl@databricks.com>

Closes #6912 from ericl/spark-6749 and squashes the following commits:

2d54b55 [Eric Liang] use conf from state
0e3a74e [Eric Liang] use shim properly
980b3e5 [Eric Liang] Fix conf parsing hive 0.14 conf.
92459b6 [Eric Liang] Work around RetryingMetaStoreClient bug
2015-06-23 22:27:17 -07:00
Reynold Xin a458efc66c Revert "[SPARK-7157][SQL] add sampleBy to DataFrame"
This reverts commit 0401cbaa8e.

The new test case on Jenkins is failing.
2015-06-23 19:30:25 -07:00
Xiangrui Meng 0401cbaa8e [SPARK-7157][SQL] add sampleBy to DataFrame
Add `sampleBy` to DataFrame. rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #6769 from mengxr/SPARK-7157 and squashes the following commits:

991f26f [Xiangrui Meng] fix seed
4a14834 [Xiangrui Meng] move sampleBy to stat
832f7cc [Xiangrui Meng] add sampleBy to DataFrame
2015-06-23 17:46:29 -07:00
Cheng Lian 111d6b9b8a [SPARK-8139] [SQL] Updates docs and comments of data sources and Parquet output committer options
This PR only applies to master branch (1.5.0-SNAPSHOT) since it references `org.apache.parquet` classes which only appear in Parquet 1.7.0.

Author: Cheng Lian <lian@databricks.com>

Closes #6683 from liancheng/output-committer-docs and squashes the following commits:

b4648b8 [Cheng Lian] Removes spark.sql.sources.outputCommitterClass as it's not a public option
ee63923 [Cheng Lian] Updates docs and comments of data sources and Parquet output committer options
2015-06-23 17:24:26 -07:00
Davies Liu 7fb5ae5024 [SPARK-8573] [SPARK-8568] [SQL] [PYSPARK] raise Exception if column is used in booelan expression
It's a common mistake that user will put Column in a boolean expression (together with `and` , `or`), which does not work as expected, we should raise a exception in that case, and suggest user to use `&`, `|` instead.

Author: Davies Liu <davies@databricks.com>

Closes #6961 from davies/column_bool and squashes the following commits:

9f19beb [Davies Liu] update message
af74bd6 [Davies Liu] fix tests
07dff84 [Davies Liu] address comments, fix tests
f70c08e [Davies Liu] raise Exception if column is used in booelan expression
2015-06-23 15:51:16 -07:00
Cheng Lian d96d7b5574 [DOC] [SQL] Addes Hive metastore Parquet table conversion section
This PR adds a section about Hive metastore Parquet table conversion. It documents:

1. Schema reconciliation rules introduced in #5214 (see [this comment] [1] in #5188)
2. Metadata refreshing requirement introduced in #5339

[1]: https://github.com/apache/spark/pull/5188#issuecomment-86531248

Author: Cheng Lian <lian@databricks.com>

Closes #5348 from liancheng/sql-doc-parquet-conversion and squashes the following commits:

42ae0d0 [Cheng Lian] Adds Python `refreshTable` snippet
4c9847d [Cheng Lian] Resorts to SQL for Python metadata refreshing snippet
756e660 [Cheng Lian] Adds Python snippet for metadata refreshing
50675db [Cheng Lian] Addes Hive metastore Parquet table conversion section
2015-06-23 14:19:21 -07:00
Oleksiy Dyagilev a8031183af [SPARK-8525] [MLLIB] fix LabeledPoint parser when there is a whitespace between label and features vector
fix LabeledPoint parser when there is a whitespace between label and features vector, e.g.
(y, [x1, x2, x3])

Author: Oleksiy Dyagilev <oleksiy_dyagilev@epam.com>

Closes #6954 from fe2s/SPARK-8525 and squashes the following commits:

0755b9d [Oleksiy Dyagilev] [SPARK-8525][MLLIB] addressing comment, removing dep on commons-lang
c1abc2b [Oleksiy Dyagilev] [SPARK-8525][MLLIB] fix LabeledPoint parser when there is a whitespace on specific position
2015-06-23 13:12:19 -07:00
Alok Singh f2fb0285ab [SPARK-8111] [SPARKR] SparkR shell should display Spark logo and version banner on startup.
spark version is taken from the environment variable SPARK_VERSION

Author: Alok  Singh <singhal@Aloks-MacBook-Pro.local>
Author: Alok  Singh <singhal@aloks-mbp.usca.ibm.com>

Closes #6944 from aloknsingh/aloknsingh_spark_jiras and squashes the following commits:

ed607bd [Alok  Singh] [SPARK-8111][SparkR] As per suggestion, 1) using the version from sparkContext rather than the Sys.env. 2) change "Welcome to SparkR!" to "Welcome to" followed by Spark logo and version
acd5b85 [Alok  Singh] fix the jira SPARK-8111 to add the spark version and logo. Currently spark version is taken from the environment variable SPARK_VERSION
2015-06-23 12:47:55 -07:00
MechCoder f2022fa0d3 [SPARK-8265] [MLLIB] [PYSPARK] Add LinearDataGenerator to pyspark.mllib.utils
It is useful to generate linear data for easy testing of linear models and in general. Scala already has it. This is just a wrapper around the Scala code.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6715 from MechCoder/generate_linear_input and squashes the following commits:

6182884 [MechCoder] Minor changes
8bda047 [MechCoder] Minor style fixes
0f1053c [MechCoder] [SPARK-8265] Add LinearDataGenerator to pyspark.mllib.utils
2015-06-23 12:43:32 -07:00
Holden Karau 2b1111dd0b [SPARK-7888] Be able to disable intercept in linear regression in ml package
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6927 from holdenk/SPARK-7888-Be-able-to-disable-intercept-in-Linear-Regression-in-ML-package and squashes the following commits:

0ad384c [Holden Karau] Add MiMa excludes
4016fac [Holden Karau] Switch to wild card import, remove extra blank lines
ae5baa8 [Holden Karau] CR feedback, move the fitIntercept down rather than changing ymean and etc above
f34971c [Holden Karau] Fix some more long lines
319bd3f [Holden Karau] Fix long lines
3bb9ee1 [Holden Karau] Update the regression suite tests
7015b9f [Holden Karau] Our code performs the same with R, except we need more than one data point but that seems reasonable
0b0c8c0 [Holden Karau] fix the issue with the sample R code
e2140ba [Holden Karau] Add a test, it fails!
5e84a0b [Holden Karau] Write out thoughts and use the correct trait
91ffc0a [Holden Karau] more murh
006246c [Holden Karau] murp?
2015-06-23 12:42:17 -07:00
Davies Liu 6f4cadf5ee [SPARK-8432] [SQL] fix hashCode() and equals() of BinaryType in Row
Also added more tests in LiteralExpressionSuite

Author: Davies Liu <davies@databricks.com>

Closes #6876 from davies/fix_hashcode and squashes the following commits:

429c2c0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_hashcode
32d9811 [Davies Liu] fix test
a0626ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_hashcode
89c2432 [Davies Liu] fix style
bd20780 [Davies Liu] check with catalyst types
41caec6 [Davies Liu] change for to while
d96929b [Davies Liu] address comment
6ad2a90 [Davies Liu] fix style
5819d33 [Davies Liu] unify equals() and hashCode()
0fff25d [Davies Liu] fix style
53c38b1 [Davies Liu] fix hashCode() and equals() of BinaryType in Row
2015-06-23 11:55:47 -07:00
Cheng Hao 7b1450b666 [SPARK-7235] [SQL] Refactor the grouping sets
The logical plan `Expand` takes the `output` as constructor argument, which break the references chain. We need to refactor the code, as well as the column pruning.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #5780 from chenghao-intel/expand and squashes the following commits:

76e4aa4 [Cheng Hao] revert the change for case insenstive
7c10a83 [Cheng Hao] refactor the grouping sets
2015-06-23 10:52:17 -07:00
lockwobr 4f7fbefb8d [SQL] [DOCS] updated the documentation for explode
the syntax was incorrect in the example in explode

Author: lockwobr <lockwobr@gmail.com>

Closes #6943 from lockwobr/master and squashes the following commits:

3d864d1 [lockwobr] updated the documentation for explode
2015-06-24 02:48:56 +09:00
Holden Karau 0f92be5b5f [SPARK-8498] [TUNGSTEN] fix npe in errorhandling path in unsafeshuffle writer
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6918 from holdenk/SPARK-8498-fix-npe-in-errorhandling-path-in-unsafeshuffle-writer and squashes the following commits:

f807832 [Holden Karau] Log error if we can't throw it
855f9aa [Holden Karau] Spelling - not my strongest suite. Fix Propegates to Propagates.
039d620 [Holden Karau] Add missing closeandwriteoutput
30e558d [Holden Karau] go back to try/finally
e503b8c [Holden Karau] Improve the test to ensure we aren't masking the underlying exception
ae0b7a7 [Holden Karau] Fix the test
2e6abf7 [Holden Karau] Be more cautious when cleaning up during failed write and re-throw user exceptions
2015-06-23 09:08:11 -07:00
Reynold Xin 6ceb169608 [SPARK-8300] DataFrame hint for broadcast join.
Users can now do
```scala
left.join(broadcast(right), "joinKey")
```
to give the query planner a hint that "right" DataFrame is small and should be broadcasted.

Author: Reynold Xin <rxin@databricks.com>

Closes #6751 from rxin/broadcastjoin-hint and squashes the following commits:

953eec2 [Reynold Xin] Code review feedback.
88752d8 [Reynold Xin] Fixed import.
8187b88 [Reynold Xin] [SPARK-8300] DataFrame hint for broadcast join.
2015-06-23 01:50:31 -07:00
Scott Taylor f0dcbe8a7c [SPARK-8541] [PYSPARK] test the absolute error in approx doctests
A minor change but one which is (presumably) visible on the public api docs webpage.

Author: Scott Taylor <github@megatron.me.uk>

Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits:

fbed000 [Scott Taylor] test the absolute error in approx doctests
2015-06-22 23:37:56 -07:00
Hari Shreedharan 9b618fb0d2 [SPARK-8483] [STREAMING] Remove commons-lang3 dependency from Flume Si…
…nk. Also bump Flume version to 1.6.0

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #6910 from harishreedharan/remove-commons-lang3 and squashes the following commits:

9875f7d [Hari Shreedharan] Revert back to Flume 1.4.0
ca35eb0 [Hari Shreedharan] [SPARK-8483][Streaming] Remove commons-lang3 dependency from Flume Sink. Also bump Flume version to 1.6.0
2015-06-22 23:34:17 -07:00
Liang-Chi Hsieh 31bd30687b [SPARK-8359] [SQL] Fix incorrect decimal precision after multiplication
JIRA: https://issues.apache.org/jira/browse/SPARK-8359

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6814 from viirya/fix_decimal2 and squashes the following commits:

071a757 [Liang-Chi Hsieh] Remove maximum precision and use MathContext.UNLIMITED.
df217d4 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal2
a43bfc3 [Liang-Chi Hsieh] Add MathContext with maximum supported precision.
72eeb3f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal2
44c9348 [Liang-Chi Hsieh] Fix incorrect decimal precision after multiplication.
2015-06-22 23:11:56 -07:00
Yu ISHIKAWA d4f633514a [SPARK-8431] [SPARKR] Add in operator to DataFrame Column in SparkR
[[SPARK-8431] Add in operator to DataFrame Column in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8431)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6941 from yu-iskw/SPARK-8431 and squashes the following commits:

1f64423 [Yu ISHIKAWA] Modify the comment
f4309a7 [Yu ISHIKAWA] Make a `setMethod` for `%in%` be independent
6e37936 [Yu ISHIKAWA] Modify a variable name
c196173 [Yu ISHIKAWA] [SPARK-8431][SparkR] Add in operator to DataFrame Column in SparkR
2015-06-22 23:04:36 -07:00
Holden Karau 164fe2aa44 [SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max bins
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits:

2894695 [Holden Karau] remove extra blank line
2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too
3a09170 [Holden Karau] add maxBins to to the train method as well
af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100
2015-06-22 22:40:19 -07:00
Yu ISHIKAWA 44fa7df64d [SPARK-8548] [SPARKR] Remove the trailing whitespaces from the SparkR files
[[SPARK-8548] Remove the trailing whitespaces from the SparkR files - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8548)

- This is the result of `lint-r`
    https://gist.github.com/yu-iskw/0019b37a2c1167f33986

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6945 from yu-iskw/SPARK-8548 and squashes the following commits:

0bd567a [Yu ISHIKAWA] [SPARK-8548][SparkR] Remove the trailing whitespaces from the SparkR files
2015-06-22 20:55:38 -07:00
Patrick Wendell c4d2343966 MAINTENANCE: Automated closing of pull requests.
This commit exists to close the following pull requests on Github:

Closes #2849 (close requested by 'srowen')
Closes #2786 (close requested by 'andrewor14')
Closes #4678 (close requested by 'JoshRosen')
Closes #5457 (close requested by 'andrewor14')
Closes #3346 (close requested by 'andrewor14')
Closes #6518 (close requested by 'andrewor14')
Closes #5403 (close requested by 'pwendell')
Closes #2110 (close requested by 'srowen')
2015-06-22 20:25:32 -07:00
Cheng Hao 13321e6555 [SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit test under jdk8
To reproduce that:
```
JAVA_HOME=/home/hcheng/Java/jdk1.8.0_45 | build/sbt -Phadoop-2.3 -Phive  'test-only org.apache.spark.sql.hive.execution.HiveWindowFunctionQueryWithoutCodeGenSuite'
```

A simple workaround to fix that is update the original query, for getting the output size instead of the exact elements of the array (output by collect_set())

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6402 from chenghao-intel/windowing and squashes the following commits:

99312ad [Cheng Hao] add order by for the select clause
edf8ce3 [Cheng Hao] update the code as suggested
7062da7 [Cheng Hao] fix the collect_set() behaviour differences under different versions of JDK
2015-06-22 20:04:49 -07:00
Davies Liu 6b7f2ceafd [SPARK-8307] [SQL] improve timestamp from parquet
This PR change to convert julian day to unix timestamp directly (without Calendar and Timestamp).

cc adrian-wang rxin

Author: Davies Liu <davies@databricks.com>

Closes #6759 from davies/improve_ts and squashes the following commits:

849e301 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
b0e4cad [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
8e2d56f [Davies Liu] address comments
634b9f5 [Davies Liu] fix mima
4891efb [Davies Liu] address comment
bfc437c [Davies Liu] fix build
ae5979c [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
602b969 [Davies Liu] remove jodd
2f2e48c [Davies Liu] fix test
8ace611 [Davies Liu] fix mima
212143b [Davies Liu] fix mina
c834108 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
a3171b8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
5233974 [Davies Liu] fix scala style
361fd62 [Davies Liu] address comments
ea196d4 [Davies Liu] improve timestamp from parquet
2015-06-22 18:03:59 -07:00
Wenchen Fan 860a49ef20 [SPARK-7153] [SQL] support all integral type ordinal in GetArrayItem
first convert `ordinal` to `Number`, then convert to int type.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #5706 from cloud-fan/7153 and squashes the following commits:

915db79 [Wenchen Fan] fix 7153
2015-06-22 17:37:35 -07:00
Andrew Or 1dfb0f7b2a [HOTFIX] [TESTS] Typo mqqt -> mqtt
This was introduced in #6866.
2015-06-22 16:16:26 -07:00
Davies Liu 96aa01378e [SPARK-8492] [SQL] support binaryType in UnsafeRow
Support BinaryType in UnsafeRow, just like StringType.

Also change the layout of StringType and BinaryType in UnsafeRow, by combining offset and size together as Long, which will limit the size of Row to under 2G (given that fact that any single buffer can not be bigger than 2G in JVM).

Author: Davies Liu <davies@databricks.com>

Closes #6911 from davies/unsafe_bin and squashes the following commits:

d68706f [Davies Liu] update comment
519f698 [Davies Liu] address comment
98a964b [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_bin
180b49d [Davies Liu] fix zero-out
22e4c0a [Davies Liu] zero-out padding bytes
6abfe93 [Davies Liu] fix style
447dea0 [Davies Liu] support binaryType in UnsafeRow
2015-06-22 15:22:17 -07:00
BenFradet 50d3242d6a [SPARK-8356] [SQL] Reconcile callUDF and callUdf
Deprecates ```callUdf``` in favor of ```callUDF```.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #6902 from BenFradet/SPARK-8356 and squashes the following commits:

ef4e9d8 [BenFradet] deprecated callUDF, use udf instead
9b1de4d [BenFradet] reinstated unit test for the deprecated callUdf
cbd80a5 [BenFradet] deprecated callUdf in favor of callUDF
2015-06-22 15:06:47 -07:00
Yu ISHIKAWA b1f3a489ef [SPARK-8537] [SPARKR] Add a validation rule about the curly braces in SparkR to .lintr
[[SPARK-8537] Add a validation rule about the curly braces in SparkR to `.lintr` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8537)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6940 from yu-iskw/SPARK-8537 and squashes the following commits:

7eec1a0 [Yu ISHIKAWA] [SPARK-8537][SparkR] Add a validation rule about the curly braces in SparkR to `.lintr`
2015-06-22 14:35:38 -07:00
Feynman Liang afe35f0519 [SPARK-8455] [ML] Implement n-gram feature transformer
Implementation of n-gram feature transformer for ML.

Author: Feynman Liang <fliang@databricks.com>

Closes #6887 from feynmanliang/ngram-featurizer and squashes the following commits:

d2c839f [Feynman Liang] Make n > input length yield empty output
9fadd36 [Feynman Liang] Add empty and corner test cases, fix names and spaces
fe93873 [Feynman Liang] Implement n-gram feature transformer
2015-06-22 14:15:35 -07:00
Yin Huai 5ab9fcfb01 [SPARK-8532] [SQL] In Python's DataFrameWriter, save/saveAsTable/json/parquet/jdbc always override mode
https://issues.apache.org/jira/browse/SPARK-8532

This PR has two changes. First, it fixes the bug that save actions (i.e. `save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds input argument `partitionBy` to `save/saveAsTable/parquet`.

Author: Yin Huai <yhuai@databricks.com>

Closes #6937 from yhuai/SPARK-8532 and squashes the following commits:

f972d5d [Yin Huai] davies's comment.
d37abd2 [Yin Huai] style.
d21290a [Yin Huai] Python doc.
889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, and parquet.
7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode since JVM-side already uses "error" as the default value.
d696dff [Yin Huai] Python style.
88eb6c4 [Yin Huai] If mode is "error", do not call mode method.
c40c461 [Yin Huai] Regression test.
2015-06-22 13:51:23 -07:00
Wenchen Fan da7bbb9435 [SPARK-8104] [SQL] auto alias expressions in analyzer
Currently we auto alias expression in parser. However, during parser phase we don't have enough information to do the right alias. For example, Generator that has more than 1 kind of element need MultiAlias, ExtractValue don't need Alias if it's in middle of a ExtractValue chain.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6647 from cloud-fan/alias and squashes the following commits:

552eba4 [Wenchen Fan] fix python
5b5786d [Wenchen Fan] fix agg
73a90cb [Wenchen Fan] fix case-preserve of ExtractValue
4cfd23c [Wenchen Fan] fix order by
d18f401 [Wenchen Fan] refine
9f07359 [Wenchen Fan] address comments
39c1aef [Wenchen Fan] small fix
33640ec [Wenchen Fan] auto alias expressions in analyzer
2015-06-22 12:13:00 -07:00
Yu ISHIKAWA 5d89d9f00b [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in regression.py
[[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits:

7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()`
4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py`
2015-06-22 11:53:11 -07:00
Pradeep Chhetri ba8a4537fe [SPARK-8482] Added M4 instances to the list.
AWS recently added M4 instances (https://aws.amazon.com/blogs/aws/the-new-m4-instance-type-bonus-price-reduction-on-m3-c4/).

Author: Pradeep Chhetri <pradeep.chhetri89@gmail.com>

Closes #6899 from pradeepchhetri/master and squashes the following commits:

4f4ea79 [Pradeep Chhetri] Added t2.large instance
3d2bb6c [Pradeep Chhetri] Added M4 instances to the list
2015-06-22 11:45:31 -07:00
Stefano Parmesan 42a1f716fa [SPARK-8429] [EC2] Add ability to set additional tags
Add the `--additional-tags` parameter that allows to set additional tags to all the created instances (masters and slaves).

The user can specify multiple tags by separating them with a comma (`,`), while each tag name and value should be separated by a colon (`:`); for example, `Task:MySparkProject,Env:production` would add two tags, `Task` and `Env`, with the given values.

Author: Stefano Parmesan <s.parmesan@gmail.com>

Closes #6857 from armisael/patch-1 and squashes the following commits:

c5ac92c [Stefano Parmesan] python style (pep8)
8e614f1 [Stefano Parmesan] Set multiple tags in a single request
bfc56af [Stefano Parmesan] Address SPARK-7900 by inceasing sleep time
daf8615 [Stefano Parmesan] Add ability to set additional tags
2015-06-22 11:43:10 -07:00