Commit graph

11914 commits

Author SHA1 Message Date
Feynman Liang afe35f0519 [SPARK-8455] [ML] Implement n-gram feature transformer
Implementation of n-gram feature transformer for ML.

Author: Feynman Liang <fliang@databricks.com>

Closes #6887 from feynmanliang/ngram-featurizer and squashes the following commits:

d2c839f [Feynman Liang] Make n > input length yield empty output
9fadd36 [Feynman Liang] Add empty and corner test cases, fix names and spaces
fe93873 [Feynman Liang] Implement n-gram feature transformer
2015-06-22 14:15:35 -07:00
Yin Huai 5ab9fcfb01 [SPARK-8532] [SQL] In Python's DataFrameWriter, save/saveAsTable/json/parquet/jdbc always override mode
https://issues.apache.org/jira/browse/SPARK-8532

This PR has two changes. First, it fixes the bug that save actions (i.e. `save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds input argument `partitionBy` to `save/saveAsTable/parquet`.

Author: Yin Huai <yhuai@databricks.com>

Closes #6937 from yhuai/SPARK-8532 and squashes the following commits:

f972d5d [Yin Huai] davies's comment.
d37abd2 [Yin Huai] style.
d21290a [Yin Huai] Python doc.
889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, and parquet.
7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode since JVM-side already uses "error" as the default value.
d696dff [Yin Huai] Python style.
88eb6c4 [Yin Huai] If mode is "error", do not call mode method.
c40c461 [Yin Huai] Regression test.
2015-06-22 13:51:23 -07:00
Wenchen Fan da7bbb9435 [SPARK-8104] [SQL] auto alias expressions in analyzer
Currently we auto alias expression in parser. However, during parser phase we don't have enough information to do the right alias. For example, Generator that has more than 1 kind of element need MultiAlias, ExtractValue don't need Alias if it's in middle of a ExtractValue chain.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6647 from cloud-fan/alias and squashes the following commits:

552eba4 [Wenchen Fan] fix python
5b5786d [Wenchen Fan] fix agg
73a90cb [Wenchen Fan] fix case-preserve of ExtractValue
4cfd23c [Wenchen Fan] fix order by
d18f401 [Wenchen Fan] refine
9f07359 [Wenchen Fan] address comments
39c1aef [Wenchen Fan] small fix
33640ec [Wenchen Fan] auto alias expressions in analyzer
2015-06-22 12:13:00 -07:00
Yu ISHIKAWA 5d89d9f00b [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in regression.py
[[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits:

7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()`
4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py`
2015-06-22 11:53:11 -07:00
Pradeep Chhetri ba8a4537fe [SPARK-8482] Added M4 instances to the list.
AWS recently added M4 instances (https://aws.amazon.com/blogs/aws/the-new-m4-instance-type-bonus-price-reduction-on-m3-c4/).

Author: Pradeep Chhetri <pradeep.chhetri89@gmail.com>

Closes #6899 from pradeepchhetri/master and squashes the following commits:

4f4ea79 [Pradeep Chhetri] Added t2.large instance
3d2bb6c [Pradeep Chhetri] Added M4 instances to the list
2015-06-22 11:45:31 -07:00
Stefano Parmesan 42a1f716fa [SPARK-8429] [EC2] Add ability to set additional tags
Add the `--additional-tags` parameter that allows to set additional tags to all the created instances (masters and slaves).

The user can specify multiple tags by separating them with a comma (`,`), while each tag name and value should be separated by a colon (`:`); for example, `Task:MySparkProject,Env:production` would add two tags, `Task` and `Env`, with the given values.

Author: Stefano Parmesan <s.parmesan@gmail.com>

Closes #6857 from armisael/patch-1 and squashes the following commits:

c5ac92c [Stefano Parmesan] python style (pep8)
8e614f1 [Stefano Parmesan] Set multiple tags in a single request
bfc56af [Stefano Parmesan] Address SPARK-7900 by inceasing sleep time
daf8615 [Stefano Parmesan] Add ability to set additional tags
2015-06-22 11:43:10 -07:00
Cheng Lian 0818fdec37 [SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental overwriting
This PR fixes a Parquet output file name collision bug which may cause data loss.  Changes made:

1.  Identify each write job issued by `InsertIntoHadoopFsRelation` with a UUID

    All concrete data sources which extend `HadoopFsRelation` (Parquet and ORC for now) must use this UUID to generate task output file path to avoid name collision.

2.  Make `TestHive` use a local mode `SparkContext` with 32 threads to increase parallelism

    The major reason for this is that, the original parallelism of 2 is too low to reproduce the data loss issue.  Also, higher concurrency may potentially caught more concurrency bugs during testing phase. (It did help us spotted SPARK-8501.)

3. `OrcSourceSuite` was updated to workaround SPARK-8501, which we detected along the way.

NOTE: This PR is made a little bit more complicated than expected because we hit two other bugs on the way and have to work them around. See [SPARK-8501] [1] and [SPARK-8513] [2].

[1]: https://github.com/liancheng/spark/tree/spark-8501
[2]: https://github.com/liancheng/spark/tree/spark-8513

----

Some background and a summary of offline discussion with yhuai about this issue for better understanding:

In 1.4.0, we added `HadoopFsRelation` to abstract partition support of all data sources that are based on Hadoop `FileSystem` interface.  Specifically, this makes partition discovery, partition pruning, and writing dynamic partitions for data sources much easier.

To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (i.e., `<id>` in output file name `part-r-<id>.gz.parquet`) at the beginning of the write job.  In 1.3.0, this step happens on driver side before any files are written.  However, in 1.4.0, this is moved to task side.  Unfortunately, for tasks scheduled later, they may see wrong max part number generated of files newly written by other finished tasks within the same job.  This actually causes a race condition.  In most cases, this only causes nonconsecutive part numbers in output file names.  But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, then one of them gets overwritten by the other.

Before `HadoopFsRelation`, Spark SQL already supports appending data to Hive tables.  From a user's perspective, these two look similar.  However, they differ a lot internally.  When data are inserted into Hive tables via Spark SQL, `InsertIntoHiveTable` simulates Hive's behaviors:

1.  Write data to a temporary location

2.  Move data in the temporary location to the final destination location using

    -   `Hive.loadTable()` for non-partitioned table
    -   `Hive.loadPartition()` for static partitions
    -   `Hive.loadDynamicPartitions()` for dynamic partitions

The important part is that, `Hive.copyFiles()` is invoked in step 2 to move the data to the destination directory (I found the name is kinda confusing since no "copying" occurs here, we are just moving and renaming stuff).  If a file in the source directory and another file in the destination directory happen to have the same name, say `part-r-00001.parquet`, the former is moved to the destination directory and renamed with a `_copy_N` postfix (`part-r-00001_copy_1.parquet`).  That's how Hive handles appending and avoids name collision between different write jobs.

Some alternatives fixes considered for this issue:

1.  Use a similar approach as Hive

    This approach is not preferred in Spark 1.4.0 mainly because file metadata operations in S3 tend to be slow, especially for tables with lots of file and/or partitions.  That's why `InsertIntoHadoopFsRelation` just inserts to destination directory directly, and is often used together with `DirectParquetOutputCommitter` to reduce latency when working with S3.  This means, we don't have the chance to do renaming, and must avoid name collision from the very beginning.

2.  Same as 1.3, just move max part number detection back to driver side

    This isn't doable because unlike 1.3, 1.4 also takes dynamic partitioning into account.  When inserting into dynamic partitions, we don't know which partition directories will be touched on driver side before issuing the write job.  Checking all partition directories is simply too expensive for tables with thousands of partitions.

3.  Add extra component to output file names to avoid name collision

    This seems to be the only reasonable solution for now.  To be more specific, we need a JOB level unique identifier to identify all write jobs issued by `InsertIntoHadoopFile`.  Notice that TASK level unique identifiers can NOT be used.  Because in this way a speculative task will write to a different output file from the original task.  If both tasks succeed, duplicate output will be left behind.  Currently, the ORC data source adds `System.currentTimeMillis` to the output file name for uniqueness.  This doesn't work because of exactly the same reason.

    That's why this PR adds a job level random UUID in `BaseWriterContainer` (which is used by `InsertIntoHadoopFsRelation` to issue write jobs).  The drawback is that record order is not preserved any more (output files of a later job may be listed before those of a earlier job).  However, we never promise to preserve record order when writing data, and Hive doesn't promise this either because the `_copy_N` trick breaks the order.

Author: Cheng Lian <lian@databricks.com>

Closes #6864 from liancheng/spark-8406 and squashes the following commits:

db7a46a [Cheng Lian] More comments
f5c1133 [Cheng Lian] Addresses comments
85c478e [Cheng Lian] Workarounds SPARK-8513
088c76c [Cheng Lian] Adds comment about SPARK-8501
99a5e7e [Cheng Lian] Uses job level UUID in SimpleTextRelation and avoids double task abortion
4088226 [Cheng Lian] Works around SPARK-8501
1d7d206 [Cheng Lian] Adds more logs
8966bbb [Cheng Lian] Fixes Scala style issue
18b7003 [Cheng Lian] Uses job level UUID to take speculative tasks into account
3806190 [Cheng Lian] Lets TestHive use all cores by default
748dbd7 [Cheng Lian] Adding UUID to output file name to avoid accidental overwriting
2015-06-22 10:03:57 -07:00
Mike Dusenberry 47c1d56293 [SPARK-7426] [MLLIB] [ML] Updated Attribute.fromStructField to allow any NumericType.
Updated `Attribute.fromStructField` to allow any `NumericType`, rather than just `DoubleType`, and added unit tests for a few of the other NumericTypes.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6540 from dusenberrymw/SPARK-7426_AttributeFactory.fromStructField_Should_Allow_NumericTypes and squashes the following commits:

87fecb3 [Mike Dusenberry] Updated Attribute.fromStructField to allow any NumericType, rather than just DoubleType, and added unit tests for a few of the other NumericTypes.
2015-06-21 18:25:36 -07:00
Joseph K. Bradley a1894422ad [SPARK-7715] [MLLIB] [ML] [DOC] Updated MLlib programming guide for release 1.4
Reorganized docs a bit.  Added migration guides.

**Q**: Do we want to say more for the 1.3 -> 1.4 migration guide for ```spark.ml```?  It would be a lot.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6897 from jkbradley/ml-guide-1.4 and squashes the following commits:

4bf26d6 [Joseph K. Bradley] tiny fix
8085067 [Joseph K. Bradley] fixed spacing/layout issues in ml guide from previous commit in this PR
6cd5c78 [Joseph K. Bradley] Updated MLlib programming guide for release 1.4
2015-06-21 16:25:25 -07:00
Cheng Lian 83cdfd84f8 [SPARK-8508] [SQL] Ignores a test case to cleanup unnecessary testing output until #6882 is merged
Currently [the test case for SPARK-7862] [1] writes 100,000 lines of integer triples to stderr and makes Jenkins build output unnecessarily large and it's hard to debug other build errors. A proper fix is on the way in #6882. This PR ignores this test case temporarily until #6882 is merged.

[1]: https://github.com/apache/spark/pull/6404/files#diff-1ea02a6fab84e938582f7f87cc4d9ea1R641

Author: Cheng Lian <lian@databricks.com>

Closes #6925 from liancheng/spark-8508 and squashes the following commits:

41e5b47 [Cheng Lian] Ignores the test case until #6882 is merged
2015-06-21 13:20:28 -07:00
Yanbo Liang 32e3cdaa64 [SPARK-7604] [MLLIB] Python API for PCA and PCAModel
Python API for PCA and PCAModel

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6315 from yanboliang/spark-7604 and squashes the following commits:

1d58734 [Yanbo Liang] remove transform() in PCAModel, use default behavior
4d9d121 [Yanbo Liang] Python API for PCA and PCAModel
2015-06-21 12:04:20 -07:00
jeanlyn a1e3649c87 [SPARK-8379] [SQL] avoid speculative tasks write to the same file
The issue link [SPARK-8379](https://issues.apache.org/jira/browse/SPARK-8379)
Currently,when we insert data to the dynamic partition with speculative tasks we will get the Exception
```
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-10000/ds=2015-06-15/type=2/part-00301.lzo
owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53
but is accessed by DFSClient_attempt_201506031520_0011_m_000042_0_-1275047721_57
```
This pr try to write the data to temporary dir when using dynamic parition  avoid the speculative tasks writing the same file

Author: jeanlyn <jeanlyn92@gmail.com>

Closes #6833 from jeanlyn/speculation and squashes the following commits:

64bbfab [jeanlyn] use FileOutputFormat.getTaskOutputPath to get the path
8860af0 [jeanlyn] remove the never using code
e19a3bd [jeanlyn] avoid speculative tasks write same file
2015-06-21 00:13:40 -07:00
Tarek Auel 41ab2853f4 [SPARK-8301] [SQL] Improve UTF8String substring/startsWith/endsWith/contains performance
Jira: https://issues.apache.org/jira/browse/SPARK-8301

Added the private method startsWith(prefix, offset) to implement startsWith, endsWith and contains without copying the array

I hope that the component SQL is still correct. I copied it from the Jira ticket.

Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Tarek Auel <tarek.auel@gmail.com>

Closes #6804 from tarekauel/SPARK-8301 and squashes the following commits:

f5d6b9a [Tarek Auel] fixed parentheses and annotation
6d7b068 [Tarek Auel] [SPARK-8301] removed null checks
9ca0473 [Tarek Auel] [SPARK-8301] removed null checks
1c327eb [Tarek Auel] [SPARK-8301] removed new
9f17cc8 [Tarek Auel] [SPARK-8301] fixed conversion byte to string in codegen
3a0040f [Tarek Auel] [SPARK-8301] changed call of UTF8String.set to UTF8String.from
e4530d2 [Tarek Auel] [SPARK-8301] changed call of UTF8String.set to UTF8String.from
a5f853a [Tarek Auel] [SPARK-8301] changed visibility of set to protected. Changed annotation of bytes from Nullable to Nonnull
d2fb05f [Tarek Auel] [SPARK-8301] added additional null checks
79cb55b [Tarek Auel] [SPARK-8301] null check. Added test cases for null check.
b17909e [Tarek Auel] [SPARK-8301] removed unnecessary copying of UTF8String. Added a private function startsWith(prefix, offset) to implement the check for startsWith, endsWith and contains.
2015-06-20 20:03:59 -07:00
Yu ISHIKAWA 004f57374b [SPARK-8495] [SPARKR] Add a .lintr file to validate the SparkR files and the lint-r script
Thank Shivaram Venkataraman for your support. This is a prototype script to validate the R files.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6922 from yu-iskw/SPARK-6813 and squashes the following commits:

c1ffe6b [Yu ISHIKAWA] Modify to save result to a log file and add a rule to validate
5520806 [Yu ISHIKAWA] Exclude the .lintr file not to check Apache lincence
8f94680 [Yu ISHIKAWA] [SPARK-8495][SparkR] Add a `.lintr` file to validate the SparkR files and the `lint-r` script
2015-06-20 16:10:14 -07:00
Josh Rosen 7a3c424ecf [SPARK-8422] [BUILD] [PROJECT INFRA] Add a module abstraction to dev/run-tests
This patch builds upon #5694 to add a 'module' abstraction to the `dev/run-tests` script which groups together the per-module test logic, including the mapping from file paths to modules, the mapping from modules to test goals and build profiles, and the dependencies / relationships between modules.

This refactoring makes it much easier to increase the granularity of test modules, which will let us skip even more tests.  It's also a prerequisite for other changes that will reduce test time, such as running subsets of the Python tests based on which files / modules have changed.

This patch also adds doctests for the new graph traversal / change mapping code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6866 from JoshRosen/more-dev-run-tests-refactoring and squashes the following commits:

75de450 [Josh Rosen] Use module system to determine which build profiles to enable.
4224da5 [Josh Rosen] Add documentation to Module.
a86a953 [Josh Rosen] Clean up modules; add new modules for streaming external projects
e46539f [Josh Rosen] Fix camel-cased endswith()
35a3052 [Josh Rosen] Enable Hive tests when running all tests
df10e23 [Josh Rosen] update to reflect fact that no module depends on root
3670d50 [Josh Rosen] mllib should depend on streaming
dc6f1c6 [Josh Rosen] Use changed files' extensions to decide whether to run style checks
7092d3e [Josh Rosen] Skip SBT tests if no test goals are specified
43a0ced [Josh Rosen] Minor fixes
3371441 [Josh Rosen] Test everything if nothing has changed (needed for non-PRB builds)
37f3fb3 [Josh Rosen] Remove doc profiles option, since it's not actually needed (see #6865)
f53864b [Josh Rosen] Finish integrating module changes
f0249bd [Josh Rosen] WIP
2015-06-20 16:05:54 -07:00
Liang-Chi Hsieh 0b8995168f [SPARK-8468] [ML] Take the negative of some metrics in RegressionEvaluator to get correct cross validation
JIRA: https://issues.apache.org/jira/browse/SPARK-8468

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6905 from viirya/cv_min and squashes the following commits:

930d3db [Liang-Chi Hsieh] Fix python unit test and add document.
d632135 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cv_min
16e3b2c [Liang-Chi Hsieh] Take the negative instead of reciprocal.
c3dd8d9 [Liang-Chi Hsieh] For comments.
b5f52c1 [Liang-Chi Hsieh] Add param to CrossValidator for choosing whether to maximize evaulation value.
2015-06-20 13:01:59 -07:00
cody koeninger 1b6fe9b1a7 [SPARK-8127] [STREAMING] [KAFKA] KafkaRDD optimize count() take() isEmpty()
…ed KafkaRDD methods.  Possible fix for [SPARK-7122], but probably a worthwhile optimization regardless.

Author: cody koeninger <cody@koeninger.org>

Closes #6632 from koeninger/kafka-rdd-count and squashes the following commits:

321340d [cody koeninger] [SPARK-8127][Streaming][Kafka] additional test of ordering of take()
5a05d0f [cody koeninger] [SPARK-8127][Streaming][Kafka] additional test of isEmpty
f68bd32 [cody koeninger] [Streaming][Kafka][SPARK-8127] code cleanup
9555b73 [cody koeninger] Merge branch 'master' into kafka-rdd-count
253031d [cody koeninger] [Streaming][Kafka][SPARK-8127] mima exclusion for change to private method
8974b9e [cody koeninger] [Streaming][Kafka][SPARK-8127] check offset ranges before constructing KafkaRDD
c3768c5 [cody koeninger] [Streaming][Kafka] Take advantage of offset range info for size-related KafkaRDD methods.  Possible fix for [SPARK-7122], but probably a worthwhile optimization regardless.
2015-06-19 18:54:07 -07:00
Andrew Or bec40e52be [HOTFIX] [SPARK-8489] Correct JIRA number in previous commit
It should be SPARK-8489, not SPARK-8498.
2015-06-19 17:39:26 -07:00
Andrew Or 093c34838d [SPARK-8498] [SQL] Add regression test for SPARK-8470
**Summary of the problem in SPARK-8470.** When using `HiveContext` to create a data frame of a user case class, Spark throws `scala.reflect.internal.MissingRequirementError` when it tries to infer the schema using reflection. This is caused by `HiveContext` silently overwriting the context class loader containing the user classes.

**What this issue is about.** This issue adds regression tests for SPARK-8470, which is already fixed in #6891. We closed SPARK-8470 as a duplicate because it is a different manifestation of the same problem in SPARK-8368. Due to the complexity of the reproduction, this requires us to pre-package a special test jar and include it in the Spark project itself.

I tested this with and without the fix in #6891 and verified that it passes only if the fix is present.

Author: Andrew Or <andrew@databricks.com>

Closes #6909 from andrewor14/SPARK-8498 and squashes the following commits:

5e9d688 [Andrew Or] Add regression test for SPARK-8470
2015-06-19 17:34:09 -07:00
cody koeninger b305e377fb [SPARK-8390] [STREAMING] [KAFKA] fix docs related to HasOffsetRanges
Author: cody koeninger <cody@koeninger.org>

Closes #6863 from koeninger/SPARK-8390 and squashes the following commits:

26a06bd [cody koeninger] Merge branch 'master' into SPARK-8390
3744492 [cody koeninger] [Streaming][Kafka][SPARK-8390] doc changes per TD, test to make sure approach shown in docs actually compiles + runs
b108c9d [cody koeninger] [Streaming][Kafka][SPARK-8390] further doc fixes, clean up spacing
bb4336b [cody koeninger] [Streaming][Kafka][SPARK-8390] fix docs related to HasOffsetRanges, cleanup
3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api
2015-06-19 17:18:31 -07:00
Michael Armbrust a333a72e02 [SPARK-8420] [SQL] Fix comparision of timestamps/dates with strings
In earlier versions of Spark SQL we casted `TimestampType` and `DataType` to `StringType` when it was involved in a binary comparison with a `StringType`.  This allowed comparing a timestamp with a partial date as a user would expect.
 - `time > "2014-06-10"`
 - `time > "2014"`

In 1.4.0 we tried to cast the String instead into a Timestamp.  However, since partial dates are not a valid complete timestamp this results in `null` which results in the tuple being filtered.

This PR restores the earlier behavior.  Note that we still special case equality so that these comparisons are not affected by not printing zeros for subsecond precision.

Author: Michael Armbrust <michael@databricks.com>

Closes #6888 from marmbrus/timeCompareString and squashes the following commits:

bdef29c [Michael Armbrust] test partial date
1f09adf [Michael Armbrust] special handling of equality
1172c60 [Michael Armbrust] more test fixing
4dfc412 [Michael Armbrust] fix tests
aaa9508 [Michael Armbrust] newline
04d908f [Michael Armbrust] [SPARK-8420][SQL] Fix comparision of timestamps/dates with strings
2015-06-19 16:54:51 -07:00
Nathan Howell 9814b971f0 [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents
Author: Nathan Howell <nhowell@godaddy.com>

Closes #6799 from NathanHowell/spark-8093 and squashes the following commits:

76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents
2015-06-19 16:19:28 -07:00
Hossein 1fa29c2df2 [SPARK-8452] [SPARKR] expose jobGroup API in SparkR
This pull request adds following methods to SparkR:

```R
setJobGroup()
cancelJobGroup()
clearJobGroup()
```
For each method, the spark context is passed as the first argument. There does not seem to be a good way to test these in R.

cc shivaram and davies

Author: Hossein <hossein@databricks.com>

Closes #6889 from falaki/SPARK-8452 and squashes the following commits:

9ce9f1e [Hossein] Added basic tests to verify methods can be called and won't throw errors
c706af9 [Hossein] Added examples
a2c19af [Hossein] taking spark context as first argument
343ca77 [Hossein] Added setJobGroup, cancelJobGroup and clearJobGroup to SparkR
2015-06-19 15:51:59 -07:00
MechCoder 54976e55e3 [SPARK-4118] [MLLIB] [PYSPARK] Python bindings for StreamingKMeans
Python bindings for StreamingKMeans

Will change status to MRG once docs, tests and examples are updated.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6499 from MechCoder/spark-4118 and squashes the following commits:

7722d16 [MechCoder] minor style fixes
51052d3 [MechCoder] Doc fixes
2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes
81482fd [MechCoder] minor
5d9fe61 [MechCoder] predictOn should take into account the latest model
8ab9e89 [MechCoder] Fix Python3 error
a9817df [MechCoder] Better tests and minor fixes
c80e451 [MechCoder] Add ignore_unicode_prefix
ee8ce16 [MechCoder] Update tests, doc and examples
4b1481f [MechCoder] Some changes and tests
d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
2015-06-19 12:23:15 -07:00
Davies Liu e41e2fd6c6 [SPARK-8461] [SQL] fix codegen with REPL class loader
The ExecutorClassLoader for REPL will cause Janino failed to find class for those in java.lang, so switch to use default class loader for Janino, which will also help performance.

cc liancheng yhuai

Author: Davies Liu <davies@databricks.com>

Closes #6898 from davies/fix_class_loader and squashes the following commits:

24276d4 [Davies Liu] add regression test
4ff0457 [Davies Liu] address comment, refactor
7f5ffbe [Davies Liu] fix REPL class loader with codegen
2015-06-19 11:40:04 -07:00
Liang-Chi Hsieh 4a462c282c [HOTFIX] Fix scala style in DFSReadWriteTest that causes tests failed
This scala style problem causes tested failed.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6907 from viirya/hotfix_style and squashes the following commits:

c53f188 [Liang-Chi Hsieh] Fix scala style.
2015-06-19 11:36:59 -07:00
Yin Huai c5876e529b [SPARK-8368] [SPARK-8058] [SQL] HiveContext may override the context class loader of the current thread
https://issues.apache.org/jira/browse/SPARK-8368

Also, I add tests according https://issues.apache.org/jira/browse/SPARK-8058.

Author: Yin Huai <yhuai@databricks.com>

Closes #6891 from yhuai/SPARK-8368 and squashes the following commits:

37bb3db [Yin Huai] Update test timeout and comment.
8762eec [Yin Huai] Style.
695cd2d [Yin Huai] Correctly set the class loader in the conf of the state in client wrapper.
b3378fe [Yin Huai] Failed tests.
2015-06-19 11:11:58 -07:00
Sean Owen 4be53d0395 [SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files
Clarify what may cause long-running Spark apps to preserve shuffle files

Author: Sean Owen <sowen@cloudera.com>

Closes #6901 from srowen/SPARK-5836 and squashes the following commits:

a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files
2015-06-19 11:03:04 -07:00
Andrew Or 68a2dca292 [SPARK-8451] [SPARK-7287] SparkSubmitSuite should check exit code
This patch also reenables the tests. Now that we have access to the log4j logs it should be easier to debug the flakiness.

yhuai brkyvz

Author: Andrew Or <andrew@databricks.com>

Closes #6886 from andrewor14/spark-submit-suite-fix and squashes the following commits:

3f99ff1 [Andrew Or] Move destroy to finally block
9a62188 [Andrew Or] Re-enable ignored tests
2382672 [Andrew Or] Check for exit code
2015-06-19 10:56:19 -07:00
Tathagata Das 866816eb97 [SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of SerializationDebugger bugs and limitations
This PR solves three SerializationDebugger issues.
* SPARK-7180 - SerializationDebugger fails with ArrayOutOfBoundsException
* SPARK-8090 - SerializationDebugger does not handle classes with writeReplace correctly
* SPARK-8091 - SerializationDebugger does not handle classes with writeObject method

The solutions for each are explained as follows
* SPARK-7180 - The wrong slot desc was used for getting the value of the fields in the object being tested.
* SPARK-8090 - Test the type of the replaced object.
* SPARK-8091 - Use a dummy ObjectOutputStream to collect all the objects written by the writeObject() method, and then test those objects as usual.

I also added more tests in the testsuite to increase code coverage. For example, added tests for cases where there are not serializability issues.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6625 from tdas/SPARK-7180 and squashes the following commits:

c7cb046 [Tathagata Das] Addressed comments on docs
ae212c8 [Tathagata Das] Improved docs
304c97b [Tathagata Das] Fixed build error
26b5179 [Tathagata Das] more tests.....92% line coverage
7e2fdcf [Tathagata Das] Added more tests
d1967fb [Tathagata Das] Added comments.
da75d34 [Tathagata Das] Removed unnecessary lines.
50a608d [Tathagata Das] Fixed bugs and added support for writeObject
2015-06-19 10:52:30 -07:00
RJ Nowling a9858036bf Add example that reads a local file, writes to a DFS path provided by th...
...e user, reads the file back from the DFS, and compares word counts on the local and DFS versions. Useful for verifying DFS correctness.

Author: RJ Nowling <rnowling@gmail.com>

Closes #3347 from rnowling/dfs_read_write_test and squashes the following commits:

af8ccb7 [RJ Nowling] Don't use java.io.File since DFS may not be POSIX-compatible
b0ef9ea [RJ Nowling] Fix string style
07c6132 [RJ Nowling] Fix string style
7d9a8df [RJ Nowling] Fix string style
f74c160 [RJ Nowling] Fix else statement style
b9edf12 [RJ Nowling] Fix spark wc style
44415b9 [RJ Nowling] Fix local wc style
94a4691 [RJ Nowling] Fix space
df59b65 [RJ Nowling] Fix if statements
1b314f0 [RJ Nowling] Add scaladoc
a931d70 [RJ Nowling] Fix import order
0c89558 [RJ Nowling] Add example that reads a local file, writes to a DFS path provided by the user, reads the file back from the DFS, and compares word counts on the local and DFS versions. Useful for verifying DFS correctness.
2015-06-19 10:51:37 -07:00
Shilei 0c32fc125c [SPARK-8234][SQL] misc function: md5
Author: Shilei <shilei.qian@intel.com>

Closes #6779 from qiansl127/MD5 and squashes the following commits:

11fcdb2 [Shilei] Fix the indent
04bd27b [Shilei] Add codegen
da60eb3 [Shilei] Remove checkInputDataTypes function
9509ad0 [Shilei] Format code
12c61f4 [Shilei] Accept only BinaryType for Md5
1df0b5b [Shilei] format to scala type
60ccde1 [Shilei] Add more test case
b8c73b4 [Shilei] Rewrite the type check for Md5
c166167 [Shilei] Add md5 function
2015-06-19 10:49:27 -07:00
Takuya UESHIN fe08561e2e [SPARK-8476] [CORE] Setters inc/decDiskBytesSpilled in TaskMetrics should also be private.
This is a follow-up of [SPARK-3288](https://issues.apache.org/jira/browse/SPARK-3288).

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #6896 from ueshin/issues/SPARK-8476 and squashes the following commits:

89251d8 [Takuya UESHIN] Make inc/decDiskBytesSpilled in TaskMetrics private[spark].
2015-06-19 10:48:16 -07:00
Lianhui Wang 9baf093014 [SPARK-8430] ExternalShuffleBlockResolver of shuffle service should support UnsafeShuffleManager
andrewor14 can you take a look?thanks

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #6873 from lianhuiwang/SPARK-8430 and squashes the following commits:

51c47ca [Lianhui Wang] update andrewor's comments
2b27b19 [Lianhui Wang] support UnsafeShuffleManager
2015-06-19 10:47:07 -07:00
Liang-Chi Hsieh 2c59d5c12a [SPARK-8207] [SQL] Add math function bin
JIRA: https://issues.apache.org/jira/browse/SPARK-8207

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6721 from viirya/expr_bin and squashes the following commits:

07e1c8f [Liang-Chi Hsieh] Remove AbstractUnaryMathExpression and let BIN inherit UnaryExpression.
0677f1a [Liang-Chi Hsieh] For comments.
cf62b95 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
0cf20f2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
dea9c12 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
d4f4774 [Liang-Chi Hsieh] Add @ignore_unicode_prefix.
7a0196f [Liang-Chi Hsieh] Fix python style.
ac2bacd [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
a0a2d0f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
4cb764d [Liang-Chi Hsieh] For comments.
0f78682 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
c0c3197 [Liang-Chi Hsieh] Add bin to FunctionRegistry.
824f761 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
50e0c3b [Liang-Chi Hsieh] Add math function bin(a: long): string.
2015-06-19 10:09:31 -07:00
Xiangrui Meng 43c7ec6384 [SPARK-8151] [MLLIB] pipeline components should correctly implement copy
Otherwise, extra params get ignored in `PipelineModel.transform`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6622 from mengxr/SPARK-8087 and squashes the following commits:

0e4c8c4 [Xiangrui Meng] fix merge issues
26fc1f0 [Xiangrui Meng] address comments
e607a04 [Xiangrui Meng] merge master
b85b57e [Xiangrui Meng] fix examples/compile
d6f7891 [Xiangrui Meng] rename defaultCopyWithParams to defaultCopy
84ec278 [Xiangrui Meng] remove setter checks due to generics
2cf2ed0 [Xiangrui Meng] snapshot
291814f [Xiangrui Meng] OneVsRest.copy
1dfe3bd [Xiangrui Meng] PipelineModel.copy should copy stages
2015-06-19 09:46:51 -07:00
cody koeninger 47af7c1ebf [SPARK-8389] [STREAMING] [KAFKA] Example of getting offset ranges out o…
…f the existing java direct stream api

Author: cody koeninger <cody@koeninger.org>

Closes #6846 from koeninger/SPARK-8389 and squashes the following commits:

3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api
2015-06-19 14:51:19 +02:00
Jihong MA ebd363aecd [SPARK-7265] Improving documentation for Spark SQL Hive support
Please review this pull request.

Author: Jihong MA <linlin200605@gmail.com>

Closes #5933 from JihongMA/SPARK-7265 and squashes the following commits:

dfaa971 [Jihong MA] SPARK-7265 minor fix of the content
ace454d [Jihong MA] SPARK-7265 take out PySpark on YARN limitation
9ea0832 [Jihong MA] Merge remote-tracking branch 'upstream/master'
d5bf3f5 [Jihong MA] Merge remote-tracking branch 'upstream/master'
7b842e6 [Jihong MA] Merge remote-tracking branch 'upstream/master'
9c84695 [Jihong MA] SPARK-7265 address review comment
a399aa6 [Jihong MA] SPARK-7265 Improving documentation for Spark SQL Hive support
2015-06-19 14:06:49 +02:00
zsxwing 93360dc3cd [SPARK-7913] [CORE] Make AppendOnlyMap use the same growth strategy of OpenHashSet and consistent exception message
This is a follow up PR for #6456 to make AppendOnlyMap consistent with OpenHashSet.

/cc srowen andrewor14

Author: zsxwing <zsxwing@gmail.com>

Closes #6879 from zsxwing/append-only-map and squashes the following commits:

912c0ad [zsxwing] Fix the doc
dd4385b [zsxwing] Make AppendOnlyMap use the same growth strategy of OpenHashSet and consistent exception message
2015-06-19 11:58:07 +02:00
Carson Wang 54557f353e [SPARK-8387] [FOLLOWUP ] [WEBUI] Update driver log URL to show only 4096 bytes
This is to follow up #6834 , update the driver log URL as well for consistency.

Author: Carson Wang <carson.wang@intel.com>

Closes #6878 from carsonwang/logUrl and squashes the following commits:

13be948 [Carson Wang] update log URL in YarnClusterSuite
a0004f4 [Carson Wang] Update driver log URL to show only 4096 bytes
2015-06-19 09:57:12 +02:00
Kevin Conor fdf63f1249 [SPARK-8339] [PYSPARK] integer division for python 3
Itertools islice requires an integer for the stop argument.  Switching to integer division here prevents a ValueError when vs is evaluated above.

davies

This is my original work, and I license it to the project.

Author: Kevin Conor <kevin@discoverybayconsulting.com>

Closes #6794 from kconor/kconor-patch-1 and squashes the following commits:

da5e700 [Kevin Conor] Integer division for batch size
2015-06-19 00:12:20 -07:00
Bryan Cutler a2016b4bc4 [SPARK-8444] [STREAMING] Adding Python streaming example for queueStream
A Python example similar to the existing one for Scala.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #6884 from BryanCutler/streaming-queueStream-example-8444 and squashes the following commits:

435ba7e [Bryan Cutler] [SPARK-8444] Fixed style checks, increased sleep time to show empty queue
257abb0 [Bryan Cutler] [SPARK-8444] Stop context gracefully, Removed unused import, Added description comment
376ef6e [Bryan Cutler] [SPARK-8444] Fixed bug causing DStream.pprint to append empty parenthesis to output instead of blank line
1ff5f8b [Bryan Cutler] [SPARK-8444] Adding Python streaming example for queue_stream
2015-06-19 00:07:53 -07:00
Yu ISHIKAWA 754929b153 [SPARK-8348][SQL] Add in operator to DataFrame Column
I have added it for only Scala.

TODO: we should also support `in` operator in Python.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6824 from yu-iskw/SPARK-8348 and squashes the following commits:

e76d02f [Yu ISHIKAWA] Not use infix notation
6f744ac [Yu ISHIKAWA] Fit the test cases because these used the old test data set.
00077d3 [Yu ISHIKAWA] [SPARK-8348][SQL] Add in operator to DataFrame Column
2015-06-18 23:13:05 -07:00
Cheng Lian a71cbbdea5 [SPARK-8458] [SQL] Don't strip scheme part of output path when writing ORC files
`Path.toUri.getPath` strips scheme part of output path (from `file:///foo` to `/foo`), which causes ORC data source only writes to the file system configured in Hadoop configuration. Should use `Path.toString` instead.

Author: Cheng Lian <lian@databricks.com>

Closes #6892 from liancheng/spark-8458 and squashes the following commits:

87f8199 [Cheng Lian] Don't strip scheme of output path when writing ORC files
2015-06-18 22:01:52 -07:00
Dibyendu Bhattacharya 3eaed8769c [SPARK-8080] [STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
tdas  zsxwing this is the new PR for Spark-8080

I have merged https://github.com/apache/spark/pull/6659

Also to mention , for MEMORY_ONLY settings , when Block is not able to unrollSafely to memory if enough space is not there, BlockManager won't try to put the block and ReceivedBlockHandler will throw SparkException as it could not find the block id in PutResult. Thus number of records in block won't be counted if Block failed to unroll in memory. Which is fine.

For MEMORY_DISK settings , if BlockManager not able to unroll block to memory, block will still get deseralized to Disk. Same for WAL based store. So for those cases ( storage level = memory + disk )  number of records will be counted even though the block not able to unroll to memory.

thus I added the isFullyConsumed in the CountingIterator but have not used it as such case will never happen that block not fully consumed and ReceivedBlockHandler still get the block ID.

I have added few test cases to cover those block unrolling scenarios also.

Author: Dibyendu Bhattacharya <dibyendu.bhattacharya1@pearson.com>
Author: U-PEROOT\UBHATD1 <UBHATD1@PIN-L-PI046.PEROOT.com>

Closes #6707 from dibbhatt/master and squashes the following commits:

f6cb6b5 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
f37cfd8 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
5a8344a [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Count ByteBufferBlock as 1 count
fceac72 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
0153e7e [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Fixed comments given by @zsxwing
4c5931d [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
01e6dc8 [U-PEROOT\UBHATD1] A
2015-06-18 20:00:05 -07:00
Lars Francke 4ce3bab89f [SPARK-8462] [DOCS] Documentation fixes for Spark SQL
This fixes various minor documentation issues on the Spark SQL page

Author: Lars Francke <lars.francke@gmail.com>

Closes #6890 from lfrancke/SPARK-8462 and squashes the following commits:

dd7e302 [Lars Francke] Merge branch 'master' into SPARK-8462
34eff2c [Lars Francke] Minor documentation fixes
2015-06-18 19:40:32 -07:00
Sandy Ryza 43f50decdd [SPARK-8135] Don't load defaults when reconstituting Hadoop Configurations
Author: Sandy Ryza <sandy@cloudera.com>

Closes #6679 from sryza/sandy-spark-8135 and squashes the following commits:

c5554ff [Sandy Ryza] SPARK-8135. In SerializableWritable, don't load defaults when instantiating Configuration
2015-06-18 19:36:05 -07:00
Reynold Xin dc41313899 [SPARK-8218][SQL] Binary log math function update.
Some minor updates based on after merging #6725.

Author: Reynold Xin <rxin@databricks.com>

Closes #6871 from rxin/log and squashes the following commits:

ab51542 [Reynold Xin] Use JVM log
76fc8de [Reynold Xin] Fixed arg.
a7c1522 [Reynold Xin] [SPARK-8218][SQL] Binary log math function update.
2015-06-18 18:41:15 -07:00
Josh Rosen 207a98ca59 [SPARK-8446] [SQL] Add helper functions for testing SparkPlan physical operators
This patch introduces `SparkPlanTest`, a base class for unit tests of SparkPlan physical operators.  This is analogous to Spark SQL's existing `QueryTest`, which does something similar for end-to-end tests with actual queries.

These helper methods provide nicer error output when tests fail and help developers to avoid writing lots of boilerplate in order to execute manually constructed physical plans.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #6885 from JoshRosen/spark-plan-test and squashes the following commits:

f8ce275 [Josh Rosen] Fix some IntelliJ inspections and delete some dead code
84214be [Josh Rosen] Add an extra column which isn't part of the sort
ae1896b [Josh Rosen] Provide implicits automatically
a80f9b0 [Josh Rosen] Merge pull request #4 from marmbrus/pr/6885
d9ab1e4 [Michael Armbrust] Add simple resolver
c60a44d [Josh Rosen] Manually bind references
996332a [Josh Rosen] Add types so that tests compile
a46144a [Josh Rosen] WIP
2015-06-18 16:45:14 -07:00
zsxwing 24e53793b4 [SPARK-8376] [DOCS] Add common lang3 to the Spark Flume Sink doc
Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it.

Author: zsxwing <zsxwing@gmail.com>

Closes #6829 from zsxwing/flume-sink-dep and squashes the following commits:

f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc
2015-06-18 16:00:27 -07:00