Commit graph

11385 commits

Author SHA1 Message Date
lockwobr 27693e1757 [SQL] [DOCS] updated the documentation for explode
the syntax was incorrect in the example in explode

Author: lockwobr <lockwobr@gmail.com>

Closes #6943 from lockwobr/master and squashes the following commits:

3d864d1 [lockwobr] updated the documentation for explode

(cherry picked from commit 4f7fbefb8d)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
2015-06-24 02:51:36 +09:00
Josh Rosen 77cb1d5ed1 Revert "[SPARK-8498] [TUNGSTEN] fix npe in errorhandling path in unsafeshuffle writer"
This reverts commit 3348245055.

Reverting because `catch (Exception e) ... throw e` doesn't compile under
Java 6 unless the method declares that it throws Exception.
2015-06-23 09:19:11 -07:00
Holden Karau 3348245055 [SPARK-8498] [TUNGSTEN] fix npe in errorhandling path in unsafeshuffle writer
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6918 from holdenk/SPARK-8498-fix-npe-in-errorhandling-path-in-unsafeshuffle-writer and squashes the following commits:

f807832 [Holden Karau] Log error if we can't throw it
855f9aa [Holden Karau] Spelling - not my strongest suite. Fix Propegates to Propagates.
039d620 [Holden Karau] Add missing closeandwriteoutput
30e558d [Holden Karau] go back to try/finally
e503b8c [Holden Karau] Improve the test to ensure we aren't masking the underlying exception
ae0b7a7 [Holden Karau] Fix the test
2e6abf7 [Holden Karau] Be more cautious when cleaning up during failed write and re-throw user exceptions

(cherry picked from commit 0f92be5b5f)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-06-23 09:08:49 -07:00
Hari Shreedharan 9294796750 [SPARK-8483] [STREAMING] Remove commons-lang3 dependency from Flume Sink
Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #6910 from harishreedharan/remove-commons-lang3 and squashes the following commits:

9875f7d [Hari Shreedharan] Revert back to Flume 1.4.0
ca35eb0 [Hari Shreedharan] [SPARK-8483][Streaming] Remove commons-lang3 dependency from Flume Sink. Also bump Flume version to 1.6.0
2015-06-22 23:41:35 -07:00
Scott Taylor d0943afbcf [SPARK-8541] [PYSPARK] test the absolute error in approx doctests
A minor change but one which is (presumably) visible on the public api docs webpage.

Author: Scott Taylor <github@megatron.me.uk>

Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits:

fbed000 [Scott Taylor] test the absolute error in approx doctests

(cherry picked from commit f0dcbe8a7c)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-06-22 23:38:21 -07:00
Holden Karau 22cc1ab66e [SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max bins
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits:

2894695 [Holden Karau] remove extra blank line
2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too
3a09170 [Holden Karau] add maxBins to to the train method as well
af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100

(cherry picked from commit 164fe2aa44)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-06-22 22:40:31 -07:00
Patrick Wendell 1cfa7302ee Preparing development version 1.4.2-SNAPSHOT 2015-06-22 22:21:31 -07:00
Patrick Wendell d0a5560ce4 Preparing Spark release v1.4.1-rc1 2015-06-22 22:21:26 -07:00
Patrick Wendell 48d6830144 [BUILD] Preparing Spark release 1.4.1 2015-06-22 22:18:52 -07:00
Yu ISHIKAWA 250179485b [SPARK-8548] [SPARKR] Remove the trailing whitespaces from the SparkR files
[[SPARK-8548] Remove the trailing whitespaces from the SparkR files - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8548)

- This is the result of `lint-r`
    https://gist.github.com/yu-iskw/0019b37a2c1167f33986

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6945 from yu-iskw/SPARK-8548 and squashes the following commits:

0bd567a [Yu ISHIKAWA] [SPARK-8548][SparkR] Remove the trailing whitespaces from the SparkR files

(cherry picked from commit 44fa7df64d)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-06-22 20:55:55 -07:00
Cheng Hao d73900a903 [SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit test under jdk8
To reproduce that:
```
JAVA_HOME=/home/hcheng/Java/jdk1.8.0_45 | build/sbt -Phadoop-2.3 -Phive  'test-only org.apache.spark.sql.hive.execution.HiveWindowFunctionQueryWithoutCodeGenSuite'
```

A simple workaround to fix that is update the original query, for getting the output size instead of the exact elements of the array (output by collect_set())

Author: Cheng Hao <hao.cheng@intel.com>

Closes #6402 from chenghao-intel/windowing and squashes the following commits:

99312ad [Cheng Hao] add order by for the select clause
edf8ce3 [Cheng Hao] update the code as suggested
7062da7 [Cheng Hao] fix the collect_set() behaviour differences under different versions of JDK

(cherry picked from commit 13321e6555)
Signed-off-by: Yin Huai <yhuai@databricks.com>
2015-06-22 20:05:00 -07:00
Yin Huai 994abbaeb3 [SPARK-8532] [SQL] In Python's DataFrameWriter, save/saveAsTable/json/parquet/jdbc always override mode
https://issues.apache.org/jira/browse/SPARK-8532

This PR has two changes. First, it fixes the bug that save actions (i.e. `save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds input argument `partitionBy` to `save/saveAsTable/parquet`.

Author: Yin Huai <yhuai@databricks.com>

Closes #6937 from yhuai/SPARK-8532 and squashes the following commits:

f972d5d [Yin Huai] davies's comment.
d37abd2 [Yin Huai] style.
d21290a [Yin Huai] Python doc.
889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, and parquet.
7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode since JVM-side already uses "error" as the default value.
d696dff [Yin Huai] Python style.
88eb6c4 [Yin Huai] If mode is "error", do not call mode method.
c40c461 [Yin Huai] Regression test.

(cherry picked from commit 5ab9fcfb01)
Signed-off-by: Yin Huai <yhuai@databricks.com>
2015-06-22 13:51:34 -07:00
Yu ISHIKAWA 507381d393 [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in regression.py
[[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits:

7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()`
4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py`

(cherry picked from commit 5d89d9f00b)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Conflicts:
	python/pyspark/mllib/tests.py
2015-06-22 11:59:53 -07:00
Michael Armbrust 65981619b2 [SPARK-8420] [SQL] Fix comparision of timestamps/dates with strings (branch-1.4)
This is branch 1.4 backport of https://github.com/apache/spark/pull/6888.

Below is the original description.

In earlier versions of Spark SQL we casted `TimestampType` and `DataType` to `StringType` when it was involved in a binary comparison with a `StringType`.  This allowed comparing a timestamp with a partial date as a user would expect.
 - `time > "2014-06-10"`
 - `time > "2014"`

In 1.4.0 we tried to cast the String instead into a Timestamp.  However, since partial dates are not a valid complete timestamp this results in `null` which results in the tuple being filtered.

This PR restores the earlier behavior.  Note that we still special case equality so that these comparisons are not affected by not printing zeros for subsecond precision.

Author: Michael Armbrust <michaeldatabricks.com>

Closes #6888 from marmbrus/timeCompareString and squashes the following commits:

bdef29c [Michael Armbrust] test partial date
1f09adf [Michael Armbrust] special handling of equality
1172c60 [Michael Armbrust] more test fixing
4dfc412 [Michael Armbrust] fix tests
aaa9508 [Michael Armbrust] newline
04d908f [Michael Armbrust] [SPARK-8420][SQL] Fix comparision of timestamps/dates with strings

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

Author: Michael Armbrust <michael@databricks.com>

Closes #6914 from yhuai/timeCompareString-1.4 and squashes the following commits:

9882915 [Michael Armbrust] [SPARK-8420] [SQL] Fix comparision of timestamps/dates with strings
2015-06-22 10:45:33 -07:00
Cheng Lian 451c8722af [SPARK-8406] [SQL] Backports SPARK-8406 and PR #6864 to branch-1.4
Author: Cheng Lian <lian@databricks.com>

Closes #6932 from liancheng/spark-8406-for-1.4 and squashes the following commits:

a0168fe [Cheng Lian] Backports SPARK-8406 and PR #6864 to branch-1.4
2015-06-22 10:04:29 -07:00
Liang-Chi Hsieh b836bac3fe [HOTFIX] Hotfix branch-1.4 building by removing avgMetrics in CrossValidatorSuite
Ref. #6905
ping yhuai

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6929 from viirya/hot_fix_cv_test and squashes the following commits:

b1aec53 [Liang-Chi Hsieh] Hotfix branch-1.4 by removing avgMetrics in CrossValidatorSuite.
2015-06-21 22:25:08 -07:00
Joseph K. Bradley 2a7ea31a9e [SPARK-7715] [MLLIB] [ML] [DOC] Updated MLlib programming guide for release 1.4
Reorganized docs a bit.  Added migration guides.

**Q**: Do we want to say more for the 1.3 -> 1.4 migration guide for ```spark.ml```?  It would be a lot.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6897 from jkbradley/ml-guide-1.4 and squashes the following commits:

4bf26d6 [Joseph K. Bradley] tiny fix
8085067 [Joseph K. Bradley] fixed spacing/layout issues in ml guide from previous commit in this PR
6cd5c78 [Joseph K. Bradley] Updated MLlib programming guide for release 1.4

(cherry picked from commit a1894422ad)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-06-21 16:27:14 -07:00
jeanlyn f0e4040202 [SPARK-8379] [SQL] avoid speculative tasks write to the same file
The issue link [SPARK-8379](https://issues.apache.org/jira/browse/SPARK-8379)
Currently,when we insert data to the dynamic partition with speculative tasks we will get the Exception
```
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-10000/ds=2015-06-15/type=2/part-00301.lzo
owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53
but is accessed by DFSClient_attempt_201506031520_0011_m_000042_0_-1275047721_57
```
This pr try to write the data to temporary dir when using dynamic parition  avoid the speculative tasks writing the same file

Author: jeanlyn <jeanlyn92@gmail.com>

Closes #6833 from jeanlyn/speculation and squashes the following commits:

64bbfab [jeanlyn] use FileOutputFormat.getTaskOutputPath to get the path
8860af0 [jeanlyn] remove the never using code
e19a3bd [jeanlyn] avoid speculative tasks write same file

(cherry picked from commit a1e3649c87)
Signed-off-by: Cheng Lian <lian@databricks.com>
2015-06-21 00:13:55 -07:00
Liang-Chi Hsieh fe59a4a5f5 [SPARK-8468] [ML] Take the negative of some metrics in RegressionEvaluator to get correct cross validation
JIRA: https://issues.apache.org/jira/browse/SPARK-8468

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6905 from viirya/cv_min and squashes the following commits:

930d3db [Liang-Chi Hsieh] Fix python unit test and add document.
d632135 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cv_min
16e3b2c [Liang-Chi Hsieh] Take the negative instead of reciprocal.
c3dd8d9 [Liang-Chi Hsieh] For comments.
b5f52c1 [Liang-Chi Hsieh] Add param to CrossValidator for choosing whether to maximize evaulation value.

(cherry picked from commit 0b8995168f)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
2015-06-20 13:02:14 -07:00
Andrew Or 9b16508d2c [HOTFIX] [SPARK-8489] Correct JIRA number in previous commit
It should be SPARK-8489, not SPARK-8498.
2015-06-19 17:40:21 -07:00
cody koeninger a7b773a8b5 [SPARK-8390] [STREAMING] [KAFKA] fix docs related to HasOffsetRanges
Author: cody koeninger <cody@koeninger.org>

Closes #6863 from koeninger/SPARK-8390 and squashes the following commits:

26a06bd [cody koeninger] Merge branch 'master' into SPARK-8390
3744492 [cody koeninger] [Streaming][Kafka][SPARK-8390] doc changes per TD, test to make sure approach shown in docs actually compiles + runs
b108c9d [cody koeninger] [Streaming][Kafka][SPARK-8390] further doc fixes, clean up spacing
bb4336b [cody koeninger] [Streaming][Kafka][SPARK-8390] fix docs related to HasOffsetRanges, cleanup
3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api
2015-06-19 17:36:59 -07:00
cody koeninger 78d0ceea82 [SPARK-8389] [STREAMING] [KAFKA] Example of getting offset ranges out o…
…f the existing java direct stream api

Author: cody koeninger <cody@koeninger.org>

Closes #6846 from koeninger/SPARK-8389 and squashes the following commits:

3f3c57a [cody koeninger] [Streaming][Kafka][SPARK-8389] Example of getting offset ranges out of the existing java direct stream api
2015-06-19 17:36:54 -07:00
Andrew Or 2248ad8b70 [SPARK-8498] [SQL] Add regression test for SPARK-8470
**Summary of the problem in SPARK-8470.** When using `HiveContext` to create a data frame of a user case class, Spark throws `scala.reflect.internal.MissingRequirementError` when it tries to infer the schema using reflection. This is caused by `HiveContext` silently overwriting the context class loader containing the user classes.

**What this issue is about.** This issue adds regression tests for SPARK-8470, which is already fixed in #6891. We closed SPARK-8470 as a duplicate because it is a different manifestation of the same problem in SPARK-8368. Due to the complexity of the reproduction, this requires us to pre-package a special test jar and include it in the Spark project itself.

I tested this with and without the fix in #6891 and verified that it passes only if the fix is present.

Author: Andrew Or <andrew@databricks.com>

Closes #6909 from andrewor14/SPARK-8498 and squashes the following commits:

5e9d688 [Andrew Or] Add regression test for SPARK-8470

(cherry picked from commit 093c34838d)
Signed-off-by: Yin Huai <yhuai@databricks.com>
2015-06-19 17:34:36 -07:00
Yin Huai 2510365faa [HOT-FIX] Fix compilation (caused by 0131142d98)
Author: Yin Huai <yhuai@databricks.com>

Closes #6913 from yhuai/branch-1.4-hotfix and squashes the following commits:

7f91fa0 [Yin Huai] [HOT-FIX] Fix compilation (caused by 0131142d98).
2015-06-19 17:29:51 -07:00
Nathan Howell 0131142d98 [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents
Author: Nathan Howell <nhowell@godaddy.com>

Closes #6799 from NathanHowell/spark-8093 and squashes the following commits:

76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents

(cherry picked from commit 9814b971f0)
Signed-off-by: Yin Huai <yhuai@databricks.com>

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala
2015-06-19 16:23:11 -07:00
Hossein 1a6b510784 [SPARK-8452] [SPARKR] expose jobGroup API in SparkR
This pull request adds following methods to SparkR:

```R
setJobGroup()
cancelJobGroup()
clearJobGroup()
```
For each method, the spark context is passed as the first argument. There does not seem to be a good way to test these in R.

cc shivaram and davies

Author: Hossein <hossein@databricks.com>

Closes #6889 from falaki/SPARK-8452 and squashes the following commits:

9ce9f1e [Hossein] Added basic tests to verify methods can be called and won't throw errors
c706af9 [Hossein] Added examples
a2c19af [Hossein] taking spark context as first argument
343ca77 [Hossein] Added setJobGroup, cancelJobGroup and clearJobGroup to SparkR

(cherry picked from commit 1fa29c2df2)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-06-19 15:52:27 -07:00
Yin Huai 9ac8393663 [SPARK-8368] [SPARK-8058] [SQL] HiveContext may override the context class loader of the current thread (branch 1.4)
This is for 1.4 branch (based on https://github.com/apache/spark/pull/6891).

Author: Yin Huai <yhuai@databricks.com>

Closes #6895 from yhuai/SPARK-8368-1.4 and squashes the following commits:

adbbbc9 [Yin Huai] Minor update.
3cca0e9 [Yin Huai] Correctly set the class loader in the conf of the state in client wrapper.
b1e14a9 [Yin Huai] Failed tests.
2015-06-19 11:15:28 -07:00
Tathagata Das 4b2c793a27 [SPARK-7180] [SPARK-8090] [SPARK-8091] Fix a number of SerializationDebugger bugs and limitations
This PR solves three SerializationDebugger issues.
* SPARK-7180 - SerializationDebugger fails with ArrayOutOfBoundsException
* SPARK-8090 - SerializationDebugger does not handle classes with writeReplace correctly
* SPARK-8091 - SerializationDebugger does not handle classes with writeObject method

The solutions for each are explained as follows
* SPARK-7180 - The wrong slot desc was used for getting the value of the fields in the object being tested.
* SPARK-8090 - Test the type of the replaced object.
* SPARK-8091 - Use a dummy ObjectOutputStream to collect all the objects written by the writeObject() method, and then test those objects as usual.

I also added more tests in the testsuite to increase code coverage. For example, added tests for cases where there are not serializability issues.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6625 from tdas/SPARK-7180 and squashes the following commits:

c7cb046 [Tathagata Das] Addressed comments on docs
ae212c8 [Tathagata Das] Improved docs
304c97b [Tathagata Das] Fixed build error
26b5179 [Tathagata Das] more tests.....92% line coverage
7e2fdcf [Tathagata Das] Added more tests
d1967fb [Tathagata Das] Added comments.
da75d34 [Tathagata Das] Removed unnecessary lines.
50a608d [Tathagata Das] Fixed bugs and added support for writeObject
2015-06-19 11:06:32 -07:00
Sean Owen 3415fb978b [SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files
Clarify what may cause long-running Spark apps to preserve shuffle files

Author: Sean Owen <sowen@cloudera.com>

Closes #6901 from srowen/SPARK-5836 and squashes the following commits:

a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files

(cherry picked from commit 4be53d0395)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-19 11:03:12 -07:00
Andrew Or aedd893b42 [SPARK-8451] [SPARK-7287] SparkSubmitSuite should check exit code
This patch also reenables the tests. Now that we have access to the log4j logs it should be easier to debug the flakiness.

yhuai brkyvz

Author: Andrew Or <andrew@databricks.com>

Closes #6886 from andrewor14/spark-submit-suite-fix and squashes the following commits:

3f99ff1 [Andrew Or] Move destroy to finally block
9a62188 [Andrew Or] Re-enable ignored tests
2382672 [Andrew Or] Check for exit code

(cherry picked from commit 68a2dca292)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-19 10:56:36 -07:00
Lianhui Wang 6f2e411084 [SPARK-8430] ExternalShuffleBlockResolver of shuffle service should support UnsafeShuffleManager
andrewor14 can you take a look?thanks

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #6873 from lianhuiwang/SPARK-8430 and squashes the following commits:

51c47ca [Lianhui Wang] update andrewor's comments
2b27b19 [Lianhui Wang] support UnsafeShuffleManager

(cherry picked from commit 9baf093014)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-19 10:47:15 -07:00
Xiangrui Meng 1f2dafb77f [SPARK-8151] [MLLIB] pipeline components should correctly implement copy
Otherwise, extra params get ignored in `PipelineModel.transform`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6622 from mengxr/SPARK-8087 and squashes the following commits:

0e4c8c4 [Xiangrui Meng] fix merge issues
26fc1f0 [Xiangrui Meng] address comments
e607a04 [Xiangrui Meng] merge master
b85b57e [Xiangrui Meng] fix examples/compile
d6f7891 [Xiangrui Meng] rename defaultCopyWithParams to defaultCopy
84ec278 [Xiangrui Meng] remove setter checks due to generics
2cf2ed0 [Xiangrui Meng] snapshot
291814f [Xiangrui Meng] OneVsRest.copy
1dfe3bd [Xiangrui Meng] PipelineModel.copy should copy stages

(cherry picked from commit 43c7ec6384)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-06-19 10:05:07 -07:00
Kevin Conor 164b9d32e7 [SPARK-8339] [PYSPARK] integer division for python 3
Itertools islice requires an integer for the stop argument.  Switching to integer division here prevents a ValueError when vs is evaluated above.

davies

This is my original work, and I license it to the project.

Author: Kevin Conor <kevin@discoverybayconsulting.com>

Closes #6794 from kconor/kconor-patch-1 and squashes the following commits:

da5e700 [Kevin Conor] Integer division for batch size

(cherry picked from commit fdf63f1249)
Signed-off-by: Davies Liu <davies@databricks.com>
2015-06-19 00:12:43 -07:00
Cheng Lian f48f3a2e2f [SPARK-8458] [SQL] Don't strip scheme part of output path when writing ORC files
`Path.toUri.getPath` strips scheme part of output path (from `file:///foo` to `/foo`), which causes ORC data source only writes to the file system configured in Hadoop configuration. Should use `Path.toString` instead.

Author: Cheng Lian <lian@databricks.com>

Closes #6892 from liancheng/spark-8458 and squashes the following commits:

87f8199 [Cheng Lian] Don't strip scheme of output path when writing ORC files

(cherry picked from commit a71cbbdea5)
Signed-off-by: Cheng Lian <lian@databricks.com>
2015-06-18 22:02:13 -07:00
Dibyendu Bhattacharya b55e4b9a52 [SPARK-8080] [STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
tdas  zsxwing this is the new PR for Spark-8080

I have merged https://github.com/apache/spark/pull/6659

Also to mention , for MEMORY_ONLY settings , when Block is not able to unrollSafely to memory if enough space is not there, BlockManager won't try to put the block and ReceivedBlockHandler will throw SparkException as it could not find the block id in PutResult. Thus number of records in block won't be counted if Block failed to unroll in memory. Which is fine.

For MEMORY_DISK settings , if BlockManager not able to unroll block to memory, block will still get deseralized to Disk. Same for WAL based store. So for those cases ( storage level = memory + disk )  number of records will be counted even though the block not able to unroll to memory.

thus I added the isFullyConsumed in the CountingIterator but have not used it as such case will never happen that block not fully consumed and ReceivedBlockHandler still get the block ID.

I have added few test cases to cover those block unrolling scenarios also.

Author: Dibyendu Bhattacharya <dibyendu.bhattacharya1@pearson.com>
Author: U-PEROOT\UBHATD1 <UBHATD1@PIN-L-PI046.PEROOT.com>

Closes #6707 from dibbhatt/master and squashes the following commits:

f6cb6b5 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
f37cfd8 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
5a8344a [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Count ByteBufferBlock as 1 count
fceac72 [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
0153e7e [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI Fixed comments given by @zsxwing
4c5931d [Dibyendu Bhattacharya] [SPARK-8080][STREAMING] Receiver.store with Iterator does not give correct count at Spark UI
01e6dc8 [U-PEROOT\UBHATD1] A
2015-06-18 20:04:29 -07:00
Lars Francke bd9bbd6119 [SPARK-8462] [DOCS] Documentation fixes for Spark SQL
This fixes various minor documentation issues on the Spark SQL page

Author: Lars Francke <lars.francke@gmail.com>

Closes #6890 from lfrancke/SPARK-8462 and squashes the following commits:

dd7e302 [Lars Francke] Merge branch 'master' into SPARK-8462
34eff2c [Lars Francke] Minor documentation fixes

(cherry picked from commit 4ce3bab89f)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-06-18 19:40:55 -07:00
Josh Rosen 152f4465d3 [SPARK-8446] [SQL] Add helper functions for testing SparkPlan physical operators
This patch introduces `SparkPlanTest`, a base class for unit tests of SparkPlan physical operators.  This is analogous to Spark SQL's existing `QueryTest`, which does something similar for end-to-end tests with actual queries.

These helper methods provide nicer error output when tests fail and help developers to avoid writing lots of boilerplate in order to execute manually constructed physical plans.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #6885 from JoshRosen/spark-plan-test and squashes the following commits:

f8ce275 [Josh Rosen] Fix some IntelliJ inspections and delete some dead code
84214be [Josh Rosen] Add an extra column which isn't part of the sort
ae1896b [Josh Rosen] Provide implicits automatically
a80f9b0 [Josh Rosen] Merge pull request #4 from marmbrus/pr/6885
d9ab1e4 [Michael Armbrust] Add simple resolver
c60a44d [Josh Rosen] Manually bind references
996332a [Josh Rosen] Add types so that tests compile
a46144a [Josh Rosen] WIP

(cherry picked from commit 207a98ca59)
Signed-off-by: Michael Armbrust <michael@databricks.com>
2015-06-18 16:45:27 -07:00
zsxwing 9f293a9eb6 [SPARK-8376] [DOCS] Add common lang3 to the Spark Flume Sink doc
Commons Lang 3 has been added as one of the dependencies of Spark Flume Sink since #5703. This PR updates the doc for it.

Author: zsxwing <zsxwing@gmail.com>

Closes #6829 from zsxwing/flume-sink-dep and squashes the following commits:

f8617f0 [zsxwing] Add common lang3 to the Spark Flume Sink doc

(cherry picked from commit 24e53793b4)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-06-18 16:00:42 -07:00
Josh Rosen c1da5cf029 [SPARK-8353] [DOCS] Show anchor links when hovering over documentation headers
This patch uses [AnchorJS](https://bryanbraun.github.io/anchorjs/) to show deep anchor links when hovering over headers in the Spark documentation. For example:

![image](https://cloud.githubusercontent.com/assets/50748/8240800/1502f85c-15ba-11e5-819a-97b231370a39.png)

This makes it easier for users to link to specific sections of the documentation.

I also removed some dead Javascript which isn't used in our current docs (it was introduced for the old AMPCamp training, but isn't used anymore).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6808 from JoshRosen/SPARK-8353 and squashes the following commits:

e59d8a7 [Josh Rosen] Suppress underline on hover
f518b6a [Josh Rosen] Turn on for all headers, since we use H1s in a bunch of places
a9fec01 [Josh Rosen] Add anchor links when hovering over headers; remove some dead JS code

(cherry picked from commit 44c931f006)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-06-18 15:10:33 -07:00
Davies Liu ca23c3b014 [SPARK-8202] [PYSPARK] fix infinite loop during external sort in PySpark
The batch size during external sort will grow up to max 10000, then shrink down to zero, causing infinite loop.
Given the assumption that the items usually have similar size, so we don't need to adjust the batch size after first spill.

cc JoshRosen rxin angelini

Author: Davies Liu <davies@databricks.com>

Closes #6714 from davies/batch_size and squashes the following commits:

b170dfb [Davies Liu] update test
b9be832 [Davies Liu] Merge branch 'batch_size' of github.com:davies/spark into batch_size
6ade745 [Davies Liu] update test
5c21777 [Davies Liu] Update shuffle.py
e746aec [Davies Liu] fix batch size during sort
2015-06-18 13:49:32 -07:00
Burak Yavuz 9dabc12936 [SPARK-8095] Resolve dependencies of --packages in local ivy cache
Dependencies of artifacts in the local ivy cache were not being resolved properly. The dependencies were not being picked up. Now they should be.

cc andrewor14

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #6788 from brkyvz/local-ivy-fix and squashes the following commits:

2875bf4 [Burak Yavuz] fix temp dir bug
48cc648 [Burak Yavuz] improve deletion
a69e3e6 [Burak Yavuz] delete cache before test as well
0037197 [Burak Yavuz] fix merge conflicts
f60772c [Burak Yavuz] use different folder for m2 cache during testing
b6ef038 [Burak Yavuz] [SPARK-8095] Resolve dependencies of Spark Packages in local ivy cache

Conflicts:
	core/src/test/scala/org/apache/spark/deploy/SparkSubmitUtilsSuite.scala
2015-06-17 22:36:28 -07:00
xutingjun 67ad12d793 [SPARK-8392] RDDOperationGraph: getting cached nodes is slow
```def getAllNodes: Seq[RDDOperationNode] =
{ _childNodes ++ _childClusters.flatMap(_.childNodes) }```

when the ```_childClusters``` has so many nodes, the process will hang on. I think we can improve the efficiency here.

Author: xutingjun <xutingjun@huawei.com>

Closes #6839 from XuTingjun/DAGImprove and squashes the following commits:

53b03ea [xutingjun] change code to more concise and easier to read
f98728b [xutingjun] fix words: node -> nodes
f87c663 [xutingjun] put the filter inside
81f9fd2 [xutingjun] put the filter inside

(cherry picked from commit e2cdb0568b)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-17 22:31:39 -07:00
Yin Huai 73cf5def06 [SPARK-8306] [SQL] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.
https://issues.apache.org/jira/browse/SPARK-8306

I will try to add a test later.

marmbrus aarondav

Author: Yin Huai <yhuai@databricks.com>

Closes #6758 from yhuai/SPARK-8306 and squashes the following commits:

1292346 [Yin Huai] [SPARK-8306] AddJar command needs to set the new class loader to the HiveConf inside executionHive.state.

(cherry picked from commit 302556ff99)
Signed-off-by: Michael Armbrust <michael@databricks.com>

Conflicts:
	sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala
2015-06-17 15:14:42 -07:00
zsxwing 5aedfa2ceb [SPARK-8404] [STREAMING] [TESTS] Use thread-safe collections to make the tests more reliable
KafkaStreamSuite, DirectKafkaStreamSuite, JavaKafkaStreamSuite and JavaDirectKafkaStreamSuite use non-thread-safe collections to collect data in one thread and check it in another thread. It may fail the tests.

This PR changes them to thread-safe collections.

Note: I cannot reproduce the test failures in my environment. But at least, this PR should make the tests more reliable.

Author: zsxwing <zsxwing@gmail.com>

Closes #6852 from zsxwing/fix-KafkaStreamSuite and squashes the following commits:

d464211 [zsxwing] Use thread-safe collections to make the tests more reliable

(cherry picked from commit a06d9c8e76)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-06-17 15:00:17 -07:00
zsxwing 5e7973df0e [SPARK-8373] [PYSPARK] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD
This PR fixes the sum issue and also adds `emptyRDD` so that it's easy to create a test case.

Author: zsxwing <zsxwing@gmail.com>

Closes #6826 from zsxwing/python-emptyRDD and squashes the following commits:

b36993f [zsxwing] Update the return type to JavaRDD[T]
71df047 [zsxwing] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD

(cherry picked from commit 0fc4b96f3e)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-17 13:59:47 -07:00
Carson Wang f0513733d4 [SPARK-8372] History server shows incorrect information for application not started
The history server may show an incorrect App ID for an incomplete application like <App ID>.inprogress. This app info will never disappear even after the app is completed.
![incorrectappinfo](https://cloud.githubusercontent.com/assets/9278199/8156147/2a10fdbe-137d-11e5-9620-c5b61d93e3c1.png)

The cause of the issue is that a log path name is used as the app id when app id cannot be got during replay.

Author: Carson Wang <carson.wang@intel.com>

Closes #6827 from carsonwang/SPARK-8372 and squashes the following commits:

cdbb089 [Carson Wang] Fix code style
3e46b35 [Carson Wang] Update code style
90f5dde [Carson Wang] Add a unit test
d8c9cd0 [Carson Wang] Replaying events only return information when app is started

(cherry picked from commit 2837e06709)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-17 13:42:46 -07:00
Mingfei d75c53d88d [SPARK-8161] Set externalBlockStoreInitialized to be true, after ExternalBlockStore is initialized
externalBlockStoreInitialized is never set to be true, which causes the blocks stored in ExternalBlockStore can not be removed.

Author: Mingfei <mingfei.shi@intel.com>

Closes #6702 from shimingfei/SetTrue and squashes the following commits:

add61d8 [Mingfei] Set externalBlockStoreInitialized to be true, after ExternalBlockStore is initialized

(cherry picked from commit 7ad8c5d869)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-17 13:40:16 -07:00
Punya Biswal a7f6979d0f [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode
Now PySpark on YARN with cluster mode is supported so let's update doc.

Author: Kousuke Saruta <sarutakoss.nttdata.co.jp>

Closes #6040 from sarutak/update-doc-for-pyspark-on-yarn and squashes the following commits:

ad9f88c [Kousuke Saruta] Brushed up sentences
469fd2e [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into update-doc-for-pyspark-on-yarn
fcfdb92 [Kousuke Saruta] Updated doc for PySpark on YARN with cluster mode

Author: Punya Biswal <pbiswal@palantir.com>
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #6842 from punya/feature/SPARK-7515 and squashes the following commits:

0b83648 [Punya Biswal] Merge remote-tracking branch 'origin/branch-1.4' into feature/SPARK-7515
de025cd [Kousuke Saruta] [SPARK-7515] [DOC] Update documentation for PySpark on YARN with cluster mode
2015-06-17 13:37:20 -07:00
Sean Owen 320c4420b9 [SPARK-8395] [DOCS] start-slave.sh docs incorrect
start-slave.sh no longer takes a worker # param in 1.4+

Author: Sean Owen <sowen@cloudera.com>

Closes #6855 from srowen/SPARK-8395 and squashes the following commits:

300278e [Sean Owen] start-slave.sh no longer takes a worker # param in 1.4+

(cherry picked from commit f005be0273)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-06-17 13:31:17 -07:00
Vyacheslav Baranov a5f602efcf [SPARK-8309] [CORE] Support for more than 12M items in OpenHashMap
The problem occurs because the position mask `0xEFFFFFF` is incorrect. It has zero 25th bit, so when capacity grows beyond 2^24, `OpenHashMap` calculates incorrect index of value in `_values` array.

I've also added a size check in `rehash()`, so that it fails instead of reporting invalid item indices.

Author: Vyacheslav Baranov <slavik.baranov@gmail.com>

Closes #6763 from SlavikBaranov/SPARK-8309 and squashes the following commits:

8557445 [Vyacheslav Baranov] Resolved review comments
4d5b954 [Vyacheslav Baranov] Resolved review comments
eaf1e68 [Vyacheslav Baranov] Fixed failing test
f9284fd [Vyacheslav Baranov] Resolved review comments
3920656 [Vyacheslav Baranov] SPARK-8309: Support for more than 12M items in OpenHashMap

(cherry picked from commit c13da20a55)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-06-17 09:42:41 +01:00