Commit graph

4467 commits

Author SHA1 Message Date
Andrew Or 31e0ae9e1d [MINOR] [UI] Improve confusing message on log page
It's good practice to check if the input path is in the directory
we expect to avoid potentially confusing error messages.
2015-06-03 14:48:15 -07:00
Shivaram Venkataraman cbfb682ab9 [SPARK-8028] [SPARKR] Use addJar instead of setJars in SparkR
This prevents the spark.jars from being cleared while using `--packages` or `--jars`

cc pwendell davies brkyvz

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #6568 from shivaram/SPARK-8028 and squashes the following commits:

3a9cf1f [Shivaram Venkataraman] Use addJar instead of setJars in SparkR This prevents the spark.jars from being cleared

(cherry picked from commit 6b44278ef7)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-06-01 21:01:26 -07:00
Andrew Or f5a9833f3f [MINOR] [UI] Improve error message on log page
Currently if a bad log type if specified, then we get blank.
We should provide a more informative error message.
2015-06-01 20:11:38 -07:00
Josh Rosen df0bf71ee0 [HOTFIX] Remove trailing whitespace to fix Scalastyle checks
866652c903 enabled this check.
2015-05-31 16:34:20 -07:00
Sun Rui f1d4e7e311 [SPARK-7227] [SPARKR] Support fillna / dropna in R DataFrame.
Author: Sun Rui <rui.sun@intel.com>

Closes #6183 from sun-rui/SPARK-7227 and squashes the following commits:

dd6f5b3 [Sun Rui] Rename readEnv() back to readMap(). Add alias na.omit() for dropna().
41cf725 [Sun Rui] [SPARK-7227][SPARKR] Support fillna / dropna in R DataFrame.

(cherry picked from commit 46576ab303)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
2015-05-31 15:02:16 -07:00
Reynold Xin 01f38f75d9 [SPARK-7979] Enforce structural type checker.
Author: Reynold Xin <rxin@databricks.com>

Closes #6536 from rxin/structural-type-checker and squashes the following commits:

f833151 [Reynold Xin] Fixed compilation.
633f9a1 [Reynold Xin] Fixed typo.
d1fa804 [Reynold Xin] [SPARK-7979] Enforce structural type checker.

(cherry picked from commit 4b5f12bac9)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-31 01:40:57 -07:00
Reynold Xin a7c217166b [SPARK-3850] Trim trailing spaces for core.
Author: Reynold Xin <rxin@databricks.com>

Closes #6533 from rxin/whitespace-2 and squashes the following commits:

038314c [Reynold Xin] [SPARK-3850] Trim trailing spaces for core.

(cherry picked from commit 74fdc97c72)
Signed-off-by: Reynold Xin <rxin@databricks.com>

Conflicts:
	core/src/main/scala/org/apache/spark/storage/TachyonBlockManager.scala
	core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala
2015-05-31 00:17:47 -07:00
Reynold Xin adfc9d1fa0 [SPARK-7976] Add style checker to disallow overriding finalize.
Author: Reynold Xin <rxin@databricks.com>

Closes #6528 from rxin/style-finalizer and squashes the following commits:

a2211ca [Reynold Xin] [SPARK-7976] Enable NoFinalizeChecker.

(cherry picked from commit 084fef76e9)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-30 23:36:37 -07:00
Timothy Chen 8938a74893 [SPARK-7962] [MESOS] Fix master url parsing in rest submission client.
Only parse standalone master url when master url starts with spark://

Author: Timothy Chen <tnachen@gmail.com>

Closes #6517 from tnachen/fix_mesos_client and squashes the following commits:

61a1198 [Timothy Chen] Fix master url parsing in rest submission client.

(cherry picked from commit 78657d53d7)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-29 23:56:27 -07:00
Burak Yavuz 1513cffa35 [SPARK-7957] Preserve partitioning when using randomSplit
cc JoshRosen
Thanks for noticing this!

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #6509 from brkyvz/sample-perf-reg and squashes the following commits:

497465d [Burak Yavuz] addressed code review
293f95f [Burak Yavuz] [SPARK-7957] Preserve partitioning when using randomSplit

(cherry picked from commit 7ed06c3992)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-29 22:19:23 -07:00
Reynold Xin f40605f064 [SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker.
…

Author: Reynold Xin <rxin@databricks.com>

Closes #6491 from rxin/more-whitespace and squashes the following commits:

f6e63dc [Reynold Xin] [SPARK-7940] Enforce whitespace checking for DO, TRY, CATCH, FINALLY, MATCH, LARROW, RARROW in style checker.

(cherry picked from commit 94f62a4979)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-29 13:39:02 -07:00
Reynold Xin 23bd05fff7 HOTFIX: Scala style checker failure due to a missing space in TachyonBlockManager.scala. 2015-05-29 09:37:46 -07:00
Tim Ellison 459c3d22e0 [SPARK-7756] [CORE] Use testing cipher suites common to Oracle and IBM security providers
Add alias names for supported cipher suites to the sample SSL configuration.

The IBM JSSE provider reports its cipher suite with an SSL_ prefix, but accepts TLS_ prefixed suite names as an alias.  However, Jetty filters the requested ciphers based on the provider's reported supported suites, so the TLS_ versions are never passed through to JSSE causing an SSL handshake failure.

Author: Tim Ellison <t.p.ellison@gmail.com>

Closes #6282 from tellison/SSLFailure and squashes the following commits:

8de8a3e [Tim Ellison] Update SecurityManagerSuite with new expected suite names
96158b2 [Tim Ellison] Update the sample configs to use ciphers that are common to both the Oracle and IBM security providers.
705421b [Tim Ellison] Merge branch 'master' of github.com:tellison/spark into SSLFailure
68b9425 [Tim Ellison] Merge branch 'master' of https://github.com/apache/spark into SSLFailure
b0c35f6 [Tim Ellison] [CORE] Add aliases used for cipher suites in IBM provider

(cherry picked from commit bf46580708)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-05-29 05:15:00 -04:00
Tathagata Das f7cb272b7c [SPARK-7930] [CORE] [STREAMING] Fixed shutdown hook priorities
Shutdown hook for temp directories had priority 100 while SparkContext was 50. So the local root directory was deleted before SparkContext was shutdown. This leads to scary errors on running jobs, at the time of shutdown. This is especially a problem when running streaming examples, where Ctrl-C is the only way to shutdown.

The fix in this PR is to make the temp directory shutdown priority lower than SparkContext, so that the temp dirs are the last thing to get deleted, after the SparkContext has been shut down. Also, the DiskBlockManager shutdown priority is change from default 100 to temp_dir_prio + 1, so that it gets invoked just before all temp dirs are cleared.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6482 from tdas/SPARK-7930 and squashes the following commits:

d7cbeb5 [Tathagata Das] Removed unnecessary line
1514d0b [Tathagata Das] Fixed shutdown hook priorities

(cherry picked from commit cd3d9a5c0c)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
2015-05-28 22:28:31 -07:00
Kay Ousterhout aee046dfa1 [SPARK-7932] Fix misleading scheduler delay visualization
The existing code rounds down to the nearest percent when computing the proportion
of a task's time that was spent on each phase of execution, and then computes
the scheduler delay proportion as 100 - sum(all other proportions).  As a result,
a few extra percent can end up in the scheduler delay. This commit eliminates
the rounding so that the time visualizations correspond properly to the real times.

sarutak If you could take a look at this, that would be great! Not sure if there's a good
reason to round here that I missed.

cc shivaram

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #6484 from kayousterhout/SPARK-7932 and squashes the following commits:

1723cc4 [Kay Ousterhout] [SPARK-7932] Fix misleading scheduler delay visualization

(cherry picked from commit 04ddcd4db7)
Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>
2015-05-28 22:09:59 -07:00
Reynold Xin e3dd2802f6 [SPARK-7927] whitespace fixes for core.
So we can enable a whitespace enforcement rule in the style checker to save code review time.

Author: Reynold Xin <rxin@databricks.com>

Closes #6473 from rxin/whitespace-core and squashes the following commits:

058195d [Reynold Xin] Fixed tests.
fce11e9 [Reynold Xin] [SPARK-7927] whitespace fixes for core.

(cherry picked from commit 7f7505d8db)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-28 20:16:35 -07:00
Reynold Xin 9c2c6b4a67 Remove SizeEstimator from o.a.spark package.
See comments on https://github.com/apache/spark/pull/3913

Author: Reynold Xin <rxin@databricks.com>

Closes #6471 from rxin/sizeestimator and squashes the following commits:

c057095 [Reynold Xin] Fixed import.
2da478b [Reynold Xin] Remove SizeEstimator from o.a.spark package.

(cherry picked from commit 0077af22ca)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-28 16:57:06 -07:00
zuxqoj bd568df224 [SPARK-7782] fixed sort arrow issue
Current behaviour::
In spark UI
![screen shot 2015-05-27 at 3 27 51 pm](https://cloud.githubusercontent.com/assets/3919211/7837541/47d330ba-04a5-11e5-89d1-e5b11da1a513.png)

In YARN
![screen shot 2015-05-27 at 3](https://cloud.githubusercontent.com/assets/3919211/7837594/aebd1d36-04a5-11e5-8216-86e03c07d2bd.png)

In jira
![screen shot 2015-05-27 at 3_2](https://cloud.githubusercontent.com/assets/3919211/7837616/d3fedce2-04a5-11e5-9e68-960ed54e5d83.png)

Author: zuxqoj <sbshekhar@gmail.com>

Closes #6437 from zuxqoj/SPARK-7782_PR and squashes the following commits:

cd068b9 [zuxqoj] [SPARK-7782] fixed sort arrow issue

(cherry picked from commit e838a25bdb)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-27 23:13:19 -07:00
Sandy Ryza d83c2ee848 [SPARK-7896] Allow ChainedBuffer to store more than 2 GB
Author: Sandy Ryza <sandy@cloudera.com>

Closes #6440 from sryza/sandy-spark-7896 and squashes the following commits:

49d8a0d [Sandy Ryza] Fix bug introduced when reading over record boundaries
6006856 [Sandy Ryza] Fix overflow issues
006b4b2 [Sandy Ryza] Fix scalastyle by removing non ascii characters
8b000ca [Sandy Ryza] Add ascii art to describe layout of data in metaBuffer
f2053c0 [Sandy Ryza] Fix negative overflow issue
0368c78 [Sandy Ryza] Initialize size as 0
a5a4820 [Sandy Ryza] Use explicit types for all numbers in ChainedBuffer
b7e0213 [Sandy Ryza] SPARK-7896. Allow ChainedBuffer to store more than 2 GB

(cherry picked from commit bd11b01eba)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
2015-05-27 22:29:10 -07:00
Josh Rosen 9da4b6bcbb [SPARK-7873] Allow KryoSerializerInstance to create multiple streams at the same time
This is a somewhat obscure bug, but I think that it will seriously impact KryoSerializer users who use custom registrators which disabled auto-reset. When auto-reset is disabled, then this breaks things in some of our shuffle paths which actually end up creating multiple OutputStreams from the same shared SerializerInstance (which is unsafe).

This was introduced by a patch (SPARK-3386) which enables serializer re-use in some of the shuffle paths, since constructing new serializer instances is actually pretty costly for KryoSerializer.  We had already fixed another corner-case (SPARK-7766) bug related to this, but missed this one.

I think that the root problem here is that KryoSerializerInstance can be used in a way which is unsafe even within a single thread, e.g. by creating multiple open OutputStreams from the same instance or by interleaving deserialize and deserializeStream calls. I considered a smaller patch which adds assertions to guard against this type of "misuse" but abandoned that approach after I realized how convoluted the Scaladoc became.

This patch fixes this bug by making it legal to create multiple streams from the same KryoSerializerInstance.  Internally, KryoSerializerInstance now implements a  `borrowKryo()` / `releaseKryo()` API that's backed by a "pool" of capacity 1. Each call to a KryoSerializerInstance method will borrow the Kryo, do its work, then release the serializer instance back to the pool. If the pool is empty and we need an instance, it will allocate a new Kryo on-demand. This makes it safe for multiple OutputStreams to be opened from the same serializer. If we try to release a Kryo back to the pool but the pool already contains a Kryo, then we'll just discard the new Kryo. I don't think there's a clear benefit to having a larger pool since our usages tend to fall into two cases, a) where we only create a single OutputStream and b) where we create a huge number of OutputStreams with the same lifecycle, then destroy the KryoSerializerInstance (this is what's happening in the bypassMergeSort code path that my regression test hits).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6415 from JoshRosen/SPARK-7873 and squashes the following commits:

00b402e [Josh Rosen] Initialize eagerly to fix a failing test
ba55d20 [Josh Rosen] Add explanatory comments
3f1da96 [Josh Rosen] Guard against duplicate close()
ab457ca [Josh Rosen] Sketch a loan/release based solution.
9816e8f [Josh Rosen] Add a failing test showing how deserialize() and deserializeStream() can interfere.
7350886 [Josh Rosen] Add failing regression test for SPARK-7873

(cherry picked from commit 852f4de2d3)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
2015-05-27 20:20:01 -07:00
Kousuke Saruta 13044b0460 [SPARK-7864] [UI] Fix the logic grabbing the link from table in AllJobPage
This issue is related to #6419 .
Now AllJobPage doesn't have a "kill link" but I think fix the issue mentioned in #6419 just in case to avoid accidents in the future.

So, it's minor issue for now and I don't file this issue in JIRA.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #6432 from sarutak/remove-ambiguity-of-link and squashes the following commits:

cd1a503 [Kousuke Saruta] Fixed ambiguity link issue in AllJobPage

(cherry picked from commit 0db76c90ad)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-27 11:42:10 -07:00
scwf 90525c9ba1 [CORE] [TEST] HistoryServerSuite failed due to timezone issue
follow up for #6377
Change time to the equivalent in GMT
/cc squito

Author: scwf <wangfei1@huawei.com>

Closes #6425 from scwf/fix-HistoryServerSuite and squashes the following commits:

4d37935 [scwf] fix HistoryServerSuite

(cherry picked from commit 4615081d7a)
Signed-off-by: Imran Rashid <irashid@cloudera.com>
2015-05-27 09:12:31 -05:00
Andrew Or f9dfa4d0f0 [SPARK-7864] [UI] Do not kill innocent stages from visualization
**Reproduction.** Run a long-running job, go to the job page, expand the DAG visualization, and click into a stage. Your stage is now killed. Why? This is because the visualization code just reaches into the stage table and grabs the first link it finds. In our case, this first link happens to be the kill link instead of the one to the stage page.

**Fix.** Use proper CSS selectors to avoid ambiguity.

This is an alternative to #6407. Thanks carsonwang for catching this.

Author: Andrew Or <andrew@databricks.com>

Closes #6419 from andrewor14/fix-ui-viz-kill and squashes the following commits:

25203bd [Andrew Or] Do not kill innocent stages

(cherry picked from commit 8f20824268)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-26 16:31:44 -07:00
scwf 79bb7dceca [CORE] [TEST] Fix SimpleDateParamTest
```
sbt.ForkMain$ForkError: 1424424077190 was not equal to 1424474477190
	at org.scalatest.MatchersHelper$.newTestFailedException(MatchersHelper.scala:160)
	at org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6231)
	at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6265)
	at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply$mcV$sp(SimpleDateParamTest.scala:25)
	at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply(SimpleDateParamTest.scala:23)
	at org.apache.spark.status.api.v1.SimpleDateParamTest$$anonfun$1.apply(SimpleDateParamTest.scala:23)
	at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	at org.scalatest.Transformer.apply(Transformer.scala:22)
	at org.scalatest.Transformer.apply(Transformer.scala:20)
	at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
	at org.scalatest.Suite$class.withFixture(Suite.scala:
```

Set timezone to fix SimpleDateParamTest

Author: scwf <wangfei1@huawei.com>
Author: Fei Wang <wangfei1@huawei.com>

Closes #6377 from scwf/fix-SimpleDateParamTest and squashes the following commits:

b8df1e5 [Fei Wang] Update SimpleDateParamSuite.scala
8bb74f0 [scwf] fix SimpleDateParamSuite

(cherry picked from commit bf49c22130)
Signed-off-by: Imran Rashid <irashid@cloudera.com>
2015-05-26 08:43:36 -05:00
Patrick Wendell 641edc99fc [SPARK-7287] [HOTFIX] Disable o.a.s.deploy.SparkSubmitSuite --packages 2015-05-23 19:44:23 -07:00
Burak Yavuz 17a51c8879 [SPARK-7224] [SPARK-7306] mock repository generator for --packages tests without nio.Path
The previous PR for SPARK-7224 (#5790) broke JDK 6, because it used java.nio.Path, which was in jdk 7, and not in 6. This PR uses Guava's `Files` to handle directory creation, and etc...

The description from the previous PR:
> This patch contains an `IvyTestUtils` file, which dynamically generates jars and pom files to test the `--packages` feature without having to rely on the internet, and Maven Central.

cc pwendell

I also rand the flaky test about 20 times locally, it didn't fail a single time, but I think it may fail like once every 100 builds? I still haven't figured the cause yet, but the test before it, `--jars` was also failing after we turned off the `--packages` test in `SparkSubmitSuite`. It may be related to the launch of SparkSubmit.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #5892 from brkyvz/maven-utils and squashes the following commits:

e9b1903 [Burak Yavuz] fix merge conflict
68214e0 [Burak Yavuz] remove ignore for test(neglect spark dependencies)
e632381 [Burak Yavuz] fix ignore
9ef1408 [Burak Yavuz] re-enable --packages test
22eea62 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into maven-utils
05cd0de [Burak Yavuz] added mock repository generator

(cherry picked from commit 8014e1f6bb)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-05-22 17:48:19 -07:00
Andrew Or 0be6e3b3e6 [SPARK-7771] [SPARK-7779] Dynamic allocation: lower default timeouts further
The default add time of 5s is still too slow for small jobs. Also, the current default remove time of 10 minutes seem rather high. This patch lowers both and rephrases a few log messages.

Author: Andrew Or <andrew@databricks.com>

Closes #6301 from andrewor14/da-minor and squashes the following commits:

6d614a6 [Andrew Or] Lower log level
2811492 [Andrew Or] Log information when requests are canceled
5fcd3eb [Andrew Or] Fix tests
3320710 [Andrew Or] Lower timeouts + rephrase a few log messages

(cherry picked from commit 3d8760d76e)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-22 17:38:09 -07:00
Imran Rashid afde4019b8 [SPARK-7760] add /json back into master & worker pages; add test
Author: Imran Rashid <irashid@cloudera.com>

Closes #6284 from squito/SPARK-7760 and squashes the following commits:

5e02d8a [Imran Rashid] style; increase timeout
9987399 [Imran Rashid] comment
8c7ed63 [Imran Rashid] add /json back into master & worker pages; add test

(cherry picked from commit 821254fb94)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-05-22 16:05:23 -07:00
WangTaoTheTonic 40989cea0d [SPARK-7758] [SQL] Override more configs to avoid failure when connect to a postgre sql
https://issues.apache.org/jira/browse/SPARK-7758

When initializing `executionHive`, we only masks
`javax.jdo.option.ConnectionURL` to override metastore location.  However,
other properties that relates to the actual Hive metastore data source are not
masked.  For example, when using Spark SQL with a PostgreSQL backed Hive
metastore, `executionHive` actually tries to use settings read from
`hive-site.xml`, which talks about PostgreSQL, to connect to the temporary
Derby metastore, thus causes error.

To fix this, we need to mask all metastore data source properties.
Specifically, according to the code of [Hive `ObjectStore.getDataSourceProps()`
method] [1], all properties whose name mentions "jdo" and "datanucleus" must be
included.

[1]: https://github.com/apache/hive/blob/release-0.13.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L288

Have tested using postgre sql as metastore, it worked fine.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #6314 from WangTaoTheTonic/SPARK-7758 and squashes the following commits:

ca7ae7c [WangTaoTheTonic] add comments
86caf2c [WangTaoTheTonic] delete unused import
e4f0feb [WangTaoTheTonic] block more data source related property
92a81fa [WangTaoTheTonic] fix style check
e3e683d [WangTaoTheTonic] override more configs to avoid failuer connecting to postgre sql

(cherry picked from commit 31d5d463e7)
Signed-off-by: Michael Armbrust <michael@databricks.com>
2015-05-22 14:44:29 -07:00
Josh Rosen 2904d3f8bd [SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled
SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization.

This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits:

e19726d [Josh Rosen] Add fix for SPARK-7766.
71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug

(cherry picked from commit eac00691da)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-05-22 13:29:02 -07:00
Andrew Or ba04b52360 [SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.

Author: Andrew Or <andrew@databricks.com>

Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits:

a82b451 [Andrew Or] Fix style
10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures
17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up
523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too
f7fe143 [Andrew Or] Avoid unnecessary closure cleaning

(cherry picked from commit 5287eec5a6)
Signed-off-by: Yin Huai <yhuai@databricks.com>
2015-05-21 14:33:24 -07:00
Sean Owen 0df461e083 [SPARK-6416] [DOCS] RDD.fold() requires the operator to be commutative
Document current limitation of rdd.fold.

This does not resolve SPARK-6416 but just documents the issue.
CC JoshRosen

Author: Sean Owen <sowen@cloudera.com>

Closes #6231 from srowen/SPARK-6416 and squashes the following commits:

9fef39f [Sean Owen] Add comment to other languages; reword to highlight the difference from non-distributed collections and to not suggest it is a bug that is to be fixed
da40d84 [Sean Owen] Document current limitation of rdd.fold.

(cherry picked from commit 6e53402696)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-05-21 19:43:09 +01:00
Hari Shreedharan 0d061ff9e7 [SPARK-7750] [WEBUI] Rename endpoints from json to api to allow fu…
…rther extension to non-json outputs too.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #6273 from harishreedharan/json-to-api and squashes the following commits:

e14b73b [Hari Shreedharan] Rename `getJsonServlet` to `getServletHandler` i
42f8acb [Hari Shreedharan] Import order fixes.
2ef852f [Hari Shreedharan] [SPARK-7750][WebUI] Rename endpoints from `json` to `api` to allow further extension to non-json outputs too.

(cherry picked from commit a70bf06b79)
Signed-off-by: Imran Rashid <irashid@cloudera.com>
2015-05-20 21:14:13 -05:00
Josh Rosen e1f7de33bf [SPARK-7719] Re-add UnsafeShuffleWriterSuite test that was removed for Java 6 compat
This patch re-adds a test which was removed in 9ebb44f8ab due to a Java 6 compatibility issue.  We now use Guava's `Iterators.emptyIterator()` in place of `Collections.emptyIterator()`, which isn't present in all Java 6 versions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6298 from JoshRosen/SPARK-7719-fix-java-6-test-code and squashes the following commits:

5c9bd85 [Josh Rosen] Re-add UnsafeShuffleWriterSuite.emptyIterator() test which was removed due to Java 6 issue

(cherry picked from commit 5196efff53)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-05-20 17:53:11 -07:00
Tathagata Das a502e4b845 [SPARK-7767] [STREAMING] Added test for checkpoint serialization in StreamingContext.start()
Currently, the background checkpointing thread fails silently if the checkpoint is not serializable. It is hard to debug and therefore its best to fail fast at `start()` when checkpointing is enabled and the checkpoint is not serializable.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6292 from tdas/SPARK-7767 and squashes the following commits:

51304e6 [Tathagata Das] Addressed comments.
c35237b [Tathagata Das] Added test for checkpoint serialization in StreamingContext.start()

(cherry picked from commit 3c434cbfd0)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-05-20 16:21:31 -07:00
Andrew Or 23356dd0d9 [SPARK-7237] [SPARK-7741] [CORE] [STREAMING] Clean more closures that need cleaning
SPARK-7741 is the equivalent of SPARK-7237 in streaming. This is an alternative to #6268.

Author: Andrew Or <andrew@databricks.com>

Closes #6269 from andrewor14/clean-moar and squashes the following commits:

c51c9ab [Andrew Or] Add periods (trivial)
6c686ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-moar
79a435b [Andrew Or] Fix tests
d18c9f9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-moar
65ef07b [Andrew Or] Fix tests?
4b487a3 [Andrew Or] Add tests for closures passed to DStream operations
328139b [Andrew Or] Do not forget foreachRDD
5431f61 [Andrew Or] Clean streaming closures
72b7b73 [Andrew Or] Clean core closures

(cherry picked from commit 9b84443dd4)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2015-05-20 15:39:47 -07:00
Davies Liu 87fa8ccd2b [SPARK-7738] [SQL] [PySpark] add reader and writer API in Python
cc rxin, please take a quick look, I'm working on tests.

Author: Davies Liu <davies@databricks.com>

Closes #6238 from davies/readwrite and squashes the following commits:

c7200eb [Davies Liu] update tests
9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
f0c5a04 [Davies Liu] use sqlContext.read.load
5f68bc8 [Davies Liu] update tests
6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
bcc6668 [Davies Liu] add reader amd writer API in Python

(cherry picked from commit 4de74d2602)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-19 14:23:35 -07:00
Patrick Wendell be1fc938f8 [HOTFIX]: Java 6 Build Breaks
These were blocking RC1 so I fixed them manually.
2015-05-19 06:01:39 +00:00
Daoyuan Wang 7fcbb2ccaf [SPARK-7150] SparkContext.range() and SQLContext.range()
This PR is based on #6081, thanks adrian-wang.

Closes #6081

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #6230 from davies/range and squashes the following commits:

d3ce5fe [Davies Liu] add tests
789eda5 [Davies Liu] add range() in Python
4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
cbf5200 [Daoyuan Wang] let's add python support in a separate PR
f45e3b2 [Daoyuan Wang] remove redundant toLong
617da76 [Daoyuan Wang] fix safe marge for corner cases
867c417 [Daoyuan Wang] fix
13dbe84 [Daoyuan Wang] update
bd998ba [Daoyuan Wang] update comments
d3a0c1b [Daoyuan Wang] add range api()

(cherry picked from commit c2437de189)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-18 21:43:25 -07:00
Patrick Wendell 9d0b7fb714 Version updates for Spark 1.4.0 2015-05-18 21:38:37 -07:00
Davies Liu 60cb33d12f [SPARK-7624] Revert #4147
Author: Davies Liu <davies@databricks.com>

Closes #6172 from davies/revert_4147 and squashes the following commits:

3bfbbde [Davies Liu] Revert #4147

(cherry picked from commit 4fb52f9545)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-05-18 16:56:01 -07:00
Andrew Or a475cbc978 [SPARK-7501] [STREAMING] DAG visualization: show DStream operations
This is similar to #5999, but for streaming. Roughly 200 lines are tests.

One thing to note here is that we already do some kind of scoping thing for call sites, so this patch adds the new RDD operation scoping logic in the same place. Also, this patch adds a `try finally` block to set the relevant variables in a safer way.

tdas zsxwing

------------------------
**Before**
<img src="https://cloud.githubusercontent.com/assets/2133137/7625996/d88211b8-f9b4-11e4-90b9-e11baa52d6d7.png" width="450px"/>

--------------------------
**After**
<img src="https://cloud.githubusercontent.com/assets/2133137/7625997/e0878f8c-f9b4-11e4-8df3-7dd611b13c87.png" width="650px"/>

Author: Andrew Or <andrew@databricks.com>

Closes #6034 from andrewor14/dag-viz-streaming and squashes the following commits:

932a64a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
e685df9 [Andrew Or] Rename createRDDWith
84d0656 [Andrew Or] Review feedback
697c086 [Andrew Or] Fix tests
53b9936 [Andrew Or] Set scopes for foreachRDD properly
1881802 [Andrew Or] Refactor DStream scope names again
af4ba8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
fd07d22 [Andrew Or] Make MQTT lower case
f6de871 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
0ca1801 [Andrew Or] Remove a few unnecessary withScopes on aliases
fa4e5fb [Andrew Or] Pass in input stream name rather than defining it from within
1af0b0e [Andrew Or] Fix style
074c00b [Andrew Or] Review comments
d25a324 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
e4a93ac [Andrew Or] Fix tests?
25416dc [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
9113183 [Andrew Or] Add tests for DStream scopes
b3806ab [Andrew Or] Fix test
bb80bbb [Andrew Or] Fix MIMA?
5c30360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
5703939 [Andrew Or] Rename operations that create InputDStreams
7c4513d [Andrew Or] Group RDDs by DStream operations and batches
bf0ab6e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
05c2676 [Andrew Or] Wrap many more methods in withScope
c121047 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
65ef3e9 [Andrew Or] Fix NPE
a0d3263 [Andrew Or] Scope streaming operations instead of RDD operations

(cherry picked from commit b93c97d79b)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-18 14:33:45 -07:00
Davies Liu a8332098ce [SPARK-6216] [PYSPARK] check python version of worker with driver
This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.

Author: Davies Liu <davies@databricks.com>

Closes #6203 from davies/py_version and squashes the following commits:

b8fb76e [Davies Liu] fix test
6ce5096 [Davies Liu] use string for version
47c6278 [Davies Liu] check python version of worker with driver

(cherry picked from commit 32fbd297dd)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-05-18 12:55:37 -07:00
Andrew Or a0ae8ce013 [SPARK-7627] [SPARK-7472] DAG visualization: style skipped stages
This patch fixes two things:

**SPARK-7627.** Cached RDDs no longer light up on the job page. This is a simple fix.
**SPARK-7472.** Display skipped stages differently from normal stages.

The latter is a major UX issue. Because we link the job viz to the stage viz even for skipped stages, the user may inadvertently click into the stage page of a skipped stage, which is empty.

-------------------
<img src="https://cloud.githubusercontent.com/assets/2133137/7675241/de1a3da6-fcea-11e4-8101-88055cef78c5.png" width="300px" />

Author: Andrew Or <andrew@databricks.com>

Closes #6171 from andrewor14/dag-viz-skipped and squashes the following commits:

f261797 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
0eda358 [Andrew Or] Tweak skipped stage border color
c604150 [Andrew Or] Tweak grayscale colors
7010676 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
762b541 [Andrew Or] Use special prefix for stage clusters to avoid collisions
51c95b9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
b928cd4 [Andrew Or] Fix potential leak + write tests for it
7c4c364 [Andrew Or] Show skipped stages differently
7cc34ce [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
c121fa2 [Andrew Or] Fix cache color

(cherry picked from commit 563bfcc1ab)
Signed-off-by: Andrew Or <andrew@databricks.com>
2015-05-18 10:59:46 -07:00
zsxwing 2a42d2d8f2 [SPARK-7693][Core] Remove "import scala.concurrent.ExecutionContext.Implicits.global"
Learnt a lesson from SPARK-7655: Spark should avoid to use `scala.concurrent.ExecutionContext.Implicits.global` because the user may submit blocking actions to `scala.concurrent.ExecutionContext.Implicits.global` and exhaust all threads in it. This could crash Spark. So Spark should always use its own thread pools for safety.

This PR removes all usages of `scala.concurrent.ExecutionContext.Implicits.global` and uses proper thread pools to replace them.

Author: zsxwing <zsxwing@gmail.com>

Closes #6223 from zsxwing/SPARK-7693 and squashes the following commits:

a33ff06 [zsxwing] Decrease the max thread number from 1024 to 128
cf4b3fc [zsxwing] Remove "import scala.concurrent.ExecutionContext.Implicits.global"

(cherry picked from commit ff71d34e00)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-17 20:37:27 -07:00
Josh Rosen 6df71eb8c1 [SPARK-7660] Wrap SnappyOutputStream to work around snappy-java bug
This patch wraps `SnappyOutputStream` to ensure that `close()` is idempotent and to guard against write-after-`close()` bugs. This is a workaround for https://github.com/xerial/snappy-java/issues/107, a bug where a non-idempotent `close()` method can lead to stream corruption. We can remove this workaround if we upgrade to a snappy-java version that contains my fix for this bug, but in the meantime this patch offers a backportable Spark fix.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6176 from JoshRosen/SPARK-7660-wrap-snappy and squashes the following commits:

8b77aae [Josh Rosen] Wrap SnappyOutputStream to fix SPARK-7660

(cherry picked from commit f2cc6b5bcc)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2015-05-17 09:33:49 -07:00
zsxwing 84949104c9 [SPARK-7655][Core] Deserializing value should not hold the TaskSchedulerImpl lock
We should not call `DirectTaskResult.value` when holding the `TaskSchedulerImpl` lock. It may cost dozens of seconds to deserialize a large object.

Author: zsxwing <zsxwing@gmail.com>

Closes #6195 from zsxwing/SPARK-7655 and squashes the following commits:

21f502e [zsxwing] Add more comments
e25fa88 [zsxwing] Add comments
15010b5 [zsxwing] Deserialize value should not hold the TaskSchedulerImpl lock

(cherry picked from commit 3b6ef2c539)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-16 21:03:28 -07:00
zsxwing ad5b0b1ce2 [SPARK-7655][Core][SQL] Remove 'scala.concurrent.ExecutionContext.Implicits.global' in 'ask' and 'BroadcastHashJoin'
Because both `AkkaRpcEndpointRef.ask` and `BroadcastHashJoin` uses `scala.concurrent.ExecutionContext.Implicits.global`. However, because the tasks in `BroadcastHashJoin` are usually long-running tasks, which will occupy all threads in `global`. Then `ask` cannot get a chance to process the replies.

For `ask`, actually the tasks are very simple, so we can use `MoreExecutors.sameThreadExecutor()`. For `BroadcastHashJoin`, it's better to use `ThreadUtils.newDaemonCachedThreadPool`.

Author: zsxwing <zsxwing@gmail.com>

Closes #6200 from zsxwing/SPARK-7655-2 and squashes the following commits:

cfdc605 [zsxwing] Remove redundant imort and minor doc fix
cf83153 [zsxwing] Add "sameThread" and "newDaemonCachedThreadPool with maxThreadNumber" to ThreadUtils
08ad0ee [zsxwing] Remove 'scala.concurrent.ExecutionContext.Implicits.global' in 'ask' and 'BroadcastHashJoin'

(cherry picked from commit 47e7ffe36b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-05-16 00:44:36 -07:00
Nishkam Ravi e7607e5cbc [SPARK-7672] [CORE] Use int conversion in translating kryoserializer.buffer.mb to kryoserializer.buffer
In translating spark.kryoserializer.buffer.mb to spark.kryoserializer.buffer, use of toDouble will lead to "Fractional values not supported" error even when spark.kryoserializer.buffer.mb is an integer.
ilganeli, andrewor14

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes #6198 from nishkamravi2/master_nravi and squashes the following commits:

171a53c [nishkamravi2] Update SparkConfSuite.scala
5261bf6 [Nishkam Ravi] Add a test for deprecated config spark.kryoserializer.buffer.mb
5190f79 [Nishkam Ravi] In translating from deprecated spark.kryoserializer.buffer.mb to spark.kryoserializer.buffer use int conversion since fractions are not permissible
059ce82 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
eaa13b5 [nishkamravi2] Update Client.scala
981afd2 [Nishkam Ravi] Check for read permission before initiating copy
1b81383 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
0f1abd0 [nishkamravi2] Update Utils.scala
474e3bf [nishkamravi2] Update DiskBlockManager.scala
97c383e [nishkamravi2] Update Utils.scala
8691e0c [Nishkam Ravi] Add a try/catch block around Utils.removeShutdownHook
2be1e76 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
1c13b79 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
bad4349 [nishkamravi2] Update Main.java
36a6f87 [Nishkam Ravi] Minor changes and bug fixes
b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument
d9658d6 [Nishkam Ravi] Changes for SPARK-6406
ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406)
345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ac58975 [Nishkam Ravi] spark-class changes
06bfeb0 [nishkamravi2] Update spark-class
35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java
4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java
746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar)
bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
d453197 [nishkamravi2] Update NewHadoopRDD.scala
6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
0ce2c32 [nishkamravi2] Update HadoopRDD.scala
f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
494d8c0 [nishkamravi2] Update DiskBlockManager.scala
3c5ddba [nishkamravi2] Update DiskBlockManager.scala
f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles

(cherry picked from commit 0ac8b01a07)
Signed-off-by: Sean Owen <sowen@cloudera.com>
2015-05-16 08:24:34 +01:00
Josh Rosen ed75cc02bc [SPARK-7563] OutputCommitCoordinator.stop() should only run on the driver
This fixes a bug where an executor that exits can cause the driver's OutputCommitCoordinator to stop. To fix this, we use an `isDriver` flag and check it in `stop()`.

See https://issues.apache.org/jira/browse/SPARK-7563 for more details.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6197 from JoshRosen/SPARK-7563 and squashes the following commits:

04b2cc5 [Josh Rosen] [SPARK-7563] OutputCommitCoordinator.stop() should only be executed on the driver

(cherry picked from commit 2c04c8a1ae)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
2015-05-15 18:06:12 -07:00