Commit graph

8931 commits

Author SHA1 Message Date
Jongyoul Lee f0afb623dc [SPARK-4525] Mesos should decline unused offers
Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly.

I've also done some minor renaming/clean-up of variables in this class and tests.

Author: Patrick Wendell <pwendell@gmail.com>
Author: Jongyoul Lee <jongyoul@gmail.com>

Closes #3436 from pwendell/mesos-issue and squashes the following commits:

58c35b5 [Patrick Wendell] Adding unit test for this situation
c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix
f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
2014-11-24 19:19:16 -08:00
Patrick Wendell a68d442270 Revert "[SPARK-4525] Mesos should decline unused offers"
This reverts commit b043c27424.

I accidentally committed this using my own authorship credential. However,
I should have given authoriship to the original author: Jongyoul Lee.
2014-11-24 19:18:15 -08:00
Patrick Wendell b043c27424 [SPARK-4525] Mesos should decline unused offers
Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly.

I've also done some minor renaming/clean-up of variables in this class and tests.

Author: Patrick Wendell <pwendell@gmail.com>
Author: Jongyoul Lee <jongyoul@gmail.com>

Closes #3436 from pwendell/mesos-issue and squashes the following commits:

58c35b5 [Patrick Wendell] Adding unit test for this situation
c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix
f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers
2014-11-24 19:14:14 -08:00
Kay Ousterhout d24d5bf064 [SPARK-4266] [Web-UI] Reduce stage page load time.
The commit changes the java script used to show/hide additional
metrics in order to reduce page load time. SPARK-4016 significantly
increased page load time for the stage page when stages had a lot
(thousands or tens of thousands) of tasks, due to the additional
Javascript to hide some metrics by default and stripe the tables.
This commit reduces page load time in two ways:

(1) Now, all of the metrics that are hidden by default are
hidden by setting "display: none;" using CSS for the page,
rather than hiding them using javascript after the page loads.
Without this change, for stages with thousands of tasks, there
was a few second delay after page load, where first the additional
metrics were shown, and then after a delay were hidden once the
relevant JS finished running.

(2) CSS is used to stripe all of the tables except for the summary
table. The summary table needs javascript to do the striping because
some rows are hidden, but the javascript striping is slower, which
again resulted in a delay when it was used for the task table (where
for a few seconds after page load, all of the rows in the task table
would be white, while the browser finished running the JS to stripe
the table).

cc pwendell

This change is intended to be backported to 1.2 to avoid a regression in
UI performance when users run large jobs.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #3328 from kayousterhout/SPARK-4266 and squashes the following commits:

f964091 [Kay Ousterhout] [SPARK-4266] [Web-UI] Reduce stage page load time.
2014-11-24 18:03:10 -08:00
Davies Liu 6cf507685e [SPARK-4548] []SPARK-4517] improve performance of python broadcast
Re-implement the Python broadcast using file:

1) serialize the python object using cPickle, write into disks.
2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
4) During deserialization, writing the data into disk.
5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.

It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).

Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):

         name |   1.1   | 1.2 with this patch |  improvement
---------|--------|---------|--------
      python-broadcast-w-bytes  |	25.20  |	9.33   |	170.13% |
        python-broadcast-w-set	  |     4.13	   |    4.50  |	-8.35%  |

Testing with 100 tasks (16 CPUs):

         name |   1.1   | 1.2 with this patch |  improvement
---------|--------|---------|--------
     python-broadcast-w-bytes	| 38.16	| 8.40	 | 353.98%
        python-broadcast-w-set	| 23.29	| 9.59 |	142.80%

Author: Davies Liu <davies@databricks.com>

Closes #3417 from davies/pybroadcast and squashes the following commits:

50a58e0 [Davies Liu] address comments
b98de1d [Davies Liu] disable gc while unpickle
e5ee6b9 [Davies Liu] support large string
09303b8 [Davies Liu] read all data into memory
dde02dd [Davies Liu] improve performance of python broadcast
2014-11-24 17:17:03 -08:00
Davies Liu 050616b408 [SPARK-4578] fix asDict() with nested Row()
The Row object is created on the fly once the field is accessed, so we should access them by getattr() in asDict(0

Author: Davies Liu <davies@databricks.com>

Closes #3434 from davies/fix_asDict and squashes the following commits:

b20f1e7 [Davies Liu] fix asDict() with nested Row()
2014-11-24 16:41:23 -08:00
Davies Liu b660de7a9c [SPARK-4562] [MLlib] speedup vector
This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.

It also improve the serialization of DenseVector.

Before this change:

trial	| trainingTime | 	testTime
-------|--------|--------
0	| 5.126 | 	1.786
1	|2.698	|1.693

After the change:

trial	| trainingTime |	testTime
-------|--------|--------
0	|4.692	|0.554
1	|2.307	|0.525

This could partially fix the performance regression during test.

Author: Davies Liu <davies@databricks.com>

Closes #3420 from davies/ser2 and squashes the following commits:

0e1e6f3 [Davies Liu] fix tests
426f5db [Davies Liu] impove toArray()
44707ec [Davies Liu] add name for ISO-8859-1
fa7d791 [Davies Liu] address comments
1cfb137 [Davies Liu] handle zero sparse vector
2548ee2 [Davies Liu] fix tests
9e6389d [Davies Liu] bugfix
470f702 [Davies Liu] speed up DenseMatrix
f0d3c40 [Davies Liu] speedup SparseVector
ef6ce70 [Davies Liu] speed up dense vector
2014-11-24 16:37:14 -08:00
Tathagata Das cb0e9b0980 [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times
Because of a corner case, a file already selected for batch t can get considered again for batch t+2. This refactoring fixes it by remembering all the files selected in the last 1 minute, so that this corner case does not arise. Also uses spark context's hadoop configuration to access the file system API for listing directories.

pwendell Please take look. I still have not run long-running integration tests, so I cannot say for sure whether this has indeed solved the issue. You could do a first pass on this in the meantime.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #3419 from tdas/filestream-fix2 and squashes the following commits:

c19dd8a [Tathagata Das] Addressed PR comments.
513b608 [Tathagata Das] Updated docs.
d364faf [Tathagata Das] Added the current time condition back
5526222 [Tathagata Das] Removed unnecessary imports.
38bb736 [Tathagata Das] Fix long line.
203bbc7 [Tathagata Das] Un-ignore tests.
eaef4e1 [Tathagata Das] Fixed SPARK-4519
9dbd40a [Tathagata Das] Refactored FileInputDStream to remember last few batches.
2014-11-24 13:50:20 -08:00
Josh Rosen 4a90276ab2 [SPARK-4145] Web UI job pages
This PR adds two new pages to the Spark Web UI:

- A jobs overview page, which shows details on running / completed / failed jobs.
- A job details page, which displays information on an individual job's stages.

The jobs overview page is now the default UI homepage; the old homepage is still accessible at `/stages`.

### Screenshots

#### New UI homepage

![image](https://cloud.githubusercontent.com/assets/50748/5119035/fd0a69e6-701f-11e4-89cb-db7e9705714f.png)

#### Job details page

(This is effectively a per-job version of the stages page that can be extended later with other things, such as DAG visualizations)

![image](https://cloud.githubusercontent.com/assets/50748/5134910/50b340d4-70c7-11e4-88e1-6b73237ea7c8.png)

### Key changes in this PR

- Rename `JobProgressPage` to `AllStagesPage`
- Expose `StageInfo` objects in the ``SparkListenerJobStart` event; add backwards-compatibility tests to JsonProtocol.
- Add additional data structures to `JobProgressListener` to map from stages to jobs.
- Add several fields to `JobUIData`.

I also added ~150 lines of Selenium tests as I uncovered UI issues while developing this patch.

### Limitations

If a job contains stages that aren't run, then its overall job progress bar may be an underestimate of the total job progress; in other words, a completed job may appear to have a progress bar that's not at 100%.

If stages or tasks fail, then the progress bar will not go backwards to reflect the true amount of remaining work.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3009 from JoshRosen/job-page and squashes the following commits:

eb05e90 [Josh Rosen] Disable kill button in completed stages tables.
f00c851 [Josh Rosen] Fix JsonProtocol compatibility
b89c258 [Josh Rosen] More JSON protocol backwards-compatibility fixes.
ff804cd [Josh Rosen] Don't write "Stage Ids" field in JobStartEvent JSON.
6f17f3f [Josh Rosen] Only store StageInfos in SparkListenerJobStart event.
2bbf41a [Josh Rosen] Update job progress bar to reflect skipped tasks/stages.
61c265a [Josh Rosen] Add “skipped stages” table; only display non-empty tables.
1f45d44 [Josh Rosen] Incorporate a bunch of minor review feedback.
0b77e3e [Josh Rosen] More bug fixes for phantom stages.
034aa8d [Josh Rosen] Use `.max()` to find result stage for job.
eebdc2c [Josh Rosen] Don’t display pending stages for completed jobs.
67080ba [Josh Rosen] Ensure that "phantom stages" don't cause memory leaks.
7d10b97 [Josh Rosen] Merge remote-tracking branch 'apache/master' into job-page
d69c775 [Josh Rosen] Fix table sorting on all jobs page.
5eb39dc [Josh Rosen] Add pending stages table to job page.
f2a15da [Josh Rosen] Add status field to job details page.
171b53c [Josh Rosen] Move `startTime` to the start of SparkContext.
e2f2c43 [Josh Rosen] Fix sorting of stages in job details page.
8955f4c [Josh Rosen] Display information for pending stages on jobs page.
8ab6c28 [Josh Rosen] Compute numTasks from job start stage infos.
5884f91 [Josh Rosen] Add StageInfos to SparkListenerJobStart event.
79793cd [Josh Rosen] Track indices of completed stage to avoid overcounting when failures occur.
d62ea7b [Josh Rosen] Add failing Selenium test for stage overcounting issue.
1145c60 [Josh Rosen] Display text instead of progress bar for stages.
3d0a007 [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page
8a2351b [Josh Rosen] Add help tooltip to Spark Jobs page.
b7bf30e [Josh Rosen] Add stages progress bar; fix bug where active stages show as completed.
4846ce4 [Josh Rosen] Hide "(Job Group") if no jobs were submitted in job groups.
4d58e55 [Josh Rosen] Change label to "Tasks (for all stages)"
85e9c85 [Josh Rosen] Extract startTime into separate variable.
1cf4987 [Josh Rosen] Fix broken kill links; add Selenium test to avoid future regressions.
56701fa [Josh Rosen] Move last stage name / description logic out of markup.
a475ea1 [Josh Rosen] Add progress bars to jobs page.
45343b8 [Josh Rosen] More comments
4b206fb [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page
bfce2b9 [Josh Rosen] Address review comments, except for progress bar.
4487dcb [Josh Rosen] [SPARK-4145] Web UI job pages
2568a6c [Josh Rosen] Rename JobProgressPage to AllStagesPage:
2014-11-24 13:18:14 -08:00
Kousuke Saruta dd1c9cb36c [SPARK-4487][SQL] Fix attribute reference resolution error when using ORDER BY.
When we use ORDER BY clause, at first, attributes referenced by projection are resolved (1).
And then, attributes referenced at ORDER BY clause are resolved (2).
 But when resolving attributes referenced at ORDER BY clause, the resolution result generated in (1) is discarded so for example, following query fails.

    SELECT c1 + c2 FROM mytable ORDER BY c1;

The query above fails because when resolving the attribute reference 'c1', the resolution result of 'c2' is discarded.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3363 from sarutak/SPARK-4487 and squashes the following commits:

fd314f3 [Kousuke Saruta] Fixed attribute resolution logic in Analyzer
6e60c20 [Kousuke Saruta] Fixed conflicts
cb5b7e9 [Kousuke Saruta] Added test case for SPARK-4487
282d529 [Kousuke Saruta] Fixed attributes reference resolution error
b6123e6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into concat-feature
317b7fb [Kousuke Saruta] WIP
2014-11-24 12:54:37 -08:00
scwf b384119304 [SQL] Fix path in HiveFromSpark
It require us to run ```HiveFromSpark``` in specified dir because ```HiveFromSpark``` use relative path, this leads to ```run-example``` error(http://apache-spark-developers-list.1001551.n3.nabble.com/src-main-resources-kv1-txt-not-found-in-example-of-HiveFromSpark-td9100.html).

Author: scwf <wangfei1@huawei.com>

Closes #3415 from scwf/HiveFromSpark and squashes the following commits:

ed3d6c9 [scwf] revert no need change
b00e20c [scwf] fix path usring spark_home
dbd321b [scwf] fix path in hivefromspark
2014-11-24 12:49:08 -08:00
Daniel Darabos d5834f0732 [SQL] Fix comment in HiveShim
This file is for Hive 0.13.1 I think.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #3432 from darabos/patch-2 and squashes the following commits:

4fd22ed [Daniel Darabos] Fix comment. This file is for Hive 0.13.1.
2014-11-24 12:45:12 -08:00
Cheng Lian a6d7b61f92 [SPARK-4479][SQL] Avoids unnecessary defensive copies when sort based shuffle is on
This PR is a workaround for SPARK-4479. Two changes are introduced: when merge sort is bypassed in `ExternalSorter`,

1. also bypass RDD elements buffering as buffering is the reason that `MutableRow` backed row objects must be copied, and
2. avoids defensive copies in `Exchange` operator

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3422)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3422 from liancheng/avoids-defensive-copies and squashes the following commits:

591f2e9 [Cheng Lian] Passes all shuffle suites
0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed
ed5df3c [Cheng Lian] Fixes styling changes
f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based shuffle is on
2014-11-24 12:43:45 -08:00
Sandy Ryza 29372b6318 SPARK-4457. Document how to build for Hadoop versions greater than 2.4
Author: Sandy Ryza <sandy@cloudera.com>

Closes #3322 from sryza/sandy-spark-4457 and squashes the following commits:

5e72b77 [Sandy Ryza] Feedback
0cf05c1 [Sandy Ryza] Caveat
be8084b [Sandy Ryza] SPARK-4457. Document how to build for Hadoop versions greater than 2.4
2014-11-24 13:28:48 -06:00
Prashant Sharma 9b2a3c6126 [SPARK-4377] Fixed serialization issue by switching to akka provided serializer.
... - there is no way around this for deserializing actorRef(s).

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #3402 from ScrapCodes/SPARK-4377/troubleDeserializing and squashes the following commits:

77233fd [Prashant Sharma] Style fixes
9b35c6e [Prashant Sharma] Scalastyle fixes
29880da [Prashant Sharma] [SPARK-4377] Fixed serialization issue by switching to akka provided serializer - there is no way around this for deserializing actorRef(s).
2014-11-22 14:05:38 -08:00
DB Tsai b5d17ef10e [SPARK-4431][MLlib] Implement efficient foreachActive for dense and sparse vector
Previously, we were using Breeze's activeIterator to access the non-zero elements
in dense/sparse vector. Due to the overhead, we switched back to native `while loop`
in #SPARK-4129.

However, #SPARK-4129 requires de-reference the dv.values/sv.values in
each access to the value, which is very expensive. Also, in MultivariateOnlineSummarizer,
we're using Breeze's dense vector to store the partial stats, and this is very expensive compared
with using primitive scala array.

In this PR, efficient foreachActive is implemented to unify the code path for dense and sparse
vector operation which makes codebase easier to maintain. Breeze dense vector is replaced
by primitive array to reduce the overhead further.

Benchmarking with mnist8m dataset on single JVM
with first 200 samples loaded in memory, and repeating 5000 times.

Before change:
Sparse Vector - 30.02
Dense Vector - 38.27

With this PR:
Sparse Vector - 6.29
Dense Vector - 11.72

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3288 from dbtsai/activeIterator and squashes the following commits:

844b0e6 [DB Tsai] formating
03dd693 [DB Tsai] futher performance tunning.
1907ae1 [DB Tsai] address feedback
98448bb [DB Tsai] Made the override final, and had a local copy of variables which made the accessing a single step operation.
c0cbd5a [DB Tsai] fix a bug
6441f92 [DB Tsai] Finished SPARK-4431
2014-11-21 18:15:07 -08:00
Davies Liu ce95bd8e13 [SPARK-4531] [MLlib] cache serialized java object
The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.

This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.

Author: Davies Liu <davies@databricks.com>

Closes #3397 from davies/cache and squashes the following commits:

7f6e6ce [Davies Liu] Update -> Updater
4b52edd [Davies Liu] using named argument
63b984e [Davies Liu] fix
7da0332 [Davies Liu] add unpersist()
dff33e1 [Davies Liu] address comments
c2bdfc2 [Davies Liu] refactor
d572f00 [Davies Liu] Merge branch 'master' into cache
f1063e1 [Davies Liu] cache serialized java object
2014-11-21 15:02:31 -08:00
Patrick Wendell a81918c5a6 SPARK-4532: Fix bug in detection of Hive in Spark 1.2
Because the Hive profile is no longer defined in the root pom,
we need to check specifically in the sql/hive pom when we
perform the check in make-distribtion.sh.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #3398 from pwendell/make-distribution and squashes the following commits:

8a58279 [Patrick Wendell] Fix bug in detection of Hive in Spark 1.2
2014-11-21 12:10:04 -08:00
zsxwing 65b987c3ed [SPARK-4397][Core] Reorganize 'implicit's to improve the API convenience
This PR moved `implicit`s to `package object` and `companion object` to enable the Scala compiler search them automatically without explicit importing.

It should not break any API. A test project for backforward compatibility is [here](https://github.com/zsxwing/SPARK-4397-Backforward-Compatibility). It proves the codes compiled with Spark 1.1.0 can run with this PR.

To summarize, the changes are:

* Deprecated the old implicit conversion functions: this preserves binary compatibility for code compiled against earlier versions of Spark.
* Removed "implicit" from them so they are just normal functions: this made sure the compiler doesn't get confused and warn about multiple implicits in scope.
* Created new implicit functions in package rdd object, which is part of the scope that scalac will search when looking for implicit conversions on various RDD objects.

The disadvantage is there are duplicated codes in SparkContext for backforward compatibility.

Author: zsxwing <zsxwing@gmail.com>

Closes #3262 from zsxwing/SPARK-4397 and squashes the following commits:

fc30314 [zsxwing] Update the comments
9c27aff [zsxwing] Move implicit functions to object RDD and forward old functions to new implicit ones directly
2b5f5a4 [zsxwing] Comments for the deprecated functions
52353de [zsxwing] Remove private[spark] from object WritableConverter
34641d4 [zsxwing] Move ImplicitSuite to org.apache.sparktest
7266218 [zsxwing] Add comments to warn the duplicate codes in SparkContext
185c12f [zsxwing] Remove simpleWritableConverter from SparkContext
3bdcae2 [zsxwing] Move WritableConverter implicits to object WritableConverter
9b73188 [zsxwing] Fix the code style issue
3ac4f07 [zsxwing] Add license header
1eda9e4 [zsxwing] Reorganize 'implicit's to improve the API convenience
2014-11-21 10:06:30 -08:00
zsxwing f1069b84b8 [SPARK-4472][Shell] Print "Spark context available as sc." only when SparkContext is created...
... successfully

It's weird that printing "Spark context available as sc" when creating SparkContext unsuccessfully.

Author: zsxwing <zsxwing@gmail.com>

Closes #3341 from zsxwing/SPARK-4472 and squashes the following commits:

4850093 [zsxwing] Print "Spark context available as sc." only when SparkContext is created successfully
2014-11-21 00:42:43 -08:00
Reynold Xin 28fdc6f682 [Doc][GraphX] Remove unused png files. 2014-11-21 00:30:58 -08:00
Reynold Xin b97070ec78 [Doc][GraphX] Remove Motivation section and did some minor update. 2014-11-21 00:29:02 -08:00
Michael Armbrust 90a6a46bd1 [SPARK-4522][SQL] Parse schema with missing metadata.
This is just a quick fix for 1.2.  SPARK-4523 describes a more complete solution.

Author: Michael Armbrust <michael@databricks.com>

Closes #3392 from marmbrus/parquetMetadata and squashes the following commits:

bcc6626 [Michael Armbrust] Parse schema with missing metadata.
2014-11-20 20:34:43 -08:00
Davies Liu 8cd6eea629 add Sphinx as a dependency of building docs
Author: Davies Liu <davies@databricks.com>

Closes #3388 from davies/doc_readme and squashes the following commits:

daa1482 [Davies Liu] add Sphinx dependency
2014-11-20 19:13:16 -08:00
Michael Armbrust 02ec058efe [SPARK-4413][SQL] Parquet support through datasource API
Goals:
 - Support for accessing parquet using SQL but not requiring Hive (thus allowing support of parquet tables with decimal columns)
 - Support for folder based partitioning with automatic discovery of available partitions
 - Caching of file metadata

See scaladoc of `ParquetRelation2` for more details.

Author: Michael Armbrust <michael@databricks.com>

Closes #3269 from marmbrus/newParquet and squashes the following commits:

1dd75f1 [Michael Armbrust] Pass all paths for FileInputFormat at once.
645768b [Michael Armbrust] Review comments.
abd8e2f [Michael Armbrust] Alternative implementation of parquet based on the datasources API.
938019e [Michael Armbrust] Add an experimental interface to data sources that exposes catalyst expressions.
e9d2641 [Michael Armbrust] logging / formatting improvements.
2014-11-20 18:31:02 -08:00
Cheng Hao 84d79ee9ec [SPARK-4244] [SQL] Support Hive Generic UDFs with constant object inspector parameters
Query `SELECT named_struct(lower("AA"), "12", lower("Bb"), "13") FROM src LIMIT 1` will throw exception, some of the Hive Generic UDF/UDAF requires the input object inspector is `ConstantObjectInspector`, however, we won't get that before the expression optimization executed. (Constant Folding).

This PR is a work around to fix this. (As ideally, the `output` of LogicalPlan should be identical before and after Optimization).

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3109 from chenghao-intel/optimized and squashes the following commits:

487ff79 [Cheng Hao] rebase to the latest master & update the unittest
2014-11-20 16:50:59 -08:00
Davies Liu d39f2e9c68 [SPARK-4477] [PySpark] remove numpy from RDDSampler
In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.

numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.

It also complicate the code a lot, so we may should remove numpy from RDDSampler.

I also did some benchmark to verify that:
```
>>> from pyspark.mllib.random import RandomRDDs
>>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
>>> rdd.count()  # cache it
>>> rdd.sample(True, 0.9).count()    # measure this line
```
the results:

|withReplacement      |  random  | numpy.random |
 ------- | ------------ |  -------
|True | 1.5 s|  1.4 s|
|False|  0.6 s | 0.8 s|

closes #2313

Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.

Author: Davies Liu <davies@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3351 from davies/numpy and squashes the following commits:

5c438d7 [Davies Liu] fix comment
c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
98eb31b [Xiangrui Meng] make poisson sampling slightly faster
ee17d78 [Davies Liu] remove = for float
13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
f583023 [Davies Liu] fix tests
51649f5 [Davies Liu] remove numpy in RDDSampler
78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()
2014-11-20 16:40:25 -08:00
Jacky Li ad5f1f3ca2 [SQL] fix function description mistake
Sample code in the description of SchemaRDD.where is not correct

Author: Jacky Li <jacky.likun@gmail.com>

Closes #3344 from jackylk/patch-6 and squashes the following commits:

62cd126 [Jacky Li] [SQL] fix function description mistake
2014-11-20 15:48:36 -08:00
Cheng Hao 6aa0fc9f4d [SPARK-2918] [SQL] Support the CTAS in EXPLAIN command
Hive supports the `explain` the CTAS, which was supported by Spark SQL previously, however, seems it was reverted after the code refactoring in HiveQL.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3357 from chenghao-intel/explain and squashes the following commits:

7aace63 [Cheng Hao] Support the CTAS in EXPLAIN command
2014-11-20 15:46:00 -08:00
Takuya UESHIN 2c2e7a44db [SPARK-4318][SQL] Fix empty sum distinct.
Executing sum distinct for empty table throws `java.lang.UnsupportedOperationException: empty.reduceLeft`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3184 from ueshin/issues/SPARK-4318 and squashes the following commits:

8168c42 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4318
66fdb0a [Takuya UESHIN] Re-refine aggregate functions.
6186eb4 [Takuya UESHIN] Fix Sum of GeneratedAggregate.
d2975f6 [Takuya UESHIN] Refine Sum and Average of GeneratedAggregate.
1bba675 [Takuya UESHIN] Refine Sum, SumDistinct and Average functions.
917e533 [Takuya UESHIN] Use aggregate instead of groupBy().
1a5f874 [Takuya UESHIN] Add tests to be executed as non-partial aggregation.
a5a57d2 [Takuya UESHIN] Fix empty Average.
22799dc [Takuya UESHIN] Fix empty Sum and SumDistinct.
65b7dd2 [Takuya UESHIN] Fix empty sum distinct.
2014-11-20 15:41:24 -08:00
ravipesala 98e9419784 [SPARK-4513][SQL] Support relational operator '<=>' in Spark SQL
The relational operator '<=>' is not working in Spark SQL. Same works in Spark HiveQL

Author: ravipesala <ravindra.pesala@huawei.com>

Closes #3387 from ravipesala/<=> and squashes the following commits:

7198e90 [ravipesala] Supporting relational operator '<=>' in Spark SQL
2014-11-20 15:34:03 -08:00
Davies Liu 1c53a5db99 [SPARK-4439] [MLlib] add python api for random forest
```
    class RandomForestModel
     |  A model trained by RandomForest
     |
     |  numTrees(self)
     |      Get number of trees in forest.
     |
     |  predict(self, x)
     |      Predict values for a single data point or an RDD of points using the model trained.
     |
     |  toDebugString(self)
     |      Full model
     |
     |  totalNumNodes(self)
     |      Get total number of nodes, summed over all trees in the forest.
     |

    class RandomForest
     |  trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None):
     |      Method to train a decision tree model for binary or multiclass classification.
     |
     |      :param data: Training dataset: RDD of LabeledPoint.
     |                   Labels should take values {0, 1, ..., numClasses-1}.
     |      :param numClassesForClassification: number of classes for classification.
     |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
     |                                  E.g., an entry (n -> k) indicates that feature n is categorical
     |                                  with k categories indexed from 0: {0, 1, ..., k-1}.
     |      :param numTrees: Number of trees in the random forest.
     |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
     |                                Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
     |                                If "auto" is set, this parameter is set based on numTrees:
     |                                  if numTrees == 1, set to "all";
     |                                  if numTrees > 1 (forest) set to "sqrt".
     |      :param impurity: Criterion used for information gain calculation.
     |                   Supported values: "gini" (recommended) or "entropy".
     |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
     |                       1 internal node + 2 leaf nodes. (default: 4)
     |      :param maxBins: maximum number of bins used for splitting features (default: 100)
     |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
     |      :return: RandomForestModel that can be used for prediction
     |
     |   trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None):
     |      Method to train a decision tree model for regression.
     |
     |      :param data: Training dataset: RDD of LabeledPoint.
     |                   Labels are real numbers.
     |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
     |                                   E.g., an entry (n -> k) indicates that feature n is categorical
     |                                   with k categories indexed from 0: {0, 1, ..., k-1}.
     |      :param numTrees: Number of trees in the random forest.
     |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
     |                                 Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
     |                                 If "auto" is set, this parameter is set based on numTrees:
     |                                 if numTrees == 1, set to "all";
     |                                 if numTrees > 1 (forest) set to "onethird".
     |      :param impurity: Criterion used for information gain calculation.
     |                       Supported values: "variance".
     |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
     |                       1 internal node + 2 leaf nodes.(default: 4)
     |      :param maxBins: maximum number of bins used for splitting features (default: 100)
     |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
     |      :return: RandomForestModel that can be used for prediction
     |
```

Author: Davies Liu <davies@databricks.com>

Closes #3320 from davies/forest and squashes the following commits:

8003dfc [Davies Liu] reorder
53cf510 [Davies Liu] fix docs
4ca593d [Davies Liu] fix docs
e0df852 [Davies Liu] fix docs
0431746 [Davies Liu] rebased
2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest
885abee [Davies Liu] address comments
dae7fc0 [Davies Liu] address comments
89a000f [Davies Liu] fix docs
565d476 [Davies Liu] add python api for random forest
2014-11-20 15:31:28 -08:00
Dan McClary b8e6886fb8 [SPARK-4228][SQL] SchemaRDD to JSON
Here's a simple fix for SchemaRDD to JSON.

Author: Dan McClary <dan.mcclary@gmail.com>

Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits:

d714e1d [Dan McClary] fixed PEP 8 error
cac2879 [Dan McClary] move pyspark comment and doctest to correct location
f9471d3 [Dan McClary] added pyspark doc and doctest
6598cee [Dan McClary] adding complex type queries
1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite
4a651f0 [Dan McClary] cleaned PEP and Scala style failures.  Moved tests to JsonSuite
47ceff6 [Dan McClary] cleaned up scala style issues
2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD
4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting
8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON
1b11980 [Dan McClary] Map and UserDefinedTypes partially done
11d2016 [Dan McClary] formatting and unicode deserialization default fixed
6af72d1 [Dan McClary] deleted extaneous comment
4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD
149dafd [Dan McClary] wrapped scala toJSON in sql.py
5e5eb1b [Dan McClary] switched to Jackson for JSON processing
6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD
aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD
1d171aa [Dan McClary] upated missing brace on if statement
319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228
424f130 [Dan McClary] tests pass, ready for pull and PR
626a5b1 [Dan McClary] added toJSON to SchemaRDD
f7d166a [Dan McClary] added toJSON method
5d34e37 [Dan McClary] merge resolved
d6d19e9 [Dan McClary] pr example
2014-11-20 13:44:19 -08:00
Cheng Lian abf29187f0 [SPARK-3938][SQL] Names in-memory columnar RDD with corresponding table name
This PR enables the Web UI storage tab to show the in-memory table name instead of the mysterious query plan string as the name of the in-memory columnar RDD.

Note that after #2501, a single columnar RDD can be shared by multiple in-memory tables, as long as their query results are the same. In this case, only the first cached table name is shown. For example:

```sql
CACHE TABLE first AS SELECT * FROM src;
CACHE TABLE second AS SELECT * FROM src;
```

The Web UI only shows "In-memory table first".

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3383)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #3383 from liancheng/columnar-rdd-name and squashes the following commits:

071907f [Cheng Lian] Fixes tests
12ddfa6 [Cheng Lian] Names in-memory columnar RDD with corresponding table name
2014-11-20 13:12:24 -08:00
Xiangrui Meng 15cacc8124 [SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc
There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent.

1. WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly.
1. GradientBoosting -> GradientBoostedTrees
1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy
1. Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.)
1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train`
1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.train` class method.
1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting.
1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError
1. doc updates

manishamde jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #3374 from mengxr/SPARK-4486 and squashes the following commits:

7097251 [Xiangrui Meng] address joseph's comments
98dea09 [Xiangrui Meng] address manish's comments
4aae3b7 [Xiangrui Meng] add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy
ea4c467 [Xiangrui Meng] fix unit tests
751da4e [Xiangrui Meng] rename class method train -> run
19030a5 [Xiangrui Meng] update boosting public APIs
2014-11-20 00:48:59 -08:00
Leolh e216ffaead [SPARK-4446] [SPARK CORE]
MetadataCleaner schedule task with a wrong param for delay time .

Author: Leolh <leosandylh@gmail.com>

Closes #3306 from Leolh/master and squashes the following commits:

4a21f4e [Leolh] Update MetadataCleaner.scala
2014-11-19 18:18:55 -08:00
Andrew Or 0eb4a7fb0f [SPARK-4480] Avoid many small spills in external data structures
**Summary.** Currently, we may spill many small files in `ExternalAppendOnlyMap` and `ExternalSorter`. The underlying root cause of this is summarized in [SPARK-4452](https://issues.apache.org/jira/browse/SPARK-4452). This PR does not address this root cause, but simply provides the guarantee that we never spill the in-memory data structure if its size is less than a configurable threshold of 5MB. This config is not documented because we don't want users to set it themselves, and it is not hard-coded because we need to change it in tests.

**Symptom.** Each spill is orders of magnitude smaller than 1MB, and there are many spills. In environments where the ulimit is set, this frequently causes "too many open file" exceptions observed in [SPARK-3633](https://issues.apache.org/jira/browse/SPARK-3633).
```
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292769 spills so far)
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4760 B to disk (292770 spills so far)
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4520 B to disk (292771 spills so far)
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4560 B to disk (292772 spills so far)
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292773 spills so far)
14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4784 B to disk (292774 spills so far)
```

**Reproduction.** I ran the following on a small 4-node cluster with 512MB executors. Note that the back-to-back shuffle here is necessary for reasons described in [SPARK-4522](https://issues.apache.org/jira/browse/SPARK-4452). The second shuffle is a `reduceByKey` because it performs a map-side combine.
```
sc.parallelize(1 to 100000000, 100)
  .map { i => (i, i) }
  .groupByKey()
  .reduceByKey(_ ++ _)
  .count()
```
Before the change, I notice that each thread may spill up to 1000 times, and the size of each spill is on the order of 10KB. After the change, each thread spills only up to 20 times in the worst case, and the size of each spill is on the order of 1MB.

Author: Andrew Or <andrew@databricks.com>

Closes #3353 from andrewor14/avoid-small-spills and squashes the following commits:

49f380f [Andrew Or] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into avoid-small-spills
27d6966 [Andrew Or] Merge branch 'master' of github.com:apache/spark into avoid-small-spills
f4736e3 [Andrew Or] Fix tests
a919776 [Andrew Or] Avoid many small spills
2014-11-19 18:07:27 -08:00
Nishkam Ravi 73fedf5a6e [Spark-4484] Treat maxResultSize as unlimited when set to 0; improve error message
The check for maxResultSize > 0 is missing, results in failures. Also, error message needs to be improved so the developers know that there is a new parameter to be configured

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>

Closes #3360 from nishkamravi2/master_nravi and squashes the following commits:

5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
2014-11-19 17:23:42 -08:00
Akshat Aranya 9ccc53c72c [SPARK-4478] Keep totalRegisteredExecutors up-to-date
This rebases PR 3368.

This commit fixes totalRegisteredExecutors update [SPARK-4478], so that we can correctly keep track of number of registered executors.

Author: Akshat Aranya <aaranya@quantcast.com>

Closes #3373 from coolfrood/topic/SPARK-4478 and squashes the following commits:

8a4d1e4 [Akshat Aranya] Added comment
150ae93 [Akshat Aranya] [SPARK-4478] Keep totalRegisteredExecutors up-to-date
2014-11-19 17:20:20 -08:00
Joseph E. Gonzalez 377b068209 Updating GraphX programming guide and documentation
This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.

Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>

Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:

4421964 [Joseph E. Gonzalez] updating documentation for graphx
2014-11-19 16:53:33 -08:00
Josh Rosen 04d462f648 [SPARK-4495] Fix memory leak in JobProgressListener
This commit fixes a memory leak in JobProgressListener that I introduced in SPARK-2321 and adds a testing framework to ensure that it’s very difficult to inadvertently introduce new memory leaks.

This solution might be overkill, but the main idea is to partition JobProgressListener's state into three buckets: collections that should be empty once Spark is idle, collections that must obey some hard size limit, and collections that have a soft size limit (they can grow arbitrarily large when Spark is active but must shrink to fit within some bound after Spark becomes idle).

Based on this, we can write fairly generic tests that run workloads that submit more than `spark.ui.retainedStages` stages and `spark.ui.retainedJobs` jobs then check that these various collections' sizes obey their contracts.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3372 from JoshRosen/SPARK-4495 and squashes the following commits:

c73fab5 [Josh Rosen] "data structures" -> collections
be72e81 [Josh Rosen] [SPARK-4495] Fix memory leaks in JobProgressListener
2014-11-19 16:50:21 -08:00
Yadong Qi c3002c4a61 [SPARK-4294][Streaming] UnionDStream stream should express the requirements in the same way as TransformedDStream
In class TransformedDStream:
```scala
require(parents.length > 0, "List of DStreams to transform is empty")
require(parents.map(.ssc).distinct.size == 1, "Some of the DStreams have different contexts")
require(parents.map(.slideDuration).distinct.size == 1,
"Some of the DStreams have different slide durations")
```

In class UnionDStream:
```scala
if (parents.length == 0)
{ throw new IllegalArgumentException("Empty array of parents") }
if (parents.map(.ssc).distinct.size > 1)
{ throw new IllegalArgumentException("Array of parents have different StreamingContexts") }
if (parents.map(.slideDuration).distinct.size > 1)
{ throw new IllegalArgumentException("Array of parents have different slide times") }
```

The function is the same, but the realization is not. I think they shoule be the same.

Author: Yadong Qi <qiyadong2010@gmail.com>

Closes #3152 from watermen/bug-fix1 and squashes the following commits:

ed66db6 [Yadong Qi] Change transform to union
b6b3b8b [Yadong Qi] The same function should have the same realization.
2014-11-19 15:53:06 -08:00
Davies Liu 73c8ea84a6 [SPARK-4384] [PySpark] improve sort spilling
If there some big broadcasts (or other object) in Python worker, the free memory could be used for sorting will be too small, then it will keep spilling small files into disks, finally failed with too many open files.

This PR try to delay the spilling until the used memory goes over limit and start to increase since last spilling, it will increase the size of spilling files, improve the stability and performance in this cases. (We also do this in ExternalAggregator).

Author: Davies Liu <davies@databricks.com>

Closes #3252 from davies/sort and squashes the following commits:

711fb6c [Davies Liu] improve sort spilling
2014-11-19 15:45:37 -08:00
Takuya UESHIN f9adda9afb [SPARK-4429][BUILD] Build for Scala 2.11 using sbt fails.
I tried to build for Scala 2.11 using sbt with the following command:

```
$ sbt/sbt -Dscala-2.11 assembly
```

but it ends with the following error messages:

```
[error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.0: not found
[error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: org.scalamacros#quasiquotes_2.11;2.0.1: not found
```

The reason is:
If system property `-Dscala-2.11` (without value) was set, `SparkBuild.scala` adds `scala-2.11` profile, but also `sbt-pom-reader` activates `scala-2.10` profile instead of `scala-2.11` profile because the activator `PropertyProfileActivator` used by `sbt-pom-reader` internally checks if the property value is empty or not.

The value is set to non-empty value, then no need to add profiles in `SparkBuild.scala` because `sbt-pom-reader` can handle as expected.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #3342 from ueshin/issues/SPARK-4429 and squashes the following commits:

14d86e8 [Takuya UESHIN] Add a comment.
4eef52b [Takuya UESHIN] Remove unneeded condition.
ce98d0f [Takuya UESHIN] Set non-empty value to system property "scala-2.11" if the property exists instead of adding profile.
2014-11-19 14:40:21 -08:00
Ken Takagiwa 9b7bbcef88 [DOC][PySpark][Streaming] Fix docstring for sphinx
This commit should be merged for 1.2 release.
cc tdas

Author: Ken Takagiwa <ugw.gi.world@gmail.com>

Closes #3311 from giwa/patch-3 and squashes the following commits:

ab474a8 [Ken Takagiwa] [DOC][PySpark][Streaming] Fix docstring for sphinx
2014-11-19 14:23:18 -08:00
Prashant Sharma 1c938413ba SPARK-3962 Marked scope as provided for external projects.
Somehow maven shade plugin is set in infinite loop of creating effective pom.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #2959 from ScrapCodes/SPARK-3962/scope-provided and squashes the following commits:

994d1d3 [Prashant Sharma] Fixed failing flume tests
270b4fb [Prashant Sharma] Removed most of the unused code.
bb3bbfd [Prashant Sharma] SPARK-3962 Marked scope as provided for external.
2014-11-19 14:18:10 -08:00
Andrew Or 0df02ca463 [HOT FIX] MiMa tests are broken
This is blocking #3353 and other patches.

Author: Andrew Or <andrew@databricks.com>

Closes #3371 from andrewor14/mima-hot-fix and squashes the following commits:

842d059 [Andrew Or] Move excludes to the right section
c4d4f4e [Andrew Or] MIMA hot fix
2014-11-19 14:03:44 -08:00
zsxwing 3bf7ceebb1 [SPARK-4481][Streaming][Doc] Fix the wrong description of updateFunc
Removed `If `this` function returns None, then corresponding state key-value pair will be eliminated.` for the description of `updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)]`

Author: zsxwing <zsxwing@gmail.com>

Closes #3356 from zsxwing/SPARK-4481 and squashes the following commits:

76a9891 [zsxwing] Add a note that keys may be added or removed
0ebc42a [zsxwing] Fix the wrong description of updateFunc
2014-11-19 13:17:15 -08:00
Tathagata Das 22fc4e751c [SPARK-4482][Streaming] Disable ReceivedBlockTracker's write ahead log by default
The write ahead log of ReceivedBlockTracker gets enabled as soon as checkpoint directory is set. This should not happen, as the WAL should be enabled only if the WAL is enabled in the Spark configuration.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #3358 from tdas/SPARK-4482 and squashes the following commits:

b740136 [Tathagata Das] Fixed bug in ReceivedBlockTracker
2014-11-19 13:06:48 -08:00
Kenichi Maehashi eacc788346 [SPARK-4470] Validate number of threads in local mode
When running Spark locally, if number of threads is specified as 0 (e.g., `spark-submit --master local[0] ...`), the job got stuck and does not run at all.
I think it's better to validate the parameter.

Fix for [SPARK-4470](https://issues.apache.org/jira/browse/SPARK-4470).

Author: Kenichi Maehashi <webmaster@kenichimaehashi.com>

Closes #3337 from kmaehashi/spark-4470 and squashes the following commits:

3ad76f3 [Kenichi Maehashi] fix code style
7716734 [Kenichi Maehashi] SPARK-4470: Validate number of threads in local mode
2014-11-19 12:11:09 -08:00