Commit graph

9475 commits

Author SHA1 Message Date
Kousuke Saruta c80194b334 [SPARK-5155] Build fails with spark-ganglia-lgpl profile
Build fails with spark-ganglia-lgpl profile at the moment. This is because pom.xml for spark-ganglia-lgpl is not updated.

This PR is related to #4218, #4209 and #3812.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #4303 from sarutak/fix-ganglia-pom-for-metric and squashes the following commits:

5cf455f [Kousuke Saruta] Fixed pom.xml for ganglia in order to use io.dropwizard.metrics
2015-02-01 17:53:56 -08:00
Liang-Chi Hsieh ef89b82d83 [Minor][SQL] Little refactor DataFrame related codes
Simplify some codes related to DataFrame.

*  Calling `toAttributes` instead of a `map`.
*  Original `createDataFrame` creates the `StructType` and its attributes in a redundant way. Refactored it to create `StructType` and call `toAttributes` on it directly.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4298 from viirya/refactor_df and squashes the following commits:

1d61c64 [Liang-Chi Hsieh] Revert it.
f36efb5 [Liang-Chi Hsieh] Relax the constraint of toDataFrame.
2c9f370 [Liang-Chi Hsieh] Just refactor DataFrame codes.
2015-02-01 17:52:18 -08:00
zsxwing 883bc88d52 [SPARK-4859][Core][Streaming] Refactor LiveListenerBus and StreamingListenerBus
This PR refactors LiveListenerBus and StreamingListenerBus and extracts the common codes to a parent class `ListenerBus`.

It also includes bug fixes in #3710:
1. Fix the race condition of queueFullErrorMessageLogged in LiveListenerBus and StreamingListenerBus to avoid outputing `queue-full-error` logs multiple times.
2. Make sure the SHUTDOWN message will be delivered to listenerThread, so that we can make sure listenerThread will always be able to exit.
3. Log the error from listener rather than crashing listenerThread in StreamingListenerBus.

During fixing the above bugs, we find it's better to make LiveListenerBus and StreamingListenerBus have the same bahaviors. Then there will be many duplicated codes in LiveListenerBus and StreamingListenerBus.

Therefore, I extracted their common codes to `ListenerBus` as a parent class: LiveListenerBus and StreamingListenerBus only need to extend `ListenerBus` and implement `onPostEvent` (how to process an event) and `onDropEvent` (do something when droppping an event).

Author: zsxwing <zsxwing@gmail.com>

Closes #4006 from zsxwing/SPARK-4859-refactor and squashes the following commits:

c8dade2 [zsxwing] Fix the code style after renaming
5715061 [zsxwing] Rename ListenerHelper to ListenerBus and the original ListenerBus to AsynchronousListenerBus
f0ef647 [zsxwing] Fix the code style
4e85ffc [zsxwing] Merge branch 'master' into SPARK-4859-refactor
d2ef990 [zsxwing] Add private[spark]
4539f91 [zsxwing] Remove final to pass MiMa tests
a9dccd3 [zsxwing] Remove SparkListenerShutdown
7cc04c3 [zsxwing] Refactor LiveListenerBus and StreamingListenerBus and make them share same code base
2015-02-01 17:48:41 -08:00
Xiangrui Meng 4a171225ba [SPARK-5424][MLLIB] make the new ALS impl take generic ID types
This PR makes the ALS implementation take generic ID types, e.g., Long and String, and expose it as a developer API.

TODO:
- [x] make sure that specialization works (validated in profiler)

srowen You may like this change:) I hit a Scala compiler bug with specialization. It compiles now but users and items must have the same type. I'm going to check whether specialization really works.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4281 from mengxr/generic-als and squashes the following commits:

96072c3 [Xiangrui Meng] merge master
135f741 [Xiangrui Meng] minor update
c2db5e5 [Xiangrui Meng] make test pass
86588e1 [Xiangrui Meng] use a single ID type for both users and items
74f1f73 [Xiangrui Meng] compile but runtime error at test
e36469a [Xiangrui Meng] add classtags and make it compile
7a5aeb3 [Xiangrui Meng] UserType -> User, ItemType -> Item
c8ee0bc [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into generic-als
72b5006 [Xiangrui Meng] remove generic from pipeline interface
8bbaea0 [Xiangrui Meng] make ALS take generic IDs
2015-02-01 14:13:31 -08:00
Octavian Geagla bdb0680d37 [SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use
This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback.

Author: Octavian Geagla <ogeagla@gmail.com>

Closes #4140 from ogeagla/SPARK-5207 and squashes the following commits:

fa64dfa [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel to take stddev instead of variance
9078fe0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] Incorporate code review feedback: change arg ordering, add dev api annotations, do better null checking, add another test and some doc for this.
997d2e0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] make withMean and withStd public, add constructor which uses defaults, un-refactor test class
64408a4 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel contructor to not be private to mllib, added tests for newly-exposed functionality
2015-02-01 09:21:14 -08:00
Ryan Williams 80bd715a3e [SPARK-5422] Add support for sending Graphite metrics via UDP
Depends on [SPARK-5413](https://issues.apache.org/jira/browse/SPARK-5413) / #4209, included here, will rebase once the latter's merged.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #4218 from ryan-williams/udp and squashes the following commits:

ebae393 [Ryan Williams] Add support for sending Graphite metrics via UDP
cb58262 [Ryan Williams] bump metrics dependency to v3.1.0
2015-01-31 23:41:05 -08:00
Sean Owen c84d5a10e8 SPARK-3359 [CORE] [DOCS] sbt/sbt unidoc doesn't work with Java 8
These are more `javadoc` 8-related changes I spotted while investigating. These should be helpful in any event, but this does not nearly resolve SPARK-3359, which may never be feasible while using `unidoc` and `javadoc` 8.

Author: Sean Owen <sowen@cloudera.com>

Closes #4193 from srowen/SPARK-3359 and squashes the following commits:

5b33f66 [Sean Owen] Additional scaladoc fixes for javadoc 8; still not going to be javadoc 8 compatible
2015-01-31 10:40:42 -08:00
Burak Yavuz ef8974b1b7 [SPARK-3975] Added support for BlockMatrix addition and multiplication
Support for multiplying and adding large distributed matrices!

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <brkyvz@dn51t42l.sunet>
Author: Burak Yavuz <brkyvz@dn51t4rd.sunet>
Author: Burak Yavuz <brkyvz@dn0a221430.sunet>
Author: Burak Yavuz <brkyvz@dn0a22b17d.sunet>

Closes #4274 from brkyvz/SPARK-3975PR2 and squashes the following commits:

17abd59 [Burak Yavuz] added indices to error message
ac25783 [Burak Yavuz] merged masyer
b66fd8b [Burak Yavuz] merged masyer
e39baff [Burak Yavuz] addressed code review v1
2dba642 [Burak Yavuz] [SPARK-3975] Added support for BlockMatrix addition and multiplication
fb7624b [Burak Yavuz] merged master
98c58ea [Burak Yavuz] added tests
cdeb5df [Burak Yavuz] before adding tests
c9bf247 [Burak Yavuz] fixed merge conflicts
1cb0d06 [Burak Yavuz] [SPARK-3976] Added doc
f92a916 [Burak Yavuz] merge upstream
1a63b20 [Burak Yavuz] [SPARK-3974] Remove setPartition method. Isn't required
1e8bb2a [Burak Yavuz] [SPARK-3974] Change return type of cache and persist
e3d24c3 [Burak Yavuz] [SPARK-3976] Pulled upstream changes
fa3774f [Burak Yavuz] [SPARK-3976] updated matrix multiplication and addition implementation
239ab4b [Burak Yavuz] [SPARK-3974] Addressed @jkbradley's comments
add7b05 [Burak Yavuz] [SPARK-3976] Updated code according to upstream changes
e29acfd [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-3976
3127233 [Burak Yavuz] fixed merge conflicts with upstream
ba414d2 [Burak Yavuz] [SPARK-3974] fixed frobenius norm
ab6cde0 [Burak Yavuz] [SPARK-3974] Modifications cleaning code up, making size calculation more robust
9ae85aa [Burak Yavuz] [SPARK-3974] Made partitioner a variable inside BlockMatrix instead of a constructor variable
d033861 [Burak Yavuz] [SPARK-3974] Removed SubMatrixInfo and added constructor without partitioner
8e954ab [Burak Yavuz] save changes
bbeae8c [Burak Yavuz] merged master
987ea53 [Burak Yavuz] merged master
49b9586 [Burak Yavuz] [SPARK-3974] Updated testing utils from master
645afbe [Burak Yavuz] [SPARK-3974] Pull latest master
beb1edd [Burak Yavuz] merge conflicts fixed
f41d8db [Burak Yavuz] update tests
b05aabb [Burak Yavuz] [SPARK-3974] Updated tests to reflect changes
56b0546 [Burak Yavuz] updates from 3974 PR
b7b8a8f [Burak Yavuz] pull updates from master
b2dec63 [Burak Yavuz] Pull changes from 3974
19c17e8 [Burak Yavuz] [SPARK-3974] Changed blockIdRow and blockIdCol
5f062e6 [Burak Yavuz] updates with 3974
6729fbd [Burak Yavuz] Updated with respect to SPARK-3974 PR
589fbb6 [Burak Yavuz] [SPARK-3974] Code review feedback addressed
63a4858 [Burak Yavuz] added grid multiplication
aa8f086 [Burak Yavuz] [SPARK-3974] Additional comments added
7381b99 [Burak Yavuz] merge with PR1
f378e16 [Burak Yavuz] [SPARK-3974] Block Matrix Abstractions ready
b693209 [Burak Yavuz] Ready for Pull request
2015-01-31 00:47:30 -08:00
martinzapletal 34250a613c [MLLIB][SPARK-3278] Monotone (Isotonic) regression using parallel pool adjacent violators algorithm
This PR introduces an API for Isotonic regression and one algorithm implementing it, Pool adjacent violators.

The Isotonic regression problem is sufficiently described in [Floudas, Pardalos, Encyclopedia of Optimization](http://books.google.co.uk/books?id=gtoTkL7heS0C&pg=RA2-PA87&lpg=RA2-PA87&dq=pooled+adjacent+violators+code&source=bl&ots=ZzQbZXVJnn&sig=reH_hBV6yIb9BeZNTF9092vD8PY&hl=en&sa=X&ei=WmF2VLiOIZLO7Qa-t4Bo&ved=0CD8Q6AEwBA#v=onepage&q&f=false), [Wikipedia](http://en.wikipedia.org/wiki/Isotonic_regression) or [Stat Wiki](http://stat.wikia.com/wiki/Isotonic_regression).

Pool adjacent violators was introduced by  M. Ayer et al. in 1955.  A history and development of isotonic regression algorithms is in [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper) and list of available algorithms including their complexity is listed in [Stout, Fastest Isotonic Regression Algorithms](http://web.eecs.umich.edu/~qstout/IsoRegAlg_140812.pdf).

An approach to parallelize the computation of PAV was presented in [Kearsley, Tapia, Trosset, An Approach to Parallelizing Isotonic Regression](http://softlib.rice.edu/pub/CRPC-TRs/reports/CRPC-TR96640.pdf).

The implemented Pool adjacent violators algorithm is based on  [Floudas, Pardalos, Encyclopedia of Optimization](http://books.google.co.uk/books?id=gtoTkL7heS0C&pg=RA2-PA87&lpg=RA2-PA87&dq=pooled+adjacent+violators+code&source=bl&ots=ZzQbZXVJnn&sig=reH_hBV6yIb9BeZNTF9092vD8PY&hl=en&sa=X&ei=WmF2VLiOIZLO7Qa-t4Bo&ved=0CD8Q6AEwBA#v=onepage&q&f=false) (Chapter Isotonic regression problems, p. 86) and  [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper), also nicely formulated in [Tibshirani,  Hoefling, Tibshirani, Nearly-Isotonic Regression](http://www.stat.cmu.edu/~ryantibs/papers/neariso.pdf). Implementation itself inspired by R implementations [Klaus, Strimmer, 2008, fdrtool: Estimation of (Local) False Discovery Rates and Higher Criticism](http://cran.r-project.org/web/packages/fdrtool/index.html) and [R Development Core Team, stats, 2009](https://github.com/lgautier/R-3-0-branch-alt/blob/master/src/library/stats/R/isoreg.R). I ran tests with both these libraries and confirmed they yield the same results. More R implementations referenced in aforementioned [Leeuw, Hornik, Mair, Isotone Optimization in R: Pool-Adjacent-Violators
Algorithm (PAVA) and Active Set Methods](http://www.jstatsoft.org/v32/i05/paper). The implementation is also inspired and cross checked with other implementations: [Ted Harding, 2007](https://stat.ethz.ch/pipermail/r-help/2007-March/127981.html), [scikit-learn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/_isotonic.pyx), [Andrew Tulloch, 2014, Julia](https://github.com/ajtulloch/Isotonic.jl/blob/master/src/pooled_pava.jl), [Andrew Tulloch, 2014, c++](https://gist.github.com/ajtulloch/9499872), described in [Andrew Tulloch, Speeding up isotonic regression in scikit-learn by 5,000x](http://tullo.ch/articles/speeding-up-isotonic-regression/), [Fabian Pedregosa, 2012](https://gist.github.com/fabianp/3081831), [Sreangsu Acharyya. libpav](f744bc1b0f/src/pav.h?at=default) and [Gustav Larsson](https://gist.github.com/gustavla/9499068).

Author: martinzapletal <zapletal-martin@email.cz>
Author: Xiangrui Meng <meng@databricks.com>
Author: Martin Zapletal <zapletal-martin@email.cz>

Closes #3519 from zapletal-martin/SPARK-3278 and squashes the following commits:

5a54ea4 [Martin Zapletal] Merge pull request #2 from mengxr/isotonic-fix-java
37ba24e [Xiangrui Meng] fix java tests
e3c0e44 [martinzapletal] Merge remote-tracking branch 'origin/SPARK-3278' into SPARK-3278
d8feb82 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
ded071c [Martin Zapletal] Merge pull request #1 from mengxr/SPARK-3278
4dfe136 [Xiangrui Meng] add cache back
0b35c15 [Xiangrui Meng] compress pools and update tests
35d044e [Xiangrui Meng] update paraPAVA
077606b [Xiangrui Meng] minor
05422a8 [Xiangrui Meng] add unit test for model construction
5925113 [Xiangrui Meng] Merge remote-tracking branch 'zapletal-martin/SPARK-3278' into SPARK-3278
80c6681 [Xiangrui Meng] update IRModel
3da56e5 [martinzapletal] SPARK-3278 fixed indentation error
75eac55 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
88eb4e2 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Isotonic parameter removed from algorithm, defined behaviour for multiple data points with the same feature value, added tests to verify it
e60a34f [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Styling and comment fixes.
d93c8f9 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Change to IsotonicRegression api. Isotonic parameter now follows api of other mllib algorithms
1fff77d [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519. Java api changes, test refactoring, comments and citations, isotonic regression model validations, linear interpolation for predictions
12151e6 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
7aca4cc [martinzapletal] SPARK-3278 comment spelling
9ae9d53 [martinzapletal] SPARK-3278 changes after PR feedback https://github.com/apache/spark/pull/3519. Binary search used for isotonic regression model predictions
fad4bf9 [martinzapletal] SPARK-3278 changes after PR comments https://github.com/apache/spark/pull/3519
ce0e30c [martinzapletal] SPARK-3278 readability refactoring
f90c8c7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
0d14bd3 [martinzapletal] SPARK-3278 changed Java api to match Scala api's (Double, Double, Double)
3c2954b [martinzapletal] SPARK-3278 Isotonic regression java api
45aa7e8 [martinzapletal] SPARK-3278 Isotonic regression java api
e9b3323 [martinzapletal] Merge branch 'SPARK-3278-weightedLabeledPoint' into SPARK-3278
823d803 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
941fd1f [martinzapletal] SPARK-3278 Isotonic regression java api
a24e29f [martinzapletal] SPARK-3278 refactored weightedlabeledpoint to (double, double, double) and updated api
deb0f17 [martinzapletal] SPARK-3278 refactored weightedlabeledpoint to (double, double, double) and updated api
8cefd18 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278-weightedLabeledPoint
cab5a46 [martinzapletal] SPARK-3278 PR 3519 refactoring WeightedLabeledPoint to tuple as per comments
b8b1620 [martinzapletal] Removed WeightedLabeledPoint. Replaced by tuple of doubles
34760d5 [martinzapletal] Removed WeightedLabeledPoint. Replaced by tuple of doubles
089bf86 [martinzapletal] Removed MonotonicityConstraint, Isotonic and Antitonic constraints. Replced by simple boolean
c06f88c [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
6046550 [martinzapletal] SPARK-3278 scalastyle errors resolved
8f5daf9 [martinzapletal] SPARK-3278 added comments and cleaned up api to consistently handle weights
629a1ce [martinzapletal] SPARK-3278 added isotonic regression for weighted data. Added tests for Java api
05d9048 [martinzapletal] SPARK-3278 isotonic regression refactoring and api changes
961aa05 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-3278
3de71d0 [martinzapletal] SPARK-3278 added initial version of Isotonic regression algorithm including proposed API
2015-01-31 00:46:02 -08:00
Reynold Xin 636408311d [SPARK-5307] Add a config option for SerializationDebugger.
Just in case there is a bug in the SerializationDebugger that makes error reporting worse than it was.

Author: Reynold Xin <rxin@databricks.com>

Closes #4297 from rxin/ser-config and squashes the following commits:

f1d4629 [Reynold Xin] [SPARK-5307] Add a config option for SerializationDebugger.
2015-01-31 00:06:36 -08:00
kai f54c9f607b [SQL] remove redundant field "childOutput" from execution.Aggregate, use child.output instead
Author: kai <kaizeng@eecs.berkeley.edu>

Closes #4291 from kai-zeng/aggregate-fix and squashes the following commits:

78658ef [kai] remove redundant field "childOutput"
2015-01-30 23:19:10 -08:00
Reynold Xin 740a56862b [SPARK-5307] SerializationDebugger
This patch adds a SerializationDebugger that is used to add serialization path to a NotSerializableException. When a NotSerializableException is encountered, the debugger visits the object graph to find the path towards the object that cannot be serialized, and constructs information to help user to find the object.

The patch uses the internals of JVM serialization (in particular, heavy usage of ObjectStreamClass). Compared with an earlier attempt, this one provides extra information including field names, array offsets, writeExternal calls, etc.

An example serialization stack:
```
Serialization stack:
  - object not serializable (class: org.apache.spark.serializer.NotSerializable, value: org.apache.spark.serializer.NotSerializable2c43caa4)
  - element of array (index: 0)
  - array (class [Ljava.lang.Object;, size 1)
  - field (class: org.apache.spark.serializer.SerializableArray, name: arrayField, type: class [Ljava.lang.Object;)
  - object (class org.apache.spark.serializer.SerializableArray, org.apache.spark.serializer.SerializableArray193c5908)
  - writeExternal data
  - externalizable object (class org.apache.spark.serializer.ExternalizableClass, org.apache.spark.serializer.ExternalizableClass320bdadc)
```

Author: Reynold Xin <rxin@databricks.com>

Closes #4098 from rxin/SerializationDebugger and squashes the following commits:

553b3ff [Reynold Xin] Update SerializationDebuggerSuite.scala
572d0cb [Reynold Xin] Disable automatically when reflection fails.
b349b77 [Reynold Xin] [SPARK-5307] SerializationDebugger to help debug NotSerializableException - take 2
2015-01-30 22:34:10 -08:00
Joseph K. Bradley e643de42a7 [SPARK-5504] [sql] convertToCatalyst should support nested arrays
After the recent refactoring, convertToCatalyst in ScalaReflection does not recurse on Arrays. It should.

The test suite modification made the test fail before the fix in ScalaReflection.  The fix makes the test suite succeed.

CC: marmbrus

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4295 from jkbradley/SPARK-5504 and squashes the following commits:

6b7276d [Joseph K. Bradley] Fixed issue in ScalaReflection.convertToCatalyst with Arrays with non-primitive types. Modified test suite so it failed before the fix and works after the fix.
2015-01-30 15:40:14 -08:00
Travis Galoppo 986977340d SPARK-5400 [MLlib] Changed name of GaussianMixtureEM to GaussianMixture
Decoupling the model and the algorithm

Author: Travis Galoppo <tjg2107@columbia.edu>

Closes #4290 from tgaloppo/spark-5400 and squashes the following commits:

9c1534c [Travis Galoppo] Fixed invokation instructions in comments
d848076 [Travis Galoppo] SPARK-5400 Changed name of GaussianMixtureEM to GaussianMixture to separate model from algorithm
2015-01-30 15:32:25 -08:00
sboeschhuawei f377431a57 [SPARK-4259][MLlib]: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
Add single pseudo-eigenvector PIC
Including documentations and updated pom.xml with the following codes:
mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala
mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala

Author: sboeschhuawei <stephen.boesch@huawei.com>
Author: Fan Jiang <fanjiang.sc@huawei.com>
Author: Jiang Fan <fjiang6@gmail.com>
Author: Stephen Boesch <stephen.boesch@huawei.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #4254 from fjiang6/PIC and squashes the following commits:

4550850 [sboeschhuawei] Removed pic test data
f292f31 [Stephen Boesch] Merge pull request #44 from mengxr/SPARK-4259
4b78aaf [Xiangrui Meng] refactor PIC
24fbf52 [sboeschhuawei] Updated API to be similar to KMeans plus other changes requested by Xiangrui on the PR
c12dfc8 [sboeschhuawei] Removed examples files and added pic_data.txt. Revamped testcases yet to come
92d4752 [sboeschhuawei] Move the Guassian/ Affinity matrix calcs out of PIC. Presently in the test suite
7ebd149 [sboeschhuawei] Incorporate Xiangrui's first set of PR comments except restructure PIC.run to take Graph but do not remove Gaussian
121e4d5 [sboeschhuawei] Remove unused testing data files
1c3a62e [sboeschhuawei] removed matplot.py and reordered all private methods to bottom of PIC
218a49d [sboeschhuawei] Applied Xiangrui's comments - especially removing RDD/PICLinalg classes and making noncritical methods private
43ab10b [sboeschhuawei] Change last two println's to log4j logger
88aacc8 [sboeschhuawei] Add assert to testcase on cluster sizes
24f438e [sboeschhuawei] fixed incorrect markdown in clustering doc
060e6bf [sboeschhuawei] Added link to PIC doc from the main clustering md doc
be659e3 [sboeschhuawei] Added mllib specific log4j
90e7fa4 [sboeschhuawei] Converted from custom Linalg routines to Breeze: added JavaDoc comments; added Markdown documentation
bea48ea [sboeschhuawei] Converted custom Linear Algebra datatypes/routines to use Breeze.
b29c0db [Fan Jiang] Update PIClustering.scala
ace9749 [Fan Jiang] Update PIClustering.scala
a112f38 [sboeschhuawei] Added graphx main and test jars as dependencies to mllib/pom.xml
f656c34 [sboeschhuawei] Added iris dataset
b7dbcbe [sboeschhuawei] Added axes and combined into single plot for matplotlib
a2b1e57 [sboeschhuawei] Revert inadvertent update to KMeans
9294263 [sboeschhuawei] Added visualization/plotting of input/output data
e5df2b8 [sboeschhuawei] First end to end working PIC
0700335 [sboeschhuawei] First end to end working version: but has bad performance issue
32a90dc [sboeschhuawei] Update circles test data values
0ef163f [sboeschhuawei] Added ConcentricCircles data generation and KMeans clustering
3fd5bc8 [sboeschhuawei] PIClustering is running in new branch (up to the pseudo-eigenvector convergence step)
d5aae20 [Jiang Fan] Adding Power Iteration Clustering and Suite test
a3c5fbe [Jiang Fan] Adding Power Iteration Clustering
2015-01-30 14:09:49 -08:00
Burak Yavuz 6ee8338b37 [SPARK-5486] Added validate method to BlockMatrix
The `validate` method will allow users to debug their `BlockMatrix`, if operations like `add` or `multiply` return unexpected results. It checks the following properties in a `BlockMatrix`:
- Are the dimensions of the `BlockMatrix` consistent with what the user entered: (`nRows`, `nCols`)
- Are the dimensions of each `MatrixBlock` consistent with what the user entered: (`rowsPerBlock`, `colsPerBlock`)
- Are there blocks with duplicate indices

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4279 from brkyvz/SPARK-5486 and squashes the following commits:

c152a73 [Burak Yavuz] addressed code review v2
598c583 [Burak Yavuz] merged master
b55ac5c [Burak Yavuz] addressed code review v1
25f083b [Burak Yavuz] simplify implementation
0aa519a [Burak Yavuz] [SPARK-5486] Added validate method to BlockMatrix
2015-01-30 13:59:10 -08:00
Xiangrui Meng 0a95085f09 [SPARK-5496][MLLIB] Allow both classification and Classification in Algo for trees.
to be backward compatible.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4287 from mengxr/SPARK-5496 and squashes the following commits:

a025c53 [Xiangrui Meng] Allow both classification and Classification in Algo for trees.
2015-01-30 10:08:07 -08:00
Joseph J.C. Tang 54d95758fc [MLLIB] SPARK-4846: throw a RuntimeException and give users hints to increase the minCount
When the vocabSize\*vectorSize is larger than Int.MaxValue/8, we try to throw a RuntimeException. Because under this circumstance it would definitely throw an OOM when allocating memory to serialize the arrays syn0Global&syn1Global.   syn0Global&syn1Global are float arrays. Serializing them should need a byte array of more than 8 times of syn0Global's size.
Also if we catch an OOM even if vocabSize\*vectorSize is less than Int.MaxValue/8, we should give users hints to increase the minCount or decrease the vectorSize.

Author: Joseph J.C. Tang <jinntrance@gmail.com>

Closes #4247 from jinntrance/w2v-fix and squashes the following commits:

b5eb71f [Joseph J.C. Tang] throw a RuntimeException and give users hints regarding the vectorSize&minCount
2015-01-30 10:07:26 -08:00
Sandy Ryza 254eaa4d35 SPARK-5393. Flood of util.RackResolver log messages after SPARK-1714
Previously I had tried to solve this with by adding a line in Spark's log4j-defaults.properties.

The issue with the message in log4j-defaults.properties was that the log4j.properties packaged inside Hadoop was getting picked up instead. While it would be ideal to fix that as well, we still want to quiet this in situations where a user supplies their own custom log4j properties.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #4192 from sryza/sandy-spark-5393 and squashes the following commits:

4d5dedc [Sandy Ryza] Only set log level if unset
46e07c5 [Sandy Ryza] SPARK-5393. Flood of util.RackResolver log messages after SPARK-1714
2015-01-30 11:31:54 -06:00
Takuya UESHIN 6f21dce5f4 [SPARK-5457][SQL] Add missing DSL for ApproxCountDistinct.
Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #4250 from ueshin/issues/SPARK-5457 and squashes the following commits:

3c05e59 [Takuya UESHIN] Remove parameter to use default value of ApproxCountDistinct.
faea19d [Takuya UESHIN] Use overload instead of default value for Java support.
d1cca38 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-5457
663d43d [Takuya UESHIN] Add missing DSL for ApproxCountDistinct.
2015-01-30 01:21:35 -08:00
Kazuki Taniguchi bc1fc9b60d [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
This PR is implementing the Gradient Boosted Trees for Python API.

Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>

Closes #3951 from kazk1018/gbt_for_py and squashes the following commits:

620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
2015-01-30 00:39:44 -08:00
Burak Yavuz dd4d84cf80 [SPARK-5322] Added transpose functionality to BlockMatrix
BlockMatrices can now be transposed!

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4275 from brkyvz/SPARK-5322 and squashes the following commits:

33806ed [Burak Yavuz] added lazy comment
33e9219 [Burak Yavuz] made transpose lazy
5a274cd [Burak Yavuz] added cached tests
5dcf85c [Burak Yavuz] [SPARK-5322] Added transpose functionality to BlockMatrix
2015-01-29 21:26:29 -08:00
Reynold Xin 80def9deb3 [SQL] Support df("*") to select all columns in a data frame.
This PR makes Star a trait, and provides two implementations: UnresolvedStar (used for *, tblName.*) and ResolvedStar (used for df("*")).

Author: Reynold Xin <rxin@databricks.com>

Closes #4283 from rxin/df-star and squashes the following commits:

c9cba3e [Reynold Xin] Removed mapFunction in UnresolvedStar.
1a3a1d7 [Reynold Xin] [SQL] Support df("*") to select all columns in a data frame.
2015-01-29 19:09:08 -08:00
Josh Rosen 22271f9693 [SPARK-5462] [SQL] Use analyzed query plan in DataFrame.apply()
This patch changes DataFrame's `apply()` method to use an analyzed query plan when resolving column names.  This fixes a bug where `apply` would throw "invalid call to qualifiers on unresolved object" errors when called on DataFrames constructed via `SQLContext.sql()`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4282 from JoshRosen/SPARK-5462 and squashes the following commits:

b9e6da2 [Josh Rosen] [SPARK-5462] Use analyzed query plan in DataFrame.apply().
2015-01-29 18:23:05 -08:00
Davies Liu 5c746eedda [SPARK-5395] [PySpark] fix python process leak while coalesce()
Currently, the Python process is released into pool only after the task had finished, it cause many process forked if coalesce() is called.

This PR will change it to release the process as soon as read all the data from it (finish the partition), then a process could be reused to process multiple partitions in a single task.

Author: Davies Liu <davies@databricks.com>

Closes #4238 from davies/py_leak and squashes the following commits:

ec80a43 [Davies Liu] add @volatile
6da437a [Davies Liu] address comments
24ed322 [Davies Liu] fix python process leak while coalesce()
2015-01-29 17:28:37 -08:00
Reynold Xin ce9c43ba8c [SQL] DataFrame API improvements
1. Added Dsl.column in case Dsl.col is shadowed.
2. Allow using String to specify the target data type in cast.
3. Support sorting on multiple columns using column names.
4. Added Java API test file.

Author: Reynold Xin <rxin@databricks.com>

Closes #4280 from rxin/dsl1 and squashes the following commits:

33ecb7a [Reynold Xin] Add the Java test.
d06540a [Reynold Xin] [SQL] DataFrame API improvements.
2015-01-29 17:24:00 -08:00
Patrick Wendell d2071e8f45 Revert "[WIP] [SPARK-3996]: Shade Jetty in Spark deliverables"
This reverts commit f240fe390b.
2015-01-29 17:14:27 -08:00
Yoshihiro Shimizu 5338772f3f remove 'return'
looks unnecessary 😀

Author: Yoshihiro Shimizu <shimizu@amoad.com>

Closes #4268 from y-shimizu/remove-return and squashes the following commits:

12be0e9 [Yoshihiro Shimizu] remove 'return'
2015-01-29 16:55:00 -08:00
Patrick Wendell f240fe390b [WIP] [SPARK-3996]: Shade Jetty in Spark deliverables
This patch piggy-back's on vanzin's work to simplify the Guava shading,
and adds Jetty as a shaded library in Spark. Other than adding Jetty,
it consilidates the \<artifactSet\>'s into the root pom. I found it was
a bit easier to follow that way, since you don't need to look into
child pom's to find out specific artifact sets included in shading.

Author: Patrick Wendell <patrick@databricks.com>

Closes #4252 from pwendell/jetty and squashes the following commits:

19f0710 [Patrick Wendell] More code review feedback
961452d [Patrick Wendell] Responding to feedback from Marcello
6df25ca [Patrick Wendell] [WIP] [SPARK-3996]: Shade Jetty in Spark deliverables
2015-01-29 16:31:19 -08:00
Josh Rosen 0bb15f22d1 [SPARK-5464] Fix help() for Python DataFrame instances
This fixes an exception that prevented users from calling `help()` on Python DataFrame instances.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #4278 from JoshRosen/SPARK-5464-python-dataframe-help-command and squashes the following commits:

08f95f7 [Josh Rosen] Fix exception when calling help() on Python DataFrame instances
2015-01-29 16:23:20 -08:00
Yin Huai c00d517d66 [SPARK-4296][SQL] Trims aliases when resolving and checking aggregate expressions
I believe that SPARK-4296 has been fixed by 3684fd21e1. I am adding tests based #3910 (change the udf to HiveUDF instead).

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #4010 from yhuai/SPARK-4296-yin and squashes the following commits:

6343800 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-4296-yin
6cfadd2 [Yin Huai] Actually, this issue has been fixed by 3684fd21e1.
d42b707 [Yin Huai] Update comment.
8b3a274 [Yin Huai] Since expressions in grouping expressions can have aliases, which can be used by the outer query block,     revert this change.
443538d [Cheng Lian] Trims aliases when resolving and checking aggregate expressions
2015-01-29 15:49:34 -08:00
wangfei c1b3eebf97 [SPARK-5373][SQL] Literal in agg grouping expressions leads to incorrect result
`select key, count( * ) from src group by key, 1`  will get the wrong answer.

e.g. for this table
```
  val testData2 =
    TestSQLContext.sparkContext.parallelize(
      TestData2(1, 1) ::
      TestData2(1, 2) ::
      TestData2(2, 1) ::
      TestData2(2, 2) ::
      TestData2(3, 1) ::
      TestData2(3, 2) :: Nil, 2).toSchemaRDD
  testData2.registerTempTable("testData2")
```
result of `SELECT a, count(1) FROM testData2 GROUP BY a, 1`  is

```
                     [1,1]
                     [2,2]
                     [3,1]
```

Author: wangfei <wangfei1@huawei.com>

Closes #4169 from scwf/agg-bug and squashes the following commits:

05751db [wangfei] fix bugs when literal in agg grouping expressioons
2015-01-29 15:47:18 -08:00
wangfei fbaf9e0896 [SPARK-5367][SQL] Support star expression in udf
now spark sql does not support star expression in udf, run the following sql by spark-sql will get error
```
select concat(*) from src
```

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #4163 from scwf/udf-star and squashes the following commits:

9db7b39 [wangfei] addressed comments
da1da09 [scwf] minor fix
f87b5f9 [scwf] added test case
587bf7e [wangfei] compile fix
eb93c16 [wangfei] fix star resolve issue in udf
2015-01-29 15:44:53 -08:00
Yash Datta de221ea032 [SPARK-4786][SQL]: Parquet filter pushdown for castable types
Enable parquet filter pushdown of castable types like short, byte that can be cast to integer

Author: Yash Datta <Yash.Datta@guavus.com>

Closes #4156 from saucam/filter_short and squashes the following commits:

a403979 [Yash Datta] SPARK-4786: Fix styling issues
d029866 [Yash Datta] SPARK-4786: Add test case
cb2e0d9 [Yash Datta] SPARK-4786: Parquet filter pushdown for castable types
2015-01-29 15:42:23 -08:00
Michael Davies 940f375611 [SPARK-5309][SQL] Add support for dictionaries in PrimitiveConverter for Strin...
...gs.

Parquet Converters allow developers to take advantage of dictionary encoding of column data to reduce Column Binary decoding.

The Spark PrimitiveConverter was not using that API and consequently for String columns that used dictionary compression repeated Binary to String conversions for the same String.

In measurements this could account for over 25% of entire query time.
For example a 500M row table split across 16 blocks was aggregated and summed in a litte under 30s before this change and a little under 20s after the change.

Author: Michael Davies <Michael.BellDavies@gmail.com>

Closes #4187 from MickDavies/SPARK-5309-2 and squashes the following commits:

327287e [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings.
33c002c [Michael Davies] SPARK-5309: Add support for dictionaries in PrimitiveConverter for Strings.
2015-01-29 15:40:59 -08:00
Liang-Chi Hsieh bce0ba1fbd [SPARK-5429][SQL] Use javaXML plan serialization for Hive golden answers on Hive 0.13.1
I found that running `HiveComparisonTest.createQueryTest` to generate Hive golden answer files on Hive 0.13.1 would throw KryoException. I am not sure if this can be reproduced by others. Since Hive 0.13.0, Kryo plan serialization is introduced to replace javaXML as default plan serialization format. This is a quick fix to set hive configuration to use javaXML serialization.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4223 from viirya/fix_hivetest and squashes the following commits:

97a8760 [Liang-Chi Hsieh] Use javaXML plan serialization.
2015-01-29 15:28:22 -08:00
Reynold Xin 715632232d [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
Turns out Scala does generate static methods for ones defined in a companion object. Finally no need to separate api.java.dsl and api.scala.dsl.

Author: Reynold Xin <rxin@databricks.com>

Closes #4276 from rxin/dsl and squashes the following commits:

30aa611 [Reynold Xin] Add all files.
1a9d215 [Reynold Xin] [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
2015-01-29 15:13:09 -08:00
Marcelo Vanzin f9e569452e [SPARK-5466] Add explicit guava dependencies where needed.
One side-effect of shading guava is that it disappears as a transitive
dependency. For Hadoop 2.x, this was masked by the fact that Hadoop
itself depends on guava. But certain versions of Hadoop 1.x also
shade guava, leaving either no guava or some random version pulled
by another dependency on the classpath.

So be explicit about the dependency in modules that use guava directly,
which is the right thing to do anyway.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4272 from vanzin/SPARK-5466 and squashes the following commits:

e3f30e5 [Marcelo Vanzin] Dependency for catalyst is not needed.
d3b2c84 [Marcelo Vanzin] [SPARK-5466] Add explicit guava dependencies where needed.
2015-01-29 13:00:45 -08:00
Xiangrui Meng a3dc618486 [SPARK-5477] refactor stat.py
There is only a single `stat.py` file for the `mllib.stat` package. We recently added `MultivariateGaussian` under `mllib.stat.distribution` in Scala/Java. It would be nice to refactor `stat.py` and make it easy to expand. Note that `ChiSqTestResult` is moved from `mllib.stat` to `mllib.stat.test`. The latter is used in Scala/Java. It is only used in the return value of `Statistics.chiSqTest`, so this should be an okay change.

davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #4266 from mengxr/py-stat-refactor and squashes the following commits:

1a5e1db [Xiangrui Meng] refactor stat.py
2015-01-29 10:11:44 -08:00
Reynold Xin 5ad78f6205 [SQL] Various DataFrame DSL update.
1. Added foreach, foreachPartition, flatMap to DataFrame.
2. Added col() in dsl.
3. Support renaming columns in toDataFrame.
4. Support type inference on arrays (in addition to Seq).
5. Updated mllib to use the new DSL.

Author: Reynold Xin <rxin@databricks.com>

Closes #4260 from rxin/sql-dsl-update and squashes the following commits:

73466c1 [Reynold Xin] Fixed LogisticRegression. Also added better error message for resolve.
fab3ccc [Reynold Xin] Bug fix.
d31fcd2 [Reynold Xin] Style fix.
62608c4 [Reynold Xin] [SQL] Various DataFrame DSL update.
2015-01-29 00:01:10 -08:00
Burak Yavuz a63be1a18f [SPARK-3977] Conversion methods for BlockMatrix to other Distributed Matrices
The conversion methods for `BlockMatrix`. Conversions go through `CoordinateMatrix` in order to cause a shuffle so that intermediate operations will be stored on disk and the expensive initial computation will be mitigated.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #4256 from brkyvz/SPARK-3977PR and squashes the following commits:

4df37fe [Burak Yavuz] moved TODO inside code block
b049c07 [Burak Yavuz] addressed code review feedback v1
66cb755 [Burak Yavuz] added default toBlockMatrix conversion
851f2a2 [Burak Yavuz] added better comments and checks
cdb9895 [Burak Yavuz] [SPARK-3977] Conversion methods for BlockMatrix to other Distributed Matrices
2015-01-28 23:42:07 -08:00
Reynold Xin 5b9760de8d [SPARK-5445][SQL] Made DataFrame dsl usable in Java
Also removed the literal implicit transformation since it is pretty scary for API design. Instead, created a new lit method for creating literals. This doesn't break anything from a compatibility perspective because Literal was added two days ago.

Author: Reynold Xin <rxin@databricks.com>

Closes #4241 from rxin/df-docupdate and squashes the following commits:

c0f4810 [Reynold Xin] Fix Python merge conflict.
094c7d7 [Reynold Xin] Minor style fix. Reset Python tests.
3c89f4a [Reynold Xin] Package.
dfe6962 [Reynold Xin] Updated Python aggregate.
5dd4265 [Reynold Xin] Made dsl Java callable.
14b3c27 [Reynold Xin] Fix literal expression for symbols.
68b31cb [Reynold Xin] Literal.
4cfeb78 [Reynold Xin] [SPARK-5097][SQL] Address DataFrame code review feedback.
2015-01-28 19:10:32 -08:00
Xiangrui Meng 4ee79c71af [SPARK-5430] move treeReduce and treeAggregate from mllib to core
We have seen many use cases of `treeAggregate`/`treeReduce` outside the ML domain. Maybe it is time to move them to Core. pwendell

Author: Xiangrui Meng <meng@databricks.com>

Closes #4228 from mengxr/SPARK-5430 and squashes the following commits:

20ad40d [Xiangrui Meng] exclude tree* from mima
e89a43e [Xiangrui Meng] fix compile and update java doc
3ae1a4b [Xiangrui Meng] add treeReduce/treeAggregate to Python
6f948c5 [Xiangrui Meng] add treeReduce/treeAggregate to JavaRDDLike
d600b6c [Xiangrui Meng] move treeReduce and treeAggregate to core
2015-01-28 17:26:03 -08:00
Xiangrui Meng e80dc1c5a8 [SPARK-4586][MLLIB] Python API for ML pipeline and parameters
This PR adds Python API for ML pipeline and parameters. The design doc can be found on the JIRA page. It includes transformers and an estimator to demo the simple text classification example code.

TODO:
- [x] handle parameters in LRModel
- [x] unit tests
- [x] missing some docs

CC: davies jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4151 from mengxr/SPARK-4586 and squashes the following commits:

415268e [Xiangrui Meng] remove inherit_doc from __init__
edbd6fe [Xiangrui Meng] move Identifiable to ml.util
44c2405 [Xiangrui Meng] Merge pull request #2 from davies/ml
dd1256b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
14ae7e2 [Davies Liu] fix docs
54ca7df [Davies Liu] fix tests
78638df [Davies Liu] Merge branch 'SPARK-4586' of github.com:mengxr/spark into ml
fc59a02 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
1dca16a [Davies Liu] refactor
090b3a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into ml
0882513 [Xiangrui Meng] update doc style
a4f4dbf [Xiangrui Meng] add unit test for LR
7521d1c [Xiangrui Meng] add unit tests to HashingTF and Tokenizer
ba0ba1e [Xiangrui Meng] add unit tests for pipeline
0586c7b [Xiangrui Meng] add more comments to the example
5153cff [Xiangrui Meng] simplify java models
036ca04 [Xiangrui Meng] gen numFeatures
46fa147 [Xiangrui Meng] update mllib/pom.xml to include python files in the assembly
1dcc17e [Xiangrui Meng] update code gen and make param appear in the doc
f66ba0c [Xiangrui Meng] make params a property
d5efd34 [Xiangrui Meng] update doc conf and move embedded param map to instance attribute
f4d0fe6 [Xiangrui Meng] use LabeledDocument and Document in example
05e3e40 [Xiangrui Meng] update example
d3e8dbe [Xiangrui Meng] more docs optimize pipeline.fit impl
56de571 [Xiangrui Meng] fix style
d0c5bb8 [Xiangrui Meng] a working copy
bce72f4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
17ecfb9 [Xiangrui Meng] code gen for shared params
d9ea77c [Xiangrui Meng] update doc
c18dca1 [Xiangrui Meng] make the example working
dadd84e [Xiangrui Meng] add base classes and docs
a3015cf [Xiangrui Meng] add Estimator and Transformer
46eea43 [Xiangrui Meng] a pipeline in python
33b68e0 [Xiangrui Meng] a working LR
2015-01-28 17:14:23 -08:00
Michael Nazario e023112d33 [SPARK-5441][pyspark] Make SerDeUtil PairRDD to Python conversions more robust
SerDeUtil.pairRDDToPython and SerDeUtil.pythonToPairRDD now both support empty RDDs by checking the result of take(1) instead of calling first which throws an exception.

Author: Michael Nazario <mnazario@palantir.com>

Closes #4236 from mnazario/feature/empty-first and squashes the following commits:

a531c0c [Michael Nazario] Added regression tests for SPARK-5441
e3b2fb6 [Michael Nazario] Added acceptance of the empty case
2015-01-28 13:58:46 -08:00
Yandu Oppacher 3bead67d59 [SPARK-4387][PySpark] Refactoring python profiling code to make it extensible
This PR is based on #3255 , fix conflicts and code style.

Closes #3255.

Author: Yandu Oppacher <yandu.oppacher@jadedpixel.com>
Author: Davies Liu <davies@databricks.com>

Closes #3901 from davies/refactor-python-profile-code and squashes the following commits:

b4a9306 [Davies Liu] fix tests
4b79ce8 [Davies Liu] add docstring for profiler_cls
2700e47 [Davies Liu] use BasicProfiler as default
349e341 [Davies Liu] more refactor
6a5d4df [Davies Liu] refactor and fix tests
31bf6b6 [Davies Liu] fix code style
0864b5d [Yandu Oppacher] Remove unused method
76a6c37 [Yandu Oppacher] Added a profile collector to accumulate the profilers per stage
9eefc36 [Yandu Oppacher] Fix doc
9ace076 [Yandu Oppacher] Refactor of profiler, and moved tests around
8739aff [Yandu Oppacher] Code review fixes
9bda3ec [Yandu Oppacher] Refactor profiler code
2015-01-28 13:48:06 -08:00
Ryan Williams a731314c31 [SPARK-5417] Remove redundant executor-id set() call
This happens inside SparkEnv initialization as of #4194

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #4213 from ryan-williams/exec-id-set and squashes the following commits:

b3e4f7b [Ryan Williams] Remove redundant executor-id set() call
2015-01-28 13:04:52 -08:00
Nicholas Chammas d44ee43665 [SPARK-5434] [EC2] Preserve spaces in EC2 path
Fixes [SPARK-5434](https://issues.apache.org/jira/browse/SPARK-5434).

Simple demonstration of the problem and the fix:

```
$ spacey_path="/path/with some/spaces"
$ dirname $spacey_path
usage: dirname path
$ echo $?
1
$ dirname "$spacey_path"
/path/with some
$ echo $?
0
```

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #4224 from nchammas/patch-1 and squashes the following commits:

960711a [Nicholas Chammas] [EC2] Preserve spaces in EC2 path
2015-01-28 12:56:03 -08:00
Andrew Or 84b6ecdef6 [SPARK-5437] Fix DriverSuite and SparkSubmitSuite timeout issues
In DriverSuite, we currently set a timeout of 60 seconds. If after this time the process has not terminated, we leak the process because we never destroy it.

In SparkSubmitSuite, we currently do not have a timeout so the test can hang indefinitely.

Author: Andrew Or <andrew@databricks.com>

Closes #4230 from andrewor14/fix-driver-suite and squashes the following commits:

f5c80fd [Andrew Or] Fix timeout behaviors in both suites
8092c36 [Andrew Or] Stop SparkContext after every individual test
2015-01-28 12:53:22 -08:00
lianhuiwang 81f8f34062 [SPARK-4955]With executor dynamic scaling enabled,executor shoude be added or killed in yarn-cluster mode.
With executor dynamic scaling enabled, executor number shoude be added or killed in yarn-cluster mode.so in yarn-cluster mode, ApplicationMaster start a AMActor that add or kill a executor. then YarnSchedulerActor  in YarnSchedulerBackend send message to am's AMActor.
andrewor14 ChengXiangLi tdas

Author: lianhuiwang <lianhuiwang09@gmail.com>

Closes #3962 from lianhuiwang/SPARK-4955 and squashes the following commits:

48d9ebb [lianhuiwang] update with andrewor14's comments
12426af [lianhuiwang] refactor am's code
45da3b0 [lianhuiwang] remove unrelated code
9318fc1 [lianhuiwang] update with andrewor14's comments
08ba473 [lianhuiwang] address andrewor14's comments
265c36d [lianhuiwang] fix small change
f43bda8 [lianhuiwang] fix address andrewor14's comments
7a7767a [lianhuiwang] fix address andrewor14's comments
bbc4d5a [lianhuiwang] address andrewor14's comments
1b029a4 [lianhuiwang] fix bug
7d33791 [lianhuiwang] in AM create a new actorSystem
2164ea8 [lianhuiwang] fix a min bug
6dfeeec [lianhuiwang] in yarn-cluster mode,executor number can be added or killed.
2015-01-28 12:51:15 -08:00