Commit graph

601 commits

Author SHA1 Message Date
Xusen Yin 8509519d8b [SPARK-5894] [ML] Add polynomial mapper
See [SPARK-5894](https://issues.apache.org/jira/browse/SPARK-5894).

Author: Xusen Yin <yinxusen@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5245 from yinxusen/SPARK-5894 and squashes the following commits:

dc461a6 [Xusen Yin] merge polynomial expansion v2
6d0c3cc [Xusen Yin] Merge branch 'SPARK-5894' of https://github.com/mengxr/spark into mengxr-SPARK-5894
57bfdd5 [Xusen Yin] Merge branch 'master' into SPARK-5894
3d02a7d [Xusen Yin] Merge branch 'master' into SPARK-5894
a067da2 [Xiangrui Meng] a new approach for poly expansion
0789d81 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5894
4e9aed0 [Xusen Yin] fix test suite
95d8fb9 [Xusen Yin] fix sparse vector indices
8d39674 [Xusen Yin] fix sparse vector expansion error
5998dd6 [Xusen Yin] fix dense vector fillin
fa3ade3 [Xusen Yin] change the functional code into imperative one to speedup
b70e7e1 [Xusen Yin] remove useless case class
6fa236f [Xusen Yin] fix vector slice error
daff601 [Xusen Yin] fix index error of sparse vector
6bd0a10 [Xusen Yin] merge repeated features
419f8a2 [Xusen Yin] need to merge same columns
4ebf34e [Xusen Yin] add test suite of polynomial expansion
372227c [Xusen Yin] add polynomial expansion
2015-04-24 00:39:29 -07:00
Xiangrui Meng 1ed46a60ad [SPARK-7070] [MLLIB] LDA.setBeta should call setTopicConcentration.
jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5649 from mengxr/SPARK-7070 and squashes the following commits:

c66023c [Xiangrui Meng] setBeta should call setTopicConcentration
2015-04-23 14:46:54 -07:00
wizz 3e91cc273d [SPARK-7085][MLlib] Fix miniBatchFraction parameter in train method called with 4 arguments
Author: wizz <wizz@wizz-dev01.kawasaki.flab.fujitsu.com>

Closes #5658 from kuromatsu-nobuyuki/SPARK-7085 and squashes the following commits:

6ec2d21 [wizz] Fix miniBatchFraction parameter in train method called with 4 arguments
2015-04-23 14:00:07 -07:00
Reynold Xin 2d33323cad [MLlib] Add support for BooleanType to VectorAssembler.
Author: Reynold Xin <rxin@databricks.com>

Closes #5648 from rxin/vectorAssembler-boolean and squashes the following commits:

1bf3d40 [Reynold Xin] [MLlib] Add support for BooleanType to VectorAssembler.
2015-04-22 23:54:48 -07:00
Reynold Xin d20686066e [SPARK-7066][MLlib] VectorAssembler should use NumericType not NativeType.
Author: Reynold Xin <rxin@databricks.com>

Closes #5642 from rxin/mllib-native-type and squashes the following commits:

e23af5b [Reynold Xin] Remove StringType
7cbb205 [Reynold Xin] [SPARK-7066][MLlib] VectorAssembler should use NumericType and StringType, not NativeType.
2015-04-22 21:35:42 -07:00
Reynold Xin 1b85e08509 [MLlib] UnaryTransformer nullability should not depend on PrimitiveType.
Author: Reynold Xin <rxin@databricks.com>

Closes #5644 from rxin/mllib-nullable and squashes the following commits:

a727e5b [Reynold Xin] [MLlib] UnaryTransformer nullability should not depend on primitive types.
2015-04-22 21:35:12 -07:00
Joseph K. Bradley 607eff0edf [SPARK-6113] [ML] Small cleanups after original tree API PR
This does a few clean-ups.  With this PR, all spark.ml tree components have ```private[ml]``` constructors.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5567 from jkbradley/dt-api-dt2 and squashes the following commits:

2263b5b [Joseph K. Bradley] Added note about tree example issue.
bb9f610 [Joseph K. Bradley] Small cleanups after original tree API PR
2015-04-21 21:44:44 -07:00
Alain ae036d0817 [Minor][MLLIB] Fix a minor formatting bug in toString method in Node.scala
add missing comma and space

Author: Alain <aihe@usc.edu>

Closes #5621 from AiHe/tree-node-issue and squashes the following commits:

159a7bb [Alain] [Minor][MLLIB] Fix a minor formatting bug in toString methods in Node.scala

(cherry picked from commit 4508f01890a723f80d631424ff8eda166a13a727)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-04-21 16:48:05 -07:00
MechCoder 7fe6142cd3 [SPARK-6065] [MLlib] Optimize word2vec.findSynonyms using blas calls
1. Use blas calls to find the dot product between two vectors.
2. Prevent re-computing the L2 norm of the given vector for each word in model.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5467 from MechCoder/spark-6065 and squashes the following commits:

dd0b0b2 [MechCoder] Preallocate wordVectors
ffc9240 [MechCoder] Minor
6b74c81 [MechCoder] Switch back to native blas calls
da1642d [MechCoder] Explicit types and indexing
64575b0 [MechCoder] Save indexedmap and a wordvecmat instead of matrix
fbe0108 [MechCoder] Made the following changes 1. Calculate norms during initialization. 2. Use Blas calls from linalg.blas
1350cf3 [MechCoder] [SPARK-6065] Optimize word2vec.findSynonynms using blas calls
2015-04-21 16:42:45 -07:00
MechCoder 45c47fa417 [SPARK-6845] [MLlib] [PySpark] Add isTranposed flag to DenseMatrix
Since sparse matrices now support a isTransposed flag for row major data, DenseMatrices should do the same.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5455 from MechCoder/spark-6845 and squashes the following commits:

525c370 [MechCoder] minor
004a37f [MechCoder] Cast boolean to int
151f3b6 [MechCoder] [WIP] Add isTransposed to pickle DenseMatrix
cc0b90a [MechCoder] [SPARK-6845] Add isTranposed flag to DenseMatrix
2015-04-21 14:36:50 -07:00
Yanbo Liang 1f2f723b0d [SPARK-5990] [MLLIB] Model import/export for IsotonicRegression
Model import/export for IsotonicRegression

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5270 from yanboliang/spark-5990 and squashes the following commits:

872028d [Yanbo Liang] fix code style
f80ec1b [Yanbo Liang] address comments
49600cc [Yanbo Liang] address comments
429ff7d [Yanbo Liang] store each interval as a record
2b2f5a1 [Yanbo Liang] Model import/export for IsotonicRegression
2015-04-21 00:14:16 -07:00
jrabary 1be207078c [SPARK-5924] Add the ability to specify withMean or withStd parameters with StandarScaler
The current implementation call the default constructor of mllib.feature.StandarScaler without the possibility to specify withMean or withStd options.

Author: jrabary <Jaonary@gmail.com>

Closes #4704 from jrabary/master and squashes the following commits:

fae8568 [jrabary] style fix
8896b0e [jrabary] Comments fix
ef96d73 [jrabary] style fix
8e52607 [jrabary] style fix
edd9d48 [jrabary] Fix default param initialization
17e1a76 [jrabary] Fix default param initialization
298f405 [jrabary] Typo fix
45ed914 [jrabary] Add withMean and withStd params to StandarScaler
2015-04-20 09:47:56 -07:00
zsxwing c776ee8a6f [SPARK-6979][Streaming] Replace JobScheduler.eventActor and JobGenerator.eventActor with EventLoop
Title says it all.

cc rxin tdas

Author: zsxwing <zsxwing@gmail.com>

Closes #5554 from zsxwing/SPARK-6979 and squashes the following commits:

5304350 [zsxwing] Fix NotSerializableException
e9d3479 [zsxwing] Add blank lines
633e279 [zsxwing] Fix NotSerializableException
e496ace [zsxwing] Replace JobGenerator.eventActor with EventLoop
ec6ec58 [zsxwing] Fix the import order
ce0fa73 [zsxwing] Replace JobScheduler.eventActor with EventLoop
2015-04-19 20:48:36 -07:00
zsxwing fa73da0240 [SPARK-6998][MLlib] Make StreamingKMeans 'Serializable'
If `StreamingKMeans` is not `Serializable`, we cannot do checkpoint for applications that using `StreamingKMeans`. So we should make it `Serializable`.

Author: zsxwing <zsxwing@gmail.com>

Closes #5582 from zsxwing/SPARK-6998 and squashes the following commits:

67c2a14 [zsxwing] Make StreamingKMeans 'Serializable'
2015-04-19 20:33:51 -07:00
Joseph K. Bradley a83571acc9 [SPARK-6113] [ml] Stabilize DecisionTree API
This is a PR for cleaning up and finalizing the DecisionTree API.  PRs for ensembles will follow once this is merged.

### Goal

Here is the description copied from the JIRA (for both trees and ensembles):

> **Issue**: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design.
> **Proposal**: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details.
> **[Design doc](https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4)** : This outlines current issues and the proposed API.

Overall code layout:
* The old API in mllib.tree.* will remain the same.
* The new API will reside in ml.classification.* and ml.regression.*

### Summary of changes

Old API
* Exactly the same, except I made 1 method in Loss private (but that is not a breaking change since that method was introduced after the Spark 1.3 release).

New APIs
* Under Pipeline API
* The new API preserves functionality, except:
  * New API does NOT store prob (probability of label in classification).  I want to have it store the full vector of probabilities but feel that should be in a later PR.
* Use abstractions for parameters, estimators, and models to avoid code duplication
* Limit parameters to relevant algorithms
* For enum-like types, only expose Strings
  * We can make these pluggable later on by adding new parameters.  That is a far-future item.

Test suites
* I organized DecisionTreeSuite, but I made absolutely no changes to the tests themselves.
* The test suites for the new API only test (a) similarity with the results of the old API and (b) elements of the new API.
  * After code is moved to this new API, we should move the tests from the old suites which test the internals.

### Details

#### Changed names

Parameters
* useNodeIdCache -> cacheNodeIds

#### Other changes

* Split: Changed categories to set instead of list

#### Non-decision tree changes
* AttributeGroup
  * Added parentheses to toMetadata, toStructField methods (These were removed in a previous PR, but I ran into 1 issue with the Scala compiler not being able to disambiguate between a toMetadata method with no parentheses and a toMetadata method which takes 1 argument.)
* Attributes
  * Renamed: toMetadata -> toMetadataImpl
  * Added toMetadata methods which return ML metadata (keyed with “ML_ATTR”)
  * NominalAttribute: Added getNumValues method which examines both numValues and values.
* Params.inheritValues: Checks whether the parent param really belongs to the child (to allow Estimator-Model pairs with different sets of parameters)

### Questions for reviewers

* Is "DecisionTreeClassificationModel" too long a name?
* Is this OK in the docs?
```
class DecisionTreeRegressor extends TreeRegressor[DecisionTreeRegressionModel] with DecisionTreeParams[DecisionTreeRegressor] with TreeRegressorParams[DecisionTreeRegressor]
```

### Future

We should open up the abstractions at some point.  E.g., it would be useful to be able to set tree-related parameters in 1 place and then pass those to multiple tree-based algorithms.

Follow-up JIRAs will be (in this order):
* Tree ensembles
* Deprecate old tree code
* Move DecisionTree implementation code to new API.
* Move tests from the old suites which test the internals.
* Update programming guide
* Python API
* Change RandomForest* to always use bootstrapping, even when numTrees = 1
* Provide the probability of the predicted label for classification.  After we move code to the new API and update it to maintain probabilities for all labels, then we can add the probabilities to the new API.

CC: mengxr  manishamde  codedeft  chouqin  MechCoder

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5530 from jkbradley/dt-api-dt and squashes the following commits:

6aae255 [Joseph K. Bradley] Changed tree abstractions not to take type parameters, and for setters to return this.type instead
ec17947 [Joseph K. Bradley] Updates based on code review.  Main changes were: moving public types from ml.impl.tree to ml.tree, modifying CategoricalSplit to take an Array of categories but store a Set internally, making more types sealed or final
5626c81 [Joseph K. Bradley] style fixes
f8fbd24 [Joseph K. Bradley] imported reorg of DecisionTreeSuite from old PR.  small cleanups
7ef63ed [Joseph K. Bradley] Added DecisionTreeRegressor, test suites, and example (for real this time)
e11673f [Joseph K. Bradley] Added DecisionTreeRegressor, test suites, and example
119f407 [Joseph K. Bradley] added DecisionTreeClassifier example
0bdc486 [Joseph K. Bradley] fixed issues after param PR was merged
f9fbb60 [Joseph K. Bradley] Done with DecisionTreeClassifier, but no save/load yet.  Need to add example as well
2532c9a [Joseph K. Bradley] partial move to spark.ml API, not done yet
c72c1a0 [Joseph K. Bradley] Copied changes for common items, plus DecisionTreeClassifier from original PR
2015-04-17 13:15:36 -07:00
Davies Liu 04e44b37cc [SPARK-4897] [PySpark] Python 3 support
This PR update PySpark to support Python 3 (tested with 3.4).

Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.

TODO: ec2/spark-ec2.py is not fully tested with python3.

Author: Davies Liu <davies@databricks.com>
Author: twneale <twneale@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #5173 from davies/python3 and squashes the following commits:

d7d6323 [Davies Liu] fix tests
6c52a98 [Davies Liu] fix mllib test
99e334f [Davies Liu] update timeout
b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
cafd5ec [Davies Liu] adddress comments from @mengxr
bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
179fc8d [Davies Liu] tuning flaky tests
8c8b957 [Davies Liu] fix ResourceWarning in Python 3
5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
4006829 [Davies Liu] fix test
2fc0066 [Davies Liu] add python3 path
71535e9 [Davies Liu] fix xrange and divide
5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ed498c8 [Davies Liu] fix compatibility with python 3
820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ad7c374 [Davies Liu] fix mllib test and warning
ef1fc2f [Davies Liu] fix tests
4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
59bb492 [Davies Liu] fix tests
1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ca0fdd3 [Davies Liu] fix code style
9563a15 [Davies Liu] add imap back for python 2
0b1ec04 [Davies Liu] make python examples work with Python 3
d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
a716d34 [Davies Liu] test with python 3.4
f1700e8 [Davies Liu] fix test in python3
671b1db [Davies Liu] fix test in python3
692ff47 [Davies Liu] fix flaky test
7b9699f [Davies Liu] invalidate import cache for Python 3.3+
9c58497 [Davies Liu] fix kill worker
309bfbf [Davies Liu] keep compatibility
5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
f53e1f0 [Davies Liu] fix tests
70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
a39167e [Davies Liu] support customize class in __main__
814c77b [Davies Liu] run unittests with python 3
7f4476e [Davies Liu] mllib tests passed
d737924 [Davies Liu] pass ml tests
375ea17 [Davies Liu] SQL tests pass
6cc42a9 [Davies Liu] rename
431a8de [Davies Liu] streaming tests pass
78901a7 [Davies Liu] fix hash of serializer in Python 3
24b2f2e [Davies Liu] pass all RDD tests
35f48fe [Davies Liu] run future again
1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
6e3c21d [Davies Liu] make cloudpickle work with Python3
2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
f40d925 [twneale] xrange --> range
e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
854be27 [Josh Rosen] Run `futurize` on Python code:
7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
2015-04-16 16:20:57 -07:00
Xiangrui Meng 57cd1e86d1 [SPARK-6893][ML] default pipeline parameter handling in python
Same as #5431 but for Python. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5534 from mengxr/SPARK-6893 and squashes the following commits:

d3b519b [Xiangrui Meng] address comments
ebaccc6 [Xiangrui Meng] style update
fce244e [Xiangrui Meng] update explainParams with test
4d6b07a [Xiangrui Meng] add tests
5294500 [Xiangrui Meng] update default param handling in python
2015-04-15 23:49:42 -07:00
Juliet Hougland 52c3439a8a SPARK-6938: All require statements now have an informative error message.
This pr adds informative error messages to all require statements in the Vectors class that did not previously have them. This references [SPARK-6938](https://issues.apache.org/jira/browse/SPARK-6938).

Author: Juliet Hougland <juliet@cloudera.com>

Closes #5532 from jhlch/SPARK-6938 and squashes the following commits:

ab321bb [Juliet Hougland] Remove braces from string interpolation when not required.
1221f94 [Juliet Hougland] All require statements now have an informative error message.
2015-04-15 21:52:25 -07:00
Xiangrui Meng 971b95b0c9 [SPARK-5957][ML] better handling of parameters
The design doc was posted on the JIRA page. Python changes will be in a follow-up PR. jkbradley

1. Use codegen for shared params.
1. Move shared params to package `ml.param.shared`.
1. Set default values in `Params` instead of in `Param`.
1. Add a few methods to `Params` and `ParamMap`.
1. Move schema handling to `SchemaUtils` from `Params`.

- [x] check visibility of the methods added

Author: Xiangrui Meng <meng@databricks.com>

Closes #5431 from mengxr/SPARK-5957 and squashes the following commits:

d19236d [Xiangrui Meng] fix test
26ae2d7 [Xiangrui Meng] re-gen code and mark clear protected
38b78c7 [Xiangrui Meng] update Param.toString and remove Params.explain()
409e2d5 [Xiangrui Meng] address comments
2d637bd [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5957
eec2264 [Xiangrui Meng] make get* public in Params
4090d95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5957
4fee9e7 [Xiangrui Meng] re-gen shared params
2737c2d [Xiangrui Meng] rename SharedParamCodeGen to SharedParamsCodeGen
e938f81 [Xiangrui Meng] update code to set default parameter values
28ed322 [Xiangrui Meng] merge master
55be1f3 [Xiangrui Meng] merge master
d63b5cc [Xiangrui Meng] fix examples
29b004c [Xiangrui Meng] update ParamsSuite
94fd98e [Xiangrui Meng] fix explain params
48d0e84 [Xiangrui Meng] add remove and update explainParams
4ac6348 [Xiangrui Meng] move schema utils to SchemaUtils add a few methods to Params
0d9594e [Xiangrui Meng] add getOrElse to ParamMap
eeeffe8 [Xiangrui Meng] map ++ paramMap => extractValues
0d3fc5b [Xiangrui Meng] setDefault after param
a9dbf59 [Xiangrui Meng] minor updates
d9302b8 [Xiangrui Meng] generate default values
1c72579 [Xiangrui Meng] pass test compile
abb7a3b [Xiangrui Meng] update default values handling
dcab97a [Xiangrui Meng] add codegen for shared params
2015-04-13 21:18:05 -07:00
MechCoder 2a55cb41bf [SPARK-5972] [MLlib] Cache residuals and gradient in GBT during training and validation
The previous PR https://github.com/apache/spark/pull/4906 helped to extract the learning curve giving the error for each iteration. This continues the work refactoring some code and extending the same logic during training and validation.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5330 from MechCoder/spark-5972 and squashes the following commits:

0b5d659 [MechCoder] minor
32d409d [MechCoder] EvaluateeachIteration and training cache should follow different paths
d542bb0 [MechCoder] Remove unused imports and docs
58f4932 [MechCoder] Remove unpersist
70d3b4c [MechCoder] Broadcast for each tree
5869533 [MechCoder] Access broadcasted values locally and other minor changes
923dbf6 [MechCoder] [SPARK-5972] Cache residuals and gradient in GBT during training and validation
2015-04-13 15:36:33 -07:00
Xusen Yin 1e340c3ae4 [SPARK-5988][MLlib] add save/load for PowerIterationClusteringModel
See JIRA issue [SPARK-5988](https://issues.apache.org/jira/browse/SPARK-5988).

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5450 from yinxusen/SPARK-5988 and squashes the following commits:

cb1ecfa [Xusen Yin] change Assignment into case class
b1dd24c [Xusen Yin] add test suite
63c3923 [Xusen Yin] add save load for power iteration clustering
2015-04-13 11:53:17 -07:00
Reynold Xin c5b0b296b8 [SPARK-6765] Enable scalastyle on test code.
Turn scalastyle on for all test code. Most of the violations have been resolved in my previous pull requests:

Core: https://github.com/apache/spark/pull/5484
SQL: https://github.com/apache/spark/pull/5412
MLlib: https://github.com/apache/spark/pull/5411
GraphX: https://github.com/apache/spark/pull/5410
Streaming: https://github.com/apache/spark/pull/5409

Author: Reynold Xin <rxin@databricks.com>

Closes #5486 from rxin/test-style-enable and squashes the following commits:

01683de [Reynold Xin] Fixed new code.
a4ab46e [Reynold Xin] Fixed tests.
20adbc8 [Reynold Xin] Missed one violation.
5e36521 [Reynold Xin] [SPARK-6765] Enable scalastyle on test code.
2015-04-13 09:29:04 -07:00
Xiangrui Meng 9294044985 [SPARK-5885][MLLIB] Add VectorAssembler as a feature transformer
VectorAssembler merges multiple columns into a vector column. This PR contains content from #5195.

~~carry ML attributes~~ (moved to a follow-up PR)

Author: Xiangrui Meng <meng@databricks.com>

Closes #5196 from mengxr/SPARK-5885 and squashes the following commits:

a52b101 [Xiangrui Meng] recognize more types
35daac2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5885
bb5e64b [Xiangrui Meng] add TODO for null
976a3d6 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5885
0859311 [Xiangrui Meng] Revert "add CreateStruct"
29fb6ac [Xiangrui Meng] use CreateStruct
adb71c4 [Xiangrui Meng] Merge branch 'SPARK-6542' into SPARK-5885
85f3106 [Xiangrui Meng] add CreateStruct
4ff16ce [Xiangrui Meng] add VectorAssembler
2015-04-12 22:42:01 -07:00
Xiangrui Meng 685ddcf525 [SPARK-5886][ML] Add StringIndexer as a feature transformer
This PR adds string indexer, which takes a column of string labels and outputs a double column with labels indexed by their frequency.

TODOs:
- [x] store feature to index map in output metadata

Author: Xiangrui Meng <meng@databricks.com>

Closes #4735 from mengxr/SPARK-5886 and squashes the following commits:

d82575f [Xiangrui Meng] fix test
700e70f [Xiangrui Meng] rename LabelIndexer to StringIndexer
16a6f8c [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5886
457166e [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5886
f8b30f4 [Xiangrui Meng] update label indexer to output metadata
e81ec28 [Xiangrui Meng] Merge branch 'openhashmap-contains' into SPARK-5886-2
d6e6f1f [Xiangrui Meng] add contains to primitivekeyopenhashmap
748a69b [Xiangrui Meng] add contains to OpenHashMap
def3c5c [Xiangrui Meng] add LabelIndexer
2015-04-12 22:41:05 -07:00
Joseph K. Bradley d3792f5497 [SPARK-4081] [mllib] VectorIndexer
**Ready for review!**

Since the original PR, I moved the code to the spark.ml API and renamed this to VectorIndexer.

This introduces a VectorIndexer class which does the following:
* VectorIndexer.fit(): collect statistics about how many values each feature in a dataset (RDD[Vector]) can take (limited by maxCategories)
  * Feature which exceed maxCategories are declared continuous, and the Model will treat them as such.
* VectorIndexerModel.transform(): Convert categorical feature values to corresponding 0-based indices

Design notes:
* This maintains sparsity in vectors by ensuring that categorical feature value 0.0 gets index 0.
* This does not yet support transforming data with new (unknown) categorical feature values.  That can be added later.
* This is necessary for DecisionTree and tree ensembles.

Reviewers: Please check my use of metadata and my unit tests for it; I'm not sure if I covered everything in the tests.

Other notes:
* This also adds a public toMetadata method to AttributeGroup (for simpler construction of metadata).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3000 from jkbradley/indexer and squashes the following commits:

5956d91 [Joseph K. Bradley] minor cleanups
f5c57a8 [Joseph K. Bradley] added Java test suite
643b444 [Joseph K. Bradley] removed FeatureTests
02236c3 [Joseph K. Bradley] Updated VectorIndexer, ready for PR
286d221 [Joseph K. Bradley] Reworked DatasetIndexer for spark.ml API, and renamed it to VectorIndexer
12e6cf2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into indexer
6d8f3f1 [Joseph K. Bradley] Added partly done DatasetIndexer to spark.ml
6a2f553 [Joseph K. Bradley] Updated TODO for allowUnknownCategories
3f041f8 [Joseph K. Bradley] Final cleanups for DatasetIndexer
038b9e3 [Joseph K. Bradley] DatasetIndexer now maintains sparsity in SparseVector
3a4a0bd [Joseph K. Bradley] Added another test for DatasetIndexer
2006923 [Joseph K. Bradley] DatasetIndexer now passes tests
f409987 [Joseph K. Bradley] partly done with DatasetIndexerSuite
5e7c874 [Joseph K. Bradley] working on DatasetIndexer
2015-04-12 22:38:27 -07:00
lewuathe fc17661475 [SPARK-6643][MLLIB] Implement StandardScalerModel missing methods
This is the sub-task of SPARK-6254.
Wrap missing method for `StandardScalerModel`.

Author: lewuathe <lewuathe@me.com>

Closes #5310 from Lewuathe/SPARK-6643 and squashes the following commits:

fafd690 [lewuathe] Fix for lint-python
bd31a64 [lewuathe] Merge branch 'master' into SPARK-6643
578f5ee [lewuathe] Remove unnecessary class
a38f155 [lewuathe] Merge master
66bb2ab [lewuathe] Fix typos
82683a0 [lewuathe] [SPARK-6643] Implement StandardScalerModel missing methods
2015-04-12 22:17:16 -07:00
Volodymyr Lyubinets 67d06880e4 [SQL] [SPARK-6620] Speed up toDF() and rdd() functions by constructing converters in ScalaReflection
cc marmbrus

Author: Volodymyr Lyubinets <vlyubin@gmail.com>

Closes #5279 from vlyubin/speedup and squashes the following commits:

e75a387 [Volodymyr Lyubinets] Changes to ScalaUDF
11a20ec [Volodymyr Lyubinets] Avoid creating a tuple
c327bc9 [Volodymyr Lyubinets] Moved the only remaining function from DataTypeConversions to DateUtils
dec6802 [Volodymyr Lyubinets] Addresed review feedback
74301fa [Volodymyr Lyubinets] Addressed review comments
afa3aa5 [Volodymyr Lyubinets] Minor refactoring, added license, removed debug output
881dc60 [Volodymyr Lyubinets] Moved to a separate module; addressed review comments; one extra place of usage; changed behaviour for Java
8cad6e2 [Volodymyr Lyubinets] Addressed review commments
41b2aa9 [Volodymyr Lyubinets] Creating converters for ScalaReflection stuff, and more
2015-04-10 16:27:56 -07:00
Yuhao Yang 9c67049b4e [Spark-6693][MLlib]add tostring with max lines and width for matrix
jira: https://issues.apache.org/jira/browse/SPARK-6693

It's kind of annoying when debugging and found you cannot print out the matrix as you want.

original toString of Matrix only print like following,
0.17810102596909183    0.5616906241468385    ... (10 total)
0.9692861997823815     0.015558159784155756  ...
0.8513015122819192     0.031523763918528847  ...
0.5396875653953941     0.3267864552779176    ...

The   def toString(maxLines : Int, maxWidth : Int) is useful when debuging, logging and saving matrix to files.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #5344 from hhbyyh/addToString and squashes the following commits:

19a6836 [Yuhao Yang] remove extra line
6314b21 [Yuhao Yang] add exclude
736c324 [Yuhao Yang] add ut and exclude
420da39 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into addToString
c22f352 [Yuhao Yang] style change
64a9e0f [Yuhao Yang] add specific to string to matrix
2015-04-09 15:37:45 -07:00
Yanbo Liang a0411aebee [SPARK-6264] [MLLIB] Support FPGrowth algorithm in Python API
Support FPGrowth algorithm in Python API.
Should we remove "Experimental" which were marked for FPGrowth and FPGrowthModel in Scala? jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5213 from yanboliang/spark-6264 and squashes the following commits:

ed62ead [Yanbo Liang] trigger jenkins
8ce0359 [Yanbo Liang] fix docstring style
544c725 [Yanbo Liang] address comments
a2d7cf7 [Yanbo Liang] add doc for FPGrowth.train()
dcf7d73 [Yanbo Liang] add python doc
b18fd07 [Yanbo Liang] trigger jenkins
2c951b8 [Yanbo Liang] fix typos
7f62c8f [Yanbo Liang] add fpm to __init__.py
b96206a [Yanbo Liang] Support FPGrowth algorithm in Python API
2015-04-09 15:10:10 -07:00
WangTaoTheTonic 7d92db342e [SPARK-6758]block the right jetty package in log
https://issues.apache.org/jira/browse/SPARK-6758

I am not sure if it is ok to block them in test resources too (as we shade jetty in assembly?).

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5406 from WangTaoTheTonic/SPARK-6758 and squashes the following commits:

e09605b [WangTaoTheTonic] block the right jetty package
2015-04-09 17:44:08 -04:00
Reynold Xin 66159c3501 [SPARK-6765] Fix test code style for mllib.
So we can turn style checker on for test code.

Author: Reynold Xin <rxin@databricks.com>

Closes #5411 from rxin/test-style-mllib and squashes the following commits:

d8a2569 [Reynold Xin] [SPARK-6765] Fix test code style for mllib.
2015-04-08 11:32:44 -07:00
Omede Firouz d138aa8ee2 [SPARK-6705][MLLIB] Add fit intercept api to ml logisticregression
I have the fit intercept enabled by default for logistic regression, I
wonder what others think here. I understand that it enables allocation
by default which is undesirable, but one needs to have a very strong
reason for not having an intercept term enabled so it is the safer
default from a statistical sense.

Explicitly modeling the intercept by adding a column of all 1s does not
work. I believe the reason is that since the API for
LogisticRegressionWithLBFGS forces column normalization, and a column of all
1s has 0 variance so dividing by 0 kills it.

Author: Omede Firouz <ofirouz@palantir.com>

Closes #5301 from oefirouz/addIntercept and squashes the following commits:

9f1286b [Omede Firouz] [SPARK-6705][MLLIB] Add fitInterceptTerm to LogisticRegression
1d6bd6f [Omede Firouz] [SPARK-6705][MLLIB] Add a fit intercept term to ML LogisticRegression
9963509 [Omede Firouz] [MLLIB] Add fitIntercept to LogisticRegression
2257fca [Omede Firouz] [MLLIB] Add fitIntercept param to logistic regression
329c1e2 [Omede Firouz] [MLLIB] Add fit intercept term
bd9663c [Omede Firouz] [MLLIB] Add fit intercept api to ml logisticregression
2015-04-07 23:36:31 -04:00
Vinod K C 7162ecf886 [SPARK-6733][ Scheduler]Added scala.language.existentials
Author: Vinod K C <vinod.kc@huawei.com>

Closes #5384 from vinodkc/Suppression_Scala_existential_code and squashes the following commits:

82a3a1f [Vinod K C] Added scala.language.existentials
2015-04-07 10:42:08 -07:00
Reza Zadeh 30363ede86 [MLlib] [SPARK-6713] Iterators in columnSimilarities for mapPartitionsWithIndex
Use Iterators in columnSimilarities to allow mapPartitionsWithIndex to spill to disk. This could happen in a dense and large column - this way Spark can spill the pairs onto disk instead of building all the pairs before handing them to Spark.

Another PR coming to update documentation.

Author: Reza Zadeh <reza@databricks.com>

Closes #5364 from rezazadeh/optmemsim and squashes the following commits:

47c90ba [Reza Zadeh] Iterators in columnSimilarities for flatMap
2015-04-06 13:15:01 -07:00
lewuathe 512a2f191a [SPARK-6615][MLLIB] Python API for Word2Vec
This is the sub-task of SPARK-6254.
Wrap missing method for `Word2Vec` and `Word2VecModel`.

Author: lewuathe <lewuathe@me.com>

Closes #5296 from Lewuathe/SPARK-6615 and squashes the following commits:

f14c304 [lewuathe] Reorder tests
1d326b9 [lewuathe] Merge master
e2bedfb [lewuathe] Modify test cases
afb866d [lewuathe] [SPARK-6615] Python API for Word2Vec
2015-04-03 09:49:50 -07:00
Omede Firouz b52c7f9fc8 [MLLIB] Remove println in LogisticRegression.scala
There's no corresponding printing in linear regression. Here was my previous PR (something weird happened and I can't reopen it) https://github.com/apache/spark/pull/5272

Author: Omede Firouz <ofirouz@palantir.com>

Closes #5338 from oefirouz/println and squashes the following commits:

3f3dbf4 [Omede Firouz] [MLLIB] Remove println
2015-04-03 10:26:43 +01:00
Reynold Xin 82701ee25f [SPARK-6428] Turn on explicit type checking for public methods.
This builds on my earlier pull requests and turns on the explicit type checking in scalastyle.

Author: Reynold Xin <rxin@databricks.com>

Closes #5342 from rxin/SPARK-6428 and squashes the following commits:

7b531ab [Reynold Xin] import ordering
2d9a8a5 [Reynold Xin] jl
e668b1c [Reynold Xin] override
9b9e119 [Reynold Xin] Parenthesis.
82e0cf5 [Reynold Xin] [SPARK-6428] Turn on explicit type checking for public methods.
2015-04-03 01:25:02 -07:00
freeman 6e1c1ec67b [SPARK-6345][STREAMING][MLLIB] Fix for training with prediction
This patch fixes a reported bug causing model updates to not properly propagate to model predictions during streaming regression. These minor changes in model declaration fix the problem, and I expanded the tests to include the scenario in which the bug was arising. The two new tests failed prior to the patch and now pass.

cc mengxr

Author: freeman <the.freeman.lab@gmail.com>

Closes #5037 from freeman-lab/train-predict-fix and squashes the following commits:

3af953e [freeman] Expand test coverage to include combined training and prediction
8f84fc8 [freeman] Move model declaration
2015-04-02 21:38:19 -07:00
Xiangrui Meng 4815bc2128 [SPARK-6660][MLLIB] pythonToJava doesn't recognize object arrays
davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #5318 from mengxr/SPARK-6660 and squashes the following commits:

0f66ec2 [Xiangrui Meng] recognize object arrays
ad8c42f [Xiangrui Meng] add a test for SPARK-6660
2015-04-01 18:17:07 -07:00
Yanbo Liang 86b4399351 [SPARK-6580] [MLLIB] Optimize LogisticRegressionModel.predictPoint
https://issues.apache.org/jira/browse/SPARK-6580

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5249 from yanboliang/spark-6580 and squashes the following commits:

6f47f21 [Yanbo Liang] address comments
4e0bd0f [Yanbo Liang] fix typos
04e2e2a [Yanbo Liang] trigger jenkins
cad5bcd [Yanbo Liang] Optimize LogisticRegressionModel.predictPoint
2015-04-01 17:19:36 -07:00
Xiangrui Meng ccafd757ed [SPARK-6642][MLLIB] use 1.2 lambda scaling and remove addImplicit from NormalEquation
This PR changes lambda scaling from number of users/items to number of explicit ratings. The latter is the behavior in 1.2. Slight refactor of NormalEquation to make it independent of ALS models. srowen codexiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #5314 from mengxr/SPARK-6642 and squashes the following commits:

dc655a1 [Xiangrui Meng] relax python tests
f410df2 [Xiangrui Meng] use 1.2 scaling and remove addImplicit from NormalEquation
2015-04-01 16:47:18 -07:00
MechCoder 0e00f12d33 [SPARK-5692] [MLlib] Word2Vec save/load
Word2Vec model now supports saving and loading.

a] The Metadata stored in JSON format consists of "version", "classname", "vectorSize" and "numWords"
b] The data stored in Parquet file format consists of an Array of rows with each row consisting of 2 columns, first being the word: String and the second, an Array of Floats.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5291 from MechCoder/spark-5692 and squashes the following commits:

1142f3a [MechCoder] Add numWords to metaData
bfe4c39 [MechCoder] [SPARK-5692] Word2Vec save/load
2015-03-31 16:01:08 -07:00
Yanbo Liang b5bd75d90a [SPARK-6255] [MLLIB] Support multiclass classification in Python API
Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:
```scala
LogisticRegressionWithLBFGS
    setNumClasses
    setValidateData
LogisticRegressionModel
    getThreshold
    numClasses
    numFeatures
SVMWithSGD
    setValidateData
SVMModel
    getThreshold
```
For users the greatest benefit in this PR is multiclass classification was supported by Python API.
Users can train multiclass classification model and use it to predict in pyspark.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5137 from yanboliang/spark-6255 and squashes the following commits:

0bd531e [Yanbo Liang] address comments
444d5e2 [Yanbo Liang] LogisticRegressionModel.predict() optimization
fc7990b [Yanbo Liang] address comments
b0d9c63 [Yanbo Liang] Support Mulinomial LR model predict in Python API
ded847c [Yanbo Liang] Python API parity check for classification (support multiclass classification)
2015-03-31 11:32:14 -07:00
leahmcguire d01a6d8c33 [SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib
Added optional model type parameter for  NaiveBayes training. Can be either Multinomial or Bernoulli.

When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html.

 Default for model is original Multinomial fit and predict.

Added additional testing for Bernoulli and Multinomial models.

Author: leahmcguire <lmcguire@salesforce.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Leah McGuire <lmcguire@salesforce.com>

Closes #4087 from leahmcguire/master and squashes the following commits:

f3c8994 [leahmcguire] changed checks on model type to requires
acb69af [leahmcguire] removed enum type and replaces all modelType parameters with strings
2224b15 [Leah McGuire] Merge pull request #2 from jkbradley/leahmcguire-master
9ad89ca [Joseph K. Bradley] removed old code
6a8f383 [Joseph K. Bradley] Added new model save/load format 2.0 for NaiveBayesModel after modelType parameter was added.  Updated tests.  Also updated ModelType enum-like type.
852a727 [leahmcguire] merged with upstream master
a22d670 [leahmcguire] changed NaiveBayesModel modelType parameter back to NaiveBayes.ModelType, made NaiveBayes.ModelType serializable, fixed getter method in NavieBayes
18f3219 [leahmcguire] removed private from naive bayes constructor for lambda only
bea62af [leahmcguire] put back in constructor for NaiveBayes
01baad7 [leahmcguire] made fixes from code review
fb0a5c7 [leahmcguire] removed typo
e2d925e [leahmcguire] fixed nonserializable error that was causing naivebayes test failures
2d0c1ba [leahmcguire] fixed typo in NaiveBayes
c298e78 [leahmcguire] fixed scala style errors
b85b0c9 [leahmcguire] Merge remote-tracking branch 'upstream/master'
900b586 [leahmcguire] fixed model call so that uses type argument
ea09b28 [leahmcguire] Merge remote-tracking branch 'upstream/master'
e016569 [leahmcguire] updated test suite with model type fix
85f298f [leahmcguire] Merge remote-tracking branch 'upstream/master'
dc65374 [leahmcguire] integrated model type fix
7622b0c [leahmcguire] added comments and fixed style as per rb
b93aaf6 [Leah McGuire] Merge pull request #1 from jkbradley/nb-model-type
3730572 [Joseph K. Bradley] modified NB model type to be more Java-friendly
b61b5e2 [leahmcguire] added back compatable constructor to NaiveBayesModel to fix MIMA test failure
5a4a534 [leahmcguire] fixed scala style error in NaiveBayes
3891bf2 [leahmcguire] synced with apache spark and resolved merge conflict
d9477ed [leahmcguire] removed old inaccurate comment from test suite for mllib naive bayes
76e5b0f [leahmcguire] removed unnecessary sort from test
0313c0c [leahmcguire] fixed style error in NaiveBayes.scala
4a3676d [leahmcguire] Updated changes re-comments. Got rid of verbose populateMatrix method. Public api now has string instead of enumeration. Docs are updated."
ce73c63 [leahmcguire] added Bernoulli option to niave bayes model in mllib, added optional model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html
2015-03-31 11:16:55 -07:00
Xiangrui Meng f75f633b21 [SPARK-6571][MLLIB] use wrapper in MatrixFactorizationModel.load
This fixes `predictAll` after load. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5243 from mengxr/SPARK-6571 and squashes the following commits:

82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load
2015-03-28 15:08:05 -07:00
Xusen Yin d5497ab134 [SPARK-6526][ML] Add Normalizer transformer in ML package
See [SPARK-6526](https://issues.apache.org/jira/browse/SPARK-6526).

mengxr Should we add test suite for this transformer? There is no test suite for all feature transformers in ML package now.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5181 from yinxusen/SPARK-6526 and squashes the following commits:

6faa7bf [Xusen Yin] fix style
8a462da [Xusen Yin] remove duplications
ab35ab0 [Xusen Yin] add test suite
bc8cd0f [Xusen Yin] fix comment
79774c9 [Xusen Yin] add Normalizer transformer in ML package
2015-03-27 13:29:10 -07:00
Yuhao Yang 3ddb975fae [MLlib]remove unused import
minor thing. Let me know if jira is required.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #5207 from hhbyyh/adjustImport and squashes the following commits:

2240121 [Yuhao Yang] remove unused import
2015-03-26 13:27:05 +00:00
MechCoder 4fc4d0369e [SPARK-5987] [MLlib] Save/load for GaussianMixtureModels
Should be self explanatory.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4986 from MechCoder/spark-5987 and squashes the following commits:

7d2cd56 [MechCoder] Iterate over dataframe in a better way
e7a14cb [MechCoder] Minor
33c84f9 [MechCoder] Store as Array[Data] instead of Data[Array]
505bd57 [MechCoder] Rebased over master and used MatrixUDT
7422bb4 [MechCoder] Store sigmas as Array[Double] instead of Array[Array[Double]]
b9794e4 [MechCoder] Minor
cb77095 [MechCoder] [SPARK-5987] Save/load for GaussianMixtureModels
2015-03-25 14:45:23 -07:00
Yanbo Liang 435337381f [SPARK-6256] [MLlib] MLlib Python API parity check for regression
MLlib Python API parity check for Regression, major disparities need to be added for Python list following:
```scala
LinearRegressionWithSGD
    setValidateData
LassoWithSGD
    setIntercept
    setValidateData
RidgeRegressionWithSGD
    setIntercept
    setValidateData
```
setFeatureScaling is mllib private function which is not needed to expose in pyspark.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #4997 from yanboliang/spark-6256 and squashes the following commits:

102f498 [Yanbo Liang] fix intercept issue & add doc test
1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept'
de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
2015-03-25 13:38:33 -07:00
Augustin Borsu 982952f4ae [ML][FEATURE] SPARK-5566: RegEx Tokenizer
Added a Regex based tokenizer for ml.
Currently the regex is fixed but if I could add a regex type paramater to the paramMap,
changing the tokenizer regex could be a parameter used in the crossValidation.
Also I wonder what would be the best way to add a stop word list.

Author: Augustin Borsu <augustin@sagacify.com>
Author: Augustin Borsu <a.borsu@gmail.com>
Author: Augustin Borsu <aborsu@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #4504 from aborsu985/master and squashes the following commits:

716d257 [Augustin Borsu] Merge branch 'mengxr-SPARK-5566'
cb07021 [Augustin Borsu] Merge branch 'SPARK-5566' of git://github.com/mengxr/spark into mengxr-SPARK-5566
5f09434 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
a164800 [Xiangrui Meng] remove tabs
556aa27 [Xiangrui Meng] Merge branch 'aborsu985-master' into SPARK-5566
9651aec [Xiangrui Meng] update test
f96526d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5566
2338da5 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
e88d7b8 [Xiangrui Meng] change pattern to a StringParameter; update tests
148126f [Augustin Borsu] Added return type to public functions
12dddb4 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
daf685e [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
6a85982 [Augustin Borsu] Style corrections
38b95a1 [Augustin Borsu] Added Java unit test for RegexTokenizer
b66313f [Augustin Borsu] Modified the pattern Param so it is compiled when given to the Tokenizer
e262bac [Augustin Borsu] Added unit tests in scala
cd6642e [Augustin Borsu] Changed regex to pattern
132b00b [Augustin Borsu] Changed matching to gaps and removed case folding
201a107 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
cb9c9a7 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
d3ef6d3 [Augustin Borsu] Added doc to RegexTokenizer
9082fc3 [Augustin Borsu] Removed stopwords parameters and updated doc
19f9e53 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
f6a5002 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
7f930bb [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
77ff9ca [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
2e89719 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
196cd7a [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
11ca50f [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
9f8685a [Augustin Borsu] RegexTokenizer
9e07a78 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
9547e9d [Augustin Borsu] RegEx Tokenizer
01cd26f [Augustin Borsu] RegExTokenizer
2015-03-25 10:16:39 -07:00