1.Implement LossFunction trait and implement squared error and cross entropy
loss with it
2.Implement unit test for gradient and loss
3.Implement InPlace trait and in-place layer evaluation
4.Refactor interface for ActivationFunction
5.Update of Layer and LayerModel interfaces
6.Fix random weights assignment
7.Implement memory allocation by MLP model instead of individual layers
These features decreased the memory usage and increased flexibility of
internal API.
Author: Alexander Ulanov <nashb@yandex.ru>
Author: avulanov <avulanov@gmail.com>
Closes#9229 from avulanov/mlp-refactoring.
## What changes were proposed in this pull request?
This PR implements `FileFormat.buildReader()` for the LibSVM data source. Besides that, a new interface method `prepareRead()` is added to `FileFormat`:
```scala
def prepareRead(
sqlContext: SQLContext,
options: Map[String, String],
files: Seq[FileStatus]): Map[String, String] = options
```
After migrating from `buildInternalScan()` to `buildReader()`, we lost the opportunity to collect necessary global information, since `buildReader()` works in a per-partition manner. For example, LibSVM needs to infer the total number of features if the `numFeatures` data source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched.
An alternative approach is to absorb `inferSchema()` into `prepareRead()`, since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, while `prepareRead()` must be called whenever a `HadoopFsRelation` based data source relation is instantiated.
One unaddressed problem is that, when `numFeatures` is absent, now the input data will be scanned twice. The `buildInternalScan()` code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, with `FileScanRDD`, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD.
## How was this patch tested?
Tested using existing test suites.
Author: Cheng Lian <lian@databricks.com>
Closes#12088 from liancheng/spark-14295-libsvm-build-reader.
## What changes were proposed in this pull request?
In this patch, we set the initial `maxNumComponents` to `Integer.MAX_VALUE` instead of the default size ( which is 16) when allocating `compositeBuffer` in `TransportFrameDecoder` because `compositeBuffer` will introduce too many memory copies underlying if `compositeBuffer` is with default `maxNumComponents` when the frame size is large (which result in many transport messages). For details, please refer to [SPARK-14242](https://issues.apache.org/jira/browse/SPARK-14242).
## How was this patch tested?
spark unit tests and manual tests.
For manual tests, we can reproduce the performance issue with following code:
`sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length`
It's easy to see the performance gain, both from the running time and CPU usage.
Author: Zhang, Liye <liye.zhang@intel.com>
Closes#12038 from liyezhang556520/spark-14242.
## What changes were proposed in this pull request?
This PR support multiple Python UDFs within single batch, also improve the performance.
```python
>>> from pyspark.sql.types import IntegerType
>>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
>>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
>>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)]
+- OneRowRelation$
== Analyzed Logical Plan ==
double(add(1, 2)): int, add(double(2), 1): int
Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
+- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
+- OneRowRelation$
== Optimized Logical Plan ==
Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
+- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
+- OneRowRelation$
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
: +- INPUT
+- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
+- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
+- Scan OneRowRelation[]
```
## How was this patch tested?
Added new tests.
Using the following script to benchmark 1, 2 and 3 udfs,
```
df = sqlContext.range(1, 1 << 23, 1, 4)
double = F.udf(lambda x: x * 2, LongType())
print df.select(double(df.id)).count()
print df.select(double(df.id), double(df.id + 1)).count()
print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
```
Here is the results:
N | Before | After | speed up
---- |------------ | -------------|------
1 | 22 s | 7 s | 3.1X
2 | 38 s | 13 s | 2.9X
3 | 58 s | 16 s | 3.6X
This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering).
Author: Davies Liu <davies@databricks.com>
Closes#12057 from davies/multi_udfs.
## What changes were proposed in this pull request?
Upgrade snappy to 1.1.2.4 to improve snappy read/write performance.
## How was this patch tested?
Tested by running a job on the cluster and saw 7.5% cpu savings after this change.
Author: Sital Kedia <skedia@fb.com>
Closes#12096 from sitalkedia/snappyRelease.
This patch fixes a compilation / build break in Spark's `java8-tests` and refactors their POM to simplify the build. See individual commit messages for more details.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#12073 from JoshRosen/fix-java8-tests.
## What changes were proposed in this pull request?
Feature importances are exposed in the python API for GBTs.
Other changes:
* Update the random forest feature importance documentation to not repeat decision tree docstring and instead place a reference to it.
## How was this patch tested?
Python doc tests were updated to validate GBT feature importance.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#12056 from sethah/Pyspark_GBT_feature_importance.
## What changes were proposed in this pull request?
If I press `CTRL-C` when running these tests, the temp files will be left in `sql/core` folder and I need to delete them manually. It's annoying. This PR just moves the temp files to the `java.io.tmpdir` folder and add a name prefix for them.
## How was this patch tested?
Existing Jenkins tests
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12093 from zsxwing/temp-file.
## What changes were proposed in this pull request?
Exclude jline from curator-recipes since it conflicts with scala 2.11 when running spark-shell. Should not affect scala 2.10 since it is builtin.
## How was this patch tested?
Ran spark-shell manually.
Author: Michel Lemay <mlemay@gmail.com>
Closes#12043 from michellemay/spark-13710-fix-jline-on-windows.
## What changes were proposed in this pull request?
Track executor information like host and port, cache size, running tasks.
TODO: tests
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11888 from cloud-fan/status-tracker.
It looks like the docs were recently updated to reflect the History Server's support for incomplete applications, but they still had wording that suggested only completed applications were viewable. This fixes that.
My editor also introduced several whitespace removal changes, that I hope are OK, as text files shouldn't have trailing whitespace. To verify they're purely whitespace changes, add `&w=1` to your browser address. If this isn't acceptable, let me know and I'll update the PR.
I also didn't think this required a JIRA. Let me know if I should create one.
Not tested
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes#12045 from mgummelt/update-history-docs.
## What changes were proposed in this pull request?
This PR try to use `incUpdatedBlockStatuses ` to update the `updatedBlockStatuses ` when removing blocks, making sure `BlockManager` correctly updates `updatedBlockStatuses`
## How was this patch tested?
test("updated block statuses") in BlockManagerSuite.scala
Author: jeanlyn <jeanlyn92@gmail.com>
Closes#12091 from jeanlyn/updateBlock.
This PR is to provide native parsing support for DDL commands: `Alter View`. Since its AST trees are highly similar to `Alter Table`. Thus, both implementation are integrated into the same one.
Based on the Hive DDL document:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL and https://cwiki.apache.org/confluence/display/Hive/PartitionedViews
**Syntax:**
```SQL
ALTER VIEW view_name RENAME TO new_view_name
```
- to change the name of a view to a different name
**Syntax:**
```SQL
ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
```
- to add metadata to a view
**Syntax:**
```SQL
ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
```
- to remove metadata from a view
**Syntax:**
```SQL
ALTER VIEW view_name ADD [IF NOT EXISTS] PARTITION spec1[, PARTITION spec2, ...]
```
- to add the partitioning metadata for a view.
- the syntax of partition spec in `ALTER VIEW` is identical to `ALTER TABLE`, **EXCEPT** that it is **ILLEGAL** to specify a `LOCATION` clause.
**Syntax:**
```SQL
ALTER VIEW view_name DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...]
```
- to drop the related partition metadata for a view.
Added the related test cases to `DDLCommandSuite`
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#11987 from gatorsmile/parseAlterView.
## What changes were proposed in this pull request?
Redirect error message to logWarning
## How was this patch tested?
Unit tests, manual tests
JoshRosen
Author: Nishkam Ravi <nishkamravi@gmail.com>
Closes#12052 from nishkamravi2/master_warning.
## What changes were proposed in this pull request?
Fixes a minor bug in the record reader constructor that was possibly introduced during refactoring.
## How was this patch tested?
N/A
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12070 from sameeragarwal/vectorized-rr.
## What changes were proposed in this pull request?
This PR proposes a new data-structure based on a vectorized hashmap that can be potentially _codegened_ in `TungstenAggregate` to speed up aggregates with group by. Micro-benchmarks show a 10x improvement over the current `BytesToBytes` aggregation map.
## How was this patch tested?
Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz
BytesToBytesMap: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
hash 108 / 119 96.9 10.3 1.0X
fast hash 63 / 70 166.2 6.0 1.7X
arrayEqual 70 / 73 150.8 6.6 1.6X
Java HashMap (Long) 141 / 200 74.3 13.5 0.8X
Java HashMap (two ints) 145 / 185 72.3 13.8 0.7X
Java HashMap (UnsafeRow) 499 / 524 21.0 47.6 0.2X
BytesToBytesMap (off Heap) 483 / 548 21.7 46.0 0.2X
BytesToBytesMap (on Heap) 485 / 562 21.6 46.2 0.2X
Vectorized Hashmap 54 / 60 193.7 5.2 2.0X
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12055 from sameeragarwal/vectorized-hashmap.
# What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-11892
Add save/load for spark ml.OneVsRest and its model. Also add OneVsRest and OneVsRestModel in MetaAlgorithmReadWrite.
# How was this patch tested?
Test with Scala unit test.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#9934 from yinxusen/SPARK-11892.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-13782
Model export/import for BisectingKMeans in spark.ml and mllib
## How was this patch tested?
unit tests
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#11933 from hhbyyh/bisectingsave.
## What changes were proposed in this pull request?
1. Currently log4j which uses distributed cache only adds to AM's classpath, not executor's, this is introduced in #9118, which breaks the original meaning of that PR, so here add log4j file to the classpath of both AM and executors.
2. Automatically upload metrics.properties to distributed cache, so that it could be used by remote driver and executors implicitly.
## How was this patch tested?
Unit test and integration test is done.
Author: jerryshao <sshao@hortonworks.com>
Closes#11885 from jerryshao/SPARK-14062.
## What changes were proposed in this pull request?
This issue improves an input layer validation and adds related testcases to MultilayerPerceptronClassifier.
```scala
- // TODO: how to check ALSO that all elements are greater than 0?
- ParamValidators.arrayLengthGt(1)
+ (t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1
```
## How was this patch tested?
Pass the Jenkins tests including the new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11964 from dongjoon-hyun/SPARK-14164.
### What changes were proposed in this pull request?
This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`.
### How was this patch tested?
Existing unit tests.
cc rxin andrewor14 yhuai
Author: Herman van Hovell <hvanhovell@questtec.nl>
Closes#12071 from hvanhovell/SPARK-14211.
## What changes were proposed in this pull request?
Major changes:
1. Implement `FileFormat.buildReader()` for the CSV data source.
1. Add an extra argument to `FileFormat.buildReader()`, `physicalSchema`, which is basically the result of `FileFormat.inferSchema` or user specified schema.
This argument is necessary because the CSV data source needs to know all the columns of the underlying files to read the file.
## How was this patch tested?
Existing tests should do the work.
Author: Cheng Lian <lian@databricks.com>
Closes#12002 from liancheng/spark-14206-csv-build-reader.
## What changes were proposed in this pull request?
This change resolves an issue where `DataFrameNaFunctions.fill` changes a `FloatType` column to a `DoubleType`. We also clarify the contract that replacement values will be cast to the column data type, which may change the replacement value when casting to a lower precision type.
## How was this patch tested?
This patch has associated unit tests.
Author: Travis Crawford <travis@medium.com>
Closes#11967 from traviscrawford/SPARK-14081-dataframena.
## What changes were proposed in this pull request?
This pr is to add a config to control the maximum number of files as even small files have a non-trivial fixed cost. The current packing can put a lot of small files together which cases straggler tasks.
## How was this patch tested?
I added tests to check if many files get split into partitions in FileSourceStrategySuite.
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
Closes#12068 from maropu/SPARK-14259.
jira: https://issues.apache.org/jira/browse/SPARK-11507
"In certain situations when adding two block matrices, I get an error regarding colPtr and the operation fails. External issue URL includes full error and code for reproducing the problem."
root cause: colPtr.last does NOT always equal to values.length in breeze SCSMatrix, which fails the require in SparseMatrix.
easy step to repro:
```
val m1: BM[Double] = new CSCMatrix[Double] (Array (1.0, 1, 1), 3, 3, Array (0, 1, 2, 3), Array (0, 1, 2) )
val m2: BM[Double] = new CSCMatrix[Double] (Array (1.0, 2, 2, 4), 3, 3, Array (0, 0, 2, 4), Array (1, 2, 1, 2) )
val sum = m1 + m2
Matrices.fromBreeze(sum)
```
Solution: By checking the code in [CSCMatrix](28000a7b90/math/src/main/scala/breeze/linalg/CSCMatrix.scala), CSCMatrix in breeze can have extra zeros in the end of data array. Invoking compact will make sure it aligns with the require of SparseMatrix. This should add limited overhead as the actual compact operation is only performed when necessary.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#9520 from hhbyyh/matricesFromBreeze.
## What changes were proposed in this pull request?
```MultilayerPerceptronClassifier``` supports save/load for Python API.
## How was this patch tested?
doctest.
cc mengxr jkbradley yinxusen
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#11952 from yanboliang/spark-14152.
## What changes were proposed in this pull request?
Fix the wrong param name of LDA ```topicDistributionCol```.
## How was this patch tested?
No tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#12065 from yanboliang/lda-topicDistributionCol.
https://issues.apache.org/jira/browse/SPARK-14181
TrainValidationSplit should have HasSeed for the random split of RDD. I also changed the random split from the RDD function to the DataFrame function.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#11985 from yinxusen/SPARK-14181.
Move the logic to find Spark jars to CommandBuilderUtils and make it
available for YARN code, so that it's possible to easily launch Spark
on YARN from a build directory.
Tested by running SparkPi from the build directory on YARN.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11970 from vanzin/SPARK-13955.
## What changes were proposed in this pull request?
In `ExpressionEncoder`, we use `constructorFor` to build `fromRowExpression` as the `deserializer` in `ObjectOperator`. It's kind of confusing, we should make the name consistent.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#12058 from cloud-fan/rename.
## What changes were proposed in this pull request?
This PR implements buildReader for text data source and enable it in the new data source code path.
## How was this patch tested?
Existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11934 from cloud-fan/text.
## What changes were proposed in this pull request?
It would be very helpful for network performance investigation if we log the time spent on connecting and resolving host.
## How was this patch tested?
Jenkins unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12046 from zsxwing/connection-time.
#### What changes were proposed in this pull request?
This PR is to implement the following four Database-related DDL commands:
- `CREATE DATABASE|SCHEMA [IF NOT EXISTS] database_name`
- `DROP DATABASE [IF EXISTS] database_name [RESTRICT|CASCADE]`
- `DESCRIBE DATABASE [EXTENDED] db_name`
- `ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...)`
Another PR will be submitted to handle the unsupported commands. In the Database-related DDL commands, we will issue an error exception for `ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role`.
cc yhuai andrewor14 rxin Could you review the changes? Is it in the right direction? Thanks!
#### How was this patch tested?
Added a few test cases in `command/DDLSuite.scala` for testing DDL command execution in `SQLContext`. Since `HiveContext` also shares the same implementation, the existing test cases in `\hive` also verifies the correctness of these commands.
Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
Closes#12009 from gatorsmile/dbDDL.
## What changes were proposed in this pull request?
For MemoryMode.OFF_HEAP, Unsafe.getInt etc. are used with no restriction.
However, the Oracle implementation uses these methods only if the class variable unaligned (commented as "Cached unaligned-access capability") is true, which seems to be calculated whether the architecture is i386, x86, amd64, or x86_64.
I think we should perform similar check for the use of Unsafe.
Reference: https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L112
## How was this patch tested?
Unit test suite
Author: tedyu <yuzhihong@gmail.com>
Closes#11943 from tedyu/master.
## What changes were proposed in this pull request?
Builds on https://github.com/apache/spark/pull/12022 and (a) appends "..." to truncated comment strings and (b) fixes indentation in lines after the commented strings if they happen to have a `(`, `{`, `)` or `}`
## How was this patch tested?
Manually examined the generated code.
Author: Sameer Agarwal <sameer@databricks.com>
Closes#12044 from sameeragarwal/comment.
## What changes were proposed in this pull request?
This PR brings the support for chained Python UDFs, for example
```sql
select udf1(udf2(a))
select udf1(udf2(a) + 3)
select udf1(udf2(a) + udf3(b))
```
Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.
For example,
```python
>>> sqlContext.sql("select double(double(1))").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF#10 AS double(double(1))#9]
: +- INPUT
+- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
+- Scan OneRowRelation[]
>>> sqlContext.sql("select double(double(1) + double(2))").explain()
== Physical Plan ==
WholeStageCodegen
: +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
: +- INPUT
+- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
+- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
+- !BatchPythonEvaluation double(1), [pythonUDF#17]
+- Scan OneRowRelation[]
```
TODO: will support multiple unrelated Python UDFs in one batch (another PR).
## How was this patch tested?
Added new unit tests for chained UDFs.
Author: Davies Liu <davies@databricks.com>
Closes#12014 from davies/py_udfs.
## What changes were proposed in this pull request?
This PR is a simple fix for an exception message to print `string[]` content correctly.
```java
String[] colPath = requestedSchema.getPaths().get(i);
...
- throw new IOException("Required column is missing in data file. Col: " + colPath);
+ throw new IOException("Required column is missing in data file. Col: " + Arrays.toString(colPath));
```
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12041 from dongjoon-hyun/fix_exception_message_with_string_array.
## What changes were proposed in this pull request?
This PR fixes two trivial typos: 'does not **much**' --> 'does not **match**'.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12042 from dongjoon-hyun/fix_typo_by_replacing_much_with_match.
Add a new api endpoint `/api/v1/version` to retrieve various version info. This PR only adds support for finding the current spark version, however other version info such as jvm or scala versions can easily be added.
Author: Jakob Odersky <jodersky@gmail.com>
Closes#10760 from jodersky/SPARK-10570.
## What changes were proposed in this pull request?
The event timeline doesn't show on job page if an executor is removed with a multiple line reason. This PR replaces all new line characters in the reason string with spaces.
![timelineerror](https://cloud.githubusercontent.com/assets/9278199/14100211/5fd4cd30-f5be-11e5-9cea-f32651a4cd62.jpg)
## How was this patch tested?
Verified on the Web UI.
Author: Carson Wang <carson.wang@intel.com>
Closes#12029 from carsonwang/eventTimeline.
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14154
I just read the code for KolmogorovSmirnovTest and find it could be much simplified following the original definition.
Send a PR for discussion
## How was this patch tested?
unit test
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes#11954 from hhbyyh/ksoptimize.
## What changes were proposed in this pull request?
Renames SQL option `spark.sql.parquet.fileScan` since now all `HadoopFsRelation` based data sources are being migrated to `FileScanRDD` code path.
## How was this patch tested?
None.
Author: Cheng Lian <lian@databricks.com>
Closes#12003 from liancheng/spark-14208-option-renaming.
## What changes were proposed in this pull request?
Adding binary toggle parameter to ml.feature.HashingTF, as well as mllib.feature.HashingTF since the former wraps this functionality. This parameter, if true, will set non-zero valued term counts to 1 to transform term count features to binary values that are well suited for discrete probability models.
## How was this patch tested?
Added unit tests for ML and MLlib
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11832 from BryanCutler/binary-param-HashingTF-SPARK-13963.
## What changes were proposed in this pull request?
This PR implements buildReader for json data source and enable it in the new data source code path.
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11960 from cloud-fan/json.
Add property to MLWritable.write method, so we can use .write instead of .write()
Add a new test to ml/test.py to check whether the write is a property.
./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (11s)
Finished test(python2.7): pyspark.ml.clustering (16s)
Finished test(python2.7): pyspark.ml.classification (24s)
Finished test(python2.7): pyspark.ml.recommendation (24s)
Finished test(python2.7): pyspark.ml.feature (39s)
Finished test(python2.7): pyspark.ml.regression (26s)
Finished test(python2.7): pyspark.ml.tuning (15s)
Finished test(python2.7): pyspark.ml.tests (30s)
Tests passed in 55 seconds
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes#11945 from wangmiao1981/fix_property.
## What changes were proposed in this pull request?
Now that GBTs have been moved to ML, they can use the implementation of feature importance for random forests. This patch simply adds a `featureImportances` attribute to `GBTClassifier` and `GBTRegressor` and adds tests for each.
GBT feature importances here simply average the feature importances for each tree in its ensemble. This follows the implementation from scikit-learn. This method is also suggested by J Friedman in [this paper](https://statweb.stanford.edu/~jhf/ftp/trebst.pdf).
## How was this patch tested?
Unit tests were added to `GBTClassifierSuite` and `GBTRegressorSuite` to validate feature importances.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#11961 from sethah/SPARK-11730.
## What changes were proposed in this pull request?
Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs.
Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later.
## How was this patch tested?
dev/lint-r
SparkR unit tests
Author: Sun Rui <rui.sun@intel.com>
Closes#12024 from sun-rui/SPARK-12792_new.