This is a partial solution to SPARK-1701, only addressing the
documentation confusion.
Additional work can be to actually change the numSlices parameter name
across languages, with care required for scala & python to maintain
backward compatibility for named parameters.
Author: Matthew Farrellee <matt@redhat.com>
Closes#2305 from mattf/SPARK-1701 and squashes the following commits:
c0af05d [Matthew Farrellee] Further tweak
06f80fc [Matthew Farrellee] Wording tweak from Josh Rosen's review
7b045e0 [Matthew Farrellee] [SPARK-1701] Clarify slice vs partition in the programming guide
This commit exists to close the following pull requests on Github:
Closes#726 (close requested by 'pwendell')
Closes#151 (close requested by 'pwendell')
VertexRDD.apply had a bug where it ignored the merge function for
duplicate vertices and instead used whichever vertex attribute occurred
first. This commit fixes the bug by passing the merge function through
to ShippableVertexPartition.apply, which merges any duplicates using the
merge function and then fills in missing vertices using the specified
default vertex attribute. This commit also adds a unit test for
VertexRDD.apply.
Author: Larry Xiao <xiaodi@sjtu.edu.cn>
Author: Blie Arkansol <xiaodi@sjtu.edu.cn>
Author: Ankur Dave <ankurdave@gmail.com>
Closes#1903 from larryxiao/2062 and squashes the following commits:
625aa9d [Blie Arkansol] Merge pull request #1 from ankurdave/SPARK-2062
476770b [Ankur Dave] ShippableVertexPartition.initFrom: Don't run mergeFunc on default values
614059f [Larry Xiao] doc update: note about the default null value vertices construction
dfdb3c9 [Larry Xiao] minor fix
1c70366 [Larry Xiao] scalastyle check: wrap line, parameter list indent 4 spaces
e4ca697 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc
6a35ea8 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc
4fbc29c [Blie Arkansol] undo unnecessary change
efae765 [Larry Xiao] fix mistakes: should be able to call with or without mergeFunc
b2422f9 [Larry Xiao] Merge branch '2062' of github.com:larryxiao/spark into 2062
52dc7f7 [Larry Xiao] pass mergeFunc to VertexPartitionBase, where merge is handled
581e9ee [Larry Xiao] TODO: VertexRDDSuite
20d80a3 [Larry Xiao] [SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc
Local `SparseMatrix` support added in Compressed Column Storage (CCS) format in addition to Level-2 and Level-3 BLAS operations such as dgemv and dgemm respectively.
BLAS doesn't support sparse matrix operations, therefore support for `SparseMatrix`-`DenseMatrix` multiplication and `SparseMatrix`-`DenseVector` implementations have been added. I will post performance comparisons in the comments momentarily.
Author: Burak <brkyvz@gmail.com>
Closes#2294 from brkyvz/SPARK-3418 and squashes the following commits:
88814ed [Burak] Hopefully fixed MiMa this time
47e49d5 [Burak] really fixed MiMa issue
f0bae57 [Burak] [SPARK-3418] Fixed MiMa compatibility issues (excluded from check)
4b7dbec [Burak] 9/17 comments addressed
7af2f83 [Burak] sealed traits Vector and Matrix
d3a8a16 [Burak] [SPARK-3418] Squashed missing alpha bug.
421045f [Burak] [SPARK-3418] New code review comments addressed
f35a161 [Burak] [SPARK-3418] Code review comments addressed and multiplication further optimized
2508577 [Burak] [SPARK-3418] Fixed one more style issue
d16e8a0 [Burak] [SPARK-3418] Fixed style issues and added documentation for methods
204a3f7 [Burak] [SPARK-3418] Fixed failing Matrix unit test
6025297 [Burak] [SPARK-3418] Fixed Scala-style errors
dc7be71 [Burak] [SPARK-3418][MLlib] Matrix unit tests expanded with indexing and updating
d2d5851 [Burak] [SPARK-3418][MLlib] Sparse Matrix support and additional native BLAS operations added
Py4j can not handle large string efficiently, so we should use broadcast for large closure automatically. (Broadcast use local filesystem to pass through data).
Author: Davies Liu <davies.liu@gmail.com>
Closes#2417 from davies/command and squashes the following commits:
fbf4e97 [Davies Liu] bugfix
aefd508 [Davies Liu] use broadcast automatically for large closure
This was introduced in #2449
Author: Andrew Or <andrewor14@gmail.com>
Closes#2452 from andrewor14/standalone-hot-fix and squashes the following commits:
d5190ca [Andrew Or] Put that line in the right place
Author: Victsm <victor.nju@gmail.com>
Author: Min Shen <mshen@linkedin.com>
Closes#2449 from Victsm/SPARK-3560 and squashes the following commits:
918405a [Victsm] Removed the additional space
4502a2a [Min Shen] [SPARK-3560] Fixed setting spark.jars system property in yarn-cluster mode.
(cherry picked from commit 832dff64dd)
Signed-off-by: Andrew Or <andrewor14@gmail.com>
https://issues.apache.org/jira/browse/SPARK-3589
"export CLASSPATH" in spark-class is redundant since same variable is exported before.
We could reuse defined value "isYarnCluster" in SparkSubmit.scala.
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Closes#2445 from WangTaoTheTonic/removeRedundant and squashes the following commits:
6fb6872 [WangTaoTheTonic] remove redundant code
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2426 from sarutak/emacs-metafiles-ignore and squashes the following commits:
a306020 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into emacs-metafiles-ignore
6a0a5eb [Kousuke Saruta] Added cmd file entry to .rat-excludes and .gitignore
897da63 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into emacs-metafiles-ignore
8cade06 [Kousuke Saruta] Modified .gitignore to ignore emacs lock file and backup file
This patch makes some small changes to fix this problem:
1. We document specific versions of Jekyll/Kramdown to use that match
those used when building the upstream docs.
2. We add a configuration for a property that for some reason varies across
packages of Jekyll/Kramdown even with the same version.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#2443 from pwendell/jekyll and squashes the following commits:
54ee2ab [Patrick Wendell] SPARK-3579 Jekyll doc generation is different across environments.
...on
As improvement of https://github.com/apache/spark/pull/1944, we should use more special exit code to represent ClassNotFoundException.
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Closes#2421 from WangTaoTheTonic/classnotfoundExitCode and squashes the following commits:
645a22a [WangTaoTheTonic] Serveral typos to trigger Jenkins
d6ae559 [WangTaoTheTonic] use 101 instead
a2d6465 [WangTaoTheTonic] use 127 instead
fbb232f [WangTaoTheTonic] Using a special exit code instead of 1 to represent ClassNotFoundException
Author: GuoQiang Li <witgo@qq.com>
Closes#2326 from witgo/rat-excludes and squashes the following commits:
860904e [GuoQiang Li] rat exclude dependency-reduced-pom.xml
Addresses the problem pointed out in [this comment](https://github.com/apache/spark/pull/2441#issuecomment-55990116).
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes#2442 from nchammas/patch-1 and squashes the following commits:
7e68b60 [Nicholas Chammas] [SPARK-3534] Add hive-thriftserver to SQL tests
https://issues.apache.org/jira/browse/SPARK-3565
"spark.ports.maxRetries" should be "spark.port.maxRetries". Make the configuration keys in document and code consistent.
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Closes#2427 from WangTaoTheTonic/fixPortRetries and squashes the following commits:
c178813 [WangTaoTheTonic] Use blank lines trigger Jenkins
646f3fe [WangTaoTheTonic] also in SparkBuild.scala
3700dba [WangTaoTheTonic] Fix configuration item not consistent with document
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2428 from sarutak/appid-volatile-modification and squashes the following commits:
c7d890d [Kousuke Saruta] Added volatile modifier to appId field in SparkDeploySchedulerBackend
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2424 from sarutak/display-appid-on-webui and squashes the following commits:
417fe90 [Kousuke Saruta] Added "App ID column" to HistoryPage
I think, this issue is caused by #1106
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2436 from sarutak/SPARK-3571 and squashes the following commits:
7a4deea [Kousuke Saruta] Modified Master.scala to use numWorkersVisited and numWorkersAlive instead of stopPos
4e51e35 [Kousuke Saruta] Modified Master to prevent from 0 divide
4817ecd [Kousuke Saruta] Brushed up previous change
71e84b6 [Kousuke Saruta] Modified Master to enable schedule normally
Testing arguments to `sbt` need to be passed as an array, not a single, long string.
Fixes a bug introduced in #2420.
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes#2437 from nchammas/selective-testing and squashes the following commits:
a9f9c1c [Nicholas Chammas] fix printing of sbt test arguments
cf57cbf [Nicholas Chammas] fix sbt test arguments
e33b978 [Nicholas Chammas] Merge pull request #2 from apache/master
0b47ca4 [Nicholas Chammas] Merge branch 'master' of github.com:nchammas/spark
8051486 [Nicholas Chammas] Merge pull request #1 from apache/master
03180a4 [Nicholas Chammas] Merge branch 'master' of github.com:nchammas/spark
d4c5f43 [Nicholas Chammas] Merge pull request #6 from apache/master
Makes the table of contents read better
Author: Andrew Ash <andrew@andrewash.com>
Closes#2402 from ash211/docs/better-indentation and squashes the following commits:
ea0e130 [Andrew Ash] Move HA subsections to a deeper indentation level
If the only files changed are related to SQL, then only run the SQL tests.
This patch includes some cosmetic/maintainability refactoring. I would be more than happy to undo some of these changes if they are inappropriate.
We can accept this patch mostly as-is and address the immediate need documented in [SPARK-3534](https://issues.apache.org/jira/browse/SPARK-3534), or we can keep it open until a satisfactory solution along the lines [discussed here](https://issues.apache.org/jira/browse/SPARK-1455?focusedCommentId=14136424&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14136424) is reached.
Note: I had to hack this patch up to test it locally, so what I'm submitting here and what I tested are technically different.
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes#2420 from nchammas/selective-testing and squashes the following commits:
db3fa2d [Nicholas Chammas] diff against master!
f9e23f6 [Nicholas Chammas] when possible, run SQL tests only
Author: Michael Armbrust <michael@databricks.com>
Closes#2434 from marmbrus/patch-1 and squashes the following commits:
67215be [Michael Armbrust] [SQL][DOCS] Improve table caching section
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes#2430 from nchammas/patch-2 and squashes the following commits:
d476bfb [Nicholas Chammas] [Docs] minor grammar fix
The JIRA and PR was original created for branch-1.1, and move to master branch now.
Chester
The Issue is due to that yarn-alpha and yarn have different APIs for certain class fields. In this particular case, the ClientBase using reflection to to address this issue, and we need to different way to test the ClientBase's method. Original ClientBaseSuite using getFieldValue() method to do this. But it doesn't work for yarn-alpha as the API returns an array of String instead of just String (which is the case for Yarn-stable API).
To fix the test, I add a new method
def getFieldValue2[A: ClassTag, A1: ClassTag, B](clazz: Class[_], field: String,
defaults: => B)
(mapTo: A => B)(mapTo1: A1 => B) : B =
Try(clazz.getField(field)).map(_.get(null)).map {
case v: A => mapTo(v)
case v1: A1 => mapTo1(v1)
case _ => defaults
}.toOption.getOrElse(defaults)
to handle the cases where the field type can be either type A or A1. In this new method the type A or A1 is pattern matched and corresponding mapTo function (mapTo or mapTo1) is used.
Author: chesterxgchen <chester@alpinenow.com>
Closes#2204 from chesterxgchen/SPARK-3177-master and squashes the following commits:
e72a6ea [chesterxgchen] The Issue is due to that yarn-alpha and yarn have different APIs for certain class fields. In this particular case, the ClientBase using reflection to to address this issue, and we need to different way to test the ClientBase's method. Original ClientBaseSuite using getFieldValue() method to do this. But it doesn't work for yarn-alpha as the API returns an array of String instead of just String (which is the case for Yarn-stable API).
change the value of spark.files.fetchTimeout
Author: viper-kun <xukun.xu@huawei.com>
Closes#2406 from viper-kun/master and squashes the following commits:
ecb0d46 [viper-kun] [Docs] Correct spark.files.fetchTimeout default value
7cf4c7a [viper-kun] Update configuration.md
Some config files in ```conf``` should ignore, such as
conf/fairscheduler.xml
conf/hive-log4j.properties
conf/metrics.properties
...
So ignore all ```sh```/```properties```/```conf```/```xml``` files
Author: wangfei <wangfei1@huawei.com>
Closes#2395 from scwf/patch-2 and squashes the following commits:
3dc53f2 [wangfei] duplicate ```conf/*.conf```
3c2986f [wangfei] ignore all config files
The test "jetty selects different port under contention" is flaky.
If another process binds to 4040 before the test starts, then the first server we start there will fail, and the subsequent servers we start thereafter may successfully bind to 4040 if it was released between the servers starting. Instead, we should just let Java find a random free port for us and hold onto it for the duration of the test.
Author: Andrew Or <andrewor14@gmail.com>
Closes#2418 from andrewor14/fix-port-contention and squashes the following commits:
0cd4974 [Andrew Or] Stop them servers
a7071fe [Andrew Or] Pick random port instead of 4040
This adds a new page to the docs listing community projects -- those created outside of Apache Spark that are of interest to the community of Spark users. Anybody can add to it just by submitting a PR.
There was a discussion thread about alternatives:
* Creating a Github organization for Spark projects - we could not find any sponsors for this, and it would be difficult to organize since many folks just create repos in their company organization or personal accounts
* Apache has some place for storing community projects, but it was deemed difficult to work with, and again would be some permissions issues -- not everyone could update it.
Author: Evan Chan <velvia@gmail.com>
Closes#2219 from velvia/community-projects-page and squashes the following commits:
7316822 [Evan Chan] Point to Spark wiki: supplemental projects page
613b021 [Evan Chan] Add a few more projects
a85eaaf [Evan Chan] Add a Community Projects page
When deploying to AWS, there is additional configuration that is required to read S3 files. EMR creates it automatically, there is no reason that the Spark EC2 script shouldn't.
This PR requires a corresponding PR to the mesos/spark-ec2 to be merged, as it gets cloned in the process of setting up machines: https://github.com/mesos/spark-ec2/pull/58
Author: Dan Osipov <daniil.osipov@shazam.com>
Closes#1120 from danosipov/s3_credentials and squashes the following commits:
758da8b [Dan Osipov] Modify documentation to include the new parameter
71fab14 [Dan Osipov] Use a parameter --copy-aws-credentials to enable S3 credential deployment
7e0da26 [Dan Osipov] Get AWS credentials out of boto connection instance
39bdf30 [Dan Osipov] Add S3 configuration parameters to the EC2 deploy scripts
Using Sphinx to generate API docs for PySpark.
requirement: Sphinx
```
$ cd python/docs/
$ make html
```
The generated API docs will be located at python/docs/_build/html/index.html
It can co-exists with those generated by Epydoc.
This is the first working version, after merging in, then we can continue to improve it and replace the epydoc finally.
Author: Davies Liu <davies.liu@gmail.com>
Closes#2292 from davies/sphinx and squashes the following commits:
425a3b1 [Davies Liu] cleanup
1573298 [Davies Liu] move docs to python/docs/
5fe3903 [Davies Liu] Merge branch 'master' into sphinx
9468ab0 [Davies Liu] fix makefile
b408f38 [Davies Liu] address all comments
e2ccb1b [Davies Liu] update name and version
9081ead [Davies Liu] generate PySpark API docs using Sphinx
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2408 from sarutak/resolve-resource-leak-issue and squashes the following commits:
074781d [Kousuke Saruta] Modified SuffleBlockFetcherIterator
5f63f67 [Kousuke Saruta] Move metrics increment logic and debug logging outside try block
b37231a [Kousuke Saruta] Modified FileSegmentManagedBuffer#nioByteBuffer to check null or not before invoking channel.close
bf29d4a [Kousuke Saruta] Modified FileSegment to close channel
Taken from liancheng's updates. Merged conflicts with #2316.
Author: Michael Armbrust <michael@databricks.com>
Closes#2384 from marmbrus/sqlDocUpdate and squashes the following commits:
2db6319 [Michael Armbrust] @liancheng's updates
Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Closes#2414 from nchammas/patch-1 and squashes the following commits:
14664bf [Nicholas Chammas] [Docs] minor punctuation fix
SchemaRDD overrides RDD functions, including collect, count, and take, with optimized versions making use of the query optimizer. The java and python interface classes wrapping SchemaRDD need to ensure the optimized versions are called as well. This patch overrides relevant calls in the python and java interfaces with optimized versions.
Adds a new Row serialization pathway between python and java, based on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasn’t overjoyed about doing this, but I noticed that some QueryPlans implement optimizations in executeCollect(), which outputs an Array[Row] rather than the typical RDD[Row] that can be shipped to python using the existing serialization code. To me it made sense to ship the Array[Row] over to python directly instead of converting it back to an RDD[Row] just for the purpose of sending the Rows to python using the existing serialization code.
Author: Aaron Staple <aaron.staple@gmail.com>
Closes#1592 from staple/SPARK-2314 and squashes the following commits:
89ff550 [Aaron Staple] Merge with master.
6bb7b6c [Aaron Staple] Fix typo.
b56d0ac [Aaron Staple] [SPARK-2314][SQL] Override count in JavaSchemaRDD, forwarding to SchemaRDD's count.
0fc9d40 [Aaron Staple] Fix comment typos.
f03cdfa [Aaron Staple] [SPARK-2314][SQL] Override collect and take in sql.py, forwarding to SchemaRDD's collect.
Throwing an error in the constructor makes it possible to run queries, even when there is no actual ambiguity. Remove this check in favor of throwing an error in analysis when they query is actually is ambiguous.
Also took the opportunity to add test cases that would have caught a subtle bug in my first attempt at fixing this and refactor some other test code.
Author: Michael Armbrust <michael@databricks.com>
Closes#2209 from marmbrus/sameNameStruct and squashes the following commits:
729cca4 [Michael Armbrust] Better tests.
a003aeb [Michael Armbrust] Remove error (it'll be caught in analysis).
This PR aims to support reading top level JSON arrays and take every element in such an array as a row (an empty array will not generate a row).
JIRA: https://issues.apache.org/jira/browse/SPARK-3308
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes#2400 from yhuai/SPARK-3308 and squashes the following commits:
990077a [Yin Huai] Handle top level JSON arrays.
Added missing rdd.distinct(numPartitions) and associated tests
Author: Matthew Farrellee <matt@redhat.com>
Closes#2383 from mattf/SPARK-3519 and squashes the following commits:
30b837a [Matthew Farrellee] Combine test cases to save on JVM startups
6bc4a2c [Matthew Farrellee] [SPARK-3519] add distinct(n) to SchemaRDD in PySpark
7a17f2b [Matthew Farrellee] [SPARK-3519] add distinct(n) to PySpark
Author: Cheng Hao <hao.cheng@intel.com>
Closes#2392 from chenghao-intel/trim and squashes the following commits:
e52024f [Cheng Hao] trim the string message
Here's my crack at Bertrand's suggestion. The Github `README.md` contains build info that's outdated. It should just point to the current online docs, and reflect that Maven is the primary build now.
(Incidentally, the stanza at the end about contributions of original work should go in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark too. It won't hurt to be crystal clear about the agreement to license, given that ICLAs are not required of anyone here.)
Author: Sean Owen <sowen@cloudera.com>
Closes#2014 from srowen/SPARK-3069 and squashes the following commits:
501507e [Sean Owen] Note that Zinc is for Maven builds too
db2bd97 [Sean Owen] sbt -> sbt/sbt and add note about zinc
be82027 [Sean Owen] Fix additional occurrences of building-with-maven -> building-spark
91c921f [Sean Owen] Move building-with-maven to building-spark and create a redirect. Update doc links to building-spark.html Add jekyll-redirect-from plugin and make associated config changes (including fixing pygments deprecation). Add example of SBT to README.md
999544e [Sean Owen] Change "Building Spark with Maven" title to "Building Spark"; reinstate tl;dr info about dev/run-tests in README.md; add brief note about building with SBT
c18d140 [Sean Owen] Optionally, remove the copy of contributing text from main README.md
8e83934 [Sean Owen] Add CONTRIBUTING.md to trigger notice on new pull request page
b1c04a1 [Sean Owen] Refer to current online documentation for building, and remove slightly outdated copy in README.md
Short version: NetworkInterface.getNetworkInterfaces returns ifs in reverse order compared to ifconfig output. It may pick up ip address associated with tun0 or virtual network interface.
See [SPARK_3040](https://issues.apache.org/jira/browse/SPARK-3040) for more detail
Author: Ye Xianjin <advancedxy@gmail.com>
Closes#1946 from advancedxy/SPARK-3040 and squashes the following commits:
f33f6b2 [Ye Xianjin] add windows support
087a785 [Ye Xianjin] reverse the Networkinterface.getNetworkInterfaces output order to get a more proper local ip address.
Actually false positive reported was due to mima generator not picking up the new jars in presence of old jars(theoretically this should not have happened.). So as a workaround, ran them both separately and just append them together.
Author: Prashant Sharma <prashant@apache.org>
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#2285 from ScrapCodes/mima-fix and squashes the following commits:
093c76f [Prashant Sharma] Update mima
59012a8 [Prashant Sharma] Update mima
35b6c71 [Prashant Sharma] SPARK-3433 Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations.
Tested on a real cluster.
Author: Reynold Xin <rxin@apache.org>
Closes#2404 from rxin/ec2-reboot-slaves and squashes the following commits:
00a2dbd [Reynold Xin] Allow rebooting slaves.
Also made some cosmetic cleanups.
Author: Aaron Staple <aaron.staple@gmail.com>
Closes#2385 from staple/SPARK-1087 and squashes the following commits:
7b3bb13 [Aaron Staple] Address review comments, cosmetic cleanups.
10ba6e1 [Aaron Staple] [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.
Pyrolite can not unpickle array.array which pickled by Python 2.6, this patch fix it by extend Pyrolite.
There is a bug in Pyrolite when unpickle array of float/double, this patch workaround it by reverse the endianness for float/double. This workaround should be removed after Pyrolite have a new release to fix this issue.
I had send an PR to Pyrolite to fix it: https://github.com/irmen/Pyrolite/pull/11
Author: Davies Liu <davies.liu@gmail.com>
Closes#2365 from davies/pickle and squashes the following commits:
f44f771 [Davies Liu] enable tests about array
3908f5c [Davies Liu] Merge branch 'master' into pickle
c77c87b [Davies Liu] cleanup debugging code
60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
Added minInstancesPerNode, minInfoGain params to:
* DecisionTreeRunner.scala example
* Python API (tree.py)
Also:
* Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements"
* small style fixes
CC: mengxr
Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: chouqin <liqiping1991@gmail.com>
Closes#2349 from jkbradley/chouqin-dt-preprune and squashes the following commits:
61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
Updating this to reflect the newest SVD via ARPACK
Author: Reza Zadeh <rizlar@gmail.com>
Closes#2389 from rezazadeh/irmdocs and squashes the following commits:
7fa1313 [Reza Zadeh] Update svd docs
715da25 [Reza Zadeh] Updated computeSVD documentation IndexedRowMatrix
SimpleUpdater ignores the regularizer, which leads to an unregularized
LogReg. To enable the common L2 regularizer (and the corresponding
regularization parameter) for logistic regression the SquaredL2Updater
has to be used in SGD (see, e.g., [SVMWithSGD])
Author: Christoph Sawade <christoph@sawade.me>
Closes#2398 from BigCrunsh/fix-regparam-logreg and squashes the following commits:
0820c04 [Christoph Sawade] Use SquaredL2Updater in LogisticRegressionWithSGD
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2380 from sarutak/SPARK-3518 and squashes the following commits:
8a1464e [Kousuke Saruta] Replaced a variable with simple field reference
c660fbc [Kousuke Saruta] Removed useless statement in JsonProtocol.scala
Closes#2387
Author: Matthew Farrellee <matt@redhat.com>
Closes#2301 from mattf/SPARK-3425 and squashes the following commits:
20f3c09 [Matthew Farrellee] [SPARK-3425] do not set MaxPermSize for OpenJDK 1.8