spark-instrumented-optimizer/python/pyspark
zhengruifeng 052ff49acd [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors
### What changes were proposed in this pull request?
1, reorg the `fit` method in LR to several blocks (`createModel`, `createBounds`, `createOptimizer`, `createInitCoefWithInterceptMatrix`);
2, add new param blockSize;
3, if blockSize==1, keep original behavior, code path `trainOnRows`;
4, if blockSize>1, standardize and stack input vectors to blocks (like ALS/MLP), code path `trainOnBlocks`

### Why are the changes needed?
On dense dataset `epsilon_normalized.t`:
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (4x ~ 5x faster)

### Does this PR introduce _any_ user-facing change?
Yes, a new param is added

### How was this patch tested?
existing and added testsuites

Closes #28458 from zhengruifeng/blockify_lor_II.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-05-07 10:07:24 +08:00
..
ml [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors 2020-05-07 10:07:24 +08:00
mllib [SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations 2020-03-16 12:41:22 -05:00
resource [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests 2020-04-23 10:20:39 +09:00
sql [SPARK-29664][PYTHON][SQL][FOLLOW-UP] Add deprecation warnings for getItem instead 2020-04-27 14:49:22 +09:00
streaming [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3 2019-09-09 10:19:40 -05:00
testing [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
tests [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group 2020-05-01 10:08:16 +09:00
__init__.py [SPARK-31088][SQL] Add back HiveContext and createExternalTable 2020-03-26 23:51:15 -07:00
_globals.py [SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary 2018-02-09 14:21:10 +08:00
accumulators.py [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation 2019-07-05 10:08:22 -07:00
broadcast.py [SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0 2019-10-03 19:20:51 +09:00
cloudpickle.py [SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8 2019-10-22 16:18:34 +09:00
conf.py [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation 2019-07-05 10:08:22 -07:00
context.py [SPARK-22340][PYTHON][FOLLOW-UP] Add a better message and improve documentation for pinned thread mode 2019-11-21 10:54:01 +09:00
daemon.py [SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon 2019-07-31 09:10:24 +09:00
files.py [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation 2019-07-05 10:08:22 -07:00
find_spark_home.py [SPARK-31382][BUILD] Show a better error message for different python and pip installation mistake 2020-04-09 11:04:35 +09:00
heapq3.py Fix typos detected by github.com/client9/misspell 2018-08-11 21:23:36 -05:00
java_gateway.py [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests 2020-04-23 10:20:39 +09:00
join.py [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo… 2016-03-28 14:51:36 -07:00
profiler.py [SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis 2019-01-17 19:40:39 -06:00
rdd.py [SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group 2020-05-01 10:08:16 +09:00
rddsampler.py [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
resourceinformation.py [SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources 2019-07-11 09:32:58 +09:00
resultiterable.py [SPARK-30205][PYSPARK] Import ABCs from collections.abc to remove deprecation warnings 2019-12-10 11:08:13 -08:00
serializers.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
shell.py [SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4 2018-09-13 11:19:43 +08:00
shuffle.py [SPARK-25696] The storage memory displayed on spark Application UI is… 2018-12-10 18:27:01 -06:00
statcounter.py [SPARK-6919] [PYSPARK] Add asDict method to StatCounter 2015-09-29 13:38:15 -07:00
status.py [SPARK-4172] [PySpark] Progress API in Python 2015-02-17 13:36:43 -08:00
storagelevel.py [SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3 2018-11-07 22:48:50 -06:00
taskcontext.py [SPARK-31344][CORE] Polish implementation of barrier() and allGather() 2020-04-16 21:23:32 -07:00
traceback_utils.py [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. 2014-09-15 19:28:17 -07:00
util.py [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests 2020-04-23 10:20:39 +09:00
version.py [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00
worker.py [SPARK-29641][PYTHON][CORE] Stage Level Sched: Add python api's and tests 2020-04-23 10:20:39 +09:00