8ec25cd67e
## What changes were proposed in this pull request? Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell). This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know. ## How was this patch tested? ### before ``` $ misspell . | grep -v '.js' R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition" R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition" NOTICE-binary:454:16: "containd" is a misspelling of "contained" R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition" R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition" R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence" R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred" R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output" R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment" common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent" common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent" common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred" common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin" core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden" core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence" core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments" dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual" dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across" dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across" dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments" docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden" docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes" docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN" docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior" examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract" examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions" python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment" python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress" python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability" python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter" python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability" python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter" python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns" python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization" python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary" resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully" resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints" resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility" sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter" sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary" sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when" sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp" sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage" sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred" sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function" sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing" sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with" sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring" ``` ### after ``` $ misspell . | grep -v '.js' common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred" core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc" core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture" data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous" licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching" licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics" licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING" licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the" mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean" mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The" mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean" python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching" python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics" python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING" python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS" python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean" ``` Closes #22070 from seratch/fix-typo. Authored-by: Kazuhiro Sera <seratch@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
327 lines
13 KiB
Python
327 lines
13 KiB
Python
#
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
# this work for additional information regarding copyright ownership.
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
# (the "License"); you may not use this file except in compliance with
|
|
# the License. You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
# See the License for the specific language governing permissions and
|
|
# limitations under the License.
|
|
#
|
|
|
|
import sys
|
|
if sys.version >= '3':
|
|
basestring = str
|
|
|
|
from pyspark.rdd import RDD, ignore_unicode_prefix
|
|
from pyspark.mllib.common import callMLlibFunc, JavaModelWrapper
|
|
from pyspark.mllib.linalg import Matrix, _convert_to_vector
|
|
from pyspark.mllib.regression import LabeledPoint
|
|
from pyspark.mllib.stat.test import ChiSqTestResult, KolmogorovSmirnovTestResult
|
|
|
|
|
|
__all__ = ['MultivariateStatisticalSummary', 'Statistics']
|
|
|
|
|
|
class MultivariateStatisticalSummary(JavaModelWrapper):
|
|
|
|
"""
|
|
Trait for multivariate statistical summary of a data matrix.
|
|
"""
|
|
|
|
def mean(self):
|
|
return self.call("mean").toArray()
|
|
|
|
def variance(self):
|
|
return self.call("variance").toArray()
|
|
|
|
def count(self):
|
|
return int(self.call("count"))
|
|
|
|
def numNonzeros(self):
|
|
return self.call("numNonzeros").toArray()
|
|
|
|
def max(self):
|
|
return self.call("max").toArray()
|
|
|
|
def min(self):
|
|
return self.call("min").toArray()
|
|
|
|
def normL1(self):
|
|
return self.call("normL1").toArray()
|
|
|
|
def normL2(self):
|
|
return self.call("normL2").toArray()
|
|
|
|
|
|
class Statistics(object):
|
|
|
|
@staticmethod
|
|
def colStats(rdd):
|
|
"""
|
|
Computes column-wise summary statistics for the input RDD[Vector].
|
|
|
|
:param rdd: an RDD[Vector] for which column-wise summary statistics
|
|
are to be computed.
|
|
:return: :class:`MultivariateStatisticalSummary` object containing
|
|
column-wise summary statistics.
|
|
|
|
>>> from pyspark.mllib.linalg import Vectors
|
|
>>> rdd = sc.parallelize([Vectors.dense([2, 0, 0, -2]),
|
|
... Vectors.dense([4, 5, 0, 3]),
|
|
... Vectors.dense([6, 7, 0, 8])])
|
|
>>> cStats = Statistics.colStats(rdd)
|
|
>>> cStats.mean()
|
|
array([ 4., 4., 0., 3.])
|
|
>>> cStats.variance()
|
|
array([ 4., 13., 0., 25.])
|
|
>>> cStats.count()
|
|
3
|
|
>>> cStats.numNonzeros()
|
|
array([ 3., 2., 0., 3.])
|
|
>>> cStats.max()
|
|
array([ 6., 7., 0., 8.])
|
|
>>> cStats.min()
|
|
array([ 2., 0., 0., -2.])
|
|
"""
|
|
cStats = callMLlibFunc("colStats", rdd.map(_convert_to_vector))
|
|
return MultivariateStatisticalSummary(cStats)
|
|
|
|
@staticmethod
|
|
def corr(x, y=None, method=None):
|
|
"""
|
|
Compute the correlation (matrix) for the input RDD(s) using the
|
|
specified method.
|
|
Methods currently supported: I{pearson (default), spearman}.
|
|
|
|
If a single RDD of Vectors is passed in, a correlation matrix
|
|
comparing the columns in the input RDD is returned. Use C{method=}
|
|
to specify the method to be used for single RDD inout.
|
|
If two RDDs of floats are passed in, a single float is returned.
|
|
|
|
:param x: an RDD of vector for which the correlation matrix is to be computed,
|
|
or an RDD of float of the same cardinality as y when y is specified.
|
|
:param y: an RDD of float of the same cardinality as x.
|
|
:param method: String specifying the method to use for computing correlation.
|
|
Supported: `pearson` (default), `spearman`
|
|
:return: Correlation matrix comparing columns in x.
|
|
|
|
>>> x = sc.parallelize([1.0, 0.0, -2.0], 2)
|
|
>>> y = sc.parallelize([4.0, 5.0, 3.0], 2)
|
|
>>> zeros = sc.parallelize([0.0, 0.0, 0.0], 2)
|
|
>>> abs(Statistics.corr(x, y) - 0.6546537) < 1e-7
|
|
True
|
|
>>> Statistics.corr(x, y) == Statistics.corr(x, y, "pearson")
|
|
True
|
|
>>> Statistics.corr(x, y, "spearman")
|
|
0.5
|
|
>>> from math import isnan
|
|
>>> isnan(Statistics.corr(x, zeros))
|
|
True
|
|
>>> from pyspark.mllib.linalg import Vectors
|
|
>>> rdd = sc.parallelize([Vectors.dense([1, 0, 0, -2]), Vectors.dense([4, 5, 0, 3]),
|
|
... Vectors.dense([6, 7, 0, 8]), Vectors.dense([9, 0, 0, 1])])
|
|
>>> pearsonCorr = Statistics.corr(rdd)
|
|
>>> print(str(pearsonCorr).replace('nan', 'NaN'))
|
|
[[ 1. 0.05564149 NaN 0.40047142]
|
|
[ 0.05564149 1. NaN 0.91359586]
|
|
[ NaN NaN 1. NaN]
|
|
[ 0.40047142 0.91359586 NaN 1. ]]
|
|
>>> spearmanCorr = Statistics.corr(rdd, method="spearman")
|
|
>>> print(str(spearmanCorr).replace('nan', 'NaN'))
|
|
[[ 1. 0.10540926 NaN 0.4 ]
|
|
[ 0.10540926 1. NaN 0.9486833 ]
|
|
[ NaN NaN 1. NaN]
|
|
[ 0.4 0.9486833 NaN 1. ]]
|
|
>>> try:
|
|
... Statistics.corr(rdd, "spearman")
|
|
... print("Method name as second argument without 'method=' shouldn't be allowed.")
|
|
... except TypeError:
|
|
... pass
|
|
"""
|
|
# Check inputs to determine whether a single value or a matrix is needed for output.
|
|
# Since it's legal for users to use the method name as the second argument, we need to
|
|
# check if y is used to specify the method name instead.
|
|
if type(y) == str:
|
|
raise TypeError("Use 'method=' to specify method name.")
|
|
|
|
if not y:
|
|
return callMLlibFunc("corr", x.map(_convert_to_vector), method).toArray()
|
|
else:
|
|
return callMLlibFunc("corr", x.map(float), y.map(float), method)
|
|
|
|
@staticmethod
|
|
@ignore_unicode_prefix
|
|
def chiSqTest(observed, expected=None):
|
|
"""
|
|
If `observed` is Vector, conduct Pearson's chi-squared goodness
|
|
of fit test of the observed data against the expected distribution,
|
|
or againt the uniform distribution (by default), with each category
|
|
having an expected frequency of `1 / len(observed)`.
|
|
|
|
If `observed` is matrix, conduct Pearson's independence test on the
|
|
input contingency matrix, which cannot contain negative entries or
|
|
columns or rows that sum up to 0.
|
|
|
|
If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
|
|
test for every feature against the label across the input RDD.
|
|
For each feature, the (feature, label) pairs are converted into a
|
|
contingency matrix for which the chi-squared statistic is computed.
|
|
All label and feature values must be categorical.
|
|
|
|
.. note:: `observed` cannot contain negative values
|
|
|
|
:param observed: it could be a vector containing the observed categorical
|
|
counts/relative frequencies, or the contingency matrix
|
|
(containing either counts or relative frequencies),
|
|
or an RDD of LabeledPoint containing the labeled dataset
|
|
with categorical features. Real-valued features will be
|
|
treated as categorical for each distinct value.
|
|
:param expected: Vector containing the expected categorical counts/relative
|
|
frequencies. `expected` is rescaled if the `expected` sum
|
|
differs from the `observed` sum.
|
|
:return: ChiSquaredTest object containing the test statistic, degrees
|
|
of freedom, p-value, the method used, and the null hypothesis.
|
|
|
|
>>> from pyspark.mllib.linalg import Vectors, Matrices
|
|
>>> observed = Vectors.dense([4, 6, 5])
|
|
>>> pearson = Statistics.chiSqTest(observed)
|
|
>>> print(pearson.statistic)
|
|
0.4
|
|
>>> pearson.degreesOfFreedom
|
|
2
|
|
>>> print(round(pearson.pValue, 4))
|
|
0.8187
|
|
>>> pearson.method
|
|
u'pearson'
|
|
>>> pearson.nullHypothesis
|
|
u'observed follows the same distribution as expected.'
|
|
|
|
>>> observed = Vectors.dense([21, 38, 43, 80])
|
|
>>> expected = Vectors.dense([3, 5, 7, 20])
|
|
>>> pearson = Statistics.chiSqTest(observed, expected)
|
|
>>> print(round(pearson.pValue, 4))
|
|
0.0027
|
|
|
|
>>> data = [40.0, 24.0, 29.0, 56.0, 32.0, 42.0, 31.0, 10.0, 0.0, 30.0, 15.0, 12.0]
|
|
>>> chi = Statistics.chiSqTest(Matrices.dense(3, 4, data))
|
|
>>> print(round(chi.statistic, 4))
|
|
21.9958
|
|
|
|
>>> data = [LabeledPoint(0.0, Vectors.dense([0.5, 10.0])),
|
|
... LabeledPoint(0.0, Vectors.dense([1.5, 20.0])),
|
|
... LabeledPoint(1.0, Vectors.dense([1.5, 30.0])),
|
|
... LabeledPoint(0.0, Vectors.dense([3.5, 30.0])),
|
|
... LabeledPoint(0.0, Vectors.dense([3.5, 40.0])),
|
|
... LabeledPoint(1.0, Vectors.dense([3.5, 40.0])),]
|
|
>>> rdd = sc.parallelize(data, 4)
|
|
>>> chi = Statistics.chiSqTest(rdd)
|
|
>>> print(chi[0].statistic)
|
|
0.75
|
|
>>> print(chi[1].statistic)
|
|
1.5
|
|
"""
|
|
if isinstance(observed, RDD):
|
|
if not isinstance(observed.first(), LabeledPoint):
|
|
raise ValueError("observed should be an RDD of LabeledPoint")
|
|
jmodels = callMLlibFunc("chiSqTest", observed)
|
|
return [ChiSqTestResult(m) for m in jmodels]
|
|
|
|
if isinstance(observed, Matrix):
|
|
jmodel = callMLlibFunc("chiSqTest", observed)
|
|
else:
|
|
if expected and len(expected) != len(observed):
|
|
raise ValueError("`expected` should have same length with `observed`")
|
|
jmodel = callMLlibFunc("chiSqTest", _convert_to_vector(observed), expected)
|
|
return ChiSqTestResult(jmodel)
|
|
|
|
@staticmethod
|
|
@ignore_unicode_prefix
|
|
def kolmogorovSmirnovTest(data, distName="norm", *params):
|
|
"""
|
|
Performs the Kolmogorov-Smirnov (KS) test for data sampled from
|
|
a continuous distribution. It tests the null hypothesis that
|
|
the data is generated from a particular distribution.
|
|
|
|
The given data is sorted and the Empirical Cumulative
|
|
Distribution Function (ECDF) is calculated
|
|
which for a given point is the number of points having a CDF
|
|
value lesser than it divided by the total number of points.
|
|
|
|
Since the data is sorted, this is a step function
|
|
that rises by (1 / length of data) for every ordered point.
|
|
|
|
The KS statistic gives us the maximum distance between the
|
|
ECDF and the CDF. Intuitively if this statistic is large, the
|
|
probability that the null hypothesis is true becomes small.
|
|
For specific details of the implementation, please have a look
|
|
at the Scala documentation.
|
|
|
|
:param data: RDD, samples from the data
|
|
:param distName: string, currently only "norm" is supported.
|
|
(Normal distribution) to calculate the
|
|
theoretical distribution of the data.
|
|
:param params: additional values which need to be provided for
|
|
a certain distribution.
|
|
If not provided, the default values are used.
|
|
:return: KolmogorovSmirnovTestResult object containing the test
|
|
statistic, degrees of freedom, p-value,
|
|
the method used, and the null hypothesis.
|
|
|
|
>>> kstest = Statistics.kolmogorovSmirnovTest
|
|
>>> data = sc.parallelize([-1.0, 0.0, 1.0])
|
|
>>> ksmodel = kstest(data, "norm")
|
|
>>> print(round(ksmodel.pValue, 3))
|
|
1.0
|
|
>>> print(round(ksmodel.statistic, 3))
|
|
0.175
|
|
>>> ksmodel.nullHypothesis
|
|
u'Sample follows theoretical distribution'
|
|
|
|
>>> data = sc.parallelize([2.0, 3.0, 4.0])
|
|
>>> ksmodel = kstest(data, "norm", 3.0, 1.0)
|
|
>>> print(round(ksmodel.pValue, 3))
|
|
1.0
|
|
>>> print(round(ksmodel.statistic, 3))
|
|
0.175
|
|
"""
|
|
if not isinstance(data, RDD):
|
|
raise TypeError("data should be an RDD, got %s." % type(data))
|
|
if not isinstance(distName, basestring):
|
|
raise TypeError("distName should be a string, got %s." % type(distName))
|
|
|
|
params = [float(param) for param in params]
|
|
return KolmogorovSmirnovTestResult(
|
|
callMLlibFunc("kolmogorovSmirnovTest", data, distName, params))
|
|
|
|
|
|
def _test():
|
|
import doctest
|
|
import numpy
|
|
from pyspark.sql import SparkSession
|
|
try:
|
|
# Numpy 1.14+ changed it's string format.
|
|
numpy.set_printoptions(legacy='1.13')
|
|
except TypeError:
|
|
pass
|
|
globs = globals().copy()
|
|
spark = SparkSession.builder\
|
|
.master("local[4]")\
|
|
.appName("mllib.stat.statistics tests")\
|
|
.getOrCreate()
|
|
globs['sc'] = spark.sparkContext
|
|
(failure_count, test_count) = doctest.testmod(globs=globs, optionflags=doctest.ELLIPSIS)
|
|
spark.stop()
|
|
if failure_count:
|
|
sys.exit(-1)
|
|
|
|
|
|
if __name__ == "__main__":
|
|
_test()
|