[SPARK-18365][DOCS] Improve Sample Method Documentation

## What changes were proposed in this pull request?

I found the documentation for the sample method to be confusing, this adds more clarification across all languages.

- [x] Scala
- [x] Python
- [x] R
- [x] RDD Scala
- [ ] RDD Python with SEED
- [X] RDD Java
- [x] RDD Java with SEED
- [x] RDD Python

## How was this patch tested?

NA

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>

Closes #15815 from anabranch/SPARK-18365.
This commit is contained in:
anabranch 2016-11-17 11:34:55 +00:00 committed by Sean Owen
parent a3cac7bd86
commit 49b6f456ac
No known key found for this signature in database
GPG key ID: BEB3956D6717BDDC
6 changed files with 30 additions and 5 deletions

View file

@ -936,7 +936,9 @@ setMethod("unique",
#' Sample
#'
#' Return a sampled subset of this SparkDataFrame using a random seed.
#' Return a sampled subset of this SparkDataFrame using a random seed.
#' Note: this is not guaranteed to provide exactly the fraction specified
#' of the total count of of the given SparkDataFrame.
#'
#' @param x A SparkDataFrame
#' @param withReplacement Sampling with replacement or not

View file

@ -98,7 +98,9 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
def repartition(numPartitions: Int): JavaRDD[T] = rdd.repartition(numPartitions)
/**
* Return a sampled subset of this RDD.
* Return a sampled subset of this RDD with a random seed.
* Note: this is NOT guaranteed to provide exactly the fraction of the count
* of the given [[RDD]].
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size
@ -109,7 +111,9 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
sample(withReplacement, fraction, Utils.random.nextLong)
/**
* Return a sampled subset of this RDD.
* Return a sampled subset of this RDD, with a user-supplied seed.
* Note: this is NOT guaranteed to provide exactly the fraction of the count
* of the given [[RDD]].
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size

View file

@ -466,6 +466,9 @@ abstract class RDD[T: ClassTag](
/**
* Return a sampled subset of this RDD.
*
* Note: this is NOT guaranteed to provide exactly the fraction of the count
* of the given [[RDD]].
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size
* without replacement: probability that each element is chosen; fraction must be [0, 1]

View file

@ -386,6 +386,11 @@ class RDD(object):
with replacement: expected number of times each element is chosen; fraction must be >= 0
:param seed: seed for the random number generator
.. note::
This is not guaranteed to provide exactly the fraction specified of the total count
of the given :class:`DataFrame`.
>>> rdd = sc.parallelize(range(100), 4)
>>> 6 <= rdd.sample(False, 0.1, 81).count() <= 14
True

View file

@ -549,6 +549,11 @@ class DataFrame(object):
def sample(self, withReplacement, fraction, seed=None):
"""Returns a sampled subset of this :class:`DataFrame`.
.. note::
This is not guaranteed to provide exactly the fraction specified of the total count
of the given :class:`DataFrame`.
>>> df.sample(False, 0.5, 42).count()
2
"""

View file

@ -1646,7 +1646,10 @@ class Dataset[T] private[sql](
}
/**
* Returns a new Dataset by sampling a fraction of rows.
* Returns a new [[Dataset]] by sampling a fraction of rows, using a user-supplied seed.
*
* Note: this is NOT guaranteed to provide exactly the fraction of the count
* of the given [[Dataset]].
*
* @param withReplacement Sample with replacement or not.
* @param fraction Fraction of rows to generate.
@ -1665,7 +1668,10 @@ class Dataset[T] private[sql](
}
/**
* Returns a new Dataset by sampling a fraction of rows, using a random seed.
* Returns a new [[Dataset]] by sampling a fraction of rows, using a random seed.
*
* Note: this is NOT guaranteed to provide exactly the fraction of the total count
* of the given [[Dataset]].
*
* @param withReplacement Sample with replacement or not.
* @param fraction Fraction of rows to generate.