spark-instrumented-optimizer/project
Reynold Xin 8b8e70ebde Merge pull request #73 from falaki/ApproximateDistinctCount
Approximate distinct count

Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
2013-12-31 17:48:24 -08:00
..
project Adding Apache license to two files 2013-09-07 20:46:58 -07:00
build.properties Change build and run instructions to use assemblies 2013-08-29 21:19:04 -07:00
plugins.sbt Upgrade to sbt-assembly 0.9.2 2013-11-12 13:29:25 -08:00
SparkBuild.scala Merge pull request #73 from falaki/ApproximateDistinctCount 2013-12-31 17:48:24 -08:00