Commit graph

353 commits

Author SHA1 Message Date
Wenchen Fan 109061f7ad [SPARK-12936][SQL] Initial bloom filter implementation
This PR adds an initial implementation of bloom filter in the newly added sketch module.  The implementation is based on the [`BloomFilter` class in guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java).

Some difference from the design doc:

* expose `bitSize` instead of `sizeInBytes` to user.
* always need the `expectedInsertions` parameter when create bloom filter.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10883 from cloud-fan/bloom-filter.
2016-01-25 17:58:11 -08:00
Cheng Lian 6f0f1d9e04 [SPARK-12934][SQL] Count-min sketch serialization
This PR adds serialization support for `CountMinSketch`.

A version number is added to version the serialized binary format.

Author: Cheng Lian <lian@databricks.com>

Closes #10893 from liancheng/cms-serialization.
2016-01-25 15:05:05 -08:00
Cheng Lian 1c690ddafa [SPARK-12933][SQL] Initial implementation of Count-Min sketch
This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under `common/sketch`. The implementation is based on the [`CountMinSketch` class in stream-lib][1].

As required by the [design doc][2], spark-sketch should have no external dependency.
Two classes, `Murmur3_x86_32` and `Platform` are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation.

The following features will be added in future follow-up PRs:

- Serialization support
- DataFrame API integration

[1]: aac6b4d23a/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java
[2]: https://issues.apache.org/jira/secure/attachment/12782378/BloomFilterandCount-MinSketchinSpark2.0.pdf

Author: Cheng Lian <lian@databricks.com>

Closes #10851 from liancheng/count-min-sketch.
2016-01-23 00:34:55 -08:00