Commit graph

159 commits

Author SHA1 Message Date
Alex Bozarth 8d3cc3de7d [SPARK-13050][BUILD] Scalatest tags fail build with the addition of the sketch module
A dependency on the spark test tags was left out of the sketch module pom file causing builds to fail when test tags were used. This dependency is found in the pom file for every other module in spark.

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #10954 from ajbozarth/spark13050.
2016-01-28 23:34:50 -08:00
Cheng Lian 415d0a859b [SPARK-12818][SQL] Specialized integral and string types for Count-min Sketch
This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`.

Author: Cheng Lian <lian@databricks.com>

Closes #10968 from liancheng/cms-specialized.
2016-01-28 12:26:03 -08:00
Wenchen Fan 680afabe78 [SPARK-12938][SQL] DataFrame API for Bloom filter
This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs.

This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10937 from cloud-fan/bloom-filter.
2016-01-27 13:29:09 -08:00
Cheng Lian ce38a35b76 [SPARK-12935][SQL] DataFrame API for Count-Min Sketch
This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs.

Author: Cheng Lian <lian@databricks.com>

Closes #10911 from liancheng/cms-df-api.
2016-01-26 20:12:34 -08:00
Wenchen Fan 6743de3a98 [SPARK-12937][SQL] bloom filter serialization
This PR adds serialization support for BloomFilter.

A version number is added to version the serialized binary format.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10920 from cloud-fan/bloom-filter.
2016-01-26 00:53:05 -08:00
tedyu fdcc3512f7 [SPARK-12934] use try-with-resources for streams
liancheng please take a look

Author: tedyu <yuzhihong@gmail.com>

Closes #10906 from tedyu/master.
2016-01-25 18:23:47 -08:00
Wenchen Fan 109061f7ad [SPARK-12936][SQL] Initial bloom filter implementation
This PR adds an initial implementation of bloom filter in the newly added sketch module.  The implementation is based on the [`BloomFilter` class in guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java).

Some difference from the design doc:

* expose `bitSize` instead of `sizeInBytes` to user.
* always need the `expectedInsertions` parameter when create bloom filter.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10883 from cloud-fan/bloom-filter.
2016-01-25 17:58:11 -08:00
Cheng Lian 6f0f1d9e04 [SPARK-12934][SQL] Count-min sketch serialization
This PR adds serialization support for `CountMinSketch`.

A version number is added to version the serialized binary format.

Author: Cheng Lian <lian@databricks.com>

Closes #10893 from liancheng/cms-serialization.
2016-01-25 15:05:05 -08:00
Cheng Lian 1c690ddafa [SPARK-12933][SQL] Initial implementation of Count-Min sketch
This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under `common/sketch`. The implementation is based on the [`CountMinSketch` class in stream-lib][1].

As required by the [design doc][2], spark-sketch should have no external dependency.
Two classes, `Murmur3_x86_32` and `Platform` are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation.

The following features will be added in future follow-up PRs:

- Serialization support
- DataFrame API integration

[1]: aac6b4d23a/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java
[2]: https://issues.apache.org/jira/secure/attachment/12782378/BloomFilterandCount-MinSketchinSpark2.0.pdf

Author: Cheng Lian <lian@databricks.com>

Closes #10851 from liancheng/count-min-sketch.
2016-01-23 00:34:55 -08:00