java.lang.Object
org.apache.lucene.codecs.bloom.FuzzySet
- All Implemented Interfaces:
Accountable
A class used to represent a set of many, potentially large, values (e.g. many long strings such
as URLs), using a significantly smaller amount of memory.
The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.
This class is NOT threadsafe.
Internally a Bitset is used to record values and once a client has finished recording a stream
of values the downsize(float) method can be used to create a suitably smaller set that
is sized appropriately for the number of values recorded and desired saturation levels.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumResult fromcontains(BytesRef): can never return definitively YES (always MAYBE), but can sometimes definitely return NO. -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate intprivate FixedBitSetprivate HashFunction(package private) static final int[]static final intstatic final intstatic final intFields inherited from interface org.apache.lucene.util.Accountable
NULL_ACCOUNTABLE -
Constructor Summary
ConstructorsModifierConstructorDescriptionprivateFuzzySet(FixedBitSet filter, int bloomSize, HashFunction hashFunction) -
Method Summary
Modifier and TypeMethodDescriptionvoidRecords a value in the set.The main method required for a Bloom filter which, given a value determines set membership.static FuzzySetcreateSetBasedOnMaxMemory(int maxNumBytes) static FuzzySetcreateSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation) static FuzzySetdeserialize(DataInput in) downsize(float targetMaxSaturation) static intgetEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits) intstatic intgetNearestSetSize(int maxNumberOfBits) Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number.static intgetNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation) Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.floatstatic HashFunctionhashFunctionForVersion(int version) private FuzzySet.ContainsResultmayContainValue(int positiveHash) longReturn the memory usage of this object in bytes.voidserialize(DataOutput out) Serializes the data set to file using the following format: FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords HashFunctionName -->StringThe name of a ServiceProvider registeredHashFunctionFuzzySetVersion -->Uint32The version number of theFuzzySetclass BloomSize -->Uint32The modulo value used to project hashes into the field's Bitset NumBitSetWords -->Uint32The number of longs (as returned fromFixedBitSet.getBits()) BitSetWord -->LongA long from the array returned byFixedBitSet.getBits()toString()Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface org.apache.lucene.util.Accountable
getChildResources
-
Field Details
-
VERSION_SPI
public static final int VERSION_SPI- See Also:
-
VERSION_START
public static final int VERSION_START- See Also:
-
VERSION_CURRENT
public static final int VERSION_CURRENT- See Also:
-
hashFunction
-
filter
-
bloomSize
private int bloomSize -
usableBitSetSizes
static final int[] usableBitSetSizes
-
-
Constructor Details
-
FuzzySet
-
-
Method Details
-
hashFunctionForVersion
-
getNearestSetSize
public static int getNearestSetSize(int maxNumberOfBits) Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number. Use this method where controlling memory use is paramount. -
getNearestSetSize
public static int getNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation) Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.- Parameters:
desiredSaturation- A number between 0 and 1 expressing the % of bits set once all values have been recorded- Returns:
- The size of the set nearest to the required size
-
createSetBasedOnMaxMemory
-
createSetBasedOnQuality
-
contains
The main method required for a Bloom filter which, given a value determines set membership. Unlike a conventional set, the fuzzy set returns NO or MAYBE rather than true or false.- Returns:
- NO or MAYBE
-
serialize
Serializes the data set to file using the following format:- FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords
- HashFunctionName -->
StringThe name of a ServiceProvider registeredHashFunction - FuzzySetVersion -->
Uint32The version number of theFuzzySetclass - BloomSize -->
Uint32The modulo value used to project hashes into the field's Bitset - NumBitSetWords -->
Uint32The number of longs (as returned fromFixedBitSet.getBits()) - BitSetWord -->
LongA long from the array returned byFixedBitSet.getBits()
- Parameters:
out- Data output stream- Throws:
IOException- If there is a low-level I/O error
-
deserialize
- Throws:
IOException
-
mayContainValue
-
addValue
Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the chosen size of the internal bitset.- Parameters:
value- the key value to be hashed- Throws:
IOException- If there is a low-level I/O error
-
downsize
- Parameters:
targetMaxSaturation- A number between 0 and 1 describing the % of bits that would ideally be set in the result. Lower values have better accuracy but require more space.- Returns:
- a smaller FuzzySet or null if the current set is already over-saturated
-
getEstimatedUniqueValues
public int getEstimatedUniqueValues() -
getEstimatedNumberUniqueValuesAllowingForCollisions
public static int getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits) -
getSaturation
public float getSaturation() -
ramBytesUsed
public long ramBytesUsed()Description copied from interface:AccountableReturn the memory usage of this object in bytes. Negative values are illegal.- Specified by:
ramBytesUsedin interfaceAccountable
-
toString
-