java.lang.Object
org.apache.lucene.util.fst.FSTCompiler<T>
Builds a minimal FST (maps an IntsRef term to an arbitrary output) from pre-sorted terms with
outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a
compact serialized format byte array, which can be saved to / loaded from a Directory or used
directly for traversal. The FST is always finite (no cycles).
NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
The parameterized type T is the output type. See the subclasses of Outputs.
FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescription(package private) static classExpert: holds a pending (seen but not yet serialized) arc.static classFluent-style constructor for FSTFSTCompiler.(package private) static final class(package private) static classReusable buffer for building nodes with fixed length arcs (binary search or direct addressing).(package private) static interface(package private) static final classExpert: holds a pending (seen but not yet serialized) Node. -
Field Summary
FieldsModifier and TypeFieldDescription(package private) final boolean(package private) long(package private) long(package private) final BytesStore(package private) longprivate static final floatMaximum oversizing factor allowed for direct addressing compared to binary search when expansion credits allow the oversizing.(package private) static final float(package private) long(package private) final float(package private) long(package private) static final int(package private) static final int(package private) static final int(package private) final FSTCompiler.FixedLengthArcsBufferprivate FSTCompiler.UnCompiledNode<T>[](package private) longprivate final IntsRefBuilderprivate final T(package private) long(package private) int[](package private) int[](package private) final int -
Constructor Summary
ConstructorsModifierConstructorDescriptionprivateFSTCompiler(FST.INPUT_TYPE inputType, double suffixRAMLimitMB, Outputs<T> outputs, boolean allowFixedLengthArcs, int bytesPageBits, float directAddressingMaxOversizingFactor, int version) -
Method Summary
Modifier and TypeMethodDescriptionvoidAdd the next input/output pair.(package private) longaddNode(FSTCompiler.UnCompiledNode<T> nodeIn) compile()Returns final FST.private FSTCompiler.CompiledNodecompileNode(FSTCompiler.UnCompiledNode<T> nodeIn, int tailLength) (package private) voidfinish(long newStartNode) private voidfreezeTail(int prefixLenPlus1) longlonglongfloatlonglong(package private) voidsetEmptyOutput(T v) private booleanshouldExpandNodeWithDirectAddressing(FSTCompiler.UnCompiledNode<T> nodeIn, int numBytesPerArc, int maxBytesPerArcWithoutLabel, int labelRange) Returns whether the given node should be expanded with direct addressing instead of binary search.private booleanReturns whether the given node should be expanded with fixed length arcs.private booleanvalidOutput(T output) private voidwriteLabel(DataOutput out, int v) private voidwriteNodeForBinarySearch(FSTCompiler.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArc) private voidwriteNodeForDirectAddressingOrContinuous(FSTCompiler.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArcWithoutLabel, int labelRange, boolean continuous) private voidwritePresenceBits(FSTCompiler.UnCompiledNode<T> nodeIn)
-
Field Details
-
DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR
static final float DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR- See Also:
-
FIXED_LENGTH_ARC_SHALLOW_DEPTH
static final int FIXED_LENGTH_ARC_SHALLOW_DEPTH- See Also:
-
FIXED_LENGTH_ARC_SHALLOW_NUM_ARCS
static final int FIXED_LENGTH_ARC_SHALLOW_NUM_ARCS- See Also:
-
FIXED_LENGTH_ARC_DEEP_NUM_ARCS
static final int FIXED_LENGTH_ARC_DEEP_NUM_ARCS- See Also:
-
DIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTOR
private static final float DIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTORMaximum oversizing factor allowed for direct addressing compared to binary search when expansion credits allow the oversizing. This factor prevents expansions that are obviously too costly even if there are sufficient credits.- See Also:
-
dedupHash
-
fst
-
NO_OUTPUT
-
lastInput
-
frontier
-
lastFrozenNode
long lastFrozenNode -
numBytesPerArc
int[] numBytesPerArc -
numLabelBytesPerArc
int[] numLabelBytesPerArc -
fixedLengthArcsBuffer
-
arcCount
long arcCount -
nodeCount
long nodeCount -
binarySearchNodeCount
long binarySearchNodeCount -
directAddressingNodeCount
long directAddressingNodeCount -
continuousNodeCount
long continuousNodeCount -
allowFixedLengthArcs
final boolean allowFixedLengthArcs -
directAddressingMaxOversizingFactor
final float directAddressingMaxOversizingFactor -
version
final int version -
directAddressingExpansionCredit
long directAddressingExpansionCredit -
bytes
-
-
Constructor Details
-
FSTCompiler
private FSTCompiler(FST.INPUT_TYPE inputType, double suffixRAMLimitMB, Outputs<T> outputs, boolean allowFixedLengthArcs, int bytesPageBits, float directAddressingMaxOversizingFactor, int version)
-
-
Method Details
-
getDirectAddressingMaxOversizingFactor
public float getDirectAddressingMaxOversizingFactor() -
getNodeCount
public long getNodeCount() -
getArcCount
public long getArcCount() -
getMappedStateCount
public long getMappedStateCount() -
compileNode
private FSTCompiler.CompiledNode compileNode(FSTCompiler.UnCompiledNode<T> nodeIn, int tailLength) throws IOException - Throws:
IOException
-
addNode
- Throws:
IOException
-
writeLabel
- Throws:
IOException
-
shouldExpandNodeWithFixedLengthArcs
Returns whether the given node should be expanded with fixed length arcs. Nodes will be expanded depending on their depth (distance from the root node) and their number of arcs.Nodes with fixed length arcs use more space, because they encode all arcs with a fixed number of bytes, but they allow either binary search or direct addressing on the arcs (instead of linear scan) on lookup by arc label.
-
shouldExpandNodeWithDirectAddressing
private boolean shouldExpandNodeWithDirectAddressing(FSTCompiler.UnCompiledNode<T> nodeIn, int numBytesPerArc, int maxBytesPerArcWithoutLabel, int labelRange) Returns whether the given node should be expanded with direct addressing instead of binary search.Prefer direct addressing for performance if it does not oversize binary search byte size too much, so that the arcs can be directly addressed by label.
- See Also:
-
writeNodeForBinarySearch
private void writeNodeForBinarySearch(FSTCompiler.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArc) -
writeNodeForDirectAddressingOrContinuous
private void writeNodeForDirectAddressingOrContinuous(FSTCompiler.UnCompiledNode<T> nodeIn, long startAddress, int maxBytesPerArcWithoutLabel, int labelRange, boolean continuous) -
writePresenceBits
-
freezeTail
- Throws:
IOException
-
add
Add the next input/output pair. The provided input must be sorted after the previous one according toIntsRef.compareTo(org.apache.lucene.util.IntsRef). It's also OK to add the same input twice in a row with different outputs, as long asOutputsimplements theOutputs.merge(T, T)method. Note that input is fully consumed after this method is returned (so caller is free to reuse), but output is not. So if your outputs are changeable (egByteSequenceOutputsorIntSequenceOutputs) then you cannot reuse across calls.- Throws:
IOException
-
setEmptyOutput
-
finish
void finish(long newStartNode) -
validOutput
-
compile
Returns final FST. NOTE: this will return null if nothing is accepted by the FST.- Throws:
IOException
-
fstRamBytesUsed
public long fstRamBytesUsed() -
fstSizeInBytes
public long fstSizeInBytes()
-