Erlang nif for xor_filter. 'Faster and Smaller Than Bloom and Cuckoo Filters'.
Nif wrapper for the xor_filter: https://github.com/FastFilter/xor_singleheader
They're 'Faster and Smaller Than Bloom and Cuckoo Filters'.
This library uses dirty nifs for initializing filters over 10K elements! Make sure your environment is setup correctly. Filters of 10M elements can be initialized within 4 seconds. Within 2.5 seconds if the library is used unsafely.
The exor_benchmark repo was used to compare access times to popular bloom filter libraries.
For rebar3:
%% rebar.config
{deps, [
{exor_filter, "0.8.2"}
]}.
For Mix:
## mix.exs
defp deps do
[
{:exor_filter, "~> 0.8.2"}
]
end
Note, if you're using Erlang below version 23, then use this version of this library: v0.5.2
. Otherwise, use the latest version.
Basic usage with default hashing is as follows:
Filter = xor8:new(["cat", "dog", "mouse"]),
true = xor8:contain(Filter, "cat"),
false = xor8:contain(Filter, "goose").
Filters are initialized independently:
Filter1 = xor8:new([1, 2, 3]),
Filter2 = xor8:new([4, 5, 6]),
false = xor8:contain(Filter1, 6),
true = xor8:contain(Filter1, 2),
false = xor8:contain(Filter2, 2),
true = xor8:contain(Filter2, 5).
This is now the preferred method of usage. To create a filter incrementally, the following API should be used. It is more memory efficient than providing the entire list at initialization time. Only the default hashing method is supported. See the hashing section for more details. This method will automatically deduplicate the input safely. WARNING: Currently, the incremental API does not use dirty nifs for large input sizes. Be cautious of this, initialization can block.
Filter0 = xor8:new_empty(), %% new_empty/0 defaults to 64 elements. Either function
%% will dynamically allocate more space as
%% needed while elements are added.
Filter1 = xor8:add(Filter0, [1, 2]),
Filter2 = xor8:add(Filter1, [3, 4]), %% More space allocated here.
Filter3 = xor8:finalize(Filter3), %% finalize/1 MUST be called to actually intialize the filter.
true = xor8:contain(Filter3, 1),
false = xor8:contain(Filter3, 6).
Do not modify the return value of any of the functions. The other APIs will not function correctly.
xor8:new/1
uses the default hash algorithm.
erlang:phash2/1
.xor8:new/2
function.xor8:contain/2
function.
xor8:contain/2
or /3
. Pass the raw value!
Filter = xor8:new([1, 2, 3], none),
true = xor8:contain(Filter, 1),
false = xor8:contain(Filter, 6).
erlang:phash2/1
default_hash
as the second argument to xor8:new/2
./1
.Fun = fun(X) -> X + 1 end,
Filter = xor8:new([1, 2, 3], Fun),
true = xor8:contain(Filter, 4),
false = xor8:contain(Filter, 1).
none
. The xor8:contain/2
and /3
functions must be passed pre-hashed data in this case.
# ...
alias :xor8, as: Xor8
# ...
true =
[1, 2, 3, 4]
|> Xor8.new()
|> Xor8.contain(1)
contain/3
can return a custom value instead of false
if the value isn't present in the filter:
Filter1 = xor8:new(["Ricky Bobby", "Cal Naughton Jr."]),
true = xor8:contain(Filter1, "Ricky Bobby", {error, not_found}),
{error, not_found} = xor8:contain(Filter1, "Reese Bobby", {error, not_found}).
Functions are provided to the filter in binary form, instead of a nif reference. This can be useful to interop with other platforms / systems. The bin returned can be used with contain
for ease of use. Example usage:
Filter = xor8:new(["test1", "test2", "test3"]),
BinFilter = xor8:to_bin(Filter),
{XorFilterBin, _HashFunction} = BinFilter,
true = xor8:contain(BinFilter, "test1").
The usage of the xor16 is the same. That structure is larger, but has a smaller false positive rate. Just sub xor8
for xor16
in all of the examples.
The buffered versions of initialize are provided for larger data sets. This can be faster. See xor8:new_buffered/2
for more information.
You didn't hear it from me, though ;)
$ rebar3 compile
$ rebar3 eunit
$ rebar3 cover
$ rebar3 edoc