String Distances in Julia
The package is registered in the General
registry and so can be installed at the REPL with ] add StringDistances
.
String distances act over any pair of iterators that define length
(e.g. AbstractStrings
, GraphemeIterators
, or AbstractVectors
)
The available distances are:
Hamming() <: SemiMetric
Jaro()
JaroWinkler() <: SemiMetric
Levenshtein() <: Metric
OptimalStringAlignment() <: SemiMetric
DamerauLevenshtein() <: Metric
RatcliffObershelp() <: SemiMetric
q
in each string)
QGram(q::Int) <: SemiMetric
Cosine(q::Int) <: SemiMetric
Jaccard(q::Int) <: SemiMetric
Overlap(q::Int) <: SemiMetric
SorensenDice(q::Int) <: SemiMetric
MorisitaOverlap(q::Int) <: SemiMetric
NMD(q::Int) <: SemiMetric
Following the Distances.jl
package, string distances can inherit from two abstract types: StringSemiMetric <: SemiMetric
or StringMetric <: Metric
.
You can always compute a certain distance between two strings using the following syntax
r = evaluate(dist, x, y)
r = dist(x, y)
Here, dist
is an instance of a distance type: for example, the type for the Levenshtein distance is Levenshtein
. You can compute the Levenshtein distance between x
and y
as
r = evaluate(Levenshtein(), x, y)
r = Levenshtein()(x, y)
The function compare
returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64
. A value of 0.0 means completely different and a value of 1.0 means completely similar.
Levenshtein()("martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0
Consider X
and Y
two AbstractVectors
of iterators. You can compute the matrix of distances across elements, dist(X[i], Y[j])
, as follows:
pairwise(dist, X, Y)
For instance,
pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])
pairwise
is optimized in various ways (e.g., for the case of QGram-distances, dictionary of qgrams are pre-computed)
The package also adds convenience functions to find elements in a iterator of strings closest to a given string
findnearest
returns the value and index of the element in itr
with the highest similarity score with s
. Its syntax is:
indnearest(s, itr, dist)
findall
returns the indices of all elements in itr
with a similarity score with s
higher than a minimum score. Its syntax is:
indall(s, itr, dist; min_score = 0.8)
The functions findnearest
and findall
are particularly optimized for the Levenshtein
and OptimalStringAlignment
distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.
The package also defines Distance "modifiers" that are inspired by the Python package - fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).
Partial
, TokenSort
and TokenSet
modifiers, with penalty terms depending on string. TokenMax(Levenshtein())
corresponds to the distance defined in fuzzywuzzy
Levenshtein()("this string", "this string is longer") = 10
Partial(Levenshtein())("this string", "this string is longer") = 0