This notebook shows how commands from Vcsn can be used to build a spell-checker (actually a spello-fixer).
We first build a minimal automaton (on $\mathbb{N}_\text{min}$) to represent a (finite) language. To make it minimal, we built a codeterministic automaton, which we determinize.
import vcsn
c = vcsn.context('lan_char, nmin')
lang = c.cotrie(filename='/usr/share/dict/words').determinize().strip()
Cotrie returns an automaton on letterset (not nullableset), which is a good idea for determinize
. But later we will want the labelset to be nullable, so require c
as final context. The resulting automaton is quite compact: compare the number of states with the number of words and the total number of characters.
lang = lang.automaton(c)
lang.info()
# The number of words, and then the total number of letters.
!wc -lc /usr/share/dict/words
Then we turn this automaton into a partial-identity transducer, i.e., each single-tape transition ($s \xrightarrow{\langle n \rangle a} s'$) is turned into a double-tape transition ($s \xrightarrow{\langle n \rangle a|a} s'$). This is because we want to compose these automata.
lang2 = lang.partial_identity()
lang2.info()
Now we build the Levenhstein automaton for the same context.
lev = lang2.context().levenshtein()
lev
Compose the Levenshtein automaton to the language transducer. This results into a transducer which on the left-tape is accepting the language fuzzily, and on the second tape, features the exact words.
fixer = lev.compose(lang2).strip()
fixer.info()
For any candidate word, build the linear transducer that represents it.
w = c.expression('automathon').automaton().partial_identity()
w
Now compose the word and the fixer
to get a computation of the Levenstheim distance of w
to any word on the language.
fix = w.compose(fixer)
fix.info()
Display the proposed correction.
fix.lightest(1)
Find the lightest path in this transducer: it maps our word to its closest word in the language.
fix.lightest_automaton()
We can ask for several proposals of close words.
fix.lightest(10).project(1)