This page is a complement to the paper Derived-Term Automata of Multitape Expressions with Composition. This page exists in several forms:
More information is available here:
You may change the cells, and run then. To run a cell in a notebook, hit "Control-Enter" (in which case the focus stays on the same cell) or "Shift-Enter" (focus goes to the next cell). Beware that depending on the requested operations, Vcsn may generate and compile code, which may be a really slow process on small machines (about a minute): be patient! However, the code is compiled only once: successive uses will be way faster.
To run all the cells anew, select "Restart & Run All" in the "Kernel" menu above. $\newcommand{\Ed}{\mathsf{E}} \newcommand{\Fd}{\mathsf{F}} \newcommand{\coloneqq}{\mathrel{\vcenter{:}}=}$
import vcsn
vcsn.setenv(SEED=1)
vcsn.table = vcsn.ipython.table
# There will be examples with benchmarks, tell the test suite to ignore the durations.
# global ignore: \d+ms
First we introduce the context we are interested in: labels are letter-or-empty-word (lan
), two tapes (lat
), weights are in the tropical semiring $\mathbb{Z}_{\min} := \langle \mathbb{Z}, \min, +, \infty, 0 \rangle$.
zmin2 = vcsn.context('lat<lan, lan>, zmin')
zmin2
The following examples show how the multiplication and addition of $\mathbb{Z}_{\min}$ behave:
zmin2.weight('1') * zmin2.weight('3')
zmin2.weight('1') + zmin2.weight('3')
The first automaton, $\mathcal{A}$, is obtained from the following rational expression (the empty word is noted \e
in Vcsn, and rendered $\varepsilon$ even as an expression, whereas in the paper it is written $\mathsf{1}$).
zmin2.expression(r'(a|a+b|b+⟨1⟩(\e|(a+b)+(a+b)|\e))∗')
Or, using syntactic sugar ($a \coloneqq a|a$, $[ab] \coloneqq a+b$):
e1 = zmin2.expression(r'([ab] + ⟨1⟩(\e|[ab] + [ab]|\e))∗')
e1
a1 = e1.derived_term()
a1
Admittedly, the display of multitape labels on transition could be improved. The rather cryptic [ε|a-a|εb|ε]
denotes ε|a+ε|b+a|ε+b|ε
. This will be addressed in future versions of Vcsn.
Ng's automaton is:
e2 = zmin2.expression('[ab]∗ (⟨1⟩(\e|[ab] + [ab]|\e) + ⟨2⟩(a|b + b|a))∗')
e2
e2.derived_term()
Or, cleaned from state decorations:
a2 = e2.automaton('expansion')
a2
Mohri factors his automaton $\mathcal{A}$ in the composition of $\mathcal{A}_1$ and $\mathcal{A}_2$:
f1 = zmin2.expression('([ab] + ⟨1⟩(\e|i + [ab]|s))∗')
f2 = zmin2.expression('([ab] + i|[ab] + s|\e)∗')
f = f1.compose(f2)
vcsn.table([["Name", "Expression", "Automaton"],
["$\mathcal{A}_1$", f1, f1.derived_term()],
["$\mathcal{A}_2$", f2, f2.derived_term()],
["$\mathcal{A}$", f, f.derived_term()]])
The resulting automaton is exactly the automaton $\mathcal{A}$.
The set of operators supported by Vcsn is quite large, see the documentation for details. We will show a few simple examples.
We start with a simple example of a single tape expression whose labels are letters-or-empty-word (lan
), and weights are rational numbers. We first introduce what in Vcsn is called a context which is simply an object to type automata, expressions, expansions, etc.
q = vcsn.context('lan, q')
q
From this context, we can build (typed) expressions:
e = q.expression('(<1/6>a*+<1/3>b*)*')
e
Let's introduce qexp
as a shorthand to build an expression from a string.
qexp = q.expression
qexp('(<1/6>a*+<1/3>b*)*')
The usual ${}^+$ quantifier ($\Ed^+ \coloneqq \Ed\Ed^*$) is written {+}
, to disambiguate from the binary operator ($\Ed + \Fd$):
qexp('a{+}+b{+}')
The following examples show the trivial identities in action.
def example(*es):
import pandas as pd
pd.options.display.max_colwidth = 0
ids = ['none', 'trivial']
res = [[e] + ['${:x}$'.format(qexp(e, id)) for id in ids]
for e in es]
return pd.DataFrame(res, columns=['.' + ' ' * 20 + id.title() for id in ['input'] + ids])
example('\z+a', 'a+a', '\za', '\ea', '(\e\za+\z)*')
Vcsn also provides Python operators to assemble expressions.
('1/6' * qexp('a').star() + '1/3' * qexp('b').star()).star()
The Python operator |
builds a multitape expression from simple tape ones, and the Python operator @
denotes composition. As a matter of fact, the reason why composition is denoted $@$ in Vcsn is because that is the only remaining operator available in Python. Besides, more beautiful symbols, such as $\circ$ for instance, are technical to type.
qexp('a*') | qexp('b*')
(qexp('a*') | qexp('b')) @ (qexp('b') | qexp('c*'))
We start with a simple example of a single tape expression whose labels are letters-or-empty-word (lan
), and weights are rational numbers.
e = qexp('a+<2>bc*')
e
Its expansion is:
e.expansion()
And its derived-term automaton (contrary to the paper, in Vcsn the empty expression is denoted $\varepsilon$ instead of $\mathsf{1}$):
e.derived_term()
which is smaller than its standard automaton:
e.standard()
The following example demonstrates how successors of a state are computed. It uses the shuffle operator, denoted :
, for illustration.
qexp('ab:xy').derived_term()
Labels are pairs of letter-or-empty-word (lat<lan, lan>
), and weights are rational numbers (q
).
q2 = vcsn.context('lat<lan(abcde), lan(xy)>, q')
q2
The expression $\mathsf{E}_1$ is:
e1 = q2.expression('⟨4⟩ade*|x + ⟨3⟩bde*|x + ⟨2⟩ace*|xy + ⟨6⟩bce*|xy')
e1
Its expansion is:
e1.expansion()
The derived-term automaton of $\mathsf{E}_1$, $\mathcal{A}_{\mathsf{E}_1}$, is:
a1 = e1.derived_term()
a1
The ten shortest accepted multitape words are:
a1.shortest(10)
We introduce a three-tape context. Unfortunately the automatic graphical rendering is poor.
q3 = vcsn.context('lat<lan, lan, lan>, q')
a3 = q3.expression('a*|b*|c*').derived_term()
a3
From multitape automata currently Vcsn extracts only simple-tape expressions over multitape generators (e.g. $(\varepsilon | a)^*$) instead of general mutitape expressions (e.g. $\varepsilon | a^*$):
a3.expression()
In the case of the corresponding five-tape expression, the generated graphical rendering is so messy that it is totally useless (try it if you want!). So instead of displaying the automaton, we list its states.
q5 = vcsn.context('lat<lan(a), lan(b), lan(c), lan(d), lan(e)>, q')
e5 = q5.expression('a*|b*|c*|d*|e*')
e5
e5.expansion()
a5 = e5.derived_term()
a5.states()
e2 = q2.expression('(a{+}|x + b{+}|y)*')
e2
e2.expansion()
a2 = e2.derived_term()
a2
Again, the extracted expression is less readable than the one we started from.
a2.expression()
The previous examples often looked like sed-like substitutions, in the sense that the first tape was often a composite expression, but the second tape a simple label. There is no such limitation.
e = q2.expression('(<2>[ab])* | (<3>[xy])*')
e
a = e.derived_term()
a
a.shortest(20)
f1 = zmin2.expression('([ab] + ⟨1⟩(\e|I + [ab]|S))∗')
f2 = zmin2.expression('([ab] + I|[ab] + S|\e)∗')
f = f1 @ f2
f.derived_term()
Here we show, as will be discussed in Section 6.2, the role played by identities. Here, they are fully disabled (via the 'none'
argument).
f1 = zmin2.expression('([ab] + ⟨1⟩(\e|I + [ab]|S))∗', 'none')
f2 = zmin2.expression('([ab] + I|[ab] + S|\e)∗', 'none')
f = f1 @ f2
f.derived_term()
Then we show an example with the endmarker (as discussed in Section 6.4).
f1 = zmin2.expression('([ab] + ⟨1⟩(\e|I + [ab]|S))∗ ($|$)')
f2 = zmin2.expression('([ab] + S|[ab] + S|\e)∗ ($|$)')
f = f1 @ f2
f.derived_term()
In this example we used symbolic weights, $k, h$. To simulate this we introduce expressions whose weights are expressions (whose weights are in $\mathbb{Q}$):
eset2 = vcsn.context('lat<lan(ab), lan(ab)>, expressionset<lan(kh), q>')
eset2
Then we build and display the expression, its expansion, and its derived-term automaton.
e = eset2.expression('(⟨k⟩\e|a)∗ @ (⟨h⟩aa|\e)∗', 'trivial')
vcsn.display(e, e.expansion(), e.derived_term())
vcsn.setenv(OLDWAY=0, DENORM=1)
e = eset2.expression('(⟨k⟩\e|a)∗ @ (⟨h⟩aa|\e)∗', 'trivial')
vcsn.display(e, e.expansion(), e.derived_term())
The following function takes a rational expression, and generates derived-term automata from denormalized expansions (simpler equation, but generates many spontaneous transitions), and from "normal" expansions. The former are often simpler, but may generate invalid automata.
vcsn.setenv(DENORM=0, OLDWAY=0)
import vcsn
def compare(e, name=None, ctx='lan, q'):
if not isinstance(e, vcsn.expression):
e = vcsn.context(ctx).expression(e, identities='trivial')
if name:
e_named = e.name(name)
else:
e_named = e
a_norm = e_named.derived_term()
vcsn.setenv(NAIVE_MUL=1, NAIVE_STAR=1)
a_denorm = e_named.derived_term()
vcsn.setenv(NAIVE_MUL=0, NAIVE_STAR=0)
return vcsn.table([['Expression', 'Denormalized', 'Normal'], [e, a_denorm, a_norm]])
compare('a*b*c*')
compare('(<1/2>\e)*')
compare('(a*b* + <-1>\e)*', name='E')
compare('((<1/2>\e)* + <-2>\e)*', name='E')
compare('((\e|ab @ ab|\e) + <-1>\e)*', ctx='lat<lan, lan>, q', name='E')
vcsn.setenv(DENORM=0, OLDWAY=0)
def dterm(e):
'''The derived-term of `e`, or the first line of the error if
it fails.'''
try:
return e.derived_term()
except RuntimeError as e:
return 'Error: ' + str(e).split('\n')[0]
def compare(e, ctx='lan, q'):
exp = vcsn.context(ctx).expression
return vcsn.table([['No identities: Expression', 'No identities: Automaton'],
[exp(e, 'none'), dterm(exp(e, 'none'))],
['Trivial identities: Expression', 'Trivial identities: Automaton'],
[exp(e, 'trivial'), dterm(exp(e, 'trivial'))]])
compare('a*')
compare('a + ⟨0⟩\e* + \z\e*')
compare('abc\z')
We compare the quality of the generated automata (smaller is better) and the duration of the generation procedure (again, smaller is better). Vcsn feature two generation procedures for multitape expressions: derived-term, and "inductive".
"Inductive" is a simple recursive translation of the rational expression that applies the corresponding operation on the automata. For single-tape expressions, it uses the "standard" operations on automata, which are the ones used to implement the expression.standard algorithm. The comparison will be performed on twenty randomly generated expressions.
zmin2 = vcsn.context('lat<lan, lan>, zmin')
from vcsn.tools import _timeit
def timeit(fun):
'Run `fun` a number of times, and return its fastest run in milliseconds.'
return '{}ms'.format(round(_timeit(lambda: fun())[1]))
title = ['Rational Expression',
'DT: transitions', 'Ind: transitions',
'DT: states', 'Ind: states',
'DT: duration', 'Ind: duration']
def compare(e):
# Make sure this is an object expression, not just a string.
if not isinstance(e, vcsn.expression):
e = zmin2.expression(e)
# Compute its derived-term automaton.
a1 = e.automaton('expansion')
t1 = timeit(lambda: e.automaton('expansion'))
# Compute another automaton by mapping operators to one of their natural implementation.
# In the case of single-tape automata, this yield the standard (aka Glushkov) automaton.
a2 = e.automaton('inductive')
t2 = timeit(lambda: e.automaton('inductive'))
# Return their numbers of states.
return [e,
a1.info('number of transitions'), a2.info('number of transitions'),
a1.state_number(), a2.state_number(),
t1, t2]
# ignore cell
zmin2 = vcsn.context('lat<lan(ab), lan(ab)>, zmin')
res = [title]
for i in range(20):
e = zmin2.random_expression('+, ., *=.2, w., .w, w="min=0, max=3", |=.5, @', length=20)
c = compare(e)
res.append(c)
vcsn.table(res)
Unfortunately, random expressions are rarely interesting (as can be seen by the number of transitions).
Now we compare the behavior of both algorithms on a familly of rational expressions: $n \mapsto [ab]^* (a|b)^n [ab]^*$.
res = [title]
for i in range(20):
e = zmin2.expression('[ab]* (a|b){{{}}} [ab]*'.format(i))
res.append(compare(e))
vcsn.table(res)
In this section, we compare the size of automata generated either using the "standard automaton" construction (aka, Glushkov automata), or the expansion-based derived-term automaton as exposed in the paper.
def exp(e, ctx=None):
'''Make an expression from `e`, using the context `ctx`
if `e` is not already an expression.'''
if isinstance(e, vcsn.expression):
return e
else:
if not isinstance(ctx, vcsn.context):
ctx = vcsn.context(ctx)
return ctx.expression(e)
def compare_one(e):
'''Compare the sizes of the derived-term, denormalized derived-term, and inductive
automata from exprssion `e`.'''
e = exp(e, ctx=zmin2)
ind = e.inductive()
dte = e.derived_term()
vcsn.setenv(OLDWAY=1, DENORM=1)
den = e.derived_term()
vcsn.setenv(OLDWAY=0, DENORM=0)
return [e,
dte.info('number of states'), dte.info('number of transitions'),
den.info('number of states'), den.info('number of transitions'),
ind.info('number of states'), ind.info('number of transitions')]
def compare(*es):
'''Compare the sizes of the derived-term, denormalized derived-term, and inductive
automata for each expression in `es`.'''
res = [['Expression',
'DTerm states', 'Dterm transitions',
'Denorm DTerm states', 'Denorm Dterm transitions',
'Inductive states', 'Inductive transitions']]
for e in es:
res.append(compare_one(e))
return res
zmin = vcsn.context('lan, zmin')
zmin2 = zmin|zmin
eset2 = vcsn.context('lat<lan, lan>, expressionset<lan, q>')
q = vcsn.context('lan, q')
q3 = vcsn.context('lat<lan, lan, lan>, q')
res = compare(# Introduction
exp('([ab] + ⟨1⟩(\e|[ab] + [ab]|\e))∗', zmin2),
exp('[ab]∗ (⟨1⟩(\e|[ab] + [ab]|\e) + ⟨2⟩(a|b + b|a))∗', zmin2),
exp('([ab] + ⟨1⟩(\e|I + [ab]|S))∗', zmin2),
exp('([ab] + I|[ab] + S|\e)∗', zmin2),
exp('(([ab] + ⟨1⟩(\e|I + [ab]|S))∗) @ (([ab] + I|[ab] + S|\e)∗)', zmin2),
exp('⟨4⟩ade*|x + ⟨3⟩bde*|x + ⟨2⟩ace*|xy + ⟨6⟩bce*|xy', zmin2),
exp('a + ⟨2⟩bc∗', ctx=zmin), # Section 3.
exp('a*|b*|c*', ctx=q3), # Example 4.
exp('a*|b*|c*|d*|e*', ctx='lat<lan(a), lan(b), lan(c), lan(d), lan(e)>, q'), # Example 4
exp('(a{+}|x + b{+}|y)*', zmin2), # Example 5
exp('(⟨k⟩\e|a)∗ @ (⟨h⟩aa|\e)∗', ctx=eset2), # Example 8, Section 5.3,
exp('a*b*c*', ctx=q), # Section 6.1
exp('(<1/2>\\e)*', ctx=q), # Section 6.1
exp('(a*b* + <-1>\\e)*', ctx=q), # Section 6.1
exp('((<1/2>\\e)* + <-2>\\e)*', ctx=q), # Section 6.1
# exp('((\\e|ab @ ab|\\e) + <-1>\\e)*', ctx='lat<lan, lan>, q'), # Section 6.1, invalid automaton.
)
t = vcsn.table(res)
t