Translate text message abbreviations to actual french.
This program is based on Rewriting the orthography of SMS messages, François Yvon, in Natural Language Engineering, volume 16, 2010. It uses two specific automata created using machine learning.
Notes:
References:
import vcsn
ctx = vcsn.context('lal_char, nmin')
Considering we enter 'bo' in the sms2fr
algorithm, the following automaton is created:
ctx.expression('bo').automaton()
It is then composed with the graphemic model (weights will depend on frequency of usage in the language):
ctx.expression('b(<1>o+<2>au+<2>eau)').standard()
And it is finally composed with the syntactic model (weights will depend on probability of presence of this grapheme after the letter b):
res = ctx.expression('b(<2>o+<3>au+<1>eau)').standard()
res
The algorithm will then choose the path with the lightest weight (in the tropical weightset Rmin), here: 'beau'.
res.lightest()
Unfortunately, for lack of a binary format for automata, loading these automata takes about ten seconds. First we read the graphemic automaton.
if vcsn.config('configuration.lzma') != 'true':
print("SKIP: this page requires LZMA support")
dir = vcsn.config('configuration.datadir') + '/sms2fr/'
grap = vcsn.automaton(filename=dir + 'graphemic.efsm.xz')
Then we read the syntactic automaton.
synt = vcsn.automaton(filename=dir + 'syntactic.efsm.xz').partial_identity()
import re
def sms2fr(sms, k=1):
# Graphemic automaton expects input of the format '[#this#is#my#text#]'.
sms = re.escape('[#' + sms.replace(' ', '#') + '#]')
ctx = vcsn.context('lan_char, rmin')
# Translate the sms into an automaton.
sms_aut = ctx.expression(sms).automaton().partial_identity().proper()
# First composition with graphemic automaton.
aut_g = sms_aut.compose(grap).coaccessible().strip()
# Second composition with syntactic automaton.
aut_s = aut_g.compose(synt).coaccessible().strip().project(1).proper()
# Retrieve the most likely path to correspond to french translation.
return aut_s.lightest(k)
sms2fr('slt')
sms2fr('bjr')
sms2fr('on svoit 2m1 ?')
sms2fr('t pa mal1 100 ton mento')
It is also possible to get multiple traduction proposition by using Vcsn's implementations of K shortest path algorithms.
sms2fr('bjr mond', 3)
sms2fr('tante', 3)