Random Notes for Vcsn Developper

Coding Style

  • Do by imitation: see how things are written elsewhere, and do the same.
  • When the order does not matter, use the alphabetical order.
  • Headers are issues by groups separated by an empty line, sorted alphabetically in the group.
    • C++ headers
    • Boost headers
    • Vcsn headers
  • If a function builds its return value in a variable, name it res.
  • Naming convensions

    The short names below applies to local variables. Use meaningful (not hyper-short) names for long-lived entities (e.g., function names, members, etc.).

    automaton: aut, aut_      Aut, Automaton
    LabelSet: ls, labelset,   LabelSet
    GenSet: gs                GenSet
    context: ctx              Context, Ctx
    WeightSet: ws, weightset  WeightSet
    Weight/word: w
    label: l
    state: s
    transition: t
    
    ostream: os, o
    istream: is
  • The printing functions take the printee first, then the stream.

Tools you should know

  • c++filt demangles C++ symbol names.
  • valgrind verifies memory issues.
  • nm lists the symbols in object files. Use nm -C to demangle C++ names.
  • ldd lists the libraries an object file depends upon.

Address Sanitizer

It is possible to use ASAN, but there are quite a few tricky issues.

  • be sure to pass -fsanitize=address in CXX, not CXXFLAGS, because otherwise libtool tends to move it.
  • ASAN must be loaded from the top-level executable. This is fine for TAF-Kit (so vcsn standard -Ee a for instance properly loads asan), but it is not for Python, since, it is unlikely that your Python was linked against ASAN. So you need to preload it, e.g., LD_PRELOAD=/usr/lib/clang/3.8.0/lib/linux/libclang_rt.asan-x86_64.so.
  • Python and Boost.Python seem to leak. I have tried to use the LSAN (leak sanitizer) suppression files, but with limited success. For a start, I used ASAN_OPTIONS=detect_leaks=0 to disable LSAN.

Environment variables

VCSN_DEBUG

Don't remove temporary files (which is especially useful to keep debug symbols in plugins).

Augment output with debugging information.

  • dot: display in parens the real state numbers.
  • is_ambigious: display the couple of states which is outside the diagonal.
  • proper: Read VCSN_DEBUG as an integer specifying the level of details to dump.

VCSN_DYN

Display information about registration and query about dyn algorithms.

VCSN_HOME [~/.vcsn]

Where data is stored at runtime. See VCSN_PLUGINDIR

VCSN_ITERATIVE

Specify that "power" should perform the naive iterative multiplicative approach, instead of the squaring one.

VCSN_PARENS

Force the display of useless parentheses.

VCSN_PLUGINDIR [VCSN_HOME/plugins]

Where the runtime context instantiation are generated and compiled.

VCSN_PRINT

Force a more verbose display of expression (XML like), where one can actually see the nesting of the structures.

VCSN_PYTHONDIR

The python directory.

VCSN_SEED

Disable the generation of a random seed, stick to the compile-time default seed. This ensure that the two successive runs are identical.

VCSN_TMPDIR [/tmp]

Path to the folder in which compiled contexts should be stored.

VCSN_VERBOSE

When reporting compilation error, include the full compiler error message. If defined as an integer greater than or equal to 2, then dyn compilations are reported.

VERBOSE

Make the test suite more verbose.

YYDEBUG, YYSCAN

Set to enable Bison parser/ Flex scanner tracing. Can be an integer to denote nesting (which is useful for instance for dot parsing which can fire expression parsing: specify how many layers you want to make verbose).

Tests

To run the test suite, 4 processes in parallel (e.g., if you have 4 threads like with an i7), run

$ make -j4 check

To run some of the tests (e.g., tafkit/determinize.chk), run

$ make -j4 tests/python/determinize.log

or

$ make -j4 check TESTS=tests/python/determinize.py

The latter will create tests/test-suite.log and tests/python/determinize.log, the former only the latter.

Both will run make all, which can take a while. If you know what you are doing (for instance you want to check something which is compiled at runtime), you can avoid this make all by running the test by hand:

$ ./_build/tests/bin/vcsn -e tests/python/determinize.py

You may also run only the unit/rat/demo/tafkit tests:

$ make -j4 check-tafkit

To create a new test case

Creating a new test case is quite easy: select the proper file (usually the name of the tests/python/*.py is clear enough, but make sure with git grep ALGO tests/python if the ALGO is not actually tested in a less obvious place), then do as you see with the other tests.

To create a new test file

Avoid creating too many files. For instance prefix and suffix do not deserve two different files, they can both live in prefix.py for instance.

  1. create the file tests/python/foo.py by copying some existing file
  2. make sure it is executable

    $ chmod +x tests/python/foo.py

  3. add this file to the repository

    $ git add tests/python/foo.py

  4. add this file to the list of tests in tests/python/local.mk, respect the alphabetical order

  5. run just this test to make sure that everything is ok

    $ ./_build/35s/tests/bin/vcsn -e tests/python/foo.py

To debug

The script _build/35s/tests/bin/vcsn defines the environment needed to run a non-installed version of Vcsn. To debug a Python program, use this:

./_build/35s/tests/bin/vcsn run gdb --args python -c "import vcsn;
vcsn.context('lal_char(ab), b').expression('[^]*a').derived_term()

for instance. To run a TAF-Kit command under Valgrind:

./_build/35s/tests/bin/vcsn run valgrind ./_build/35s/bin/vcsn-tafkit cat -Ee a

Note that the debug symbols (at least on OS X) are in the *.o file, not the *.so file. And the *.o is removed when the compilation succeeded. To keep it, be sure to enable VCSN_DEBUG:

VCSN_DEBUG=1 ./_build/35s/tests/bin/vcsn run gdb --args python -c "import vcsn;
vcsn.context('lal_char(ab), b').expression('[^]*a').derived_term()

Also, if your build is optimized, some variables might be optimized away. So use VCSN_CXXFLAGS to change that:

VCSN_CXXFLAGS='-std=c++14 -O0 -ggdb' \
VCSN_DEBUG=1 \
./_build/35s/tests/bin/vcsn run gdb --args python -c "import vcsn;
vcsn.context('lal_char(ab), b').expression('[^]*a').derived_term()

Benchmarks

Be sure to compile in -O3 -DNDEBUG for benchmarks. Be sure that the tools with compared against (typically OpenFST) are also compiled this way!

OpenFST

When benching against OpenFST, do not use the convenient automaton.fstFOO functions. For instance:

vcsn = [time(lambda : a.synchronize())    for a in auts]
ofst = [time(lambda : a.fstsynchronize()) for a in auts]

is completely unfair to OpenFST. Do this instead:

ofst =
  [time(lambda : subprocess.call(['fstsynchronize', fst.name, '/dev/null']))
   for fst in [a.as_fst() for a in auts]]

as it converts to FST format only once (and outside the benchmark command), and does not try to convert the result back to Vcsn. Actually, it does not even try to save the result at all (/dev/null), since Vcsn does not try to save its result either.

Commits

There is a number of rules to follow:

  • one purpose per commit
  • space changes are one purpose, therefore a separate commit
  • commit should have a title which looks like "topic: what" where
    • topic is something like doc, dyn, determinize, expression, style (for space changes, etc.)...
    • what is a sentence, starting with a lower case letter, telling what this change is about
  • commits then follow with text explaining the motivations for the commit
    • describe the bug it fixes
    • explain the alternatives and why this one was preferred
    • show results of vcsn score that demonstrate that there is no regression/an improvement
  • commits then follow with the list of files and a concise description of the changes (with an emphasize on the why rather than paraphrasing the git log).

Be sure to read other commit logs to understand the style and to copy it. Here are a few examples:

commit ab30db4e253cf88e167baf7209036e62df2820ce
Author: Akim Demaille <akim@lrde.epita.fr>
Date:   Tue Jul 22 13:05:38 2014 +0200

    translate: issue a parse error when reading an invalid context

    Currently, when parsing a context, we actually get ready to accept any
    type (including automata, ints, or pure junk).  As a result, we also
    try to compile the corresponding make_context, even for obvious parse
    errors.  This is slow, and provides a bad error message.

    Speed is especially an issue with the current %automaton IPython magic
    line, as it tries to compile the result at each key stroke, which
    results in incredibly slow interaction, as we wait for the compiler to
    fail...

    * vcsn/dyn/context-parser.hh, lib/vcsn/dyn/context-parser.cc
    (parse_context): New.
    (context_): New overload to factor some calls.
    * lib/vcsn/dyn/translate.cc: Use parse_context.
    Remove duplicate handling of EOF.

commit e105139747ae4efc83e6e4b61545feaa25d2efec
Author: Akim Demaille <akim@lrde.epita.fr>
Date:   Mon Jul 21 18:57:47 2014 +0200

    doc: determinize: be clearer

    * share/vcsn/notebooks/algos/determinize.ipynb: Show examples in Q.

commit 5308a3c6679efea3076b5de645b85f85bed1cb9a
Author: Akim Demaille <akim@lrde.epita.fr>
Date:   Wed Jul 16 10:28:59 2014 +0200

    polynomials: get back the print_weight_ function

    This basically reverts

        commit 61802295079835b2f71bb4b35545f5c8dcc2a7b7
        Author: Akim Demaille <akim@lrde.epita.fr>
        Date:   Fri Jul 11 18:32:34 2014 +0200

        determinize: use our angle-bracket notation for "weighted" states

        I see polynomials of states in our future...

        * vcsn/weightset/polynomialset.hh (print_): Move to...
        * vcsn/weightset/weightset.hh (print_weight): here.

    since we no longer need it outside polynomials, as we use
    polynomialset's print function.

    * vcsn/weightset/weightset.hh (print_weight): Move back to...
    * vcsn/weightset/polynomialset.hh (print_weight_): here.

commit 8e183c358dd10246ef57dc25fabe122f4ddb0bb0
Author: Canh Luu <ldcanh@gmail.com>
Date:   Wed Jul 2 10:13:07 2014 +0200

    determinize: remove complete

    Benchs show that even when _not_ asking for a complete result, the
    mere fact of checking whether not to complete does incur a small
    performance hit.

    Worse yet: the next commit completely changes to core loop of the
    algorithm, and making the automaton complete really adds cruft to the
    algorithm.  So instead of trying to merge two algorithms into one,
    stop supporting the 'complete' feature in determinization, just call
    complete() afterwards if needed.

    On erebus, previous commit and then this one:

       n-1   n
      7.93s 7.80s: a.determinize()      # a = ladybird(21)
      6.64s 6.45s: a.determinize()      # a = lal(a-zA-Z0-9).ladybird(18)

    * vcsn/algos/determinize.hh, lib/vcsn/algos/determinize.cc,
    * python/vcsn_cxx.cc, tests/python/determinize.py
    * vcsn/algos/are-equivalent.hh, bin/vcsn-tafkit.cc,
    * vcsn/ctx/lal_char, b.hh, vcsn/dyn/algos.hh: Here.

commit 044bf26c52b825ff155df4e103586cef68092102
Author: Luca Saiu <positron@gnu.org>
Date:   Fri Jun 27 12:08:47 2014 +0200

    subset automaton bugfix: copy shared_ptr instead of using reference to it

    (Reported by Alexandre Duret-Lutz)

    I was mistakenly storing a reference to a shared_ptr object, instead
    of copying the shared_ptr into another shared_ptr, thus updating the
    reference counter.  Of course as soon as the referred object was
    destroyed, hell broke loose.

    * vcsn/core/partition-automaton.hh: One-character fix.

    * TODO.txt: Write down that using a *const* shared_ptr for the referred
    * automaton, as I did here, is often a good idea in decorators.

commit a17e4e65d6a67dda647776164da0a5e2c6a5799b
Author: Luca Saiu <positron@gnu.org>
Date:   Thu Jun 5 16:49:14 2014 +0200

    add new partition-automaton decorator, and use it for minimization

    Build partition-automaton, a decorator related to the *shape* of an
    automaton rather than to the algorithm which produced it, which
    reflects a slight change of directions in decorators.  Use it in
    quotient to build a decorated automaton from classes -- without making
    everything inefficient: the algorithm remains identical; it could have
    been slightly simplified, but at the cost of introducing dynamic
    lookups for state sets.

    The decorator is intended to be useful for determinization as well.

    Don't accept "brzozowski" as an algorithm variant at the static level;
    rather, let the user call minimize_brzozowski directly (Rationale: the
    decorator would be very different, and typically uninteresting).
    Accept "brzozowski" as a variant at the Dyn and Python levels.

    No change in score from the previous commit.  On minsky:
    1.64s  1.64s  a.minimize("moore")  # a = std([a-k]{2000})
    2.34s  2.32s  a.minimize("signature") # a = std([a-g]{800})

    * vcsn/core/partition-automaton.hh: New.

    * vcsn/algos/minimize.hh: Disable brzozowski; don't strip input.
    * vcsn/algos/quotient.hh: Remove the VCSN_ORIGINS kludge, now obsolete here.

    * vcsn/algos/minimize-moore.hh, vcsn/algos/minimize-signature.hh,
    * vcsn/algos/minimize-weighted.hh: Change result types.

    * lib/vcsn/dyn/context-parser.cc: Here.
    * lib/vcsn/dyn/context-printer.cc: Here.

    * vcsn/local.mk: Here.

    * tests/python/minimize.py, tests/python/minimize.dir/intricate.exp.gv,
    * tests/python/minimize.dir/nonlal.exp.gv,
    * tests/python/minimize.dir/redundant.exp.gv,
    * tests/python/minimize.dir/small-nfa.exp.gv,
    * tests/python/minimize.dir/small-weighted.exp.gv,
    * tests/python/local.mk: Update testsuite.

    * vcsn/algos/minimize-brzozowski.hh: Write "brzozowski" with a 'w'.

    * NEWS.txt: Brag.
In [ ]: