./textproc/libstemmer, Snowball compiler and the stemming algorithms

Branch: CURRENT, Version: 2.1.0nb1, Package name: libstemmer-2.1.0nb1, Maintainer: ryoon

The snowball compiler and the stemming algorithms

Snowball 2.1.0 (2021-01-21)


* Fix decoding of 4-byte UTF-8 sequences in `grouping` checks.  This bug
  affected Unicode codepoints U+40000 to U+7FFFF and U+C0000 to U+FFFFF and
  doesn't affect any of the stemming algorithms we currently ship (#138,
  reported by Stephane Carrez).


* Fix snowballstemmer.algorithms() method (#132, reported by kkaiser).

* Update code to generate trove language classifiers for PyPI.  All the
  natural languages we previously had stemmers for have now been added to
  PyPI's list, but Armenian and Yiddish aren't on it.  Patch from Dmitry


Code Quality Improvements

* Suppress GCC warning in compiler code.

* Use `const` pointers more in C runtime.

* Only use spaces for indentation in javascript code.  Change proposed by Emily
  Marigold Klassen in #123, and seems to be the modern Javascript norm.

New Code Generators

* Add Ada generator from Stephane Carrez (#135).

New Snowball Language Features

* `lenof` and `sizeof` can now be applied to a literal string, which can be
  useful if you want to do calculations on cursor values.

  This change actually simplifies the language a little, since you can now use
  a literal string in any read-only context which accepts a string variable.

Code generation improvements

* General:

  + Fix bugs in the code generated to handle failure of `goto`, `gopast` or
    `try` inside `setlimit` or string-`$`.  This affected all languages (though
    the issue with `try` wasn't present for C).  These bugs don't affect any of
    the stemming algorithms we currently ship.  Reported by Stefan Petkovic on

  + Change `hop` with a negative argument to work as documented.  The manual
    says a negative argument to hop will raise signal f, but the implementation
    for all languages was actually to move the cursor in the opposite direction
    to `hop` with a positive argument.  The implemented behaviour is
    problematic as it allows invalidating implicitly saved cursor values by
    modifying the string outside the current region, so we've decided it's best
    to fix the implementation to match the documentation.

    The only Snowball code we're aware of which relies on this was the original
    version of the new Yiddish stemming algorithm, which has been updated not
    to rely on this.

    The compiler now issues a warning for `hop` with a constant negative
    argument (internally now converted to `false`), and for `hop` with a
    constant zero argument (internally now converted to `true`).

  + Canonicalise `among` actions equivalent to `()` such as `(true)` which
    previously resulted in an extra case in the among, and for Python
    we'd generate invalid Python code (`if` or `elif` with an empty body).
    Bug revealed by Assaf Urieli's Yiddish stemmer in #137.

  + Eliminate variables whose values are never used - they no longer have
    corresponding member variables, etc, and no code is generated for any
    assignments to them.

  + Don't generate anything for an unused `grouping`.

  + Stop warning "grouping X defined but not used" for a `grouping` \ 
which is
    only used to define other another `grouping`.

* C/C++:

  + Store booleans in same array as integers.  This means each boolean is
    stored as an int instead of an unsigned char which means 4 bytes instead of
    1, but we save a pointer (4 or 8 bytes) in struct SN_env which is a win for
    all the current stemmers.  For an algorithm which uses both integers and
    booleans, we also save the overhead of allocating a block on the heap, and
    potentially improve data locality.

  + Eliminate duplicate generated C comment for sliceto.

* Pascal:

  + Avoid generating unused variables.  The Pascal code generated for the
    stemmers we ship is now warning free (tested with fpc 3.2.0).

* Python:

  + End `if`-chain with `else` where possible, avoiding a redundant test
    of the variable being switched on.  This optimisation kicks in for an
    `among` where all cases have commands.  This change seems to speed up `make
    check_python_arabic` by a few percent.

New stemming algorithms

* Add Serbian stemmer from stef4np (#113).

* Add Yiddish stemmer from Assaf Urieli (#137).

* Add Armenian stemmer from Astghik Mkrtchyan.  It's been on the website for
  over a decade, and included in Xapian for over 9 years without any negative

Behavioural changes to existing algorithms

Optimisations to existing algorithms

* kraaij_pohlmann: Use `$v = limit` instead of `do (tolimit setmark v)` since
  this generates simpler code, and also matches the code other algorithm
  implementations use.

  Probably for languages like C with optimising compilers the compiler
  will generate equivalent code anyway, but e.g. for Python this should be
  an improvement.

Code clarity improvements to existing algorithms

* hindi.sbl: Fix comment typo.


* Don't count `$x = x + 1` as initialising or using `x`, so it's now handled
  like `$x += 1` already is.

* Comments are now only included in the generated code if command like option
  -comments is specified.

  The comments in the generated code are useful if you're trying to debug the
  compiler, and perhaps also if you are trying to debug your Snowball code, but
  for everyone else they just bloat the code which as the number of languages
  we support grows becomes more of an issue.

* `-parentclassname` is not only for java and csharp so don't disable it if
  those backends are disabled.

* `-syntax` now reports the value for each numeric literal.

* Report location for excessive get nesting error.

* Internally the compiler now represents negated literal numbers as a simple
  `c_number` rather than `c_neg` applied to a `c_number` with a positive value.
  This simplifies optimisations that want to check for a constant numeric

Build system

* Link binaries with LDFLAGS if it's set, which is needed for some platform
  (e.g. OpenEmbedded).  Patch from Andreas Müller (#120).

* Add missing dependencies of algorithms.go rule.


* C: Add stemtest for low-level regression tests.


* Document a C99 compiler as a requirement for building the snowball compiler
  (but the C code it generates should still work with any ISO C compiler.)

  A few declarations mixed with code crept in some time ago (which nobody's
  complained about), so this is really just formally documenting a requirement
  which already existed.

* README: Explain what Snowball is and what Stemming is (#131, reported by Sean

* CONTRIBUTING.rst: Expand section on adding a new generator.

* For Python snowballstemmer module include global NEWS instead of
  Python-specific CHANGES.rst and use README.rst as the long description.
  Patch from Dmitry Shachnev (#119).

* COPYING: Update and incorporate Python backend licensing information which
  was previously in a separate file.
The snowball compiler and the stemming algorithms