Path to this page:
Subject: CVS commit: pkgsrc/textproc/libstemmer
From: Thomas Klausner
Date: 2021-02-18 11:26:56
Message id: 20210218102657.00347FA95@cvs.NetBSD.org
Log Message:
libstemmer: update to 2.1.0.
Snowball 2.1.0 (2021-01-21)
===========================
C/C++
-----
* Fix decoding of 4-byte UTF-8 sequences in `grouping` checks. This bug
affected Unicode codepoints U+40000 to U+7FFFF and U+C0000 to U+FFFFF and
doesn't affect any of the stemming algorithms we currently ship (#138,
reported by Stephane Carrez).
Python
------
* Fix snowballstemmer.algorithms() method (#132, reported by kkaiser).
* Update code to generate trove language classifiers for PyPI. All the
natural languages we previously had stemmers for have now been added to
PyPI's list, but Armenian and Yiddish aren't on it. Patch from Dmitry
Shachnev.
Java
----
Code Quality Improvements
-------------------------
* Suppress GCC warning in compiler code.
* Use `const` pointers more in C runtime.
* Only use spaces for indentation in javascript code. Change proposed by Emily
Marigold Klassen in #123, and seems to be the modern Javascript norm.
New Code Generators
-------------------
* Add Ada generator from Stephane Carrez (#135).
New Snowball Language Features
------------------------------
* `lenof` and `sizeof` can now be applied to a literal string, which can be
useful if you want to do calculations on cursor values.
This change actually simplifies the language a little, since you can now use
a literal string in any read-only context which accepts a string variable.
Code generation improvements
----------------------------
* General:
+ Fix bugs in the code generated to handle failure of `goto`, `gopast` or
`try` inside `setlimit` or string-`$`. This affected all languages (though
the issue with `try` wasn't present for C). These bugs don't affect any of
the stemming algorithms we currently ship. Reported by Stefan Petkovic on
snowball-discuss.
+ Change `hop` with a negative argument to work as documented. The manual
says a negative argument to hop will raise signal f, but the implementation
for all languages was actually to move the cursor in the opposite direction
to `hop` with a positive argument. The implemented behaviour is
problematic as it allows invalidating implicitly saved cursor values by
modifying the string outside the current region, so we've decided it's best
to fix the implementation to match the documentation.
The only Snowball code we're aware of which relies on this was the original
version of the new Yiddish stemming algorithm, which has been updated not
to rely on this.
The compiler now issues a warning for `hop` with a constant negative
argument (internally now converted to `false`), and for `hop` with a
constant zero argument (internally now converted to `true`).
+ Canonicalise `among` actions equivalent to `()` such as `(true)` which
previously resulted in an extra case in the among, and for Python
we'd generate invalid Python code (`if` or `elif` with an empty body).
Bug revealed by Assaf Urieli's Yiddish stemmer in #137.
+ Eliminate variables whose values are never used - they no longer have
corresponding member variables, etc, and no code is generated for any
assignments to them.
+ Don't generate anything for an unused `grouping`.
+ Stop warning "grouping X defined but not used" for a `grouping` \
which is
only used to define other another `grouping`.
* C/C++:
+ Store booleans in same array as integers. This means each boolean is
stored as an int instead of an unsigned char which means 4 bytes instead of
1, but we save a pointer (4 or 8 bytes) in struct SN_env which is a win for
all the current stemmers. For an algorithm which uses both integers and
booleans, we also save the overhead of allocating a block on the heap, and
potentially improve data locality.
+ Eliminate duplicate generated C comment for sliceto.
* Pascal:
+ Avoid generating unused variables. The Pascal code generated for the
stemmers we ship is now warning free (tested with fpc 3.2.0).
* Python:
+ End `if`-chain with `else` where possible, avoiding a redundant test
of the variable being switched on. This optimisation kicks in for an
`among` where all cases have commands. This change seems to speed up `make
check_python_arabic` by a few percent.
New stemming algorithms
-----------------------
* Add Serbian stemmer from stef4np (#113).
* Add Yiddish stemmer from Assaf Urieli (#137).
* Add Armenian stemmer from Astghik Mkrtchyan. It's been on the website for
over a decade, and included in Xapian for over 9 years without any negative
feedback.
Behavioural changes to existing algorithms
------------------------------------------
Optimisations to existing algorithms
------------------------------------
* kraaij_pohlmann: Use `$v = limit` instead of `do (tolimit setmark v)` since
this generates simpler code, and also matches the code other algorithm
implementations use.
Probably for languages like C with optimising compilers the compiler
will generate equivalent code anyway, but e.g. for Python this should be
an improvement.
Code clarity improvements to existing algorithms
------------------------------------------------
* hindi.sbl: Fix comment typo.
Compiler
--------
* Don't count `$x = x + 1` as initialising or using `x`, so it's now handled
like `$x += 1` already is.
* Comments are now only included in the generated code if command like option
-comments is specified.
The comments in the generated code are useful if you're trying to debug the
compiler, and perhaps also if you are trying to debug your Snowball code, but
for everyone else they just bloat the code which as the number of languages
we support grows becomes more of an issue.
* `-parentclassname` is not only for java and csharp so don't disable it if
those backends are disabled.
* `-syntax` now reports the value for each numeric literal.
* Report location for excessive get nesting error.
* Internally the compiler now represents negated literal numbers as a simple
`c_number` rather than `c_neg` applied to a `c_number` with a positive value.
This simplifies optimisations that want to check for a constant numeric
expression.
Build system
------------
* Link binaries with LDFLAGS if it's set, which is needed for some platform
(e.g. OpenEmbedded). Patch from Andreas Müller (#120).
* Add missing dependencies of algorithms.go rule.
Testsuite
---------
* C: Add stemtest for low-level regression tests.
Documentation
-------------
* Document a C99 compiler as a requirement for building the snowball compiler
(but the C code it generates should still work with any ISO C compiler.)
A few declarations mixed with code crept in some time ago (which nobody's
complained about), so this is really just formally documenting a requirement
which already existed.
* README: Explain what Snowball is and what Stemming is (#131, reported by Sean
Kelly).
* CONTRIBUTING.rst: Expand section on adding a new generator.
* For Python snowballstemmer module include global NEWS instead of
Python-specific CHANGES.rst and use README.rst as the long description.
Patch from Dmitry Shachnev (#119).
* COPYING: Update and incorporate Python backend licensing information which
was previously in a separate file.
Files: