pkgsrc.se | The NetBSD package collection

Subject: CVS commit: pkgsrc/textproc/py-ftfy
From: Blue Rats
Date: 2017-01-12 01:45:43
Message id: 20170112004543.BFFC4FBA6@cvs.NetBSD.org
Log Message:
Version 4.3.0 (December 29, 2016)

ftfy has gotten by for four years without dependencies on other Python \ 
libraries, but now we can spare ourselves some code and some maintenance burden \ 
by delegating certain tasks to other libraries that already solve them well. \ 
This version now depends on the html5lib and wcwidth libraries.

Feature changes:

    The remove_control_chars fixer will now remove some non-ASCII control \ 
characters as well, such as deprecated Arabic control characters and byte-order \ 
marks. Bidirectional controls are still left as is.

    This should have no impact on well-formed text, while cleaning up many \ 
characters that the Unicode Consortium deems "not suitable for markup" \ 
(see Unicode Technical Report #20).

    The unescape_html fixer uses a more thorough list of HTML entities, which it \ 
imports from html5lib.

    ftfy.formatting now uses wcwidth to compute the width that a string will \ 
occupy in a text console.

Heuristic changes:

    Updated the data file of Unicode character categories to Unicode 9, as used \ 
in Python 3.6.0. (No matter what version of Python you're on, ftfy uses the same \ 
data.)

Pending deprecations:

    The remove_bom option will become deprecated in 5.0, because it has been \ 
superseded by remove_control_chars.

    ftfy 5.0 will remove the previously deprecated name fix_text_encoding. It \ 
was renamed to fix_encoding in 4.0.

    ftfy 5.0 will require Python 3.2 or later, as planned. Python 2 users, \ 
please specify ftfy < 5 in your dependencies if you haven't already.

Version 4.2.0 (September 28, 2016)

Heuristic changes:

    Math symbols next to currency symbols are no longer considered 'weird' by \ 
the heuristic. This fixes a false positive where text that involved the \ 
multiplication sign and British pounds or euros (as in '5ÃÂ£35') could turn \ 
into Hebrew letters.

    A heuristic that used to be a bonus for certain punctuation now also gives a \ 
bonus to successfully decoding other common codepoints, such as the non-breaking \ 
space, the degree sign, and the byte order mark.

    In version 4.0, we tried to "future-proof" the categorization of \ 
emoji (as a kind of symbol) to include codepoints that would likely be assigned \ 
to emoji later. The future happened, and there are even more emoji than we \ 
expected. We have expanded the range to include those emoji, too.

    ftfy is still mostly based on information from Unicode 8 (as Python 3.5 is), \ 
but this expanded range should include the emoji from Unicode 9 and 10.

    Emoji are increasingly being modified by variation selectors and skin-tone \ 
modifiers. Those codepoints are now grouped with 'symbols' in ftfy, so they fit \ 
right in with emoji, instead of being considered 'marks' as their Unicode \ 
category would suggest.

    This enables fixing mojibake that involves iOS's new diverse emoji.

    An old heuristic that wasn't necessary anymore considered Latin text with \ 
high-numbered codepoints to be 'weird', but this is normal in languages such as \ 
Vietnamese and Azerbaijani. This does not seem to have caused any false \ 
positives, but it caused ftfy to be too reluctant to fix some cases of broken \ 
text in those languages.

    The heuristic has been changed, and all languages that use Latin letters \ 
should be on even footing now.

Version 4.1.1 (April 13, 2016)

    Bug fix: in the command-line interface, the -e option had no effect on \ 
Python 3 when using standard input. Now, it correctly lets you specify a \ 
different encoding for standard input.

Version 4.1.0 (February 25, 2016)

Heuristic changes:

    ftfy can now deal with "lossy" mojibake. If your text has been run \ 
through a strict Windows-1252 decoder, such as the one in Python, it may contain \ 
the replacement character ï¿½ (U+FFFD) where there were bytes that are \ 
unassigned in Windows-1252.

    Although ftfy won't recover the lost information, it can now detect this \ 
situation, replace the entire lossy character with ï¿½, and decode the rest \ 
of the characters. Previous versions would be unable to fix any string that \ 
contained U+FFFD.

    As an example, text in curly quotes that gets corrupted Ã¢â¬Å like \ 
this Ã¢â¬ï¿½ now gets fixed to be â like this ï¿½.

    Updated the data file of Unicode character categories to Unicode 8.0, as \ 
used in Python 3.5.0. (No matter what version of Python you're on, ftfy uses the \ 
same data.)

    Heuristics now count characters such as ~ and ^ as punctuation instead of \ 
wacky math symbols, improving the detection of mojibake in some edge cases.

New features:

    A new module, ftfy.formatting, can be used to justify Unicode text in a \ 
monospaced terminal. It takes into account that each character can take up \ 
anywhere from 0 to 2 character cells.

    Internally, the utf-8-variants codec was simplified and optimized.

Version 4.0.0 (April 10, 2015)

Breaking changes:

    The default normalization form is now NFC, not NFKC. NFKC replaces a large \ 
number of characters with 'equivalent' characters, and some of these \ 
replacements are useful, but some are not desirable to do by default.

    The fix_text function has some new options that perform more targeted \ 
operations that are part of NFKC normalization, such as fix_character_width, \ 
without requiring hitting all your text with the huge mallet that is NFKC.
        If you were already using NFC normalization, or in general if you want \ 
to preserve the spacing of CJK text, you should be sure to set \ 
fix_character_width=False.

    The remove_unsafe_private_use parameter has been removed entirely, after two \ 
versions of deprecation. The function name fix_bad_encoding is also gone.

New features:

    Fixers for strange new forms of mojibake, including particularly clear cases \ 
of mixed UTF-8 and Windows-1252.

    New heuristics, so that ftfy can fix more stuff, while maintaining \ 
approximately zero false positives.

    The command-line tool trusts you to know what encoding your input is in, and \ 
assumes UTF-8 by default. You can still tell it to guess with the -g option.

    The command-line tool can be configured with options, and can be used as a pipe.

    Recognizes characters that are new in Unicode 7.0, as well as emoji from \ 
Unicode 8.0+ that may already be in use on iOS.

Deprecations:

    fix_text_encoding is being renamed again, for conciseness and consistency. \ 
It's now simply called fix_encoding. The name fix_text_encoding is available but \ 
emits a warning.

Pending deprecations:

    Python 2.6 support is largely coincidental.

    Python 2.7 support is on notice. If you use Python 2, be sure to pin a \ 
version of ftfy less than 5.0 in your requirements.
Files:
Revision	Action	file
1.5	modify	pkgsrc/textproc/py-ftfy/Makefile
1.2	modify	pkgsrc/textproc/py-ftfy/PLIST
1.3	modify	pkgsrc/textproc/py-ftfy/distinfo