./textproc/py-ftfy, Fixes some problems with Unicode text after the fact

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 4.2.0nb2, Package name: py39-ftfy-4.2.0nb2, Maintainer: pkgsrc-users

Given Unicode text, make its representation consistent and possibly less broken.


Required to run:
[devel/py-setuptools] [lang/python27] [devel/py-wcwidth]

Master sites:

Filesize: 34.315 KB

Version history: (Expand)


CVS history: (Expand)


   2022-01-05 16:41:32 by Thomas Klausner | Files touched by this commit (289)
Log message:
python: egg.mk: add USE_PKG_RESOURCES flag

This flag should be set for packages that import pkg_resources
and thus need setuptools after the build step.

Set this flag for packages that need it and bump PKGREVISION.
   2022-01-04 21:55:40 by Thomas Klausner | Files touched by this commit (1595)
Log message:
*: bump PKGREVISION for egg.mk users

They now have a tool dependency on py-setuptools instead of a DEPENDS
   2021-10-26 13:23:42 by Nia Alarie | Files touched by this commit (1161)
Log message:
textproc: Replace RMD160 checksums with BLAKE2s checksums

All checksums have been double-checked against existing RMD160 and
SHA512 hashes

Unfetchable distfiles (fetched conditionally?):
./textproc/convertlit/distinfo clit18src.zip
   2021-10-07 17:02:49 by Nia Alarie | Files touched by this commit (1162)
Log message:
textproc: Remove SHA1 hashes for distfiles
   2017-09-16 21:27:31 by Thomas Klausner | Files touched by this commit (372)
Log message:
Reset maintainer
   2017-07-31 00:32:28 by Thomas Klausner | Files touched by this commit (229)
Log message:
Switch github HOMEPAGEs to https.
   2017-01-12 01:48:25 by Blue Rats | Files touched by this commit (1)
Log message:
DEPENDS on devel/py-wcwidth and textproc/py-html5lib.
   2017-01-12 01:45:43 by Blue Rats | Files touched by this commit (3)
Log message:
Version 4.3.0 (December 29, 2016)

ftfy has gotten by for four years without dependencies on other Python \ 
libraries, but now we can spare ourselves some code and some maintenance burden \ 
by delegating certain tasks to other libraries that already solve them well. \ 
This version now depends on the html5lib and wcwidth libraries.

Feature changes:

    The remove_control_chars fixer will now remove some non-ASCII control \ 
characters as well, such as deprecated Arabic control characters and byte-order \ 
marks. Bidirectional controls are still left as is.

    This should have no impact on well-formed text, while cleaning up many \ 
characters that the Unicode Consortium deems "not suitable for markup" \ 
(see Unicode Technical Report #20).

    The unescape_html fixer uses a more thorough list of HTML entities, which it \ 
imports from html5lib.

    ftfy.formatting now uses wcwidth to compute the width that a string will \ 
occupy in a text console.

Heuristic changes:

    Updated the data file of Unicode character categories to Unicode 9, as used \ 
in Python 3.6.0. (No matter what version of Python you're on, ftfy uses the same \ 
data.)

Pending deprecations:

    The remove_bom option will become deprecated in 5.0, because it has been \ 
superseded by remove_control_chars.

    ftfy 5.0 will remove the previously deprecated name fix_text_encoding. It \ 
was renamed to fix_encoding in 4.0.

    ftfy 5.0 will require Python 3.2 or later, as planned. Python 2 users, \ 
please specify ftfy < 5 in your dependencies if you haven't already.

Version 4.2.0 (September 28, 2016)

Heuristic changes:

    Math symbols next to currency symbols are no longer considered 'weird' by \ 
the heuristic. This fixes a false positive where text that involved the \ 
multiplication sign and British pounds or euros (as in '5×£35') could turn \ 
into Hebrew letters.

    A heuristic that used to be a bonus for certain punctuation now also gives a \ 
bonus to successfully decoding other common codepoints, such as the non-breaking \ 
space, the degree sign, and the byte order mark.

    In version 4.0, we tried to "future-proof" the categorization of \ 
emoji (as a kind of symbol) to include codepoints that would likely be assigned \ 
to emoji later. The future happened, and there are even more emoji than we \ 
expected. We have expanded the range to include those emoji, too.

    ftfy is still mostly based on information from Unicode 8 (as Python 3.5 is), \ 
but this expanded range should include the emoji from Unicode 9 and 10.

    Emoji are increasingly being modified by variation selectors and skin-tone \ 
modifiers. Those codepoints are now grouped with 'symbols' in ftfy, so they fit \ 
right in with emoji, instead of being considered 'marks' as their Unicode \ 
category would suggest.

    This enables fixing mojibake that involves iOS's new diverse emoji.

    An old heuristic that wasn't necessary anymore considered Latin text with \ 
high-numbered codepoints to be 'weird', but this is normal in languages such as \ 
Vietnamese and Azerbaijani. This does not seem to have caused any false \ 
positives, but it caused ftfy to be too reluctant to fix some cases of broken \ 
text in those languages.

    The heuristic has been changed, and all languages that use Latin letters \ 
should be on even footing now.

Version 4.1.1 (April 13, 2016)

    Bug fix: in the command-line interface, the -e option had no effect on \ 
Python 3 when using standard input. Now, it correctly lets you specify a \ 
different encoding for standard input.

Version 4.1.0 (February 25, 2016)

Heuristic changes:

    ftfy can now deal with "lossy" mojibake. If your text has been run \ 
through a strict Windows-1252 decoder, such as the one in Python, it may contain \ 
the replacement character � (U+FFFD) where there were bytes that are \ 
unassigned in Windows-1252.

    Although ftfy won't recover the lost information, it can now detect this \ 
situation, replace the entire lossy character with �, and decode the rest \ 
of the characters. Previous versions would be unable to fix any string that \ 
contained U+FFFD.

    As an example, text in curly quotes that gets corrupted “ like \ 
this â€� now gets fixed to be “ like this �.

    Updated the data file of Unicode character categories to Unicode 8.0, as \ 
used in Python 3.5.0. (No matter what version of Python you're on, ftfy uses the \ 
same data.)

    Heuristics now count characters such as ~ and ^ as punctuation instead of \ 
wacky math symbols, improving the detection of mojibake in some edge cases.

New features:

    A new module, ftfy.formatting, can be used to justify Unicode text in a \ 
monospaced terminal. It takes into account that each character can take up \ 
anywhere from 0 to 2 character cells.

    Internally, the utf-8-variants codec was simplified and optimized.

Version 4.0.0 (April 10, 2015)

Breaking changes:

    The default normalization form is now NFC, not NFKC. NFKC replaces a large \ 
number of characters with 'equivalent' characters, and some of these \ 
replacements are useful, but some are not desirable to do by default.

    The fix_text function has some new options that perform more targeted \ 
operations that are part of NFKC normalization, such as fix_character_width, \ 
without requiring hitting all your text with the huge mallet that is NFKC.
        If you were already using NFC normalization, or in general if you want \ 
to preserve the spacing of CJK text, you should be sure to set \ 
fix_character_width=False.

    The remove_unsafe_private_use parameter has been removed entirely, after two \ 
versions of deprecation. The function name fix_bad_encoding is also gone.

New features:

    Fixers for strange new forms of mojibake, including particularly clear cases \ 
of mixed UTF-8 and Windows-1252.

    New heuristics, so that ftfy can fix more stuff, while maintaining \ 
approximately zero false positives.

    The command-line tool trusts you to know what encoding your input is in, and \ 
assumes UTF-8 by default. You can still tell it to guess with the -g option.

    The command-line tool can be configured with options, and can be used as a pipe.

    Recognizes characters that are new in Unicode 7.0, as well as emoji from \ 
Unicode 8.0+ that may already be in use on iOS.

Deprecations:

    fix_text_encoding is being renamed again, for conciseness and consistency. \ 
It's now simply called fix_encoding. The name fix_text_encoding is available but \ 
emits a warning.

Pending deprecations:

    Python 2.6 support is largely coincidental.

    Python 2.7 support is on notice. If you use Python 2, be sure to pin a \ 
version of ftfy less than 5.0 in your requirements.