Path to this page:
Subject: CVS commit: pkgsrc/textproc/py-ftfy
From: Blue Rats
Date: 2017-01-12 01:45:43
Message id: 20170112004543.BFFC4FBA6@cvs.NetBSD.org
Log Message:
Version 4.3.0 (December 29, 2016)
ftfy has gotten by for four years without dependencies on other Python \
libraries, but now we can spare ourselves some code and some maintenance burden \
by delegating certain tasks to other libraries that already solve them well. \
This version now depends on the html5lib and wcwidth libraries.
Feature changes:
The remove_control_chars fixer will now remove some non-ASCII control \
characters as well, such as deprecated Arabic control characters and byte-order \
marks. Bidirectional controls are still left as is.
This should have no impact on well-formed text, while cleaning up many \
characters that the Unicode Consortium deems "not suitable for markup" \
(see Unicode Technical Report #20).
The unescape_html fixer uses a more thorough list of HTML entities, which it \
imports from html5lib.
ftfy.formatting now uses wcwidth to compute the width that a string will \
occupy in a text console.
Heuristic changes:
Updated the data file of Unicode character categories to Unicode 9, as used \
in Python 3.6.0. (No matter what version of Python you're on, ftfy uses the same \
data.)
Pending deprecations:
The remove_bom option will become deprecated in 5.0, because it has been \
superseded by remove_control_chars.
ftfy 5.0 will remove the previously deprecated name fix_text_encoding. It \
was renamed to fix_encoding in 4.0.
ftfy 5.0 will require Python 3.2 or later, as planned. Python 2 users, \
please specify ftfy < 5 in your dependencies if you haven't already.
Version 4.2.0 (September 28, 2016)
Heuristic changes:
Math symbols next to currency symbols are no longer considered 'weird' by \
the heuristic. This fixes a false positive where text that involved the \
multiplication sign and British pounds or euros (as in '5ã35') could turn \
into Hebrew letters.
A heuristic that used to be a bonus for certain punctuation now also gives a \
bonus to successfully decoding other common codepoints, such as the non-breaking \
space, the degree sign, and the byte order mark.
In version 4.0, we tried to "future-proof" the categorization of \
emoji (as a kind of symbol) to include codepoints that would likely be assigned \
to emoji later. The future happened, and there are even more emoji than we \
expected. We have expanded the range to include those emoji, too.
ftfy is still mostly based on information from Unicode 8 (as Python 3.5 is), \
but this expanded range should include the emoji from Unicode 9 and 10.
Emoji are increasingly being modified by variation selectors and skin-tone \
modifiers. Those codepoints are now grouped with 'symbols' in ftfy, so they fit \
right in with emoji, instead of being considered 'marks' as their Unicode \
category would suggest.
This enables fixing mojibake that involves iOS's new diverse emoji.
An old heuristic that wasn't necessary anymore considered Latin text with \
high-numbered codepoints to be 'weird', but this is normal in languages such as \
Vietnamese and Azerbaijani. This does not seem to have caused any false \
positives, but it caused ftfy to be too reluctant to fix some cases of broken \
text in those languages.
The heuristic has been changed, and all languages that use Latin letters \
should be on even footing now.
Version 4.1.1 (April 13, 2016)
Bug fix: in the command-line interface, the -e option had no effect on \
Python 3 when using standard input. Now, it correctly lets you specify a \
different encoding for standard input.
Version 4.1.0 (February 25, 2016)
Heuristic changes:
ftfy can now deal with "lossy" mojibake. If your text has been run \
through a strict Windows-1252 decoder, such as the one in Python, it may contain \
the replacement character � (U+FFFD) where there were bytes that are \
unassigned in Windows-1252.
Although ftfy won't recover the lost information, it can now detect this \
situation, replace the entire lossy character with �, and decode the rest \
of the characters. Previous versions would be unable to fix any string that \
contained U+FFFD.
As an example, text in curly quotes that gets corrupted ââ¬Å like \
this ââ¬ï¿½ now gets fixed to be â like this �.
Updated the data file of Unicode character categories to Unicode 8.0, as \
used in Python 3.5.0. (No matter what version of Python you're on, ftfy uses the \
same data.)
Heuristics now count characters such as ~ and ^ as punctuation instead of \
wacky math symbols, improving the detection of mojibake in some edge cases.
New features:
A new module, ftfy.formatting, can be used to justify Unicode text in a \
monospaced terminal. It takes into account that each character can take up \
anywhere from 0 to 2 character cells.
Internally, the utf-8-variants codec was simplified and optimized.
Version 4.0.0 (April 10, 2015)
Breaking changes:
The default normalization form is now NFC, not NFKC. NFKC replaces a large \
number of characters with 'equivalent' characters, and some of these \
replacements are useful, but some are not desirable to do by default.
The fix_text function has some new options that perform more targeted \
operations that are part of NFKC normalization, such as fix_character_width, \
without requiring hitting all your text with the huge mallet that is NFKC.
If you were already using NFC normalization, or in general if you want \
to preserve the spacing of CJK text, you should be sure to set \
fix_character_width=False.
The remove_unsafe_private_use parameter has been removed entirely, after two \
versions of deprecation. The function name fix_bad_encoding is also gone.
New features:
Fixers for strange new forms of mojibake, including particularly clear cases \
of mixed UTF-8 and Windows-1252.
New heuristics, so that ftfy can fix more stuff, while maintaining \
approximately zero false positives.
The command-line tool trusts you to know what encoding your input is in, and \
assumes UTF-8 by default. You can still tell it to guess with the -g option.
The command-line tool can be configured with options, and can be used as a pipe.
Recognizes characters that are new in Unicode 7.0, as well as emoji from \
Unicode 8.0+ that may already be in use on iOS.
Deprecations:
fix_text_encoding is being renamed again, for conciseness and consistency. \
It's now simply called fix_encoding. The name fix_text_encoding is available but \
emits a warning.
Pending deprecations:
Python 2.6 support is largely coincidental.
Python 2.7 support is on notice. If you use Python 2, be sure to pin a \
version of ftfy less than 5.0 in your requirements.
Files: