Log message:
py-ftfy: updated to 6.1.3
Version 6.1.3 (November 21, 2023)
- Updated wcwidth.
- Switched to the Apache 2.0 license.
- Dropped support for Python 3.7.
Version 6.1.2 (February 17, 2022)
- Added type information for `guess_bytes`.
Version 6.1.1 (February 9, 2022)
- Updated the heuristic to fix the letter ß in UTF-8/MacRoman mojibake,
which had regressed since version 5.6.
- Packaging fixes to pyproject.toml.
Version 6.1 (February 9, 2022)
- Updated the heuristic to fix the letter Ñ with more confidence.
- Fixed type annotations and added py.typed.
- ftfy is packaged using Poetry now, and wheels are created and uploaded to
PyPI.
Version 6.0.3 (May 14, 2021)
- Allow the keyword argument `fix_entities` as a deprecated alias for
`unescape_html`, raising a warning.
- `ftfy.formatting` functions now disregard ANSI terminal escapes when
calculating text width.
Version 6.0.2 (May 4, 2021)
This version is purely a cosmetic change, updating the maintainer's e-mail
address and the project's canonical location on GitHub.
Version 6.0.1 (April 12, 2021)
- The `remove_terminal_escapes` step was accidentally not being used. This
version restores it.
- Specified in setup.py that ftfy 6 requires Python 3.6 or later.
- Use a lighter link color when the docs are viewed in dark mode.
Version 6.0 (April 2, 2021)
- New function: `ftfy.fix_and_explain()` can describe all the transformations
that happen when fixing a string. This is similar to what
`ftfy.fixes.fix_encoding_and_explain()` did in previous versions, but it
can fix more than the encoding.
- `fix_and_explain()` and `fix_encoding_and_explain()` are now in the top-level
ftfy module.
- Changed the heuristic entirely. ftfy no longer needs to categorize every
Unicode character, but only characters that are expected to appear in
mojibake.
- Because of the new heuristic, ftfy will no longer have to release a new
version for every new version of Unicode. It should also run faster and
use less RAM when imported.
- The heuristic `ftfy.badness.is_bad(text)` can be used to determine whether
there appears to be mojibake in a string. Some users were already using
the old function `sequence_weirdness()` for that, but this one is actually
designed for that purpose.
- Instead of a pile of named keyword arguments, ftfy functions now take in
a TextFixerConfig object. The keyword arguments still work, and become
settings that override the defaults in TextFixerConfig.
- Added support for UTF-8 mixups with Windows-1253 and Windows-1254.
- Overhauled the documentation: https://ftfy.readthedocs.org
|
Log message:
Version 4.3.0 (December 29, 2016)
ftfy has gotten by for four years without dependencies on other Python \
libraries, but now we can spare ourselves some code and some maintenance burden \
by delegating certain tasks to other libraries that already solve them well. \
This version now depends on the html5lib and wcwidth libraries.
Feature changes:
The remove_control_chars fixer will now remove some non-ASCII control \
characters as well, such as deprecated Arabic control characters and byte-order \
marks. Bidirectional controls are still left as is.
This should have no impact on well-formed text, while cleaning up many \
characters that the Unicode Consortium deems "not suitable for markup" \
(see Unicode Technical Report #20).
The unescape_html fixer uses a more thorough list of HTML entities, which it \
imports from html5lib.
ftfy.formatting now uses wcwidth to compute the width that a string will \
occupy in a text console.
Heuristic changes:
Updated the data file of Unicode character categories to Unicode 9, as used \
in Python 3.6.0. (No matter what version of Python you're on, ftfy uses the same \
data.)
Pending deprecations:
The remove_bom option will become deprecated in 5.0, because it has been \
superseded by remove_control_chars.
ftfy 5.0 will remove the previously deprecated name fix_text_encoding. It \
was renamed to fix_encoding in 4.0.
ftfy 5.0 will require Python 3.2 or later, as planned. Python 2 users, \
please specify ftfy < 5 in your dependencies if you haven't already.
Version 4.2.0 (September 28, 2016)
Heuristic changes:
Math symbols next to currency symbols are no longer considered 'weird' by \
the heuristic. This fixes a false positive where text that involved the \
multiplication sign and British pounds or euros (as in '5ã35') could turn \
into Hebrew letters.
A heuristic that used to be a bonus for certain punctuation now also gives a \
bonus to successfully decoding other common codepoints, such as the non-breaking \
space, the degree sign, and the byte order mark.
In version 4.0, we tried to "future-proof" the categorization of \
emoji (as a kind of symbol) to include codepoints that would likely be assigned \
to emoji later. The future happened, and there are even more emoji than we \
expected. We have expanded the range to include those emoji, too.
ftfy is still mostly based on information from Unicode 8 (as Python 3.5 is), \
but this expanded range should include the emoji from Unicode 9 and 10.
Emoji are increasingly being modified by variation selectors and skin-tone \
modifiers. Those codepoints are now grouped with 'symbols' in ftfy, so they fit \
right in with emoji, instead of being considered 'marks' as their Unicode \
category would suggest.
This enables fixing mojibake that involves iOS's new diverse emoji.
An old heuristic that wasn't necessary anymore considered Latin text with \
high-numbered codepoints to be 'weird', but this is normal in languages such as \
Vietnamese and Azerbaijani. This does not seem to have caused any false \
positives, but it caused ftfy to be too reluctant to fix some cases of broken \
text in those languages.
The heuristic has been changed, and all languages that use Latin letters \
should be on even footing now.
Version 4.1.1 (April 13, 2016)
Bug fix: in the command-line interface, the -e option had no effect on \
Python 3 when using standard input. Now, it correctly lets you specify a \
different encoding for standard input.
Version 4.1.0 (February 25, 2016)
Heuristic changes:
ftfy can now deal with "lossy" mojibake. If your text has been run \
through a strict Windows-1252 decoder, such as the one in Python, it may contain \
the replacement character � (U+FFFD) where there were bytes that are \
unassigned in Windows-1252.
Although ftfy won't recover the lost information, it can now detect this \
situation, replace the entire lossy character with �, and decode the rest \
of the characters. Previous versions would be unable to fix any string that \
contained U+FFFD.
As an example, text in curly quotes that gets corrupted ââ¬Å like \
this ââ¬ï¿½ now gets fixed to be â like this �.
Updated the data file of Unicode character categories to Unicode 8.0, as \
used in Python 3.5.0. (No matter what version of Python you're on, ftfy uses the \
same data.)
Heuristics now count characters such as ~ and ^ as punctuation instead of \
wacky math symbols, improving the detection of mojibake in some edge cases.
New features:
A new module, ftfy.formatting, can be used to justify Unicode text in a \
monospaced terminal. It takes into account that each character can take up \
anywhere from 0 to 2 character cells.
Internally, the utf-8-variants codec was simplified and optimized.
Version 4.0.0 (April 10, 2015)
Breaking changes:
The default normalization form is now NFC, not NFKC. NFKC replaces a large \
number of characters with 'equivalent' characters, and some of these \
replacements are useful, but some are not desirable to do by default.
The fix_text function has some new options that perform more targeted \
operations that are part of NFKC normalization, such as fix_character_width, \
without requiring hitting all your text with the huge mallet that is NFKC.
If you were already using NFC normalization, or in general if you want \
to preserve the spacing of CJK text, you should be sure to set \
fix_character_width=False.
The remove_unsafe_private_use parameter has been removed entirely, after two \
versions of deprecation. The function name fix_bad_encoding is also gone.
New features:
Fixers for strange new forms of mojibake, including particularly clear cases \
of mixed UTF-8 and Windows-1252.
New heuristics, so that ftfy can fix more stuff, while maintaining \
approximately zero false positives.
The command-line tool trusts you to know what encoding your input is in, and \
assumes UTF-8 by default. You can still tell it to guess with the -g option.
The command-line tool can be configured with options, and can be used as a pipe.
Recognizes characters that are new in Unicode 7.0, as well as emoji from \
Unicode 8.0+ that may already be in use on iOS.
Deprecations:
fix_text_encoding is being renamed again, for conciseness and consistency. \
It's now simply called fix_encoding. The name fix_text_encoding is available but \
emits a warning.
Pending deprecations:
Python 2.6 support is largely coincidental.
Python 2.7 support is on notice. If you use Python 2, be sure to pin a \
version of ftfy less than 5.0 in your requirements.
|