./textproc/py-ftfy, Fixes some problems with Unicode text after the fact

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 4.2.0, Package name: py27-ftfy-4.2.0, Maintainer: rodent

Given Unicode text, make its representation consistent and possibly less broken.


Required to run:
[devel/py-setuptools] [textproc/py-html5lib] [lang/python27] [devel/py-wcwidth]

Required to build:
[pkgtools/cwrappers]

Master sites:

SHA1: 31b504c7abb80286210c4d484fd92e2717226232
RMD160: 9e0de31674bd19eb8f29fc1895c5db65c72628e4
Filesize: 34.315 KB

Version history: (Expand)


CVS history: (Expand)


   2017-01-12 01:48:25 by Blue Rats | Files touched by this commit (1)
Log message:
DEPENDS on devel/py-wcwidth and textproc/py-html5lib.
   2017-01-12 01:45:43 by Blue Rats | Files touched by this commit (3) | Package updated
Log message:
Version 4.3.0 (December 29, 2016)

ftfy has gotten by for four years without dependencies on other Python \ 
libraries, but now we can spare ourselves some code and some maintenance burden \ 
by delegating certain tasks to other libraries that already solve them well. \ 
This version now depends on the html5lib and wcwidth libraries.

Feature changes:

    The remove_control_chars fixer will now remove some non-ASCII control \ 
characters as well, such as deprecated Arabic control characters and byte-order \ 
marks. Bidirectional controls are still left as is.

    This should have no impact on well-formed text, while cleaning up many \ 
characters that the Unicode Consortium deems "not suitable for markup" \ 
(see Unicode Technical Report #20).

    The unescape_html fixer uses a more thorough list of HTML entities, which it \ 
imports from html5lib.

    ftfy.formatting now uses wcwidth to compute the width that a string will \ 
occupy in a text console.

Heuristic changes:

    Updated the data file of Unicode character categories to Unicode 9, as used \ 
in Python 3.6.0. (No matter what version of Python you're on, ftfy uses the same \ 
data.)

Pending deprecations:

    The remove_bom option will become deprecated in 5.0, because it has been \ 
superseded by remove_control_chars.

    ftfy 5.0 will remove the previously deprecated name fix_text_encoding. It \ 
was renamed to fix_encoding in 4.0.

    ftfy 5.0 will require Python 3.2 or later, as planned. Python 2 users, \ 
please specify ftfy < 5 in your dependencies if you haven't already.

Version 4.2.0 (September 28, 2016)

Heuristic changes:

    Math symbols next to currency symbols are no longer considered 'weird' by \ 
the heuristic. This fixes a false positive where text that involved the \ 
multiplication sign and British pounds or euros (as in '5×£35') could turn \ 
into Hebrew letters.

    A heuristic that used to be a bonus for certain punctuation now also gives a \ 
bonus to successfully decoding other common codepoints, such as the non-breaking \ 
space, the degree sign, and the byte order mark.

    In version 4.0, we tried to "future-proof" the categorization of \ 
emoji (as a kind of symbol) to include codepoints that would likely be assigned \ 
to emoji later. The future happened, and there are even more emoji than we \ 
expected. We have expanded the range to include those emoji, too.

    ftfy is still mostly based on information from Unicode 8 (as Python 3.5 is), \ 
but this expanded range should include the emoji from Unicode 9 and 10.

    Emoji are increasingly being modified by variation selectors and skin-tone \ 
modifiers. Those codepoints are now grouped with 'symbols' in ftfy, so they fit \ 
right in with emoji, instead of being considered 'marks' as their Unicode \ 
category would suggest.

    This enables fixing mojibake that involves iOS's new diverse emoji.

    An old heuristic that wasn't necessary anymore considered Latin text with \ 
high-numbered codepoints to be 'weird', but this is normal in languages such as \ 
Vietnamese and Azerbaijani. This does not seem to have caused any false \ 
positives, but it caused ftfy to be too reluctant to fix some cases of broken \ 
text in those languages.

    The heuristic has been changed, and all languages that use Latin letters \ 
should be on even footing now.

Version 4.1.1 (April 13, 2016)

    Bug fix: in the command-line interface, the -e option had no effect on \ 
Python 3 when using standard input. Now, it correctly lets you specify a \ 
different encoding for standard input.

Version 4.1.0 (February 25, 2016)

Heuristic changes:

    ftfy can now deal with "lossy" mojibake. If your text has been run \ 
through a strict Windows-1252 decoder, such as the one in Python, it may contain \ 
the replacement character � (U+FFFD) where there were bytes that are \ 
unassigned in Windows-1252.

    Although ftfy won't recover the lost information, it can now detect this \ 
situation, replace the entire lossy character with �, and decode the rest of \ 
the characters. Previous versions would be unable to fix any string that \ 
contained U+FFFD.

    As an example, text in curly quotes that gets corrupted “ like this \ 
â€� now gets fixed to be “ like this �.

    Updated the data file of Unicode character categories to Unicode 8.0, as \ 
used in Python 3.5.0. (No matter what version of Python you're on, ftfy uses the \ 
same data.)

    Heuristics now count characters such as ~ and ^ as punctuation instead of \ 
wacky math symbols, improving the detection of mojibake in some edge cases.

New features:

    A new module, ftfy.formatting, can be used to justify Unicode text in a \ 
monospaced terminal. It takes into account that each character can take up \ 
anywhere from 0 to 2 character cells.

    Internally, the utf-8-variants codec was simplified and optimized.

Version 4.0.0 (April 10, 2015)

Breaking changes:

    The default normalization form is now NFC, not NFKC. NFKC replaces a large \ 
number of characters with 'equivalent' characters, and some of these \ 
replacements are useful, but some are not desirable to do by default.

    The fix_text function has some new options that perform more targeted \ 
operations that are part of NFKC normalization, such as fix_character_width, \ 
without requiring hitting all your text with the huge mallet that is NFKC.
        If you were already using NFC normalization, or in general if you want \ 
to preserve the spacing of CJK text, you should be sure to set \ 
fix_character_width=False.

    The remove_unsafe_private_use parameter has been removed entirely, after two \ 
versions of deprecation. The function name fix_bad_encoding is also gone.

New features:

    Fixers for strange new forms of mojibake, including particularly clear cases \ 
of mixed UTF-8 and Windows-1252.

    New heuristics, so that ftfy can fix more stuff, while maintaining \ 
approximately zero false positives.

    The command-line tool trusts you to know what encoding your input is in, and \ 
assumes UTF-8 by default. You can still tell it to guess with the -g option.

    The command-line tool can be configured with options, and can be used as a pipe.

    Recognizes characters that are new in Unicode 7.0, as well as emoji from \ 
Unicode 8.0+ that may already be in use on iOS.

Deprecations:

    fix_text_encoding is being renamed again, for conciseness and consistency. \ 
It's now simply called fix_encoding. The name fix_text_encoding is available but \ 
emits a warning.

Pending deprecations:

    Python 2.6 support is largely coincidental.

    Python 2.7 support is on notice. If you use Python 2, be sure to pin a \ 
version of ftfy less than 5.0 in your requirements.
   2017-01-03 14:23:05 by Jonathan Perkin | Files touched by this commit (52)
Log message:
Use "${MV} || ${TRUE}" and "${RM} -f" consistently in \ 
post-install targets.
   2016-08-28 17:48:37 by Thomas Klausner | Files touched by this commit (112)
Log message:
Remove unnecessary PLIST_SUBST and FILES_SUBST that are now provided
by the infrastructure.

Mark a couple more packages as not ready for python-3.x.
   2016-06-08 19:43:49 by Thomas Klausner | Files touched by this commit (356)
Log message:
Switch to MASTER_SITES_PYPI.
   2015-11-04 03:00:17 by Alistair G. Crooks | Files touched by this commit (797)
Log message:
Add SHA512 digests for distfiles for textproc category

Problems found locating distfiles:
	Package cabocha: missing distfile cabocha-0.68.tar.bz2
	Package convertlit: missing distfile clit18src.zip
	Package php-enchant: missing distfile php-enchant/enchant-1.1.0.tgz

Otherwise, existing SHA1 digests verified and found to be the same on
the machine holding the existing distfiles (morden).  All existing
SHA1 digests retained for now as an audit trail.
   2015-04-03 00:36:59 by Blue Rats | Files touched by this commit (5)
Log message:
Hmm, i thought i imported this already, but apparently not...

Import py27-ftfy-3.4.0 as textproc/py-ftfy.

Given Unicode text, make its representation consistent and possibly less broken.