./textproc/py-html5lib, HTML5 parser and tokenizer

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]

Branch: CURRENT, Version: 1.1nb1, Package name: py39-html5lib-1.1nb1, Maintainer: joerg

html5lib is a pure-python library for parsing HTML. The parser is
designed to handle all flavours of HTML and parses invalid documents
using well-defined error handling rules compatible with the behaviour of
major desktop web browsers.

Output is to a tree structure; the current release supports output to
DOM, ElementTree, lxml and BeautifulSoup tree formats as well as a
simple custom format.

Required to run:
[devel/py-setuptools] [lang/python27] [lang/py-six] [textproc/py-webencodings]

Required to build:

Master sites:

Filesize: 265.835 KB

Version history: (Expand)

CVS history: (Expand)

   2022-01-04 21:55:40 by Thomas Klausner | Files touched by this commit (1595)
Log message:
*: bump PKGREVISION for egg.mk users

They now have a tool dependency on py-setuptools instead of a DEPENDS
   2021-11-09 21:10:28 by Thomas Klausner | Files touched by this commit (3) | Package updated
Log message:
py-html5lib: update to 1.1.

Add some missing dependencies and test dependencies.


Breaking changes:

* Drop support for Python 3.3. (#358)
* Drop support for Python 3.4. (#421)


* Deprecate the ``html5lib`` sanitizer (``html5lib.serialize(sanitize=True)`` and
  ``html5lib.filters.sanitizer``). We recommend users migrate to `Bleach
  <https://github.com/mozilla/bleach>`. Please let us know if Bleach \ 
doesn't suffice for your
  use. (#443)

Other changes:

* Try to import from ``collections.abc`` to remove DeprecationWarning and ensure
  ``html5lib`` keeps working in future Python versions. (#403)
* Drop optional ``datrie`` dependency. (#442)
   2021-10-26 13:23:42 by Nia Alarie | Files touched by this commit (1161)
Log message:
textproc: Replace RMD160 checksums with BLAKE2s checksums

All checksums have been double-checked against existing RMD160 and
SHA512 hashes

Unfetchable distfiles (fetched conditionally?):
./textproc/convertlit/distinfo clit18src.zip
   2021-10-07 17:02:49 by Nia Alarie | Files touched by this commit (1162)
Log message:
textproc: Remove SHA1 hashes for distfiles
   2018-02-26 09:24:42 by Adam Ciarcinski | Files touched by this commit (3) | Package updated
Log message:
py-html5lib: updated to 1.0.1


Breaking changes:
* Drop support for Python 2.6.
* Remove utils/spider.py

* Improve documentation.
* Add iframe seamless boolean attribute.
* Add itemscope as a boolean attribute.
* Support Python 3.6.
* Add CI support for Windows using AppVeyor.
* Improve testing and CI and add code coverage
* Semver-compliant version number.

Bug fixes:
* Add support for setuptools < 18.5 to support environment markers.
* Add explicit dependency for six >= 1.9.
* Fix regexes to work with Python 3.7 regex adjustments.
* Fix alphabeticalattributes filter namespace bug.
* Include license file in generated wheel package.
* Fix annotation-xml typo.
* Allow uppercase hex chararcters in CSS colour check.
   2017-01-15 00:04:16 by Klaus Klein | Files touched by this commit (1)
Log message:
Add dependency on py-webencodings (added the package in preparation,
but still managed not to add the dependency here).

   2017-01-11 18:42:24 by Klaus Klein | Files touched by this commit (3) | Package updated
Log message:
Update py-html5lib to 0.999999999.

This is the actual update to 0.999999999; the previous one was to
0.9999999 only.  Changes for the missed two versions can be looked up
   2016-12-30 11:09:36 by Ryo ONODERA | Files touched by this commit (3)
Log message:
Update to 0.999999999

* Use upstream filename as DISTNAME
* The latest version for Chromium build


Released on July 15, 2016

    Fix attribute order going to the tree builder to be document order instead \ 
of reverse document order(!).


Released on July 14, 2016

    Added ordereddict as a mandatory dependency on Python 2.6.
    Added lxml, genshi, datrie, charade, and all extras that will do the right \ 
thing based on the specific interpreter implementation.
    Now requires the mock package for the testsuite.
    Cease supporting DATrie under PyPy.
    Remove ``PullDOM`` support, as this hasn't ever been properly tested, \ 
doesn't entirely work, and as far as I can tell is completely unused by anyone.
    Move testsuite to py.test.
    Fix #124: move to webencodings for decoding the input byte stream; this \ 
makes html5lib compliant with the Encoding Standard, and introduces a required \ 
dependency on webencodings.
    Cease supporting Python 3.2 (in both CPython and PyPy forms).
    Fix comments containing double-dash with lxml 3.5 and above.
    Use scripting disabled by default (as we don't implement scripting).
    Fix #11, avoiding the XSS bug potentially caused by serializer allowing \ 
attribute values to be escaped out of in old browser versions, changing the \ 
quote_attr_values option on serializer to take one of three values, \ 
"always" (the old True value), "legacy" (the new option, and \ 
the new default), and "spec" (the old False value, and the old \ 
    Fix #72 by rewriting the sanitizer to apply only to treewalkers (instead of \ 
the tokenizer); as such, this will require amending all callers of it to use it \ 
via the treewalker API.
    Drop support of charade, now that chardet is supported once more.
    Replace the charset keyword argument on parse and related methods with a set \ 
of keyword arguments: override_encoding, transport_encoding, \ 
same_origin_parent_encoding, likely_encoding, and default_encoding.
    Move filters._base, treebuilder._base, and treewalkers._base to .base to \ 
clarify their status as public.
    Get rid of the sanitizer package. Merge sanitizer.sanitize into the \ 
sanitizer.htmlsanitizer module and move that to saniziter. This means anyone who \ 
used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no code changes.
    Rename treewalkers.lxmletree to .etree_lxml and treewalkers.genshistream to \ 
.genshi to have a consistent API.
    Move a whole load of stuff (inputstream, ihatexml, trie, tokenizer, utils) \ 
to be underscore prefixed to clarify their status as private.


Released on September 10, 2015

    Fix #195: fix the sanitizer to drop broken URLs (it threw an exception \ 
between 0.9999 and 0.999999).


Released on July 7, 2015

    Fix #189: fix the sanitizer to allow relative URLs again (as it did prior to \ 


Released on April 30, 2015

    Fix #188: fix the sanitizer to not throw an exception when sanitizing bogus \ 
data URLs.


Released on April 29, 2015

    Fix #153: Sanitizer fails to treat some attributes as URLs. Despite how this \ 
sounds, this has no known security implications. No known version of IE (5.5 to \ 
current), Firefox (3 to current), Safari (6 to current), Chrome (1 to current), \ 
or Opera (12 to current) will run any script provided in these attributes.
    Pass error message to the ParseError exception in strict parsing mode.
    Allow data URIs in the sanitizer, with a whitelist of content-types.
    Add support for Python implementations that don't support lone surrogates \ 
(read: Jython). Fixes #2.
    Remove localization of error messages. This functionality was totally unused \ 
(and untested that everything was localizable), so we may as well follow \ 
numerous browsers in not supporting translating technical strings.
    Expose treewalkers.pprint as a public API.
    Add a documentEncoding property to HTML5Parser, fix #121.