./textproc/py-html-sanitizer, White-list based HTML sanitizer

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 2.4.4, Package name: py311-html-sanitizer-2.4.4, Maintainer: pkgsrc-users

html-sanitizer is a whitelist-based and very opinionated HTML sanitizer
that can be used both for untrusted and trusted sources. It attempts to
clean up the mess made by various rich text editors and or copy-pasting
to make styling of webpages simpler and more consistent. It builds on the
excellent HTML cleaner in lxml to make the result both valid and safe.

It goes further than pure tag filtering by transforming the HTML
fragments to normalize formatting and drop redundant or pointless tags.


Required to run:
[textproc/py-lxml] [www/py-beautifulsoup4] [lang/python310]

Master sites:

Filesize: 16.853 KB

Version history: (Expand)


CVS history: (Expand)


   2024-05-27 16:41:38 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-html-sanitizer: updated to 2.4.4

Next version

- **Vulnerability:** Fixed an issue where normalizing unicode too late in the
  process would keep disallowed tags when using specially crafted HTML. Fixed
  in 2.4.2.
- Fixed missing whitespace while merging adjacent tags.

2.4 (2024-04-01)

- Fixed an edge case where ``br`` tag attributes weren't removed if the br tag
  appears first.
- Updated the ``lxml`` dependency to 5.2 and added the now-required
  ``lxml[html_clean]`` extra.
   2024-03-11 07:55:41 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-html-sanitizer: updated to 2.3.1

2.3.1

- Fixed an edge case where ``br`` tag attributes weren't removed if the br tag
  appears first.
   2024-02-07 21:12:24 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-html-sanitizer: updated to 2.3.0

2.3 (2024-02-07)

- Avoided adding whitespace when merging tags of the same type.
- Updated the tests.
- Switched from black to the ruff formatter.
   2023-11-27 21:21:00 by Adam Ciarcinski | Files touched by this commit (3) | Package updated
Log message:
py-html-sanitizer: updated to 2.2.0

2.2 (2023-07-03)

- Changed ``keep_normalized_whitespace`` to preserve whitespace at the tail of
  tags, not just between tags.
- Changed the parameters of ``normalize_whitespace_in_text_or_tail`` to be
  keyword-only.

2.1 (2023-06-29)

- Added a test for a type of misconfiguration.
- Changed the sanitizer configuration validation to not allow unexpected data
  types in ``tags``, ``empty``, ``separate``, ``whitespace`` and
  ``attributes``.

2.0 (2023-06-28)

- Raised the minimum Python version to 3.7. Added Python 3.10, 3.11.
- Raised the minimum lxml version to the current 4.9.1.
- Switched from Travis CI to GitHub actions. Added Python 3.9 to the CI
  matrix.
- Renamed the main branch to main.
- Switched to a declarative setup.
- Fixed a whitespace dependency in the testsuite.
- Switched to hatchling and ruff.
- Made behavior-altering arguments to ``normalize_overall_whitespace``
  keyword-only.
   2022-11-30 17:43:32 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-html-sanitizer: updated to 1.9.3

1.9 (2020-01-20)

Added Python 3.8 to the CI matrix.
Be able to keep the <style> tag by adding it to tags.
Added a style check to the CI matrix.

1.8 (2019-11-21)

Actually added support for customizing lxml's autolinking behavior using a \ 
dictionary argument.
Stopped removing explicitly allowed attributes.
Removed id from allowed attributes of <a> tags to provide an additional \ 
layer of defense against DOM clobbering attacks.
Added an element preprocessor which assigns the id value to the name attribute \ 
of anchors if name isn't set or empty. This should provide additional backwards \ 
compatibility making the id removal less of a problem when using named anchors.

1.7 (2019-02-19)

Added a system check which validates sanitizer configurations early when using \ 
Django.
Fixed an edge case where passing in an empty allowed tags list would \ 
unexpectedly and silently not remove any tags at all (because that's the way \ 
lxml's cleaner works).
Changed the sanitizer tags, empty and separate options to also accept any \ 
iterable, not just sets.
Changed the lru_cache import in the Django module to try functools first.
Fixed the tag merging to also check tags in empty. This means that e.g. \ 
consecutive <hr> tags are also merged now when using the default settings.
Made it possible to override the set of tags processed as whitespace. The \ 
default set is {"br"} which preserves the current behavior of \ 
stripping breaks from the beginning or end of tags' content.
   2022-11-09 14:14:32 by Joerg Sonnenberger | Files touched by this commit (223)
Log message:
Reset MAINTAINER
   2022-01-04 21:55:40 by Thomas Klausner | Files touched by this commit (1595)
Log message:
*: bump PKGREVISION for egg.mk users

They now have a tool dependency on py-setuptools instead of a DEPENDS
   2021-10-26 13:23:42 by Nia Alarie | Files touched by this commit (1161)
Log message:
textproc: Replace RMD160 checksums with BLAKE2s checksums

All checksums have been double-checked against existing RMD160 and
SHA512 hashes

Unfetchable distfiles (fetched conditionally?):
./textproc/convertlit/distinfo clit18src.zip