./www/py-w3lib, Python library of web-related functions

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 2.2.1, Package name: py311-w3lib-2.2.1, Maintainer: pkgsrc-users

This is a Python library of web-related functions, such as:
* remove comments, or tags from HTML snippets
* extract base url from HTML snippets
* translate entites on HTML strings
* convert raw HTTP headers to dicts and vice-versa
* construct HTTP auth header
* converting HTML pages to unicode
* sanitize urls (like browsers do)
* extract arguments from urls


Required to run:
[devel/py-setuptools] [lang/py-six] [lang/python37]

Required to build:
[pkgtools/cwrappers]

Master sites:

Filesize: 48.44 KB

Version history: (Expand)


CVS history: (Expand)


   2024-01-14 21:49:07 by Adam Ciarcinski | Files touched by this commit (3) | Package updated
Log message:
py-w3lib: updated to 2.1.2

2.1.2 (2023-08-03)
------------------

- Fix test failures on Python 3.11.4+
- Fix an incorrect type hint
- Add project URLs to setup.py

2.1.1 (2022-12-09)
------------------

- :func:`~w3lib.url.safe_url_string`, :func:`~w3lib.url.safe_download_url`
  and :func:`~w3lib.url.canonicalize_url` now strip whitespace and control
  characters urls according to the URL living standard.

2.1.0 (2022-11-28)
------------------

-   Dropped Python 3.6 support, and made Python 3.11 support official.

-   :func:`~w3lib.url.safe_url_string` now generates safer URLs.

    To make URLs safer for the `URL living standard`_:

    .. _URL living standard: https://url.spec.whatwg.org/

    -   ``;=`` are percent-encoded in the URL username.

    -   ``;:=`` are percent-encoded in the URL password.

    -   ``'`` is percent-encoded in the URL query if the URL scheme is `special
        <https://url.spec.whatwg.org/#special-scheme>`__.

    To make URLs safer for `RFC 2396`_ and `RFC 3986`_, ``|[]`` are
    percent-encoded in URL paths, queries, and fragments.

    .. _RFC 2396: https://www.ietf.org/rfc/rfc2396.txt
    .. _RFC 3986: https://www.ietf.org/rfc/rfc3986.txt

-   :func:`~w3lib.encoding.html_to_unicode` now checks for the `byte order
    mark`_ before inspecting the ``Content-Type`` header when determining the
    content encoding, in line with the `URL living standard`_.

    .. _byte order mark: https://en.wikipedia.org/wiki/Byte_order_mark

-   :func:`~w3lib.url.canonicalize_url` now strips spaces from the input URL,
    to be more in line with the `URL living standard`_.

-   :func:`~w3lib.html.get_base_url` now ignores HTML comments.

-   Fixed :func:`~w3lib.url.safe_url_string` re-encoding percent signs on
    the URL username and password even when they were being used as part of an
    escape sequence.

-   Fixed :func:`~w3lib.http.basic_auth_header` using the wrong flavor of
    base64 encoding, which could prevent authentication in rare cases.

-   Fixed :func:`~w3lib.html.replace_entities` raising :exc:`OverflowError` in
    some cases due to `a bug in CPython
    <https://github.com/python/cpython/issues/76763>`__.

-   Improved typing and fixed typing issues.

-   Made CI and test improvements.

-   Adopted a Code of Conduct.

2.0.1 (2022-08-11)
------------------
Minor documentation fix (release date is set in the changelog).

2.0.0 (2022-08-11)
------------------

Backwards incompatible changes:

- Python 2 is no longer supported; Python 3.6+ is required now
- :func:`w3lib.url.safe_url_string` and :func:`w3lib.url.canonicalize_url`
  no longer convert "%23" to "#" when it appears in the URL \ 
path. This is a bug
  fix. It's listed as a backward-incomatible change because in some cases the
  output of :func:`w3lib.url.canonicalize_url` is going to change, and so, if
  this output is used to generate URL fingerprints, new fingerprints might be
  incompatible with those created with the previous w3lib versions

Deprecation removals

- The ``w3lib.form`` module is removed.
- The ``w3lib.html.remove_entities`` function is removed.
- The ``w3lib.url.urljoin_rfc`` function is removed.

The following functions are deprecated, and will be removed in future releases:

- ``w3lib.util.str_to_unicode``
- ``w3lib.util.unicode_to_str``
- ``w3lib.util.to_native_str``

Other improvements and bug fixes:

- Type annotations are added
- Added support for Python 3.9 and 3.10
- Fixed :func:`w3lib.html.get_meta_refresh` for ``<meta>`` tags where
  ``http-equiv`` is written after ``content``
- Fixed :func:`w3lib.url.safe_url_string` for IDNA domains with ports
- :func:`w3lib.url.url_query_cleaner` no longer adds an unneeded ``#`` when
  ``keep_fragments=True`` is passed, and the URL doesn't have a fragment

- Removed a workaround for an ancient pathname2url bug
- CI is migrated to GitHub Actions
- The code is formatted using black

1.22.0 (2020-05-13)
-------------------

- Python 3.4 is no longer supported
- :func:`w3lib.url.safe_url_string` now supports an optional ``quote_path``
  parameter to disable the percent-encoding of the URL path
- :func:`w3lib.url.add_or_replace_parameter` and
  :func:`w3lib.url.add_or_replace_parameters` no longer remove duplicate
  parameters from the original query string that are not being added or
  replaced
- :func:`w3lib.html.remove_tags` now raises a :exc:`ValueError` exception
  instead of :exc:`AssertionError` when using both the ``which_ones`` and the
  ``keep`` parameters
- Test improvements
- Documentation improvements
- Code cleanup
   2022-01-04 21:55:40 by Thomas Klausner | Files touched by this commit (1595)
Log message:
*: bump PKGREVISION for egg.mk users

They now have a tool dependency on py-setuptools instead of a DEPENDS
   2021-10-26 13:31:15 by Nia Alarie | Files touched by this commit (1030)
Log message:
www: Replace RMD160 checksums with BLAKE2s checksums

All checksums have been double-checked against existing RMD160 and
SHA512 hashes

Not committed (merge conflicts):
www/nghttp2/distinfo

Unfetchable distfiles (almost certainly fetched conditionally...):
./www/nginx-devel/distinfo array-var-nginx-module-0.05.tar.gz
./www/nginx-devel/distinfo echo-nginx-module-0.62.tar.gz
./www/nginx-devel/distinfo encrypted-session-nginx-module-0.08.tar.gz
./www/nginx-devel/distinfo form-input-nginx-module-0.12.tar.gz
./www/nginx-devel/distinfo headers-more-nginx-module-0.33.tar.gz
./www/nginx-devel/distinfo lua-nginx-module-0.10.19.tar.gz
./www/nginx-devel/distinfo naxsi-1.3.tar.gz
./www/nginx-devel/distinfo nginx-dav-ext-module-3.0.0.tar.gz
./www/nginx-devel/distinfo nginx-rtmp-module-1.2.2.tar.gz
./www/nginx-devel/distinfo nginx_http_push_module-1.2.10.tar.gz
./www/nginx-devel/distinfo ngx_cache_purge-2.5.1.tar.gz
./www/nginx-devel/distinfo ngx_devel_kit-0.3.1.tar.gz
./www/nginx-devel/distinfo ngx_http_geoip2_module-3.3.tar.gz
./www/nginx-devel/distinfo njs-0.5.0.tar.gz
./www/nginx-devel/distinfo set-misc-nginx-module-0.32.tar.gz
./www/nginx/distinfo array-var-nginx-module-0.05.tar.gz
./www/nginx/distinfo echo-nginx-module-0.62.tar.gz
./www/nginx/distinfo encrypted-session-nginx-module-0.08.tar.gz
./www/nginx/distinfo form-input-nginx-module-0.12.tar.gz
./www/nginx/distinfo headers-more-nginx-module-0.33.tar.gz
./www/nginx/distinfo lua-nginx-module-0.10.19.tar.gz
./www/nginx/distinfo naxsi-1.3.tar.gz
./www/nginx/distinfo nginx-dav-ext-module-3.0.0.tar.gz
./www/nginx/distinfo nginx-rtmp-module-1.2.2.tar.gz
./www/nginx/distinfo nginx_http_push_module-1.2.10.tar.gz
./www/nginx/distinfo ngx_cache_purge-2.5.1.tar.gz
./www/nginx/distinfo ngx_devel_kit-0.3.1.tar.gz
./www/nginx/distinfo ngx_http_geoip2_module-3.3.tar.gz
./www/nginx/distinfo njs-0.5.0.tar.gz
./www/nginx/distinfo set-misc-nginx-module-0.32.tar.gz
   2021-10-07 17:09:00 by Nia Alarie | Files touched by this commit (1033)
Log message:
www: Remove SHA1 hashes for distfiles
   2020-05-14 08:06:42 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-w3lib: updated to 1.22.0

1.22.0:
- Python 3.4 is no longer supported
- :func:`w3lib.url.safe_url_string` now supports an optional ``quote_path``
  parameter to disable the percent-encoding of the URL path
- :func:`w3lib.url.add_or_replace_parameter` and
  :func:`w3lib.url.add_or_replace_parameters` no longer remove duplicate
  parameters from the original query string that are not being added or
  replaced
- :func:`w3lib.html.remove_tags` now raises a :exc:`ValueError` exception
  instead of :exc:`AssertionError` when using both the ``which_ones`` and the
  ``keep`` parameters
- Test improvements
- Documentation improvements
- Code cleanup
   2019-08-12 22:03:01 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-w3lib: updated to 1.21.0

1.21.0:
- Add the encoding and path_encoding parameters to
  :func:w3lib.url.safe_download_url
- :func:w3lib.url.safe_url_string now also removes tabs and new lines
- :func:w3lib.html.remove_comments now also removes truncated comments
- :func:w3lib.html.remove_tags_with_content no longer removes tags which
  start with the same text as one of the specified tags
- Recommend pytest instead of nose to run tests
   2019-01-16 00:05:37 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-w3lib: updated to 1.20.0

1.20.0:
- Fix url_query_cleaner to do not append "?" to urls without a query string
- Add support for Python 3.7 and drop Python 3.3
- Add w3lib.url.add_or_replace_parameters helper
- Documentation fixes
   2018-01-26 09:06:07 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-w3lib: updated to 1.19.0

1.19.0:
- Add a workaround for CPython segfault (https://bugs.python.org/issue32583)
  which affect w3lib.encoding functions. This is technically **backwards
  incompatible** because it changes the way non-decodable bytes are replaced
  (in some cases instead of two ``\ufffd`` chars you can get one).
  As a side effect, the fix speeds up decoding in Python 3.4+.
- Add 'encoding' parameter for w3lib.http.basic_auth_header.
- Fix pypy testing setup, add pypy3 to CI.