Log message:
py-scrapy: updated to 2.9.0
Scrapy 2.9.0 (2023-05-08)
-------------------------
Highlights:
- Per-domain download settings.
- Compatibility with new cryptography_ and new parsel_.
- JMESPath selectors from the new parsel_.
- Bug fixes.
Deprecations
~~~~~~~~~~~~
- :class:`scrapy.extensions.feedexport._FeedSlot` is renamed to
:class:`scrapy.extensions.feedexport.FeedSlot` and the old name is
deprecated. (:issue:`5876`)
New features
~~~~~~~~~~~~
- Settings correponding to :setting:`DOWNLOAD_DELAY`,
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` and
:setting:`RANDOMIZE_DOWNLOAD_DELAY` can now be set on a per-domain basis
via the new :setting:`DOWNLOAD_SLOTS` setting. (:issue:`5328`)
- Added :meth:`TextResponse.jmespath`, a shortcut for JMESPath selectors
available since parsel_ 1.8.1. (:issue:`5894`, :issue:`5915`)
- Added :signal:`feed_slot_closed` and :signal:`feed_exporter_closed`
signals. (:issue:`5876`)
- Added :func:`scrapy.utils.request.request_to_curl`, a function to produce a
curl command from a :class:`~scrapy.Request` object. (:issue:`5892`)
- Values of :setting:`FILES_STORE` and :setting:`IMAGES_STORE` can now be
:class:`pathlib.Path` instances. (:issue:`5801`)
Bug fixes
~~~~~~~~~
- Fixed a warning with Parsel 1.8.1+. (:issue:`5903`, :issue:`5918`)
- Fixed an error when using feed postprocessing with S3 storage.
(:issue:`5500`, :issue:`5581`)
- Added the missing :meth:`scrapy.settings.BaseSettings.setdefault` method.
(:issue:`5811`, :issue:`5821`)
- Fixed an error when using cryptography_ 40.0.0+ and
:setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` is enabled.
(:issue:`5857`, :issue:`5858`)
- The checksums returned by :class:`~scrapy.pipelines.files.FilesPipeline`
for files on Google Cloud Storage are no longer Base64-encoded.
(:issue:`5874`, :issue:`5891`)
- :func:`scrapy.utils.request.request_from_curl` now supports $-prefixed
string values for the curl ``--data-raw`` argument, which are produced by
browsers for data that includes certain symbols. (:issue:`5899`,
:issue:`5901`)
- The :command:`parse` command now also works with async generator callbacks.
(:issue:`5819`, :issue:`5824`)
- The :command:`genspider` command now properly works with HTTPS URLs.
(:issue:`3553`, :issue:`5808`)
- Improved handling of asyncio loops. (:issue:`5831`, :issue:`5832`)
- :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
now skips certain malformed URLs instead of raising an exception.
(:issue:`5881`)
- :func:`scrapy.utils.python.get_func_args` now supports more types of
callables. (:issue:`5872`, :issue:`5885`)
- Fixed an error when processing non-UTF8 values of ``Content-Type`` headers.
(:issue:`5914`, :issue:`5917`)
- Fixed an error breaking user handling of send failures in
:meth:`scrapy.mail.MailSender.send()`. (:issue:`1611`, :issue:`5880`)
Documentation
~~~~~~~~~~~~~
- Expanded contributing docs. (:issue:`5109`, :issue:`5851`)
- Added blacken-docs_ to pre-commit and reformatted the docs with it.
(:issue:`5813`, :issue:`5816`)
- Fixed a JS issue. (:issue:`5875`, :issue:`5877`)
- Fixed ``make htmlview``. (:issue:`5878`, :issue:`5879`)
- Fixed typos and other small errors. (:issue:`5827`, :issue:`5839`,
:issue:`5883`, :issue:`5890`, :issue:`5895`, :issue:`5904`)
Quality assurance
~~~~~~~~~~~~~~~~~
- Extended typing hints. (:issue:`5805`, :issue:`5889`, :issue:`5896`)
- Tests for most of the examples in the docs are now run as a part of CI,
found problems were fixed. (:issue:`5816`, :issue:`5826`, :issue:`5919`)
- Removed usage of deprecated Python classes. (:issue:`5849`)
- Silenced ``include-ignored`` warnings from coverage. (:issue:`5820`)
- Fixed a random failure of the ``test_feedexport.test_batch_path_differ``
test. (:issue:`5855`, :issue:`5898`)
- Updated docstrings to match output produced by parsel_ 1.8.1 so that they
don't cause test failures. (:issue:`5902`, :issue:`5919`)
- Other CI and pre-commit improvements. (:issue:`5802`, :issue:`5823`,
:issue:`5908`)
|
Log message:
py-scrapy: updated to 2.8.0
Scrapy 2.8.0 (2023-02-02)
-------------------------
This is a maintenance release, with minor features, bug fixes, and cleanups.
Deprecation removals
~~~~~~~~~~~~~~~~~~~~
- The ``scrapy.utils.gz.read1`` function, deprecated in Scrapy 2.0, has now
been removed. Use the :meth:`~io.BufferedIOBase.read1` method of
:class:`~gzip.GzipFile` instead.
- The ``scrapy.utils.python.to_native_str`` function, deprecated in Scrapy
2.0, has now been removed. Use :func:`scrapy.utils.python.to_unicode`
instead.
- The ``scrapy.utils.python.MutableChain.next`` method, deprecated in Scrapy
2.0, has now been removed. Use
:meth:`~scrapy.utils.python.MutableChain.__next__` instead.
- The ``scrapy.linkextractors.FilteringLinkExtractor`` class, deprecated
in Scrapy 2.0, has now been removed. Use
:class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
instead.
- Support for using environment variables prefixed with ``SCRAPY_`` to
override settings, deprecated in Scrapy 2.0, has now been removed.
- Support for the ``noconnect`` query string argument in proxy URLs,
deprecated in Scrapy 2.0, has now been removed. We expect proxies that used
to need it to work fine without it.
- The ``scrapy.utils.python.retry_on_eintr`` function, deprecated in Scrapy
2.3, has now been removed.
- The ``scrapy.utils.python.WeakKeyCache`` class, deprecated in Scrapy 2.4,
has now been removed.
Deprecations
~~~~~~~~~~~~
- :exc:`scrapy.pipelines.images.NoimagesDrop` is now deprecated.
- :meth:`ImagesPipeline.convert_image
<scrapy.pipelines.images.ImagesPipeline.convert_image>` must now accept a
``response_body`` parameter.
New features
~~~~~~~~~~~~
- Applied black_ coding style to files generated with the
:command:`genspider` and :command:`startproject` commands.
.. _black: https://black.readthedocs.io/en/stable/
- :setting:`FEED_EXPORT_ENCODING` is now set to ``"utf-8"`` in the
``settings.py`` file that the :command:`startproject` command generates.
With this value, JSON exports won’t force the use of escape sequences for
non-ASCII characters.
- The :class:`~scrapy.extensions.memusage.MemoryUsage` extension now logs the
peak memory usage during checks, and the binary unit MiB is now used to
avoid confusion.
- The ``callback`` parameter of :class:`~scrapy.http.Request` can now be set
to :func:`scrapy.http.request.NO_CALLBACK`, to distinguish it from
``None``, as the latter indicates that the default spider callback
(:meth:`~scrapy.Spider.parse`) is to be used.
Bug fixes
~~~~~~~~~
- Enabled unsafe legacy SSL renegotiation to fix access to some outdated
websites.
- Fixed STARTTLS-based email delivery not working with Twisted 21.2.0 and
better.
- Fixed the :meth:`finish_exporting` method of :ref:`item exporters
<topics-exporters>` not being called for empty files.
- Fixed HTTP/2 responses getting only the last value for a header when
multiple headers with the same name are received.
- Fixed an exception raised by the :command:`shell` command on some cases
when :ref:`using asyncio <using-asyncio>`.
- When using :class:`~scrapy.spiders.CrawlSpider`, callback keyword arguments
(``cb_kwargs``) added to a request in the ``process_request`` callback of a
:class:`~scrapy.spiders.Rule` will no longer be ignored.
- The :ref:`images pipeline <images-pipeline>` no longer re-encodes JPEG
files.
- Fixed the handling of transparent WebP images by the :ref:`images pipeline
<images-pipeline>`.
- :func:`scrapy.shell.inspect_response` no longer inhibits ``SIGINT``
(Ctrl+C).
- :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
with ``unique=False`` no longer filters out links that have identical URL
*and* text.
- :class:`~scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware` now
ignores URL protocols that do not support ``robots.txt`` (``data://``,
``file://``).
- Silenced the ``filelock`` debug log messages introduced in Scrapy 2.6.
- Fixed the output of ``scrapy -h`` showing an unintended ``**commands**``
line.
- Made the active project indication in the output of :ref:`commands
<topics-commands>` more clear.
Documentation
~~~~~~~~~~~~~
- Documented how to :ref:`debug spiders from Visual Studio Code
<debug-vscode>`.
- Documented how :setting:`DOWNLOAD_DELAY` affects per-domain concurrency.
- Improved consistency.
- Fixed typos.
Quality assurance
~~~~~~~~~~~~~~~~~
- Applied :ref:`black coding style <coding-style>`, sorted import statements,
and introduced :ref:`pre-commit <scrapy-pre-commit>`.
- Switched from :mod:`os.path` to :mod:`pathlib`.
- Addressed many issues reported by Pylint.
- Improved code readability.
- Improved package metadata.
- Removed direct invocations of ``setup.py``.
- Removed unnecessary :class:`~collections.OrderedDict` usages.
- Removed unnecessary ``__str__`` definitions.
- Removed obsolete code and comments.
- Fixed test and CI issues.
|