crawl websites and extract structured data from their pages. It can be used for
a wide range of purposes, from data mining to monitoring and automated testing.
2025-03-08 14:06:33 by Thomas Klausner | Files touched by this commit (1) |
Log message:
py-scrapy: fix wheel name for latest setuptools and depend on it
Bump PKGREVISION.
|
2024-11-30 07:56:49 by Adam Ciarcinski | Files touched by this commit (3) |  |
Log message:
py-scrapy: updated to 2.12.0
Scrapy 2.12.0 (2024-11-18)
Highlights:
- Dropped support for Python 3.8, added support for Python 3.13
- :meth:`~scrapy.Spider.start_requests` can now yield items
- Added :class:`~scrapy.http.JsonResponse`
- Added :setting:`CLOSESPIDER_PAGECOUNT_NO_ITEM`
|
2024-11-11 08:29:31 by Thomas Klausner | Files touched by this commit (862) |
Log message:
py-*: remove unused tool dependency
py-setuptools includes the py-wheel functionality nowadays
|
2024-05-14 21:15:59 by Adam Ciarcinski | Files touched by this commit (3) |  |
Log message:
py-scrapy: updated to 2.11.2
Scrapy 2.11.2 (2024-05-14)
--------------------------
Security bug fixes
~~~~~~~~~~~~~~~~~~
- Redirects to non-HTTP protocols are no longer followed. Please, see the
`23j4-mw76-5v7h security advisory`_ for more information. (:issue:`457`)
.. _23j4-mw76-5v7h security advisory: \
https://github.com/scrapy/scrapy/security/advisories/GHSA-23j4-mw76-5v7h
- The ``Authorization`` header is now dropped on redirects to a different
scheme (``http://`` or ``https://``) or port, even if the domain is the
same. Please, see the `4qqq-9vqf-3h3f security advisory`_ for more
information.
.. _4qqq-9vqf-3h3f security advisory: \
https://github.com/scrapy/scrapy/security/advisories/GHSA-4qqq-9vqf-3h3f
- When using system proxy settings that are different for ``http://`` and
``https://``, redirects to a different URL scheme will now also trigger the
corresponding change in proxy settings for the redirected request. Please,
see the `jm3v-qxmh-hxwv security advisory`_ for more information.
(:issue:`767`)
.. _jm3v-qxmh-hxwv security advisory: \
https://github.com/scrapy/scrapy/security/advisories/GHSA-jm3v-qxmh-hxwv
- :attr:`Spider.allowed_domains <scrapy.Spider.allowed_domains>` is now
enforced for all requests, and not only requests from spider callbacks.
(:issue:`1042`, :issue:`2241`, :issue:`6358`)
- :func:`~scrapy.utils.iterators.xmliter_lxml` no longer resolves XML
entities. (:issue:`6265`)
- defusedxml_ is now used to make
:class:`scrapy.http.request.rpc.XmlRpcRequest` more secure.
(:issue:`6250`, :issue:`6251`)
.. _defusedxml: https://github.com/tiran/defusedxml
Bug fixes
~~~~~~~~~
- Restored support for brotlipy_, which had been dropped in Scrapy 2.11.1 in
favor of brotli_. (:issue:`6261`)
.. _brotli: https://github.com/google/brotli
.. note:: brotlipy is deprecated, both in Scrapy and upstream. Use brotli
instead if you can.
- Make :setting:`METAREFRESH_IGNORE_TAGS` ``["noscript"]`` by \
default. This
prevents
:class:`~scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware` from
following redirects that would not be followed by web browsers with
JavaScript enabled. (:issue:`6342`, :issue:`6347`)
- During :ref:`feed export <topics-feed-exports>`, do not close the
underlying file from :ref:`built-in post-processing plugins
<builtin-plugins>`.
(:issue:`5932`, :issue:`6178`, :issue:`6239`)
- :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
now properly applies the ``unique`` and ``canonicalize`` parameters.
(:issue:`3273`, :issue:`6221`)
- Do not initialize the scheduler disk queue if :setting:`JOBDIR` is an empty
string. (:issue:`6121`, :issue:`6124`)
- Fix :attr:`Spider.logger <scrapy.Spider.logger>` not logging custom extra
information. (:issue:`6323`, :issue:`6324`)
- ``robots.txt`` files with a non-UTF-8 encoding no longer prevent parsing
the UTF-8-compatible (e.g. ASCII) parts of the document.
(:issue:`6292`, :issue:`6298`)
- :meth:`scrapy.http.cookies.WrappedRequest.get_header` no longer raises an
exception if ``default`` is ``None``.
(:issue:`6308`, :issue:`6310`)
- :class:`~scrapy.selector.Selector` now uses
:func:`scrapy.utils.response.get_base_url` to determine the base URL of a
given :class:`~scrapy.http.Response`. (:issue:`6265`)
- The :meth:`media_to_download` method of :ref:`media pipelines
<topics-media-pipeline>` now logs exceptions before stripping them.
(:issue:`5067`, :issue:`5068`)
- When passing a callback to the :command:`parse` command, build the callback
callable with the right signature.
(:issue:`6182`)
Documentation
~~~~~~~~~~~~~
- Add a FAQ entry about :ref:`creating blank requests <faq-blank-request>`.
(:issue:`6203`, :issue:`6208`)
- Document that :attr:`scrapy.selector.Selector.type` can be ``"json"``.
(:issue:`6328`, :issue:`6334`)
Quality assurance
~~~~~~~~~~~~~~~~~
- Make builds reproducible. (:issue:`5019`, :issue:`6322`)
- Packaging and test fixes.
|
2024-02-16 20:02:45 by Adam Ciarcinski | Files touched by this commit (3) |  |
Log message:
py-scrapy: updated to 2.11.1
Scrapy 2.11.1 (2024-02-14)
--------------------------
Highlights:
- Security bug fixes.
- Support for Twisted >= 23.8.0.
- Documentation improvements.
Security bug fixes
~~~~~~~~~~~~~~~~~~
- Addressed `ReDoS vulnerabilities`_:
- ``scrapy.utils.iterators.xmliter`` is now deprecated in favor of
:func:`~scrapy.utils.iterators.xmliter_lxml`, which
:class:`~scrapy.spiders.XMLFeedSpider` now uses.
To minimize the impact of this change on existing code,
:func:`~scrapy.utils.iterators.xmliter_lxml` now supports indicating
the node namespace with a prefix in the node name, and big files with
highly nested trees when using libxml2 2.7+.
- Fixed regular expressions in the implementation of the
:func:`~scrapy.utils.response.open_in_browser` function.
Please, see the `cc65-xxvf-f7r9 security advisory`_ for more information.
.. _ReDoS vulnerabilities: \
https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
.. _cc65-xxvf-f7r9 security advisory: \
https://github.com/scrapy/scrapy/security/advisories/GHSA-cc65-xxvf-f7r9
- :setting:`DOWNLOAD_MAXSIZE` and :setting:`DOWNLOAD_WARNSIZE` now also apply
to the decompressed response body. Please, see the `7j7m-v7m3-jqm7 security
advisory`_ for more information.
.. _7j7m-v7m3-jqm7 security advisory: \
https://github.com/scrapy/scrapy/security/advisories/GHSA-7j7m-v7m3-jqm7
- Also in relation with the `7j7m-v7m3-jqm7 security advisory`_, the
deprecated ``scrapy.downloadermiddlewares.decompression`` module has been
removed.
- The ``Authorization`` header is now dropped on redirects to a different
domain. Please, see the `cw9j-q3vf-hrrv security advisory`_ for more
information.
.. _cw9j-q3vf-hrrv security advisory: \
https://github.com/scrapy/scrapy/security/advisories/GHSA-cw9j-q3vf-hrrv
Modified requirements
~~~~~~~~~~~~~~~~~~~~~
- The Twisted dependency is no longer restricted to < 23.8.0. (:issue:`6024`,
:issue:`6064`, :issue:`6142`)
Bug fixes
~~~~~~~~~
- The OS signal handling code was refactored to no longer use private Twisted
functions. (:issue:`6024`, :issue:`6064`, :issue:`6112`)
Documentation
~~~~~~~~~~~~~
- Improved documentation for :class:`~scrapy.crawler.Crawler` initialization
changes made in the 2.11.0 release. (:issue:`6057`, :issue:`6147`)
- Extended documentation for :attr:`Request.meta <scrapy.http.Request.meta>`.
(:issue:`5565`)
- Fixed the :reqmeta:`dont_merge_cookies` documentation. (:issue:`5936`,
:issue:`6077`)
- Added a link to Zyte's export guides to the :ref:`feed exports
<topics-feed-exports>` documentation. (:issue:`6183`)
- Added a missing note about backward-incompatible changes in
:class:`~scrapy.exporters.PythonItemExporter` to the 2.11.0 release notes.
(:issue:`6060`, :issue:`6081`)
- Added a missing note about removing the deprecated
``scrapy.utils.boto.is_botocore()`` function to the 2.8.0 release notes.
(:issue:`6056`, :issue:`6061`)
- Other documentation improvements. (:issue:`6128`, :issue:`6144`,
:issue:`6163`, :issue:`6190`, :issue:`6192`)
Quality assurance
~~~~~~~~~~~~~~~~~
- Added Python 3.12 to the CI configuration, re-enabled tests that were
disabled when the pre-release support was added. (:issue:`5985`,
:issue:`6083`, :issue:`6098`)
- Fixed a test issue on PyPy 7.3.14. (:issue:`6204`, :issue:`6205`)
|
2023-10-10 19:18:24 by =?UTF-8?B?RnLDqWTDqXJpYyBGYXViZXJ0ZWF1?= | Files touched by this commit (3) |
Log message:
py-scrapy: Update to 2.11.0
upstream changes:
-----------------
* 2.11.0: https://docs.scrapy.org/en/latest/news.html#scrapy-2-11-0-2023-09-18
* 2.10.0: https://docs.scrapy.org/en/2.10/news.html#scrapy-2-10-0-2023-08-04
|
2023-06-18 07:39:38 by Adam Ciarcinski | Files touched by this commit (20) |
Log message:
py-ZopeInterface: moved to py-zope.interface
|
2023-05-10 14:40:45 by Adam Ciarcinski | Files touched by this commit (2) |  |
Log message:
py-scrapy: updated to 2.9.0
Scrapy 2.9.0 (2023-05-08)
-------------------------
Highlights:
- Per-domain download settings.
- Compatibility with new cryptography_ and new parsel_.
- JMESPath selectors from the new parsel_.
- Bug fixes.
Deprecations
~~~~~~~~~~~~
- :class:`scrapy.extensions.feedexport._FeedSlot` is renamed to
:class:`scrapy.extensions.feedexport.FeedSlot` and the old name is
deprecated. (:issue:`5876`)
New features
~~~~~~~~~~~~
- Settings correponding to :setting:`DOWNLOAD_DELAY`,
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` and
:setting:`RANDOMIZE_DOWNLOAD_DELAY` can now be set on a per-domain basis
via the new :setting:`DOWNLOAD_SLOTS` setting. (:issue:`5328`)
- Added :meth:`TextResponse.jmespath`, a shortcut for JMESPath selectors
available since parsel_ 1.8.1. (:issue:`5894`, :issue:`5915`)
- Added :signal:`feed_slot_closed` and :signal:`feed_exporter_closed`
signals. (:issue:`5876`)
- Added :func:`scrapy.utils.request.request_to_curl`, a function to produce a
curl command from a :class:`~scrapy.Request` object. (:issue:`5892`)
- Values of :setting:`FILES_STORE` and :setting:`IMAGES_STORE` can now be
:class:`pathlib.Path` instances. (:issue:`5801`)
Bug fixes
~~~~~~~~~
- Fixed a warning with Parsel 1.8.1+. (:issue:`5903`, :issue:`5918`)
- Fixed an error when using feed postprocessing with S3 storage.
(:issue:`5500`, :issue:`5581`)
- Added the missing :meth:`scrapy.settings.BaseSettings.setdefault` method.
(:issue:`5811`, :issue:`5821`)
- Fixed an error when using cryptography_ 40.0.0+ and
:setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` is enabled.
(:issue:`5857`, :issue:`5858`)
- The checksums returned by :class:`~scrapy.pipelines.files.FilesPipeline`
for files on Google Cloud Storage are no longer Base64-encoded.
(:issue:`5874`, :issue:`5891`)
- :func:`scrapy.utils.request.request_from_curl` now supports $-prefixed
string values for the curl ``--data-raw`` argument, which are produced by
browsers for data that includes certain symbols. (:issue:`5899`,
:issue:`5901`)
- The :command:`parse` command now also works with async generator callbacks.
(:issue:`5819`, :issue:`5824`)
- The :command:`genspider` command now properly works with HTTPS URLs.
(:issue:`3553`, :issue:`5808`)
- Improved handling of asyncio loops. (:issue:`5831`, :issue:`5832`)
- :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
now skips certain malformed URLs instead of raising an exception.
(:issue:`5881`)
- :func:`scrapy.utils.python.get_func_args` now supports more types of
callables. (:issue:`5872`, :issue:`5885`)
- Fixed an error when processing non-UTF8 values of ``Content-Type`` headers.
(:issue:`5914`, :issue:`5917`)
- Fixed an error breaking user handling of send failures in
:meth:`scrapy.mail.MailSender.send()`. (:issue:`1611`, :issue:`5880`)
Documentation
~~~~~~~~~~~~~
- Expanded contributing docs. (:issue:`5109`, :issue:`5851`)
- Added blacken-docs_ to pre-commit and reformatted the docs with it.
(:issue:`5813`, :issue:`5816`)
- Fixed a JS issue. (:issue:`5875`, :issue:`5877`)
- Fixed ``make htmlview``. (:issue:`5878`, :issue:`5879`)
- Fixed typos and other small errors. (:issue:`5827`, :issue:`5839`,
:issue:`5883`, :issue:`5890`, :issue:`5895`, :issue:`5904`)
Quality assurance
~~~~~~~~~~~~~~~~~
- Extended typing hints. (:issue:`5805`, :issue:`5889`, :issue:`5896`)
- Tests for most of the examples in the docs are now run as a part of CI,
found problems were fixed. (:issue:`5816`, :issue:`5826`, :issue:`5919`)
- Removed usage of deprecated Python classes. (:issue:`5849`)
- Silenced ``include-ignored`` warnings from coverage. (:issue:`5820`)
- Fixed a random failure of the ``test_feedexport.test_batch_path_differ``
test. (:issue:`5855`, :issue:`5898`)
- Updated docstrings to match output produced by parsel_ 1.8.1 so that they
don't cause test failures. (:issue:`5902`, :issue:`5919`)
- Other CI and pre-commit improvements. (:issue:`5802`, :issue:`5823`,
:issue:`5908`)
|