Path to this page:
Subject: CVS commit: pkgsrc/www/py-scrapy
From: Adam Ciarcinski
Date: 2019-08-22 10:21:11
Message id: 20190822082112.0CE58FBF4@cvs.NetBSD.org
Log Message:
py-scrapy: updated to 1.7.3
Scrapy 1.7.3:
Enforce lxml 4.3.5 or lower for Python 3.4 (issue 3912, issue 3918).
Scrapy 1.7.2:
Fix Python 2 support (issue 3889, issue 3893, issue 3896).
Scrapy 1.7.1:
Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI.
Scrapy 1.7.0:
Highlights:
Improvements for crawls targeting multiple domains
A cleaner way to pass arguments to callbacks
A new class for JSON requests
Improvements for rule-based spiders
New features for feed exports
Backward-incompatible changes
429 is now part of the RETRY_HTTP_CODES setting by default
This change is backward incompatible. If you don’t want to retry 429, you must \
override RETRY_HTTP_CODES accordingly.
Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler no longer accept a \
Spider subclass instance, they only accept a Spider subclass now.
Spider subclass instances were never meant to work, and they were not working as \
one would expect: instead of using the passed Spider subclass instance, their \
from_crawler method was called to generate a new instance.
Non-default values for the SCHEDULER_PRIORITY_QUEUE setting may stop working. \
Scheduler priority queue classes now need to handle Request objects instead of \
arbitrary Python data structures.
New features
A new scheduler priority queue, scrapy.pqueues.DownloaderAwarePriorityQueue, may \
be enabled for a significant scheduling improvement on crawls targetting \
multiple web domains, at the cost of no CONCURRENT_REQUESTS_PER_IP support \
(issue 3520)
A new Request.cb_kwargs attribute provides a cleaner way to pass keyword \
arguments to callback methods (issue 1138, issue 3563)
A new JSONRequest class offers a more convenient way to build JSON requests \
(issue 3504, issue 3505)
A process_request callback passed to the Rule constructor now receives the \
Response object that originated the request as its second argument (issue 3682)
A new restrict_text parameter for the LinkExtractor constructor allows filtering \
links by linking text (issue 3622, issue 3635)
A new FEED_STORAGE_S3_ACL setting allows defining a custom ACL for feeds \
exported to Amazon S3 (issue 3607)
A new FEED_STORAGE_FTP_ACTIVE setting allows using FTP’s active connection \
mode for feeds exported to FTP servers (issue 3829)
A new METAREFRESH_IGNORE_TAGS setting allows overriding which HTML tags are \
ignored when searching a response for HTML meta tags that trigger a redirect \
(issue 1422, issue 3768)
A new redirect_reasons request meta key exposes the reason (status code, meta \
refresh) behind every followed redirect (issue 3581, issue 3687)
The SCRAPY_CHECK variable is now set to the true string during runs of the check \
command, which allows detecting contract check runs from code (issue 3704, issue \
3739)
A new Item.deepcopy() method makes it easier to deep-copy items (issue 1493, \
issue 3671)
CoreStats also logs elapsed_time_seconds now (issue 3638)
Exceptions from ItemLoader input and output processors are now more verbose \
(issue 3836, issue 3840)
Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler now fail \
gracefully if they receive a Spider subclass instance instead of the subclass \
itself (issue 2283, issue 3610, issue 3872)
Bug fixes
process_spider_exception() is now also invoked for generators (issue 220, issue 2061)
System exceptions like KeyboardInterrupt are no longer caught (issue 3726)
ItemLoader.load_item() no longer makes later calls to \
ItemLoader.get_output_value() or ItemLoader.load_item() return empty data (issue \
3804, issue 3819)
The images pipeline (ImagesPipeline) no longer ignores these Amazon S3 settings: \
AWS_ENDPOINT_URL, AWS_REGION_NAME, AWS_USE_SSL, AWS_VERIFY (issue 3625)
Fixed a memory leak in MediaPipeline affecting, for example, non-200 responses \
and exceptions from custom middlewares (issue 3813)
Requests with private callbacks are now correctly unserialized from disk (issue 3790)
FormRequest.from_response() now handles invalid methods like major web browsers
Files: