./www/py-scrapy, High-level Web Crawling and Web Scraping framework

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 2.4.1nb2, Package name: py39-scrapy-2.4.1nb2, Maintainer: pkgsrc-users

Scrapy is a fast high-level web crawling and web scraping framework, used to
crawl websites and extract structured data from their pages. It can be used for
a wide range of purposes, from data mining to monitoring and automated testing.


Required to run:
[net/py-twisted] [security/py-OpenSSL] [devel/py-setuptools] [devel/py-ZopeInterface] [textproc/py-lxml] [textproc/py-cssselect] [lang/py-six] [security/py-cryptography] [security/py-service_identity] [www/py-parsel] [www/py-w3lib] [devel/py-pydispatcher] [devel/py-queuelib] [lang/python37] [www/py-protego]

Required to build:
[pkgtools/cwrappers]

Master sites:

Filesize: 1019.771 KB

Version history: (Expand)


CVS history: (Expand)


   2022-01-05 16:41:32 by Thomas Klausner | Files touched by this commit (289)
Log message:
python: egg.mk: add USE_PKG_RESOURCES flag

This flag should be set for packages that import pkg_resources
and thus need setuptools after the build step.

Set this flag for packages that need it and bump PKGREVISION.
   2022-01-04 21:55:40 by Thomas Klausner | Files touched by this commit (1595)
Log message:
*: bump PKGREVISION for egg.mk users

They now have a tool dependency on py-setuptools instead of a DEPENDS
   2021-10-26 13:31:15 by Nia Alarie | Files touched by this commit (1030)
Log message:
www: Replace RMD160 checksums with BLAKE2s checksums

All checksums have been double-checked against existing RMD160 and
SHA512 hashes

Not committed (merge conflicts):
www/nghttp2/distinfo

Unfetchable distfiles (almost certainly fetched conditionally...):
./www/nginx-devel/distinfo array-var-nginx-module-0.05.tar.gz
./www/nginx-devel/distinfo echo-nginx-module-0.62.tar.gz
./www/nginx-devel/distinfo encrypted-session-nginx-module-0.08.tar.gz
./www/nginx-devel/distinfo form-input-nginx-module-0.12.tar.gz
./www/nginx-devel/distinfo headers-more-nginx-module-0.33.tar.gz
./www/nginx-devel/distinfo lua-nginx-module-0.10.19.tar.gz
./www/nginx-devel/distinfo naxsi-1.3.tar.gz
./www/nginx-devel/distinfo nginx-dav-ext-module-3.0.0.tar.gz
./www/nginx-devel/distinfo nginx-rtmp-module-1.2.2.tar.gz
./www/nginx-devel/distinfo nginx_http_push_module-1.2.10.tar.gz
./www/nginx-devel/distinfo ngx_cache_purge-2.5.1.tar.gz
./www/nginx-devel/distinfo ngx_devel_kit-0.3.1.tar.gz
./www/nginx-devel/distinfo ngx_http_geoip2_module-3.3.tar.gz
./www/nginx-devel/distinfo njs-0.5.0.tar.gz
./www/nginx-devel/distinfo set-misc-nginx-module-0.32.tar.gz
./www/nginx/distinfo array-var-nginx-module-0.05.tar.gz
./www/nginx/distinfo echo-nginx-module-0.62.tar.gz
./www/nginx/distinfo encrypted-session-nginx-module-0.08.tar.gz
./www/nginx/distinfo form-input-nginx-module-0.12.tar.gz
./www/nginx/distinfo headers-more-nginx-module-0.33.tar.gz
./www/nginx/distinfo lua-nginx-module-0.10.19.tar.gz
./www/nginx/distinfo naxsi-1.3.tar.gz
./www/nginx/distinfo nginx-dav-ext-module-3.0.0.tar.gz
./www/nginx/distinfo nginx-rtmp-module-1.2.2.tar.gz
./www/nginx/distinfo nginx_http_push_module-1.2.10.tar.gz
./www/nginx/distinfo ngx_cache_purge-2.5.1.tar.gz
./www/nginx/distinfo ngx_devel_kit-0.3.1.tar.gz
./www/nginx/distinfo ngx_http_geoip2_module-3.3.tar.gz
./www/nginx/distinfo njs-0.5.0.tar.gz
./www/nginx/distinfo set-misc-nginx-module-0.32.tar.gz
   2021-10-07 17:09:00 by Nia Alarie | Files touched by this commit (1033)
Log message:
www: Remove SHA1 hashes for distfiles
   2021-10-06 11:07:00 by Jonathan Perkin | Files touched by this commit (1)
Log message:
py-scrapy: Switch to PYTHON_VERSIONS_INCOMPATIBLE.
   2021-03-22 09:56:56 by =?UTF-8?B?RnLDqWTDqXJpYyBGYXViZXJ0ZWF1?= | Files touched by this commit (3)
Log message:
py-scrapy: Update to 2.4.1

upstream cheanges:
------------------
A lot of changes listed at https://github.com/scrapy/scrapy/blob/master/docs/news.rst
   2020-01-29 23:06:30 by Adam Ciarcinski | Files touched by this commit (3) | Package updated
Log message:
py-scrapy: updated to 1.8.0

Scrapy 1.8.0:

Highlights:
* Dropped Python 3.4 support and updated minimum requirements; made Python 3.8
  support official
* New :meth:`Request.from_curl <scrapy.http.Request.from_curl>` class method
* New :setting:`ROBOTSTXT_PARSER` and :setting:`ROBOTSTXT_USER_AGENT` settings
* New :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` and
  :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` settings
   2019-08-22 10:21:11 by Adam Ciarcinski | Files touched by this commit (3) | Package updated
Log message:
py-scrapy: updated to 1.7.3

Scrapy 1.7.3:
Enforce lxml 4.3.5 or lower for Python 3.4 (issue 3912, issue 3918).

Scrapy 1.7.2:
Fix Python 2 support (issue 3889, issue 3893, issue 3896).

Scrapy 1.7.1:
Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI.

Scrapy 1.7.0:
Highlights:
Improvements for crawls targeting multiple domains
A cleaner way to pass arguments to callbacks
A new class for JSON requests
Improvements for rule-based spiders
New features for feed exports

Backward-incompatible changes

429 is now part of the RETRY_HTTP_CODES setting by default
This change is backward incompatible. If you don’t want to retry 429, you must \ 
override RETRY_HTTP_CODES accordingly.

Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler no longer accept a \ 
Spider subclass instance, they only accept a Spider subclass now.
Spider subclass instances were never meant to work, and they were not working as \ 
one would expect: instead of using the passed Spider subclass instance, their \ 
from_crawler method was called to generate a new instance.

Non-default values for the SCHEDULER_PRIORITY_QUEUE setting may stop working. \ 
Scheduler priority queue classes now need to handle Request objects instead of \ 
arbitrary Python data structures.

New features

A new scheduler priority queue, scrapy.pqueues.DownloaderAwarePriorityQueue, may \ 
be enabled for a significant scheduling improvement on crawls targetting \ 
multiple web domains, at the cost of no CONCURRENT_REQUESTS_PER_IP support \ 
(issue 3520)
A new Request.cb_kwargs attribute provides a cleaner way to pass keyword \ 
arguments to callback methods (issue 1138, issue 3563)
A new JSONRequest class offers a more convenient way to build JSON requests \ 
(issue 3504, issue 3505)
A process_request callback passed to the Rule constructor now receives the \ 
Response object that originated the request as its second argument (issue 3682)
A new restrict_text parameter for the LinkExtractor constructor allows filtering \ 
links by linking text (issue 3622, issue 3635)
A new FEED_STORAGE_S3_ACL setting allows defining a custom ACL for feeds \ 
exported to Amazon S3 (issue 3607)
A new FEED_STORAGE_FTP_ACTIVE setting allows using FTP’s active connection \ 
mode for feeds exported to FTP servers (issue 3829)
A new METAREFRESH_IGNORE_TAGS setting allows overriding which HTML tags are \ 
ignored when searching a response for HTML meta tags that trigger a redirect \ 
(issue 1422, issue 3768)
A new redirect_reasons request meta key exposes the reason (status code, meta \ 
refresh) behind every followed redirect (issue 3581, issue 3687)
The SCRAPY_CHECK variable is now set to the true string during runs of the check \ 
command, which allows detecting contract check runs from code (issue 3704, issue \ 
3739)
A new Item.deepcopy() method makes it easier to deep-copy items (issue 1493, \ 
issue 3671)
CoreStats also logs elapsed_time_seconds now (issue 3638)
Exceptions from ItemLoader input and output processors are now more verbose \ 
(issue 3836, issue 3840)
Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler now fail \ 
gracefully if they receive a Spider subclass instance instead of the subclass \ 
itself (issue 2283, issue 3610, issue 3872)

Bug fixes

process_spider_exception() is now also invoked for generators (issue 220, issue 2061)
System exceptions like KeyboardInterrupt are no longer caught (issue 3726)
ItemLoader.load_item() no longer makes later calls to \ 
ItemLoader.get_output_value() or ItemLoader.load_item() return empty data (issue \ 
3804, issue 3819)
The images pipeline (ImagesPipeline) no longer ignores these Amazon S3 settings: \ 
AWS_ENDPOINT_URL, AWS_REGION_NAME, AWS_USE_SSL, AWS_VERIFY (issue 3625)
Fixed a memory leak in MediaPipeline affecting, for example, non-200 responses \ 
and exceptions from custom middlewares (issue 3813)
Requests with private callbacks are now correctly unserialized from disk (issue 3790)
FormRequest.from_response() now handles invalid methods like major web browsers