./www/crawl, Small and efficient HTTP crawler

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 0.4nb13, Package name: crawl-0.4nb13, Maintainer: pkgsrc-users

The crawl utility starts a depth-first traversal of the web at the specified
URLs. It stores all JPEG images that match the configured constraints.
Crawl is fairly fast and allows for graceful termination. After terminating
crawl, it is possible to restart it at exactly the same spot where it was
terminated. Crawl keeps a persistent database that allows multiple crawls
without revisiting sites.

The main features of crawl are:

* Saves encountered images or other media types
* Media selection based on regular expressions and size contraints
* Resume previous crawl after graceful termination
* Persistent database of visited URLs
* Very small and efficient code
* Asynchronous DNS lookups
* Supports robots.txt


Required to run:
[devel/libevent]

Required to build:
[pkgtools/cwrappers]

Master sites:

Filesize: 108.48 KB

Version history: (Expand)


CVS history: (Expand)


   2021-10-26 13:31:15 by Nia Alarie | Files touched by this commit (1030)
Log message:
www: Replace RMD160 checksums with BLAKE2s checksums

All checksums have been double-checked against existing RMD160 and
SHA512 hashes

Not committed (merge conflicts):
www/nghttp2/distinfo

Unfetchable distfiles (almost certainly fetched conditionally...):
./www/nginx-devel/distinfo array-var-nginx-module-0.05.tar.gz
./www/nginx-devel/distinfo echo-nginx-module-0.62.tar.gz
./www/nginx-devel/distinfo encrypted-session-nginx-module-0.08.tar.gz
./www/nginx-devel/distinfo form-input-nginx-module-0.12.tar.gz
./www/nginx-devel/distinfo headers-more-nginx-module-0.33.tar.gz
./www/nginx-devel/distinfo lua-nginx-module-0.10.19.tar.gz
./www/nginx-devel/distinfo naxsi-1.3.tar.gz
./www/nginx-devel/distinfo nginx-dav-ext-module-3.0.0.tar.gz
./www/nginx-devel/distinfo nginx-rtmp-module-1.2.2.tar.gz
./www/nginx-devel/distinfo nginx_http_push_module-1.2.10.tar.gz
./www/nginx-devel/distinfo ngx_cache_purge-2.5.1.tar.gz
./www/nginx-devel/distinfo ngx_devel_kit-0.3.1.tar.gz
./www/nginx-devel/distinfo ngx_http_geoip2_module-3.3.tar.gz
./www/nginx-devel/distinfo njs-0.5.0.tar.gz
./www/nginx-devel/distinfo set-misc-nginx-module-0.32.tar.gz
./www/nginx/distinfo array-var-nginx-module-0.05.tar.gz
./www/nginx/distinfo echo-nginx-module-0.62.tar.gz
./www/nginx/distinfo encrypted-session-nginx-module-0.08.tar.gz
./www/nginx/distinfo form-input-nginx-module-0.12.tar.gz
./www/nginx/distinfo headers-more-nginx-module-0.33.tar.gz
./www/nginx/distinfo lua-nginx-module-0.10.19.tar.gz
./www/nginx/distinfo naxsi-1.3.tar.gz
./www/nginx/distinfo nginx-dav-ext-module-3.0.0.tar.gz
./www/nginx/distinfo nginx-rtmp-module-1.2.2.tar.gz
./www/nginx/distinfo nginx_http_push_module-1.2.10.tar.gz
./www/nginx/distinfo ngx_cache_purge-2.5.1.tar.gz
./www/nginx/distinfo ngx_devel_kit-0.3.1.tar.gz
./www/nginx/distinfo ngx_http_geoip2_module-3.3.tar.gz
./www/nginx/distinfo njs-0.5.0.tar.gz
./www/nginx/distinfo set-misc-nginx-module-0.32.tar.gz
   2021-10-07 17:09:00 by Nia Alarie | Files touched by this commit (1033)
Log message:
www: Remove SHA1 hashes for distfiles
   2020-01-18 22:51:16 by Jonathan Perkin | Files touched by this commit (1836)
Log message:
*: Recursive revision bump for openssl 1.1.1.
   2018-07-04 15:40:45 by Jonathan Perkin | Files touched by this commit (423)
Log message:
*: Move SUBST_STAGE from post-patch to pre-configure

Performing substitutions during post-patch breaks tools such as mkpatches,
making it very difficult to regenerate correct patches after making changes,
and often leading to substituted string replacements being committed.
   2017-08-01 16:59:08 by Thomas Klausner | Files touched by this commit (211)
Log message:
Follow some http -> https redirects.
   2016-03-05 12:29:49 by Jonathan Perkin | Files touched by this commit (1813)
Log message:
Bump PKGREVISION for security/openssl ABI bump.
   2015-11-04 03:47:43 by Alistair G. Crooks | Files touched by this commit (758)
Log message:
Add SHA512 digests for distfiles for www category

Problems found locating distfiles:
	Package haskell-cgi: missing distfile haskell-cgi-20001206.tar.gz
	Package nginx: missing distfile array-var-nginx-module-0.04.tar.gz
	Package nginx: missing distfile encrypted-session-nginx-module-0.04.tar.gz
	Package nginx: missing distfile headers-more-nginx-module-0.261.tar.gz
	Package nginx: missing distfile nginx_http_push_module-0.692.tar.gz
	Package nginx: missing distfile set-misc-nginx-module-0.29.tar.gz
	Package nginx-devel: missing distfile echo-nginx-module-0.58.tar.gz
	Package nginx-devel: missing distfile form-input-nginx-module-0.11.tar.gz
	Package nginx-devel: missing distfile lua-nginx-module-0.9.16.tar.gz
	Package nginx-devel: missing distfile nginx_http_push_module-0.692.tar.gz
	Package nginx-devel: missing distfile set-misc-nginx-module-0.29.tar.gz
	Package php-owncloud: missing distfile owncloud-8.2.0.tar.bz2

Otherwise, existing SHA1 digests verified and found to be the same on
the machine holding the existing distfiles (morden).  All existing
SHA1 digests retained for now as an audit trail.
   2014-02-13 00:18:57 by Matthias Scheler | Files touched by this commit (1568)
Log message:
Recursive PKGREVISION bump for OpenSSL API version bump.