./www/crawl, Small and efficient HTTP crawler

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 0.4nb12, Package name: crawl-0.4nb12, Maintainer: pkgsrc-users

The crawl utility starts a depth-first traversal of the web at the specified
URLs. It stores all JPEG images that match the configured constraints.
Crawl is fairly fast and allows for graceful termination. After terminating
crawl, it is possible to restart it at exactly the same spot where it was
terminated. Crawl keeps a persistent database that allows multiple crawls
without revisiting sites.

The main features of crawl are:

* Saves encountered images or other media types
* Media selection based on regular expressions and size contraints
* Resume previous crawl after graceful termination
* Persistent database of visited URLs
* Very small and efficient code
* Asynchronous DNS lookups
* Supports robots.txt


Required to build:
[pkgtools/cwrappers]

Master sites:

SHA1: b53be27b572ba6a88ab80243b177873aed0b314b
RMD160: c86898b66c661e6b841170114deba4d8f076651d
Filesize: 108.48 KB

Version history: (Expand)


CVS history: (Expand)


   2017-08-01 16:59:08 by Thomas Klausner | Files touched by this commit (211)
Log message:
Follow some http -> https redirects.
   2016-03-05 12:29:49 by Jonathan Perkin | Files touched by this commit (1813) | Package updated
Log message:
Bump PKGREVISION for security/openssl ABI bump.
   2015-11-04 03:47:43 by Alistair G. Crooks | Files touched by this commit (758)
Log message:
Add SHA512 digests for distfiles for www category

Problems found locating distfiles:
	Package haskell-cgi: missing distfile haskell-cgi-20001206.tar.gz
	Package nginx: missing distfile array-var-nginx-module-0.04.tar.gz
	Package nginx: missing distfile encrypted-session-nginx-module-0.04.tar.gz
	Package nginx: missing distfile headers-more-nginx-module-0.261.tar.gz
	Package nginx: missing distfile nginx_http_push_module-0.692.tar.gz
	Package nginx: missing distfile set-misc-nginx-module-0.29.tar.gz
	Package nginx-devel: missing distfile echo-nginx-module-0.58.tar.gz
	Package nginx-devel: missing distfile form-input-nginx-module-0.11.tar.gz
	Package nginx-devel: missing distfile lua-nginx-module-0.9.16.tar.gz
	Package nginx-devel: missing distfile nginx_http_push_module-0.692.tar.gz
	Package nginx-devel: missing distfile set-misc-nginx-module-0.29.tar.gz
	Package php-owncloud: missing distfile owncloud-8.2.0.tar.bz2

Otherwise, existing SHA1 digests verified and found to be the same on
the machine holding the existing distfiles (morden).  All existing
SHA1 digests retained for now as an audit trail.
   2014-02-13 00:18:57 by Matthias Scheler | Files touched by this commit (1568)
Log message:
Recursive PKGREVISION bump for OpenSSL API version bump.
   2013-02-07 00:24:19 by Jonathan Perkin | Files touched by this commit (1351) | Package updated
Log message:
PKGREVISION bumps for the security/openssl 1.0.1d update.
   2012-10-28 07:31:10 by Aleksej Saushev | Files touched by this commit (600)
Log message:
Drop superfluous PKG_DESTDIR_SUPPORT, "user-destdir" is default these days.
   2011-02-11 22:22:05 by Tobias Nygren | Files touched by this commit (18) | Package updated
Log message:
revbump(1) for devel/libevent update.
   2010-12-21 09:17:28 by OBATA Akio | Files touched by this commit (1)
Log message:
Not to use BDB check in configure.
Fixes PR#44244.