./www/py-scrapy, High-level Web Crawling and Web Scraping framework

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 1.5.0, Package name: py27-scrapy-1.5.0, Maintainer: pkgsrc-users

Scrapy is a fast high-level web crawling and web scraping framework, used to
crawl websites and extract structured data from their pages. It can be used for
a wide range of purposes, from data mining to monitoring and automated testing.


Required to run:
[net/py-twisted] [security/py-OpenSSL] [devel/py-setuptools] [textproc/py-lxml] [lang/python27] [textproc/py-cssselect] [lang/py-six] [security/py-service_identity] [www/py-parsel] [www/py-w3lib] [devel/py-pydispatcher] [devel/py-queuelib]

Required to build:
[pkgtools/cwrappers]

Master sites:

SHA1: 466a6e502585507f0bdd711043a5474ba0f3899d
RMD160: 083f584cbe11a9382eef6314829f891f3b2a3b9d
Filesize: 884.218 KB

Version history: (Expand)


CVS history: (Expand)


   2018-01-04 22:31:41 by Adam Ciarcinski | Files touched by this commit (3) | Package updated
Log message:
py-scrapy: updated to 1.5.0

Scrapy 1.5.0:
This release brings small new features and improvements across the codebase.
Some highlights:

* Google Cloud Storage is supported in FilesPipeline and ImagesPipeline.
* Crawling with proxy servers becomes more efficient, as connections
  to proxies can be reused now.
* Warnings, exception and logging messages are improved to make debugging
  easier.
* scrapy parse command now allows to set custom request meta via
  --meta argument.
* Compatibility with Python 3.6, PyPy and PyPy3 is improved;
  PyPy and PyPy3 are now supported officially, by running tests on CI.
* Better default handling of HTTP 308, 522 and 524 status codes.
* Documentation is improved, as usual.

Backwards Incompatible Changes
* Scrapy 1.5 drops support for Python 3.3.
* Default Scrapy User-Agent now uses https link to scrapy.org.
  **This is technically backwards-incompatible**; override
  :setting:USER_AGENT if you relied on old value.
* Logging of settings overridden by custom_settings is fixed;
  **this is technically backwards-incompatible** because the logger
  changes from [scrapy.utils.log] to [scrapy.crawler]. If you're
  parsing Scrapy logs, please update your log parsers.
* LinkExtractor now ignores m4v extension by default, this is change
  in behavior.
* 522 and 524 status codes are added to RETRY_HTTP_CODES

New features
- Support <link> tags in Response.follow
- Support for ptpython REPL
- Google Cloud Storage support for FilesPipeline and ImagesPipeline
- New --meta option of the "scrapy parse" command allows to pass additional
  request.meta
- Populate spider variable when using shell.inspect_response
- Handle HTTP 308 Permanent Redirect
- Add 522 and 524 to RETRY_HTTP_CODES
- Log versions information at startup
- scrapy.mail.MailSender now works in Python 3 (it requires Twisted 17.9.0)
- Connections to proxy servers are reused
- Add template for a downloader middleware
- Explicit message for NotImplementedError when parse callback not defined
- CrawlerProcess got an option to disable installation of root log handler
- LinkExtractor now ignores m4v extension by default
- Better log messages for responses over :setting:DOWNLOAD_WARNSIZE and
  :setting:DOWNLOAD_MAXSIZE limits
- Show warning when a URL is put to Spider.allowed_domains instead of
  a domain.

Bug fixes
- Fix logging of settings overridden by custom_settings;
  **this is technically backwards-incompatible** because the logger
  changes from [scrapy.utils.log] to [scrapy.crawler], so please
  update your log parsers if needed
- Default Scrapy User-Agent now uses https link to scrapy.org.
  **This is technically backwards-incompatible**; override
  :setting:USER_AGENT if you relied on old value.
- Fix PyPy and PyPy3 test failures, support them officially
- Fix DNS resolver when DNSCACHE_ENABLED=False
- Add cryptography for Debian Jessie tox test env
- Add verification to check if Request callback is callable
- Port extras/qpsclient.py to Python 3
- Use getfullargspec under the scenes for Python 3 to stop DeprecationWarning
- Update deprecated test aliases
- Fix SitemapSpider support for alternate links
   2017-09-04 20:08:31 by Thomas Klausner | Files touched by this commit (163)
Log message:
Follow some redirects.
   2017-05-20 08:25:36 by Adam Ciarcinski | Files touched by this commit (4)
Log message:
Scrapy 1.4 does not bring that many breathtaking new features
but quite a few handy improvements nonetheless.

Scrapy now supports anonymous FTP sessions with customizable user and
password via the new :setting:`FTP_USER` and :setting:`FTP_PASSWORD` settings.
And if you're using Twisted version 17.1.0 or above, FTP is now available
with Python 3.

There's a new :meth:`response.follow <scrapy.http.TextResponse.follow>` method
for creating requests; **it is now a recommended way to create Requests
in Scrapy spiders**. This method makes it easier to write correct
spiders; ``response.follow`` has several advantages over creating
``scrapy.Request`` objects directly:

* it handles relative URLs;
* it works properly with non-ascii URLs on non-UTF8 pages;
* in addition to absolute and relative URLs it supports Selectors;
  for ``<a>`` elements it can also extract their href values.
   2017-03-19 23:59:11 by Adam Ciarcinski | Files touched by this commit (2)
Log message:
Changes 1.3.3:
Bug fixes
- Make ``SpiderLoader`` raise ``ImportError`` again by default for missing
  dependencies and wrong :setting:`SPIDER_MODULES`.
  These exceptions were silenced as warnings since 1.3.0.
  A new setting is introduced to toggle between warning or exception if needed ;
  see :setting:`SPIDER_LOADER_WARN_ONLY` for details.
   2017-02-13 22:25:33 by Adam Ciarcinski | Files touched by this commit (4)
Log message:
Added www/py-scrapy version 1.3.2

Scrapy is a fast high-level web crawling and web scraping framework, used to
crawl websites and extract structured data from their pages. It can be used for
a wide range of purposes, from data mining to monitoring and automated testing.