./wip/py-goose3, Html Content / Article Extractor, web scrapping for Python3

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 3.1.6, Package name: py312-goose3-3.1.6, Maintainer: kamelderouiche

Goose was originally an article extractor written in Java that has most
recently (Aug2011) been converted to a scala project.

This is a complete rewrite in Python. The aim of the software is to take
any news article or article-type web page and not only extract what is
the main body of the article but also all meta data and most probable
image candidate.

Goose will try to extract the following information:

- Main text of an article
- Main image of article
- Any YouTube/Vimeo movies embedded in article
- Meta Description
- Meta tags


Master sites:

RMD160: fd1f29c623d95610f98737abe473a31b9ac48da2
Filesize: 36.84 KB

Version history: (Expand)