./www/py-beautifulsoup4, HTML/XML Parser for Python, version 4

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 4.9.3, Package name: py37-beautifulsoup4-4.9.3, Maintainer: pkgsrc-users

Beautiful Soup is a Python library designed for quick turnaround projects like
screen-scraping. Three features make it powerful:

* Beautiful Soup provides a few simple methods and Pythonic idioms for
navigating, searching, and modifying a parse tree: a toolkit for dissecting a
document and extracting what you need. It doesn't take much code to write an
application
* Beautiful Soup automatically converts incoming documents to Unicode and
outgoing documents to UTF-8. You don't have to think about encodings, unless
the document doesn't specify an encoding and Beautiful Soup can't autodetect
one. Then you just have to specify the original encoding.
* Beautiful Soup sits on top of popular Python parsers like lxml and html5lib,
allowing you to try out different parsing strategies or trade speed for
flexibility.

Beautiful Soup parses anything you give it, and does the tree traversal stuff
for you. You can tell it "Find all the links", or "Find all the links of class
externalLink", or "Find all the links whose urls match "foo.com", or "Find the
table heading that's got bold text, then give me that text."

Valuable data that was once locked up in poorly-designed websites is now within
your reach. Projects that would have taken hours take only minutes with
Beautiful Soup.


Required to run:
[devel/py-setuptools] [textproc/py-lxml] [lang/python37] [www/py-soupsieve]

Required to build:
[pkgtools/cwrappers]

Master sites:

SHA1: 2a2beb32b1457245fff614adc90fa24fdfb37c2d
RMD160: 825e3830c785519220eab2998eb83f396cae13fd
Filesize: 367.218 KB

Version history: (Expand)


CVS history: (Expand)


   2020-10-03 20:11:59 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-beautifulsoup4: updated to 4.9.3

4.9.3:
* Implemented a significant performance optimization to the process of
  searching the parse tree.
   2020-09-29 20:47:30 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-beautifulsoup4: updated to 4.9.2

4.9.2

* Fixed a bug that caused too many tags to be popped from the tag
  stack during tree building, when encountering a closing tag that had
  no matching opening tag.

* Fixed a bug that inconsistently moved elements over when passing
  a Tag, rather than a list, into Tag.extend().

* Specify the soupsieve dependency in a way that complies with
  PEP 508. Patch by Mike Nerone.

* Change the signatures for BeautifulSoup.insert_before and insert_after
  (which are not implemented) to match PageElement.insert_before and
  insert_after, quieting warnings in some IDEs.
   2020-05-27 15:00:40 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-beautifulsoup4: updated to 4.9.1

4.9.1:

* Added a keyword argument 'on_duplicate_attribute' to the
  BeautifulSoupHTMLParser constructor (used by the html.parser tree
  builder) which lets you customize the handling of markup that
  contains the same attribute more than once, as in:
  <a href="url1" href="url2">

* Added a distinct subclass, GuessedAtParserWarning, for the warning
  issued when BeautifulSoup is instantiated without a parser being
  specified.

* Added a distinct subclass, MarkupResemblesLocatorWarning, for the
  warning issued when BeautifulSoup is instantiated with 'markup' that
  actually seems to be a URL or the path to a file on
  disk.

* The new NavigableString subclasses (Stylesheet, Script, and
  TemplateString) can now be imported directly from the bs4 package.

* If you encode a document with a Python-specific encoding like
  'unicode_escape', that encoding is no longer mentioned in the final
  XML or HTML document. Instead, encoding information is omitted or
  left blank.

* Fixed test failures when run against soupselect 2.0.
   2020-04-28 23:16:14 by David H. Gutteridge | Files touched by this commit (2) | Package updated
Log message:
py-beautifulsoup4: update to 4.9.0

4.9.0 (20200405)

* Added PageElement.decomposed, a new property which lets you
  check whether you've already called decompose() on a Tag or
  NavigableString.

* Embedded CSS and Javascript is now stored in distinct Stylesheet and
  Script tags, which are ignored by methods like get_text(). This
  feature is not supported by the html5lib treebuilder. [bug=1868861]

* Added a Russian translation by 'authoress' to the repository.

* Fixed an unhandled exception when formatting a Tag that had been
  decomposed.[bug=1857767]

* Fixed a bug that happened when passing a Unicode filename containing
  non-ASCII characters as markup into Beautiful Soup, on a system that
  allows Unicode filenames. [bug=1866717]

* Added a performance optimization to PageElement.extract(). Patch by
  Arthur Darcet.
   2020-01-08 22:08:26 by Adam Ciarcinski | Files touched by this commit (3) | Package updated
Log message:
py-beautifulsoup4: updated to 4.8.2

4.8.2:

* Added Python docstrings to all public methods of the most commonly
  used classes.

* Added a Chinese translation by Deron Wang and a Brazilian Portuguese
  translation by Cezar Peixeiro to the repository.

* Fixed two deprecation warnings.

* The html.parser tree builder now correctly handles DOCTYPEs that are
  not uppercase.

* PageElement.select() now returns a ResultSet rather than a regular
  list, making it consistent with methods like find_all().
   2019-10-15 19:21:35 by Adam Ciarcinski | Files touched by this commit (3) | Package updated
Log message:
py-beautifulsoup4: updated to 4.8.1

4.8.1:

* When the html.parser or html5lib parsers are in use, Beautiful Soup
  will, by default, record the position in the original document where
  each tag was encountered. This includes line number (Tag.sourceline)
  and position within a line (Tag.sourcepos).  Based on code by Chris
  Mayo.

* When instantiating a BeautifulSoup object, it's now possible to
   provide a dictionary ('element_classes') of the classes you'd like to be
   instantiated instead of Tag, NavigableString, etc.

* Fixed the definition of the default XML namespace when using
   lxml 4.4. Patch by Isaac Muse.

* Fixed a crash when pretty-printing tags that were not created
   during initial parsing.

* Copying a Tag preserves information that was originally obtained from
   the TreeBuilder used to build the original Tag.

* Raise an explanatory exception when the underlying parser
   completely rejects the incoming markup.

* Avoid a crash when trying to detect the declared encoding of a
   Unicode document.

* Avoid a crash when unpickling certain parse trees generated
   using html5lib on Python 3.
   2019-07-21 10:05:32 by Adam Ciarcinski | Files touched by this commit (3) | Package updated
Log message:
py-beautifulsoup4: updated to 4.8.0

4.8.0:

This release focuses on making it easier to customize Beautiful Soup's
input mechanism (the TreeBuilder) and output mechanism (the Formatter).

* You can customize the TreeBuilder object by passing keyword
  arguments into the BeautifulSoup constructor. Those keyword
  arguments will be passed along into the TreeBuilder constructor.

  The main reason to do this right now is to change how which
  attributes are treated as multi-valued attributes (the way 'class'
  is treated by default). You can do this with the
  'multi_valued_attributes' argument.

* The role of Formatter objects has been greatly expanded. The Formatter
  class now controls the following:

  - The function to call to perform entity substitution. (This was
    previously Formatter's only job.)
  - Which tags should be treated as containing CDATA and have their
    contents exempt from entity substitution.
  - The order in which a tag's attributes are output.
  - Whether or not to put a '/' inside a void element, e.g. '<br/>' vs \ 
'<br>'

  All preexisting code should work as before.

* Added a new method to the API, Tag.smooth(), which consolidates
  multiple adjacent NavigableString elements.

* &apos; (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always
  recognized as a named entity and converted to a single quote.
   2019-01-08 10:30:44 by Adam Ciarcinski | Files touched by this commit (2) | Package updated
Log message:
py-beautifulsoup4: updated to 4.7.1

4.7.1:

* Fixed a significant performance problem introduced in 4.7.0.

* Fixed an incorrectly raised exception when inserting a tag before or
  after an identical tag.

* Beautiful Soup will no longer try to keep track of namespaces that
  are not defined with a prefix; this can confuse soupselect.

* Tried even harder to avoid the deprecation warning originally fixed in
  4.6.1.