./textproc/xapian-omega, Search engine application for websites using Xapian

Branch: CURRENT, Version: 1.4.14, Package name: xapian-omega-1.4.14, Maintainer: schmonz

Omega operates on a set of databases. Each database is created and
updated separately using either omindex or scriptindex. You can
search these databases (or any other Xapian database with suitable
contents) via a web front-end provided by omega, a CGI application.
A search can also be done over more than one database at once.

Required to run:
[lang/perl5] [devel/pcre] [textproc/xapian]

Required to build:

Master sites:

SHA1: 2e55e0b9f61862329fc936329731e2590326fa74
RMD160: b4ab9c0d8e78c921e0dc2f7c5147ff7be71145ed
Filesize: 527.863 KB

   2019-12-17 04:54:18 by Amitai Schleier | Files touched by this commit (2) | Package updated
Log message:
Update to 1.4.14. From the changelog:


* Improve omindex --help docs for --duplicates.
* Document that $log will start to return an error message in 1.5.0, and that
  one can wrap it using a $if with no action now to be future-proof.


* Add built-in support for iso-8859-15 so we can handle it without iconv.
  This charset is a variant of iso-8859-1 with 8 characters changed, most
  notably including the euro currency symbol.  It's the most commonly seen
  charset we didn't have built-in support for.
* Optimise converting us-ascii to UTF-8 to do nothing, like we already do when
  converting UTF-8 to UTF-8.
* scriptindex:
  + Add new 'gap' action which provides a way to leave a gap in the term
    positions between fields to prevent phrases and positional operators from
    matching across fields.


* Fix error handling in $lookup.  We now check for errors from cdb_init()
  and cdb_get().  We've never checked for errors from cdb_init(), while
  for cdb_get() this bug was introduced by a warning fix in 1.2.20.


* Future-proof use of $log against changes in 1.5.0.
   2019-08-11 15:25:21 by Thomas Klausner | Files touched by this commit (3557) | Package updated
Log message:
Bump PKGREVISIONs for perl 5.30.0
   2019-08-02 23:29:11 by Amitai Schleier | Files touched by this commit (2) | Package updated
Log message:
Update to 1.4.12. From the changelog:


* Improve docs for OmegaScript $hitlist{}.

* Fix RST formatting errors in omega docs.

* Clarify use of Q prefix for unique ID terms - it was described as \ 
  but the use of "Q" is really just a convention (and in fact omindex \ 
uses "U"
  not "Q").

* Clarify scriptindex's weight action takes parameter >= 0.

* Correct typo in OmegaScript $add parameter documentation.


* omindex:

  + Fix typo in mimetypes used for Apple iWork documents ("apply" \ 
instead of
    "apple") which meant that these documents weren't actually being \ 
    Patch from Bruno Baruffaldi.

  + Pipe input to ps2pdf as this accepts input on stdin.  Possibility pointed
    out by Gaurav Arora.

* scriptindex:

  + If parsedate action's format includes %z adjust for the timezone if
    possible (this requires the non-POSIX tm_gmtoff member of struct tm)
    and flag an error for other platforms.

  + If parsedate action's format include %Z flag an error as that doesn't
    seem to be usefully supported by strptime() anywhere.

  + Fix parsedate action to treat formats without a timezone as being UTC
    instead of localtime.

  + Add date=unixutc.  The existing date=unix works in localtime which is
    unhelpful if you want to use it on the output of parsedate since that's in
    UTC; date=unixutc is just like date=unix except it always works in UTC.

  + The date action now emits a warning for invalid values.  The documentation
    used to say "invalid values are ignored at present", but it's more \ 
    to flag bad data than quietly ignore it.

  + We now check the date action's parameter at script parse time and unknown
    values result in an error and nothing being indexed.  Previously an unknown
    format uselessly resulted in the terms D, M and Y literally being added to
    every document.

  + The split action now supports a new "prefixes" split style.  This \ 
gives all
    the prefixes from the split, so split=/,prefixes on a file path gives all
    parent directories.


* Remove documented limitation of $subdb and $subid - the implementation
  assumed that each omega database name corresponded to a single Xapian
  database, and if a database name referred to a stub database file expanding
  to multiple Xapian databases then they would misbehave.  Such cases are now
  handled properly as well.

* Extend $addfilter to support adding negated filters via a new optional second
  argument which specifies the type of filter to add.

* Stop $sort from needlessly ensuring the match has run.

* Handle corner case of nested $hitlist gracefully instead of potentially
  entering an infinite loop.


* omegatest: Avoid setting TZ globally during tests as that hides bugs where
  behaviour depends on the local timezone when it shouldn't.

* omegatest: Support testing when built using LeakSanitizer by suppressing
  leak reports for cached compiled pcre regular expressions.  These aren't
  released when the program exits but aren't memory leaks.

build system:

* Remove outdated deprecation warning suppression which was there to support
  building from git in the run up to 1.3.2 - a development version which is
  nearly 5 years ago now.


* Fix problems with fallback strptime() implementation which was being included
  in the wrong binary, and was lacking a required const_cast on the return

* Rework setenv() compatibility handling.  Now that Solaris 9 is dead we can
  assume setenv() is provided by Unix-like platforms (POSIX requires it).  For
  other platforms, provide a compatibility implementation of setenv() so the
  compatibility code is encapsulated in one place rather than replicated at
  every use.
   2019-03-10 14:21:05 by Amitai Schleier | Files touched by this commit (3)
Log message:
Avoid conflicting with system bswap32(). Use SUBST_VARS to mollify pkglint.
   2019-03-04 02:38:10 by Amitai Schleier | Files touched by this commit (1) | Package updated
Log message:
Update to 1.4.11. From the changelog:


* omindex:

  + outlookmsg2html: Handle Subject, Date, and From headers.


* In $div and $mod we were converting a non-zero denominator from string to int
  twice for no good reason.


* omegatest: Fix testcase which was failing if the local timezone was behind
  UTC.  This testcase was added in 1.4.10.

* omegatest: Tweak to not fail when $time not supported - it seems that the
  OS time functions we use report an error on GNU Hurd for unknown reasons.

build system:

* Sync up probes for OS time functions in omega's configure with those in
  xapian-core which may solve $time not being supported on GNU Hurd.


* Add missing includes of <cerrno>.  Fixes #776, reported by Matthieu Gautier.

* Stop using htonl()/ntohl() in a non-network context which should improve
  portability to platforms without a POSIX-like socket API.
   2019-02-12 20:23:37 by Amitai Schleier | Files touched by this commit (1) | Package updated
Log message:
Omega 1.4.10 (2019-02-12):


* Use https for URLs where supported.


* omindex:

  + Index .apxl and .kth files as Apple Keynote.  The .apxl extension is used
    for the XML files inside .key bundles/directories which hold the text
    content of the presentation, and by handling them we can index .key
    directories more usefully.  It seems they are also sometimes found by
    themselves.  Keynote themes have a .kth extension, and key2text can also
    handle these.

  + Pipe input to pdftotext, pdfinto and dpkg.  These tools all support piping
    an input file on stdin, which can be a little more efficient when we
    already have the file open (e.g.  to determine its type using libmagic, or
    to calculate its checksum).

  + An empty string for the start directory is now flagged as an error.
    Previously `/` was used instead, which is unlikely to be what is wanted
    (and `/` can be explicitly specified if that really is what is wanted).

  + Fix emulation of stderr redirection when the indexer's stderr has been
    closed.  We try to avoid using the shell when running external filters, and
    emulate 2>/dev/null in commands, but if the indexer's stderr was closed
    this emulation was buggy and would make give the filter a closed stderr
    instead of one redirected to /dev/null.

  + When emulating redirection to /dev/null, we now open /dev/null once and
    dup that fd each time which is a little more efficient and simplifies the

* scriptindex:

  + date=unix is now a no-op for empty input - previously it would unhelpfully
    add boolean date terms for 1970-01-01.

  + Warn for empty filename in LOAD action.  Previously this gave a slightly
    confusing error: "Couldn't load file '': No such file or directory"

  + Unknown command-line options now cause scriptindex to give a non-zero exit


* omegatest: Add testcase for SPAN.n on different slots.

* omegatest: Update expected QueryParser output for the xapian-core change to
  produce flatter Query trees.

build system:

* Use AM_ICONV to detect iconv() which should handle non-system install of GNU
  libiconv properly.  Fixes #775, reported by Ryan Schmidt.


* Provide fall-back strptime() implementation for platforms which don't provide
  it, using the C++11 std::get_time() function.  We use strptime() directly
  where it's available as some older C++11 compilers seem to lack
  std::get_time() (GCC 4.8 for example).  This is used by the parsedate action,
  which was added in 1.4.6.
   2018-11-05 06:42:59 by Amitai Schleier | Files touched by this commit (1) | Package updated
Log message:
Update to 1.4.9. From the changelog:


* omindex:

  + Try harder to avoid opening a file being indexed more than once by
    reusing the file descriptor in more cases.

  + Hint to the OS not to cache output from external filters which require
    using a temporary file.

* scriptindex:

  + If the LOAD action successfully opens a file but hits a read error the
    error message now reports the file name correctly.  Previously it would
    report the partial file contents read so far instead of the file name.


* We no longer call posix_fadvise() with POSIX_FADV_NOREUSE under Linux,
  since it's still not implemented there.  We also now only call
  posix_fadvise() with POSIX_FADV_DONTNEED right before we close the file
  descriptor under Linux.
   2018-10-28 04:44:06 by Amitai Schleier | Files touched by this commit (1) | Package updated
Log message:
Update to 1.4.8. From the changelog:


* omindex:

  + Improve date handling in .eml files.  We now handle a "Date:" header
    without the day of the week, which is allowed by RFC822 and RFC2822
    (though seems rare in practice).  If the date can't be parsed, we now
    just omit the date information rather than failing to process the file.

  + Add support for indexing Apple iWork documents (Keynote (.key), Numbers
    (.numbers) and Pages (.pages)) using libetonyek.  Currently only the file
    variants are handled since omindex doesn't currently support indexing a
    directory as a document.

  + Index Visio files using vsd2xhtml.

  + Extend --filter to support filters which produce SVG as output.

  + Handle SVG embedded in XML with svg: namespace prefix.

  + Add --read-filters option to read a list of filters from a file, each line
    of which is a rule as passed to --filter.  Based on a patch from Gaurav

  + Add new --mime-type-match option which allows specifying a MIME
    Content-Type for a given shell filename pattern pattern (with the special
    Content-Type values "ignore" and "skip" supported, as \ 
for --mime-type).

  + Adjust --mime-type to allow ':' in the extension.  A valid MIME
    Content-Type can't contain a colon, so if the argument to --mime-type
    contains more than one colon it makes more sense to split at the *last*
    colon (we used to split at the first), as an extension could conceivably
    contain a colon.  Mostly this change is for consistency with the new
    --mime-type-match option, where the leafname pattern could reasonably
    contain a colon.

  + Remove failed entries for ignored files.  If a file is mapped to
    pseudo-mimetype "ignore" then remove any existing failure record \ 
for it so
    that ignored files so we don't potentially end up with a lot of cruft
    failure records for files we are no longer trying to index.

  + If a file fails to index due to failing to allocate enough memory we now
    try to flag it as failed to index so it will be skipped by default on
    future runs.  This should help to avoid indexing getting stuck on
    problematic files.

  + Add a "pages" field with the number of pages in the document where we
    know how to determine this (currently only for PDF files for which pdfinfo
    reports this information).

  + Handle initially empty database exactly the same was as when --overwrite
    is specified.  This probably has no user-visible consequences, but it's
    cleaner for the handling to be exactly the same.

* scriptindex:

  + Improve scriptindex diagnostic messages.  All diagnostics are now labelled
    as "error", "warning" or "note" as \ 
appropriate, and we now consistently
    report "FILE:LINE:" (and also "COLUMN:" in most cases) \ 
to make it clearer
    where the problem lies.

  + Add new "split" action which splits the text on a specified \ 
delimiter and
    executes the following actions for each piece.  Based on a patch by Gaurav

  + Missing whitespace after the closing " on an action argument is now
    flagged as an error.  Previously scriptindex would attempt to parse
    the following characters as the next action.

  + Support C-like escapes for quoted parameter values.  Notably this means it
    is now possible to include `"` in quoted parameter values.


  + Value-based date range filters can now be specified via CGI parameters
    START.N, END.N and/or SPAN.N where N is a value slot number, allowing
    multiple concurrent filters on different slots to be specified.

  + Support YYYY and YYYYMM limits in term-based date ranges.  Previously
    value-based date ranges supported these as limits, but term-based date
    ranges gave an error.

  + Add stem_strategy option and deprecate existing stem_all option in favour
    of this new more versatile option.

  + Support "natural" $sort option via new flag "#" which \ 
sorts embedded
    natural numbers in numerical order.

  + Support numeric $sort option via new flag "n", similar to GNU sort -n.

  + Rewrite field parsing to be more efficient, and store fields in an
    unordered_map for faster lookup.