./textproc/xapian-omega, Search engine application for websites using Xapian

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 1.4.2, Package name: xapian-omega-1.4.2, Maintainer: schmonz

Omega operates on a set of databases. Each database is created and
updated separately using either omindex or scriptindex. You can
search these databases (or any other Xapian database with suitable
contents) via a web front-end provided by omega, a CGI application.
A search can also be done over more than one database at once.


Required to run:
[lang/perl5] [devel/pcre] [textproc/xapian]

Required to build:
[pkgtools/cwrappers]

Master sites:

SHA1: 146f56c8688d2dbc9c4e8294906e1e67195f89f0
RMD160: f8c5a145479884cd28b987f8ec01481afd521068
Filesize: 485.227 KB

Version history: (Expand)


CVS history: (Expand)


   2017-01-01 11:41:03 by Amitai Schleier | Files touched by this commit (2) | Package updated
Log message:
Update to 1.4.2. From the changelog:

documentation:

* Replace auto-generated list of the supported MIME types with an
  auto-generated table showing the extensions that are mapped to each MIME type
  by default.  Partly addresses #569, reported by catkin.

indexers:

* omindex: Add support for indexing markdown files (extension .md or .markdown,
  mime-type text/markdown, using "markdown" to convert to HTML).

testsuite:

* Add support for "make installcheck" to run tests against installed \ 
version.

build system:

* configure: Fail with clear error with xapian-core < 1.4.0.

portability:

* Fix GCC -Wimplicit-fallthrough warning.

* Add missing <ctime> for time_t.

* Avoid snprintf_for formatting fixed-width integers - it results in warnings
  about possible output truncation with GCC7 (which aren't actually possible
  due to limited input range) and it's a bit heavyweight for this job anyway.
   2016-11-07 14:46:46 by Thomas Klausner | Files touched by this commit (11)
Log message:
Recursive bump for xapian shlib major bump.
   2016-11-07 14:02:45 by Amitai Schleier | Files touched by this commit (5) | Package updated
Log message:
Update to 1.4.1. From the changelog:

omindex:

  + Also index leafname with _ and & replaced by spaces.  Literal spaces are
    often avoided in filenames, and "hello_world.txt" ought to be \ 
searchable for
    via "hello" and "world".  Partly addresses #618, \ 
reported by Julien
    Pfefferkorn.

  + Make named entity look-up (e.g. &eacute; -> 233) use the same \ 
keyword-lookup
    table approach we already use for HTML tags and built-in MIME content-types,
    rather than a std::map, which makes it faster while using less memory.

  + Avoid using the shell to run most external commands as it's unnecessary
    overhead.  For the built-in filters, the only cases which now use a shell
    are where we run two unzip commands.  For user-specified commands, a simple
    and slightly conservative test is used, which should avoid a shell in most
    common cases where it isn't needed.  Notably, environment variables set
    before the command are handled.

  + Track files which couldn't be indexed in the user metadata and skip them by
    default on subsequent runs to avoid the costs of repeatedly running a
    filter on a file it can't handle.  Run omindex with --retry-failed to retry
    such files.

  + Overhaul the "per-site" terms:
    - 'H' prefix is hostname as before, except that if the term would be > 240
      bytes (unlikely but possible) the end is hashed is the same way 'U'
      prefix terms are.
    - 'P' terms are now added for every directory level, not just the start
      URL's path.
    - A new 'J' prefix term is added with the start URL (less any trailing
      '/'), which means all files indexed from a particular "site" are now
      indexed by one term.  See #376.

  + Add 'skip' pseudo-mimetype which extensions can be mapped to, and they will
    then be reported and skipped (to complement the existing 'ignore'
    pseudo-mimetype which causes files with the specified extension to be
    quietly ignored).

  + Treat a command of 'true' specially as meaning make the text extraction a
    no-op (as actually running /bin/true effectively would).  This provides a
    way to index some file types by only meta-data.  Fixes #519, reported by
    Brian Burton.

  + Add support for wildcard mimetypes */* and *.  Combined with filter command
    ``true`` for indexing by meta-data only, you can specify a fall back case
    of indexing by meta-data only using ``--filter '*:true'``.  From a
    suggestion by Brian Burton on xapian-discuss.

  + Index message/rfc822 and message/news.  These are individually saved email
    messages and news articles.

  + Index archived web page formats MAFF and MHTML.

  + Handle .xla, yet another XL extension.

  + Handle metadata in LibreOffice HTML export (dcterms.subject,
    dcterms.description, dcterms.creator and dcterms.contributor).

  + Use zlib's gzopen() instead of invoking "gzip -dc" for compressed \ 
Abiword
    documents.

  + Add support for %f in command passed to --filter to allow specifying
    commands where the input file is not the final argument.  Fixed #570,
    reported by Charles Atkinson.

  + Allow --filter to handle commands which produce output in a temporary file
    rather than on stdout.

  + Allow --filter to specify the character set of the output the filter
    produces.

  + Handle application/vnd.ms-excel, text/x-perl and application/x-dvi via
    default --filter settings instead of hardcoded cases (now possible thanks
    to the new abilities that --filter has).

  + Add support for specifying a MIME subtype of '*' in --filter arguments.

  + Add -track-ctime option to allow omindex to pick up changes to file
    ownership and permissions.

  + Index terms from the leafname with an 'F' prefix, rather than treating them
    as more body text.  (Fixes #633, reported by Emmanuel Garette)

  + The starting URL wasn't previously URL encoded.  In 1.2.18, a minimally
    intrusive fix was implemented.  In 1.3.2, we now encode the starting URL
    as we do for the rest of the filename.

  + Don't assume .doc is application/msword but let libmagic decide, since .doc
    files may actually be RTF, and sometimes people use .doc for plain-text
    documentation.

  + Add support for indexing 'topic' and 'created date' meta-data for
    OpenDocument format and HTML.

  + Index "topic" for PDF documents.

  + Commit changes and exit, rather than skipping the current file on most
    unexpected errors reading directories or initialising libmagic - otherwise
    we can end up deleting a lot of database entries on errors like EHOSTDOWN
    when indexing network mounts.

  + Add --opendir-sleep=SECS option to allow working around problems with
    indexing files on Microsoft DFS shares.

  + If we get ENOTDIR trying to index a file, skip it quietly (unless in
    verbose mode) as we already do if we get ENOENT, since ENOTDIR is what we
    get if the file and the directory it was in got removed between us getting
    the filename and trying to open it.

  + Handle ENOENT, ENOTDIR and EACCES from readdir().

  + If we've already opened the file (as we often will have if using a modern
    libmagic with magic_descriptor() available), then use fstat() on that fd
    rather than stat()/lstat() on the pathname.

  + Pass error message string and errno value in ReadError exceptions.

  + Report strerror(errno) if we can't read a file.

  + Filtering via text/html now handles HTML documents which specify a charset.

  + Add support for indexing Microsoft Publisher files using pub2xhtml.

  + Restrict the length of what we consider to be an extension, currently to 7
    characters or whatever the longest extension in the mime_map is if it is
    longer.

  + Avoid '//' in temporary filenames (cosmetic only).

  + Extend --filter to handle commands which produce HTML on stdout.

  + Don't report an error if a file is deleted (or renamed) between us reading
    the directory entry for it and trying to read the file itself by default.
    In --verbose mode, the situation is still reported, but now with a
    specific message.

  + If omindex receives any of the signals SIGHUP, SIGINT, SIGQUIT or SIGTERM,
    then kill any active external filter child process, then handle the signal
    as we did before.  If setpgid() is available, put each external filter in
    its own process group and kill the whole process group when we get a
    signal.

  + Use magic_descriptor() if the version of libmagic we're building against
    is new enough to have it.  This eliminates an extra opening of a file
    being indexed in certain cases.

  + Use rst2html to handle .rst and .rest files.

  + Index title with an 'S' prefix rather than no prefix.

  + If the document with the highest existing docid before the run was updated,
    we were reporting it as "added", but now we correctly report it as
    "updated".

  + Catch and report std::exception explicitly, so failing to allocate memory
    is no longer reported as "Unknown exception".

omindex-list: New tool to list URLs of all the documents in a database
(or list of databases) indexed by omindex.

* The HTML parser now explicitly handles <APPLET>, <OBJECT> and \ 
<TR>.

* Use a generated compact and efficient table to convert HTML tag names
  to enum codes - this is both faster and smaller than the approach we were
  using, with the benefit that the table is auto-generated.

* Always use our built-in conversion code for the character sets it can handle
  (previously we'd use iconv if available; now we only use iconv for other
  character sets).  This gives us more consistent results, and in particular
  means we now handle BOMs better (at least when using GNU iconv).

* A lot of data labelled as "iso-8859-1" is actually \ 
"windows-1252".  The two
  only differ in characters which are control characters in iso-8859-1, so
  assume the latter when we see the former.

scriptindex:

  + Remove special error handling case noting that index=nopos was replaced
    with indexnopos - this was removed in 1.1.0 so there's been enough time to
    upgrade.

omega:

* Add support for sorting by more than one value - e.g. SORT=+1,-2

* Add $msizelower and $msizeupper which provide access to the lower and upper
  bounds on the number of matches.

* Add support for $set{weighting,coord}.

* Add weightingpurefilter option.  Normally a query consisting only of filter
  terms won't have relevance weights calculated.  This new option allows you to
  specify a weighting scheme to use for such queries, with the same values
  supported as for the existing weighting option.  For example,
  $set{weightingpurefilter,coord} will weight such queries by how many filter
  terms match each document.

* $filters now includes DATEVALUE, which means we'll force the first page when
  reloading or changing page starting from existing URLs upon upgrade to 1.4.1,
  but the exact same existing URL could be for a search without the date filter
  where we want to force the first page, so there's an inherent ambiguity
  there.  Forcing first page in this case seems the least problematic
  side-effect.

* Implement $match command for omegascript.  Patch from Richhiey Thomas.

* Add optional prefix argument to $terms.

* $snippet now uses MSet::snippet() instead of the Snipper class.

* Add $contains{STRING1,STRING2}.  Contributed by Ayush Gupta.

* Add support for negated boolean filter terms, specified by CGI parameter \ 
"N".

* Support a direction prefix on SORT: '+' for ascending, '-' for descending.
  SORTREVERSE set to non-0 now flips the direction.  Fixes #697, reported by
  Andy Chilton.

* Add options argument to $transform.

* Cache compiled regexps used in $transform.

* Add $ord OmegaScript command which returns the Unicode codepoint for the
  first character of a UTF-8 string.

* Add $chr OmegaScript command which returns the UTF-8 string for given Unicode
  codepoint.

* Add $csv OmegaScript command which escapes a string for use as a field in a
  CSV file ("always quote" mode inspired by patch from Gaurav Arora.)

* New $filters encoding which avoids collisions.  We also compare CGI parameter
  xFILTERS to what $filters would have returned in previous releases, so that
  on upgrades old format serialised filters are handled correctly.

* Fix $jsonarray not to prepend ']' to the first array element.

* Skip weighting scheme setup for a pure date range query - it won't be
  weighted anyway, so we can avoid having to parse weighting scheme parameters,
  etc.

* Use value ranges when date range filtering by value.  Should be more
  efficient than a MatchDecider, and will automatically take advantage of any
  future value range optimisations in xapian-core.

* Add default_db and default_template config options.  These allow the default
  template and default database name to be set via the config file, rather than
  being stuck with the respective defaults of "default" and \ 
"query".  Fixes
  #310, reported by Marco Hennigs.

* Add support for non-exclusive filters.  Fixes #234, reported by Thomas
  Viehmann.

* Fix handling of multiple P.<prefix> fields - previously only the first seen
  was used.  These fields are also now taken into account when deciding if the
  query has changed.  $query now returns an OmegaScript list with one entry for
  each CGI parameter passed.

* Allow setting query expansion scheme to "bo1".

* Make the $json and $jsonarray force the text to be valid UTF-8, since
  otherwise the output isn't valid JSON.

* Check parameters to $set{weighting,bm25 ...} and $set{weighting,trad ...}
  converted OK.  Based on patch from Aarsh Shah.

* Add support to $set{weighting,...} for bb2, dlh, dph, ifb2, ineb2, inl2, lm,
  pl2 when we're built against a xapian-core which is new enough to have these
  schemes.

* Add $snippet to generate a snippet of text tailored to the search.

* Add new $json and $jsonarray OmegaScript commands to support producing JSON
  output.

* Add $truncate command which truncates a string after a word.

* Add support for $set{weighting,tfidf} to allow the new TfIdfWeight weighting
  scheme to be used.

+ DEFAULTOP now defaults to AND rather than OR, since that matches what pretty
  much every search engine does these days.  Closes ticket#512.

* Allow mapping a query string prefix to more than one term prefix (which
  xapian-core has supported since 1.0.4).

* Add support for search inputs for multiple probabilistic prefixes, with
  support for per-prefix stemmers.

* Drop legacy support for handling '.' separated terms in xP - that changed in
  Omega 0.9.7, more than 5 years ago now.

* Remove support for OLDP CGI parameter which was superseded by xP
  approximately a decade ago, and isn't even documented!

* Drop special handling for R-prefixed terms in $prettyterm - we stopped
  generating these in Xapian 1.0.

templates:

* Lower case all HTML tags, attributes and values; explicitly close <option>
  tags.  Patches from Vivek Pal and Nirmal Singhania.

* Migrate Omega Templates to HTML5.  Patch from Nirmal Sighania.

* templates/query: Remove stray double quote from generated URL for spelling
  suggestion when THRESHOLD is set.  Patch from Nirmal Singhania.

* templates/opensearch: Change response feeds to support OpenSearch 1.1.
  Patch from Nirmal Singhania.

* templates/query: Fix setting setting of prefix map for P - in 1.3.2, this
  would failed to also search in the subject.  Now it also searches in the
  subject and topic.

* templates/query:

  + We now map unprefixed queries to include S-prefixed terms to match the
    change in omindex to prefixing terms from the title with S.  You may want
    to make the same update to your own templates.

  + Set up prefixes for 'author:' and 'title:'.
   2016-07-09 08:39:18 by Thomas Klausner | Files touched by this commit (1068) | Package updated
Log message:
Bump PKGREVISION for perl-5.24.0 for everything mentioning perl.
   2016-04-30 16:14:17 by Amitai Schlair | Files touched by this commit (2) | Package updated
Log message:
Update to 1.2.23. From the changelog:

documentation:

* Update links to Xapian website and trac to use https, which is now supported,
  thanks to James Aylett.

indexers:

* Fix HTML/XML entity decoding to be O(n) not O(n²) - processing HTML/XML with
  a lot of entities is now much faster.

templates:

* Remove unused country code to name maps.  These were intended as examples,
  but they aren't very useful as such, and really just bloat the templates
  needlessly.
   2016-01-13 22:03:49 by Amitai Schlair | Files touched by this commit (2) | Package updated
Log message:
Update to 1.2.22. From the changelog:

documentation:

* Stop maintaining ChangeLog files.  They make merging patches harder, and stop
  'git cherry-pick' from working as it should.  The git repo history should be
  sufficient for complying with GPLv2 2(a).

* Clarify help text for omindex --mime-type option.

* docs/omegascript.rst:

  + Fix documentation of $last to say it's the MSet index *one beyond* the end
    of the current page.  Reported by Andrew Chilton.

  + Clarify that $split and $substr work in bytes.  Previously we said
    "characters" which could be taken as meaning they work with UTF-8
    characters.

  + Update documentation for $filters - it was missing these CGI parameters
    from the list of those serialised: COLLAPSE, DOCIDORDER, SORT, SORTREVERSE,
    SORTAFTER

  + Explicitly note user can use $setmap to create their own maps.

* docs/overview.rst:

  + SVG extraction is built-in too.

  + Expand paragraph about command `false`.  Note the versions where explicit
    support was added, and that this will also work with any version on Unix,
    where `false` is a command.

  + Document `cdb_dir`.

* docs/cgiparams.rst: Document behaviour if xDB is not set.

* Change "characters" to "bytes" in a few places to clarify \ 
that we don't mean
  Unicode code points.

indexers:

* omindex:

  + Add '--title-size' option.

  + Handle .oft the same way as .msg - it's some sort of template email, and
    has essentially the same format.

omega:

* Make $querydescription ensure the match has been run, so that it includes
  filters.

* Avoid $allterms, $cgilist, $filterterms and $terms being O(n²) in the number
  of items in the returned list.

* If xFILTERS is not set, don't force the first page as that's unhelpful if
  someone fails to set it in their template.

* When environment variable SERVER_PROTOCOL is set to INCLUDED (as it is when
  we're being included in a page), we already suppress the HTTP headers, but
  now we suppress the blank line after the header too.

* Support option flag_cjk_ngram if built against xapian-core >= 1.2.22.

testsuite:

* Add test coverage for parsing of HTML entities.

build system:

* Fix error reporting if PCRE isn't installed. Fixes #693, reported by lhz7370.

portability:

* Avoid warning when building with glibc >= 2.21.

* Don't provide our own implementation of sleep() under __WIN32__ if there
  already is one - mingw provides one, and in some situations it seems to clash
  with ours.  Reported to xapian-discuss by John Alveris.

* Stop trying to use O_STREAMING - the patch to implement it was never merged
  into the Linux kernel, and I can't find any evidence that other platforms
  implement it.  The constant value O_STREAMING used now seems to be used for
  the part of O_SYNC which isn't covered by O_DSYNC, which seems likely to hurt
  performance if anything.
   2015-11-04 03:00:17 by Alistair G. Crooks | Files touched by this commit (797)
Log message:
Add SHA512 digests for distfiles for textproc category

Problems found locating distfiles:
	Package cabocha: missing distfile cabocha-0.68.tar.bz2
	Package convertlit: missing distfile clit18src.zip
	Package php-enchant: missing distfile php-enchant/enchant-1.1.0.tgz

Otherwise, existing SHA1 digests verified and found to be the same on
the machine holding the existing distfiles (morden).  All existing
SHA1 digests retained for now as an audit trail.
   2015-06-12 12:52:19 by Thomas Klausner | Files touched by this commit (3152)
Log message:
Recursive PKGREVISION bump for all packages mentioning 'perl',
having a PKGNAME of p5-*, or depending such a package,
for perl-5.22.0.