./textproc/xapian, Probabilistic Information Retrieval search engine

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]


Branch: CURRENT, Version: 1.4.4, Package name: xapian-1.4.4, Maintainer: schmonz

Xapian is an Open Source Search Engine Library, released under the
GPL. It's written in C++, with bindings to allow use from Perl,
Python, PHP, Java, Tcl, C# and Ruby (so far!)

Xapian is a highly adaptable toolkit which allows developers to
easily add advanced indexing and search facilities to their own
applications. It supports the Probabilistic Information Retrieval
model and also supports a rich set of boolean query operators.

If you're after a packaged search engine for your website, you
should take a look at Omega: an application we supply built upon
Xapian. Unlike most other website search solutions, Xapian's
versatility allows you to extend Omega to meet your needs as they
grow.


Required to run:
[devel/libuuid]

Required to build:
[pkgtools/cwrappers]

Master sites:

SHA1: 6b8bf7eea3059dab8d5dd254c3ae0cf895bc4910
RMD160: 19535bc7ca5c175b7ee1c4898e9e9e796e45dcb0
Filesize: 2742.141 KB

Version history: (Expand)


CVS history: (Expand)


   2017-07-14 14:55:45 by Joerg Sonnenberger | Files touched by this commit (5)
Log message:
Collect patches for xapian in a common subdirectory. Put distinfo for
modules into a separate file as well. Don't hard-code -lstdc++ for
broken ancient OpenBSD versions of GCC. Sync p5-Xapian PLIST with
reality.
   2017-07-11 16:56:37 by Amitai Schleier | Files touched by this commit (2)
Log message:
Include missing AF_INET definition to fix NetBSD build.
   2017-07-10 19:43:44 by Amitai Schleier | Files touched by this commit (1) | Package updated
Log message:
Update references to this Makefile.common.
   2017-07-10 19:29:58 by Amitai Schleier | Files touched by this commit (4)
Log message:
Extract settings to be shared by various language bindings.
   2017-07-10 00:27:44 by Amitai Schleier | Files touched by this commit (2) | Package updated
Log message:
Update to 1.4.4. From the changelog:

API:

* Database::check():

  + Fix checking a single table - changes in 1.4.2 broke such checks unless you
    specified the table without any extension.

  + Errors from failing to find the file specified are now thrown as
    DatabaseOpeningError (was DatabaseError, of which DatabaseOpeningError is
    a subclass so existing code should continue to work).  Also improved the
    error message when the file doesn't exist is better.

* Drop OP_SCALE_WEIGHT over OP_VALUE_RANGE, OP_VALUE_GE and OP_VALUE_LE in the
  Query constructor.  These operators always return weight 0 so OP_SCALE_WEIGHT
  over them has no effect.  Eliminating it at query construction time is cheap
  (we only need to check the type of the subquery), eliminates the confusing
  "0 * " from the query description, and means the OP_SCALE_WEIGHT \ 
Query object
  can be released sooner.  Inspired by Shivanshu Chauhan asking about the query
  description on IRC.

* Drop OP_SCALE_WEIGHT on the right side of OP_AND_NOT in the Query
  constructor.  OP_AND_NOT takes no weight from the right so OP_SCALE_WEIGHT
  has no effect there.  Eliminating it at query construction time is cheap
  (just need to check the subquery's type), eliminates the confusing "0 * "
  from the query description, and means the OP_SCALE_WEIGHT object can be
  released sooner.

* MSet::snippet(): Favour candidate snippets which contain more of a diversity
  of matching terms by discounting the relevance of repeated terms using an
  exponential decay.  A snippet which contains more terms from the query is
  likely to be better than one which contains the same term or terms multiple
  times, but a repeated term is still interesting, just less with each
  additional appearance.  Diversity issue highlighted by Robert Stepanek's
  patch in https://github.com/xapian/xapian/pull/117 - testcases taken from his
  patch.

* MSet::snippet(): New flag SNIPPET_EMPTY_WITHOUT_MATCH to get an empty snippet
  if there are no matches in the text passed in.  Implemented by Robert
  Stepanek.

* Round MSet::get_matches_estimated() to an appropriate number of significant
  figures.  The algorithm used looks at the lower and upper bound and where the
  estimate sits between them, and then picks an appropriate number of
  significant figures.  Thanks to Sébastien Le Callonnec for help sorting out a
  portability issue on OS X.

* Add Database::locked() method - where possible this non-invasively checks if
  the database is currently open for writing, which can be useful for
  dashboards and other status reporting tools.

testsuite:

* Add more tests of Database::check().  Fixes #238, reported by Richard
  Boulton.

* Make apitest testcase nosuchdb1 fail if we manage to open the DB.

* Skip testcases which throw NetworkError with errno value ECHILD - this
  indicates system resource starvation rather than a Xapian bug.  Such failures
  are seen on Debian buildds from time to time, see:
  https://bugs.debian.org/681941

* Use terms that exist in the database for most snippet tests.  It's good to
  test that snippet highlighting works for terms that aren't in the database,
  but it's not good for all our snippet tests to feature such terms - it's
  not the common usage.

matcher:

* Fix incorrect results due to uninitialised memory.  The array holding max
  weight values in MultiAndPostList is never initialised if the operator is
  unweighted, but the values are still used to calculate the max weight to pass
  to subqueries, leading to incorrect results.  This can be observed with an OR
  under an unweighted AND (e.g. OR under AND on the right side of AND_NOT).
  The fix applied is to simply default initialise this array, which should lead
  to a max weight of 0.0 being passed on to subqueries.  Bug reported in
  notmuch by Kirill A. Shutemov, and forwarded by David Bremner.

* Improve value range upper bound and estimated matches.  The value slot
  frequency provides a tighter upper bound than Database::get_doccount().
  The estimate is now calculated by working out the proportion of possible
  values between the slot lower and upper bounds which the range covers
  (assuming a uniform distribution).  This seems to work fairly well in
  practice, and is certainly better than the crude estimate we were using:
  Database::get_doccount() / 2

* Handle arbitrary combinations of OP_OR under OP_NEAR/OP_PHRASE, partly
  addressing #508.  Thanks to Jean-Francois Dockes for motivation and testing.

* Only convert OP_PHRASE to OP_AND if full DB has no positions.  Until now the
  conversion was done independently for each sub-database, but being consistent
  with the results from a database containing all the same documents seems more
  useful.

* Avoid double get_wdf() call for first subquery of OP_NEAR and OP_PHRASE,
  which will speed them up by a small amount.

documentation:

* Correct "Query::feature_flag" -> \ 
"QueryParser::feature_flag".  Fixes #747,
  reported by James Aylett.

* Rename set_metadata() `value` parameter to `metadata`.  This change is
  particularly motivated by making it easier to map this case specially in SWIG
  bindings, but the new name is also clearer and better documents its purpose.

* Rename value range parameters.  The new names (`range_limit` instead of
  `limit`, `range_lower` instead of `begin` and `range_upper` instead of `end`)
  are particularly motivated by making it easier to map them specially in SWIG
  bindings, but they're also clearer names which better document their
  purposes.

* Change "(key, tag)" to "(key, value)" in user metadata \ 
docs.  The user
  metadata is essentially what's often called a "key-value store" so users
  are likely to be familiar with that terminology.

* Consistently name parameter of Weight::unserialise() overridden forms.
  In xapian/weight.h it was almost always named `serialised`, but LMWeight
  named it `s` and CoordWeight omitted the name.

* Fix various minor documentation comment typos.

* INSTALL: Update section about -Bsymbolic-functions which is not a new
  GNU ld feature at this point.

tools:

* xapian-delve: Uses new Database::locked() method to report if the database
  is currently locked.

portability:

* Fix configure probe for __builtin_exp10() to work around bug on mingw - there
  GCC generates a call to exp10() for __builtin_exp10() but there is no exp10()
  function in the C library, so we get a link failure.  Use a full link test
  instead to avoid this issue.  Reported by Mario Emmenlauer on xapian-devel.

* Fix configure probe for log2() which was failing on at least some platforms
  due to ambiguity between overloaded forms of log2().  Make the probe
  explicitly check for log2(double) to avoid this problem.

* Workaround the unhelpful semantics of AI_ADDRCONFIG on platforms which follow
  the old RFC instead of POSIX (such as Linux) - if only loopback networking is
  configured, localhost won't resolve by name or IP address, which causes
  testsuites using the remote backend over localhost to fail in auto-build
  environments which deliberately disable networking during builds.  The
  workaround implemented is to check if the hostname is "::1", \ 
"127.0.0.1" or
  "localhost" and disable AI_ADDRCONFIG for these.  This doesn't catch all
  possible ways to specify localhost, but should catch all the ways these might
  be specified in a testsuite.  Fixes https://bugs.debian.org/853107, reported
  by Daniel Schepler and the root cause uncovered by James Clarke.

* Fix build failure cross-compiling for android due to not pulling in header
  for errno.

* Fix compiler warnings.

debug code:

* Adjust assertion in InMemoryPostList.  Calling skip_to() is fine when the
  postlist hasn't been started yet (but the assertion was failing for a term
  not in the database).  Latent bug, triggered by testcases complexphrase1 and
  complexnear1 as updated for addition of support for OP_OR subqueries of
  OP_PHRASE/OP_NEAR.
   2017-05-08 14:02:16 by Amitai Schleier | Files touched by this commit (1)
Log message:
Needs C++11.
   2017-01-01 11:40:49 by Amitai Schleier | Files touched by this commit (2) | Package updated
Log message:
Update to 1.4.2. From the changelog:

API:

* Add XAPIAN_AT_LEAST(A,B,C) macro.

* MSet::snippet(): Optimise snippet generation - it's now ~46% faster in a
  simple test.

* Add Xapian::DOC_ASSUME_VALID flag which tells Database::get_document() that
  it doesn't need to check that the passed docid is valid.  Fixes #739,
  reported by Germán M. Bravo.

* TfIdfWeight: Add support for the L wdf normalisation.  Patch from Vivek Pal.

* BB2Weight: Fix weights when database has just one document.  Our existing
  attempt to clamp N to be at least 2 was ineffective due to computing
  N - 2 < 0 in an unsigned type.

* DPHWeight: Fix reversed sign in quadratic formula, making the upper bound a
  tiny amount higher.

* DLHWeight: Correct upper bound which was a bit too low, due to flawed logic
  in its derivation.  The new bound is slightly less tight (by a few percent).

* DLHWeight,DPHWeight: Avoid calculating log(0) when wdf is equal to the
  document length.

* TermGenerator: Handle stemmer returning empty string - the Arabic stemmer
  can currently do this (e.g. for a single tatweel) and user stemmers can too.
  Fixes #741, reported by Emmanuel Engelhart.

* Database::check(): Fix check that the first docid in each doclength chunk is
  more than the last docid in the previous chunk - this code was in the wrong
  place so didn't actually work.

* Database::get_unique_terms(): Clamp returned value to be <= document length.
  Ideally get_unique_terms() ought to only count terms with wdf > 0, but that's
  expensive to calculate on demand.

glass backend:

* When compacting we now only write the iamglass file out once, and we write it
  before we sync the tables but sync it after, which is more I/O friendly.

* Database::check(): Fix in SEGV when out == NULL and opts != 0.

* Fix potential SEGV with corrupt value stats.

chert backend:

* Fix potential SEGV with corrupt value stats.

build system:

* Add XO_REQUIRE autoconf macro to provide an easy way to handle version checks
  in user configure scripts.

tools:

* quest: Support BM25+, LM and PL2+ weighting schemes.

* xapian-check: Fix when ellipses are shown in 't' mode.  They were being shown
  when there were exactly 6 entries, but we only start omitting entries when
  there are *more* than 6.  Fix applies to both glass and chert.

portability:

* Avoid using opendir()/readdir() in our closefrom() implementation as these
  functions can call malloc(), which isn't safe to do between fork() and exec()
  in a multi-threaded program, but after fork() is exactly where we want to
  use closefrom().  Instead we now use getdirentries() on Linux and
  getdirentriesattr() on OS X (OS X support bugs shaken out with help from
  Germán M. Bravo).

* Support reading UUIDs from /proc/sys/kernel/random/uuid which is especially
  useful when building for Android, as it avoids having to cross-build a UUID
  library.

* Disable volatile workaround for excess precision SEGV for SSE - previously it
  was only being disabled for SSE2.

* When building for x86 using a compiler where we don't know how to disable
  use of 387 FP instructions, we now run remote servers for the testsuite under
  valgrind --tool=none, like we do when --disable-sse is explicitly specified.

* Add alignment_cast<T> which has the same effect as \ 
reinterpret_cast<T> but
  avoids warnings about alignment issues.

* Suppress warnings about unused private members.  DLHWeight and DPHWeight
  have an unused lower_bound member, which clang warns about, but we need to
  keep them there in 1.4.x to preserve ABI compatibility.

* Remove workaround for g++ 2.95 bug as we require at least 4.7 now.

* configure: Probe for <cxxabi.h>.  GCC added this header in GCC 3.1, which
  is much older than we support, so we've just assumed it was available if
  __GNUC__ was defined.  However, clang lies and defines __GNUC__ yet doesn't
  seem to reliably provide <cxxabi.h>, so we need to probe for it.

* Fix "unused assignment" warning.

* configure: Probe for __builtin_* functions.  Previously we just checked for
  __GNUC__ being defined, but it's cleaner to probe for them properly -
  compilers other than GCC and those that pretend to be GCC might provide these
  too.

* Use __builtin_clz() with compilers which support it to speed up encoding
  and especially decoding of positional data.  This speed up phrase searching
  by ~0.5% in a simple test.

* Check signed right shift behaviour at compile time - we can use a test on a
  constant expression which should optimise away to just the required version
  of the code, which means that on platforms which perform sign-extension
  (pretty much everything current it seems) we don't have to rely on the
  compiler optimising a portable idiom down to the appropriate right shift
  instruction.

* Improve configure check for log2().  We include <cmath> so the check really
  should succeed if only std::log2() is declared.

* Enable win32-dll option to LT_INIT.

debug code:

* xapian-inspect:

  + Support glass instead of chert.

  + Allow control of showing keys/tags.

  + Use more mnemonic letters than X for command arguments in help.
   2016-11-23 17:11:52 by Sebastian Wiedenroth | Files touched by this commit (1)
Log message:
link network libs on SunOS