./graphics/tesseract, Open Source OCR Engine

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]

Branch: CURRENT, Version: 4.0.0nb4, Package name: tesseract-4.0.0nb4, Maintainer: pkgsrc-users

Tesseract provides an OCR engine and a command line program. It
includes a new neural net (LSTM) based OCR engine which is focused on
line recognition, but also still provides a legacy OCR engine which
works by recognizing character patterns. Tesseract has Unicode (UTF-8)
support, and can recognize more than 100 languages "out of the box".
Tesseract can be trained to recognize other languages. It supports
various output formats: plain text, hOCR (HTML), PDF,
invisible-text-only PDF, and TSV.

Required to run:
[textproc/icu] [graphics/cairo] [devel/pango] [graphics/leptonica]

Required to build:
[textproc/asciidoc] [pkgtools/x11-links] [x11/xcb-proto] [pkgtools/cwrappers] [x11/xorgproto]

Master sites:

SHA1: 243a4919d44bc64d1e7e4cac660c716c845a8d03
RMD160: 0e95d343639ab98c6d3fbc528053b627b6e12282
Filesize: 1915.402 KB

Version history: (Expand)

CVS history: (Expand)

   2019-01-16 01:07:49 by David H. Gutteridge | Files touched by this commit (1) | Package updated
Log message:
graphics/tesseract: update DESCR

The DESCR was about a decade out of date, revise to reflect 4.0.
   2018-12-09 19:52:52 by Adam Ciarcinski | Files touched by this commit (724)
Log message:
revbump after updating textproc/icu
   2018-11-29 10:15:23 by Adam Ciarcinski | Files touched by this commit (2)
Log message:
tesseract: fix manpage formatting
   2018-11-28 13:04:20 by Adam Ciarcinski | Files touched by this commit (1)
Log message:
tesseract: build depends on asciidoc
   2018-11-18 19:07:20 by Adam Ciarcinski | Files touched by this commit (2)
Log message:
tesseract: use REPLACE_BASH; fix building man-pages; courtesy of Mustafa D. :)
   2018-11-14 23:22:54 by Klaus Klein | Files touched by this commit (1332) | Package updated
Log message:
Revbump after cairo 1.16.0 update.
   2018-11-12 04:53:16 by Ryo ONODERA | Files touched by this commit (1532)
Log message:
Recursive revbump from hardbuzz-2.1.1
   2018-11-03 10:13:07 by Adam Ciarcinski | Files touched by this commit (5) | Package updated
Log message:
tesseract: updated to 4.0.0

New OCR engine
- Added a new OCR engine that uses neural network system based on LSTMs, with \ 
major accuracy gains.
- This includes new training tools for the LSTM OCR engine. A new model can be \ 
trained from scratch or by fine tuning an existing model.
- Added trained data that includes LSTM models to 123 languages.
- Added optional accelerated code paths for the LSTM recognizer:
  * Using OpenMP
  * Using SIMD: AVX2 / AVX / SSE4.1
- Added a new parameter lstm_choice_mode that allows to include alternative \ 
symbol choices in the hOCR output.
- The new LSTM engine still does not support all features from the old legacy \ 
engine (see missing features).

Other OCR engines
- The pattern matching OCR engine that was the primary OCR engine in previous \ 
versions is still available in this version.
- Removed the 'Cube' OCR engine from the codebase. It was used for Hindi and for \ 
Arabic. The New LSTM engine performs much better, thus the Cube engine was no \ 
longer needed.

Updated build system
- Tesseract now uses semantic versioning.
- Tesseract now requires Leptonica 1.74.0 or a higher version.
- For building Tesseract from source code, a compiler with good C++ 11 support \ 
is required. See here for a list of officially supported compilers.
- Added unit tests to the main repo. The unit tests require Git submodules and \ 
the code for training.
- Added an option to compile Tesseract without the code of the legacy OCR engine.
- Update minimum required autoconf version to 2.63.
- Training tools dependencies - Update minimum required versions: ICU 52.1, \ 
Pango 1.22.0.
- Reorganized Tesseract's source tree. Most sources are now below the src directory.

Bug fixes and enhancements
- Fixed many issues that triggered compiler warnings.
- Fixed many issues reported by Coverity Scan or LGTM.
- Fixes to trainingdata rendering.
- Fixed damage to binary images when processing PDFs.
- Don't trigger a deliberate segmentation fault for fatal errors in release code.
- Fixed some issues in OpenCL code. OpenCL now works for the legacy Tesseract \ 
OCR engine, but does not improve the performance. It is not implemented for the \ 
LSTM OCR engine.
- Improved multi-page TIFF handling.
- Improvements to PDF rendering.
- Added version information and improved help texts to the training tools.
- Added faster version of log2().
- Documented in tesseract man page the option to use an input text file which \ 
contains lists of images.
- Made 'osd' the default traineddata when psm 0 is requested (currently this \ 
feature is only implemented in the command line interface, but not in the API).
- Removed tessedit_pageseg_mode 1 from hocr, pdf, and tsv config files. The user \ 
should explicitly use --psm 1 if that is desired.
- The list of available languages and scripts is now sorted alphabetically.
- Parameter unlv_tilde_crunching changed to false, because of default values \ 
cause issues in cases of unlv output in Tesseract 4.
- Removed obsolete code.