Some parts of this website may do not work correctly, because your browser doesn't support JavaScript or you have disabled it. In order to use all features please enable JavaScript in your browser.

Specification for reader > apertium-reader

apertium-reader

Apertium-reader allows you to read text in various markup formats, such as: HTML documents, RTF files, Open-Office Writer odt or Microsoft Office 2007 formats: docx, xlsx and pptx. The default format for apertium-reader is html. To read text from doc files use doc-reader.

Apertium-reader uses format handling rules specified in XML and based on regular expressions. Current rule files comes from Apertium platform. It is possible to write new handling for any XML format, see http://wiki.apertium.org/wiki/Format_handling.

Note about license: XML files with rules come from Apertium platform and they are licensed under GNU General Public License.

Examples

apertium-reader --format docx ! simple-writer --tags frag

Reads DOCX file with --format docx option and writes only text fragments.

in:
/storage/18cb02b23c80441c21e9163057d78a80.UNKNOWN
out:
Przykładowy nagłówek.
Przykładowy tekst pierwszego akapitu.
Tekst w drugim akapicie.
apertium-reader --format rtf ! simple-writer --tags frag

Reads RTF file with --format rft option and writes only text fragments using simple-writer.

in:
/storage/ed33e779887ea8587d9de91e44a34e9e.rtf
out:
Title
Text in first paragraph.
Second paragraph.
apertium-reader ! simple-writer --tags frag

Reads HTML file and outputs only text content.

in:
/storage/778ca8d7d38111cb4efe0cbc2268f3ff.html
out:
Header
Text in first paragraph.
Second paragraph.

Options

Allowed options:
  --format arg (=html)     type of file for deformatting
  --specification-file arg specification file path
  --unzip-data arg (=1)    unzip compressed file formats like .pptx or .xlsx
  --keep-tags              keep formatting tags

Other help resources