Some parts of this website may do not work correctly, because your browser doesn't support JavaScript or you have disabled it. In order to use all features please enable JavaScript in your browser.

Specification for annotator > segmenter > srx-segmenter

srx-segmenter

Splits texts into segments (i.e. sentences) according to rules defined in an SRX (Segmentation Rules Exchange) file. In terms of psi-toolkit lattices segment edges are extracted from frag edges.

By default, slightly modified SRX files from Translatica Machine Translation system are used (for Polish, English, Russian, German, French and Italian). Another SRX file can be specified with the --rules option.

Maximum sentence length can be set with --sentence-length-hard-limit and --sentence-length-soft-limit.

Known deviations from the SRX standard (cf. http://www.gala-global.org/oscarStandards/srx/srx20.html)

  • regexps in SRX files are interpreted as PCRE regexps,
  • \G metacharacter is not handled,
  • segmentsubflows attribute is not handled in any way.

Aliases

segment, segment-generator, segmenter

Languages

de, en, es, fi, fr, it, pl, ru, tr, xx

Examples

segment --lang pl ! write-simple --tags segment

Splits an Polish text into sentences.

in:
Zwiedziłem wiele krajów, m.in. Niemcy, Francję, Kanadę. Uwielbiam podróżować!
out:
Zwiedziłem wiele krajów, m.in. Niemcy, Francję, Kanadę.
 Uwielbiam podróżować!
segment --lang en ! write-simple --tags segment

Splits an English text into sentences.

in:
I've been to many countries, e.g. Germany, France, Canada. I enjoy travelling.
out:
I've been to many countries, e.g. Germany, France, Canada.
 I enjoy travelling.

Options

Allowed options:
  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --rules arg (=%ITSDATA%/%LANG%/segmentation.srx)
                                        rule file
  --cascade                             force cascade mode
  --sentence-length-hard-limit arg (=1000)
                                        maximum length (in bytes, not in 
                                        characters) of a sentence (if, 
                                        according to rules, a sentence of a 
                                        greater length would be generated, a 
                                        sentence break is forced), zero turns 
                                        the limit off
  --sentence-length-soft-limit arg (=600)
                                        soft limit on the length (in bytes) of 
                                        a sentence (sentence break is forced 
                                        only on spaces), zero turns the limit 
                                        off

Other help resources