Some parts of this website may do not work correctly, because your browser doesn't support JavaScript or you have disabled it. In order to use all features please enable JavaScript in your browser.

Specification for lemmatise

lamerlemma

A simple lemmatizer. It can be used with predefined lemmatizers in binary format or with text files that contain a full form lexicon with optional grammatical information. It is also possible to create binary files from the specified text files for more efficienta and repeated use.

By default, the dictionary text file format consists of 2 to 4 tab-separated columns with the following meaning:

  1. word form - (required) the inflected word form, may be repeated,
  2. lemma - (required) the base form, may be repeated,
  3. part-of-speech tag - (optional) a single part of speech tag and several, optional tag related features,
  4. morfological features - (optional) zero or more morphological features.

The most simple format consists of two columns contain only word forms and lemmas (tab as primary separator):

Ala\tAl Ala\tAla Alego\tAl Alę\tAla Aly\tAla ma\tmieć ma\tmój

The same with part-of-speech tags and some morphological features (single space as secondary separator):

Ala\tAl\tsubst\tcase=acc gender=m1 number=sg Ala\tAla\tsubst\tcase=nom gender=f number=sg ma\tmieć\tverb\tnumber=sg aspect=imperf person=ter tense=fin ma\tmój\tadj\tcase=nom gender=f number=sg degree=pos

If the text file contains part-of-speech and/or morphological information, this has to be stated explicitly with --pos and --morpho respectively to include this data in the analysis or the construction of a binary version. This information will be saved in the binary version. The --morpho option implies --pos. The default separators (tab for columns, space for inner-column features) can be changed with --primary-separator and --secondary-separator respectivly.

The default morphological dictionary of Polish for Lammerlemma lemmatizer was created using linguistic data from SGJP Grammatical Dictionary of Polish.

Aliases

lemma-generator, lemmatise, lemmatiser, lemmatize, lemmatizer

Languages

de, en, es, fr, it, pl

Options

  --lang arg (=guess)                   language
  --force-language                      force using specified language even if 
                                        a text was resognised otherwise
  --binary-lexicon arg (=%ITSDATA%/%LANG%.bin)
                                        path to the lexicon in the binary 
                                        format
  --level arg (=3)                      set word processing level 0-3 (0 - do 
                                        nothing, 1 - return only base forms, 2 
                                        - add grammatical class and main 
                                        attributes, 3 - add detailed 
                                        attributes)
  --plain-text-lexicon arg              path to the lexicon in the plain text 
                                        format
  --save-binary-lexicon arg             as a side effect the lexicon in the 
                                        binary format is generated

morfologik

Morfologik is a Polish morphological analyzer and lemmatizer. It returns morphosyntactic information for each token: base forms, grammatical class and attributes.

Values returned by Morfologik are described on page Znaczniki Morfologika (in Polish). In general, Morfologik's tagset is similar to the tagset of National Corpus of Polish, so you can also see http://nkjp.pl/poliqarp/help/ense2.html for more details.

Aliases

lemma-generator, lemmatise, lemmatiser, lemmatize, lemmatizer

Languages

pl

Examples

morfologik ! simple-writer --tags lemma

Returns all base forms for each word.

in:
Ala ma kota i psa.
out:
Al|Ala
mieć|mój
kot|kota
i
pies
morfologik ! simple-writer --tags lexeme

Returns all base forms and grammatical classes for each word.

in:
Wszędzie dobrze, ale w domu najlepiej.
out:
wszędzie+adv
dobro+subst|dobry+adv|dobrze+adv
ala+qub|ale+conj
w+prep|wiek+brev
dom+subst
dobrze+adv

Options

Allowed options:
  --level arg (=3)         set word processing level 0-3 (0 - do nothing, 1 - 
                           return only base forms, 2 - add grammatical class 
                           and main attributes, 3 - add detailed attributes)
  --dict arg (=morfologik) set dictionary, one of morfologik, morfeusz, 
                           combined
  --keep-original          keep original Morfologik's settings i.e. do not 
                           break brief forms

Other help resources