Frequently Asked Questions

Is the graphical user interface available offline?

Yes, but, currently, the GUI is available only for Linux users. After installing one of the PSI Linux packages, you will get the same webservice as this and you will be able to run it offline by command:

$ psi-server

Which types of input files does PSI-Toolkit support?

PSI-Toolkit deals with all kinds of text files, including some NLP-tools internal formats like PSI and UTT formats. Other supported file formats are: (x)html, pdf, doc, djvu, docx, xlsx and pptx.

Which languages does PSI-Toolkit support?

It depends on the processor. Some of them, such as tokenizer and segmentizer, offer support for the wide range of languages, whereas others are designed for specific language, like in the case of morfologik or link-grammar. In documentation you may find all supported languages for each processor.

How can I display all text fragments with particular tag?

You can filter the display of PSI-lattice by using the tag option of simple-writer. See page Working with PSI-lattice for details.

How to extract grammatical classes for each word in a sentence?

One of the possible solutions is to run the PSI-Toolkit with the following pipeline:

lemmatize ! write --tags lexeme --fallback-tags token

It returns the list of all known lemmas and its grammatical classes for each token, or it simply returns token if lemmas is not found. The order of obtained lexemes is exactly the same as the token's order for the tokenize command.

Which types of character set are supported?

The PSI-Toolkit has a native support for text in UTF-8 character encoding system, but there are also libraries for automatic detection and conversion of character set integrated into psi-pipe command. Please, check its --list-encoding and --encoding-conversion options.

Additionally, for all files sending to PSI-Server the conversion to UTF-8 is enabled.

