Creating text corpora for special purposes on the basis of extended TXM platform


Smirnoff I. Suvorova (Ananieva) M.


TXM platform suggests a wide range of corpus analysis capabilities including correspondence analysis, clusterization, lexical table construction, parametrized subcorpus selection. The default structural unit of analysis for the TXM platform is a token. However it is possible to supply each token with a number of features enabling more sophisticated, complex while flexible corpus analysis. The only extension available by default is the TreeTagger augmenting TXM platform with automated token morphological analysis capability. In this work we present a number of tools for even more extensive and complex corpus analysis relying both on our previously developed tools as well as on publicly available tools.

External links


Article (PDF) in the Highly Available Systems journal (in Russian):


Semantic Scholar:

Reference link

Lavrentev A. M., Smirnov I. V., Suvorova M. I., Solovyov F. N., Fokina A. I., Chepovsky A. M. Creating text corpora for special purposes on the basis of extended TXM platform // Highly Available Systems. 2018. T. 14. No. 3. Page 76-81