Creating text corpora for special purposes on the basis of extended TXM platform

Authors

Smirnoff I. , Suvorova (Ananieva) M.

Annotation

TXM platform suggests a wide range of corpus analysis capabilities including correspondence analysis, clusterization, lexical table construction, parametrized subcorpus selection. The default structural unit of analysis for the TXM platform is a token. However it is possible to supply each token with a number of features enabling more sophisticated, complex while flexible corpus analysis. The only extension available by default is the TreeTagger augmenting TXM platform with automated token morphological analysis capability. In this work we present a number of tools for even more extensive and complex corpus analysis relying both on our previously developed tools as well as on publicly available tools.

External links

DOI: https://doi.org/10.18127/j20729472-201803-13

Article (PDF) in the Highly Available Systems journal (in Russian): https://npo-echelon.ru/doc/Aktualnie_voprosi_2019.pdf

ResearchGate: https://www.researchgate.net/publication/327903105_Sozdanie_specialnyh_korpusov_tekstov_na_osnove_rassirennoj_platformy_TXM

Semantic Scholar: https://api.semanticscholar.org/CorpusID:187957131

Reference link

Lavrentev A. M., Smirnov I. V., Suvorova M. I., Solovyov F. N., Fokina A. I., Chepovsky A. M. Creating text corpora for special purposes on the basis of extended TXM platform // Highly Available Systems. 2018. T. 14. No. 3. Page 76-81