CLUSTERIZATION OF REGULATORY - REFERENCE INFORMATION USING THE LDA ALGORITHM | Информационные технологии и математическое моделирование в управлении сложными системами

Authors:

Alimaev Andrey Alekseevich

Plotnikova Natalya Pavlovna

Receipt date:

29.04.2021

Section:

2. Information technology in the management of technical and socio-economic objects

Year:

2021

Journal number:

2(10) 2021

УДК:

004.032.26

DOI:

10.26731/2658-3704.2021.2(10).46-51

Article File:

clustering_normative-reference_information.pdf

Pages:

Abstract:

The article discusses a method for clustering normative and reference information using the Latent Dirichlet Allocation (LDA) algorithm, a generative probabilistic model. The ever-increasing volume of information makes it difficult for humans to process this amount of data. The task of processing normative and reference information in automatic mode is relevant at the moment, because it will free a person from performing monotonous tasks and reduce the number of errors. A feature of this task is that the normative and reference information is mainly a text written by a person. This means that the text may contain typographical errors or errors. A situation is also possible when similar names are in different categories of normative and reference information. Using clustering will avoid this problem. The process of preparing data for clustering is considered - tokenization and removal of stop words. The received set of tokens was filtered based on the frequency of occurrence of tokens in documents. Clustering can be done in two ways - using bag-of-words or TF-IDF. Clustering results are assessed. Conclusions on the applicability of this clustering method are obtained, and the possibility of further improving clustering using a hierarchical approach is considered.

Keywords:

LDA

Latent Dirichlet Allocation

clustering

probabilistic model

topic modeling

List of references:

1. Hofmann, Thomas. "Probabilistic latent semantic analysis." arXiv preprint arXiv:1301.6705 (2013).

2. Marjanen J. et al. Topic modelling discourse dynamics in historical newspapers //arXiv preprint arXiv:2011.10428. – 2020.

3. Zhao F. et al. Latent Dirichlet Allocation Model Training With Differential Privacy //IEEE Transactions on Information Forensics and Security. – 2020. – Т. 16. – С. 1290-1305.

4. Radim Řehůřek. Optimized Latent Dirichlet Allocation (LDA) [Электронный ресурс]. – Режим доступа: https://radimrehurek.com/gensim/models/ldamodel.html, свободный. – (дата обращения: 02.02.2021).

5. Rieger J. et al. Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs //arXiv preprint arXiv:2003.04980. – 2020.