The article discusses a method for clustering normative and reference information using the Latent Dirichlet Allocation (LDA) algorithm, a generative probabilistic model. The ever-increasing volume of information makes it difficult for humans to process this amount of data. The task of processing normative and reference information in automatic mode is relevant at the moment, because it will free a person from performing monotonous tasks and reduce the number of errors. A feature of this task is that the normative and reference information is mainly a text written by a person. This means that the text may contain typographical errors or errors. A situation is also possible when similar names are in different categories of normative and reference information. Using clustering will avoid this problem. The process of preparing data for clustering is considered - tokenization and removal of stop words. The received set of tokens was filtered based on the frequency of occurrence of tokens in documents. Clustering can be done in two ways - using bag-of-words or TF-IDF. Clustering results are assessed. Conclusions on the applicability of this clustering method are obtained, and the possibility of further improving clustering using a hierarchical approach is considered.
1. Hofmann, Thomas. "Probabilistic latent semantic analysis." arXiv preprint arXiv:1301.6705 (2013).
2. Marjanen J. et al. Topic modelling discourse dynamics in historical newspapers //arXiv preprint arXiv:2011.10428. – 2020.
3. Zhao F. et al. Latent Dirichlet Allocation Model Training With Differential Privacy //IEEE Transactions on Information Forensics and Security. – 2020. – Т. 16. – С. 1290-1305.
4. Radim Řehůřek. Optimized Latent Dirichlet Allocation (LDA) [Электронный ресурс]. – Режим доступа: https://radimrehurek.com/gensim/models/ldamodel.html, свободный. – (дата обращения: 02.02.2021).
5. Rieger J. et al. Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs //arXiv preprint arXiv:2003.04980. – 2020.