[1]曹中华,黄 欣,彭文忠,等.基于词嵌入特性聚类的文本主题挖掘[J].江西师范大学学报(自然科学版),2022,(05):468-474.[doi:10.16357/j.cnki.issn1000-5862.2022.05.05]
 CAO Zhonghua,HUANG Xin,PENG Wenzhong,et al.The Topic Mining Based on Word Embedding Characteristics Clustering[J].Journal of Jiangxi Normal University:Natural Science Edition,2022,(05):468-474.[doi:10.16357/j.cnki.issn1000-5862.2022.05.05]
点击复制

基于词嵌入特性聚类的文本主题挖掘()
分享到:

《江西师范大学学报》(自然科学版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2022年05期
页码:
468-474
栏目:
信息科学与技术
出版日期:
2022-09-25

文章信息/Info

Title:
The Topic Mining Based on Word Embedding Characteristics Clustering
文章编号:
1000-5862(2022)05-0468-07
作者:
曹中华1黄 欣1彭文忠2刘媛春1
(1.江西师范大学软件学院,江西 南昌 330022; 2.江西财经大学信息管理学院,江西 南昌 330032)
Author(s):
CAO Zhonghua1HUANG Xin1PENG Wenzhong2LIU Yuanchun1
(1.School of Software,Jiangxi Normal University,Nanchang Jiangxi 330022,China; 2.School of Information Management,Jiangxi University of Finance and Economics,Nanchang Jiangxi 330032,China)
关键词:
词嵌入 聚类 语言模型 文本主题
Keywords:
word embedding clustering language model text topics
分类号:
TP 391
DOI:
10.16357/j.cnki.issn1000-5862.2022.05.05
文献标志码:
A
摘要:
数据聚类是常用的无监督学习方法,通过词嵌入聚类能够挖掘文本主题,但现有研究大多数采用常规聚类算法挖掘词嵌入的簇类,缺少基于词嵌入特性设计实现词嵌入聚类的主题挖掘算法.该文从语言模型通过建模词间相关信息来使相关及语义相似词的嵌入表示聚集在一起的特点出发,设计词嵌入聚类算法.该算法首先计算中心词的簇类号,然后使该簇中心嵌入和相邻词嵌入的相似性增强,同时使其与负样本词嵌入远离,学习文本集词嵌入的簇类结构,并将其应用于文本主题挖掘.在3种公开数据集上的实验表明:该算法在一些模型的词嵌入结果上能够挖掘出一致性和多样性更好的主题结果.
Abstract:
Data clustering is a common unsupervised learning method,which can be used to mine topics through word embedding clustering.However,most researchers used conventional clustering algorithms to mine the cluster of word embedding.There is still a lack of research on the design of clustering algorithm based on word embedding characteristics to mine text topics.In the paper,a word embedding clustering algorithm is designed based on the language model,which gathers the embedding representations of related and similar words by learning the relevant information.The algorithm first calculates the cluster number of the central word,and then enhances the similarity between the cluster central and the related words,at the same time keeps it away from the negative sample word.Therefore,it can learn the word embedding cluster structure of the text set,and be used to mine text topics.Experiments on three public datasets show that the algorithm can mine topic results with better coherence and diversity in some models of word embedding.

参考文献/References:

[1] BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003,3(4/5):993-1022.
[2] TOMAS M,ILYA S,CHEN Kai,et al.Distributed representations of words and phrases and their compositionality[EB/OL].[2013-10-16].https://arxiv.org/abs/1310.4546.
[3] PENNINGTON J,SOCHER R,MANNING C D,et al.Glove:global vectors for word representation[EB/OL].[2014-10-01].https://aclanthology.org/D14-1162/.
[4] JACOB D,CHANG Mingwei,KENTON L,et al.Bert:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2018-10-11].https://arxiv.org/abs/1810.04805.
[5] 黄佳佳,李鹏伟,彭敏,等.基于深度学习的主题模型研究[J].计算机学报,2020,43(5):827-855.
[6] DAS R,ZAHEER M,DYER C,et al.Gaussian LDA for topic models with word embeddings[EB/OL].[2015-07-19].https://aclanthology.org/P15-1077.
[7] ADJI B D,FRANCISCO J R,DAVID M B.Topic modeling in embedding spaces[EB/OL].[2019-07-08].https://arxiv.org/abs/1907.04907.
[8] MIAO Yishu,YU Lei,PHIL B,et al.Neural variational inference for text processing[EB/OL].[2015-11-19].https://arxiv.org/abs/1511.06038.
[9] FENG Nan,RAN Ding,RAMESH N,et al.Topic modeling with wasserstein autoencoders[EB/OL].[2019-07-25].https://aclanthology.org/P19-1640.
[10] 夏家莉,曹中华,彭文忠,等.Skip-Gram结构和词嵌入特性的文本主题建模[J].小型微型计算机系统,2020,41(7):1400-1405.
[11] ANGELOV D.Top2vec:distributed representations of topics[EB/OL].[2020-08-20].https://arxiv.org/abs/2008.09470.
[12] QUOC V L,TOMAS M.Distributed representations of sentences and documents[EB/OL].[2014-05-17].https://arxiv.org/abs/1405.4053.
[13] GROOTENDORST M.BERTopic:neural topic modeling with a class-based TF-IDF procedure[EB/OL].[2022-03-11].https://arxiv.org/abs/2203.05794.
[14] NILS R,IRYNA G.Sentence-bert:sentence embeddings using siamese bert-networks[EB/OL].[2019-08-27].https://aclanthology.org/D19-1410.
[15] SIA S,AYUSH D,SABRINA J M.Tired of topic models?clusters of pretrained word embeddings make for fast and good topics too![EB/OL].[2020-04-30].https://aclanthology.org/2020.emnlp-main.135.
[16] GUILHERME R M,RODRIGO P,LEANDRO N C.Detecting topics in documents by clustering word vectors[EB/OL].[2019-06-22].https://link.springer.com/chapter/10.1007/978-3-030-23887-2_27.
[17] LAURE T,DAVID M.Topic modeling with contextualized word representation clusters[EB/OL].[2020-10-24].https://arxiv.org/abs/2010.12626.
[18] YU Meng,Yunyi ZHANG,HUANG Jiaxin,et al.Topic discovery via latent space clustering of pretrained language model representations[EB/OL].[2022-02-09].https://arxiv.org/abs/2202.04582.
[19] ZHANG Zihan,FANG Meng,CHEN Ling,et al.Is Neural Topic Modelling better than Clustering? An empirical study on clustering with contextual embeddings for Topics[EB/OL].[2022-04-21].https://aclanthology.org/2022.naacl-main.285.
[20] LI Bohan,ZHOU Hao,HE Junxian,et al.On the sentence embeddings from pre-trained language models[EB/OL].[2020-11-02].https://aclanthology.org/2020.emnlp-main.733.
[21] 苏剑林.提速不掉点:基于词颗粒度的中文WoBERT[EB/OL].[2020-09-18].https://www.spaces.ac.cn/archives/7758.

相似文献/References:

[1]高灵渲,张巍,霍颖翔,等.改进的聚类模式过滤推荐算法[J].江西师范大学学报(自然科学版),2012,(01):106.
 GAO Ling-xuan,ZHANG Wei,HUO Ying-xiang,et al.Improved Clustering Filtering Recommendation Algorithm[J].Journal of Jiangxi Normal University:Natural Science Edition,2012,(05):106.
[2]杨雨晴,吴水秀*,左家莉.一种改进的中文词嵌入模型[J].江西师范大学学报(自然科学版),2021,(02):131.[doi:10.16357/j.cnki.issn1000-5862.2021.02.04]
 YANG Yuqing,WU Shuixiu*,ZUO Jiali.The Modified Chinese Word Embeddings Model[J].Journal of Jiangxi Normal University:Natural Science Edition,2021,(05):131.[doi:10.16357/j.cnki.issn1000-5862.2021.02.04]

备注/Memo

备注/Memo:
收稿日期:2022-05-12
基金项目:江西省自然科学基金(20212BAB202016)和江西省教科基金(GJJ10091)资助项目.
作者简介:曹中华(1976—),男,江西鄱阳人,讲师,博士,主要从事文本挖掘和财政大数据处理的研究.E-mail:rjxy_czh@jxnu.edu.cn
更新日期/Last Update: 2022-09-25