[1]谷舒豪,单 勇,谢婉莹,等.基于数据增强及领域适应的神经机器翻译技术[J].江西师范大学学报(自然科学版),2019,(06):643-648.[doi:10.16357/j.cnki.issn1000-5862.2019.06.14]
 GU Shuhao,SHAN Yong,XIE Wanying,et al.The Neural Machine Translation Based on Data Augmentation and Domain Adaptation Technology[J].Journal of Jiangxi Normal University:Natural Science Edition,2019,(06):643-648.[doi:10.16357/j.cnki.issn1000-5862.2019.06.14]
点击复制

基于数据增强及领域适应的神经机器翻译技术()
分享到:

《江西师范大学学报》(自然科学版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2019年06期
页码:
643-648
栏目:
机器翻译
出版日期:
2019-12-10

文章信息/Info

Title:
The Neural Machine Translation Based on Data Augmentation and Domain Adaptation Technology
文章编号:
1000-5862(2019)06-0643-06
作者:
谷舒豪12单 勇12谢婉莹3郭登级12王树根12邵晨泽12薛海洋12张 良4冯 洋12
1.中国科学院计算技术研究所,北京 100190; 2.中国科学院大学,北京 100049; 3.北京语言大学信息科学学院,北京 100083; 4.中国矿业大学北京机电与信息工程学院,北京 100080
Author(s):
GU Shuhao12SHAN Yong12XIE Wanying3GUO Dengji12WANG Shugen12SHAO Chenze12XUE Haiyang12ZHANG Liang4FENG Yang12
1.Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China; 2.University of Chinese Academy of Sciences,Beijing 100049,China; 3.School of Information Science,Beijing Language and Culture University,Beijing 100083,China; 4.School
关键词:
神经机器翻译 藏汉翻译 语音翻译
Keywords:
neural machine translation Tibetan-Chinese translation speech translation
分类号:
TP 302.1
DOI:
10.16357/j.cnki.issn1000-5862.2019.06.14
文献标志码:
A
摘要:
近年来,基于深度学习的神经机器翻译已经成为机器翻译的主流方法.神经机器翻译模型比统计机器翻译模型更依赖于大规模的标注数据.因此,当训练语料稀缺或语料领域不一致时,翻译质量会显著下降.在藏汉翻译中,训练语料大多为政府文献领域且数据稀缺; 在汉英语音翻译中,训练语料大多为书面语领域且噪音语料稀缺.为了提高神经机器翻译模型在这2个任务上的表现,该文提出了一种噪音数据增强方法和2种通用的领域自适应方法,并验证了其有效性.
Abstract:
In recent years,neural machine translation based on deep learning has become the mainstream method in machine translation.The neural machine translation model relies more on large-scale annotation data than the statistical machine translation model,so its translation quality will be significantly reduced when the training corpus is scarce or the domain is inconsistent.In Tibetan-Chinese translation,the training corpus is mostly in the government literature domain and the data is scarce.In speech translation,the training corpus is mostly in the written language domain and the noise corpus is scarce. In order to improve the performance of the neural machine translation model on these two tasks,this report proposes a noise data enhancement method and two general domain adaptive methods,and verifies their effectiveness.

参考文献/References:

[1] Sutskever I,Vinyals O,Le Q V.Sequence to sequence learning with neural networks[EB/OL].[2019-03-16].https://arxiv.org/abs/1409.3215.
[2] Bahdanau D,Cho K,Bengio Y.Neural machine translation by jointly learning to align and translate[EB/OL].[2019-03-16].https://arxiv.org/abs/1409.0473v2.
[3] Shao Chenze,Feng Yang,Zhang Jinchao,et al.Retrieving sequential information for non-autoregressive neural machine translation[EB/OL].[2019-03-16].https://www.aclweb.org/anthology/P19-1288/.
[4] Bengio Y,Louradour J,Collobert R B,et al.Curriculum learning[EB/OL].[2019-03-16].http://dx.doi.org/10.1145/1553374.1553380.
[5] Luong M,Manning C D,et al.Stanford neural machine translation systems for spoken language domains[EB/OL].[2019-03-16].https://nlp.stanford.edu/pubs/luong-manning-iwslt15.pdf.
[6] Gu Shuhao,Feng Yang,Liu Qun.Improving domain adaptation translation with domain invariant and specific information[EB/OL].[2019-03-16].https://arxiv.org/pdf/1904.03879.pdf.
[7] Zhang Jiacheng,Luan Huanbo,Sun Maosong,et al.Improving the transformer translation model with document-level context[EB/OL].[2019-03-16].https://arxiv.org/abs/1810.03581.
[8] Yang Zhengxin,Zhang Jinchao,Meng Fandong,et al.Enhancing context modeling with a query-guided capsule network for document-level NMT[EB/OL].[2019-03-16].https://arxiv.org/abs/1909.00564.
[9] Sennrich R,Haddow B,Birch A.Neural machine translation of rare words with subword units[EB/OL].[2019-03-16].https://arxiv.org/abs/1508.07909.
[10] Axelrod A,He Xiaodong,Gao Jianfeng.Domain adaptation via pseudo in-domain data selection[EB/OL].[2019-03-16].https://core.ac.uk/display/21859466.
[11] Rico S,Barry H,Alexandra B.Improving neural machine translation models with monolingual data[EB/OL].[2019-03-16].https://arxiv.org/abs/1511.06709.
[12] Zhang Wen,Feng Yang,Meng Fandong,et al.Bridging the gap between training and inference for neural machine translation[EB/OL].[2019-03-16].https://arxiv.org/abs/1906.02448.

相似文献/References:

[1]赵 阳,周 龙,王 迁,等.民汉稀缺资源神经机器翻译技术研究[J].江西师范大学学报(自然科学版),2019,(06):630.[doi:10.16357/j.cnki.issn1000-5862.2019.06.12]
 ZHAO Yang,ZHOU Long,WANG Qian,et al.The Study on Ethnic-to-Chinese Scare-Resource Neural Machine Translation[J].Journal of Jiangxi Normal University:Natural Science Edition,2019,(06):630.[doi:10.16357/j.cnki.issn1000-5862.2019.06.12]
[2]王 坤,殷明明,俞鸿飞,等.低资源维汉神经机器翻译研究[J].江西师范大学学报(自然科学版),2019,(06):638.[doi:10.16357/j.cnki.issn1000-5862.2019.06.13]
 WANG Kun,YIN Mingming,YU Hongfei,et al.The Study on Low-Resource Uygur-Chinese Neural Machine Translation[J].Journal of Jiangxi Normal University:Natural Science Edition,2019,(06):638.[doi:10.16357/j.cnki.issn1000-5862.2019.06.13]
[3]刘俊鹏,宋鼎新,张一鸣,等.多种数据泛化策略融合的神经机器翻译系统[J].江西师范大学学报(自然科学版),2020,(01):39.[doi:10.16357/j.cnki.issn1000-5862.2020.01.07]
 LIU Junpeng,SONG Dingxin,ZHANG Yiming,et al.The Neural Machine Translation System of Multiple Data Generalization Fusion[J].Journal of Jiangxi Normal University:Natural Science Edition,2020,(06):39.[doi:10.16357/j.cnki.issn1000-5862.2020.01.07]

备注/Memo

备注/Memo:
收稿日期:2019-08-31
基金项目:国家自然科学基金(61876174,61662077)和国家重点研发计划(2017YFE9132900)资助项目.
作者简介:谷舒豪(1994-),男,河北保定人,博士生,主要从事机器翻译、自然语言处理研究.E-mail:gushuhao17g@ict.ac.cn
更新日期/Last Update: 2019-12-10