面向自然语言理解的多教师BERT模型蒸馏研究

石佳来; 郭卫斌

doi:10.11959/j.issn.2096-0271.2023039

您当前的位置：

首页 >

文章列表页 >

面向自然语言理解的多教师BERT模型蒸馏研究

研究 | 更新时间：2024-06-03

- 面向自然语言理解的多教师BERT模型蒸馏研究
- Multi-teacher distillation BERT model in NLU tasks
- 大数据 2024年10卷第3期页码：119-132
- 作者机构：
- 作者简介：
  
  [ "石佳来（1998- ），男，华东理工大学信息科学与工程学院硕士生，主要研究方向为自然语言理解、知识蒸馏等。" ]
  [ "郭卫斌（1968- ），男，博士，华东理工大学信息科学与工程学院教授，中国计算机学会高级会员，主要研究方向为高性能计算、大数据与云计算、计算机应用等。" ]
- 基金信息：
  
  国家自然科学基金项目;The National Natural Science Foundation of China(62076094)
- DOI：10.11959/j.issn.2096-0271.2023039
  中图分类号： TP391.1
- 网络首发：2024-05，
  
  纸质出版：2024-05-15
- 稿件说明：
移动端阅览
石佳来, 郭卫斌. 面向自然语言理解的多教师BERT模型蒸馏研究[J]. 大数据, 2024,10(3):119-132.

Jialai SHI, Weibin GUO. Multi-teacher distillation BERT model in NLU tasks[J]. Big data research, 2024, 10(3): 119-132.
石佳来, 郭卫斌. 面向自然语言理解的多教师BERT模型蒸馏研究[J]. 大数据, 2024,10(3):119-132. DOI： 10.11959/j.issn.2096-0271.2023039.

Jialai SHI, Weibin GUO. Multi-teacher distillation BERT model in NLU tasks[J]. Big data research, 2024, 10(3): 119-132. DOI： 10.11959/j.issn.2096-0271.2023039.

摘要

知识蒸馏是一种常用于解决BERT等深度预训练模型规模大、推断慢等问题的模型压缩方案。采用“多教师蒸馏”的方法，可以进一步提高学生模型的表现，而传统的对教师模型中间层采用的“一对一”强制指定的策略会导致大部分的中间特征被舍弃。提出了一种“单层对多层”的映射方式，解决了知识蒸馏时中间层无法对齐的问题，帮助学生模型掌握教师模型中间层中的语法、指代等知识。在GLUE中的若干数据集的实验表明，学生模型在保留了教师模型平均推断准确率的93.9%的同时，只占用了教师模型平均参数规模的41.5%。

Abstract

Knowledge distillation is a model compression scheme commonly used to solve the problems of large scale and slow inference of BERT constant depth pre-training model.The method of ＆quot;multi-teacher distillation＆quot; can further improve the performance of the student model

while the traditional ＆quot;one-to-one＆quot; mapping method mandatory assignment strategy for the middle layer of the teacher model will lead to the abandonment of most of the middle features.The ＆quot;one-tomany＆quot; mapping method is proposed to solve the problem that the middle layer cannot be aligned during knowledge distillation

and help students master the grammar

reference and other knowledge in the middle layer of the teacher model.Experiments on several data sets in GLUE show that the student model retains 93.9% of the average inference accuracy of the teacher model

while only accounting for 41.5% of the average parameter size of the teacher model.

关键词

Keywords

references

XIE K , LU S , WANG M , et al . Elbert:fast albert with confidence-window based early exit [C ] // Proceedings of ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing . Piscataway:IEEE Press , 2021 : 7713 - 7717 .

LAN Z Z , CHEN M D , GOODMAN S , et al . ALBERT:a lite BERT for self-supervised learning of language representations [C ] // Proceedings of 8th International Conference on Learning Representations.New York:OpenReview . net , 2020 : 564 - 571 .

JIAO X Q , YIN Y C , SHANG L F , et al . TinyBERT:distilling BERT for natural language understanding [C ] // Proceedings of the Association for Computational Linguistics . New York:EMNLP , 2020 : 4163 - 4174 .

DEVLIN J , CHANG M W , LEE K , et al . BERT:pre-training of deep bidirectional transformers for language understanding [C ] // Proceedings of the 2019 Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies .[S.l. ] : ACL Press , 2019 : 4171 - 4186 .

SUN S Q , CHENG Y , GEN Z , et al . Patient knowledge distillation for BERT model compression [C ] // Proceedings of the 2019 Conference on Empirical Methods in Natural Language,Processing and the 9th International Joint Conference on Natural Language Processing . New York:EMNLP-IJCNLP , 2019 : 4322 - 4331 .

ILICHEV A , SOROKIN N , PIONTKOVSKAYA I , et al . Multiple teacher distillation for robust and greener models [C ] // Proceedings of the International Conference on Recent Advances in Natural Language Processing . New York:RANLP , 2021 : 601 - 610 .

WANG A , SINGH A , MICHAEL J , et al . GLUE:a multi-task benchmark and analysis platform for natural language understanding [J ] . arXiv preprint , 2018 ,arXiv:1804.07461.

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [J ] . Advances in Neural Information Processing Systems , 2017 , 30 ( 1 ): 5998 - 6008 .

任欢 , 王旭光 . 注意力机制综述 [J ] . 计算机应用 , 2021 , 41 ( z1 ): 1 - 6 .

REN H , WANG X G . Overview of attention mechanism [J ] . Computer Applications , 2021 , 41 ( z1 ): 1 - 6 .

李爱黎 , 张子帅 , 林荫 , 等 . 基于社交网络大数据的民众情感监测研究 [J ] . 大数据 , 2022 , 8 ( 6 ): 105 - 126 .

LI A L , ZHANG Z S , LIN Y , et al . Research on public emotion monitoring based on social network big data [J ] . Big Data Research , 2022 , 8 ( 6 ): 105 - 126 .

韩立帆 , 季紫荆 , 陈子睿 , 等 . 数字人文视域下面向历史古籍的信息抽取方法研究 [J ] . 大数据 , 2022 , 8 ( 6 ): 26 - 39 .

HAN L F , JI Z J , CHEN Z R , et al . Research on information extraction from historical ancient books from the perspective of digital humanities [J ] . Big Data Research , 2022 , 8 ( 6 ): 26 - 39 .

MICHEL P , LEVY O , NEUBIG G . Are sixteen heads really better than one? [J ] . Advances in Neural Information Processing Systems , 2019 , 32 ( 1 ): 4809 - 4818 .

XU Y , WANG Y , ZHOU A , et al Deep neural network compression with single and multiple level quantization [C ] // Proceedings of the 32nd AAAI Conference on Artificial Intelligence . New York:ACM Press , 2018 .

ZAFRIR O , BOUDOUKH G , IZSAK P , et al . Q8bert:quantized 8bit bert [C ] // Proceedings of 2019 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) . Piscataway:IEEE Press , 2019 : 36 - 39 .

HINTON G , VINYALS O , DEAN J . Distilling the knowledge in a neural network [J ] . arXiv preprint , 2015 ,arXiv:1503.02531.

Al-OMARI H , ABDULLAH M A , SHAIKH S . Emodet2:emotion detection in english textual dialogue using BERT and BiLSTM models [C ] // Proceedings of 2020 11th International Conference on Information and Communication Systems . Piscataway:IEEE Press , 2020 : 226 - 232 .

杨秋勇 , 彭泽武 , 苏华权 , 等 . 基于B iLSTM-CRF的中文电力实体识别 [J ] . 信息技术 , 2021 ( 9 ): 45 - 50 .

YANG Q Y , PENG Z W , SU H Q , et al . Chinese power entity recognition based on Bi-LSTM-CRF [J ] . Information Technology , 2021 ( 9 ): 45 - 50 .

叶榕 , 邵剑飞 , 张小为 , 等 . 基于BERT-CNN的新闻文本分类的知识蒸馏方法研究 [J ] . 电子技术应用 , 2023 , 49 ( 1 ): 8 - 13 .

YE R , SHAO J F , ZHANG X W , et al . Research on knowledge distillation method of news text classification based on BERT-CNN [J ] . Application of Electronic Technology , 2023 , 49 ( 1 ): 8 - 13 .

XU C , ZHOU W , GE T , et al . BERTof-theseus:compressing BERT by progressive module replacing [C ] // Proceedings of Empirical Methods in Natural Language Processing . 2021 : 7859 - 7869 .

张睿东 . 基于BERT和知识蒸馏的自然语言理解研究 [D ] . 南京:南京大学 , 2020 .

ZHANG R D . Research on natural language understanding based on BERT and knowledge distillation [D ] . Nanjing:Nanjing University , 2020 .

FUKUDA T , KURATA G . Generalized knowledge distillation from an ensemble of specialized teachers leveraging unsupervised neural clustering [C ] // Proceedings of ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . Piscataway:IEEE Press , 2021 : 6868 - 6872 .

CHO J H , HARIHARAN B . On the efficacy of knowledge distillation [C ] // Proceedings of the IEEE/CVF International Conference on Computer Vision .[S.l.:s.n. ] , 2019 : 4794 - 4802 .

JIANG L , WEN Z , LIANG Z , et al Long short-term sample distillation [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . New York:ACM Press , 2020 : 4345 - 4352 .

YANG Z , SHOU L , GONG M , et al . Model compression with two-stage multi-teacher knowledge distillation for web question answering system [C ] // Proceedings of the 13th International Conference on Web Search and Data Mining .[S.l.:s.n. ] , 2020 : 690 - 698 .

WU C , WU F Z , HUANG Y F . One teacher is enough? Pre-trained language model distillation from multiple teachers [C ] // Proceedings of the Association for Computational Linguistics . New York:ACL Press , 2021 : 4408 - 4413 .

YUAN F , SHOU L , PEI J , et al . Reinforced multi-teacher selection for knowledge distillation [C ] // Proceedings of the AAAI Conference on Artificial Intelligence . 2021 , 35 ( 16 ): 14284 - 14291 .

CLARK K , LUONG M T , LE Q V , et al . ELECTRA:pre-training text encoders as discriminators rather than generators [C ] // Proceedings of 8th International Conference on Learning Representations . New York:ICLR , 2020 .

LIU Z , LIN W , SHI Y , et al . A robustly optimized BERT pre-training approach with post-training [C ] // Proceedings of Chinese Computational Linguistics:20th China National Conference . Cham:Springer , 2021 : 471 - 484 .

YANG Z L , DAI Z L , CARBONELL J G , et al . XLNet:generalized autoregressive pretraining for language understanding [C ] // Advances in Neural Information Processing Systems 32:Annual Conference on Neural Information Processing Systems . New York:NeurIPS , 2019 : 5754 - 5764 .

浏览量

171

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于BERT阅读理解框架的司法要素抽取方法