数字人文视域下面向历史古籍的信息抽取方法研究

韩立帆; 季紫荆; 陈子睿; 王鑫

doi:10.11959/j.issn.2096-0271.2022058

您当前的位置：

首页 >

文章列表页 >

数字人文视域下面向历史古籍的信息抽取方法研究

专题：面向人文领域的大数据技术和方法 | 更新时间：2024-06-03

- 数字人文视域下面向历史古籍的信息抽取方法研究
- Research on information extraction methods for historical classics under the threshold of digital humanities
- 大数据 2022年8卷第6期页码：26-39
- 作者机构：
  
  1. 天津大学智能与计算学部，天津 300350
  2. 天津市认知计算与应用重点实验室，天津 300350
- 作者简介：
  
  [ "韩立帆（1999- ）, 男, 天津大学智能与计算学部硕士生, 主要研究方向为自然语言处理、知识图谱构建" ]
  [ "季紫荆（1997- ）, 女, 天津大学智能与计算学部硕士生, 主要研究方向为自然语言处理、知识图谱构建" ]
  [ "陈子睿（1998- ）, 男, 天津大学智能与计算学部硕士生, 主要研究方向为知识表示学习、知识图谱问答、知识图谱构建" ]
  [ "王鑫（1981- ），男，博士，天津大学智能与计算学部教授、博士生导师，主要研究方向为知识图谱数据管理、图数据库、大规模知识处理" ]
- 基金信息：
  
  科技创新2030—“新一代人工智能”重大项目;Science and Technology Innovation 2030 “New Generation Artificial Intelligence” Major Project(2020AAA0108504);国家自然科学基金资助项目;The National Natural Science Foundation of China(61972275)
- DOI：10.11959/j.issn.2096-0271.2022058
  中图分类号： TP391.1
- 网络首发：2022-11，
  
  纸质出版：2022-11-15
- 稿件说明：
移动端阅览
韩立帆, 季紫荆, 陈子睿, 等. 数字人文视域下面向历史古籍的信息抽取方法研究[J]. 大数据, 2022,8(6):26-39.

Lifan HAN, Zijing JI, Zirui CHEN, et al. Research on information extraction methods for historical classics under the threshold of digital humanities[J]. Big data research, 2022, 8(6): 26-39.
韩立帆, 季紫荆, 陈子睿, 等. 数字人文视域下面向历史古籍的信息抽取方法研究[J]. 大数据, 2022,8(6):26-39. DOI： 10.11959/j.issn.2096-0271.2022058.

Lifan HAN, Zijing JI, Zirui CHEN, et al. Research on information extraction methods for historical classics under the threshold of digital humanities[J]. Big data research, 2022, 8(6): 26-39. DOI： 10.11959/j.issn.2096-0271.2022058.

摘要

数字人文旨在采用现代计算机网络技术助力传统人文研究，文言历史古籍是进行历史研究和学习的重要基础，但由于其写作语言为文言文，与现代所用的白话文在语法和词义上均有较大差别，因此不易于阅读和理解。针对上述问题，提出基于预训练模型对历史古籍中的实体和关系等进行知识抽取的方法，从而有效获取历史古籍文本中蕴含的丰富信息。该模型首先采用多级预训练任务代替BERT原有的预训练任务，以充分捕获语义信息，此外在BERT模型的基础上添加了卷积层及句子级聚合等结构，以进一步优化生成的词表示。然后，针对文言文标注数据稀缺的问题，构建了一个面向历史古籍文本标注任务的众包系统，获取高质量、大规模的实体和关系数据，完成文言文知识抽取数据集的构建，评估模型性能，并对模型进行微调。在构建的数据集及GulianNER数据集上的实验证明了提出模型的有效性。

Abstract

Digital humanities aims to use modern computer network technology to help traditional humanities research.Classical Chinese historical books are the important basis for historical research and learning

but since their writing language is classical Chinese

it is quite different from the vernacular Chinese in grammar and meaning

so it is not easy to read and understand.In view of the above problems

the solution to extract entities and relations in historical books based on pre-trained models was proposed to obtain the rich information contained in historical texts effectively.The model usedmulti-level pre-training tasks instead of BERT's original pre-training tasks to fully capture semantic information.And the model added some structures such as convolutional layers and sentence-level aggregations on the basis of the BERT model to optimize the generated word representation further.Then

in view of the scarcity of classical Chinese annotation data

a crowdsourcing system for the task of labeling historical classics was constructed

high-quality

large-scale entity and relation data was obtained and the classical Chinese knowledge extraction dataset was constructed.So it helped to evaluate the performance of the model and fine-tune the model.Experiments on the dataset constructed in this paper and on the GulianNER dataset demonstrated the effectiveness of the model proposed in this paper.

关键词

Keywords

references

DEVLIN J , CHANG M , LEE K , et al . BERT:pre-training of deep bidirectional transformers for language understanding [C ] // Proceedings of NAACLHLT .[S.l.:s.n. ] , 2019 : 4171 - 4186 .

HOWE J . The rise of crowdsourcing [J ] . Wired , 2006 , 14 ( 6 ): 176 - 183 .

HOLLEY R . Crowdsourcing:how and why should libraries do it? [J ] . D-Lib Magazine , 2010 , 16 ( 3/4 ): 1 - 21 .

OOMEN J , AROYO L . Crowdsourcing in the cultural heritage domain:opportunities and challenges [C ] // Proceedings of the 5th International Conference on Communities and Technologies .[S.l.:s.n. ] , 2011 : 138 - 149 .

TERRAS M . Digital curiosities:resource creation via amateur digitization [J ] . Literary and Linguistic Computing , 2010 , 25 ( 4 ): 425 - 438 .

RIDGE M , . Citizen history and its discontents [C ] // Proceedings of 2014 IHR Digital History Seminar .[S.l.:s.n. ] , 2014 : 1 - 13 .

ZHANG X H , SONG S J , ZHAO Y C , et al . Motivations of volunteers in the Transcribe Sheng project:a grounded theory approach [J ] . Proceedings of the Association for Information Science and Technology , 2018 , 55 ( 1 ): 951 - 953 .

RIDGE M . From tagging to theorizing:deepening engagement with cultural heritage through crowdsourcing [J ] . Curator:the Museum Journal , 2013 , 56 ( 4 ): 435 - 450 .

DANIELS C , HOLTZE T L , HOWARD R I , et al . Community as resource:crowdsourcing transcription of an historic newspaper [J ] . Journal of Electronic Resources Librarianship , 2014 , 26 ( 1 ): 36 - 48 .

CONCILIO G , VITELLIO I . Cocreating intangible cultural heritage by crowd-mapping: the case of mappi[na] [C ] // Proceedings of 2016 IEEE 2nd International Forum on Research and Technologies for Society and Industry Leveraging a Better Tomorrow . Piscataway:IEEE Press , 2016 : 1 - 5 .

RUMELHART D E , HINTON G E , WILLIAMS R J . Learning representations by back-propagating errors [J ] . Nature , 1986 , 323 ( 6088 ): 533 - 536 .

HINTON G E , MCCLELLAND J L , RUMELHART D E . Distributed representations [M ] . Cambridge : MIT Press , 1986 : 77 - 109 .

MIKOLOV T , CHEN K , CORRADO G , et al . Efficient estimation of word representations in vector space [J ] . arXiv preprint,2013,arXiv:1301.3781 .

MCCANN B , BRADBURY J , XIONG C , et al . Learned in translation:contextualized word vectors [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . Red Hook:Curran Associates Inc , 2017 : 6297 - 6308 .

PETERS M , NEUMANN M , IYYER M , et al . Deep contextualized word representations [C ] // Proceedings of 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Volume 1 (Long Papers) . Stroudsburg:Association for Computational Linguistics , 2018 : 2227 - 2237 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [J ] . Advances in Neural Information Processing Systems , 2017 ,30.

STAUDEMEYER R C , MORRIS E R . Understanding LSTM-a tutorial into long short-term memory recurrent neural networks [J ] . arXiv preprint,2019,arXiv:1909.09586 .

RADFORD A , NARASIMHAN K , SALIMANS T , et al . Imporoving language understanding by generative pretraining [Z ] . 2018 .

MA X Z , HOVY E . End-to-end sequence labeling via Bi-directional LSTMCNNs-CRF [C ] // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers) . Stroudsburg:Association for Computational Linguistics , 2016 .

LIU Y , OTT M , GOYAL N , et al . RoBERta:a robustly optimized bert pretraining approach [J ] . arXiv preprint,2019,arXiv:1907.11692 .

LAN Z , CHEN M , GOODMAN S , et al . Albert:a lite bert for self-supervised learning of language representations [J ] . arXiv preprint,2019,arXiv:1909.11942 .

YANG Z , DAI Z , YANG Y , et al . Xlnet:generalized autoregressive pretraining for language understanding [C ] // Proceedings of the 33rd International Conference on Neural Information Processing Systems . Red Hook:Curran Associates Inc , 2019 : 5753 - 5763 .

CUI Y M , CHE W X , LIU T , et al . Pretraining with whole word masking for Chinese BERT [J ] . IEEE/ACM Transactions on Audio,Speech,and Language Processing , 2021 , 29 : 3504 - 3514 .

王东波 , 刘畅 , 朱子赫 , 等 . SikuBERT与SikuRoBERTa：面向数字人文的《四库全书》预训练模型构建及应用研究 [J ] . 图书馆论坛 , 2022 , 42 ( 6 ): 31 - 43 .

WANG D B , LIU C , ZHU Z H , et al . Construction and application of pretrained models of Siku Quanshu in orientation to digital humanities [J ] . Library Tribune , 2022 , 42 ( 6 ): 31 - 43 .

浏览量

313

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

税收优惠政策关键要素抽取与可视化分析