基于SentencePiece的中医学分词模型建模研究
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家重点研发计划项目(2017YFC1700303,2017YFC1700300)


Research on Modeling of Traditional Chinese Medicine Word Segmentation Model Based on SentencePiece
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    目的:探索构建适用于中医学领域的分词模型。方法:采用基于SentencePiece的无监督学习分词方法,提出利用出版教材、名家著作及中医临床病历这3种不同类型的文献构建中医学分词模型;选择中医临床病历、名医医案作为测试集进行模型测试。结果:中医学分词模型在测试集中的Kappa系数为0.79(一致性程度很高),准确率为0.84,宏观精确率为0.84,宏观召回率为0.83,宏观f1得分为0.83。结论:所构建的分词模型对于中医学专业术语有着较好的切分效果,表明该方法可运用于中医学领域的分词模型的构建,可为进一步地研究中医学分词提供方法学参考。

    Abstract:

    To explore the construction of word segmentation model suitable for the field of traditional Chinese medicine (TCM).Methods:Using the unsupervised learning word segmentation method based on SentencePiece,we proposed to use 3 different types of documents,such as published textbooks,famous works and clinical medical records of TCM,to construct a word segmentation model of TCM; choosed the clinical records of TCM and medical records of famous doctors as the test set for model testing.Results:The Kappa coefficient of the word segmentation model of TCM established in this study was 0.79 (with substantial consistency),the accuracy rate was 0.84,the macro precision rate was 0.84,the macro recall rate was 0.83,and the macro f1 score was 0.83.Conclusion:The word segmentation model constructed by this study has a good segmentation effect on the terminology of TCM,indicating that this method can be applied to the construction of the word segmentation model in the field of TCM,and can provide a methodological reference for further study of TCM word segmentation.

    参考文献
    相似文献
    引证文献
引用本文

刘双巧,周璐,李彩艳,袁慧敏,张异卓,李昱达,刘锦钢,郑丰杰,孙燕,李宇航.基于SentencePiece的中医学分词模型建模研究[J].世界中医药,2021,(06).

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-07-07
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2021-05-11
  • 出版日期:
文章二维码