Skip to main content

Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13551))

Abstract

Most of the Chinese pre-trained models adopt characters as basic units for downstream tasks. However, these models ignore the information carried by words and thus lead to the loss of some important semantics. In this paper, we propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models. Specifically, we project a word’s embedding into its internal characters’ embeddings according to the similarity weight. To strengthen the word boundary information, we mix the representations of the internal characters within a word. After that, we apply a word-to-character alignment attention mechanism to emphasize important characters by masking unimportant ones. Moreover, in order to reduce the error propagation caused by word segmentation, we present an ensemble approach to combine segmentation results given by different tokenizers. The experimental results show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks: sentiment classification, sentence pair matching, natural language inference and machine reading comprehension. We make further analysis to prove the effectiveness of each component of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/isnowfy/snownlp.

  2. 2.

    https://github.com/pengming617/bert_classification.

References

  1. Conneau, A., et al.: XNLI: evaluating cross-lingual sentence representations. In EMNLP, pp. 2475–2485 (2018)

    Google Scholar 

  2. Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:1906.08101 (2019)

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186 (2019)

    Google Scholar 

  4. Diao, S., Bai, J., Song, Y., Zhang, T., Wang, Y.: ZEN: pre-training Chinese text encoder enhanced by n-gram representations. In EMNLP, pp. 4729–4740 (2019)

    Google Scholar 

  5. Jiao, Z., Sun, S., Sun, K.: Chinese lexical analysis with deep BI-GRU-CRF network. arXiv preprint arXiv:1807.01882 (2018)

  6. Lai, Y., Liu, Y., Feng, Y., Huang, S., Zhao, D.: Lattice-BERT: leveraging multi-granularity representations in Chinese pre-trained language models. In NAACL, pp. 1716–1731 (2021)

    Google Scholar 

  7. Li, X., Yan, H., Qiu, X., Huang, X.: Flat: Chinese NER using flat-lattice transformer. In ACL, (2020)

    Google Scholar 

  8. Liu, W., Fu, X., Zhang, Y., Xiao, W.: Lexicon enhanced Chinese sequence labelling using BERT adapter. In ACL, pp. 5847–5858 (2021)

    Google Scholar 

  9. Liu, X., Chen, Q., Deng, C., Zeng, H., Chen, J., Li, D., Tang, B.: Lcqmc: A large-scale chinese question matching corpus. In COLING, pp. 1952–1962 (2018)

    Google Scholar 

  10. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  11. Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam (2018)

    Google Scholar 

  12. Luo, R., Xu, J., Zhang, Y., Ren, X., Sun, X.: Pkuseg: A toolkit for multi-domain chinese word segmentation. CoRR abs/1906.11455 (2019)

    Google Scholar 

  13. Ma, R., Peng, M., Zhang, Q., Huang, X.: Simplify the usage of lexicon in Chinese NER. In ACL, pp. 5951–5960 (2019)

    Google Scholar 

  14. Mengge, X., Yu, B., Liu, T., Zhang, Y., Meng, E., Wang, B.: Porous lattice transformer encoder for Chinese NER. In COLING, pp. 3831–3841 (2020)

    Google Scholar 

  15. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, pp. 8026–8037 (2019)

    Google Scholar 

  16. Shao, C.C., Liu, T., Lai, Y., Tseng, Y., Tsai, S.: DRCD: a Chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920 (2018)

  17. Song, Y., Shi, S., Li, J., Zhang, H.: Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In NAACL, pp. 175–180 (2018)

    Google Scholar 

  18. Su, J., Tan, Z., Xiong, D., Ji, R., Shi, X., Liu, Y.: Lattice-based recurrent neural network encoders for neural machine translation. In AAAI, pp. 3302–3308 (2017)

    Google Scholar 

  19. Sun, Y., et al.: Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019)

  20. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNET: generalized autoregressive pretraining for language understanding. In NeurIPS, pp. 5754–5764 (2019)

    Google Scholar 

  21. Zhang, Y., Yang, J.: Chinese NER using lattice LSTM. In ACL, pp. 1554–1564 (2018)

    Google Scholar 

  22. Zhu, D.: Lexical notes on Chinese grammar (in Chinese). The Commercial Press (1982)

    Google Scholar 

Download references

Acknowledgement

This work is supported by the National Hi-Tech RD Program of China (2020AAA0106600), the National Natural Science Foundation of China (62076008) and the KeyProject of Natural Science Foundation of China (61936012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunfang Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, W., Sun, R., Wu, Y. (2022). Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17120-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17119-2

  • Online ISBN: 978-3-031-17120-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics