Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models

Li, Wenbiao; Sun, Rui; Wu, Yunfang

doi:10.1007/978-3-031-17120-8_1

Wenbiao Li^11,12,
Rui Sun^11,12 &
Yunfang Wu^11,13

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13551))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

2522 Accesses
4 Citations

Abstract

Most of the Chinese pre-trained models adopt characters as basic units for downstream tasks. However, these models ignore the information carried by words and thus lead to the loss of some important semantics. In this paper, we propose a new method to exploit word structure and integrate lexical semantics into character representations of pre-trained models. Specifically, we project a word’s embedding into its internal characters’ embeddings according to the similarity weight. To strengthen the word boundary information, we mix the representations of the internal characters within a word. After that, we apply a word-to-character alignment attention mechanism to emphasize important characters by masking unimportant ones. Moreover, in order to reduce the error propagation caused by word segmentation, we present an ensemble approach to combine segmentation results given by different tokenizers. The experimental results show that our approach achieves superior performance over the basic pre-trained models BERT, BERT-wwm and ERNIE on different Chinese NLP tasks: sentiment classification, sentence pair matching, natural language inference and machine reading comprehension. We make further analysis to prove the effectiveness of each component of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Overview of Character-Based Models for Natural Language Processing

Dual Long Short-Term Memory Networks for Sub-Character Representation Learning

A Comprehensive Analysis of Subword Contextual Embeddings for Languages with Rich Morphology

Notes

References

Conneau, A., et al.: XNLI: evaluating cross-lingual sentence representations. In EMNLP, pp. 2475–2485 (2018)
Google Scholar
Cui, Y., et al.: Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:1906.08101 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186 (2019)
Google Scholar
Diao, S., Bai, J., Song, Y., Zhang, T., Wang, Y.: ZEN: pre-training Chinese text encoder enhanced by n-gram representations. In EMNLP, pp. 4729–4740 (2019)
Google Scholar
Jiao, Z., Sun, S., Sun, K.: Chinese lexical analysis with deep BI-GRU-CRF network. arXiv preprint arXiv:1807.01882 (2018)
Lai, Y., Liu, Y., Feng, Y., Huang, S., Zhao, D.: Lattice-BERT: leveraging multi-granularity representations in Chinese pre-trained language models. In NAACL, pp. 1716–1731 (2021)
Google Scholar
Li, X., Yan, H., Qiu, X., Huang, X.: Flat: Chinese NER using flat-lattice transformer. In ACL, (2020)
Google Scholar
Liu, W., Fu, X., Zhang, Y., Xiao, W.: Lexicon enhanced Chinese sequence labelling using BERT adapter. In ACL, pp. 5847–5858 (2021)
Google Scholar
Liu, X., Chen, Q., Deng, C., Zeng, H., Chen, J., Li, D., Tang, B.: Lcqmc: A large-scale chinese question matching corpus. In COLING, pp. 1952–1962 (2018)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam (2018)
Google Scholar
Luo, R., Xu, J., Zhang, Y., Ren, X., Sun, X.: Pkuseg: A toolkit for multi-domain chinese word segmentation. CoRR abs/1906.11455 (2019)
Google Scholar
Ma, R., Peng, M., Zhang, Q., Huang, X.: Simplify the usage of lexicon in Chinese NER. In ACL, pp. 5951–5960 (2019)
Google Scholar
Mengge, X., Yu, B., Liu, T., Zhang, Y., Meng, E., Wang, B.: Porous lattice transformer encoder for Chinese NER. In COLING, pp. 3831–3841 (2020)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, pp. 8026–8037 (2019)
Google Scholar
Shao, C.C., Liu, T., Lai, Y., Tseng, Y., Tsai, S.: DRCD: a Chinese machine reading comprehension dataset. arXiv preprint arXiv:1806.00920 (2018)
Song, Y., Shi, S., Li, J., Zhang, H.: Directional skip-gram: explicitly distinguishing left and right context for word embeddings. In NAACL, pp. 175–180 (2018)
Google Scholar
Su, J., Tan, Z., Xiong, D., Ji, R., Shi, X., Liu, Y.: Lattice-based recurrent neural network encoders for neural machine translation. In AAAI, pp. 3302–3308 (2017)
Google Scholar
Sun, Y., et al.: Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNET: generalized autoregressive pretraining for language understanding. In NeurIPS, pp. 5754–5764 (2019)
Google Scholar
Zhang, Y., Yang, J.: Chinese NER using lattice LSTM. In ACL, pp. 1554–1564 (2018)
Google Scholar
Zhu, D.: Lexical notes on Chinese grammar (in Chinese). The Commercial Press (1982)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Hi-Tech RD Program of China (2020AAA0106600), the National Natural Science Foundation of China (62076008) and the KeyProject of Natural Science Foundation of China (61936012).

Author information

Authors and Affiliations

MOE Key Laboratory of Computational Linguistics, Peking University, Beijing, China
Wenbiao Li, Rui Sun & Yunfang Wu
School of Software and Microelectronics, Peking University, Beijing, China
Wenbiao Li & Rui Sun
School of Computer Science, Peking University, Beijing, China
Yunfang Wu

Authors

Wenbiao Li
View author publications
You can also search for this author in PubMed Google Scholar
Rui Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yunfang Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunfang Wu .

Editor information

Editors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Wei Lu
Nanjing University, Nanjing, China
Shujian Huang
Soochow University, Suzhou, China
Yu Hong
Soochow University, Soochow, China
Xiabing Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, W., Sun, R., Wu, Y. (2022). Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-17120-8_1
Published: 24 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17119-2
Online ISBN: 978-3-031-17120-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models

Abstract

Access this chapter

Similar content being viewed by others

Overview of Character-Based Models for Natural Language Processing

Dual Long Short-Term Memory Networks for Sub-Character Representation Learning

A Comprehensive Analysis of Subword Contextual Embeddings for Languages with Rich Morphology

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Exploiting Word Semantics to Enrich Character Representations of Chinese Pre-trained Models

Abstract

Access this chapter

Similar content being viewed by others

Overview of Character-Based Models for Natural Language Processing

Dual Long Short-Term Memory Networks for Sub-Character Representation Learning

A Comprehensive Analysis of Subword Contextual Embeddings for Languages with Rich Morphology

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation