Korean automatic spacing using pretrained transformer encoder and analysis |
| |
Authors: | Taewook Hwang Sangkeun Jung Yoon-Hyung Roh |
| |
Affiliation: | 1. Computer Science & Engineering, ChungNam National University, Daejeon, Republic of Korea;2. Language Intelligence Research Section, Electronics and Telecommunications Research Institute, Daejeon, Republic of Korea |
| |
Abstract: | Automatic spacing in Korean is used to correct spacing units in a given input sentence. The demand for automatic spacing has been increasing owing to frequent incorrect spacing in recent media, such as the Internet and mobile networks. Therefore, herein, we propose a transformer encoder that reads a sentence bidirectionally and can be pretrained using an out-of-task corpus. Notably, our model exhibited the highest character accuracy (98.42%) among the existing automatic spacing models for Korean. We experimentally validated the effectiveness of bidirectional encoding and pretraining for automatic spacing in Korean. Moreover, we conclude that pretraining is more important than fine-tuning and data size. |
| |
Keywords: | attention BERT Korean automatic spacing natural language processing pretrained transformer encoder |
|
|