低资源语言是指缺乏可用于自然语言处理任务和语言学计量分析所需足够基础数据的语言.低资源语言数据稀缺,是当前语言科学和自然语言处理共同面临的问题.语言数据资源最基础的部分是单语或双语词汇、语句的语音和文本数据.我国普通话、粤方言、藏语、维吾尔语、蒙古语、壮语总体属于高资源语言,其他语言都属于低资源语言,其中县乡语言和方言属零资源语言.建构我国低资源语言的大规模数据,有助于强化我们掌握自己国家语言资源的控制权,发挥我国自然语言处理领域在语言模型技术创新中的独特作用,推动语言田野工作的数据转向,创新田野语言学理论和实践,促进基于数据计量的语言学广域研究.建构我国低资源语言数据,主要有四项任务:一是建构大规模词语数据集,二是建构知识语义词网,三是建构大规模句子数据集,四是现有语言资料的数据化.
Low-resource languages are those that lack sufficient basic data for natural language pro-cessing tasks and quantitative linguistic analyses.The scarcity of low-resource language data is a com-mon problem faced by current language science and natural language processing.The fundamental part of language data resources is composed of monolingual or bilingual vocabulary,the sentence speech sounds and textual data.In China,Mandarin,Cantonese dialect,Tibetan,Uyghur,Mongoli-an,and Zhuang languages are generally high-resource languages,and other languages are low-re-source languages,of which the county and township languages and dialects are zero-resource langua-ges.Building large-scale data of low-resource languages of our country will help strengthen our con-trol over the language resources,play a unique role in our country's NLP technological innovation of language models,promote the data shift of our linguistic fieldwork,the innovation on the field lin-guistic theory and practice,and wide-area linguistic research based on data measurement.There are four main tasks in building low-resource language data of China:the first is to build a large word data set,the second is to construct a knowledge-based semantic word network,the third is to build a large sentence data set,and the fourth is to digitize the existing low-resource language data.