eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

Tokenization of Tunisian Arabic: a comparison between three Machine Learning models

Resource Type
Authors: Asma Mekki; Inès Zribi; Mariem Ellouze; Lamia Hadrich Belguith
Source: ACM Transactions on Asian and Low-Resource Language Information Processing.
Subject: General Computer Science
Language
ISSN: 2375-4702
2375-4699

Online Access

초록

Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, etc. In this paper, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses Tunisian Arabic for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.

공지

DAU Library

eArticles

요약정보

Tokenization of Tunisian Arabic: a comparison between three Machine Learning models

Online Access

초록