학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

Unveiling Bahasa Rojak's Linguistic Complexity: Out-of-Vocabulary Detection and Tokenization Strategies for Language

Resource Type: Conference
Authors: Leong, Fang En; Tan, Chi Wee; Chan, Yean Ling; Lim, Tong Ming
Source: 2024 3rd International Conference on Digital Transformation and Applications (ICDXA) Digital Transformation and Applications (ICDXA), 2024 3rd International Conference on. :1-5 Jan, 2024
Subject: Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Bahasa Rojak
OOV Detection
Tokenization Method
Language

Online Access

Full Text (IEEE)

초록

This study dives into the subtle realm of Bahasa Rojak focusing on the critical function of Out-of-Vocabulary (OOV) detection tasks in unraveling its dense linguistic tapestry. The study uses painstaking preprocessing methods to improve the textual data after gathering data from various sources. The study uses Brown Clustering to find linguistic trends and groupings within the Bahasa Rojak language. Furthermore, this work dissects and encodes complicated lexical elements using several tokenization methods—Byte Pair Encoding (BPE), Unigram, Word Piece, and Sentence Piece—revealing numerous segmentation views. The goal is to create a Union Vocabulary that combines different segments while minimizing duplication, hence lowering OOV occurrences. The combination of these strategies not only improves language representation but also demonstrates the adaptability and usefulness of various tokenization techniques in dealing with OOV terms. This investigation provides insights into effective ways for dealing with OOV terms in linguistic datasets.

공지

DAU Library

학술논문

요약정보

Unveiling Bahasa Rojak's Linguistic Complexity: Out-of-Vocabulary Detection and Tokenization Strategies for Language

Online Access

초록