This study dives into the subtle realm of Bahasa Rojak focusing on the critical function of Out-of-Vocabulary (OOV) detection tasks in unraveling its dense linguistic tapestry. The study uses painstaking preprocessing methods to improve the textual data after gathering data from various sources. The study uses Brown Clustering to find linguistic trends and groupings within the Bahasa Rojak language. Furthermore, this work dissects and encodes complicated lexical elements using several tokenization methods—Byte Pair Encoding (BPE), Unigram, Word Piece, and Sentence Piece—revealing numerous segmentation views. The goal is to create a Union Vocabulary that combines different segments while minimizing duplication, hence lowering OOV occurrences. The combination of these strategies not only improves language representation but also demonstrates the adaptability and usefulness of various tokenization techniques in dealing with OOV terms. This investigation provides insights into effective ways for dealing with OOV terms in linguistic datasets.