Interpolating the Text-to-Image Correspondence Based on Phonetic and Phonological Similarities for Nonword-to-Image Generation
- Resource Type
- Periodical
- Authors
- Matsuhira, C.; Kastner, M.A.; Komamizu, T.; Hirayama, T.; Doman, K.; Kawanishi, Y.; Ide, I.
- Source
- IEEE Access Access, IEEE. 12:41299-41316 2024
- Subject
- Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Computational modeling
Phonetics
Tokenization
Solid modeling
Image synthesis
Visualization
Task analysis
Linguistics
Psychology
Natural language processing
Text-to-image
Nonwords
phonetics
pronunciation
psycholinguistics
text-to-image generation
vision and language
- Language
- ISSN
- 2169-3536
Text-to-Image (T2I) generation is the task of synthesizing images corresponding to a given text input. The recent innovations in artificial intelligence have enhanced the capacity of conventional T2I generation, yielding more and more powerful models day by day. However, their behavior is known to become unstable in the face of text inputs containing nonwords that have no definition within a language. This behavior not only results in situations where image generation does not match human expectations but also hinders these models from being utilized in psycholinguistic applications and simulations. This paper exploits the human nature of associating nonwords with their phonetically and phonologically similar words and uses it to propose a T2I generation framework robust against nonword inputs. The framework comprises a phonetics-aware language model as well as an adjusted T2I generation model. Our evaluations confirm that the proposed nonword-to-image generation synthesizes images that depict visual concepts of phonetically similar words more stably than comparative methods. We also assess how the image generation results match human expectations, showing a better agreement than the phonetics-blind baseline.