학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Resource Type: Conference
Authors: Li, Yikuan; Wang, Hanyin; Luo, Yuan
Source: 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Bioinformatics and Biomedicine (BIBM), 2020 IEEE International Conference on. :1999-2004 Dec, 2020
Subject: Bioengineering
Computing and Processing
Signal Processing and Analysis
Visualization
Biological system modeling
Bit error rate
Task analysis
MIMICs
Training
Lung
Vision-and-Language
Multi-modal Representation Learning
Thoracic Findings Classification
Language

Online Access

Full Text (IEEE)

초록

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language $(\mathrm{V}+\mathrm{L})$ tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained $\mathrm{V}+\mathrm{L}$ models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR images and associated reports. External evaluation using the OpenI dataset shows that the joint embedding learned by pre-trained $\mathrm{V}+\mathrm{L}$ models demonstrates performance improvement of 1.4% in thoracic finding classification tasks compared to a pioneering CNN + RNN model. Ablation studies are conducted to further analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. Attention maps are also visualized to illustrate the attention mechanism of $\mathrm{V}+\mathrm{L}$ models.

공지

DAU Library

학술논문

요약정보

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Online Access

초록