Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language $(\mathrm{V}+\mathrm{L})$ tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained $\mathrm{V}+\mathrm{L}$ models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR images and associated reports. External evaluation using the OpenI dataset shows that the joint embedding learned by pre-trained $\mathrm{V}+\mathrm{L}$ models demonstrates performance improvement of 1.4% in thoracic finding classification tasks compared to a pioneering CNN + RNN model. Ablation studies are conducted to further analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. Attention maps are also visualized to illustrate the attention mechanism of $\mathrm{V}+\mathrm{L}$ models.