3D reconstruction of medical images is required for many clinical scenarios as the aided diagnosis. In this study, we propose a novel medical image 3D reconstruction framework based on a transformer-based deep learning model, in which non-rigid Structure from Motion (NRSfM) is used to estimate the non-rigid deformation, and coplanar constraint is considered for the finer reconstruction result. We design a 3-stage feature extractor transformer as the backbone and take a multitask output structure to predict the photometric parameters of depth, pose, and camera structure. In addition, to obtain robust features, we pre-learn the features from the natural images with rich texture and transfer the knowledge to medical image learning. The experimental for both computed tomography (CT) and ultrasound images from the open clinical libraries show that our method can efficiently estimate the camera structure and motion, and the more precise 3D reconstruction can be achieved.