Recognizing the human hands’ positions is crucial during human-robot collaboration because the robot might need to deliver objects to humans or avoid a collision. The aim of this study is to accurately position human hands in the collected video on a real-time basis when a human operator processes the manufacturing tasks. Although the pose estimation methods such as MediaPipe or Detectron2 can recognize the skeletons of the human body, they cannot detect the human body contour, particularly for palms and fingers. Besides, the existing human hand positioning models such as BodyPix require substantial computational resources to achieve high accuracy. To tackle this issue, a hybrid framework was proposed to combine semantic segmentation and skeletal detection, obtaining pixel-level information in regions of interest (ROI) to achieve accuracy in a cost-efficient way. The experimental results show that the proposed hybrid model outperforms the existing lightweight model on the simulated data with maintaining stable resource usage. For future research, it is suggested to expand the application of various hybrid models and to conduct a further in-depth analysis of different human body parts. Through these extended investigations, it is anticipated that the accuracy and efficiency of human body part positioning can be further improved.