The widespread use of technology in hospitals and the difficulty of sterilizing computer controls has increased opportunities for the spread of pathogens. This leads to an interest in touchless user interfaces for computer systems, especially the application of gesture recognition to human-machine interfaces. Human gestures are a universal language that requires minimal time to learn. This research work proposes an end-to-end data-driven deep-learning model called SRDD-Net to address the problem of gesture recognition, composed of an encoder network to encode germane hand features, and a classification model to decode the dynamic hand gestures from a string of encodings. Since classification of images is more computationally expensive than encoding the images, our approach implements a hand pose estimation model to estimate 21 key hand-point coordinates in the image that are later fed into a spatio-temporal classification model to provide a more economical and an accurate method for gesture classification.