Grapes have been cultivated for wine production since antiquity and are considered one of the highest valued crops worldwide, with the global market size expected to reach 456.1 billion USD by 2027. Considering that they are non-climacteric fruits, the selection of the point in time in which harvesting is optimal with respect to their maturity degree is instrumental in determining the chemical composition of the wines as well as various sensory traits thereof. Automatic processes that identify the grapes’ maturity degree in situ using non-destructive approaches would enable producers to more accurately monitor their produce as well as provide an elegant enabler towards robotic harvesting. Hyperspectral imaging in VNIR (400 to 1000 nm) has demonstrated this capacity in laboratory conditions in the literature, however its application in situ is contingent on various ambient parameters like varying illumination conditions which hamper the accurate estimation of the maturity degree, with a limited number of studies focusing on real-life field conditions. In this paper we present a methodology to partially overcome these limitations in real-life in situ conditions using a deep learning approach utilizing the attention mechanism as well as appropriate pre-processing techniques. The methodology was tested on hyperspectral images collected from the Cubert Firefleye sensor in the harvesting period of 2021 in Ktima Gerovassilliou, Northern Greece, spanning across four different grape varieties (Chardonnay, Malagouzia, Sauvignon Blanc and Syrah) in different time periods from across the maturity cycle. The sugar content (ο Brix) of the grapes was determined using a portable refractometer. Overall, the dataset is comprised of 233 point spectra extracted from the hyperspectral cubes. The spectra are pre-processed using scatter correction techniques and spectral derivatives which can partially overcome the limitations of varying illumination conditions that hinder the extraction of proper reflectance spectra, albeit at the cost of lower signal-to-noise ratio. Following this, an attention-based convolutional neural network (CNN) is developed for each grape variety independently using 70% of the samples for calibration while the 30% rest are left-out as the testing dataset. The proposed approach outperforms other standard machine learning algorithms while the attained accuracy is comparable with other results from the literature that examined hyperspectral imaging in the laboratory, under more controlled conditions. More concretely, the proposed approached attained a mean of RMSE 2.65 ο Brix, R2 0.58 and RPIQ of 2.12 across the four different varieties in the independent test set. In the future, automated approaches to perform image segmentation on the hyperspectral cubes should be integrated as the first step in the pipeline procedure to extract the pixels corresponding to the grapes. More elaborate techniques could also be implemented to address the inherent issue of the effect of ambient conditions on the hyperspectral data, whilst a direct comparison between in situ spectra and spectra recorded under laboratory conditions would provide more conclusive results.