Deep neural network (DNN) models are widespread across various applications. However, designing a computationally intensive DNN model on a resource-constrained platform is challenging and limits its applicability for hardware-based applications. Developing optimized architectural designs of DNNs is one way of addressing the hardware constraints and expanding the application space. The current study proposes an architectural design optimization for a gated recurrent unit (GRU) based regression model using microcontroller units (MCUs), and we refer to the optimized network as embedded GRU (eGRU). In addition, we also present hardware architectures with different precision formats for application-specific energy and speed benefits. The performance of the proposed methodology is investigated and tested on two different MCUs: ARM Cortex LPC1768 and AURIX TC277. We validate the proposed architecture design methodology for vehicle speed prediction (VSP) based intelligent vehicle braking system applications using time-series four-wheel speed data alone. The hardware experimental results for regression-based eGRU inference per test sample for the VSP task show a mean reduction of 3.5X in energy usage, 3X in latency, and 33% model parameter memory reduction for single-precision format architecture. A further decrease of 50% model parameter memory, 9% latency, and 9.1% in energy usage yielded for half-precision against single-precision architecture.