Large deep neural network (DNN) models pose significant computational and memory challenges, particularly when deploying them on edge devices. To address this, techniques such as pruning, quantization, data sparsity, and data reuse have been applied to DNNs, mitigating memory and computational complexity at the cost of some accuracy loss. This paper introduces an efficient hardware accelerator tailored for Convolutional Neural Networks (CNNs). The proposed architecture is the result of a co-optimized approach encompassing both algorithms and hardware. It leverages linear approximation of pre-trained network weights with minimal accuracy loss. A novel computational reuse method is presented to curtail the number of multiplication and addition operations and memory accesses, seamlessly integrated into the dedicated elements within the CNN design. To validate the effectiveness of this architecture, we conducted experiments on a gem5-based RISCV simulator, employing the VGG16 model for the CIFAR 100 dataset and the AlexNet model for the TinyImageNet dataset. The results showcased an impressive speedup of approximately $2\times$ on AlexNet compared to the reference model. Additionally, our proposed CNN design was successfully implemented on the Xilinx Kintex 7 Field Programmable Gate Array (FPGA), achieving a notable reduction in hardware resource utilization compared to prior research efforts. This work serves as a versatile framework for evaluating diverse trade-offs involving accuracy, latency, power consumption, and cost across different CNN architectures.