In-memory computing (IMC) has been proposed to address compute-intensive data-driven AI workloads, using either SRAM or emerging memory technologies such as PCM, RRAM, and MRAM offering different trade-offs when used as an integrated computing device at the system level. A notable distinction is between digital vs. analog IMC. The latter uses either resistive or capacitive sharing techniques to maximize row parallelism, but at the expense of inaccuracies and accumulation resolution loss due to device variations across PVT and the limited SNR and dynamic range of the ADC/readout circuits. Most of the analog SRAM IMC solutions make use of large logic bit cells and aggressive ADC/readout bitwidth reduction leading to low memory density and computing inaccuracy. These drawbacks significantly limit deployment when functional safety, low-cost testing and system scalability to handle general-purpose workloads are required. In contrast, the deterministic behavior of digital IMC, and compatibility with pushed technology scaling rules offer a fast path for the next generation of neural processing systems. However, the integration of IMC into a Neural Processing Unit (NPU) must retain a mix of computing capabilities, while aiming at a substantial improvement in terms of power and cost efficiency. In this work, we present the architecture of a scalable and design time parametric NPU for edge AI relying on digital SRAM IMC (DIMC) using 8T standard bitcells integrated into IMC tiles supporting 1, 2, and 4b operation (a version with 8b support is in the works), instantiated in multiple clusters with digital logic and driven by a custom tensor slicing optimizing graph compiler, achieving an end-to-end system-level energy efficiency from 40–31 OTOPS/W in 18nm FD-SOI.