Nonvolatile-memory-based computing in memory (nvCIM) [1–6] is ideal for low-power edge-Al devices requiring neural network (NN) parameter storage in the power-off mode, a rapid response to device wake-up, and high energy efficiency for MAC operations $(\text{EF}_{\text{MAC}})$. Current analog nvCIMs impose a tradeoff between the signal margin (SM) and the number of accumulations $(\mathrm{N}_{\mathrm{A}\text{CU}})$ per cycle versus $\text{EF}_{\text{MAC}}$ and computing latency $(\mathrm{T}_{\text{CD}-\text{MAC}})$. Near-memory computing (NMC), with high precision for inputs (IN), weights (W), and outputs (OUT), and a high $\mathrm{N}_{\text{ACU}}$ is a trend to improve $\text{EF}_{\text{MAC}}, \mathrm{T}_{\text{CD}-\text{MAC}}$, and accuracy. A prior STT-MRAM NMC [1] uses vertical-weight mapping (VWM) to improve the $\text{EF}_{\text{MAC}}$; however, further improvement is challenging: due to (1) the large energy consumption in reading repetitious weight data across multiple inputs for a single NN-layer; (2) a high bitstream toggling-rate (BTR) for digital MAC circuits $(\text{DC}_{\text{MAC}})$ reduces $\text{EF}_{\text{MAC}}$, and; (3) a limited SM and memory readout latency $(\mathrm{T}_{\text{CD}-\mathrm{M}})$ for memories with a small R-ratio (e.g. STT-MRAM, see Fig. 33.2.1). In developing an STT-MRAM nvCIM macro, this work moves beyond circuit-level novelty by using system-software-circuit co-design. This work achieves a high $\text{EF}_{\text{MAC}}$, a short $\mathrm{T}_{\text{CD-M}}$, a high read bandwidth (R-BW), a high IN-W-OUT precision, and a high $\mathrm{N}_{\text{ACU}}$ by using the novel schemes: (1) a hardware based weight-feature aware read (WFAR) to reduce weight accesses and improve $\text{EF}_{\text{MAC}}$ with a minimal area overhead; (2) toggling-aware weight-tuning (TAWT) to obtain fine-tuned weights $(\mathrm{W}_{\text{FT}})$ with a low BTR, which is based on VWM to enhance the $\text{EF}_{\text{MAC}}$ of the $\text{DC}_{\text{MAC}}$; (3) a differential charge-accumulating margin-enhanced voltage-sensing amplifier (DCME-VSA) to enhance the SM, while reducing the T CD - M . The proposed 22-nm S-Mb STT-MRAM NMC nvCIM macro achieves the highest R-BW $(436\text{GB}/\mathrm{s})$ and $\text{EF}_{\text{MAC}}(46.4-160.1\text{TO}\text{PS}/\mathrm{W})$ for $\mathrm{N}_{\mathrm{A}\text{CU}}=576$ for SblN - SbW - 26bOUT.