Sparse Matrix-Vector Multiplication (SpMV) is one of the vital computational primitives used in modern workloads. SpMV performs memory access, leading to unnecessary data transmission, massive data access, and redundant multiplicative accumulators. Therefore, we propose the near spin-transfer torque magnetic random access memory (STT-MRAM) processing architecture from three optimization perspectives. These optimizations include (1) the NMP controller receives the instruction through the AXI4 bus to implement the SpMV operation in the following steps, identifies valid data, and encodes the index depending on the kernel size, (2) the NMP controller uses high-level synthesis dataflow in the shared buffer for achieving better performance throughput while do not consume bus bandwidth, and (3) the configurable MACs are implemented in the NMP core without matching step entirely during the multiplication. Using these optimizations, the NMP architecture can access the pipelined STT-MRAM (read bandwidth is 26.7GB/s). The experimental simulation results show that this design achieves up to 66x and 28x speedup compared with state-of-the-art ones and 69x speedup without sparse optimization.