General Purpose Graphics Processing Units (GPG-PUs) employ frequent context switching to mask the long-latency of memory operations. However, GPGPUs still suffer from stagnation due to the incomplete overlapping of memory operations. To alleviate this stagnation and enhance Memory-Level Parallelism (MLP), it is crucial to overlap and minimize memory operations. This paper conducts a comprehensive analysis of data locality in GPGPUs and proposes an approach called Locality-Aware Warp Scheduling and Dynamic Data Prefetching (LWSDP) Co-design in the Per-SM Private Cache of GPGPUs, which effectively utilizes data locality to improve MLP. In addition to employing a coordinated scheduler and dynamic data prefetching, we incorporate Prefetching Requests Admitted Cache Access Re-execution (PRA-CAR) to mitigate the adverse impact of excessive prefetching memory requests on memory saturation. Experimental results demonstrate that LWSDP achieves an average 33.02% performance improvement and an average 28.16% miss rate reduction compared to the previous schedulers on data locality-sensitive kernels.