This paper presents high-performance batched lower-upper (LU) factorization routines for small matrices on graphics processing units (GPUs). LU factorization is an effective method for solving linear equations. Pivoting, at least partial pivoting, is essential for LU factorization to ensure numerical stability. However, the row interchanges resulting from the pivoting are very troublesome due to non-coalesced memory access on GPUs. We analyze the global memory access of different variants of LU factorization algorithms and design our batched LU factorization routines based on the blocked left-looking algorithm with the minimum global memory access. To release the penalty of non-coalesced memory access, we propose a swapping rows group by group approach and a delaying row interchanges method. What's more, techniques for kernel fusion and data prefetching are used to improve overall performance. The numerical results on an NVIDIA Tesla V100 GPU show that our batched LU routine outperforms the highly optimized vendor library cuBLAS up to $6.5\times$ and achieves up to $2.2\times$ over the well-known open-source package MAGMA.