Bit-slice architectures have been developed to support various bit-precision and data sparsity of deep neural networks (DNNs). However, because of low-bit precision and a wide range of sparsity of bit-slice computations, bit-slice architectures are challenging in multiplier-, processing element (PE)-, core-, and software (SW)-level designs. First, data sparsity causes a power trade-off between the Radix numbers of a multiplier, and the previous multipliers cannot take advantage of all sparsity ranges. Second, a bit-slice PE which integrated massive lowbit multiplier-and-accumulate (MAC) units brings about large data transactions compared to a fixed bit-width PE. Third, bitslice core architectures only focus on either dense or sparse data computations, limiting the overall performance of bit-slice computations with a wide sparsity range. Lastly, low-bit bit-slice computations cause massive repetitive instruction fetches across hardware units. To solve the challenges, LUTein is proposed. It exploits the new lookup table (LUT)-based computing method to support the Radix-4 Modified Booth algorithm, achieving low power consumption in all sparsity ranges. Moreover, the slice-tensor PE efficiently processes slice-tensor data by sharing hardware units across the Radix-4 LUT-based MAC units. In addition, the LUTein architecture adopts a systolic datapath with a multi-port buffer to exploit both inter-PE data reuse and slice-level sparsity. Lastly, LUTein's instruction set architecture (ISA) and the hierarchical instruction decoder are introduced to alleviate repetitive instruction fetches. As a result, LUTein outperforms the state-of-the-art bit-slice architecture, Sibia, over 1.34× higher energy-efficiency and 1.78× higher area-efficiency.