The expectation propagation (EP) algorithm is near-optimal in massive multiple-input multiple-output (MIMO) systems but suffers from high computation complexity. Most of the previous works exploit the channel hardening property and introduce iterative matrix inversion algorithms to simplify the EP algorithm, but the performance degrades dramatically in non-ideal channels. In this paper, we propose a more universal EP-based detector, which can perform well in both ideal and non-ideal channels. Firstly, two general methods are proposed to effectively improve the detection performance and convergence speed of iterative matrix inversion algorithms under correlated channels. The proposed diagonal preprocessing (DP) method can improve the detection performance by more than 2-dB compared to not using this method; the novel eigenvalue parameter estimation method guarantees the convergence of all the frames. These two methods are applied to the second-order Richardson iteration (SORI) algorithm to derive the DP-SORI algorithm, which converges more than twice as fast as the state-of-the-art design. Secondly, for another important part of EP-based algorithms, namely the calculation of expectation and variance, complicated operations such as exponentiations, divisions, and inversions are all removed by algorithmic optimization. Moreover, based on the proposed approximate EP with DP-SORI (EPA-DP-SORI) algorithm, an efficient hardware design is developed, combining multiple optimization methods such as efficient matrix multiplication architecture design and low-complexity LDL decomposition. In addition to better detection performance compared with the state-of-the-art design, the presented EPA-DP-SORI detector can also deliver $1.27 \times $ and $1.57 \times $ higher area and energy efficiency.