Brain machine interface systems will require recording thousands of neural channels in parallel to acquire large scale neuronal activity. High bandwidth action potential signal will overload the data communication bandwidth, and on-site spike sorting can extract essential information, however, requires extensive computational resources to achieve high classification accuracy. This demands for high resources consuming, especially in large-scale real-time sorting systems. In this work, a customized unsupervised training engine incorporated with distributed and optimized sorting channels is presented in order to reduce the hardware complexity without compromising the accuracy of spike sorting. A mixed-domain feature set is extracted in each channel, followed by feature based sorting. Each channel will constantly monitor the sorting accuracy and will request training engine intervention when in need. The proposed system is implemented in a 180 nm CMOS process, consuming only 0.33 μ W/channel with a clock of 25 kHz and power supply of 1.8 V, and in-channel sorting occupies 0.0023 mm 2 , with training engines occupying 1.956 mm 2 , which can be shared by all the channels.