In this paper, a novel unsupervised approach to mining categories from action video sequences is presented. This approach consists of two modules: action representation and learning model. Videos are regarded as spatially distributed dynamic pixel time series, which are quantized into pixel prototypes. After replacing the pixel time eries with their corresponding prototype labels, the video sequences are compressed into 2D action matrices. We put these matrices together to form an multi-action tensor, and propose the joint matrix factorization method to simultaneously cluster the pixel prototypes into pixel signatures, and matrices into action classes. The approach is tested on public and popular Weizmann data set, and promising results are achieved.