Human action recognition has been a widely studied topic in the field of computer. However challenging problems exist for both local and global methods to classify human actions. Local methods usually ignore the structure information among local descriptors. Global methods generally have difficulties in occlusion and background clutter. To solve these problems, a novel combination representation called global Gist feature and local patch coding is proposed. Firstly, Gist feature captures spectrum information of actions in a global view, with spatial relationship among body parts. Secondly, Gist feature located in different grids of the action-centric region is divided into four patches according to the frequencies of action variance. Afterwards on the basis of traditional bag-of-words (BoW) model, a novel formation of local patch coding is adopted. Each patch is encoded independently and finally all the visual words are concatenated to represent high variability of human actions. By combining local patch coding, the proposed method not only solves the problem that global descriptors can not reliably identified actions in complex backgrounds, but also reduces the redundant features in a video. Experimental results performed on KTH and UCF sports dataset demonstrate that the proposed representation is effective for human action recognition.