In natural language processing, utterance-level emotion recognition (ULER) is a fundamental task which requires comprehensive analysis of words and contexts. Since one word can exhibit opposite emotion polarities in different contexts, even large scale deep learning models cannot accurately predict the emotions of the utterances, especially in daily dialogues. Thus, in this paper, a novel Semantic and Sentiment Hierarchical Transformer (SS-HiTransformer) has been proposed, which predicts the emotion label of a response utterance based on its preceding contextual utterances. At word-level processing, each utterance token is represented as matrices with both semantic and sentiment word embeddings and fused together as token features. Then, context token features are fed into transformer encoders to further capture the long dependent utterance-level information. The final prediction is made by a transformer decoder which combines the features from the contextual and the response utterances with multi-head attention mechanism. Experiment results showed that the proposed SS-HiTransformer outperforms the state-of-the-art models on Friends, EmotionPush, EmoryNLP, IEMOCAP and MELD data sets.