We address the effective use of speaker information for automatic speech recognition (ASR) in a speaker-imbalanced dataset. Recently, joint speaker and speech recognition has been investigated in end-to-end (E2E) systems. However, speaker information as the output of speaker recognition (SRE) is not explicitly used for ASR in these systems. Inspired by speaker embedding for ASR, we propose a direct connection of SRE to the ASR decoder. The E2E architecture allows for backpropagating the ASR loss to the SRE decoder, resulting in joint optimisation. The architecture is beneficial for speaker-sparse datasets such as meetings and low-resource language settings, in which speaker clustering is conducted to compensate minor speakers. We also make a systematic comparison of our proposed method with other methods, including multi-task learning (MTL), adversarial learning (AL), and speaker attribute augmentation (SAug). It is shown that the use of speaker cluster information improves both ASR and SRE, and the proposed method outperforms other methods. It reduces errors of the baseline model by 3.35% and 8.23% for ASR and SRE, respectively.