Numerous attempts to interpret charts have historically been more focused on enhancing performance rather than aligning with human practicality, inadvertently steering us away from the fundamental objective. Given that the subjective knowledge in interpreting charts varies depending on its application, it is imperative to ensure autonomy in interpretation based on foundational information. This necessitates the provision of intuitive information grounded in human perception hierarchically. We propose a methodological expansion of caption usage, termed “Caption Hierarchical Segmentation”, which progressively augments caption information based on the spatial characteristics of tokens, offering multi-layered captions. This approach facilitates the training of models to be versatile in application, grounded in human perceptibility. Our method, when integrated with existing chart explanation models, serves to prevent misunderstandings and overfitting by the model. It achieves this by offering simple explanations for samples that are otherwise uninterpretable, thereby providing only intuitive information and averting incorrect responses.