As large language models (LLMs) rapidly advance, rigorous testing and evaluation of these models grows increasingly crucial. To address this need, we have developed three types of questions: Chinese contextual, English contextual, and language context-independent. Testing in both Chinese and English probes the LLMs' hallucination tendencies. We investigate the impact of language on hallucinations from two perspectives: the type of language used in the input prompt and the cultural context underlying the prompt's content. Additionally, 52 multi-domain single-choice questions from C-EVAL are presented in original and randomized order to assess robustness to perturbations. Among the five LLMs, the tests demonstrate GPT -4 has the strongest anti-hallucination and robustness capabilities, answering with greater accuracy, consistency, and reliability. ChatGLM ranks second and outperforms GPT -4 on Chinese context-dependent questions. Emergent testing phenomena are analyzed from the user's perspective. Hallucinated responses are categorized and potential causal factors leading to hallucination and fragility are examined. Based on these findings, viable avenues for improvement are proposed.