Medical document classification is one of the prominent research problems in document classification domain. As medical discharge notes are collected from real patients, they are often imbalanced. Moreover, these datasets are usually too small for data-hungry models (specially in rare disease cases). Both of these issues can lead to poor classification performance. In this work a new probabilistic dictionary-based data augmentation approach is proposed to address these issues by oversampling on the minority class. This method works by creating new documents with high variety by using the extracted synonyms from WordNet with awareness of synonyms’ similarities with the original word. To verify the effectiveness of the proposed oversampling approach, three different machine learning methods are used to learn classifiers from the augmented clinical text datasets generated by the oversampling approach. The experimental results show that the proposed method not only provides better classification accuracy than the imbalanced dataset case, but also can outperform some existing augmentation methods on the dataset of 2008 Integrating Informatics with Biology and the Bedside (I2B2) obesity challenge.