We analyzed de-identified patient data from the New York State SPARCS system, consisting of 9 million patient records from 2016 through 2019. Each patient record contains 35 features including patient demographics, clinical diagnoses, length of stay, and total cost. We used big data and machine learning techniques, Python Pandas libraries, and the SciKit Learn toolkit. We examined trends in the cost distributions and identified the diagnosis codes that correspond to the largest changes. The distributions are long-tailed and have a peak near USD 9000. We compared cost samples from 2016−2019 and applied the Kolmogorov-Smirnov test to show that the samples arise from different statistical distributions (p-value < 0.0001). The dataset contained 305 unique clinical diagnoses. Of these, 275 showed positive increases in cost, which represents 90% of the categories. The largest cost increases were for "EXTENSIVE 3RD DEGREE OR FULL THICKNESS BURNS" with a 96.5% increase and “CARDIAC CATHETERIZATION FOR OTHER NON-CORONARY CONDITIONS” with a 96% increase. We developed models to predict costs using machine learning, too. The model input consisted of patient demographics and diagnosis codes. The model output was the predicted cost for treatment. We investigated the Catboost regression model and computed the R2 score for performance evaluation. We achieved R2 values in the range of 0.59 to 0.88. The higher R2 value is obtained when the length of stay is used as an input feature. Though the cost distributions were different from 2016−2019, the R2 scores for the proposed models for the years 2016 through 2019 were consistent. The methodology in this study helps providers and policymakers predict healthcare costs for planning purposes better. The trends in the costs and the identification of diagnostic codes associated with large cost increases guide expenditure in the most needed area. The results suggest that the age group "70 and older" benefits from targeted interventions.