The increasing spread of COVID-19, caused by the virus SARS-CoV-2, raises concerns about the extent to which mutations have occurred across the viral genome. We present a partial replication of an earlier 2021 study by Wang, R. et al. that determined the presence of four substrains and eleven top mutations in the United States. We analyze a portion of the authors' data set in order to recreate Figure S1 from the paper, recapitulating the same features observed in the original figure. We further generate a summary of mutation characteristics for each of the 26 named proteins and confirm the significance of the spike protein at roughly 24% of all recorded mutations. Our analysis suggests that additional factors may affect per-protein mutation rate besides protein length.
Comment: 11 pages, 4 figures, added references to GitHub source code