This study uses both traditional and newer verification methods to evaluate two 4-km grid-spacing Weather Research and Forecasting Model (WRF) forecasts: a 'cold start' forecast that uses the 12-km North American Mesoscale Model (NAM) analysis and forecast cycle to derive the initial and boundary conditions (C0) and a 'hot start' forecast that adds radar data into the initial conditions using a three-dimensional variational data assimilation (3DVAR)/cloud analysis technique (CN). These forecasts were evaluated as part of 2009 and 2010 NOAA Hazardous Weather Test Bed (HWT) Spring Forecasting Experiments. The Spring Forecasting Experiment participants noted that the skill of CN's explicit forecasts of convection estimated by some traditional objective metrics often seemed large compared to the subjectively determined skill. The Gilbert skill score (GSS) reveals CN scores higher than C0 at lower thresholds likely due to CN having higher-frequency biases than C0, but the difference is negligible at higher thresholds, where CN's and C0's frequency biases are similar. This suggests that if traditional skill scores are used to quantify convective forecasts, then higher (>35 dB Z) reflectivity thresholds should be used to be consistent with expert's subjective assessments of the lack of forecast skill for individual convective cells. The spatial verification methods show that both CN and C0 generally have little to no skill at scales <8-12Δ x starting at forecast hour 1, but CN has more skill at larger spatial scales (40-320 km) than C0 for the majority of the forecasting period. This indicates that the hot start provides little to no benefit for forecasts of convective cells, but that it has some benefit for larger mesoscale precipitation systems. [ABSTRACT FROM AUTHOR]