Introduction Despite the recognized importance of atrophy in multiple sclerosis (MS), methods for its quantification have been mostly restricted to the research domain. Recently, a CE labelled and FDA approved MS-specific atrophy quantification method, MSmetrix, has become commercially available. Here we perform a validation of MSmetrix against established methods in simulated and in vivo MRI data. Methods Whole-brain and gray matter (GM) volume were measured with the cross-sectional pipeline of MSmetrix and compared to the outcomes of FreeSurfer (cross-sectional pipeline), SIENAX and SPM. For this comparison we investigated 20 simulated brain images, as well as in vivo data from 100 MS patients and 20 matched healthy controls. In fifty of the MS patients a second time point was available. In this subgroup, we additionally analyzed the whole-brain and GM volume change using the longitudinal pipeline of MSmetrix and compared the results with those of FreeSurfer (longitudinal pipeline) and SIENA. Results In the simulated data, SIENAX displayed the smallest average deviation compared with the reference whole-brain volume (+ 19.56 ± 10.34 mL), followed by MSmetrix (− 38.15 ± 17.77 mL), SPM (− 42.99 ± 17.12 mL) and FreeSurfer (− 78.51 ± 12.68 mL). A similar pattern was seen in vivo. Among the cross-sectional methods, Deming regression analyses revealed proportional errors particularly in MSmetrix and SPM. The mean difference percentage brain volume change (PBVC) was lowest between longitudinal MSmetrix and SIENA (+ 0.16 ± 0.91%). A strong proportional error was present between longitudinal percentage gray matter volume change (PGVC) measures of MSmetrix and FreeSurfer (slope = 2.48). All longitudinal methods were sensitive to the MRI hardware upgrade that occurred during the time of the study. Conclusion MSmetrix, FreeSurfer, FSL and SPM show differences in atrophy measurements, even at the whole-brain level, that are large compared to typical atrophy rates observed in MS. Especially striking are the proportional errors between methods. Cross-sectional MSmetrix behaved similarly to SPM, both in terms of mean volume difference as well as proportional error. Longitudinal MSmetrix behaved most similar to SIENA. Our results indicate that brain volume measurement and normalization from T1-weighted images remains an unsolved problem that requires much more attention.