Background: The Dutch-Flemish PROMIS® Upper Extremity (DF-PROMIS-UE) V2.0 item bank was recently developed using Item Response Theory (IRT). Unknown for this bank are: (1) if it is legitimate to calculate IRT-based scores for short forms and Computerized Adaptive Tests (CATs), which requires that the items meet the assumptions of and fit the IRT-model (Graded Response Model [GRM]);(2) if it is legitimate to compare (sub) groups of patients using this measure, which requires measurement invariance; and (3) the precision of the estimated patients' scores for patients with different levels of functioning and compared to legacy measures. Aims were to evaluate (1) the assumptions of and fit to the GRM, (2) measurement invariance and (3) (comparative) precision of the DF-PROMIS-UE v2.0. Methods: Cross-sectional data were collected in Dutch patients with upper extremity disorders. Assessed were IRT-assumptions (unidimensionality [bi-factor analysis], local independence [residual correlations], monotonicity [coefficient H]), GRM item fit, measurement invariance (absence of Differential Item Functioning [DIF] due to age, gender, center, duration, and location of complaints) and precision (standard error of IRT-based scores across levels of functioning). To study measurement invariance for language [Dutch vs. English], additional US data were used. Legacy instruments were the Disability of the Arm, Shoulder and Hand (DASH), the QuickDASH and the Michigan Hand Questionnaire (MHQ). Results: In total 521 Dutch (mean age ± SD = 51 ± 17 years, 49% female) and 246 US patients (mean age ± SD = 48 ± 14 years, 69% female) participated. The DF-PROMIS-UE v2.0 item bank was sufficiently unidimensional (Omega-H = 0.80, Explained Common Variance = 0.68), had negligible local dependence (four out of 1035 correlations > 0.20), good monotonicity (H = 0.63), good GRM fit (no misfitting items) and demonstrated sufficient measurement invariance. Precise estimates (Standard Error < 3.2) were obtained for most patients (7-item short form, 88.5%; standard CAT, 91.3%; and, fixed 7-item CAT, 87.6%). The DASH displayed better reliability than the DF-PROMIS-UE short form and standard CAT, the QuickDASH displayed comparable reliability. The MHQ-ADL displayed better reliability than the DF-PROMIS-UE short form and standard CAT for T-scores between 28 and 50. For patients with low function, the DF-PROMIS-UE measures performed better. Conclusions: The DF-PROMIS-UE v2.0 item bank showed sufficient psychometric properties in Dutch patients with UE disorders. Keywords: Dutch-Flemish PROMIS; Item response theory; Measurement invariance; Reliability; Upper extremity.