Background Physical capacity tasks (ie, observer-administered outcome measures that comprise a standardized activity) are useful for assessing functioning in patients with low back pain. Purpose The purpose of this study was to systematically review the level of evidence for the reliability, validity, and responsiveness of physical capacity tasks. Data Sources MEDLINE, CINAHL, PsycINFO, Scopus, the Cochrane Library, and relevant reference lists were used as data sources. Study Selection Two authors independently selected articles addressing the reliability, validity, and responsiveness of physical capacity tasks, and a third author resolved discrepancies. Data Extraction and Quality Assessment One author performed data extraction, and a second author independently checked the data extraction for accuracy. Two authors independently assessed the methodological quality with the Consensus-Based Standards for the Selection of Health Measurement Instruments (COSMIN) 4-point checklist, and a third author resolved discrepancies. Data Synthesis and Analysis Data synthesis was performed by all authors to determine the level of evidence per measurement property per physical capacity task. The 5-repetition sit-to-stand, 5-minute walk, 50-ft (1/415.3-m) walk, Progressive Isoinertial Lifting Evaluation, and Timed "Up & Go" tasks displayed moderate to strong evidence for positive ratings of both reliability and construct validity. The 1-minute stair-climbing, 5-repetition sit-to-stand, shuttle walking, and Timed "Up & Go" tasks showed limited evidence for positive ratings of responsiveness. Limitations The COSMIN 4-point checklist was originally developed for patient-reported outcome measures and not physical capacity tasks. Conclusions The 5-repetition sit-to-stand, 50-ft walk, 5-minute walk, Progressive Isoinertial Lifting Evaluation, Timed "Up & Go," and 1-minute stair-climbing tasks are promising tests for the measurement of functioning in patients with chronic low back pain. However, more research on the measurement error and responsiveness of these tasks is needed to be able to fully recommend them as outcome measures in research and clinical practice.