Primary Image

Rehab Measures Database

NIH Toolbox for Assessment of Neurological and Behavioral Function--Cognition Battery, V3

Last Updated

Purpose

The purpose of NIHTB-CB was to design a brief assessment tool for variety of researchers that would measure neurological functions across the lifespan, but with a particular emphasis on longitudinal epidemiologic studies and intervention trails.

Link to Instrument

Acronym NIHTB-CB

Area of Assessment

Attention & Working Memory
Cognition
Executive Functioning
Language
Processing Speed

Assessment Type

Performance Measure

Administration Mode

Computer

Cost

Not Free

Actual Cost

$599.99

Cost Description

The annual subscription costs for the NIHTB (V3) is $599.99 plus the cost of equipment (iPad, Bluetooth keyboard, batteries).

The yearly subscription price is for up to 2 iPads. Additional subscription options are up to 6 devices for $1,499.99 and up to 10 for $2,499.99. It also includes the NIH Toolbox batteries for emotion, motor and sensory domains. Plus, access to PROMIS, Neuro-QOl, TBI-QOL, SCI-QLI and SCI-fI, and ECOG. The subscription is auto-renewing. There is a free 14 day trial period, but one is not able to save data or export reports.

The cost of the public V2 apps will increase to $749.99 with anticipated annual increases of $250 on June 1 from 2025 through 2027.

Key Descriptions

  • Brief measures that were designed to provide a common currency between researchers in the cognitive domains to assess domains that are important for school and work.
  • The NIH Toolbox Version 3 (V3) update was released in 2023 with an enhanced user interface, advanced data management capabilities, and expanded features including addition of new and improved tests with updated normed scores. The NIH Toolbox V2 will be retired, with annual subscriptions for both the English and Spanish versions no longer sold after June 30, 2027 and the last subscription expiring on June 30, 2028. Official support for the V2 app will cease after August 2028.
  • Seven main sub tests that cover attention, executive functioning, language, memory, and processing speed.
  • A new normative sample was collected for the NIH Toolbox Cognition Tests and Standing Balance Test (Motor Domain) as part of the 2023 Version 3 update with sample demographics representative of the 2020 U.S. Census.
  • For those ages 4-6 there is an Early Childhood Composite score derived from the scores for the Dimensional Change Card Sort, Flanker, Picture Sequence Memory, Picture Vocabulary, and Speeded Matching tests.
  • For those age 7+ there is a Total Cognition Composite score that is the sum of the Crystalized Composite (Picture Vocabulary and Oral Reading Recognition tests) and Fluid Composite (Dimensional Change Card Sort, Flanker, Picture Sequence Memory, List Sorting, and Pattern Comparison tests) scores.
  • Supplemental tests not used in the calculation of the composite scores can be used to better understand the participant’s cognitive functioning.
  • Results can be accessed through a Score Report or .CSV file once the assessment has been completed.
  • The NIHTB-CB was designed to be quick, reliable, and easy to administer.
  • The instrument uses IRT and CAT to assess constructs quickly

Number of Items

? NIH Toolbox Picture Vocabulary Test: 25 items
? NIH Toolbox Flanker Inhibitory Control and Attention Test: 20 items
? NIH Toolbox List Sorting Working Memory Test: 12 items
? NIH Toolbox Dimensional Change Card Sort Test: 30 mixed items
? NIH Toolbox Pattern Comparison Processing Speed Test: 130 items or 85 seconds
? NIH Toolbox Picture Sequence Memory Test: 2 test sequences
? NIH Toolbox Oral Reading Recognition Test: 25 items
? *NIH Toolbox Oral Symbol Digit Test: 144 items
? *NIH Auditory Verbal Learning Test: 15 items, 3 trials

*Supplemental tests that can be used to better understand the participant’s cognitive functioning.

Equipment Required

  • NIH Toolbox App from iTunes
  • 11 inch iPad Air or iPad Pro
  • Bluetooth keyboard (including batteries)
  • Laminated sheet containing key and nine practice items on one side and the key and test items on the other side for oral symbol digit test
  • Home base (downloaded from NIH toolbox website)
  • Pronunciation guide
  • NIH Toolbox Oral Reading Recognition Test Training and Certification Materials (contact: cognition@nihtoolbox.org)

Time to Administer

Up to 32 minutes

NIHTB (V3)-CB

Time to administer a complete NIH Toolbox (V3) Cognitive Battery takes up to 32 minutes, however, due to individual difference in test takers, test administrators and other unknowns the test time may differ. The times for the individual tests are:
? NIH Toolbox Picture Vocabulary Test: 3 minutes
? NIH Toolbox Flanker Inhibitory Control and Attention Test: 3 minutes
? NIH Toolbox List Sorting Working Memory Test: 7 minutes
? NIH Toolbox Dimensional Change Card Sort Test: 4 minutes
? NIH Toolbox Pattern Comparison Processing Speed Test: 4 minutes
? NIH Toolbox Picture Sequence Memory Test: 7 minutes
? NIH Toolbox Oral Reading Recognition Test: 4 Minutes
? *NIH Oral Symbol Digit Test: 3 minutes
? *NIH Rey Auditory Verbal Learning Test: 4 minutes
? *NIH Visual Reasoning Test: 7 minutes
? *NIH Face Name Associative Memory Exam Test: 7 minutes
? *NIH Speeded Matching Test: 3 minutes

*Supplemental tests can be used to better understand the participant’s cognitive functioning and are not included in the overall time estimate.

NIHTB (V2)-CB

Time to administer a complete NIH Toolbox (V2) Cognitive Battery takes up to 31 minutes, however, due to individual difference in test takers, test administrators and other unknowns the test time may differ. The times for the individual tests are:
? NIH Toolbox Picture Vocabulary Test: 4 minutes
? NIH Toolbox Flanker Inhibitory Control and Attention Test: 3 minutes
? NIH Toolbox List Sorting Working Memory Test: 7 minutes
? NIH Toolbox Dimensional Change Card Sort Test: 4 minutes
? NIH Toolbox Pattern Comparison Processing Speed Test: 3 minutes
? NIH Toolbox Picture Sequence Memory Test: 7 minutes
? NIH Toolbox Oral Reading Recognition Test: 3 Minutes
? *NIH Oral Symbol Digit Test: 3 minutes
? *NIH Auditory Verbal Learning Test: 3 minutes

*Supplemental tests can be used to better understand the participant’s cognitive functioning and are not included in the overall time estimate.


Required Training

Training Course

Required Training Description

There are two avenues for training, eLearning and Workshops for the NIH Toolbox.

Links to online videos for the administration of the NIH Cognition Battery, “How To” videos, and virtual conference and workshop recordings may be found here: https://www.healthmeasures.net/NIH_Toolbox_iPad_e-learning/story_html5.html.

In-person workshops are a day and half, and scheduled throughout the year at various locations. Upcoming scheduled workshops are listed here: https://www.healthmeasures.net/index.php?option=com_content&view=category&layout=blog&id=128&Itemid=934

For the cognition battery, the test administrator is required to have level C classifications.

Age Ranges

Preschool Children

3 - 5

years

Child

6 - 12

years

Adolescent

13 - 17

years

Adult

18 - 64

years

Elderly Adult

65 +

years

Instrument Reviewers

Constance Richard, MS, CRC, University of Wisconsin-Madison doctoral student under the direction of Timothy Tansey, PhD, Rehabilitation Psychology & Special Education Department, School of Education, University of Wisconsin-Madison

Kevin Fearn, MS, Shirley Ryan 嫩B研究院 

ICF Domain

Body Function
Body Structure
Activity
Participation

Measurement Domain

Cognition

Professional Association Recommendation

None found – last searched 8/28/2024

Considerations

An NIH Infant and Toddler “Baby” Toolbox (ages 0-42 months) containing more than 30 assessments of Cognition, Motor, and Social-Emotional domains in one iPad app is expected to be released in the second half of 2024.

Stroke

back to Populations

Normative Data

Mild & Moderate/Severe Stroke: (Carlozzi et al, 2017a; n = 131 (n = 71 mild stroke; n = 60 moderate/severe stroke—54 moderate and 6 severe); mean age = 57.5 (12.6) years; age range = 22-83 years; male = 51%; median time post CVA = 29.0 months (range = 12.5-87.3 months); mean = 31.5 (11.8) months)

 

National Institute of Health (NIH) Toolbox (NIHTB) – Cognition Battery Scores for Individuals with Mild vs Moderate/Severe Stroke 

NIHTB scores

 

n

Mild stroke

Mean (SD)

 

n

Moderate/severe stroke

Mean (SD)

Composite scores*

 

 

 

 

   Fluid

71

42.71 (12.64)

42

34.00 (9.57)

   Crystallized

75

50.54 (11.73)

50

45.72 (10.85)

Subtest scores

 

 

 

 

   Picture vocabulary

77

50.65 (12.77)

51

47.08 (11.38)

   Oral reading recognition

75

50.10 (10.18)

50

45.22 (10.76)

   Picture sequence memory test

73

45.86 (12.96)

45

35.98 (11.34)

   Pattern comparison

74

45.10 (11.25)

49

38.05 (9.13)

   List sorting

74

45.70 (10.26)

45

42.21 (10.86)

   Flanker

77

44.74 (10.89)

50

37.76 (9.53)

   DCCS

77

44.31 (10.11)

49

38.94 (8.83)

*Fluid cognitive composite score combines Dimensional Change Card Sort (DCCS) Test, Flanker Test of Executive Function Inhibitory Control and Attention, Picture Sequence Memory Test of episodic memory, List Sorting Working Memory Test, and Pattern Comparison Processing Speed Test. Crystallized cognitive composite score includes Picture Vocabulary and Oral Reading scores.

 

Stroke: (Carlozzi et al, 2017b; n = 211; mean age = 56.13 (12.97); female = 50.2%, mean time post CVA = 2.74 (2.46) years)

 

NIHTB Cognition Battery Scores for Stroke Participantsa

NIHTB scores

n

Mean (SD)

% Impairedb

Composite scores

 

 

 

   Fluid

176

40.51 (11.59)

49.2

   Crystallized

176

49.54 (10.85)

19.2

Subtest scores

 

 

 

   Picture vocabulary

201

49.08 (12.06)

22.3

   Oral reading recognition

201

48.34 (10.58)

 

23.3

   Picture sequence memory test

175

42.64 (12.38)

 

43.8

   Pattern comparison

175

43.13 (10.74)

39.2

   List sorting

175

43.98 (9.96)

40.3

   Flanker

175

42.08 (10.58)

40.3

   DCCS

175

42.46 (9.47)

40.3

aT scores are demographically adjusted for age, sex, education, and race/ethnicity.

b% impairment reflects individuals with scores >1 SD beyond the mean on a given test in the negative direction.

Construct Validity

Convergent validity:

Stroke: (Carlozzi et al, 2017a; all correlations significant at p < 0.01)

  • Adequate convergent validity between Flanker and Delis Kaplan Executive Functioning System (DKEFS) Interference (r = 0.46; r = 0.50 with motor function included as a covariate)
  • Adequate convergent validity between Dimensional Change Card Sort (DCCS) and DKEFS Interference (r = 0.54; r = 0.40 with motor function included as a covariate)
  • Excellent convergent validity between List Sorting Working Memory and Wechsler Adult Intelligence Scale Letter Number Sequencing, Fourth Edition (WAIS-IV LN) (r = 0.66)
  • Adequate convergent validity between Pattern Comparison and WAIS-IV Coding (CD) (r = 0.59; r = 0.58 with motor function included as a covariate)
  • Excellent convergent validity between Pattern Comparison and WAIS-IV Symbol Search (SS) (r = 0.67; r = 0.66 with motor function included as a covariate) 
  • Adequate convergent validity between Picture Sequence Memory and Auditory Verbal Learning Test (Rey) (AVLT) (r = 0.52)
  • Excellent convergent validity between Picture Sequence Memory and Brief Visuospatial Memory Test-Revised (BVMT-R) (r = 0.65)
  • Excellent convergent validity between Picture Vocabulary and Peabody Picture Vocabulary Test, Fourth Edition (PPVT-IV) (r = 0.87)
  • Excellent convergent validity between Oral Reading Recognition and Word Reading Achievement Test-Fourth Edition (WRAT-IV) (r = 0.88)

 

Discriminant validity

Stroke: (Carlozzi et al, 2017a) 

Convergent and Discriminant Validity for NIHTB Scores for Combined Stroke Sample (= 131)*  

NIHTB Scores

AVLT

BVMT

WRAT

PPVT

WAIS CD

WAIS SS

WAIS LN

DKEFS

PS

0.45

0.62

0.29

0.36

0.50

0.54

0.38

0.30

OR

0.32

0.54

0.88

0.77

0.47

0.48

0.66

0.48

PV

0.40

0.61

0.74

0.87

0.47

0.52

0.57

0.48

PC

0.28 (0.23)

0.38 (0.36)

0.25 (0.24)

0.38 (0.38)

0.59 (0.57)

0.67 (0.66)

0.35 (0.32)

0.47 (0.44)

LS

0.40

0.60

0.61

0.58

0.50 (0.49)

0.54 (0.54)

0.64

0.47

DCCS

0.45 (0.41)

0.57 (0.55)

0.51 (0.50)

0.55 (0.55)

0.60 (0.57)

0.65 (0.64)

0.49 (0.47)

0.54 (0.51)

Flank-er

0.33 (0.28)

0.50 (0.48)

0.37 (0.36) 

0.50 (0.50)

0.61 (0.58)

0.61 (0.59)

0.41 (0.38)

0.46 (0.43)

*All correlations < .01

Note: NIHTB scores: PS = Picture sequencing; OR = Oral reading recognition; PV = Picture vocabulary; PC = Pattern comparison; LS = List sorting; DCCS = Dimensional Change Card Sort; Flanker = Flanker inhibitory control and attention; AVLT = Auditory Verbal Learning Test (Rey); BVMT = Brief Visuospatial Memory Test-Revised; WRAT = Word Reading Achievement Test; PPVT = Peabody Picture Vocabulary Test, Fourth Edition; WAIS CD = Wechsler Adult Intelligence Scale – Digit Symbol Coding; WAIS SS =  Wechsler Adult Intelligence Scale – Symbol Search;   WAIS LN = Wechsler Adult Intelligence Scale – Letter Number Sequencing, Fourth Edition; DKEFS = Delis Kaplan Executive Functioning System

  • NIHTB scores discriminated between mild and moderate/severe stroke (see Normative Data for mean T scores):
    • Individuals with moderate/severe stroke performed significantly worse on the Fluid composite score and subtests: Picture sequence memory test, Pattern comparison, Flanker, and DCCS (<0.01)
    • Persons with moderate/sever stroke also performed significantly worse on the Crystallized composite score and subtest for Oral reading recognition (< 0.05)

 

 

Content Validity

The Cognition Battery (CB) team selected the subdomains of attention, executive function, episodic memory, language, working memory, and processing speed based on two Requests for Information (RFIs) that were solicited online to obtain the input from 293 experts. The RFIs were then followed-up by telephone interviews of a subset of 44 experts. After selecting the specific tests for the subdomains, the NIH Toolbox was presented to an expert advisory panel who provided written critiques of the subdomains and instruments that were reviewed and addressed by the CB team. Lastly, several conference calls were conducted with 16 expert consultants to present the version of the CB created for validation testing and to invite feedback prior to initiating the validation study (Weintraub et al., 2013a). 

Responsiveness

Stroke: (Carlozzi et al, 2017a)

Effect sizes (Cohen’s d) by stroke severity for NIHTB – Cognition Battery composite and subtest scores

NIHTB scores

Mild stroke

Moderate/severe stroke

Composite scores

 

 

   Fluid

-0.64

-1.64

   Crystallized

0.05

-0.41

Subtest scores

 

 

   Picture vocabulary

0.06

-0.27

   Oral reading recognition

0.01

-0.46

   Picture sequence memory test

-0.36

-0.98

   Pattern comparison

-0.46

-1.25

   List sorting

-0.42

-0.75

   Flanker

-0.50

-1.25

   DCCS

-0.57

-1.18

Brain Injury

back to Populations

Cut-Off Scores

Brain Injury: (Tulskey et al, 2017; n = 182 (mild or moderate, n = 83 or severe, n = 99); mean age = 38.6 (17.4) years; mean time since TBI = 5.8 years (5.6) years).

  • 16th percentile (score ≤ 40) is the cutoff for a low score. Complicated mild/moderate TBI and severe TBI respectively had individuals with 59% and 75% of individuals have one or more low scores.  Normative base rate of 46% suggests that .54 specificity is poor. 

Normative Data

Brain Injury: (Tulskey et al, 2017)

National Institute of Health (NIH) Toolbox (NIHTB) Demographically Adjusted T  Scores for Individuals with Complicated Mild/Moderate TBI, Severe TBI, and Matched Controls 

NIHTB scores

Comp mild/moderate TBI

Mean (SD)

(= 74)

Severe TBI

Mean (SD)

(= 84)

 

 

Control

Mean (SD)

(n = 158) 

Composite scores*

 

 

 

   Fluid

92.2 (18.7)

82.4 (18.4)

101.6 (16.0)

   Crystallized

103.2 (15.9)

96.2 (14.7)

101.7 (15.2)

Subtest scores

 

 

 

   Picture vocabulary

102.9 (15.4)

97.1 (14.2)

 

101.5 (15.4)

   Oral reading recognition

103.2 (15.8)

96.4 (15.3)

 

102.0 (15.5)

   Picture sequence memory test

92.5 (16.2)

84.2 (15.5)

100.2 (14.3)

   Pattern comparison

96.7 (15.9)

89.8 (17.2)

101.8 (15.9)

   List sorting

96.3 (15.5)

91.2 (15.0)

102.1 (15.3)

   Flanker

94.0 (17.2)

86.2 (16.4)

100.8 (14.6)

   DCCS

94.9 (16.7)

90.1 (16.6)

101.1 (15.5)

*Fluid cognitive composite score combines Dimensional Change Card Sort (DCCS) Test, Flanker Test of Executive Function Inhibitory Control and Attention, Picture Sequence Memory Test of episodic memory, List Sorting Working Memory Test, and Pattern Comparison Processing Speed Test. Crystallized cognitive composite score includes Picture Vocabulary and Oral Reading scores.

 

 

Brain Injury: (Carlozzi et al, 2017); n = 184; mean age = 39.83 (17.4) years; male = 64.1%; mean time since TBI = 5.95 (5.54) years)

NIHTB Cognition Scores for TBI Participantsa

NIHTB scores

n

Mean (SD)

% Impairedb

Composite scores

 

 

 

   Fluid

163

41.75 (13.00)

41.1

   Crystallized

163

49.79 (9.54)

16.6

Subtest scores

 

 

 

   Picture vocabulary

177

49.41 (9.60)

15.8

   Oral reading recognition

177

49.86 (10.82)

16.9

   Picture sequence memory test

161

42.35 (10.88)

43.5

   Pattern comparison

161

45.44 (10.86)

28.0

   List sorting

161

44.75 (10.55)

32.3

   Flanker

161

43.21 (11.10)

38.5

   DCCS

161

44.42 (10.49)

33.5

aT scores are demographically adjusted for age, sex, education, and race/ethnicity.

b% impairment reflects individuals with scores >1 SD beyond the mean on a given test in the negative direction.

Construct Validity

Convergent validity: 

Brain Injury: (Tulskey et al., 2017)

  • Excellent correlation between Oral Reading Recognition and Wide Range Achievement Test, 4th Edition (r = 0.83)
  • Excellent correlation between Picture Vocabulary and Peabody Picture Vocabulary Test, 4th Edition (r = 0.80)
  • Adequate correlation between List Sorting and WAIS-IV Letter Number Sequencing (r = 0.56)
  • Excellent correlation between Picture Sequence Memory and Brief Visuospatial Memory Test-Revised (r = 0.68)
  • Excellent correlation between Pattern Comparison and WAIS-IV Coding (r = 0.69)
  • Adequate correlation between Flanker Inhibitory Control and DKEFS Color-Word Interference-Inhibition (r = -0.46)
  • Adequate correlation between Dimensional Change Card Sort and Wisconsin Card Sort Test (r = -0.42)

 

Discriminant validity:

Repetitive Head Impact (RHI): (Amadon et al., 2023; = 176, mean age = 21.19 (1.63), female = 60 (34%) (= 115 in contact sport, mean age = 21.46 (1.65), female = 20 (17%); and 61 in non-contact sport, mean age = 20.67 (1.45), female = 40 (66%))

  • Significant discriminate ability across the whole sample of the Picture Sequence Memory Test (episodic memory) to distinguish between contact sport and non-contact sport athletes (F(1,171) = 7.16, = 0.008), years of exposure (F(1,171) = 4.19, = 0.042), and extensive vs. non-extensive RHI exposure based on traumatic encephalopathy syndrome (TES) criteria (F(1,171) = 5.12, = 0.025) 
    • A similar—although not significant--effect was observed for age of first exposure (AFE), with athletes with AFE < 12 having better episodic memory than those with AFE 12+/none (F(1,171) = 3.88, = 0.05)
  • Significant discriminate ability among females of the Picture Sequence Memory Test (episodic memory) to distinguish between contact sport and non-contact sport athletes (F(1,56) = 5.05, = 0.03), years of exposure (F(1,56) = 4.54, = 0.04), athletes with AFE < 12 vs. those with AFE 12+/none (F(1,56) = 4.76, = 0.03), and extensive vs. non-extensive RHI exposure based on TES criteria (F(1,56) = 4.99, = 0.03) 
  • Although the direction of the relationships was the same as those seen in women, among males there were no significant relationships between RHI measures and episodic memory performance. 

Content Validity

The Cognition Battery (CB) team selected the subdomains of attention, executive function, episodic memory, language, working memory, and processing speed based on two Requests for Information (RFIs) that were solicited online to obtain the input from 293 experts. The RFIs were then followed-up by telephone interviews of a subset of 44 experts. After selecting the specific tests for the subdomains, the NIH Toolbox was presented to an expert advisory panel who provided written critiques of the subdomains and instruments that were reviewed and addressed by the CB team. Lastly, several conference calls were conducted with 16 expert consultants to present the version of the CB created for validation testing and to invite feedback prior to initiating the validation study (Weintraub et al., 2013a). 

Intellectual Disability

back to Populations

Test/Retest Reliability

Intellectual Disability: (Shields et al., 2020; n = 242, mean age = 15.71 (5.15) years (Down syndrome = 91, mean age = 15.88 (5.17) years; Fragile X syndrome = 75, mean age = 16.16 (4.92) years; Other ID = 76, mean age = 15.05 (5.35) years); chronological age from 6 through 25 years; full-scale IQ <80 on Stanford-Binet, 5th edition (SB-5); mental age of at least 3.0 years on the SB-5)  

  • Acceptable Total test-retest reliability for Flanker Inhibitory Control and Attention (ICC = 0.74)
  • Acceptable test-retest reliability for Dimensional Change Card Sort (ICC = 0.71)
  • Acceptable Total test-retest reliability for List Sorting Working Memory (ICC = 0.74)
  • Acceptable Total test-retest reliability for Pattern Comparison (ICC = 0.77)
  • Poor Total test-retest reliability for Picture Sequence Memory (ICC = 0.47 with form A-A and ICC = 0.55 with form A-B) 
  • Acceptable Total test-retest reliability for Picture Vocabulary (ICC = 0.85)
  • Excellent Total test-retest reliability for Oral Reading and Recognition (ICC = 0.96)
  • Acceptable Total test-retest reliability for Fluid Composite (ICC = 0.83)
  • Excellent Total test-retest reliability for Crystalized Composite (ICC = 0.93)
  • Excellent Total test-retest reliability for Cognitive Function Composite (ICC = 0.92)

 

Intellectual Disability: (Hessl et al., 2016; Fragile X Syndrome: = 63, mean age = 19.3 (8.3) years, mean mental age = 5.3 (1.6) years )

  • Acceptable test-retest validity for Flanker Inhibitory Control and Attention (ICC = 0.75)
  • Acceptable test-retest validity for Dimensional Change Card Sort (ICC = 0.88)
  • Acceptable test-retest validity for List Sorting Working Memory (ICC = 0.84)
  • Excellent test-retest validity for Pattern Comparison (ICC = 0.90)
  • Acceptable test-retest validity for Picture Sequence Memory (ICC = 0.76) 
  • Acceptable test-retest validity for Picture Vocabulary (ICC = 0.77)
  • Excellent test-retest validity Oral Reading and Recognition (ICC = 0.99)

Construct Validity

Convergent validity:

Intellectual Disability: (Shields et al., 2020)

  • Adequate convergent validity of Flanker Inhibitory Control and Attention with Conners Kiddie Continuous Performance Test, 2nd Edition (r = -0.52*)
  • Adequate convergent validity of Dimensional Change Card Sort with NEPSY Inhibition (NEPSY-IN) subtest (r = 0.48*)
  • Excellent convergent validity of List Sorting Working Memory with Stanford-Binet, 5th Edition (SB-5) Verbal Working Memory (r = 0.65*) 
  • Excellent convergent validity of Pattern Comparison with Wechsler Preschool and Primary Scale of Intelligence, 4th Edition Bug Search (r = 0.66*) 
  • Adequate convergent validity of Picture Sequence Memory with Leiter International Performance Scale, 3rd Edition Forward Memory (Leiter-FM)(r = 0.47*) 
  • Excellent convergent validity of Picture Vocabulary Test with Peabody Picture Vocabulary Test, 4th Edition (r = 0.83*)
  • Excellent convergent validity of Oral Reading and Recognition with Woodcock Johnson 4th Edition Letter-Word Identification (WJ-LW) (r = 0.92*)
  • Excellent convergent validity of Fluid Composite with SB-5 Fluid Reasoning IQ (r = 0.60*)
  • Excellent convergent validity of Crystalized Composite with SB-5 Verbal IQ (= 0.75*)

*Significant at < 0.001

 

Discriminant validity:

Intellectual Disability: (Shields, et al, 2020)

  • Adequate discriminant validity of Flanker Inhibitory Control and Attention with WJ-LW (r = -0.53*)
  • Adequate discriminant validity of Dimensional Change Card Sort with WJ-LW (r = 0.36*)
  • Adequate discriminant validity of List Sorting Working Memory with WJ-LW (r = 0.49*)
  • Adequate discriminant validity of Pattern Comparison with WJ-LW (r = 0.45*) 
  • Adequate discriminant validity of Picture Sequence Memory with WJ-LW (r = 0.50*) 
  • Adequate discriminant validity of Picture Vocabulary Test with Leiter-FM (r = 0.47*)
  • Adequate discriminant validity of Oral Reading and Recognition with Leiter-FM (r = 0.58*)
  • Poor discriminant validity of Fluid Composite with SB-5 Verbal IQ (r = 0.61*)
  • Poor discriminant validity of Crystalized Composite with SB-5 Fluid Reasoning IQ (r = 0.68*)

*Significant at < 0.001

Content Validity

The Cognition Battery (CB) team selected the subdomains of attention, executive function, episodic memory, language, working memory, and processing speed based on two Requests for Information (RFIs) that were solicited online to obtain the input from 293 experts. The RFIs were then followed-up by telephone interviews of a subset of 44 experts. After selecting the specific tests for the subdomains, the NIH Toolbox was presented to an expert advisory panel who provided written critiques of the subdomains and instruments that were reviewed and addressed by the CB team. Lastly, several conference calls were conducted with 16 expert consultants to present the version of the CB created for validation testing and to invite feedback prior to initiating the validation study (Weintraub et al., 2013a). 

Mixed Populations

back to Populations

Test/Retest Reliability

Mixed populations: (Weintraub et al, 2013b; = 476, age range = 3-85 years, female = 53.1%, stratified sample of community-dwelling individuals)

  • Excellent  test-retest validity for Flanker Inhibitory Control and Attention (ICC = 0.9?6)
  • Excellent test-retest validity for Dimensional Change Card Sort (ICC = 0.94)
  • Acceptable test-retest validity for List Sorting Working Memory (ICC = 0.89)
  • Acceptable  test-retest validity for Pattern Comparison (ICC = 0.82)
  • Acceptable test-retest validity for Picture Sequence Memory (ICC = 0.78)
  • Excellent  test-retest validity for Picture Vocabulary (ICC = 0.94)
  • Excellent test-retest validity for Oral Reading and Recognition (ICC = 0.99)

 

Construct Validity

Convergent validity:

Mixed populations: (Weintraub et al., 2013b, age range = 8-85)

  • Adequate  convergent validity of Flanker Inhibitory Control and Attention with WISC-IV/WAIS-IV Letter-Number Sequencing, Coding, Symbol Search averagea (r = -0.48b)
  • Adequate  convergent validity of Dimensional Change Card Sort with D-KEFS Inhibition (r = -0.51b)
  • Adequate  convergent validity of List Sorting Working Memory with WISC-IV/WAIS-IV Letter-Number Sequencinga/Paced Auditory Serial Addition Test average (r = 0.58b)
  • Adequate,  convergent validity of Pattern Comparison with WISC-IV/WAIS-IV Coding/Symbol Search average(r = 0.49b)
  • Excellent  convergent validity of Picture Sequence Memory with  BVMT-R/Rey Auditory Verbal Learning Test averagea (RALVT) 0.69b)
  • Excellent  convergent validity of Picture Vocabulary with PPVT-4 (r = 0.78b)
  • Excellent  convergent validity of Oral Reading and Recognition with Wide Range Achievement Test, 4th Editon (WRAT-4) (r = 0.93b)

 

Discriminant validity:

Mixed populations: (Weintraub et al., 2013b, age range = 8-85) 

  • Excellent  discriminant validity of Flanker Inhibitory Control and Attention with PPVT-4 (r = 0.15c)
  • Excellent  discriminant validity of Dimensional Change Card Sort with PPVT-4 (r = 0.14d)
  • Excellent  discriminant validity of List Sorting Working Memory with PPVT-4 (r = 0.30b)
  • Excellent  discriminant validity of Pattern Comparison with PPVT-4 (r = 0.12d)
  • Excellent  discriminant validity of Picture Sequence Memory with PPVT-4 (r = -0.08)
  • Excellent  discriminant validity of Picture Vocabulary with BVMT-R/RAVLT averagea (r = 0.08)
  • Excellent  discriminant validity of Oral Reading and Recognition with BVMT-R/RAVLT averagea (r = 0.19b)

aMeasure used dependent upon subjects’ age.

b< 0.001

c< 0.01

d< 0.05

Content Validity

The Cognition Battery (CB) team selected the subdomains of attention, executive function, episodic memory, language, working memory, and processing speed based on two Requests for Information (RFIs) that were solicited online to obtain the input from 293 experts. The RFIs were then followed-up by telephone interviews of a subset of 44 experts. After selecting the specific tests for the subdomains, the NIH Toolbox was presented to an expert advisory panel who provided written critiques of the subdomains and instruments that were reviewed and addressed by the CB team. Lastly, several conference calls were conducted with 16 expert consultants to present the version of the CB created for validation testing and to invite feedback prior to initiating the validation study (Weintraub et al., 2013a). 

Multiple Sclerosis

back to Populations

Normative Data

Multiple Sclerosis: (Manglani et al., 2022; = 87; age range = 30-59; subjects diagnosed with relapsing-remitting MS; participating in a randomized control trial comparing effects of a physical activity intervention with a water consumption intervention on cognition)

Unadjusted standard scores on NIH Toolbox Cognition Battery

Cognitive Test

Mean

Std. Dev.

Pattern Comparison

98.5

16.8

List Sorting

104

9.58

Picture Sequence Memory

105

15.0

DCCSa

104

7.79

Flanker

97.5

6.52

Oral Reading

108

5.76

Picture vocabularyb

111

7.86

NIH Fluid Cognition

102

10.5

aDimensional Change Card Sort

bDiscriminant validity measure

Construct Validity

Convergent validity:

Multiple Sclerosis: (Manglani et al., 2022)

  • Adequate convergent validity of the pattern comparison (Processing Speed) subtest with the WAIS-IV Cancelation Test (mean concordance correlation coefficient (Mccc) = 0.46)
  • Adequate convergent validity of the list sorting (Working Memory) subtest with the WAIS-IV Digit Span Test (Mccc = 0.34)
  • Adequate convergent validity of the picture sequence (Episodic Memory) subtest with the California Verbal Learning Test, Second Edition (Mccc = 0.33)
  • Poor convergent validity of the DCCS and the Flanker Test (Mccc = 0.20) (both Executive Function) with the Delis-Kaplan Executive Function System Sorting Test (DKEFS)

 

Discriminant validity:

Multiple Sclerosis: (Manglani et al., 2022)

  • Excellent discriminant validity of the pattern comparison subtest (Processing Speed) with the Picture Vocabulary Test (concordance correlation coefficient (CCC) = 0.071)
  • Excellent discriminant validity of the list sorting (Working Memory) pattern comparison subtest with the Picture Vocabulary Test (CCC = 0.29)
  • Excellent discriminant validity of the picture sequence memory subtest with the Picture Vocabulary Test (CCC = 0.19)
  • Excellent discriminant validity of the DCCS (CCC = 0.16) and the Flanker Test (CCC = 0.11) (both Executive Function) with the Picture Vocabulary Test

Content Validity

The Cognition Battery (CB) team selected the subdomains of attention, executive function, episodic memory, language, working memory, and processing speed based on two Requests for Information (RFIs) that were solicited online to obtain the input from 293 experts. The RFIs were then followed-up by telephone interviews of a subset of 44 experts. After selecting the specific tests for the subdomains, the NIH Toolbox was presented to an expert advisory panel who provided written critiques of the subdomains and instruments that were reviewed and addressed by the CB team. Lastly, several conference calls were conducted with 16 expert consultants to present the version of the CB created for validation testing and to invite feedback prior to initiating the validation study (Weintraub et al., 2013a). 

Neurological Disorders

back to Populations

Normative Data

Pediatric Epilepsy: (Matuska et al., 2024; = 47; male = 55%; mean age = 10.0 (3.7) years; age range = 4.2-18.3 years; mean age of seizure onset = 4.9 (3.2) years; number of antiseizure medications at testing: 0 = 8.5%, 1 = 36.2%, 2 = 31.9%, 3 = 14.9%, 4 = 8.5%; history of Electrical Status Epilepticus in Sleep (ESES) = 46.8%)

Average performance of pediatric epilepsy patients on two NIH Toolbox Cognition Battery tests (= 46)

Cognitive Test

Mean (SD)

Range

Flanker Inhibitory

87.2 (15.1)

54-124

Pattern Comparison

81.0 (24.6)

33-163

Criterion Validity (Predictive/Concurrent)

Concurrent validity:

Pediatric Epilepsy: (Matuska et al., 2024)

  • Adequate concurrent validity between Flanker Inhibitory Test and Pattern Comparison Test (r = 0.514, < 0.001)
  • Adequate concurrent validity between Pattern Comparison Test and Test of Everyday Attention for Children (TEA-Ch) (Sky Search – Accuracy) (r = 0.459, < 0.05)
  • Adequate concurrent validity between Flanker Inhibitory Test and Wechsler Intelligence Scale for Children (WISC) (Working Memory Index (WMI) (r = 0.455, = 0.009)
  • Adequate concurrent validity between Pattern Comparison Test and WISC (WMI) (r = 0.381, = 0.035)
  • Adequate concurrent validity between Flanker Inhibitory Test and Semantic Fluency (age-appropriate version of A Developmental NEuroPSYchological Assessment, Second Edition (NEPSY-II) or the Delis-Kaplan Executive Function System (D-KEFS)) (r = 0.373, = 0.023)
  • Adequate concurrent validity between Pattern Comparison Test and Semantic Fluency (age-appropriate version of NEPSY-II or D-KEFS (r = 0.365, = 0.028)
  • Adequate concurrent validity between Flanker Inhibitory Test and Comprehensive Test of Phonological Processing, Second Edition (CTOPP-2) (r = 0.577, = 0.001)
  • Adequate concurrent validity between Flanker Inhibitory Test and Wechsler Intelligence Scale for Children (WISC) (Processing Speed Index (PSI) (r = 0.378, = 0.030)
  • Adequate concurrent validity between Flanker Inhibitory Test and Grooved Pegboard w/dominant hand (r = 0.554, < 0.001) and non-dominant hand (r = 0.420, = 0.008)
  • Adequate concurrent validity between Flanker Inhibitory Test and Verbal IQ (r = 0.458, = 0.002)
  • Adequate concurrent validity between Flanker Inhibitory Test and Nonverbal IQ (r = 0.495, = 0.002)
  • Adequate concurrent validity between Pattern Comparison Test and Nonverbal IQ (r = 0.483, = 0.002)
  • Agreement statistics conducted between the two NIHTB-CB tasks and the above clinical measures found that overall agreement ranged from 57 to 74% for Flanker and from 52 to 72% for Pattern Comparison.
    • Sensitivity was lower for Flanker (42-67%) than for Pattern Comparison (45-83%)
    • Specificity was higher for Flanker (70-83%) than for Pattern Comparison (48-83%)

Content Validity

The Cognition Battery (CB) team selected the subdomains of attention, executive function, episodic memory, language, working memory, and processing speed based on two Requests for Information (RFIs) that were solicited online to obtain the input from 293 experts. The RFIs were then followed-up by telephone interviews of a subset of 44 experts. After selecting the specific tests for the subdomains, the NIH Toolbox was presented to an expert advisory panel who provided written critiques of the subdomains and instruments that were reviewed and addressed by the CB team. Lastly, several conference calls were conducted with 16 expert consultants to present the version of the CB created for validation testing and to invite feedback prior to initiating the validation study (Weintraub et al., 2013a). 

Bibliography

Amadon, G. K., Goeckner, B. D., Brett, B. L., & Meier, T. B. (2023). Comparison of various metrics of repetitive head impact exposure and their associations with neurocognition in college-aged athletes. Archives of Clinical Psychology, 38(5), 714-723. 

Carlozzi, N. E., Tulsky, D. S., Wolf, T. J., et al. (2017a). Construct validity of the NIH Toolbox Cognition Battery in individuals with stroke. Rehabilitation Psychology, 62(4), 443-454.  

Carlozzi, N. E., Goodnight, S., Casaletto, K. B., et al. (2017b). Validation on the NIH Toolbox in Individuals with Neurologic Disorders. Archives of Clinical Neuropsychology, 32(5), 555-573. 

Hessl, D., Sansone, S. M., Berry-Kravis, E., et al. (2016). The NIH Toolbox Cognitive Battery for intellectual disabilities: three preliminary studies and future directions. Journal of Neurodevelopmental Disorders, 8, 35. 

Holdnack, J. A., Iverson, G. L., Silverberg, N. D., et al. (2017). NIH Toolbox Cognition Tests Following Traumatic Brain Injury: Frequency of Low Scores. Rehabilitation Psychology, 62(4):474-484 

Manglani, H. R., Fisher, M. E., Duraney, E. J., et al. (2022). A promising cognitive screener in multiple sclerosis: The NIH toolbox cognition battery concords with gold standard neuropsychological measures. Multiple Sclerosis Journal, 28(11): 1762-1772.  

Matuska, E., Carney, A., Sepeta, L. N., et al. (2024, April). Clinical validation of selected NIH Cognitive Toolbox tasks in pediatric epilepsy. Epilepsy & Behavior, 153, 1-9. 

Tulsky, D. S., Carlozzi, N. E., Holdnack J., et al. (2017). Using the NIH Toolbox Cognition Battery in individuals with traumatic brain injury. Rehabilitation Psychology, 62(4):413-424.  

Shields, R. H., Kaat, A. J., McKenzie, F. J., et. al. (2020). Validation of the NIH Toolbox Cognition Battery in intellectual disability. Neurology, 94(12):1229-1240. 

Weintraub, S., Bauer, P. J., Zelazo, P.D, et al. (2013a). NIH Toolbox Cognition Batter (CB): Introduction and Pediatric Data. Monogr Soc Res Child Dev., 78(4): 1-15.  

Weintraub, S., Dikmen, S. S., Heaton, R. K., et al. (2013b). Cognition assessment using the NIH Toolbox. Neurology, 80(11, Supplement 3):s54-64.