Imagine Learning
Imagine Language & Literacy Benchmark

Summary

The Imagine Language & Literacy Placement and Benchmarking system includes subtests addressing print concepts, phonological awareness, phonics and word recognition, reading comprehension, oral vocabulary, and grammar. Student performance data is provided by subtest. Data collected through the Imagine Language & Literacy Placement and Benchmarking system serve multiple purposes. First, Placement Test results are used to determine developmentally-appropriate starting points for each student across several literacy and language development curricula. Second, Placement and Benchmark Test results provide point-in-time skill-level estimates that are independently useful for Universal Screening. Third, when data are collected for both the Placement and Benchmark Tests (or multiple administrations of the Benchmark Test), differences in performance can be used to estimate scaled student growth. Those data can then be compared across administrations to illuminate changes in students’ reading proficiency levels.

Where to Obtain:
Imagine Learning
info@imaginelearning.com
382 W Park Circle, Provo, UT 84604
(866) 377-5071
www.imaginelearning.com
Initial Cost:
Contact vendor for pricing details.
Replacement Cost:
Contact vendor for pricing details.
Included in Cost:
Imagine Language & Literacy is sold as a flat annual fee per student or per school site, depending on customer preference. The purchase comes with customer support included. Additional support, or Customer Success, packages can be purchased as well. Those prices include the full instructional and assessment product offering. The assessment cannot be purchased separately.
The screening tool is a computer adaptive test that is administered online only. Test administrators require less than 1 hour of training to administer the test. Accommodations: Some of the accommodations included with the screening tool include sync highlighting, repeated directions, visual cues, and directions with pictures. The tool does not have built-in assistive technology specifically designed as accommodations for students with disabilities; however, the software can be used with the operating system accessibility features enabled. Please visit Imagine Learning's support site (support.imaginelearning.com) for a list of supported operating systems. Acceptable testing accommodations include breaking testing into multiple sessions, providing short breaks within testing sessions, scheduling testing to optimize student performance, and administering the test in environments that minimize distractions. Imagine Support answers questions and requests for specific accommodations for special needs: SERVICES AND SUPPORT For questions regarding product services and features contact Imagine Learning at: 382 W. Park Circle Provo, UT 84604 Phone: 1.866.377.5071 Email: info@imaginelearning.com Website: http://www.imaginelearning.com/ Contact customer support at: Call or text (866) 457-8776 Email: support@imaginelearning.com
Training Requirements:
Less than 1 hr of training
Qualified Administrators:
No minimum qualifications specified.
Access to Technical Support:
Educators can access Imagine Learning’s customer support services to address any technical issues via telephone or email from 6am-6pm MT, Monday-Friday. Local representatives are also available to provide assistance. Varying support packages are listed under the pricing section.
Assessment Format:
  • Direct: Computerized
Scoring Time:
  • Scoring is automatic
Scores Generated:
  • Raw score
  • IRT-based score
  • Developmental benchmarks
  • Subscale/subtest scores
Administration Time:
  • 45 minutes per student
Scoring Method:
  • Automatically (computer-scored)
Technology Requirements:
  • Computer or tablet
  • Internet connection
Accommodations:
The screening tool is a computer adaptive test that is administered online only. Test administrators require less than 1 hour of training to administer the test. Accommodations: Some of the accommodations included with the screening tool include sync highlighting, repeated directions, visual cues, and directions with pictures. The tool does not have built-in assistive technology specifically designed as accommodations for students with disabilities; however, the software can be used with the operating system accessibility features enabled. Please visit Imagine Learning's support site (support.imaginelearning.com) for a list of supported operating systems. Acceptable testing accommodations include breaking testing into multiple sessions, providing short breaks within testing sessions, scheduling testing to optimize student performance, and administering the test in environments that minimize distractions. Imagine Support answers questions and requests for specific accommodations for special needs: SERVICES AND SUPPORT For questions regarding product services and features contact Imagine Learning at: 382 W. Park Circle Provo, UT 84604 Phone: 1.866.377.5071 Email: info@imaginelearning.com Website: http://www.imaginelearning.com/ Contact customer support at: Call or text (866) 457-8776 Email: support@imaginelearning.com

Descriptive Information

Please provide a description of your tool:
The Imagine Language & Literacy Placement and Benchmarking system includes subtests addressing print concepts, phonological awareness, phonics and word recognition, reading comprehension, oral vocabulary, and grammar. Student performance data is provided by subtest. Data collected through the Imagine Language & Literacy Placement and Benchmarking system serve multiple purposes. First, Placement Test results are used to determine developmentally-appropriate starting points for each student across several literacy and language development curricula. Second, Placement and Benchmark Test results provide point-in-time skill-level estimates that are independently useful for Universal Screening. Third, when data are collected for both the Placement and Benchmark Tests (or multiple administrations of the Benchmark Test), differences in performance can be used to estimate scaled student growth. Those data can then be compared across administrations to illuminate changes in students’ reading proficiency levels.
The tool is intended for use with the following grade(s).
not selected Preschool / Pre - kindergarten
selected Kindergarten
selected First grade
selected Second grade
selected Third grade
selected Fourth grade
selected Fifth grade
selected Sixth grade
not selected Seventh grade
not selected Eighth grade
not selected Ninth grade
not selected Tenth grade
not selected Eleventh grade
not selected Twelfth grade

The tool is intended for use with the following age(s).
not selected 0-4 years old
selected 5 years old
selected 6 years old
selected 7 years old
selected 8 years old
selected 9 years old
selected 10 years old
selected 11 years old
not selected 12 years old
not selected 13 years old
not selected 14 years old
not selected 15 years old
not selected 16 years old
not selected 17 years old
not selected 18 years old

The tool is intended for use with the following student populations.
not selected Students in general education
not selected Students with disabilities
not selected English language learners

ACADEMIC ONLY: What skills does the tool screen?

Reading
Phonological processing:
not selected RAN
not selected Memory
selected Awareness
not selected Letter sound correspondence
selected Phonics
not selected Structural analysis

Word ID
selected Accuracy
not selected Speed

Nonword
not selected Accuracy
not selected Speed

Spelling
not selected Accuracy
not selected Speed

Passage
not selected Accuracy
not selected Speed

Reading comprehension:
selected Multiple choice questions
selected Cloze
not selected Constructed Response
not selected Retell
selected Maze
not selected Sentence verification
not selected Other (please describe):


Listening comprehension:
not selected Multiple choice questions
not selected Cloze
not selected Constructed Response
not selected Retell
not selected Maze
not selected Sentence verification
not selected Vocabulary
not selected Expressive
not selected Receptive

Mathematics
Global Indicator of Math Competence
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Early Numeracy
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Mathematics Concepts
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Mathematics Computation
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Mathematic Application
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Fractions/Decimals
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Algebra
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Geometry
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

not selected Other (please describe):

Please describe specific domain, skills or subtests:
BEHAVIOR ONLY: Which category of behaviors does your tool target?


BEHAVIOR ONLY: Please identify which broad domain(s)/construct(s) are measured by your tool and define each sub-domain or sub-construct.

Acquisition and Cost Information

Where to obtain:
Email Address
info@imaginelearning.com
Address
382 W Park Circle, Provo, UT 84604
Phone Number
(866) 377-5071
Website
www.imaginelearning.com
Initial cost for implementing program:
Cost
Unit of cost
Replacement cost per unit for subsequent use:
Cost
Unit of cost
Duration of license
Additional cost information:
Describe basic pricing plan and structure of the tool. Provide information on what is included in the published tool, as well as what is not included but required for implementation.
Imagine Language & Literacy is sold as a flat annual fee per student or per school site, depending on customer preference. The purchase comes with customer support included. Additional support, or Customer Success, packages can be purchased as well. Those prices include the full instructional and assessment product offering. The assessment cannot be purchased separately.
Provide information about special accommodations for students with disabilities.
The screening tool is a computer adaptive test that is administered online only. Test administrators require less than 1 hour of training to administer the test. Accommodations: Some of the accommodations included with the screening tool include sync highlighting, repeated directions, visual cues, and directions with pictures. The tool does not have built-in assistive technology specifically designed as accommodations for students with disabilities; however, the software can be used with the operating system accessibility features enabled. Please visit Imagine Learning's support site (support.imaginelearning.com) for a list of supported operating systems. Acceptable testing accommodations include breaking testing into multiple sessions, providing short breaks within testing sessions, scheduling testing to optimize student performance, and administering the test in environments that minimize distractions. Imagine Support answers questions and requests for specific accommodations for special needs: SERVICES AND SUPPORT For questions regarding product services and features contact Imagine Learning at: 382 W. Park Circle Provo, UT 84604 Phone: 1.866.377.5071 Email: info@imaginelearning.com Website: http://www.imaginelearning.com/ Contact customer support at: Call or text (866) 457-8776 Email: support@imaginelearning.com

Administration

BEHAVIOR ONLY: What type of administrator is your tool designed for?
not selected General education teacher
not selected Special education teacher
not selected Parent
not selected Child
not selected External observer
not selected Other
If other, please specify:

What is the administration setting?
not selected Direct observation
not selected Rating scale
not selected Checklist
not selected Performance measure
not selected Questionnaire
selected Direct: Computerized
not selected One-to-one
not selected Other
If other, please specify:

Does the tool require technology?
Yes

If yes, what technology is required to implement your tool? (Select all that apply)
selected Computer or tablet
selected Internet connection
not selected Other technology (please specify)

If your program requires additional technology not listed above, please describe the required technology and the extent to which it is combined with teacher small-group instruction/intervention:

What is the administration context?
not selected Individual
selected Small group   If small group, n=
selected Large group   If large group, n=
not selected Computer-administered
not selected Other
If other, please specify:

What is the administration time?
Time in minutes
45
per (student/group/other unit)
student

Additional scoring time:
Time in minutes
0
per (student/group/other unit)
student

ACADEMIC ONLY: What are the discontinue rules?
not selected No discontinue rules provided
not selected Basals
not selected Ceilings
selected Other
If other, please specify:
the test uses subtest level discontinuation criteria, which consists of varying percent correct thresholds to indicate mastery.


Are norms available?
No
Are benchmarks available?
Yes
If yes, how many benchmarks per year?
3
If yes, for which months are benchmarks available?
Administration 1: August-October, Administration 2: December-February, Administration 3: April - June
BEHAVIOR ONLY: Can students be rated concurrently by one administrator?
If yes, how many students can be rated concurrently?

Training & Scoring

Training

Is training for the administrator required?
Yes
Describe the time required for administrator training, if applicable:
Less than 1 hr of training
Please describe the minimum qualifications an administrator must possess.
selected No minimum qualifications
Are training manuals and materials available?
Yes
Are training manuals/materials field-tested?
Yes
Are training manuals/materials included in cost of tools?
Yes
If No, please describe training costs:
Can users obtain ongoing professional and technical support?
Yes
If Yes, please describe how users can obtain support:
Educators can access Imagine Learning’s customer support services to address any technical issues via telephone or email from 6am-6pm MT, Monday-Friday. Local representatives are also available to provide assistance. Varying support packages are listed under the pricing section.

Scoring

How are scores calculated?
not selected Manually (by hand)
selected Automatically (computer-scored)
not selected Other
If other, please specify:

Do you provide basis for calculating performance level scores?
No
What is the basis for calculating performance level and percentile scores?
not selected Age norms
not selected Grade norms
not selected Classwide norms
not selected Schoolwide norms
not selected Stanines
not selected Normal curve equivalents

What types of performance level scores are available?
selected Raw score
not selected Standard score
not selected Percentile score
not selected Grade equivalents
selected IRT-based score
not selected Age equivalents
not selected Stanines
not selected Normal curve equivalents
selected Developmental benchmarks
not selected Developmental cut points
not selected Equated
not selected Probability
not selected Lexile score
not selected Error analysis
not selected Composite scores
selected Subscale/subtest scores
not selected Other
If other, please specify:

Does your tool include decision rules?
Yes
If yes, please describe.
Decision rules for the Imagine literacy Placement test, which is used as Benchmark 1, and all subsequent Benchmark tests is listed below. (see document attachment in reliability section)
Can you provide evidence in support of multiple decision rules?
Yes
If yes, please describe.
Using field test data, student subtest score patterns were observed along the difficulty continuum. The continuum of difficulty determines the subtest administration flow and was established by subject matter expert design. The score threshold for each subtest was determined by its correlated success rate, beyond guessing, on the following subtest. In other words, when a student reaches a subtest score sufficient to indicate they will be able to access the upcoming subtest they are allowed to proceed. If a student does not reach that criteria, they are not permitted to proceed, and the test terminates and issues a score. This same logic was used to determine instructional placement in the Imagine Language & Literacy instructional application. Cut points vary by subtest as documented above.
Please describe the scoring structure. Provide relevant details such as the scoring format, the number of items overall, the number of items per subscale, what the cluster/composite score comprises, and how raw scores are calculated.
Automatic scoring. Literacy scores – Scaled literacy scores are reported. They are calculated from combined subtest performance, subtests vary by student based on adaptive logic. Language scores –Scaled language scores are reported. They are calculated from combined subtest performance, subtests vary by student based on adaptive logic. Subtest scores – Raw scores are reported as the percentage correct from the presented items within each subtest.
Describe the tool’s approach to screening, samples (if applicable), and/or test format, including steps taken to ensure that it is appropriate for use with culturally and linguistically diverse populations and students with disabilities.
Imagine Language & Literacy Benchmark is a computerized adaptive multistage assessment which combines features of traditional assessments and computer adaptive tests. Students are assessed via short subtests that focus on skills that are representative of literacy skills across the elementary reading curriculum. The initial sub-test delivered to students is selected according to grade level. Successful students then attempt subtests that assess more complex, higher-order literacy skills. Unsuccessful students attempt skill subtests that assess prerequisite skills. The adaptive nature of the test ensures that content presented is at an appropriate level for most students. Screening is accomplished by comparing student performance with grade-level standards. If test performance suggests a student cannot successfully engage with appropriate content for their grade level, the student is deemed to be in need of additional intervention/instruction.

Technical Standards

Classification Accuracy & Cross-Validation Summary

Grade Kindergarten
Grade 1
Grade 2
Grade 3
Grade 4
Grade 5
Grade 6
Classification Accuracy Fall Data unavailable Data unavailable Data unavailable Data unavailable Data unavailable Data unavailable Data unavailable
Classification Accuracy Winter Unconvincing evidence Unconvincing evidence Unconvincing evidence Unconvincing evidence Unconvincing evidence Unconvincing evidence Unconvincing evidence
Classification Accuracy Spring Data unavailable Data unavailable Data unavailable Data unavailable Data unavailable Data unavailable Data unavailable
Legend
Full BubbleConvincing evidence
Half BubblePartially convincing evidence
Empty BubbleUnconvincing evidence
Null BubbleData unavailable
dDisaggregated data available

Measures of Academic Progress (MAP) Reading Assessment

Classification Accuracy

Select time of year
Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
Rasch Unit (RIT) scale scores from the Measures of Academic Progress (MAP) Reading assessment (Northwest Evaluation Association, 2014), were used as the criterion measure. Performance on this measure was expected to be related to performance on the Imagine Learning Beginning-of-Year Benchmark because it includes similar measures of language and literacy skills for grades K – 6. Specifically, the Imagine Learning Beginning-of-Year Benchmark includes measures of letter recognition, phonemic awareness, word recognition, basic reading vocabulary, sentence cloze, beginning book comprehension, leveled book comprehension, and cloze passage completion. MAP Reading includes measures of word recognition, structure and vocabulary, and reading informational texts. MAP Reading is a widely-used and, like the Imagine Learning Beginning-of-Year Benchmark, is a computer adaptive assessment. MAP Reading also has good evidence of reliability and validity for use as an interim progress monitoring assessment. This formative assessment is often used to identify students who are at-risk or in need of intensive intervention and, although it measures similar constructs, is independent of the Imagine Learning Beginning-of-Year Benchmark.
Do the classification accuracy analyses examine concurrent and/or predictive classification?

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
Classification accuracy analyses were used to calculate the Area Under the Receiver Operating Characteristic Curve through the pROC package (Robin et al., 2011) in RStudio Version 1.0.153. The Imagine Learning Winter 2016 Screening Logit Scores were used as the classifier and were tested against the criterion of Spring 2016 MAP Reading Scores. The logit scores used in the Imagine Learning Beginning-of-Year Benchmark were a continuous indicator of the students’ performance on the screener. Students were identified on the criterion as needing intensive intervention if they scored in the 20th percentile or below on the MAP Reading assessment. Area Under the Curve (AUC) was specified using Youden’s (1950) method to determine cut points on the logit scores for the Imagine Learning Beginning-of-Year Benchmark. This method optimizes the threshold for sensitivity and specificity using the logit score calculated by Winsteps through Rasch measurement. The cut-points determined using Youden’s (1950) method performed as expected. The cut points increased for each grade level, suggesting the test was designed so that students performed better on the test as they advanced through grade levels.
Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
No
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Cross-Validation

Has a cross-validation study been conducted?
No
If yes,
Select time of year.
Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
Do the cross-validation analyses examine concurrent and/or predictive classification?

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Classification Accuracy - Winter

Evidence Kindergarten Grade 1 Grade 2 Grade 3 Grade 4 Grade 5 Grade 6
Criterion measure Measures of Academic Progress (MAP) Reading Assessment Measures of Academic Progress (MAP) Reading Assessment Measures of Academic Progress (MAP) Reading Assessment Measures of Academic Progress (MAP) Reading Assessment Measures of Academic Progress (MAP) Reading Assessment Measures of Academic Progress (MAP) Reading Assessment Measures of Academic Progress (MAP) Reading Assessment
Cut Points - Percentile rank on criterion measure 20 20 20 20 20 20 20
Cut Points - Performance score on criterion measure
Cut Points - Corresponding performance score (numeric) on screener measure -3.305 -1.87 -0.22 0.505 0.97 1.265 2.555
Classification Data - True Positive (a) 126 148 191 136 134 118 27
Classification Data - False Positive (b) 211 104 136 119 94 85 23
Classification Data - False Negative (c) 53 44 25 34 48 51 7
Classification Data - True Negative (d) 470 617 560 488 528 520 102
Area Under the Curve (AUC) 0.75 0.87 0.91 0.87 0.86 0.84 0.88
AUC Estimate’s 95% Confidence Interval: Lower Bound 0.71 0.83 0.89 0.84 0.82 0.81 0.83
AUC Estimate’s 95% Confidence Interval: Upper Bound 0.79 0.90 0.93 0.90 0.89 0.88 0.94
Statistics Kindergarten Grade 1 Grade 2 Grade 3 Grade 4 Grade 5 Grade 6
Base Rate 0.21 0.21 0.24 0.22 0.23 0.22 0.21
Overall Classification Rate 0.69 0.84 0.82 0.80 0.82 0.82 0.81
Sensitivity 0.70 0.77 0.88 0.80 0.74 0.70 0.79
Specificity 0.69 0.86 0.80 0.80 0.85 0.86 0.82
False Positive Rate 0.31 0.14 0.20 0.20 0.15 0.14 0.18
False Negative Rate 0.30 0.23 0.12 0.20 0.26 0.30 0.21
Positive Predictive Power 0.37 0.59 0.58 0.53 0.59 0.58 0.54
Negative Predictive Power 0.90 0.93 0.96 0.93 0.92 0.91 0.94
Sample Kindergarten Grade 1 Grade 2 Grade 3 Grade 4 Grade 5 Grade 6
Date Spring 2017 Spring 2017 Spring 2017 Spring 2017 Spring 2017 Spring 2017 Spring 2017
Sample Size 860 913 912 777 804 774 159
Geographic Representation Mountain (AZ, WY)
Pacific (CA)
West South Central (LA)
Mountain (AZ, WY)
Pacific (CA)
West South Central (LA)
Mountain (AZ, WY)
Pacific (CA)
West South Central (LA)
Mountain (AZ, WY)
Pacific (CA)
West South Central (LA)
Mountain (AZ, WY)
Pacific (CA)
West South Central (LA)
Mountain (AZ, WY)
Pacific (CA)
West South Central (LA)
Mountain (AZ, WY)
Pacific (CA)
West South Central (LA)
Male 48.3% 49.8% 50.4% 51.1% 52.6% 53.7% 52.2%
Female 51.7% 50.2% 49.6% 48.9% 47.4% 46.3% 47.8%
Other              
Gender Unknown              
White, Non-Hispanic 12.9% 11.0% 12.3% 5.5% 6.6% 5.7% 4.4%
Black, Non-Hispanic 0.3% 9.5% 8.1% 4.1% 4.4% 5.2% 6.9%
Hispanic 79.1% 78.4% 78.5% 88.9% 87.6% 87.7% 84.9%
Asian/Pacific Islander 0.2% 0.1% 0.1% 0.1% 0.4% 0.4%  
American Indian/Alaska Native 0.1% 0.7% 0.7% 1.0% 0.6% 0.6% 1.9%
Other 0.1% 0.2% 0.2% 0.3% 0.4% 0.4% 1.9%
Race / Ethnicity Unknown              
Low SES              
IEP or diagnosed disability              
English Language Learner 31.7% 31.4% 27.6% 35.9% 32.3% 25.3% 12.6%

Reliability

Grade Kindergarten
Grade 1
Grade 2
Grade 3
Grade 4
Grade 5
Grade 6
Rating Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence
Legend
Full BubbleConvincing evidence
Half BubblePartially convincing evidence
Empty BubbleUnconvincing evidence
Null BubbleData unavailable
dDisaggregated data available
*Offer a justification for each type of reliability reported, given the type and purpose of the tool.
Because the Imagine Learning Beginning-of-Year Benchmark assessment uses a total score as a basis for screening decisions, we followed guidance from the Technical Review Committee (National Center for Intensive Intervention, 2018) and conducted model-based analyses of item quality. A model-based approach also accommodates the computerized adaptive nature of the assessment. Sijtsma (2009) and Bentler (2009) describe statistical model-based internal consistency reliability coefficients based in covariance and correlational approaches to identifying the proportion of variance that can be reliably interpreted as true and not as due to error. Bentler (2003, 2007) proposed using any statistically acceptable structural model that contains additive random errors to estimate model-based internal consistency reliability. Rasch models (Rasch, 1960; Andrich, 1988; Bond & Fox, 2015; Linacre, 1993, 1994, 1996; Wilson, 2005) are widely-used statistically acceptable structural models applying additive random errors to estimate internal consistency reliability. Use of a Rasch model-based approach to reliability is justified in the substantive relation to the fact that education is founded on the idea that future student success can be predicted from performance on tasks similar enough to those encountered in real life to be useful proxies. These tasks are assembled into curricular materials and approached from pedagogical perspectives that typically are not tailored to each individual student but are assumed to apply in a developmentally appropriate way across students as they age and progress in knowledge. Much recent work in education has focused on developing and testing measurement models and calibrating tools as coherent guides to future practice, across the full range of assessment applications, from formative classroom instruction to accountability (Wilson, 2004; Gorin & Mislevy, 2013; NRC, 2006). A simple model of this kind takes the form: ln(pni / (1-pni)) = bn - di which states that the natural logarithm ln of the odds (pni / (1- pni)) is hypothesized to be equal to the difference between the ability b of student n and the difficulty d of item i. Estimates of the ability measures b and the item difficulty calibrations d are expressed in a common unit, which makes it possible to interpret the measures relative to the calibrations in the substantive terms of the items, and vice versa. All measures and calibrations are associated with individually estimated error and model fit statistics. A Rasch measurement model-based approach to reliability (Andrich, 1982; Andrich & Douglas, 1977; Fisher, 1992, 2008; Wright & Masters, 1982) differs, then, from the statistical model-based approaches described by Sijtsma (2009) and Bentler (2009) in the way error is estimated and treated en route to arriving at an estimate of the ratio of true variation to error variation.
*Describe the sample(s), including size and characteristics, for each reliability analysis conducted.
Exhibit 3 demonstrates that the sample for the reliability analysis came from multiple regions, divisions, and states for grades K-5. For students for whom gender was known, approximately half of each grade level was male and half was female. Fewer than ten percent of each grade level was identified as English Language Learners, though ELL status data were missing for the majority of students. Inconsistent information was available for FRL, other SES indicators, disability classification, and first language.
*Describe the analysis procedures for each reported type of reliability.
A Rasch measurement model-based approach was used to assess reliability. This approach estimates error (better termed uncertainty; Fisher & Stenner, 2017; JCGM, 2008) for each individual person and item. The mean square error variance is then subtracted from the total variance to arrive directly at an estimate of the true variance. The ratio of the true standard deviation to the root mean square error then provides a separation ratio, G, that increases in linear error-unit amounts (Wright & Masters, 1982) relative to a separation reliability coefficient formulated in the traditional 0.00-1.00 form (Andrich, 1982). An intuitive expression of measurement model-based reliability is obtained in terms of strata, ranges with centers separated by three errors. Strata are estimated by multiplying the separation ratio, G, by 4, which captures 95% of the variation in the measures; adding 1, allowing for error; and dividing that total by 3, for the number of errors taken as separating the centers of the resulting ranges. Winsteps version 4.00.1 was used for all Rasch analyses (Linacre, 2017). Model fit statistics and Principal Components Analyses of the standardized residuals explicitly focus on the internal consistency of the observed responses, evaluating them in terms of expectations formed on the basis of a simple conception of what is supposed to happen when students respond to test questions. Simulations show that model-based assessments of internal consistency using fit statistics are better able to detect violations of interval measurement’s unidimensionality and invariance requirements than Cronbach’s alpha (Fisher, Elbaum, & Coulter, 2010).

*In the table(s) below, report the results of the reliability analyses described above (e.g., internal consistency or inter-rater reliability coefficients).

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of reliability analysis not compatible with above table format:
Manual cites other published reliability studies:
No
Provide citations for additional published studies.
Do you have reliability data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?

If yes, fill in data for each subgroup with disaggregated reliability data.

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of reliability analysis not compatible with above table format:
Manual cites other published reliability studies:
No
Provide citations for additional published studies.

Validity

Grade Kindergarten
Grade 1
Grade 2
Grade 3
Grade 4
Grade 5
Grade 6
Rating Unconvincing evidence Convincing evidence Convincing evidence Convincing evidence Partially convincing evidence Partially convincing evidence Convincing evidence
Legend
Full BubbleConvincing evidence
Half BubblePartially convincing evidence
Empty BubbleUnconvincing evidence
Null BubbleData unavailable
dDisaggregated data available
*Describe each criterion measure used and explain why each measure is appropriate, given the type and purpose of the tool.
Rasch Unit (RIT) scale scores from the Measures of Academic Progress (MAP) Reading assessment (Northwest Evaluation Association, 2014), were used as a criterion measure to demonstrate concurrent and predictive validity of the Imagine Learning Beginning-of-Year Benchmark assessment. Performance on this external measure was expected to be related to performance on the Imagine Learning Beginning-of-Year Benchmark because it includes similar measures of language and literacy skills for grades K – 6. Specifically, the Imagine Learning Beginning-of-Year Benchmark includes measures of letter recognition, phonemic awareness, word recognition, basic reading vocabulary, sentence cloze, beginning book comprehension, leveled book comprehension, and cloze passage completion. MAP Reading includes measures of word recognition, structure and vocabulary, and reading informational texts. MAP Reading is widely-used to identify students who are at-risk or in need of intensive intervention and, like the Imagine Learning Beginning-of-Year Benchmark, is a computer adaptive assessment. MAP Reading also has good evidence of reliability and validity for use as an interim progress monitoring assessment.
*Describe the sample(s), including size and characteristics, for each validity analysis conducted.
See Exhibits 5-7 for sample characteristics for each validity analysis conducted. Across all grade K-5 samples used to examine predictive and concurrent validity, sample sizes ranged 641 to 914; sample sizes for grade 6 ranged from 115 to 159. Multiple regions, divisions, and states were represented for each grade level K-5; the 6th grade sample represented one region, division, and state. The proportion of male and female students across all samples was roughly equal. English Language Learners (ELLs) represented between a quarter and a third of all samples, except for Grade 6 samples in which ELLs represented between 4 and 12 percent of all students. In all samples, most students were Hispanic, with the next most prevalent race/ethnicity categories being White/non-Hispanic and Black/African American. Data indicating eligibility for free/reduced price lunch, student disability status, and services received were not consistently available.
*Describe the analysis procedures for each reported type of validity.
Three bivariate correlation analyses were conducted for each grade level. The first provides evidence of predictive validity using winter 2016 Imagine Learning Beginning-of-Year Benchmark assessment data to predict scores on the spring 2017 MAP Reading assessment data. Two analyses were used to demonstrate concurrent validity by relating Imagine Learning Beginning-of-Year Benchmark assessment data and MAP Reading assessment data (separately, using winter 2016 and spring 2017 data). The Pearson correlation coefficient was computed for each correlation and a 95% confidence interval around each coefficient was calculated.

*In the table below, report the results of the validity analyses described above (e.g., concurrent or predictive validity, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of validity analysis not compatible with above table format:
Manual cites other published reliability studies:
No
Provide citations for additional published studies.
Describe the degree to which the provided data support the validity of the tool.
Using the NCII lower bound threshold of .6 for the confidence intervals around the standardized estimate, analyses showed evidence of concurrent validity for all grade levels. Based on this threshold, evidence of predictive validity was available for first, second, third, and sixth grades. The lower bound of the confidence intervals for predictive validity for fourth and fifth grades were very close to the threshold (.597 for fourth grade and .594 for fifth grade).
Do you have validity data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
No

If yes, fill in data for each subgroup with disaggregated validity data.

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of validity analysis not compatible with above table format:
Disaggregated validity analyses are not included because bias analysis (differential item functioning) was used to assess subgroup bias (see Technical Standard 5).
Manual cites other published reliability studies:
No
Provide citations for additional published studies.

Bias Analysis

Grade Kindergarten
Grade 1
Grade 2
Grade 3
Grade 4
Grade 5
Grade 6
Rating Yes Yes Yes Yes Yes Yes Yes
Have you conducted additional analyses related to the extent to which your tool is or is not biased against subgroups (e.g., race/ethnicity, gender, socioeconomic status, students with disabilities, English language learners)? Examples might include Differential Item Functioning (DIF) or invariance testing in multiple-group confirmatory factor models.
Yes
If yes,
a. Describe the method used to determine the presence or absence of bias:
Differential Item Functioning (DIF) was evaluated for all items on the Imagine Learning Beginning-of-Year Benchmark using focal versus contrast groups differences by means of t-tests. Winsteps version 4.00.1 was used for all analyses (Linacre, 2017). Evaluating DIF requires close attention to what item estimate stability is desired. Lee (1992) finds that student performance across elementary grade levels changes by about a logit per year, with the rate of growth slowing as the upper grades are reached. The present data roughly corresponded with that, as the changes from K to Grade 1, Grade 1 to Grade 2, and Grade 2 to Grade 3 were all a logit or more, and the remaining three were all half a logit or more. Accordingly, as Linacre (1994) points out, items stable to within a logit will be targeted at the correct grade level. Linacre (1994) shows that to have 99% confidence that an item difficulty estimate is stable to within a logit, its error of measurement must be 0.385 or less (since that is the value obtained dividing the desired one logit range by the 2.6 standard deviations needed for 99% confidence). The sample size needed for item estimate errors of 0.385 or less was in the range of 27 to 61, depending on how well targeted the item is at the sample. An item that is too easy or hard, and so is off target, will have a higher error and will need a correspondingly larger sample size to drive down the error to the range needed for stability within a logit. The criterion of interest in the Winsteps table is then the DIF size, which should be less than 1.00 logit for samples of sufficient size.
b. Describe the subgroups for which bias analyses were conducted:
Groups tested included ELL versus non-ELL, and unknown; female versus male, and unknown; and White versus non-White. Exhibit 10 displays the number of students and percentages within each of these categories by grade level. Race/ethnicity categories were collapsed into binary categories (i.e., White and Nonwhite) because of small cell sizes for certain groups.
c. Describe the results of the bias analyses conducted, including data and interpretative statements. Include magnitude of effect (if available) if bias has been identified.
For ELL, with three groups, the hypothesis tested was that the item has the same difficulty as its average difficulty for all groups. In 98.5% of comparisons, DIF sizes of |0.77| or less were obtained. Of the six instances of DIF greater than |0.77| logits, four of them involved Phoneme Segmentation items with only four nonELL students responding. The other two items were two Cloze Passage items (items 107.1 and 107.18); both had 77 ELL students answering, both had DIF sizes of about -1.00 logit, meaning these ELL students correctly answered these two items disproportionately more frequently. However, neither item had a t-statistic greater than 1.96, so neither difference was statistically significant at a p-value less than .05. For gender, also with three groups, the hypothesis tested was again that the item has the same difficulty as its average difficulty for all groups. No DIF sizes over |0.77| logits were found and seven items had DIF sizes greater than or equal to |0.50|. All were Cloze Passage items. Males found four items more difficult than the other two groups did, and females found three items easier. Two items with DIF sizes of 0.66 disadvantaging males were statistically significant at p less than .05. For ethnicity, the hypothesis tested was whether the item has the same difficulty for the two groups. There were 3 items with DIF sizes greater than |0.76|. Two of the three have DIF measure standard errors greater than 0.385 for one of the contrasted groups, and so were not expected to be stable to within a logit with 99% confidence. The other item (Cloze Passage 104.1), however, was 1.57 logits more difficult for white students (CLASS=1) than it was for others. The DIF analysis provides a statistical approach to assessing whether the items retain their relative difficulties across different groups. This kind of invariance can also be investigated more directly, by scaling the items on subsamples of students. A series of analyses of each grade level resulted in seven separate sets of item calibrations. Of the 21 pairs of correlations, 15 were between 0.94 and 0.99, four were between 0.83 and 0.89, and the remaining two (Kindergarten vs 5th and 6th grades) were 0.78 and 0.66. Disattenuated for error (Muchinsky, 1996; Schumacker, 1996), the smallest correlation was 0.75. These results supported the overall stability of the scale across grade levels.

Data Collection Practices

Most tools and programs evaluated by the NCII are branded products which have been submitted by the companies, organizations, or individuals that disseminate these products. These entities supply the textual information shown above, but not the ratings accompanying the text. NCII administrators and members of our Technical Review Committees have reviewed the content on this page, but NCII cannot guarantee that this information is free from error or reflective of recent changes to the product. Tools and programs have the opportunity to be updated annually or upon request.