Imagine Learning
Imagine Language & Literacy Benchmark

Summary

Descriptive Information

The Imagine Language & Literacy Placement and Benchmarking system includes subtests addressing print concepts, phonological awareness, phonics and word recognition, reading comprehension, oral vocabulary, and grammar. Student performance data is provided by subtest. Data collected through the Imagine Language & Literacy Placement and Benchmarking system serve multiple purposes. First, Placement Test results are used to determine developmentally-appropriate starting points for each student across several literacy and language development curricula. Second, Placement and Benchmark Test results provide point-in-time skill-level estimates that are independently useful for Universal Screening. Third, when data are collected for both the Placement and Benchmark Tests (or multiple administrations of the Benchmark Test), differences in performance can be used to estimate scaled student growth. Those data can then be compared across administrations to illuminate changes in students’ reading proficiency levels.

Acquisition & Cost

Where to Obtain:: Imagine Learning; info@imaginelearning.com; 382 W Park Circle, Provo, UT 84604; (866) 377-5071; www.imaginelearning.com

Initial Cost:: Contact vendor for pricing details.

Replacement Cost:: Contact vendor for pricing details.

Included in Cost:: Imagine Language & Literacy is sold as a flat annual fee per student or per school site, depending on customer preference. The purchase comes with customer support included. Additional support, or Customer Success, packages can be purchased as well. Those prices include the full instructional and assessment product offering. The assessment cannot be purchased separately.; The screening tool is a computer adaptive test that is administered online only. Test administrators require less than 1 hour of training to administer the test. Accommodations: Some of the accommodations included with the screening tool include sync highlighting, repeated directions, visual cues, and directions with pictures. The tool does not have built-in assistive technology specifically designed as accommodations for students with disabilities; however, the software can be used with the operating system accessibility features enabled. Please visit Imagine Learning's support site (support.imaginelearning.com) for a list of supported operating systems. Acceptable testing accommodations include breaking testing into multiple sessions, providing short breaks within testing sessions, scheduling testing to optimize student performance, and administering the test in environments that minimize distractions. Imagine Support answers questions and requests for specific accommodations for special needs: SERVICES AND SUPPORT For questions regarding product services and features contact Imagine Learning at: 382 W. Park Circle Provo, UT 84604 Phone: 1.866.377.5071 Email: info@imaginelearning.com Website: http://www.imaginelearning.com/ Contact customer support at: Call or text (866) 457-8776 Email: support@imaginelearning.com

Training & Technical Support

Training Requirements:: Less than 1 hr of training

Qualified Administrators:: No minimum qualifications specified.

Access to Technical Support:: Educators can access Imagine Learning’s customer support services to address any technical issues via telephone or email from 6am-6pm MT, Monday-Friday. Local representatives are also available to provide assistance. Varying support packages are listed under the pricing section.

Administration

Assessment Format:

Direct: Computerized

Scoring Time:

Scoring is automatic

Scores Generated:

Raw score
IRT-based score
Developmental benchmarks
Subscale/subtest scores

Administration Time:

45 minutes per student

Scoring Method:

Automatically (computer-scored)

Technology Requirements:

Computer or tablet
Internet connection

Accommodations:: The screening tool is a computer adaptive test that is administered online only. Test administrators require less than 1 hour of training to administer the test. Accommodations: Some of the accommodations included with the screening tool include sync highlighting, repeated directions, visual cues, and directions with pictures. The tool does not have built-in assistive technology specifically designed as accommodations for students with disabilities; however, the software can be used with the operating system accessibility features enabled. Please visit Imagine Learning's support site (support.imaginelearning.com) for a list of supported operating systems. Acceptable testing accommodations include breaking testing into multiple sessions, providing short breaks within testing sessions, scheduling testing to optimize student performance, and administering the test in environments that minimize distractions. Imagine Support answers questions and requests for specific accommodations for special needs: SERVICES AND SUPPORT For questions regarding product services and features contact Imagine Learning at: 382 W. Park Circle Provo, UT 84604 Phone: 1.866.377.5071 Email: info@imaginelearning.com Website: http://www.imaginelearning.com/ Contact customer support at: Call or text (866) 457-8776 Email: support@imaginelearning.com

Descriptive Information

Please provide a description of your tool:: The Imagine Language & Literacy Placement and Benchmarking system includes subtests addressing print concepts, phonological awareness, phonics and word recognition, reading comprehension, oral vocabulary, and grammar. Student performance data is provided by subtest. Data collected through the Imagine Language & Literacy Placement and Benchmarking system serve multiple purposes. First, Placement Test results are used to determine developmentally-appropriate starting points for each student across several literacy and language development curricula. Second, Placement and Benchmark Test results provide point-in-time skill-level estimates that are independently useful for Universal Screening. Third, when data are collected for both the Placement and Benchmark Tests (or multiple administrations of the Benchmark Test), differences in performance can be used to estimate scaled student growth. Those data can then be compared across administrations to illuminate changes in students’ reading proficiency levels.

The tool is intended for use with the following grade(s).

Preschool / Pre - kindergarten
selected

Kindergarten
selected

First grade
selected

Second grade
selected

Third grade
selected

Fourth grade
selected

Fifth grade
selected

Sixth grade
not selected

Seventh grade
not selected

Eighth grade
not selected

Ninth grade
not selected

Tenth grade
not selected

Eleventh grade
not selected

Twelfth grade

The tool is intended for use with the following age(s).

0-4 years old
selected

5 years old
selected

6 years old
selected

7 years old
selected

8 years old
selected

9 years old
selected

10 years old
selected

11 years old
not selected

12 years old
not selected

13 years old
not selected

14 years old
not selected

15 years old
not selected

16 years old
not selected

17 years old
not selected

18 years old

The tool is intended for use with the following student populations.

Students in general education
not selected

Students with disabilities
not selected

English language learners

ACADEMIC ONLY: What skills does the tool screen?

Reading

Phonological processing:

RAN

Memory

Awareness

Letter sound correspondence
selected

Phonics

Structural analysis

Word ID

Accuracy

Speed

Nonword

Accuracy

Speed

Spelling

Accuracy

Speed

Passage

Accuracy

Speed

Reading comprehension:

Multiple choice questions
selected

Cloze

Constructed Response
not selected

Retell

Maze

Sentence verification
not selected

Other (please describe):

Listening comprehension:

Multiple choice questions
not selected

Cloze

Constructed Response
not selected

Retell

Maze

Sentence verification
not selected

Vocabulary
not selected

Expressive
not selected

Receptive

Mathematics

Global Indicator of Math Competence

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Early Numeracy

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Mathematics Concepts

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Mathematics Computation

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Mathematic Application

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Fractions/Decimals

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Algebra

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Geometry

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Other (please describe):

Please describe specific domain, skills or subtests:

BEHAVIOR ONLY: Which category of behaviors does your tool target?: Internalizing
Externalizing
Internalizing and Externalizing

BEHAVIOR ONLY: Please identify which broad domain(s)/construct(s) are measured by your tool and define each sub-domain or sub-construct.

Acquisition and Cost Information

Where to obtain:

Email Address: info@imaginelearning.com
Address: 382 W Park Circle, Provo, UT 84604
Phone Number: (866) 377-5071
Website: www.imaginelearning.com

Initial cost for implementing program:

Cost
Unit of cost

Replacement cost per unit for subsequent use:

Cost
Unit of cost
Duration of license

Additional cost information:

Describe basic pricing plan and structure of the tool. Provide information on what is included in the published tool, as well as what is not included but required for implementation.: Imagine Language & Literacy is sold as a flat annual fee per student or per school site, depending on customer preference. The purchase comes with customer support included. Additional support, or Customer Success, packages can be purchased as well. Those prices include the full instructional and assessment product offering. The assessment cannot be purchased separately.

Provide information about special accommodations for students with disabilities.: The screening tool is a computer adaptive test that is administered online only. Test administrators require less than 1 hour of training to administer the test. Accommodations: Some of the accommodations included with the screening tool include sync highlighting, repeated directions, visual cues, and directions with pictures. The tool does not have built-in assistive technology specifically designed as accommodations for students with disabilities; however, the software can be used with the operating system accessibility features enabled. Please visit Imagine Learning's support site (support.imaginelearning.com) for a list of supported operating systems. Acceptable testing accommodations include breaking testing into multiple sessions, providing short breaks within testing sessions, scheduling testing to optimize student performance, and administering the test in environments that minimize distractions. Imagine Support answers questions and requests for specific accommodations for special needs: SERVICES AND SUPPORT For questions regarding product services and features contact Imagine Learning at: 382 W. Park Circle Provo, UT 84604 Phone: 1.866.377.5071 Email: info@imaginelearning.com Website: http://www.imaginelearning.com/ Contact customer support at: Call or text (866) 457-8776 Email: support@imaginelearning.com

Administration

BEHAVIOR ONLY: What type of administrator is your tool designed for?

General education teacher
not selected

Special education teacher
not selected

Parent

Child

External observer
not selected

Other

If other, please specify:

What is the administration setting?

Direct observation
not selected

Rating scale
not selected

Checklist

Performance measure
not selected

Questionnaire
selected

Direct: Computerized
not selected

One-to-one
not selected

Other

If other, please specify:

Does the tool require technology?

Yes

If yes, what technology is required to implement your tool? (Select all that apply)

Computer or tablet
selected

Internet connection
not selected

Other technology (please specify)

If your program requires additional technology not listed above, please describe the required technology and the extent to which it is combined with teacher small-group instruction/intervention:

What is the administration context?

Individual
selected

Small group If small group, n=
selected

Large group If large group, n=
not selected

Computer-administered
not selected

Other

If other, please specify:

What is the administration time?

Time in minutes

per (student/group/other unit)

student

Additional scoring time:

Time in minutes

per (student/group/other unit)

student

ACADEMIC ONLY: What are the discontinue rules?

No discontinue rules provided
not selected

Basals

Ceilings

Other

If other, please specify:

the test uses subtest level discontinuation criteria, which consists of varying percent correct thresholds to indicate mastery.

Are norms available?: No

Are benchmarks available?: Yes
If yes, how many benchmarks per year?: 3
If yes, for which months are benchmarks available?: Administration 1: August-October, Administration 2: December-February, Administration 3: April - June

BEHAVIOR ONLY: Can students be rated concurrently by one administrator?
If yes, how many students can be rated concurrently?

Training & Scoring

Training

Is training for the administrator required?: Yes

Describe the time required for administrator training, if applicable:: Less than 1 hr of training

Please describe the minimum qualifications an administrator must possess.: No minimum qualifications

Are training manuals and materials available?: Yes

Are training manuals/materials field-tested?: Yes

Are training manuals/materials included in cost of tools?: Yes
If No, please describe training costs:

Can users obtain ongoing professional and technical support?: Yes
If Yes, please describe how users can obtain support:: Educators can access Imagine Learning’s customer support services to address any technical issues via telephone or email from 6am-6pm MT, Monday-Friday. Local representatives are also available to provide assistance. Varying support packages are listed under the pricing section.

Scoring

How are scores calculated?

Manually (by hand)
selected

Automatically (computer-scored)
not selected

Other

If other, please specify:

Do you provide basis for calculating performance level scores?: No

What is the basis for calculating performance level and percentile scores?

Age norms

Grade norms
not selected

Classwide norms
not selected

Schoolwide norms
not selected

Stanines

Normal curve equivalents

What types of performance level scores are available?

Raw score

Standard score
not selected

Percentile score
not selected

Grade equivalents
selected

IRT-based score
not selected

Age equivalents
not selected

Stanines

Normal curve equivalents
selected

Developmental benchmarks
not selected

Developmental cut points
not selected

Equated

Probability
not selected

Lexile score
not selected

Error analysis
not selected

Composite scores
selected

Subscale/subtest scores
not selected

Other

If other, please specify:

Does your tool include decision rules?: Yes
If yes, please describe.: Decision rules for the Imagine literacy Placement test, which is used as Benchmark 1, and all subsequent Benchmark tests is listed below. (see document attachment in reliability section)

Can you provide evidence in support of multiple decision rules?: Yes
If yes, please describe.: Using field test data, student subtest score patterns were observed along the difficulty continuum. The continuum of difficulty determines the subtest administration flow and was established by subject matter expert design. The score threshold for each subtest was determined by its correlated success rate, beyond guessing, on the following subtest. In other words, when a student reaches a subtest score sufficient to indicate they will be able to access the upcoming subtest they are allowed to proceed. If a student does not reach that criteria, they are not permitted to proceed, and the test terminates and issues a score. This same logic was used to determine instructional placement in the Imagine Language & Literacy instructional application. Cut points vary by subtest as documented above.

Please describe the scoring structure. Provide relevant details such as the scoring format, the number of items overall, the number of items per subscale, what the cluster/composite score comprises, and how raw scores are calculated.: Automatic scoring. Literacy scores – Scaled literacy scores are reported. They are calculated from combined subtest performance, subtests vary by student based on adaptive logic. Language scores –Scaled language scores are reported. They are calculated from combined subtest performance, subtests vary by student based on adaptive logic. Subtest scores – Raw scores are reported as the percentage correct from the presented items within each subtest.

Describe the tool’s approach to screening, samples (if applicable), and/or test format, including steps taken to ensure that it is appropriate for use with culturally and linguistically diverse populations and students with disabilities.: Imagine Language & Literacy Benchmark is a computerized adaptive multistage assessment which combines features of traditional assessments and computer adaptive tests. Students are assessed via short subtests that focus on skills that are representative of literacy skills across the elementary reading curriculum. The initial sub-test delivered to students is selected according to grade level. Successful students then attempt subtests that assess more complex, higher-order literacy skills. Unsuccessful students attempt skill subtests that assess prerequisite skills. The adaptive nature of the test ensures that content presented is at an appropriate level for most students. Screening is accomplished by comparing student performance with grade-level standards. If test performance suggests a student cannot successfully engage with appropriate content for their grade level, the student is deemed to be in need of additional intervention/instruction.

Technical Standards

Classification Accuracy & Cross-Validation Summary

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Classification Accuracy Fall
Classification Accuracy Winter
Classification Accuracy Spring

Legend

Convincing evidence

Partially convincing evidence

Unconvincing evidence

Data unavailable

^dDisaggregated data available

Measures of Academic Progress (MAP) Reading Assessment

Classification Accuracy

Select time of year

Fall

Winter

Spring

Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.: Rasch Unit (RIT) scale scores from the Measures of Academic Progress (MAP) Reading assessment (Northwest Evaluation Association, 2014), were used as the criterion measure. Performance on this measure was expected to be related to performance on the Imagine Learning Beginning-of-Year Benchmark because it includes similar measures of language and literacy skills for grades K – 6. Specifically, the Imagine Learning Beginning-of-Year Benchmark includes measures of letter recognition, phonemic awareness, word recognition, basic reading vocabulary, sentence cloze, beginning book comprehension, leveled book comprehension, and cloze passage completion. MAP Reading includes measures of word recognition, structure and vocabulary, and reading informational texts. MAP Reading is a widely-used and, like the Imagine Learning Beginning-of-Year Benchmark, is a computer adaptive assessment. MAP Reading also has good evidence of reliability and validity for use as an interim progress monitoring assessment. This formative assessment is often used to identify students who are at-risk or in need of intensive intervention and, although it measures similar constructs, is independent of the Imagine Learning Beginning-of-Year Benchmark.

Do the classification accuracy analyses examine concurrent and/or predictive classification?

Concurrent
Predictive

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.

Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).: Classification accuracy analyses were used to calculate the Area Under the Receiver Operating Characteristic Curve through the pROC package (Robin et al., 2011) in RStudio Version 1.0.153. The Imagine Learning Winter 2016 Screening Logit Scores were used as the classifier and were tested against the criterion of Spring 2016 MAP Reading Scores. The logit scores used in the Imagine Learning Beginning-of-Year Benchmark were a continuous indicator of the students’ performance on the screener. Students were identified on the criterion as needing intensive intervention if they scored in the 20th percentile or below on the MAP Reading assessment. Area Under the Curve (AUC) was specified using Youden’s (1950) method to determine cut points on the logit scores for the Imagine Learning Beginning-of-Year Benchmark. This method optimizes the threshold for sensitivity and specificity using the logit score calculated by Winsteps through Rasch measurement. The cut-points determined using Youden’s (1950) method performed as expected. The cut points increased for each grade level, suggesting the test was designed so that students performed better on the test as they advanced through grade levels.

Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?: No
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Cross-Validation

Has a cross-validation study been conducted?: No
If yes,

Select time of year.

Fall

Winter

Spring

Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.

Do the cross-validation analyses examine concurrent and/or predictive classification?

Concurrent
Predictive

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.

Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).

Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Classification Accuracy - Winter

Evidence	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Criterion measure	Measures of Academic Progress (MAP) Reading Assessment	Measures of Academic Progress (MAP) Reading Assessment	Measures of Academic Progress (MAP) Reading Assessment	Measures of Academic Progress (MAP) Reading Assessment	Measures of Academic Progress (MAP) Reading Assessment	Measures of Academic Progress (MAP) Reading Assessment	Measures of Academic Progress (MAP) Reading Assessment
Cut Points - Percentile rank on criterion measure	20	20	20	20	20	20	20
Cut Points - Performance score on criterion measure
Cut Points - Corresponding performance score (numeric) on screener measure	-3.305	-1.87	-0.22	0.505	0.97	1.265	2.555
Classification Data - True Positive (a)	126	148	191	136	134	118	27
Classification Data - False Positive (b)	211	104	136	119	94	85	23
Classification Data - False Negative (c)	53	44	25	34	48	51	7
Classification Data - True Negative (d)	470	617	560	488	528	520	102
Area Under the Curve (AUC)	0.75	0.87	0.91	0.87	0.86	0.84	0.88
AUC Estimate’s 95% Confidence Interval: Lower Bound	0.71	0.83	0.89	0.84	0.82	0.81	0.83
AUC Estimate’s 95% Confidence Interval: Upper Bound	0.79	0.90	0.93	0.90	0.89	0.88	0.94

Statistics	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Base Rate	0.21	0.21	0.24	0.22	0.23	0.22	0.21
Overall Classification Rate	0.69	0.84	0.82	0.80	0.82	0.82	0.81
Sensitivity	0.70	0.77	0.88	0.80	0.74	0.70	0.79
Specificity	0.69	0.86	0.80	0.80	0.85	0.86	0.82
False Positive Rate	0.31	0.14	0.20	0.20	0.15	0.14	0.18
False Negative Rate	0.30	0.23	0.12	0.20	0.26	0.30	0.21
Positive Predictive Power	0.37	0.59	0.58	0.53	0.59	0.58	0.54
Negative Predictive Power	0.90	0.93	0.96	0.93	0.92	0.91	0.94

Sample	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Date	Spring 2017	Spring 2017	Spring 2017	Spring 2017	Spring 2017	Spring 2017	Spring 2017
Sample Size	860	913	912	777	804	774	159
Geographic Representation	Mountain (AZ, WY) Pacific (CA) West South Central (LA)	Mountain (AZ, WY) Pacific (CA) West South Central (LA)	Mountain (AZ, WY) Pacific (CA) West South Central (LA)	Mountain (AZ, WY) Pacific (CA) West South Central (LA)	Mountain (AZ, WY) Pacific (CA) West South Central (LA)	Mountain (AZ, WY) Pacific (CA) West South Central (LA)	Mountain (AZ, WY) Pacific (CA) West South Central (LA)
Male	48.3%	49.8%	50.4%	51.1%	52.6%	53.7%	52.2%
Female	51.7%	50.2%	49.6%	48.9%	47.4%	46.3%	47.8%
Other
Gender Unknown
White, Non-Hispanic	12.9%	11.0%	12.3%	5.5%	6.6%	5.7%	4.4%
Black, Non-Hispanic	0.3%	9.5%	8.1%	4.1%	4.4%	5.2%	6.9%
Hispanic	79.1%	78.4%	78.5%	88.9%	87.6%	87.7%	84.9%
Asian/Pacific Islander	0.2%	0.1%	0.1%	0.1%	0.4%	0.4%
American Indian/Alaska Native	0.1%	0.7%	0.7%	1.0%	0.6%	0.6%	1.9%
Other	0.1%	0.2%	0.2%	0.3%	0.4%	0.4%	1.9%
Race / Ethnicity Unknown
Low SES
IEP or diagnosed disability
English Language Learner	31.7%	31.4%	27.6%	35.9%	32.3%	25.3%	12.6%

Reliability

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Rating

Legend

Convincing evidence

Partially convincing evidence

Unconvincing evidence

Data unavailable

^dDisaggregated data available

*Offer a justification for each type of reliability reported, given the type and purpose of the tool.: Because the Imagine Learning Beginning-of-Year Benchmark assessment uses a total score as a basis for screening decisions, we followed guidance from the Technical Review Committee (National Center for Intensive Intervention, 2018) and conducted model-based analyses of item quality. A model-based approach also accommodates the computerized adaptive nature of the assessment. Sijtsma (2009) and Bentler (2009) describe statistical model-based internal consistency reliability coefficients based in covariance and correlational approaches to identifying the proportion of variance that can be reliably interpreted as true and not as due to error. Bentler (2003, 2007) proposed using any statistically acceptable structural model that contains additive random errors to estimate model-based internal consistency reliability. Rasch models (Rasch, 1960; Andrich, 1988; Bond & Fox, 2015; Linacre, 1993, 1994, 1996; Wilson, 2005) are widely-used statistically acceptable structural models applying additive random errors to estimate internal consistency reliability. Use of a Rasch model-based approach to reliability is justified in the substantive relation to the fact that education is founded on the idea that future student success can be predicted from performance on tasks similar enough to those encountered in real life to be useful proxies. These tasks are assembled into curricular materials and approached from pedagogical perspectives that typically are not tailored to each individual student but are assumed to apply in a developmentally appropriate way across students as they age and progress in knowledge. Much recent work in education has focused on developing and testing measurement models and calibrating tools as coherent guides to future practice, across the full range of assessment applications, from formative classroom instruction to accountability (Wilson, 2004; Gorin & Mislevy, 2013; NRC, 2006). A simple model of this kind takes the form: ln(pni / (1-pni)) = bn - di which states that the natural logarithm ln of the odds (pni / (1- pni)) is hypothesized to be equal to the difference between the ability b of student n and the difficulty d of item i. Estimates of the ability measures b and the item difficulty calibrations d are expressed in a common unit, which makes it possible to interpret the measures relative to the calibrations in the substantive terms of the items, and vice versa. All measures and calibrations are associated with individually estimated error and model fit statistics. A Rasch measurement model-based approach to reliability (Andrich, 1982; Andrich & Douglas, 1977; Fisher, 1992, 2008; Wright & Masters, 1982) differs, then, from the statistical model-based approaches described by Sijtsma (2009) and Bentler (2009) in the way error is estimated and treated en route to arriving at an estimate of the ratio of true variation to error variation.

*Describe the sample(s), including size and characteristics, for each reliability analysis conducted.: Exhibit 3 demonstrates that the sample for the reliability analysis came from multiple regions, divisions, and states for grades K-5. For students for whom gender was known, approximately half of each grade level was male and half was female. Fewer than ten percent of each grade level was identified as English Language Learners, though ELL status data were missing for the majority of students. Inconsistent information was available for FRL, other SES indicators, disability classification, and first language.

*Describe the analysis procedures for each reported type of reliability.: A Rasch measurement model-based approach was used to assess reliability. This approach estimates error (better termed uncertainty; Fisher & Stenner, 2017; JCGM, 2008) for each individual person and item. The mean square error variance is then subtracted from the total variance to arrive directly at an estimate of the true variance. The ratio of the true standard deviation to the root mean square error then provides a separation ratio, G, that increases in linear error-unit amounts (Wright & Masters, 1982) relative to a separation reliability coefficient formulated in the traditional 0.00-1.00 form (Andrich, 1982). An intuitive expression of measurement model-based reliability is obtained in terms of strata, ranges with centers separated by three errors. Strata are estimated by multiplying the separation ratio, G, by 4, which captures 95% of the variation in the measures; adding 1, allowing for error; and dividing that total by 3, for the number of errors taken as separating the centers of the resulting ranges. Winsteps version 4.00.1 was used for all Rasch analyses (Linacre, 2017). Model fit statistics and Principal Components Analyses of the standardized residuals explicitly focus on the internal consistency of the observed responses, evaluating them in terms of expectations formed on the basis of a simple conception of what is supposed to happen when students respond to test questions. Simulations show that model-based assessments of internal consistency using fit statistics are better able to detect violations of interval measurement’s unidimensionality and invariance requirements than Cronbach’s alpha (Fisher, Elbaum, & Coulter, 2010).

*In the table(s) below, report the results of the reliability analyses described above (e.g., internal consistency or inter-rater reliability coefficients).

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of reliability analysis not compatible with above table format:

Manual cites other published reliability studies:: No

Provide citations for additional published studies.

Do you have reliability data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?

If yes, fill in data for each subgroup with disaggregated reliability data.

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of reliability analysis not compatible with above table format:

Manual cites other published reliability studies:: No

Provide citations for additional published studies.

Validity

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Rating

Legend

Convincing evidence

Partially convincing evidence

Unconvincing evidence

Data unavailable

^dDisaggregated data available

*Describe each criterion measure used and explain why each measure is appropriate, given the type and purpose of the tool.: Rasch Unit (RIT) scale scores from the Measures of Academic Progress (MAP) Reading assessment (Northwest Evaluation Association, 2014), were used as a criterion measure to demonstrate concurrent and predictive validity of the Imagine Learning Beginning-of-Year Benchmark assessment. Performance on this external measure was expected to be related to performance on the Imagine Learning Beginning-of-Year Benchmark because it includes similar measures of language and literacy skills for grades K – 6. Specifically, the Imagine Learning Beginning-of-Year Benchmark includes measures of letter recognition, phonemic awareness, word recognition, basic reading vocabulary, sentence cloze, beginning book comprehension, leveled book comprehension, and cloze passage completion. MAP Reading includes measures of word recognition, structure and vocabulary, and reading informational texts. MAP Reading is widely-used to identify students who are at-risk or in need of intensive intervention and, like the Imagine Learning Beginning-of-Year Benchmark, is a computer adaptive assessment. MAP Reading also has good evidence of reliability and validity for use as an interim progress monitoring assessment.

*Describe the sample(s), including size and characteristics, for each validity analysis conducted.: See Exhibits 5-7 for sample characteristics for each validity analysis conducted. Across all grade K-5 samples used to examine predictive and concurrent validity, sample sizes ranged 641 to 914; sample sizes for grade 6 ranged from 115 to 159. Multiple regions, divisions, and states were represented for each grade level K-5; the 6th grade sample represented one region, division, and state. The proportion of male and female students across all samples was roughly equal. English Language Learners (ELLs) represented between a quarter and a third of all samples, except for Grade 6 samples in which ELLs represented between 4 and 12 percent of all students. In all samples, most students were Hispanic, with the next most prevalent race/ethnicity categories being White/non-Hispanic and Black/African American. Data indicating eligibility for free/reduced price lunch, student disability status, and services received were not consistently available.

*Describe the analysis procedures for each reported type of validity.: Three bivariate correlation analyses were conducted for each grade level. The first provides evidence of predictive validity using winter 2016 Imagine Learning Beginning-of-Year Benchmark assessment data to predict scores on the spring 2017 MAP Reading assessment data. Two analyses were used to demonstrate concurrent validity by relating Imagine Learning Beginning-of-Year Benchmark assessment data and MAP Reading assessment data (separately, using winter 2016 and spring 2017 data). The Pearson correlation coefficient was computed for each correlation and a 95% confidence interval around each coefficient was calculated.

*In the table below, report the results of the validity analyses described above (e.g., concurrent or predictive validity, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of validity analysis not compatible with above table format:

Manual cites other published reliability studies:: No

Provide citations for additional published studies.

Describe the degree to which the provided data support the validity of the tool.: Using the NCII lower bound threshold of .6 for the confidence intervals around the standardized estimate, analyses showed evidence of concurrent validity for all grade levels. Based on this threshold, evidence of predictive validity was available for first, second, third, and sixth grades. The lower bound of the confidence intervals for predictive validity for fourth and fifth grades were very close to the threshold (.597 for fourth grade and .594 for fifth grade).

Do you have validity data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?: No

If yes, fill in data for each subgroup with disaggregated validity data.

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of validity analysis not compatible with above table format:: Disaggregated validity analyses are not included because bias analysis (differential item functioning) was used to assess subgroup bias (see Technical Standard 5).

Manual cites other published reliability studies:: No

Provide citations for additional published studies.

Bias Analysis

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Rating	Provided	Provided	Provided	Provided	Provided	Provided	Provided

Have you conducted additional analyses related to the extent to which your tool is or is not biased against subgroups (e.g., race/ethnicity, gender, socioeconomic status, students with disabilities, English language learners)? Examples might include Differential Item Functioning (DIF) or invariance testing in multiple-group confirmatory factor models.: Yes

If yes,
a. Describe the method used to determine the presence or absence of bias:: Differential Item Functioning (DIF) was evaluated for all items on the Imagine Learning Beginning-of-Year Benchmark using focal versus contrast groups differences by means of t-tests. Winsteps version 4.00.1 was used for all analyses (Linacre, 2017). Evaluating DIF requires close attention to what item estimate stability is desired. Lee (1992) finds that student performance across elementary grade levels changes by about a logit per year, with the rate of growth slowing as the upper grades are reached. The present data roughly corresponded with that, as the changes from K to Grade 1, Grade 1 to Grade 2, and Grade 2 to Grade 3 were all a logit or more, and the remaining three were all half a logit or more. Accordingly, as Linacre (1994) points out, items stable to within a logit will be targeted at the correct grade level. Linacre (1994) shows that to have 99% confidence that an item difficulty estimate is stable to within a logit, its error of measurement must be 0.385 or less (since that is the value obtained dividing the desired one logit range by the 2.6 standard deviations needed for 99% confidence). The sample size needed for item estimate errors of 0.385 or less was in the range of 27 to 61, depending on how well targeted the item is at the sample. An item that is too easy or hard, and so is off target, will have a higher error and will need a correspondingly larger sample size to drive down the error to the range needed for stability within a logit. The criterion of interest in the Winsteps table is then the DIF size, which should be less than 1.00 logit for samples of sufficient size.

b. Describe the subgroups for which bias analyses were conducted:: Groups tested included ELL versus non-ELL, and unknown; female versus male, and unknown; and White versus non-White. Exhibit 10 displays the number of students and percentages within each of these categories by grade level. Race/ethnicity categories were collapsed into binary categories (i.e., White and Nonwhite) because of small cell sizes for certain groups.

c. Describe the results of the bias analyses conducted, including data and interpretative statements. Include magnitude of effect (if available) if bias has been identified.: For ELL, with three groups, the hypothesis tested was that the item has the same difficulty as its average difficulty for all groups. In 98.5% of comparisons, DIF sizes of |0.77| or less were obtained. Of the six instances of DIF greater than |0.77| logits, four of them involved Phoneme Segmentation items with only four nonELL students responding. The other two items were two Cloze Passage items (items 107.1 and 107.18); both had 77 ELL students answering, both had DIF sizes of about -1.00 logit, meaning these ELL students correctly answered these two items disproportionately more frequently. However, neither item had a t-statistic greater than 1.96, so neither difference was statistically significant at a p-value less than .05. For gender, also with three groups, the hypothesis tested was again that the item has the same difficulty as its average difficulty for all groups. No DIF sizes over |0.77| logits were found and seven items had DIF sizes greater than or equal to |0.50|. All were Cloze Passage items. Males found four items more difficult than the other two groups did, and females found three items easier. Two items with DIF sizes of 0.66 disadvantaging males were statistically significant at p less than .05. For ethnicity, the hypothesis tested was whether the item has the same difficulty for the two groups. There were 3 items with DIF sizes greater than |0.76|. Two of the three have DIF measure standard errors greater than 0.385 for one of the contrasted groups, and so were not expected to be stable to within a logit with 99% confidence. The other item (Cloze Passage 104.1), however, was 1.57 logits more difficult for white students (CLASS=1) than it was for others. The DIF analysis provides a statistical approach to assessing whether the items retain their relative difficulties across different groups. This kind of invariance can also be investigated more directly, by scaling the items on subsamples of students. A series of analyses of each grade level resulted in seven separate sets of item calibrations. Of the 21 pairs of correlations, 15 were between 0.94 and 0.99, four were between 0.83 and 0.89, and the remaining two (Kindergarten vs 5th and 6th grades) were 0.78 and 0.66. Disattenuated for error (Muchinsky, 1996; Schumacker, 1996), the smallest correlation was 0.75. These results supported the overall stability of the scale across grade levels.

Data Collection Practices

Most tools and programs evaluated by the NCII are branded products which have been submitted by the companies, organizations, or individuals that disseminate these products. These entities supply the textual information shown above, but not the ratings accompanying the text. NCII administrators and members of our Technical Review Committees have reviewed the content on this page, but NCII cannot guarantee that this information is free from error or reflective of recent changes to the product. Tools and programs have the opportunity to be updated annually or upon request.

Summary
Descriptive Information
Administration
Training & Scoring

Technical Standards
Classification Accuracy &
Cross-Validation Summary
Reliability
Validity
Bias Analysis

Data Collection Practices

Imagine LearningImagine Language & Literacy Benchmark

Summary

Descriptive Information

Administration

Training & Scoring

Training

Scoring

Technical Standards

Classification Accuracy & Cross-Validation Summary

Measures of Academic Progress (MAP) Reading Assessment

Classification Accuracy

Cross-Validation

Classification Accuracy - Winter

Reliability

Validity

Bias Analysis

Data Collection Practices

Imagine Learning
Imagine Language & Literacy Benchmark