Amira ISIP
Reading

Summary

Descriptive Information

Amira has two essential jobs: (1) provide supplemental instructional materials and student data to teachers with a bridge from science of reading professional development to classroom execution and (2) engage students in the time on the tongue needed to bridge identified reading skills gaps. Amira’s personalized learning software listens to students read aloud, identifies dyslexic students, continuously assesses reading mastery, and delivers individualized tutoring. Our solution empowers the learning community, from students to paraprofessionals, teachers, parents, and administrators, with AI-powered tools for real-time results. Utilizing Amira, teachers can continuously assess and monitor student progress in oral language, phonological awareness, phonics, fluency, vocabulary and comprehension. Amira optimizes instructional strategies and resource allocation to provide a comprehensive solution. Anytime a student reads with Amira, all stakeholders are provided up-to-the-moment feedback critical for teachers to make the right decisions for intervention support.

Acquisition & Cost

Where to Obtain:: Amira Learning, Inc.; orders@amiralearning.com; 5214F Diamond Heights Blvd #3255, San Francisco, CA 94131; 650-455-4380; www.amiralearning.com

Initial Cost:: $7.50 per student

Replacement Cost:: $7.50 per student per year

Included in Cost:: Amira is a comprehensive and holistic assessment, instruction, and tutoring suite. Assess ($7.50 per student per year): Benchmark, dyslexia screening, and ongoing progress monitoring to diagnose skill gaps; Instruct ($12 .50 per student per year): Lesson plans prescribed and delivered based on targeted needs that can be linked to each district’s high-quality instructional materials; Tutor ($15 per student per year): Combined literacy instruction and practice during school or through extended learning opportunities; Amira Full Suite (includes Assess, Instruct, and Tutor): $35 per student per year. These student license costs include the screening tool software, access for students and district/school personnel, virtual (live and asynchronous) professional development, and foundational data for implementation and monitoring. Bulk discounts are available pending the number of student licenses needed by states/districts.; Amira software is accessibility ready, adhering to the Web Content Accessibility Guidelines (WCAG 2.1) Level AA, ensuring that content is accessible and enhancing usability for all users. Amira adheres to best practices for UX development, supporting WCAG 2.1 guidelines. Amira is also SOC 2 Type 2 certified. All tasks in Amira’s English suite of assessments and practice are also available in Spanish. Details on Amira’s accommodations are available in the Teacher Manual (2024) pages 16-31 available at https://go.amiralearning.com/hubfs/Assessment/Amira%20Teacher%20Manual%20(2024).pdf.

Training & Technical Support

Training Requirements:: Administrators are encouraged to attend or complete online asynchronous 45-minute training, “Getting Started with Amira” prior to initial administration. Following administration, administrators may access self-paced asynchronous training, including understanding data on Amira Academy. Amira handles most of the tasks that typically are difficult for teachers. The software acts as a proctor, guiding the student through each task. The software is a highly adept support technician, identifying hardware and software issues that may impact the assessment process. Finally, Amira produces a consistent and comprehensive scoring of the items, providing the teacher with a framework for evaluating outputs.

Qualified Administrators:: No minimum qualifications specified.

Access to Technical Support:

Administration

Assessment Format:

Scoring Time:

Scoring is automatic OR
0 minutes per student

Scores Generated:

Raw score
Standard score
Percentile score
Grade equivalents
IRT-based score
Developmental benchmarks
Developmental cut points
Equated
Lexile score
Error analysis
Composite scores
Subscale/subtest scores

Administration Time:

20 minutes per student

Scoring Method:

Manually (by hand)
Automatically (computer-scored)

Technology Requirements:

Computer or tablet
Internet connection

Accommodations:: Amira software is accessibility ready, adhering to the Web Content Accessibility Guidelines (WCAG 2.1) Level AA, ensuring that content is accessible and enhancing usability for all users. Amira adheres to best practices for UX development, supporting WCAG 2.1 guidelines. Amira is also SOC 2 Type 2 certified. All tasks in Amira’s English suite of assessments and practice are also available in Spanish. Details on Amira’s accommodations are available in the Teacher Manual (2024) pages 16-31 available at https://go.amiralearning.com/hubfs/Assessment/Amira%20Teacher%20Manual%20(2024).pdf.

Descriptive Information

Please provide a description of your tool:: Amira has two essential jobs: (1) provide supplemental instructional materials and student data to teachers with a bridge from science of reading professional development to classroom execution and (2) engage students in the time on the tongue needed to bridge identified reading skills gaps. Amira’s personalized learning software listens to students read aloud, identifies dyslexic students, continuously assesses reading mastery, and delivers individualized tutoring. Our solution empowers the learning community, from students to paraprofessionals, teachers, parents, and administrators, with AI-powered tools for real-time results. Utilizing Amira, teachers can continuously assess and monitor student progress in oral language, phonological awareness, phonics, fluency, vocabulary and comprehension. Amira optimizes instructional strategies and resource allocation to provide a comprehensive solution. Anytime a student reads with Amira, all stakeholders are provided up-to-the-moment feedback critical for teachers to make the right decisions for intervention support.

The tool is intended for use with the following grade(s).

Preschool / Pre - kindergarten
selected

Kindergarten
selected

First grade
selected

Second grade
selected

Third grade
selected

Fourth grade
selected

Fifth grade
selected

Sixth grade
not selected

Seventh grade
not selected

Eighth grade
not selected

Ninth grade
not selected

Tenth grade
not selected

Eleventh grade
not selected

Twelfth grade

The tool is intended for use with the following age(s).

0-4 years old
selected

5 years old
selected

6 years old
selected

7 years old
selected

8 years old
selected

9 years old
selected

10 years old
selected

11 years old
not selected

12 years old
not selected

13 years old
not selected

14 years old
not selected

15 years old
not selected

16 years old
not selected

17 years old
not selected

18 years old

The tool is intended for use with the following student populations.

Students in general education
selected

Students with disabilities
selected

English language learners

ACADEMIC ONLY: What skills does the tool screen?

Reading

Phonological processing:

RAN

Memory

Awareness

Letter sound correspondence
selected

Phonics

Structural analysis

Word ID

Accuracy

Speed

Nonword

Accuracy

Speed

Spelling

Accuracy

Speed

Passage

Accuracy

Speed

Reading comprehension:

Multiple choice questions
selected

Cloze

Constructed Response
selected

Retell

Maze

Sentence verification
not selected

Other (please describe):

Listening comprehension:

Multiple choice questions
not selected

Cloze

Constructed Response
selected

Retell

Maze

Sentence verification
selected

Vocabulary
selected

Expressive
selected

Receptive

Mathematics

Global Indicator of Math Competence

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Early Numeracy

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Mathematics Concepts

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Mathematics Computation

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Mathematic Application

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Fractions/Decimals

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Algebra

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Geometry

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Other (please describe):

Please describe specific domain, skills or subtests:

BEHAVIOR ONLY: Which category of behaviors does your tool target?: Internalizing
Externalizing
Internalizing and Externalizing

BEHAVIOR ONLY: Please identify which broad domain(s)/construct(s) are measured by your tool and define each sub-domain or sub-construct.

Acquisition and Cost Information

Where to obtain:

Email Address: orders@amiralearning.com
Address: 5214F Diamond Heights Blvd #3255, San Francisco, CA 94131
Phone Number: 650-455-4380
Website: www.amiralearning.com

Initial cost for implementing program:

Cost: $7.50
Unit of cost: student

Replacement cost per unit for subsequent use:

Cost: $7.50
Unit of cost: student
Duration of license: year

Additional cost information:

Describe basic pricing plan and structure of the tool. Provide information on what is included in the published tool, as well as what is not included but required for implementation.: Amira is a comprehensive and holistic assessment, instruction, and tutoring suite. Assess ($7.50 per student per year): Benchmark, dyslexia screening, and ongoing progress monitoring to diagnose skill gaps; Instruct ($12 .50 per student per year): Lesson plans prescribed and delivered based on targeted needs that can be linked to each district’s high-quality instructional materials; Tutor ($15 per student per year): Combined literacy instruction and practice during school or through extended learning opportunities; Amira Full Suite (includes Assess, Instruct, and Tutor): $35 per student per year. These student license costs include the screening tool software, access for students and district/school personnel, virtual (live and asynchronous) professional development, and foundational data for implementation and monitoring. Bulk discounts are available pending the number of student licenses needed by states/districts.

Provide information about special accommodations for students with disabilities.: Amira software is accessibility ready, adhering to the Web Content Accessibility Guidelines (WCAG 2.1) Level AA, ensuring that content is accessible and enhancing usability for all users. Amira adheres to best practices for UX development, supporting WCAG 2.1 guidelines. Amira is also SOC 2 Type 2 certified. All tasks in Amira’s English suite of assessments and practice are also available in Spanish. Details on Amira’s accommodations are available in the Teacher Manual (2024) pages 16-31 available at https://go.amiralearning.com/hubfs/Assessment/Amira%20Teacher%20Manual%20(2024).pdf.

Administration

BEHAVIOR ONLY: What type of administrator is your tool designed for?

General education teacher
not selected

Special education teacher
not selected

Parent

Child

External observer
not selected

Other

If other, please specify:

What is the administration setting?

Direct observation
not selected

Rating scale
not selected

Checklist

Performance measure
not selected

Questionnaire
not selected

Direct: Computerized
not selected

One-to-one
not selected

Other

If other, please specify:

Does the tool require technology?

Yes

If yes, what technology is required to implement your tool? (Select all that apply)

Computer or tablet
selected

Internet connection
not selected

Other technology (please specify)

If your program requires additional technology not listed above, please describe the required technology and the extent to which it is combined with teacher small-group instruction/intervention:

What is the administration context?

Individual
selected

Small group If small group, n=4
selected

Large group If large group, n=20
selected

Computer-administered
selected

Other

If other, please specify:

any sized group

What is the administration time?

Time in minutes

per (student/group/other unit)

student

Additional scoring time:

Time in minutes

per (student/group/other unit)

student

ACADEMIC ONLY: What are the discontinue rules?

No discontinue rules provided
selected

Basals

Ceilings

Other

If other, please specify:

Are norms available?: Yes

Are benchmarks available?: Yes
If yes, how many benchmarks per year?: 3
If yes, for which months are benchmarks available?: Aug-Nov, Dec-March, April-July; Amira ISIP offers unlimited additional administrations of progress monitoring throughout the school year in addition to benchmark assessment.

BEHAVIOR ONLY: Can students be rated concurrently by one administrator?
If yes, how many students can be rated concurrently?

Training & Scoring

Training

Is training for the administrator required?: Yes

Describe the time required for administrator training, if applicable:: Administrators are encouraged to attend or complete online asynchronous 45-minute training, “Getting Started with Amira” prior to initial administration. Following administration, administrators may access self-paced asynchronous training, including understanding data on Amira Academy. Amira handles most of the tasks that typically are difficult for teachers. The software acts as a proctor, guiding the student through each task. The software is a highly adept support technician, identifying hardware and software issues that may impact the assessment process. Finally, Amira produces a consistent and comprehensive scoring of the items, providing the teacher with a framework for evaluating outputs.

Please describe the minimum qualifications an administrator must possess.: No minimum qualifications

Are training manuals and materials available?: Yes

Are training manuals/materials field-tested?: Yes

Are training manuals/materials included in cost of tools?: Yes
If No, please describe training costs:

Can users obtain ongoing professional and technical support?: Yes
If Yes, please describe how users can obtain support:

Scoring

How are scores calculated?

Manually (by hand)
selected

Automatically (computer-scored)
not selected

Other

If other, please specify:

Do you provide basis for calculating performance level scores?: Yes

What is the basis for calculating performance level and percentile scores?

Age norms

Grade norms
not selected

Classwide norms
not selected

Schoolwide norms
not selected

Stanines

Normal curve equivalents

What types of performance level scores are available?

Raw score

Standard score
selected

Percentile score
selected

Grade equivalents
selected

IRT-based score
not selected

Age equivalents
not selected

Stanines

Normal curve equivalents
selected

Developmental benchmarks
selected

Developmental cut points
selected

Equated

Probability
selected

Lexile score
selected

Error analysis
selected

Composite scores
selected

Subscale/subtest scores
not selected

Other

If other, please specify:

Does your tool include decision rules?: No
If yes, please describe.

Can you provide evidence in support of multiple decision rules?: No
If yes, please describe.

Please describe the scoring structure. Provide relevant details such as the scoring format, the number of items overall, the number of items per subscale, what the cluster/composite score comprises, and how raw scores are calculated.: Amira leverages machine learning and AI to automatically score every student interaction. Interactions that Amira automatically scores include rapid automatized naming (RAN), letter sound fluency, blending, word reading, word part manipulation, spelling, listening comprehension, oral reading, multiple-choice questions, open-ended oral responses, and open-ended written responses, among others. The cornerstone of Amira’s scoring system lies in its ability to automatically and accurately score these varied interactions. Where necessary, Amira incorporates rubric-based scoring to further enhance the precision of scoring, particularly for open-ended responses such Amira’s dialogue-based comprehension questions. Amira’s ability to automatically and accurately score each student interaction uniquely enables the software not only to provide immediate feedback to students but also to maintain continuously updated profiles on each student's progress, achievements, and instructional needs. Each time a student completes an activity, Amira instantly and automatically scores that activity. The scoring process also immediately updates all teacher-facing reports. This means that Amira’s scoring system allows Amira to maintain a real-time profile of each student’s progress, achievements, and instructional needs, and makes that always up-to-date profile available to educators. An integral feature of Amira’s scoring system is mechanisms for quality and equity assurance. One such mechanism is a “meta-analysis” conducted by our machine learning models. This process identifies any discrepancies that may indicate a misrepresentation of a student’s true abilities, enabling the system to flag activities that may not represent a student’s best effort. A second mechanism is a set of automated data integrity tests to ensure data quality and consistency across the locations data is stored. The third and critical mechanism is that recordings of activities are always available to educators should they want to listen and adjust any of Amira’s scoring. Educators may click on an activity in any of Amira’s reports to bring up a recording of the activity and adjust scores as needed. Amira’s Scoring System rigorously and universally adheres to the following principles: (a) The scoring of ALL items is visible and transparent to teachers/proctors. There is no black box – educators can see every item taken and every Amira score; (b) educators have the ability to actually listen to student responses, enabling 100% auditability of Amira’s scoring; (c) educators/teachers/proctors have the ability to correct Amira’s scoring manually. If desired, a state/district can allow educators to have the final word on the scoring of any student’s assessment, with the ability to override the Amira scoring item by item.

Describe the tool’s approach to screening, samples (if applicable), and/or test format, including steps taken to ensure that it is appropriate for use with culturally and linguistically diverse populations and students with disabilities.: Amira’s voice-administered system is fully validated and adaptable for students in kindergarten through 6th grade, ensuring it meets the requirements for effective early literacy assessment across various grade levels and student needs. The system’s validation process includes rigorous testing to ensure that it provides accurate and reliable results, even when administered to young learners, including those with special needs or limited prior education. Amira’s mode of administration is designed to be flexible and inclusive, allowing for both digital, AI-driven assessments and partially teacher-facilitated modes. This adaptability ensures that the Screener is appropriate for students in early elementary grades, including those with specific educational needs. The following evidence demonstrates how Amira ISIP has been validated across diverse student populations: Voice Administration Validation: Amira’s voice-enabled assessment system has been validated specifically for students in K-6. The model training and evaluation process leverages an extensive dataset of over 10,000 hours of children's speech, carefully balanced to ensure representation across diverse populations representative of national demographics along dimensions of gender, race/ethnicity, socioeconomic status, ELL/MLL students, and accent/dialect group. These studies confirmed that the system could accurately capture and assess reading performance, with a classification accuracy of over 90% across these grade levels. This validation is crucial for ensuring that the assessments provide reliable data, even in a fully digital, AI-driven mode where the system autonomously guides students through tasks. Consistency and Standardization: The consistency and standardization provided by Amira’s AI reduce the risk of bias and human error, aligning with best practices for large-scale assessments (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). Amira’s machine-proctored model is supported by extensive research demonstrating that AI-driven assessments can offer greater consistency and reliability than human-proctored ones. Human proctoring introduces variability due to differences in proctor training and execution, which can lead to measurement error and bias (Kane, 2013). In contrast, Amira’s AI provides a uniform testing experience, ensuring that every student is assessed under the same conditions, thereby upholding the integrity of the results (Wilson & Draney, 2002). At the individual item scoring level, currently the model being used in production has a 96% agreement with human expert judgment. To benchmark this number, human inter-rater reliability metrics for the same assessment range between 96-97%. At the aggregate assessment score level, there is an extremely high correlation (0.95 - 0.98, depending on the grade level) between the model scores and the same assessments scored by human experts. Proctoring and Administration for Kindergarteners: In both English and Spanish, Amira has proctored and administered millions of screenings. The system has been effectively used in thousands of classrooms, including those in more than 1,200 districts across all 50 states. The proctoring model for kindergarteners emphasizes small group administration, with a teacher serving as the meta-proctor. In this model, the teacher helps students log in, creating a staggered start that aids in managing the process. Once logged in, Amira leads students through a simple, voice-based dialog to ensure the environment is test-ready. This model allows Amira to function as an effective teacher’s assistant, handling most situations independently and alerting the teacher only when necessary. Adaptability to Student Needs: The Screener can be administered in various modes to accommodate different student needs. For younger students or those requiring additional support, Amira can operate in a small group, partially teacher-facilitated mode. In this setup, the teacher facilitates the assessment, while Amira provides automated guidance and feedback. Research supports this approach, indicating that younger children and those with specific needs benefit from the combination of human interaction and AI-driven assessment, which offers reassurance and allows for immediate intervention. Developmental Alignment and Task Appropriateness: Amira’s tasks are carefully aligned with the developmental milestones of K-6 students, ensuring the assessments are both appropriate and effective for identifying reading difficulties, including dyslexia risk at each stage. For example, tasks in kindergarten focus on phonological awareness, while those in 2nd grade emphasize reading fluency. This developmental alignment ensures that the Screener is sensitive to the literacy skills typical of each age group, providing a fair and relevant assessment regardless of the student’s prior education. Evidence of Effectiveness in Diverse Settings: Amira has been implemented and validated in various educational settings, including urban schools in large states such as California and Oklahoma where it was used in whole-classroom settings. Across the country, Amira has been administered in over 500,000 sessions, spanning diverse environments such as urban, suburban, and rural schools. These include Title I schools, charter schools, and districts serving large populations of English language learners and students with disabilities. The system’s predictive validity has been confirmed in these environments, demonstrating that it can provide accurate assessments even when administered to entire classrooms of young students. This large-scale validation ensures that Amira’s fully digital mode is not only effective but also scalable for broad application across different types of schools and student populations. Accommodations for Special Needs: Amira employs a Universal Design for Learning (UDL) approach to ensure that the Screener is accessible to all students, including those with disabilities. The system provides a range of accommodations, including the option for a fully proctored, one-on-one assessment or the use of paper-based alternatives like the TPRI or Tejas Lee for students who may not be well-suited to digital assessments. To date, Amira has been validated with over 7,000 students, including those requiring specific accommodations, with approximately 20% utilizing options such as one-on-one proctoring or bilingual support. This extensive validation confirms that the system reliably accommodates and accurately assesses students with various needs, including those with visual or auditory impairments, English language learners, and students with individualized education programs (IEPs). Amira’s UDL approach includes features that are WCAG 2.1 Level AA compliant, ensuring accessibility for students with disabilities. These accommodations have been tested and validated across multiple studies to ensure that they do not compromise the accuracy or reliability of the assessments. This commitment to accessibility has allowed Amira to meet the needs of a diverse student population while maintaining high standards of assessment validity. Use with English Learners (ELs): Amira’s Screener is configurable for English language learners, offering the option to screen in English, Spanish, or both. This flexibility is supported by research indicating that assessing literacy skills in both the native language and English can provide a more accurate picture of a student’s abilities. Amira’s ability to offer proctoring in Spanish or administer the assessment in a hybrid mode ensures that ELs receive an equitable assessment experience. In summary, Amira’s voice administration is fully validated and adaptable for all K-6 students, including those with special needs or limited prior education. The system’s design and extensive validation across diverse populations ensure that it is both effective and appropriate for early literacy assessment in various educational settings. The flexibility in administration modes, combined with robust evidence of accuracy and reliability, makes Amira an ideal tool for assessing young learners.

Technical Standards

Classification Accuracy & Cross-Validation Summary

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Classification Accuracy Fall
Classification Accuracy Winter
Classification Accuracy Spring

Legend

Convincing evidence

Partially convincing evidence

Unconvincing evidence

Data unavailable

^dDisaggregated data available

NWEA MAP

Classification Accuracy

Select time of year

Fall

Winter

Spring

Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.: The criterion outcome measure used in this study was the NWEA MAP Reading assessment, specifically the RIT scores from the end-of-year administration. This measure was selected for its well-established validity and reliability in assessing students' reading ability across a broad range of skills and grade levels. The NWEA MAP Reading assessment is entirely independent from the screening measure utilized in this study. The two assessments are developed and administered separately, and there is no overlap in test items, scoring rubrics, or administration processes. The only shared characteristic is their common purpose of measuring the construct of reading ability. This shared construct ensures alignment between the screening and outcome measures without introducing direct dependency, thereby supporting the validity of the classification accuracy study.

Do the classification accuracy analyses examine concurrent and/or predictive classification?

Concurrent
Predictive

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.

Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).: The degree to which the Amira Screener can accurately identify students who need intensive intervention was evaluated using classification accuracy statistics based on the Amira cut scores that show the proportion of students correctly classified by their ARM scores as at-risk or not-at-risk and the criterion measure cut scores that show whether students actually need intensive intervention. The classification accuracy analysis was conducted as follows: Compare an individual student's (a) ARM score and the candidate ARM cut score and (b) their score on the criterion measure and the criterion measure cut score. Assign "1" in one of the four designations demonstrated here: TP: Students classified by screener as "At-Risk" and are actually "At-Risk" FP: Students classified by screener as "At-Risk" and are actually "Not At-Risk FN: Students classified by screener as "Not At-Risk" and are actually "At-Risk" TN: Students classified by screener as "Not At-Risk" and are actually "Not At-Risk" Aggregate the designations to obtain the total counts in each cell for students in the sample. Compute the statistics as shown below. Classification Accuracy Rate Formula: (TP + TN) / (total sample size) Description: Proportion of the study sample whose classification by the ARM cut scores was consistent with classification by the criterion measure. False Negative (FN) Rate Formula: FN / (FN + TP) Description: The proportion of "at-risk" students identified as "not at-risk" by the ARM score. False Positive (FP) Rate Formula: FP / (FP + TN) Description: The proportion of "not at-risk" students identified as "at-risk" by the ARM score. Positive Predictive Value (PPV) Formula: TP / (TP + FP) Description: Proportion of students identified as "at-risk" by ARM scores who are actually at-risk. Negative Predictive Value (NPV) Formula: TN / (TN + FN) Description: Proportion of students identified as "not at-risk" by ARM scores who are actually not at-risk. Sensitivity Formula: TP / (TP + FN) Description: Proportion of "at-risk" students that were identified as "at-risk" by the ARM score. Specificity Formula: TN / (TN + FP) Description: Proportion of "not at-risk" students that were identified as "not at-risk" by ARM scores. Area Under the Curve (AUC) Description: Represents the area under the receiver operating characteristics (ROC) curve, including the lower (AUC-LB) and upper (AUC-UB) bounds of the 95% confidence interval. Confidence intervals were calculated using a 1000-sample bootstrap method to construct a two-sided interval. Interpretation: Measures how well ARM scores separate the study sample into "at-risk" and "not at-risk" categories that match those from the criterion measure cut scores. The cut points for the Amira Screener were determined to align closely with the identification of students at risk for requiring intensive intervention, as defined by their performance on the criterion measure (NWEA MAP Reading assessment). Specifically, candidate cut scores were tested at the 20th, 25th, and 30th percentiles of the ARM scores for each grade and testing window. These cut points were chosen because they correspond to widely accepted thresholds for identifying students performing below grade-level expectations and, consequently, at risk of academic difficulty. The final cut points were selected based on their ability to maximize classification accuracy while minimizing the rates of false negatives and false positives, as evaluated using standard classification metrics. Specifically, the cut points that resulted in the highest lower bound on the AUC were selected. By achieving an optimal balance of sensitivity and specificity, the selected cut scores ensure that students most in need of support are accurately identified without unnecessarily flagging those who are not at risk. The alignment between the cut points and students at risk was validated through the classification accuracy analysis, which compared ARM classifications with those based on the criterion measure. High sensitivity values demonstrated the cut points' ability to identify the majority of at-risk students, while high specificity values confirmed that students not at risk were correctly excluded. This approach ensures the cut points serve as a reliable tool for early identification and targeted intervention. The groups contrasted were students who are at risk versus students who are not at risk.

Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?: No
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Cross-Validation

Has a cross-validation study been conducted?: No
If yes,

Select time of year.

Fall

Winter

Spring

Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.

Do the cross-validation analyses examine concurrent and/or predictive classification?

Concurrent
Predictive

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.

Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).

Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Classification Accuracy - Fall

Evidence	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Criterion measure	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP
Cut Points - Percentile rank on criterion measure	20	20	20	20	20	20	20
Cut Points - Performance score on criterion measure	140	158	171	183	192	200	203
Cut Points - Corresponding performance score (numeric) on screener measure	-0.10	.70	1.71	2.5	3.28	4.41	5.41
Classification Data - True Positive (a)	1074	1528	1879	1718	26	322	53
Classification Data - False Positive (b)	1193	869	723	702	475	896	6
Classification Data - False Negative (c)	326	345	459	337	10	114	22
Classification Data - True Negative (d)	4872	5652	5803	4921	1435	2314	20
Area Under the Curve (AUC)	0.79	0.84	0.85	0.86	0.73	0.73	0.74
AUC Estimate’s 95% Confidence Interval: Lower Bound	0.77	0.83	0.84	0.85	0.70	0.71	0.61
AUC Estimate’s 95% Confidence Interval: Upper Bound	0.80	0.85	0.86	0.87	0.78	0.75	0.87

Statistics	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Base Rate	0.19	0.22	0.26	0.27	0.02	0.12	0.74
Overall Classification Rate	0.80	0.86	0.87	0.86	0.75	0.72	0.72
Sensitivity	0.77	0.82	0.80	0.84	0.72	0.74	0.71
Specificity	0.80	0.87	0.89	0.88	0.75	0.72	0.77
False Positive Rate	0.20	0.13	0.11	0.12	0.25	0.28	0.23
False Negative Rate	0.23	0.18	0.20	0.16	0.28	0.26	0.29
Positive Predictive Power	0.47	0.64	0.72	0.71	0.05	0.26	0.90
Negative Predictive Power	0.94	0.94	0.93	0.94	0.99	0.95	0.48

Sample	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Date	2023-2024	2023-2024	2023-2024	2023-2024	2023-2024	2023-2024	2023-2024
Sample Size	7465	8394	8864	7678	1946	3646	101
Geographic Representation	East North Central (IL, IN) Mountain (AZ) South Atlantic (MD)	East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD)	East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD)	Mountain (AZ) Pacific (CA) South Atlantic (MD)	Mountain (NV) South Atlantic (SC) West South Central (OK)	Mountain (NV) South Atlantic (SC) West South Central (OK)	West South Central (LA)
Male	41.8%	47.7%	48.4%
Female	41.1%	45.9%	46.4%
Other
Gender Unknown
White, Non-Hispanic	49.9%	25.9%	15.3%
Black, Non-Hispanic	22.1%	19.8%	19.9%
Hispanic	11.5%	10.4%	9.7%
Asian/Pacific Islander	4.2%	3.5%	3.5%
American Indian/Alaska Native
Other	6.8%	10.1%	2.9%
Race / Ethnicity Unknown
Low SES	6.7%	6.3%	6.5%
IEP or diagnosed disability	9.4%	11.6%	12.2%
English Language Learner	13.7%	13.3%	13.0%

Classification Accuracy - Winter

Evidence	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5
Criterion measure	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP
Cut Points - Percentile rank on criterion measure	20	20	20	20	20	20
Cut Points - Performance score on criterion measure	140	158	171	183	192	200
Cut Points - Corresponding performance score (numeric) on screener measure	.21	.94	1.97	2.92	3.58	4.71
Classification Data - True Positive (a)	1111	926	1400	1563	64	173
Classification Data - False Positive (b)	1317	850	875	574	545	1045
Classification Data - False Negative (c)	218	172	313	319	16	42
Classification Data - True Negative (d)	5497	6392	6138	7655	1320	2385
Area Under the Curve (AUC)	0.82	0.86	0.85	0.87	0.75	0.75
AUC Estimate’s 95% Confidence Interval: Lower Bound	0.81	0.85	0.84	0.85	0.72	0.72
AUC Estimate’s 95% Confidence Interval: Upper Bound	0.83	0.88	0.86	0.88	0.78	0.77

Statistics	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5
Base Rate	0.16	0.13	0.20	0.19	0.04	0.06
Overall Classification Rate	0.81	0.88	0.86	0.91	0.71	0.70
Sensitivity	0.84	0.84	0.82	0.83	0.80	0.80
Specificity	0.81	0.88	0.88	0.93	0.71	0.70
False Positive Rate	0.19	0.12	0.12	0.07	0.29	0.30
False Negative Rate	0.16	0.16	0.18	0.17	0.20	0.20
Positive Predictive Power	0.46	0.52	0.62	0.73	0.11	0.14
Negative Predictive Power	0.96	0.97	0.95	0.96	0.99	0.98

Sample	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5
Date	2023-2024	2023-2024	2023-2024	2023-2024	2023-2024	2023-2024
Sample Size	8143	8340	8726	10111	1945	3645
Geographic Representation	East North Central (IL, IN) Mountain (AZ) South Atlantic (MD)	East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD)	East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD)	Mountain (AZ) Pacific (CA) South Atlantic (MD)	Mountain (NV) South Atlantic (SC) West South Central (OK)	Mountain (NV) South Atlantic (SC) West South Central (OK)
Male	38.3%	48.0%	49.2%
Female	37.7%	46.2%	47.1%
Other
Gender Unknown
White, Non-Hispanic	45.7%	26.0%	15.5%
Black, Non-Hispanic	20.3%	19.9%	20.2%
Hispanic	10.6%	10.4%	9.8%
Asian/Pacific Islander	3.9%	3.5%	3.6%
American Indian/Alaska Native
Other	6.2%	10.2%	2.9%
Race / Ethnicity Unknown
Low SES	6.1%	6.4%	6.6%
IEP or diagnosed disability	8.7%	11.6%	12.4%
English Language Learner	12.6%	13.4%	13.2%

Classification Accuracy - Spring

Evidence	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Criterion measure	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP	NWEA MAP
Cut Points - Percentile rank on criterion measure	20	20	20	20	20	20	20
Cut Points - Performance score on criterion measure	140	158	171	183	192	200	203
Cut Points - Corresponding performance score (numeric) on screener measure	.47	1.43	2.19	3.19	3.82	4.97	5.97
Classification Data - True Positive (a)	1096	1210	1497	1290	88	225	63
Classification Data - False Positive (b)	1209	746	794	545	512	666	7
Classification Data - False Negative (c)	208	289	328	238	25	83	13
Classification Data - True Negative (d)	5572	6145	6190	5579	1291	1672	18
Area Under the Curve (AUC)	0.83	0.85	0.85	0.88	0.75	0.72	0.77
AUC Estimate’s 95% Confidence Interval: Lower Bound	0.82	0.84	0.84	0.87	0.72	0.70	0.63
AUC Estimate’s 95% Confidence Interval: Upper Bound	0.84	0.86	0.87	0.89	0.77	0.75	0.89

Statistics	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Base Rate	0.16	0.18	0.21	0.20	0.06	0.12	0.75
Overall Classification Rate	0.82	0.88	0.87	0.90	0.72	0.72	0.80
Sensitivity	0.84	0.81	0.82	0.84	0.78	0.73	0.83
Specificity	0.82	0.89	0.89	0.91	0.72	0.72	0.72
False Positive Rate	0.18	0.11	0.11	0.09	0.28	0.28	0.28
False Negative Rate	0.16	0.19	0.18	0.16	0.22	0.27	0.17
Positive Predictive Power	0.48	0.62	0.65	0.70	0.15	0.25	0.90
Negative Predictive Power	0.96	0.96	0.95	0.96	0.98	0.95	0.58

Sample	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Date	2023-2024	2023-2024	2023-2024	2023-2024	2023-2024	2023-2024	2023-2024
Sample Size	8085	8390	8809	7652	1916	2646	101
Geographic Representation	East North Central (IL, IN) Mountain (AZ) South Atlantic (MD)	East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD)	East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD)	Mountain (AZ) Pacific (CA) South Atlantic (MD)	Mountain (NV) South Atlantic (SC) West South Central (OK)	Mountain (NV) South Atlantic (SC) West South Central (OK)	West South Central (LA)
Male	38.6%	47.7%	48.7%
Female	37.9%	45.9%	46.7%
Other
Gender Unknown
White, Non-Hispanic	46.1%	25.9%	15.4%
Black, Non-Hispanic	20.4%	19.8%	20.0%
Hispanic	10.7%	10.4%	9.7%
Asian/Pacific Islander	3.9%	3.5%	3.5%
American Indian/Alaska Native
Other	6.3%	10.1%	2.9%
Race / Ethnicity Unknown
Low SES	6.2%	6.3%	6.5%
IEP or diagnosed disability	8.7%	11.6%	12.3%
English Language Learner	12.7%	13.3%	13.1%

Reliability

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Rating

Legend

Convincing evidence

Partially convincing evidence

Unconvincing evidence

Data unavailable

^dDisaggregated data available

*Offer a justification for each type of reliability reported, given the type and purpose of the tool.: Reliability refers to the relative stability with which a test measures the same skills across minor differences in conditions. Two types of reliability are reported in the table below, Parallel Form reliability and Cronbach’s coefficient alpha. Parallel Forms Reliability is crucial for ensuring the consistency of the Amira Progress Monitoring assessment. This analysis measures the consistency of results across different assessment forms, which is essential for accurately tracking student growth, since students receive a different form each time they receive a progress monitoring assessment. By confirming that each form is equivalent, we can ensure that any observed improvements in student scores are due to actual learning, not differences in complexity or difficulty in the test forms. The coefficient reported is the average correlation among alternate forms of the measure. High alternate-form reliability coefficients suggest that these multiple forms are measuring the same construct. Coefficient alpha, commonly known as Cronbach's alpha, is a measure of internal consistency reliability used widely in education research and other fields. It estimates the proportion of total variance in a set of scores that is attributable to the true score variance, reflecting the reliability of the measurement.

*Describe the sample(s), including size and characteristics, for each reliability analysis conducted.: The samples used to establish reliability include students who tested in the 2023-2024 school year. Both samples encompassed at least dozens of districts across the country for each grade. These districts were selected to emulate the diversity and variation of the national population of students and are representative in a variety of dimensions including school type, socioeconomic status, geographic region, gender, race, and ethnicity. Students in the parallel forms reliability sample each took two different Progress Monitoring forms within the same time window (1 week). Students in the internal consistency analyses were those who had taken at least 5 forms (instances) of Progress Monitoring across the 2023-2024 school year.

*Describe the analysis procedures for each reported type of reliability.: To assess parallel forms reliability, two forms of the assessment were administered to the same group of students within the range of one week. The scores obtained on each assessment version were then correlated to assess the degree of consistency between them. We measure these correlations using Pearson’s correlation coefficient, which is a measure of the strength of the linear relationship between two variables. The practical significance of the reliability coefficients was evaluated as follows: poor (0−0.39), adequate (0.40−0.59), good (0.60−0.79), and excellent (0.80−1.0). These estimates of practical significance are arbitrary, but conventionally used, and provide a useful heuristic for interpreting the reliability data. Confidence intervals were then calculated for the correlation coefficients computed across distinct pairs of forms. To obtain an estimate of internal consistency reliability, Cronbach's alphas were calculated for students who had taken at least 5 forms of Progress Monitoring assessment over the year. The 95% confidence interval of each reliability metric is computed using the bootstrap method, where 1000 samples with replacement are drawn from the data, and the 2.5% and 97.5% quantiles are calculated and reported.

*In the table(s) below, report the results of the reliability analyses described above (e.g., internal consistency or inter-rater reliability coefficients).

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of reliability analysis not compatible with above table format:

Manual cites other published reliability studies:: No

Provide citations for additional published studies.

Do you have reliability data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?: No

If yes, fill in data for each subgroup with disaggregated reliability data.

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of reliability analysis not compatible with above table format:

Manual cites other published reliability studies:: No

Provide citations for additional published studies.

Validity

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Rating

Legend

Convincing evidence

Partially convincing evidence

Unconvincing evidence

Data unavailable

^dDisaggregated data available

*Describe each criterion measure used and explain why each measure is appropriate, given the type and purpose of the tool.: Concurrent validity measures how well Amira scores correlate with the scores of another test that is administered at the same time and is already established as valid for measuring the same construct. Predictive validity refers to the extent to which scores on the Amira assessment can accurately predict future performance on a related outcome or criterion. The external assessments used in these studies include the i-Ready Reading Diagnostic and NWEA MAP Reading assessment. Both assessments are nationally-normed, computer adaptive measures of reading ability that are widely used in many states with established validity studies of their own.

*Describe the sample(s), including size and characteristics, for each validity analysis conducted.: The samples include students who tested in the 2022-2023 school year. This includes a sample of students from hundreds of districts across the country. These districts were selected to emulate the diversity and variation of the national population of students and are representative in a variety of dimensions including school type, socioeconomic status, geographic region, gender, race, and ethnicity. Sample sizes for each validity study vary across testing window, grade and criterion measure ranging from 988 to 5,643.

*Describe the analysis procedures for each reported type of validity.: Concurrent validity was established by correlating Amira’s Reading Mastery (ARM) scores from students in grades K through 6 who took both an Amira assessment and the external measure within the same two-week window of one another. The predictive validity of Amira was examined by correlating Amira’s assessment scores taken during the beginning of the year (Fall) window to scores from external measures taken at the end of the school year (Spring). In both forms of validity, the relationship between Amira’s scores and the external criterion measure was evaluated using Pearson’s correlation coefficient. Coefficients were calculated using bootstrap sampling across 100 random samples, and median correlation coefficients as well as 95% confidence intervals on the correlation coefficients are reported. All median and lower-bound correlation coefficients are 0.70 or higher, indicating a strong positive linear relationship between Amira and the external measure.

*In the table below, report the results of the validity analyses described above (e.g., concurrent or predictive validity, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of validity analysis not compatible with above table format:

Manual cites other published reliability studies:: Yes

Provide citations for additional published studies.: https://www.amiralearning.com/amira-technical-guide.html Rice, M. L., & Hoffman, L. (2015). Predicting vocabulary growth in children with and without specific language impairment: A longitudinal study from 2; 6 to 21 years of age. Journal of Speech, Language, and Hearing Research, 58(2), 345–359. Boscardin, C. K., Muthén, B., Francis, D. J., & Baker, E. L. (2008). Early identification of reading difficulties using heterogeneous developmental trajectories. Journal of Educational Psychology, 100(1), 192.

Describe the degree to which the provided data support the validity of the tool.: All results show a correlation of 0.7 or higher (strong correlation) between Amira’s Progress Monitoring scores and external criterion scores, so the provided data support the validity of the tool to a high degree.

Do you have validity data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?: No

If yes, fill in data for each subgroup with disaggregated validity data.

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of validity analysis not compatible with above table format:

Manual cites other published reliability studies:: No

Provide citations for additional published studies.

Bias Analysis

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6
Rating	Provided	Provided	Provided	Provided	Provided	Provided	Provided

Have you conducted additional analyses related to the extent to which your tool is or is not biased against subgroups (e.g., race/ethnicity, gender, socioeconomic status, students with disabilities, English language learners)? Examples might include Differential Item Functioning (DIF) or invariance testing in multiple-group confirmatory factor models.: Yes

If yes,
a. Describe the method used to determine the presence or absence of bias:: We conducted a Differential Item Functioning (DIF) analysis using the Zumbo & Thomas (ZT) classification system with logistic regression implemented in the difR package in R software. The analysis examined 2,598 items across 10 subtests spanning grades Pre-K through 8, using a sample of 65,000 students from three large districts across three states. Items were classified using Nagelkerke's R² effect size thresholds: A items (negligible DIF) ≤ 0.13, B items (slight to moderate DIF) > 0.13 but ≤ 0.26, and C items (moderate to large DIF) > 0.26. To ensure robust DIF detection, items with fewer than 100 responses were excluded from analysis. DIF was examined using both overall reading ability scores and subscale scores as matching criteria to validate findings across different ability matching approaches. Following statistical analysis, curriculum experts reviewed all B and C flagged items to determine whether observed DIF represented construct-irrelevant variance (bias) or legitimate construct-related performance differences.

b. Describe the subgroups for which bias analyses were conducted:: Bias analyses were conducted across the following subgroups: 1. Gender: Male and female students 2. Race/Ethnicity: Hispanic/Latino, African American/Black, and White students The sample was strategically selected to provide sufficient demographic diversity and adequate sample sizes for reliable DIF detection across these key demographic groups that are central to educational equity considerations in academic screening.

c. Describe the results of the bias analyses conducted, including data and interpretative statements. Include magnitude of effect (if available) if bias has been identified.: The DIF analysis revealed exceptionally low levels of differential item functioning across all demographic subgroups: Results Summary: - 99% of items were classified as A (negligible DIF) - <1% were classified as B (slight to moderate DIF) - <1% were classified as C (moderate to large DIF) Expert Review Results: All flagged items underwent curriculum expert review for construct-irrelevant variance. Items were evaluated for evidence of bias versus legitimate construct-related performance variations. A small number of items (approximately 0.20%) were identified as exhibiting bias and were removed from the pool. Interpretation: The removal of biased items ensures that the final assessment maintains technical rigor and equity across diverse student populations. The extremely low percentage of items requiring removal (0.20%) demonstrates that the vast majority of items function equivalently across demographic subgroups, providing strong evidence for test fairness. Construct Validity Support: Pearson correlations between overall and subscale scores ranged from 0.72-0.90 (R² = 0.52-0.81), indicating strong construct coherence and supporting the validity of our matching criteria approach.

Data Collection Practices

Most tools and programs evaluated by the NCII are branded products which have been submitted by the companies, organizations, or individuals that disseminate these products. These entities supply the textual information shown above, but not the ratings accompanying the text. NCII administrators and members of our Technical Review Committees have reviewed the content on this page, but NCII cannot guarantee that this information is free from error or reflective of recent changes to the product. Tools and programs have the opportunity to be updated annually or upon request.

Summary
Descriptive Information
Administration
Training & Scoring

Technical Standards
Classification Accuracy &
Cross-Validation Summary
Reliability
Validity
Bias Analysis

Data Collection Practices

Amira ISIPReading

Summary

Descriptive Information

Administration

Training & Scoring

Training

Scoring

Technical Standards

Classification Accuracy & Cross-Validation Summary

NWEA MAP

Classification Accuracy

Cross-Validation

Classification Accuracy - Fall

Classification Accuracy - Winter

Classification Accuracy - Spring

Reliability

Validity

Bias Analysis

Data Collection Practices

Amira ISIP
Reading