Amira ISIP
Reading
Summary
Amira has two essential jobs: (1) provide supplemental instructional materials and student data to teachers with a bridge from science of reading professional development to classroom execution and (2) engage students in the time on the tongue needed to bridge identified reading skills gaps. Amira’s personalized learning software listens to students read aloud, identifies dyslexic students, continuously assesses reading mastery, and delivers individualized tutoring. Our solution empowers the learning community, from students to paraprofessionals, teachers, parents, and administrators, with AI-powered tools for real-time results. Utilizing Amira, teachers can continuously assess and monitor student progress in oral language, phonological awareness, phonics, fluency, vocabulary and comprehension. Amira optimizes instructional strategies and resource allocation to provide a comprehensive solution. Anytime a student reads with Amira, all stakeholders are provided up-to-the-moment feedback critical for teachers to make the right decisions for intervention support.
- Where to Obtain:
- Amira Learning, Inc.
- orders@amiralearning.com
- 5214F Diamond Heights Blvd #3255, San Francisco, CA 94131
- 650-455-4380
- www.amiralearning.com
- Initial Cost:
- $7.50 per student
- Replacement Cost:
- $7.50 per student per year
- Included in Cost:
- Amira is a comprehensive and holistic assessment, instruction, and tutoring suite. Assess ($7.50 per student per year): Benchmark, dyslexia screening, and ongoing progress monitoring to diagnose skill gaps; Instruct ($12 .50 per student per year): Lesson plans prescribed and delivered based on targeted needs that can be linked to each district’s high-quality instructional materials; Tutor ($15 per student per year): Combined literacy instruction and practice during school or through extended learning opportunities; Amira Full Suite (includes Assess, Instruct, and Tutor): $35 per student per year. These student license costs include the screening tool software, access for students and district/school personnel, virtual (live and asynchronous) professional development, and foundational data for implementation and monitoring. Bulk discounts are available pending the number of student licenses needed by states/districts.
- Amira software is accessibility ready, adhering to the Web Content Accessibility Guidelines (WCAG 2.1) Level AA, ensuring that content is accessible and enhancing usability for all users. Amira adheres to best practices for UX development, supporting WCAG 2.1 guidelines. Amira is also SOC 2 Type 2 certified. All tasks in Amira’s English suite of assessments and practice are also available in Spanish. Details on Amira’s accommodations are available in the Teacher Manual (2024) pages 16-31 available at https://go.amiralearning.com/hubfs/Assessment/Amira%20Teacher%20Manual%20(2024).pdf.
- Training Requirements:
- Administrators are encouraged to attend or complete online asynchronous 45-minute training, “Getting Started with Amira” prior to initial administration. Following administration, administrators may access self-paced asynchronous training, including understanding data on Amira Academy. Amira handles most of the tasks that typically are difficult for teachers. The software acts as a proctor, guiding the student through each task. The software is a highly adept support technician, identifying hardware and software issues that may impact the assessment process. Finally, Amira produces a consistent and comprehensive scoring of the items, providing the teacher with a framework for evaluating outputs.
- Qualified Administrators:
- No minimum qualifications specified.
- Access to Technical Support:
- Assessment Format:
-
- Scoring Time:
-
- Scoring is automatic OR
- 0 minutes per student
- Scores Generated:
-
- Raw score
- Standard score
- Percentile score
- Grade equivalents
- IRT-based score
- Developmental benchmarks
- Developmental cut points
- Equated
- Lexile score
- Error analysis
- Composite scores
- Subscale/subtest scores
- Administration Time:
-
- 20 minutes per student
- Scoring Method:
-
- Manually (by hand)
- Automatically (computer-scored)
- Technology Requirements:
-
- Computer or tablet
- Internet connection
- Accommodations:
- Amira software is accessibility ready, adhering to the Web Content Accessibility Guidelines (WCAG 2.1) Level AA, ensuring that content is accessible and enhancing usability for all users. Amira adheres to best practices for UX development, supporting WCAG 2.1 guidelines. Amira is also SOC 2 Type 2 certified. All tasks in Amira’s English suite of assessments and practice are also available in Spanish. Details on Amira’s accommodations are available in the Teacher Manual (2024) pages 16-31 available at https://go.amiralearning.com/hubfs/Assessment/Amira%20Teacher%20Manual%20(2024).pdf.
Descriptive Information
- Please provide a description of your tool:
- Amira has two essential jobs: (1) provide supplemental instructional materials and student data to teachers with a bridge from science of reading professional development to classroom execution and (2) engage students in the time on the tongue needed to bridge identified reading skills gaps. Amira’s personalized learning software listens to students read aloud, identifies dyslexic students, continuously assesses reading mastery, and delivers individualized tutoring. Our solution empowers the learning community, from students to paraprofessionals, teachers, parents, and administrators, with AI-powered tools for real-time results. Utilizing Amira, teachers can continuously assess and monitor student progress in oral language, phonological awareness, phonics, fluency, vocabulary and comprehension. Amira optimizes instructional strategies and resource allocation to provide a comprehensive solution. Anytime a student reads with Amira, all stakeholders are provided up-to-the-moment feedback critical for teachers to make the right decisions for intervention support.
ACADEMIC ONLY: What skills does the tool screen?
- Please describe specific domain, skills or subtests:
- BEHAVIOR ONLY: Which category of behaviors does your tool target?
-
- BEHAVIOR ONLY: Please identify which broad domain(s)/construct(s) are measured by your tool and define each sub-domain or sub-construct.
Acquisition and Cost Information
Administration
- Are norms available?
- Yes
- Are benchmarks available?
- Yes
- If yes, how many benchmarks per year?
- 3
- If yes, for which months are benchmarks available?
- Aug-Nov, Dec-March, April-July; Amira ISIP offers unlimited additional administrations of progress monitoring throughout the school year in addition to benchmark assessment.
- BEHAVIOR ONLY: Can students be rated concurrently by one administrator?
- If yes, how many students can be rated concurrently?
Training & Scoring
Training
- Is training for the administrator required?
- Yes
- Describe the time required for administrator training, if applicable:
- Administrators are encouraged to attend or complete online asynchronous 45-minute training, “Getting Started with Amira” prior to initial administration. Following administration, administrators may access self-paced asynchronous training, including understanding data on Amira Academy. Amira handles most of the tasks that typically are difficult for teachers. The software acts as a proctor, guiding the student through each task. The software is a highly adept support technician, identifying hardware and software issues that may impact the assessment process. Finally, Amira produces a consistent and comprehensive scoring of the items, providing the teacher with a framework for evaluating outputs.
- Please describe the minimum qualifications an administrator must possess.
-
No minimum qualifications
- Are training manuals and materials available?
- Yes
- Are training manuals/materials field-tested?
- Yes
- Are training manuals/materials included in cost of tools?
- Yes
- If No, please describe training costs:
- Can users obtain ongoing professional and technical support?
- Yes
- If Yes, please describe how users can obtain support:
Scoring
- Do you provide basis for calculating performance level scores?
-
Yes
- Does your tool include decision rules?
-
No
- If yes, please describe.
- Can you provide evidence in support of multiple decision rules?
-
No
- If yes, please describe.
- Please describe the scoring structure. Provide relevant details such as the scoring format, the number of items overall, the number of items per subscale, what the cluster/composite score comprises, and how raw scores are calculated.
- Amira leverages machine learning and AI to automatically score every student interaction. Interactions that Amira automatically scores include rapid automatized naming (RAN), letter sound fluency, blending, word reading, word part manipulation, spelling, listening comprehension, oral reading, multiple-choice questions, open-ended oral responses, and open-ended written responses, among others. The cornerstone of Amira’s scoring system lies in its ability to automatically and accurately score these varied interactions. Where necessary, Amira incorporates rubric-based scoring to further enhance the precision of scoring, particularly for open-ended responses such Amira’s dialogue-based comprehension questions. Amira’s ability to automatically and accurately score each student interaction uniquely enables the software not only to provide immediate feedback to students but also to maintain continuously updated profiles on each student's progress, achievements, and instructional needs. Each time a student completes an activity, Amira instantly and automatically scores that activity. The scoring process also immediately updates all teacher-facing reports. This means that Amira’s scoring system allows Amira to maintain a real-time profile of each student’s progress, achievements, and instructional needs, and makes that always up-to-date profile available to educators. An integral feature of Amira’s scoring system is mechanisms for quality and equity assurance. One such mechanism is a “meta-analysis” conducted by our machine learning models. This process identifies any discrepancies that may indicate a misrepresentation of a student’s true abilities, enabling the system to flag activities that may not represent a student’s best effort. A second mechanism is a set of automated data integrity tests to ensure data quality and consistency across the locations data is stored. The third and critical mechanism is that recordings of activities are always available to educators should they want to listen and adjust any of Amira’s scoring. Educators may click on an activity in any of Amira’s reports to bring up a recording of the activity and adjust scores as needed. Amira’s Scoring System rigorously and universally adheres to the following principles: (a) The scoring of ALL items is visible and transparent to teachers/proctors. There is no black box – educators can see every item taken and every Amira score; (b) educators have the ability to actually listen to student responses, enabling 100% auditability of Amira’s scoring; (c) educators/teachers/proctors have the ability to correct Amira’s scoring manually. If desired, a state/district can allow educators to have the final word on the scoring of any student’s assessment, with the ability to override the Amira scoring item by item.
- Describe the tool’s approach to screening, samples (if applicable), and/or test format, including steps taken to ensure that it is appropriate for use with culturally and linguistically diverse populations and students with disabilities.
- Amira’s voice-administered system is fully validated and adaptable for students in kindergarten through 6th grade, ensuring it meets the requirements for effective early literacy assessment across various grade levels and student needs. The system’s validation process includes rigorous testing to ensure that it provides accurate and reliable results, even when administered to young learners, including those with special needs or limited prior education. Amira’s mode of administration is designed to be flexible and inclusive, allowing for both digital, AI-driven assessments and partially teacher-facilitated modes. This adaptability ensures that the Screener is appropriate for students in early elementary grades, including those with specific educational needs. The following evidence demonstrates how Amira ISIP has been validated across diverse student populations: Voice Administration Validation: Amira’s voice-enabled assessment system has been validated specifically for students in K-6. The model training and evaluation process leverages an extensive dataset of over 10,000 hours of children's speech, carefully balanced to ensure representation across diverse populations representative of national demographics along dimensions of gender, race/ethnicity, socioeconomic status, ELL/MLL students, and accent/dialect group. These studies confirmed that the system could accurately capture and assess reading performance, with a classification accuracy of over 90% across these grade levels. This validation is crucial for ensuring that the assessments provide reliable data, even in a fully digital, AI-driven mode where the system autonomously guides students through tasks. Consistency and Standardization: The consistency and standardization provided by Amira’s AI reduce the risk of bias and human error, aligning with best practices for large-scale assessments (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). Amira’s machine-proctored model is supported by extensive research demonstrating that AI-driven assessments can offer greater consistency and reliability than human-proctored ones. Human proctoring introduces variability due to differences in proctor training and execution, which can lead to measurement error and bias (Kane, 2013). In contrast, Amira’s AI provides a uniform testing experience, ensuring that every student is assessed under the same conditions, thereby upholding the integrity of the results (Wilson & Draney, 2002). At the individual item scoring level, currently the model being used in production has a 96% agreement with human expert judgment. To benchmark this number, human inter-rater reliability metrics for the same assessment range between 96-97%. At the aggregate assessment score level, there is an extremely high correlation (0.95 - 0.98, depending on the grade level) between the model scores and the same assessments scored by human experts. Proctoring and Administration for Kindergarteners: In both English and Spanish, Amira has proctored and administered millions of screenings. The system has been effectively used in thousands of classrooms, including those in more than 1,200 districts across all 50 states. The proctoring model for kindergarteners emphasizes small group administration, with a teacher serving as the meta-proctor. In this model, the teacher helps students log in, creating a staggered start that aids in managing the process. Once logged in, Amira leads students through a simple, voice-based dialog to ensure the environment is test-ready. This model allows Amira to function as an effective teacher’s assistant, handling most situations independently and alerting the teacher only when necessary. Adaptability to Student Needs: The Screener can be administered in various modes to accommodate different student needs. For younger students or those requiring additional support, Amira can operate in a small group, partially teacher-facilitated mode. In this setup, the teacher facilitates the assessment, while Amira provides automated guidance and feedback. Research supports this approach, indicating that younger children and those with specific needs benefit from the combination of human interaction and AI-driven assessment, which offers reassurance and allows for immediate intervention. Developmental Alignment and Task Appropriateness: Amira’s tasks are carefully aligned with the developmental milestones of K-6 students, ensuring the assessments are both appropriate and effective for identifying reading difficulties, including dyslexia risk at each stage. For example, tasks in kindergarten focus on phonological awareness, while those in 2nd grade emphasize reading fluency. This developmental alignment ensures that the Screener is sensitive to the literacy skills typical of each age group, providing a fair and relevant assessment regardless of the student’s prior education. Evidence of Effectiveness in Diverse Settings: Amira has been implemented and validated in various educational settings, including urban schools in large states such as California and Oklahoma where it was used in whole-classroom settings. Across the country, Amira has been administered in over 500,000 sessions, spanning diverse environments such as urban, suburban, and rural schools. These include Title I schools, charter schools, and districts serving large populations of English language learners and students with disabilities. The system’s predictive validity has been confirmed in these environments, demonstrating that it can provide accurate assessments even when administered to entire classrooms of young students. This large-scale validation ensures that Amira’s fully digital mode is not only effective but also scalable for broad application across different types of schools and student populations. Accommodations for Special Needs: Amira employs a Universal Design for Learning (UDL) approach to ensure that the Screener is accessible to all students, including those with disabilities. The system provides a range of accommodations, including the option for a fully proctored, one-on-one assessment or the use of paper-based alternatives like the TPRI or Tejas Lee for students who may not be well-suited to digital assessments. To date, Amira has been validated with over 7,000 students, including those requiring specific accommodations, with approximately 20% utilizing options such as one-on-one proctoring or bilingual support. This extensive validation confirms that the system reliably accommodates and accurately assesses students with various needs, including those with visual or auditory impairments, English language learners, and students with individualized education programs (IEPs). Amira’s UDL approach includes features that are WCAG 2.1 Level AA compliant, ensuring accessibility for students with disabilities. These accommodations have been tested and validated across multiple studies to ensure that they do not compromise the accuracy or reliability of the assessments. This commitment to accessibility has allowed Amira to meet the needs of a diverse student population while maintaining high standards of assessment validity. Use with English Learners (ELs): Amira’s Screener is configurable for English language learners, offering the option to screen in English, Spanish, or both. This flexibility is supported by research indicating that assessing literacy skills in both the native language and English can provide a more accurate picture of a student’s abilities. Amira’s ability to offer proctoring in Spanish or administer the assessment in a hybrid mode ensures that ELs receive an equitable assessment experience. In summary, Amira’s voice administration is fully validated and adaptable for all K-6 students, including those with special needs or limited prior education. The system’s design and extensive validation across diverse populations ensure that it is both effective and appropriate for early literacy assessment in various educational settings. The flexibility in administration modes, combined with robust evidence of accuracy and reliability, makes Amira an ideal tool for assessing young learners.
Technical Standards
Classification Accuracy & Cross-Validation Summary
Grade |
Kindergarten
|
Grade 1
|
Grade 2
|
Grade 3
|
Grade 4
|
Grade 5
|
Grade 6
|
---|---|---|---|---|---|---|---|
Classification Accuracy Fall |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Classification Accuracy Winter |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Classification Accuracy Spring |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |




NWEA MAP
Classification Accuracy
- Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
- The criterion outcome measure used in this study was the NWEA MAP Reading assessment, specifically the RIT scores from the end-of-year administration. This measure was selected for its well-established validity and reliability in assessing students' reading ability across a broad range of skills and grade levels. The NWEA MAP Reading assessment is entirely independent from the screening measure utilized in this study. The two assessments are developed and administered separately, and there is no overlap in test items, scoring rubrics, or administration processes. The only shared characteristic is their common purpose of measuring the construct of reading ability. This shared construct ensures alignment between the screening and outcome measures without introducing direct dependency, thereby supporting the validity of the classification accuracy study.
- Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
- Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
- The degree to which the Amira Screener can accurately identify students who need intensive intervention was evaluated using classification accuracy statistics based on the Amira cut scores that show the proportion of students correctly classified by their ARM scores as at-risk or not-at-risk and the criterion measure cut scores that show whether students actually need intensive intervention. The classification accuracy analysis was conducted as follows: Compare an individual student's (a) ARM score and the candidate ARM cut score and (b) their score on the criterion measure and the criterion measure cut score. Assign "1" in one of the four designations demonstrated here: TP: Students classified by screener as "At-Risk" and are actually "At-Risk" FP: Students classified by screener as "At-Risk" and are actually "Not At-Risk FN: Students classified by screener as "Not At-Risk" and are actually "At-Risk" TN: Students classified by screener as "Not At-Risk" and are actually "Not At-Risk" Aggregate the designations to obtain the total counts in each cell for students in the sample. Compute the statistics as shown below. Classification Accuracy Rate Formula: (TP + TN) / (total sample size) Description: Proportion of the study sample whose classification by the ARM cut scores was consistent with classification by the criterion measure. False Negative (FN) Rate Formula: FN / (FN + TP) Description: The proportion of "at-risk" students identified as "not at-risk" by the ARM score. False Positive (FP) Rate Formula: FP / (FP + TN) Description: The proportion of "not at-risk" students identified as "at-risk" by the ARM score. Positive Predictive Value (PPV) Formula: TP / (TP + FP) Description: Proportion of students identified as "at-risk" by ARM scores who are actually at-risk. Negative Predictive Value (NPV) Formula: TN / (TN + FN) Description: Proportion of students identified as "not at-risk" by ARM scores who are actually not at-risk. Sensitivity Formula: TP / (TP + FN) Description: Proportion of "at-risk" students that were identified as "at-risk" by the ARM score. Specificity Formula: TN / (TN + FP) Description: Proportion of "not at-risk" students that were identified as "not at-risk" by ARM scores. Area Under the Curve (AUC) Description: Represents the area under the receiver operating characteristics (ROC) curve, including the lower (AUC-LB) and upper (AUC-UB) bounds of the 95% confidence interval. Confidence intervals were calculated using a 1000-sample bootstrap method to construct a two-sided interval. Interpretation: Measures how well ARM scores separate the study sample into "at-risk" and "not at-risk" categories that match those from the criterion measure cut scores. The cut points for the Amira Screener were determined to align closely with the identification of students at risk for requiring intensive intervention, as defined by their performance on the criterion measure (NWEA MAP Reading assessment). Specifically, candidate cut scores were tested at the 20th, 25th, and 30th percentiles of the ARM scores for each grade and testing window. These cut points were chosen because they correspond to widely accepted thresholds for identifying students performing below grade-level expectations and, consequently, at risk of academic difficulty. The final cut points were selected based on their ability to maximize classification accuracy while minimizing the rates of false negatives and false positives, as evaluated using standard classification metrics. Specifically, the cut points that resulted in the highest lower bound on the AUC were selected. By achieving an optimal balance of sensitivity and specificity, the selected cut scores ensure that students most in need of support are accurately identified without unnecessarily flagging those who are not at risk. The alignment between the cut points and students at risk was validated through the classification accuracy analysis, which compared ARM classifications with those based on the criterion measure. High sensitivity values demonstrated the cut points' ability to identify the majority of at-risk students, while high specificity values confirmed that students not at risk were correctly excluded. This approach ensures the cut points serve as a reliable tool for early identification and targeted intervention. The groups contrasted were students who are at risk versus students who are not at risk.
- Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
-
No
- If yes, please describe the intervention, what children received the intervention, and how they were chosen.
Cross-Validation
- Has a cross-validation study been conducted?
-
No
- If yes,
- Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
- Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
- Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
- Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
- If yes, please describe the intervention, what children received the intervention, and how they were chosen.
Classification Accuracy - Fall
Evidence | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 |
---|---|---|---|---|---|---|---|
Criterion measure | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP |
Cut Points - Percentile rank on criterion measure | 20 | 20 | 20 | 20 | 20 | 20 | 20 |
Cut Points - Performance score on criterion measure | 140 | 158 | 171 | 183 | 192 | 200 | 203 |
Cut Points - Corresponding performance score (numeric) on screener measure | -0.10 | .70 | 1.71 | 2.5 | 3.28 | 4.41 | 5.41 |
Classification Data - True Positive (a) | 1074 | 1528 | 1879 | 1718 | 26 | 322 | 53 |
Classification Data - False Positive (b) | 1193 | 869 | 723 | 702 | 475 | 896 | 6 |
Classification Data - False Negative (c) | 326 | 345 | 459 | 337 | 10 | 114 | 22 |
Classification Data - True Negative (d) | 4872 | 5652 | 5803 | 4921 | 1435 | 2314 | 20 |
Area Under the Curve (AUC) | 0.79 | 0.84 | 0.85 | 0.86 | 0.73 | 0.73 | 0.74 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.77 | 0.83 | 0.84 | 0.85 | 0.70 | 0.71 | 0.61 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.80 | 0.85 | 0.86 | 0.87 | 0.78 | 0.75 | 0.87 |
Statistics | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 |
---|---|---|---|---|---|---|---|
Base Rate | 0.19 | 0.22 | 0.26 | 0.27 | 0.02 | 0.12 | 0.74 |
Overall Classification Rate | 0.80 | 0.86 | 0.87 | 0.86 | 0.75 | 0.72 | 0.72 |
Sensitivity | 0.77 | 0.82 | 0.80 | 0.84 | 0.72 | 0.74 | 0.71 |
Specificity | 0.80 | 0.87 | 0.89 | 0.88 | 0.75 | 0.72 | 0.77 |
False Positive Rate | 0.20 | 0.13 | 0.11 | 0.12 | 0.25 | 0.28 | 0.23 |
False Negative Rate | 0.23 | 0.18 | 0.20 | 0.16 | 0.28 | 0.26 | 0.29 |
Positive Predictive Power | 0.47 | 0.64 | 0.72 | 0.71 | 0.05 | 0.26 | 0.90 |
Negative Predictive Power | 0.94 | 0.94 | 0.93 | 0.94 | 0.99 | 0.95 | 0.48 |
Sample | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 |
---|---|---|---|---|---|---|---|
Date | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 |
Sample Size | 7465 | 8394 | 8864 | 7678 | 1946 | 3646 | 101 |
Geographic Representation | East North Central (IL, IN) Mountain (AZ) South Atlantic (MD) |
East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD) |
East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD) |
Mountain (AZ) Pacific (CA) South Atlantic (MD) |
Mountain (NV) South Atlantic (SC) West South Central (OK) |
Mountain (NV) South Atlantic (SC) West South Central (OK) |
West South Central (LA) |
Male | 41.8% | 47.7% | 48.4% | ||||
Female | 41.1% | 45.9% | 46.4% | ||||
Other | |||||||
Gender Unknown | |||||||
White, Non-Hispanic | 49.9% | 25.9% | 15.3% | ||||
Black, Non-Hispanic | 22.1% | 19.8% | 19.9% | ||||
Hispanic | 11.5% | 10.4% | 9.7% | ||||
Asian/Pacific Islander | 4.2% | 3.5% | 3.5% | ||||
American Indian/Alaska Native | |||||||
Other | 6.8% | 10.1% | 2.9% | ||||
Race / Ethnicity Unknown | |||||||
Low SES | 6.7% | 6.3% | 6.5% | ||||
IEP or diagnosed disability | 9.4% | 11.6% | 12.2% | ||||
English Language Learner | 13.7% | 13.3% | 13.0% |
Classification Accuracy - Winter
Evidence | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 |
---|---|---|---|---|---|---|
Criterion measure | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP |
Cut Points - Percentile rank on criterion measure | 20 | 20 | 20 | 20 | 20 | 20 |
Cut Points - Performance score on criterion measure | 140 | 158 | 171 | 183 | 192 | 200 |
Cut Points - Corresponding performance score (numeric) on screener measure | .21 | .94 | 1.97 | 2.92 | 3.58 | 4.71 |
Classification Data - True Positive (a) | 1111 | 926 | 1400 | 1563 | 64 | 173 |
Classification Data - False Positive (b) | 1317 | 850 | 875 | 574 | 545 | 1045 |
Classification Data - False Negative (c) | 218 | 172 | 313 | 319 | 16 | 42 |
Classification Data - True Negative (d) | 5497 | 6392 | 6138 | 7655 | 1320 | 2385 |
Area Under the Curve (AUC) | 0.82 | 0.86 | 0.85 | 0.87 | 0.75 | 0.75 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.81 | 0.85 | 0.84 | 0.85 | 0.72 | 0.72 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.83 | 0.88 | 0.86 | 0.88 | 0.78 | 0.77 |
Statistics | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 |
---|---|---|---|---|---|---|
Base Rate | 0.16 | 0.13 | 0.20 | 0.19 | 0.04 | 0.06 |
Overall Classification Rate | 0.81 | 0.88 | 0.86 | 0.91 | 0.71 | 0.70 |
Sensitivity | 0.84 | 0.84 | 0.82 | 0.83 | 0.80 | 0.80 |
Specificity | 0.81 | 0.88 | 0.88 | 0.93 | 0.71 | 0.70 |
False Positive Rate | 0.19 | 0.12 | 0.12 | 0.07 | 0.29 | 0.30 |
False Negative Rate | 0.16 | 0.16 | 0.18 | 0.17 | 0.20 | 0.20 |
Positive Predictive Power | 0.46 | 0.52 | 0.62 | 0.73 | 0.11 | 0.14 |
Negative Predictive Power | 0.96 | 0.97 | 0.95 | 0.96 | 0.99 | 0.98 |
Sample | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 |
---|---|---|---|---|---|---|
Date | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 |
Sample Size | 8143 | 8340 | 8726 | 10111 | 1945 | 3645 |
Geographic Representation | East North Central (IL, IN) Mountain (AZ) South Atlantic (MD) |
East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD) |
East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD) |
Mountain (AZ) Pacific (CA) South Atlantic (MD) |
Mountain (NV) South Atlantic (SC) West South Central (OK) |
Mountain (NV) South Atlantic (SC) West South Central (OK) |
Male | 38.3% | 48.0% | 49.2% | |||
Female | 37.7% | 46.2% | 47.1% | |||
Other | ||||||
Gender Unknown | ||||||
White, Non-Hispanic | 45.7% | 26.0% | 15.5% | |||
Black, Non-Hispanic | 20.3% | 19.9% | 20.2% | |||
Hispanic | 10.6% | 10.4% | 9.8% | |||
Asian/Pacific Islander | 3.9% | 3.5% | 3.6% | |||
American Indian/Alaska Native | ||||||
Other | 6.2% | 10.2% | 2.9% | |||
Race / Ethnicity Unknown | ||||||
Low SES | 6.1% | 6.4% | 6.6% | |||
IEP or diagnosed disability | 8.7% | 11.6% | 12.4% | |||
English Language Learner | 12.6% | 13.4% | 13.2% |
Classification Accuracy - Spring
Evidence | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 |
---|---|---|---|---|---|---|---|
Criterion measure | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP | NWEA MAP |
Cut Points - Percentile rank on criterion measure | 20 | 20 | 20 | 20 | 20 | 20 | 20 |
Cut Points - Performance score on criterion measure | 140 | 158 | 171 | 183 | 192 | 200 | 203 |
Cut Points - Corresponding performance score (numeric) on screener measure | .47 | 1.43 | 2.19 | 3.19 | 3.82 | 4.97 | 5.97 |
Classification Data - True Positive (a) | 1096 | 1210 | 1497 | 1290 | 88 | 225 | 63 |
Classification Data - False Positive (b) | 1209 | 746 | 794 | 545 | 512 | 666 | 7 |
Classification Data - False Negative (c) | 208 | 289 | 328 | 238 | 25 | 83 | 13 |
Classification Data - True Negative (d) | 5572 | 6145 | 6190 | 5579 | 1291 | 1672 | 18 |
Area Under the Curve (AUC) | 0.83 | 0.85 | 0.85 | 0.88 | 0.75 | 0.72 | 0.77 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.82 | 0.84 | 0.84 | 0.87 | 0.72 | 0.70 | 0.63 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.84 | 0.86 | 0.87 | 0.89 | 0.77 | 0.75 | 0.89 |
Statistics | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 |
---|---|---|---|---|---|---|---|
Base Rate | 0.16 | 0.18 | 0.21 | 0.20 | 0.06 | 0.12 | 0.75 |
Overall Classification Rate | 0.82 | 0.88 | 0.87 | 0.90 | 0.72 | 0.72 | 0.80 |
Sensitivity | 0.84 | 0.81 | 0.82 | 0.84 | 0.78 | 0.73 | 0.83 |
Specificity | 0.82 | 0.89 | 0.89 | 0.91 | 0.72 | 0.72 | 0.72 |
False Positive Rate | 0.18 | 0.11 | 0.11 | 0.09 | 0.28 | 0.28 | 0.28 |
False Negative Rate | 0.16 | 0.19 | 0.18 | 0.16 | 0.22 | 0.27 | 0.17 |
Positive Predictive Power | 0.48 | 0.62 | 0.65 | 0.70 | 0.15 | 0.25 | 0.90 |
Negative Predictive Power | 0.96 | 0.96 | 0.95 | 0.96 | 0.98 | 0.95 | 0.58 |
Sample | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 |
---|---|---|---|---|---|---|---|
Date | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 | 2023-2024 |
Sample Size | 8085 | 8390 | 8809 | 7652 | 1916 | 2646 | 101 |
Geographic Representation | East North Central (IL, IN) Mountain (AZ) South Atlantic (MD) |
East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD) |
East North Central (IL) Mountain (AZ) Pacific (CA) South Atlantic (MD) |
Mountain (AZ) Pacific (CA) South Atlantic (MD) |
Mountain (NV) South Atlantic (SC) West South Central (OK) |
Mountain (NV) South Atlantic (SC) West South Central (OK) |
West South Central (LA) |
Male | 38.6% | 47.7% | 48.7% | ||||
Female | 37.9% | 45.9% | 46.7% | ||||
Other | |||||||
Gender Unknown | |||||||
White, Non-Hispanic | 46.1% | 25.9% | 15.4% | ||||
Black, Non-Hispanic | 20.4% | 19.8% | 20.0% | ||||
Hispanic | 10.7% | 10.4% | 9.7% | ||||
Asian/Pacific Islander | 3.9% | 3.5% | 3.5% | ||||
American Indian/Alaska Native | |||||||
Other | 6.3% | 10.1% | 2.9% | ||||
Race / Ethnicity Unknown | |||||||
Low SES | 6.2% | 6.3% | 6.5% | ||||
IEP or diagnosed disability | 8.7% | 11.6% | 12.3% | ||||
English Language Learner | 12.7% | 13.3% | 13.1% |
Reliability
Grade |
Kindergarten
|
Grade 1
|
Grade 2
|
Grade 3
|
Grade 4
|
Grade 5
|
Grade 6
|
---|---|---|---|---|---|---|---|
Rating |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |




- *Offer a justification for each type of reliability reported, given the type and purpose of the tool.
- Reliability refers to the relative stability with which a test measures the same skills across minor differences in conditions. Two types of reliability are reported in the table below, Parallel Form reliability and Cronbach’s coefficient alpha. Parallel Forms Reliability is crucial for ensuring the consistency of the Amira Progress Monitoring assessment. This analysis measures the consistency of results across different assessment forms, which is essential for accurately tracking student growth, since students receive a different form each time they receive a progress monitoring assessment. By confirming that each form is equivalent, we can ensure that any observed improvements in student scores are due to actual learning, not differences in complexity or difficulty in the test forms. The coefficient reported is the average correlation among alternate forms of the measure. High alternate-form reliability coefficients suggest that these multiple forms are measuring the same construct. Coefficient alpha, commonly known as Cronbach's alpha, is a measure of internal consistency reliability used widely in education research and other fields. It estimates the proportion of total variance in a set of scores that is attributable to the true score variance, reflecting the reliability of the measurement.
- *Describe the sample(s), including size and characteristics, for each reliability analysis conducted.
- The samples used to establish reliability include students who tested in the 2023-2024 school year. Both samples encompassed at least dozens of districts across the country for each grade. These districts were selected to emulate the diversity and variation of the national population of students and are representative in a variety of dimensions including school type, socioeconomic status, geographic region, gender, race, and ethnicity. Students in the parallel forms reliability sample each took two different Progress Monitoring forms within the same time window (1 week). Students in the internal consistency analyses were those who had taken at least 5 forms (instances) of Progress Monitoring across the 2023-2024 school year.
- *Describe the analysis procedures for each reported type of reliability.
- To assess parallel forms reliability, two forms of the assessment were administered to the same group of students within the range of one week. The scores obtained on each assessment version were then correlated to assess the degree of consistency between them. We measure these correlations using Pearson’s correlation coefficient, which is a measure of the strength of the linear relationship between two variables. The practical significance of the reliability coefficients was evaluated as follows: poor (0−0.39), adequate (0.40−0.59), good (0.60−0.79), and excellent (0.80−1.0). These estimates of practical significance are arbitrary, but conventionally used, and provide a useful heuristic for interpreting the reliability data. Confidence intervals were then calculated for the correlation coefficients computed across distinct pairs of forms. To obtain an estimate of internal consistency reliability, Cronbach's alphas were calculated for students who had taken at least 5 forms of Progress Monitoring assessment over the year. The 95% confidence interval of each reliability metric is computed using the bootstrap method, where 1000 samples with replacement are drawn from the data, and the 2.5% and 97.5% quantiles are calculated and reported.
*In the table(s) below, report the results of the reliability analyses described above (e.g., internal consistency or inter-rater reliability coefficients).
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of reliability analysis not compatible with above table format:
- Manual cites other published reliability studies:
- No
- Provide citations for additional published studies.
- Do you have reliability data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
- No
If yes, fill in data for each subgroup with disaggregated reliability data.
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of reliability analysis not compatible with above table format:
- Manual cites other published reliability studies:
- No
- Provide citations for additional published studies.
Validity
Grade |
Kindergarten
|
Grade 1
|
Grade 2
|
Grade 3
|
Grade 4
|
Grade 5
|
Grade 6
|
---|---|---|---|---|---|---|---|
Rating |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |




- *Describe each criterion measure used and explain why each measure is appropriate, given the type and purpose of the tool.
- Concurrent validity measures how well Amira scores correlate with the scores of another test that is administered at the same time and is already established as valid for measuring the same construct. Predictive validity refers to the extent to which scores on the Amira assessment can accurately predict future performance on a related outcome or criterion. The external assessments used in these studies include the i-Ready Reading Diagnostic and NWEA MAP Reading assessment. Both assessments are nationally-normed, computer adaptive measures of reading ability that are widely used in many states with established validity studies of their own.
- *Describe the sample(s), including size and characteristics, for each validity analysis conducted.
- The samples include students who tested in the 2022-2023 school year. This includes a sample of students from hundreds of districts across the country. These districts were selected to emulate the diversity and variation of the national population of students and are representative in a variety of dimensions including school type, socioeconomic status, geographic region, gender, race, and ethnicity. Sample sizes for each validity study vary across testing window, grade and criterion measure ranging from 988 to 5,643.
- *Describe the analysis procedures for each reported type of validity.
- Concurrent validity was established by correlating Amira’s Reading Mastery (ARM) scores from students in grades K through 6 who took both an Amira assessment and the external measure within the same two-week window of one another. The predictive validity of Amira was examined by correlating Amira’s assessment scores taken during the beginning of the year (Fall) window to scores from external measures taken at the end of the school year (Spring). In both forms of validity, the relationship between Amira’s scores and the external criterion measure was evaluated using Pearson’s correlation coefficient. Coefficients were calculated using bootstrap sampling across 100 random samples, and median correlation coefficients as well as 95% confidence intervals on the correlation coefficients are reported. All median and lower-bound correlation coefficients are 0.70 or higher, indicating a strong positive linear relationship between Amira and the external measure.
*In the table below, report the results of the validity analyses described above (e.g., concurrent or predictive validity, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of validity analysis not compatible with above table format:
- Manual cites other published reliability studies:
- Yes
- Provide citations for additional published studies.
- https://www.amiralearning.com/amira-technical-guide.html Rice, M. L., & Hoffman, L. (2015). Predicting vocabulary growth in children with and without specific language impairment: A longitudinal study from 2; 6 to 21 years of age. Journal of Speech, Language, and Hearing Research, 58(2), 345–359. Boscardin, C. K., Muthén, B., Francis, D. J., & Baker, E. L. (2008). Early identification of reading difficulties using heterogeneous developmental trajectories. Journal of Educational Psychology, 100(1), 192.
- Describe the degree to which the provided data support the validity of the tool.
- All results show a correlation of 0.7 or higher (strong correlation) between Amira’s Progress Monitoring scores and external criterion scores, so the provided data support the validity of the tool to a high degree.
- Do you have validity data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
- No
If yes, fill in data for each subgroup with disaggregated validity data.
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of validity analysis not compatible with above table format:
- Manual cites other published reliability studies:
- No
- Provide citations for additional published studies.
Bias Analysis
Grade |
Kindergarten
|
Grade 1
|
Grade 2
|
Grade 3
|
Grade 4
|
Grade 5
|
Grade 6
|
---|---|---|---|---|---|---|---|
Rating | Provided | Provided | Provided | Provided | Provided | Provided | Provided |
- Have you conducted additional analyses related to the extent to which your tool is or is not biased against subgroups (e.g., race/ethnicity, gender, socioeconomic status, students with disabilities, English language learners)? Examples might include Differential Item Functioning (DIF) or invariance testing in multiple-group confirmatory factor models.
- Yes
- If yes,
- a. Describe the method used to determine the presence or absence of bias:
- We conducted a Differential Item Functioning (DIF) analysis using the Zumbo & Thomas (ZT) classification system with logistic regression implemented in the difR package in R software. The analysis examined 2,598 items across 10 subtests spanning grades Pre-K through 8, using a sample of 65,000 students from three large districts across three states. Items were classified using Nagelkerke's R² effect size thresholds: A items (negligible DIF) ≤ 0.13, B items (slight to moderate DIF) > 0.13 but ≤ 0.26, and C items (moderate to large DIF) > 0.26. To ensure robust DIF detection, items with fewer than 100 responses were excluded from analysis. DIF was examined using both overall reading ability scores and subscale scores as matching criteria to validate findings across different ability matching approaches. Following statistical analysis, curriculum experts reviewed all B and C flagged items to determine whether observed DIF represented construct-irrelevant variance (bias) or legitimate construct-related performance differences.
- b. Describe the subgroups for which bias analyses were conducted:
- Bias analyses were conducted across the following subgroups: 1. Gender: Male and female students 2. Race/Ethnicity: Hispanic/Latino, African American/Black, and White students The sample was strategically selected to provide sufficient demographic diversity and adequate sample sizes for reliable DIF detection across these key demographic groups that are central to educational equity considerations in academic screening.
- c. Describe the results of the bias analyses conducted, including data and interpretative statements. Include magnitude of effect (if available) if bias has been identified.
- The DIF analysis revealed exceptionally low levels of differential item functioning across all demographic subgroups: Results Summary: - 99% of items were classified as A (negligible DIF) - <1% were classified as B (slight to moderate DIF) - <1% were classified as C (moderate to large DIF) Expert Review Results: All flagged items underwent curriculum expert review for construct-irrelevant variance. Items were evaluated for evidence of bias versus legitimate construct-related performance variations. A small number of items (approximately 0.20%) were identified as exhibiting bias and were removed from the pool. Interpretation: The removal of biased items ensures that the final assessment maintains technical rigor and equity across diverse student populations. The extremely low percentage of items requiring removal (0.20%) demonstrates that the vast majority of items function equivalently across demographic subgroups, providing strong evidence for test fairness. Construct Validity Support: Pearson correlations between overall and subscale scores ranged from 0.72-0.90 (R² = 0.52-0.81), indicating strong construct coherence and supporting the validity of our matching criteria approach.
Data Collection Practices
Most tools and programs evaluated by the NCII are branded products which have been submitted by the companies, organizations, or individuals that disseminate these products. These entities supply the textual information shown above, but not the ratings accompanying the text. NCII administrators and members of our Technical Review Committees have reviewed the content on this page, but NCII cannot guarantee that this information is free from error or reflective of recent changes to the product. Tools and programs have the opportunity to be updated annually or upon request.