MAP® Growth™
Reading
Summary
MAP Growth assessments measure what students know and inform educators and parents about what they are ready to learn next. By dynamically adjusting to each student’s answers, the computer adaptive tests — calibrated to an equal-interval scale for each subject — create a personalized experience that accurately measures performance, regardless of demographics or test-taking ability. Educators around the globe use MAP Growth for multiple purposes, including as a universal screener in response to intervention (RTI) programs. The assessments identify students at risk of poor academic outcomes in reading and give educators insight into the instructional needs of all students, whether they are performing at, above, or below what is typical of a student’s grade-level peers. MAP Growth assessments are aligned to state standards and draw from an item bank containing more than 50,000 items, including technology-enhanced items, across all grades and content areas. MAP Growth reports transform data into timely insights. Teachers use the information to differentiate instruction and pinpoint needs of individual students or sub-groups of students. Higher-level reports give administrators the context to drive improvement across schools and districts. Tests can be administered three times per school year — once each in fall, winter, and spring — with an optional summer administration. The Rasch model, an item response theory (IRT) model commonly employed in K–12 assessment programs, was used to create the vertical scales for MAP Growth assessments. Each item is assigned a score on our RIT (for Rasch Unit) scale. Because the scales are equal-interval across grades K–12, educators can compare academic growth across students and time — within an academic year and over multiple years.
- Where to Obtain:
- NWEA®
- proposals@nwea.org
- 121 NW Everett Street, Portland, OR 97209
- (503) 624-1951
- www.nwea.org
- Initial Cost:
- Contact vendor for pricing details.
- Replacement Cost:
- Contact vendor for pricing details.
- Included in Cost:
- MAP Growth Reading assessments require an annual per-student subscription fee. Please visit https://www.nwea.org/contact-us/sales-information/ or call (866) 654-3246 to request a quote. Annual subscription fees include a suite of assessments, scoring and reporting, all assessment software including maintenance and upgrades, support services, and unlimited staff access to NWEA Professional Learning Online. MAP Growth assessments can be administered up to four times per calendar year, plus the abbreviated Screening assessment once a year for placement purposes. All results from MAP Growth assessments, including RIT scores, proficiency projections, and status and growth norms, are available in a variety of views and formats through our comprehensive suite of reports. A full system of support is provided to enable the success of our partners, including technical support; implementation support through the first test administration; and ongoing, dedicated account management for the duration of the partnership. Unlimited staff access to the NWEA Professional Learning Online learning portal provides on-demand tutorials, webinars, courses, and videos to supplement professional learning plans and help educators use MAP Growth to improve teaching and learning. NWEA offers a portfolio of flexible, customizable professional learning and training options for an additional cost to meet the needs of our partners.
- NWEA strives to make our products accessible for all students. In line with the Every Student Succeeds Act, MAP Growth for grades 2 and above provides accommodations, such as interoperability with, and ability to use, assistive technology for students with disabilities. In 2018, NWEA adopted and implemented the language and terminology of the Council of Chief State School Officers (CCSSO) Accessibility Manual. MAP Growth tests for students in grades K–2 do not include some of the accessibility features included in our tests for grades 2 and above because adding a new assistive technology at this level calls into question the validity of what is being tested: the use of new technology or the assessment content. Our goal is to provide a universal approach and make the use of features and accommodations as easy as possible — for both the student and educator. NWEA uses the CCSSO Accessibility Manual to help us define our current features and accommodations to create consistency in the market and to make sure we keep delivering what students need. As we continue to add more features and accommodations to our assessment platform, NWEA intends to align with current CCSSO guidelines and updates. Our policy includes universal features, designated features, and accommodations, each with embedded and non-embedded features. Universal features are accessibility supports that are available to all students as they access instructional or assessment content. They are either embedded and provided digitally through instructional or assessment technology (such as keyboard navigation) or non-embedded and provided non-digitally at the local level (such as scratch paper). Designated features are available when an educator (or team of educators including the parents/guardians and the student, if appropriate) indicates that there is a need for them. Designated features must be assigned to a student by trained educators or teams using a consistent process. Embedded designated features (such as text-to-speech) are provided digitally through instructional or assessment technology. Non-embedded designated features (such as a magnification device) are provided locally. Accommodations are changes in procedures or materials that provide equitable access to instructional and assessment content — and generate valid assessment results for students who need them. Embedded accommodations are provided digitally through instructional or assessment technology. Non-embedded designated features (such as a scribe) are provided locally. Accommodations are generally available to students for whom there is a documented need on an IEP or 504 accommodation plan; however, some states also offer accommodations for English language learners. Our Voluntary Product Accessibility Template (VPAT) for MAP Growth is available online at www.nwea.org/accommodations-accessibility. This document indicates the areas we support, do not support, and areas that may not be applicable to our assessment. Because our accessibility offerings are online, NWEA has created an accessibility checklist that follows accessibility standards and protocols provided by the Americans with Disabilities Act (ADA), compliance with Section 508 of the Rehabilitation Act, and Web Content Accessibility Guidelines (WCAG) 2.1 Guidelines, and other various sources such as standards from CAST, a nonprofit focused on expanding learning opportunities through Universal Design for Learning (UDL); and the National Center on Educational Outcomes (NCEO).
- Training Requirements:
- 1-4 hours of training
- Qualified Administrators:
- Proctors should meet the same qualifications as a teaching paraprofessional and should complete all necessary training related to administering an assessment.
- Access to Technical Support:
- Users can obtain support through our Partner Support team via toll-free telephone number, email, and chat; our online Help Center; and a dedicated Account Manager.
- Assessment Format:
-
- Direct: Computerized
- Scoring Time:
-
- Scoring is automatic
- Scores Generated:
-
- Percentile score
- IRT-based score
- Developmental benchmarks
- Developmental cut points
- Lexile score
- Composite scores
- Subscale/subtest scores
- Administration Time:
-
- 45 minutes per student/subject
- Scoring Method:
-
- Automatically (computer-scored)
- Technology Requirements:
-
- Computer or tablet
- Internet connection
- Other technology : System requirements are regularly updated at https://teach.mapnwea.org/impl/QRM2_System_Requirements_QuickRef.pdf.
- Accommodations:
- NWEA strives to make our products accessible for all students. In line with the Every Student Succeeds Act, MAP Growth for grades 2 and above provides accommodations, such as interoperability with, and ability to use, assistive technology for students with disabilities. In 2018, NWEA adopted and implemented the language and terminology of the Council of Chief State School Officers (CCSSO) Accessibility Manual. MAP Growth tests for students in grades K–2 do not include some of the accessibility features included in our tests for grades 2 and above because adding a new assistive technology at this level calls into question the validity of what is being tested: the use of new technology or the assessment content. Our goal is to provide a universal approach and make the use of features and accommodations as easy as possible — for both the student and educator. NWEA uses the CCSSO Accessibility Manual to help us define our current features and accommodations to create consistency in the market and to make sure we keep delivering what students need. As we continue to add more features and accommodations to our assessment platform, NWEA intends to align with current CCSSO guidelines and updates. Our policy includes universal features, designated features, and accommodations, each with embedded and non-embedded features. Universal features are accessibility supports that are available to all students as they access instructional or assessment content. They are either embedded and provided digitally through instructional or assessment technology (such as keyboard navigation) or non-embedded and provided non-digitally at the local level (such as scratch paper). Designated features are available when an educator (or team of educators including the parents/guardians and the student, if appropriate) indicates that there is a need for them. Designated features must be assigned to a student by trained educators or teams using a consistent process. Embedded designated features (such as text-to-speech) are provided digitally through instructional or assessment technology. Non-embedded designated features (such as a magnification device) are provided locally. Accommodations are changes in procedures or materials that provide equitable access to instructional and assessment content — and generate valid assessment results for students who need them. Embedded accommodations are provided digitally through instructional or assessment technology. Non-embedded designated features (such as a scribe) are provided locally. Accommodations are generally available to students for whom there is a documented need on an IEP or 504 accommodation plan; however, some states also offer accommodations for English language learners. Our Voluntary Product Accessibility Template (VPAT) for MAP Growth is available online at www.nwea.org/accommodations-accessibility. This document indicates the areas we support, do not support, and areas that may not be applicable to our assessment. Because our accessibility offerings are online, NWEA has created an accessibility checklist that follows accessibility standards and protocols provided by the Americans with Disabilities Act (ADA), compliance with Section 508 of the Rehabilitation Act, and Web Content Accessibility Guidelines (WCAG) 2.1 Guidelines, and other various sources such as standards from CAST, a nonprofit focused on expanding learning opportunities through Universal Design for Learning (UDL); and the National Center on Educational Outcomes (NCEO).
Descriptive Information
- Please provide a description of your tool:
- MAP Growth assessments measure what students know and inform educators and parents about what they are ready to learn next. By dynamically adjusting to each student’s answers, the computer adaptive tests — calibrated to an equal-interval scale for each subject — create a personalized experience that accurately measures performance, regardless of demographics or test-taking ability. Educators around the globe use MAP Growth for multiple purposes, including as a universal screener in response to intervention (RTI) programs. The assessments identify students at risk of poor academic outcomes in reading and give educators insight into the instructional needs of all students, whether they are performing at, above, or below what is typical of a student’s grade-level peers. MAP Growth assessments are aligned to state standards and draw from an item bank containing more than 50,000 items, including technology-enhanced items, across all grades and content areas. MAP Growth reports transform data into timely insights. Teachers use the information to differentiate instruction and pinpoint needs of individual students or sub-groups of students. Higher-level reports give administrators the context to drive improvement across schools and districts. Tests can be administered three times per school year — once each in fall, winter, and spring — with an optional summer administration. The Rasch model, an item response theory (IRT) model commonly employed in K–12 assessment programs, was used to create the vertical scales for MAP Growth assessments. Each item is assigned a score on our RIT (for Rasch Unit) scale. Because the scales are equal-interval across grades K–12, educators can compare academic growth across students and time — within an academic year and over multiple years.
ACADEMIC ONLY: What skills does the tool screen?
- Please describe specific domain, skills or subtests:
- MAP Growth Reading assess vocabulary in standalone formats and embedded in passages where context clues are important.
- BEHAVIOR ONLY: Which category of behaviors does your tool target?
-
- BEHAVIOR ONLY: Please identify which broad domain(s)/construct(s) are measured by your tool and define each sub-domain or sub-construct.
Acquisition and Cost Information
Administration
- Are norms available?
- Yes
- Are benchmarks available?
- Yes
- If yes, how many benchmarks per year?
- 3
- If yes, for which months are benchmarks available?
- Fall, Winter, Spring
- BEHAVIOR ONLY: Can students be rated concurrently by one administrator?
- If yes, how many students can be rated concurrently?
Training & Scoring
Training
- Is training for the administrator required?
- Yes
- Describe the time required for administrator training, if applicable:
- 1-4 hours of training
- Please describe the minimum qualifications an administrator must possess.
- Proctors should meet the same qualifications as a teaching paraprofessional and should complete all necessary training related to administering an assessment.
- No minimum qualifications
- Are training manuals and materials available?
- Yes
- Are training manuals/materials field-tested?
- Yes
- Are training manuals/materials included in cost of tools?
- Yes
- If No, please describe training costs:
- Can users obtain ongoing professional and technical support?
- Yes
- If Yes, please describe how users can obtain support:
- Users can obtain support through our Partner Support team via toll-free telephone number, email, and chat; our online Help Center; and a dedicated Account Manager.
Scoring
- Do you provide basis for calculating performance level scores?
-
Yes
- Does your tool include decision rules?
-
No
- If yes, please describe.
- Can you provide evidence in support of multiple decision rules?
-
No
- If yes, please describe.
- Please describe the scoring structure. Provide relevant details such as the scoring format, the number of items overall, the number of items per subscale, what the cluster/composite score comprises, and how raw scores are calculated.
- MAP Growth results, reported as RIT scores with a range from 100 to 350, relate directly to the RIT scale, a vertical, equal-interval scale that is continuous across grades for each subject. MAP Growth provides an overall RIT score for each subject and a score for each instructional area within that subject. The test blueprint specifies the number of items that each student will be administered overall and in each instructional area based on state curriculum standards and psychometric considerations to provide a reliable and accurate estimate of student abilities in each subject and its instructional areas. MAP Growth assessments draw from an item bank containing more than 50,000 items across all grades and subjects. The item pools cover all instructional areas and difficulty levels across the full range of the RIT scale. MAP Growth Reading tests have 39-49 items. The number of items in an instructional area varies across tests of different subjects. In general, it is decided by dividing the test length by the number of instructional areas. We try to keep the number of items in an instructional area no smaller than 10. MAP Growth employs a common item selection and test scoring algorithm. Each student begins the test with a preliminary student score based on past test performance. If a student has no prior test score, a default starting value is assigned according to test content and the student’s grade. As each test proceeds, each item is selected from the pool of Rasch-calibrated items based on the student’s interim ability estimate, content requirements, and longitudinal item exposure controls. Interim ability estimates are updated after each response using Bayesian methods (Owen, 1975) that consider all of the student’s responses up to that point in the test. The updated interim ability estimate is factored into selection of the next item. As this cycle is repeated, each successive interim ability estimate is slightly more precise than the previous one. The test continues until the standard error associated with the estimate is smaller than a pre-specified value, which is typically set as low as that provided by a test length assuming that item difficulties match the examinee’s ability as closely as possible when the Rasch model is used. The final ability estimate (i.e., RIT score) is computed via maximum-likelihood estimation (MLE), commonly used in large-scale education assessment to estimate student performance or achievement, that indicates the student’s location on the RIT scale, both overall and for each instructional area.
- Describe the tool’s approach to screening, samples (if applicable), and/or test format, including steps taken to ensure that it is appropriate for use with culturally and linguistically diverse populations and students with disabilities.
- With computer adaptive tests such as MAP Growth, each student experiences a unique test based on his or her responses to each question. This adaptivity supports students with diverse needs, including students with disabilities, English language learners, and those performing outside of grade-level expectations. The design of each MAP Growth test starts with an analysis of the content standards to be assessed. Items that align to standards are included in a pool and grouped into instructional areas and sub-areas. Although each item pool is tailored to specific standards, all MAP Growth assessments follow the same design principles and content rationale. The assessment begins by delivering a question based on known information about that student — grade level the first time tested, and previous score after that. If the student answers the question correctly, he or she receives a more difficult question. An incorrect response prompts an easier question. MAP Growth requires students to answer every question presented instead of giving them the option to skip, so the difficulty level of the assessment is accurate. The algorithms used to deliver a unique test to each student are based upon Rasch item difficulty calibrations, where items are delivered so that students will likely respond correctly 50 percent of the time. As a result, students at all levels of learning stay engaged with our assessment. Struggling students who might otherwise get frustrated and stop trying and high-achieving students who might get bored by strictly grade-level assessments will remain interested, as subsequent questions adapt to their abilities. Our industry-leading MAP Growth norms, updated in July 2020, provide support of normative interpretations of a student’s performance for a specific grade or academic term as well as a student’s growth over time. The fundamental assumption underlying item response theory (IRT) is that the probability of a correct response to a test item is a function of the item’s difficulty and the person’s ability. This function is expected to remain invariant to other person characteristics that are unrelated to ability such as gender, ethnic group membership, family wealth, etc. Therefore, if two test takers with the same ability respond to the same test item, they are assumed to have an equal probability of answering the item correctly. We are committed to developing engaging, authentic, rigorous, and culturally diverse assessments that effectively measure the full range of the standards. Therefore, it is vital that we address a wide variety of texts in a balanced, respectful way that does not upset, distract, or exclude any student populations. Item writers employ careful consideration and sound judgment while crafting items, considering each item from a variety of angles regarding bias and sensitivity, in accordance with the NWEA Sensitivity, Fairness, and Accessibility Guidelines. A well-constructed item serves to activate and focus a student’s thought process on the task presented in the item. To meet our high expectation of fairness to all students, every item is thoroughly examined at multiple points in the item development process, undergoing specific bias and sensitivity reviews. Sensitivity in this context means an awareness of the different things that can distract a student during assessment. Fairness in this context relates to giving each student equal opportunity to answer the item correctly based solely on their knowledge of the item content. Any sensitivity and fairness issues found in items are eliminated in revision or rejection of the item during development. Each item is evaluated against a set of criteria and is flagged if it requires prior knowledge other than the skill/concept being assessed; requires construct-irrelevant or specialized knowledge; has cultural bias; has linguistic bias; has socioeconomic bias; has religious bias; has geographic bias; has color-blind bias; has gender bias; favors students who have no visual impairments; favors students who have no disabilities; inappropriately employs idiomatic English; offensively stereotypes a group of people; mentions body/weight issues; contains inappropriate or sensitive topics; distracts, upsets, or confuses in any way; or has other bias issues. Our Psychometric Solutions team performs differential item functioning (DIF) analyses to examine the percentages of items that exhibit DIF in the item pools. All items revealed as exhibiting moderate DIF are subjected to an extra review by NWEA Content Specialists to identify the source(s) for differential functioning. For each item, these specialists make a judgment to remove the item from the item bank; revise the item and resubmit it for field-testing; or retain the item as is. Items exhibiting severe DIF are removed from the item bank. These procedures are consistent with periodic item quality reviews that remove or flag items for revision, which are then field tested again.
Technical Standards
Classification Accuracy & Cross-Validation Summary
Grade |
Kindergarten
|
Grade 1
|
Grade 2
|
Grade 3
|
Grade 4
|
Grade 5
|
Grade 6
|
Grade 7
|
Grade 8
|
---|---|---|---|---|---|---|---|---|---|
Classification Accuracy Fall | |||||||||
Classification Accuracy Winter | |||||||||
Classification Accuracy Spring |
Linked State Summative Assessments
Classification Accuracy
- Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
- For this series of analyses for grades K–8, the criterion measure was scaled scores from multiple state summative assessments linked to a common scale via the MAP Growth scale. This approach of linking test scores on different state standardized tests to a common scale is documented and validated in a study by Reardon, Kalogrides, and Ho (2017), who transformed state-level test scores to the common National Assessment of Educational Progress (NAEP) scale that yields score distributions corresponding well to the relative performance of students in different districts on the NAEP and MAP Growth assessments. The summative assessments administered in spring are used to provide evidence of student achievement for various intended test score uses such as meeting school accountability requirements. MAP Growth tests are adaptive interim assessments aligned to state-specific content standards and administered in the fall, winter, and spring. Scores are reported on the RIT vertical scale with a range of 100–350. State assessment scaled scores for English language arts were from the Spring 2018 test administrations in five states: Arkansas (Region 3: South; Division 7: West South Central), Colorado (Region 4: West; Division 8: Mountain), New York (Region 1: Northeast; Division 2: Middle Atlantic), Florida (Region 3: South; Division 5: South Atlantic), and Missouri (Region 2: Midwest; Division 4: West North Central). These five states cover all four U.S. Census Regions and five Divisions. These five testing programs included ACT Aspire (used in Arkansas), Colorado Measures of Academic Success, the New York State Testing Program, Florida's Statewide Assessment Program, and the Missouri Assessment Program. Each student had taken one of these state summative tests and MAP Growth, which allowed for a common person linking design. To make test scores from each state testing program comparable with each other, we linked them to each other via the MAP Growth scale using the equipercentile method (Kolen & Brennan, 2004). As a result, each student in the sample had a MAP Growth score and a state summative test linked to a common metric. Linking state summative test scores to a common metric allowed state-level test scores to be comparable across states. In other words, this approach created a common measure across states. This common measure was used as the outcome measure (i.e. criterion) in the classification analysis. For the grades K–2 analyses, the grade 3 linked state summative scores were used as the outcome measure because these states did not have a summative test before grade 3. Thus, classification analysis for grades K–2 involved a criterion measure that was given one to three years after their MAP Growth score.
- Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
- MAP Growth Reading test scores for grades 3–8 came from test administrations in Arkansas, Florida, New York, Missouri, and Colorado during the Fall 2017, Winter 2018, and Spring 2018 terms, which spanned from August 2017 to June 2018. For grades K–2, the MAP Growth scores of the grade 3 students from the previous three academic years (i.e., Fall 2014, Winter 2015, Spring 2015, Fall 2015, Winter 2016, Spring 2016, Fall 2016, Winter 2017, and Spring 2017) were used. The criterion data were from the Spring 2018 administration of the five state assessment programs, spanning approximately from March 2018 to June 2018. Both concurrent and predictive classification analyses were conducted. For the concurrent classification analyses, they were conducted for grades 3–8 involving scores from students who took both MAP Growth and state summative tests within a short amount of time during the Spring 2018 term. For the predictive analyses, they were conducted for grades 3–8 involving scores from students who took state assessments during the Spring 2018 administration and MAP Growth during the Fall 2017 and Winter 2018 terms, respectively. That is, MAP Growth tests were taken by the same students approximately 3-6 months earlier than the state tests. Grades K–2 involved state assessment scores obtained for grade 3 students in the Spring 2018 administration and MAP Growth scores for the same students when they were in grades K–2, i.e., Fall 2014, Winter 2015, Spring 2015, Fall 2015, Winter 2016, Spring 2016, Fall 2016, Winter 2017, and Spring 2017, respectively. That is, MAP Growth tests were taken approximately 12-36 months earlier than state tests.
- Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
- Students were designated as actually “at risk” or “not at risk” on the outcome measure (i.e. criterion) by rank-ordering their linked state scaled scores and using the 10th percentile within the study sample as the cut score. The percentiles were determined separately for each grade. Students actually “at risk” on the outcome measure were so designated when their linked state scale scores fell below the 10th percentile rank point, and vice versa. Students were classified as “at risk” and requiring intensive intervention if their MAP Growth scores were below the 30th percentile based on the MAP Growth norming study conducted in 2020 by NWEA. Percentiles were computed for each term, grade, and subject area. Students with MAP Growth scores below the 30th percentile point according to the 2020 MAP Growth norming study were classified as “at risk” requiring intensive intervention.
- Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
-
No
- If yes, please describe the intervention, what children received the intervention, and how they were chosen.
- Some students may have been involved in various interventions in their particular schools, but we do not know which interventions or which students.
Cross-Validation
- Has a cross-validation study been conducted?
-
Yes
- If yes,
- Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
- For this series of cross-validation analyses for grades 3–8, the outcome measure was the English language arts state assessment scaled scores from the Indiana Learning Evaluation Readiness Network (ILEARN) state testing program. Students in the cross-validation sample took the ILEARN English Language Arts tests during the Spring 2019 school term. For the grades K–2 analyses, the scaled scores for grade 3 students in the sample were used as the outcome measure. The ILEARN tests are Indiana’s state summative assessments aligned to the Indiana Academic Standards. Based on their test scores, students are placed into one of four performance levels: Below Proficiency, Approaching Proficiency, At Proficiency, and Above Proficiency. These tests are used to provide evidence of student achievement for various test score uses such as meeting state and federal accountability requirements. MAP Growth tests are adaptive interim assessments aligned to state-specific content standards and administered in the fall, winter, and spring. Scores are reported on the RIT vertical scale with a range of 100–350.
- Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
- MAP Growth Reading test scores for grades 3–8 came from test administrations in Indiana during the Fall 2018, Winter 2019, and Spring 2019 terms, which spanned from August 2018 to June 2019. For grades K–2, the MAP Growth scores of the grade 3 students from the previous three academic years (i.e., Fall 2015, Winter 2016, Spring 2016, Fall 2016, Winter 2017, Spring 2017, Fall 2017, Winter 2018, Spring 2018) were used as the grades K–2 MAP Growth scores. The criterion data were from the Spring 2019 administration of the Indiana state assessment, spanning approximately from March 2019 to June 2019. Both concurrent and predictive classification analyses were conducted. For the concurrent classification analyses, they were conducted for grades 3–8 involving MAP Growth scores and ILEARN assessment scores from the Spring 2019 test administration. Both tests assess similar constructs and were taken by the same students within a short amount of time. For the predictive analyses, they were conducted for grades 3–8 involving scores from students who took ILEARN assessments during the Spring 2019 administration and MAP Growth during the Fall 2018 and Winter 2019 terms, respectively. That is, MAP Growth tests were taken by the same students approximately 3-6 months earlier than the ILEARN tests. Grades K–2 involved ILEARN assessment scores obtained for grade 3 students in the Spring 2019 administration and MAP Growth scores for the same students when they were in grades K–2, i.e., Fall 2015, Winter 2016, Spring 2016, Fall 2016, Winter 2017, Spring 2017, Fall 2017, Winter 2018, and Spring 2018, respectively. That is, MAP Growth tests were taken approximately 12-36 months earlier than the ILEARN state tests, but both tests were developed based on the same content standards and assess similar constructs.
- Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
- Students were designated as actually “at risk” or “not at risk” by rank-ordering their ILEARN state scale scores and using the 10th percentile rank point within the study sample as the cut score, disaggregated within the sample by grade and subject area. Students actually “at risk” were so designated when their state scale scores fell below the 10th percentile rank point. Students were classified as “at-risk” and requiring intervention if their MAP Growth scores were below the 30th percentile rank points based on the MAP Growth norming study conducted in 2020 by NWEA, disaggregated by term, grade, and subject area. Students with MAP Growth scores at or above the 30th percentile point based on the 2020 MAP Growth norming study were classified as “not at risk.”
- Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
-
No
- If yes, please describe the intervention, what children received the intervention, and how they were chosen.
- Some students may have been involved in various interventions in their particular schools, but we do not know which interventions or which students.
Classification Accuracy - Fall
Evidence | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Criterion measure | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments |
Cut Points - Percentile rank on criterion measure | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
Cut Points - Performance score on criterion measure | 178 | 178 | 178 | 178 | 188 | 195 | 197 | 201 | 203 |
Cut Points - Corresponding performance score (numeric) on screener measure | 130 | 149 | 164 | 178 | 188 | 196 | 202 | 206 | 209 |
Classification Data - True Positive (a) | 217 | 845 | 881 | 2105 | 2066 | 2010 | 1636 | 1435 | 1233 |
Classification Data - False Positive (b) | 688 | 2943 | 2816 | 4514 | 3352 | 3030 | 2957 | 2537 | 2093 |
Classification Data - False Negative (c) | 445 | 254 | 92 | 342 | 562 | 473 | 363 | 346 | 266 |
Classification Data - True Negative (d) | 6401 | 8746 | 8073 | 21823 | 21883 | 21366 | 17440 | 15713 | 12825 |
Area Under the Curve (AUC) | 0.77 | 0.84 | 0.89 | 0.92 | 0.91 | 0.93 | 0.92 | 0.92 | 0.91 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.75 | 0.82 | 0.88 | 0.91 | 0.91 | 0.92 | 0.91 | 0.91 | 0.91 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.79 | 0.85 | 0.90 | 0.92 | 0.92 | 0.93 | 0.92 | 0.92 | 0.92 |
Statistics | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Base Rate | 0.09 | 0.09 | 0.08 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 |
Overall Classification Rate | 0.85 | 0.75 | 0.75 | 0.83 | 0.86 | 0.87 | 0.85 | 0.86 | 0.86 |
Sensitivity | 0.33 | 0.77 | 0.91 | 0.86 | 0.79 | 0.81 | 0.82 | 0.81 | 0.82 |
Specificity | 0.90 | 0.75 | 0.74 | 0.83 | 0.87 | 0.88 | 0.86 | 0.86 | 0.86 |
False Positive Rate | 0.10 | 0.25 | 0.26 | 0.17 | 0.13 | 0.12 | 0.14 | 0.14 | 0.14 |
False Negative Rate | 0.67 | 0.23 | 0.09 | 0.14 | 0.21 | 0.19 | 0.18 | 0.19 | 0.18 |
Positive Predictive Power | 0.24 | 0.22 | 0.24 | 0.32 | 0.38 | 0.40 | 0.36 | 0.36 | 0.37 |
Negative Predictive Power | 0.93 | 0.97 | 0.99 | 0.98 | 0.97 | 0.98 | 0.98 | 0.98 | 0.98 |
Sample | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Date | 2014-2018 | 2015-2018 | 2016-2018 | 2017-2018 | 2017-2018 | 2017-2018 | 2017-2018 | 2017-2018 | 2017-2018 |
Sample Size | 7751 | 12788 | 11862 | 28784 | 27863 | 26879 | 22396 | 20031 | 16417 |
Geographic Representation | Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Male | 49.2% | 50.2% | 50.4% | 50.8% | 50.2% | 50.3% | 50.3% | 50.0% | 50.5% |
Female | 50.8% | 49.8% | 49.6% | 49.2% | 49.8% | 49.7% | 49.7% | 50.0% | 49.5% |
Other | |||||||||
Gender Unknown | |||||||||
White, Non-Hispanic | 38.5% | 46.7% | 43.4% | 47.3% | 47.7% | 47.9% | 47.1% | 46.7% | 45.8% |
Black, Non-Hispanic | 12.4% | 12.3% | 15.6% | 14.9% | 13.6% | 13.2% | 12.9% | 11.9% | 13.9% |
Hispanic | 31.2% | 27.5% | 29.8% | 24.1% | 24.5% | 25.4% | 26.6% | 27.4% | 26.5% |
Asian/Pacific Islander | 6.6% | 5.5% | 5.0% | 5.6% | 5.9% | 5.9% | 5.8% | 6.4% | 6.3% |
American Indian/Alaska Native | 1.1% | 1.0% | 1.3% | 0.8% | 0.9% | 0.7% | 1.1% | 1.1% | 1.3% |
Other | 1.5% | 2.6% | 3.0% | 3.6% | 3.4% | 3.0% | 2.5% | 2.3% | 2.3% |
Race / Ethnicity Unknown | 8.7% | 4.3% | 1.8% | 3.6% | 3.9% | 4.0% | 4.0% | 4.2% | 4.0% |
Low SES | |||||||||
IEP or diagnosed disability | |||||||||
English Language Learner |
Classification Accuracy - Winter
Evidence | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Criterion measure | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments |
Cut Points - Percentile rank on criterion measure | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
Cut Points - Performance score on criterion measure | 178 | 178 | 178 | 178 | 188 | 195 | 197 | 201 | 203 |
Cut Points - Corresponding performance score (numeric) on screener measure | 140 | 159 | 173 | 185 | 194 | 201 | 205 | 209 | 212 |
Classification Data - True Positive (a) | 317 | 939 | 899 | 2209 | 2201 | 2126 | 1746 | 1478 | 1175 |
Classification Data - False Positive (b) | 900 | 3725 | 2402 | 4897 | 3467 | 3632 | 3231 | 2831 | 1850 |
Classification Data - False Negative (c) | 304 | 122 | 133 | 294 | 453 | 377 | 268 | 292 | 253 |
Classification Data - True Negative (d) | 5437 | 7507 | 8874 | 21054 | 21128 | 19804 | 16556 | 14605 | 12070 |
Area Under the Curve (AUC) | 0.80 | 0.86 | 0.91 | 0.92 | 0.92 | 0.92 | 0.93 | 0.92 | 0.92 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.79 | 0.85 | 0.90 | 0.92 | 0.92 | 0.92 | 0.92 | 0.91 | 0.92 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.82 | 0.87 | 0.91 | 0.93 | 0.92 | 0.93 | 0.93 | 0.92 | 0.93 |
Statistics | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Base Rate | 0.09 | 0.09 | 0.08 | 0.09 | 0.10 | 0.10 | 0.09 | 0.09 | 0.09 |
Overall Classification Rate | 0.83 | 0.69 | 0.79 | 0.82 | 0.86 | 0.85 | 0.84 | 0.84 | 0.86 |
Sensitivity | 0.51 | 0.89 | 0.87 | 0.88 | 0.83 | 0.85 | 0.87 | 0.84 | 0.82 |
Specificity | 0.86 | 0.67 | 0.79 | 0.81 | 0.86 | 0.85 | 0.84 | 0.84 | 0.87 |
False Positive Rate | 0.14 | 0.33 | 0.21 | 0.19 | 0.14 | 0.15 | 0.16 | 0.16 | 0.13 |
False Negative Rate | 0.49 | 0.11 | 0.13 | 0.12 | 0.17 | 0.15 | 0.13 | 0.16 | 0.18 |
Positive Predictive Power | 0.26 | 0.20 | 0.27 | 0.31 | 0.39 | 0.37 | 0.35 | 0.34 | 0.39 |
Negative Predictive Power | 0.95 | 0.98 | 0.99 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
Sample | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Date | 2014-2018 | 2015-2018 | 2016-2018 | 2017-2018 | 2017-2018 | 2017-2018 | 2017-2018 | 2017-2018 | 2017-2018 |
Sample Size | 6958 | 12293 | 12308 | 28454 | 27249 | 25939 | 21801 | 19206 | 15348 |
Geographic Representation | Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Male | 49.5% | 50.3% | 50.5% | 50.9% | 50.2% | 50.4% | 50.3% | 50.1% | 50.2% |
Female | 50.5% | 49.7% | 49.5% | 49.1% | 49.8% | 49.6% | 49.7% | 49.9% | 49.8% |
Other | |||||||||
Gender Unknown | |||||||||
White, Non-Hispanic | 37.2% | 46.0% | 43.7% | 47.7% | 48.0% | 48.3% | 47.0% | 46.6% | 46.2% |
Black, Non-Hispanic | 14.7% | 12.5% | 15.9% | 15.5% | 14.1% | 13.5% | 13.7% | 12.8% | 14.1% |
Hispanic | 33.1% | 28.1% | 29.9% | 24.3% | 24.9% | 25.7% | 26.9% | 27.3% | 27.0% |
Asian/Pacific Islander | 6.4% | 5.7% | 4.8% | 4.6% | 4.8% | 4.7% | 5.3% | 6.0% | 5.8% |
American Indian/Alaska Native | 1.3% | 1.1% | 1.3% | 0.8% | 0.8% | 0.7% | 1.1% | 1.1% | 1.3% |
Other | 1.6% | 2.7% | 3.0% | 3.6% | 3.4% | 2.9% | 2.4% | 2.4% | 2.3% |
Race / Ethnicity Unknown | 5.7% | 3.9% | 1.4% | 3.5% | 4.0% | 4.1% | 3.7% | 3.9% | 3.4% |
Low SES | |||||||||
IEP or diagnosed disability | |||||||||
English Language Learner |
Classification Accuracy - Spring
Evidence | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Criterion measure | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments |
Cut Points - Percentile rank on criterion measure | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
Cut Points - Performance score on criterion measure | 178 | 178 | 178 | 178 | 188 | 195 | 197 | 201 | 203 |
Cut Points - Corresponding performance score (numeric) on screener measure | 147 | 164 | 177 | 189 | 196 | 203 | 207 | 210 | 213 |
Classification Data - True Positive (a) | 463 | 981 | 869 | 2401 | 2406 | 2378 | 1967 | 1699 | 1484 |
Classification Data - False Positive (b) | 1463 | 3162 | 1807 | 4579 | 3623 | 3613 | 3567 | 3048 | 2487 |
Classification Data - False Negative (c) | 257 | 151 | 190 | 314 | 505 | 389 | 293 | 365 | 274 |
Classification Data - True Negative (d) | 6138 | 8770 | 9980 | 23404 | 23237 | 22008 | 18321 | 16665 | 13575 |
Area Under the Curve (AUC) | 0.82 | 0.88 | 0.91 | 0.93 | 0.93 | 0.93 | 0.93 | 0.92 | 0.92 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.80 | 0.87 | 0.91 | 0.93 | 0.92 | 0.93 | 0.92 | 0.91 | 0.91 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.83 | 0.89 | 0.92 | 0.94 | 0.93 | 0.94 | 0.93 | 0.92 | 0.93 |
Statistics | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Base Rate | 0.09 | 0.09 | 0.08 | 0.09 | 0.10 | 0.10 | 0.09 | 0.09 | 0.10 |
Overall Classification Rate | 0.79 | 0.75 | 0.84 | 0.84 | 0.86 | 0.86 | 0.84 | 0.84 | 0.85 |
Sensitivity | 0.64 | 0.87 | 0.82 | 0.88 | 0.83 | 0.86 | 0.87 | 0.82 | 0.84 |
Specificity | 0.81 | 0.73 | 0.85 | 0.84 | 0.87 | 0.86 | 0.84 | 0.85 | 0.85 |
False Positive Rate | 0.19 | 0.27 | 0.15 | 0.16 | 0.13 | 0.14 | 0.16 | 0.15 | 0.15 |
False Negative Rate | 0.36 | 0.13 | 0.18 | 0.12 | 0.17 | 0.14 | 0.13 | 0.18 | 0.16 |
Positive Predictive Power | 0.24 | 0.24 | 0.32 | 0.34 | 0.40 | 0.40 | 0.36 | 0.36 | 0.37 |
Negative Predictive Power | 0.96 | 0.98 | 0.98 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
Sample | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Date | 2014-2018 | 2015-2018 | 2016-2018 | 2017-2018 | 2017-2018 | 2017-2018 | 2017-2018 | 2017-2018 | 2017-2018 |
Sample Size | 8321 | 13064 | 12846 | 30698 | 29771 | 28388 | 24148 | 21777 | 17820 |
Geographic Representation | Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Middle Atlantic (NY) Mountain (CO) South Atlantic (FL) West North Central (MO) West South Central (AR) |
Male | 49.1% | 50.2% | 50.3% | 50.9% | 50.2% | 50.5% | 50.4% | 50.2% | 50.7% |
Female | 50.9% | 49.8% | 49.7% | 49.1% | 49.8% | 49.5% | 49.6% | 49.8% | 49.3% |
Other | |||||||||
Gender Unknown | |||||||||
White, Non-Hispanic | 39.3% | 46.6% | 43.7% | 47.8% | 48.1% | 47.9% | 46.9% | 47.1% | 46.3% |
Black, Non-Hispanic | 12.8% | 12.3% | 15.4% | 14.8% | 13.7% | 13.3% | 13.3% | 12.2% | 14.0% |
Hispanic | 32.0% | 27.5% | 29.8% | 24.0% | 24.3% | 25.5% | 26.6% | 27.0% | 26.2% |
Asian/Pacific Islander | 6.8% | 5.5% | 4.9% | 5.5% | 5.8% | 5.8% | 5.7% | 6.3% | 6.1% |
American Indian/Alaska Native | 1.1% | 1.0% | 1.3% | 0.7% | 0.8% | 0.7% | 1.1% | 1.0% | 1.3% |
Other | 1.6% | 2.6% | 3.1% | 3.6% | 3.4% | 3.0% | 2.5% | 2.4% | 2.4% |
Race / Ethnicity Unknown | 6.4% | 4.4% | 1.8% | 3.6% | 3.9% | 3.9% | 3.8% | 4.0% | 3.8% |
Low SES | |||||||||
IEP or diagnosed disability | |||||||||
English Language Learner |
Cross-Validation - Fall
Evidence | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Criterion measure | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments |
Cut Points - Percentile rank on criterion measure | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
Cut Points - Performance score on criterion measure | 178 | 178 | 178 | 179 | 186 | 191 | 196 | 199 | 203 |
Cut Points - Corresponding performance score (numeric) on screener measure | 130 | 149 | 164 | 178 | 188 | 196 | 202 | 206 | 209 |
Classification Data - True Positive (a) | 294 | 1326 | 1142 | 2949 | 2815 | 2924 | 2880 | 2753 | 2778 |
Classification Data - False Positive (b) | 1343 | 5572 | 4352 | 6610 | 4843 | 5687 | 5668 | 5069 | 5023 |
Classification Data - False Negative (c) | 484 | 476 | 240 | 599 | 740 | 589 | 634 | 534 | 538 |
Classification Data - True Negative (d) | 10739 | 17182 | 15349 | 28699 | 31024 | 31196 | 30165 | 29639 | 28467 |
Area Under the Curve (AUC) | 0.75 | 0.82 | 0.88 | 0.90 | 0.91 | 0.92 | 0.91 | 0.92 | 0.92 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.73 | 0.82 | 0.87 | 0.89 | 0.91 | 0.91 | 0.91 | 0.92 | 0.92 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.77 | 0.83 | 0.89 | 0.90 | 0.92 | 0.92 | 0.92 | 0.93 | 0.93 |
Statistics | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Base Rate | 0.06 | 0.07 | 0.07 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 |
Overall Classification Rate | 0.86 | 0.75 | 0.78 | 0.81 | 0.86 | 0.84 | 0.84 | 0.85 | 0.85 |
Sensitivity | 0.38 | 0.74 | 0.83 | 0.83 | 0.79 | 0.83 | 0.82 | 0.84 | 0.84 |
Specificity | 0.89 | 0.76 | 0.78 | 0.81 | 0.86 | 0.85 | 0.84 | 0.85 | 0.85 |
False Positive Rate | 0.11 | 0.24 | 0.22 | 0.19 | 0.14 | 0.15 | 0.16 | 0.15 | 0.15 |
False Negative Rate | 0.62 | 0.26 | 0.17 | 0.17 | 0.21 | 0.17 | 0.18 | 0.16 | 0.16 |
Positive Predictive Power | 0.18 | 0.19 | 0.21 | 0.31 | 0.37 | 0.34 | 0.34 | 0.35 | 0.36 |
Negative Predictive Power | 0.96 | 0.97 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
Sample | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Date | 2015-2019 | 2016-2019 | 2017-2019 | 2018-2019 | 2018-2019 | 2018-2019 | 2018-2019 | 2018-2019 | 2018-2019 |
Sample Size | 12860 | 24556 | 21083 | 38857 | 39422 | 40396 | 39347 | 37995 | 36806 |
Geographic Representation | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) |
Male | 51.2% | 51.9% | 51.5% | 51.8% | 50.5% | 50.9% | 50.9% | 50.7% | 50.9% |
Female | 48.8% | 48.1% | 48.5% | 48.2% | 49.5% | 49.1% | 49.1% | 49.3% | 49.1% |
Other | |||||||||
Gender Unknown | |||||||||
White, Non-Hispanic | 69.1% | 69.7% | 72.9% | 66.6% | 66.8% | 66.5% | 67.5% | 67.5% | 68.5% |
Black, Non-Hispanic | 12.0% | 10.4% | 9.0% | 12.4% | 12.1% | 12.1% | 11.7% | 11.8% | 11.5% |
Hispanic | 12.0% | 12.9% | 11.1% | 13.8% | 14.1% | 14.5% | 14.4% | 13.9% | 13.7% |
Asian/Pacific Islander | 1.7% | 1.7% | 1.8% | 1.9% | 1.7% | 1.7% | 1.6% | 1.8% | 1.7% |
American Indian/Alaska Native | 0.1% | 0.1% | 0.1% | 0.1% | 0.2% | 0.1% | 0.2% | 0.2% | 0.2% |
Other | 5.0% | 5.2% | 4.9% | 5.2% | 5.0% | 5.0% | 4.7% | 4.7% | 4.4% |
Race / Ethnicity Unknown | |||||||||
Low SES | |||||||||
IEP or diagnosed disability | |||||||||
English Language Learner |
Cross-Validation - Winter
Evidence | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Criterion measure | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments |
Cut Points - Percentile rank on criterion measure | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
Cut Points - Performance score on criterion measure | 178 | 178 | 178 | 179 | 186 | 191 | 196 | 199 | 203 |
Cut Points - Corresponding performance score (numeric) on screener measure | 140 | 159 | 173 | 185 | 194 | 201 | 205 | 209 | 212 |
Classification Data - True Positive (a) | 557 | 1533 | 1069 | 3060 | 3018 | 3172 | 2988 | 2876 | 2814 |
Classification Data - False Positive (b) | 2512 | 6820 | 3301 | 6046 | 4902 | 6804 | 5819 | 5762 | 4658 |
Classification Data - False Negative (c) | 349 | 224 | 274 | 563 | 609 | 446 | 591 | 412 | 535 |
Classification Data - True Negative (d) | 9960 | 14833 | 16777 | 28606 | 30046 | 29211 | 28567 | 26497 | 26298 |
Area Under the Curve (AUC) | 0.80 | 0.86 | 0.90 | 0.91 | 0.92 | 0.92 | 0.91 | 0.92 | 0.92 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.78 | 0.85 | 0.89 | 0.91 | 0.92 | 0.92 | 0.91 | 0.92 | 0.92 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.81 | 0.86 | 0.91 | 0.91 | 0.93 | 0.93 | 0.92 | 0.93 | 0.93 |
Statistics | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Base Rate | 0.07 | 0.08 | 0.06 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.10 |
Overall Classification Rate | 0.79 | 0.70 | 0.83 | 0.83 | 0.86 | 0.82 | 0.83 | 0.83 | 0.85 |
Sensitivity | 0.61 | 0.87 | 0.80 | 0.84 | 0.83 | 0.88 | 0.83 | 0.87 | 0.84 |
Specificity | 0.80 | 0.69 | 0.84 | 0.83 | 0.86 | 0.81 | 0.83 | 0.82 | 0.85 |
False Positive Rate | 0.20 | 0.31 | 0.16 | 0.17 | 0.14 | 0.19 | 0.17 | 0.18 | 0.15 |
False Negative Rate | 0.39 | 0.13 | 0.20 | 0.16 | 0.17 | 0.12 | 0.17 | 0.13 | 0.16 |
Positive Predictive Power | 0.18 | 0.18 | 0.24 | 0.34 | 0.38 | 0.32 | 0.34 | 0.33 | 0.38 |
Negative Predictive Power | 0.97 | 0.99 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
Sample | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Date | 2015-2019 | 2016-2019 | 2017-2019 | 2018-2019 | 2018-2019 | 2018-2019 | 2018-2019 | 2018-2019 | 2018-2019 |
Sample Size | 13378 | 23410 | 21421 | 38275 | 38575 | 39633 | 37965 | 35547 | 34305 |
Geographic Representation | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) |
Male | 51.4% | 51.9% | 51.5% | 51.7% | 50.6% | 50.8% | 51.2% | 50.8% | 51.0% |
Female | 48.6% | 48.1% | 48.5% | 48.3% | 49.4% | 49.2% | 48.8% | 49.2% | 49.0% |
Other | |||||||||
Gender Unknown | |||||||||
White, Non-Hispanic | 68.8% | 69.2% | 73.2% | 66.2% | 66.4% | 66.0% | 66.8% | 66.4% | 67.4% |
Black, Non-Hispanic | 11.1% | 10.8% | 9.0% | 12.6% | 12.4% | 12.5% | 12.0% | 12.7% | 12.3% |
Hispanic | 13.1% | 13.2% | 10.9% | 13.9% | 14.3% | 14.8% | 14.8% | 14.3% | 14.1% |
Asian/Pacific Islander | 1.8% | 1.6% | 1.8% | 1.9% | 1.7% | 1.6% | 1.6% | 1.7% | 1.6% |
American Indian/Alaska Native | 0.1% | 0.1% | 0.1% | 0.1% | 0.2% | 0.1% | 0.2% | 0.2% | 0.2% |
Other | 5.1% | 5.1% | 5.0% | 5.2% | 5.0% | 5.0% | 4.7% | 4.7% | 4.4% |
Race / Ethnicity Unknown | |||||||||
Low SES | |||||||||
IEP or diagnosed disability | |||||||||
English Language Learner |
Cross-Validation - Spring
Evidence | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Criterion measure | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments | Linked State Summative Assessments |
Cut Points - Percentile rank on criterion measure | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
Cut Points - Performance score on criterion measure | 178 | 178 | 178 | 179 | 186 | 191 | 196 | 199 | 203 |
Cut Points - Corresponding performance score (numeric) on screener measure | 147 | 164 | 177 | 189 | 196 | 203 | 207 | 210 | 213 |
Classification Data - True Positive (a) | 736 | 1514 | 1047 | 3259 | 3309 | 3431 | 3334 | 3292 | 3275 |
Classification Data - False Positive (b) | 2816 | 5485 | 2644 | 5611 | 5430 | 6929 | 6339 | 5738 | 5382 |
Classification Data - False Negative (c) | 331 | 340 | 353 | 635 | 582 | 441 | 589 | 473 | 561 |
Classification Data - True Negative (d) | 11903 | 17313 | 18349 | 31194 | 31788 | 31127 | 30962 | 30706 | 29650 |
Area Under the Curve (AUC) | 0.83 | 0.87 | 0.90 | 0.92 | 0.93 | 0.93 | 0.92 | 0.93 | 0.93 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.82 | 0.86 | 0.90 | 0.91 | 0.92 | 0.93 | 0.92 | 0.93 | 0.92 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.84 | 0.88 | 0.91 | 0.92 | 0.93 | 0.93 | 0.92 | 0.94 | 0.93 |
Statistics | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Base Rate | 0.07 | 0.08 | 0.06 | 0.10 | 0.09 | 0.09 | 0.10 | 0.09 | 0.10 |
Overall Classification Rate | 0.80 | 0.76 | 0.87 | 0.85 | 0.85 | 0.82 | 0.83 | 0.85 | 0.85 |
Sensitivity | 0.69 | 0.82 | 0.75 | 0.84 | 0.85 | 0.89 | 0.85 | 0.87 | 0.85 |
Specificity | 0.81 | 0.76 | 0.87 | 0.85 | 0.85 | 0.82 | 0.83 | 0.84 | 0.85 |
False Positive Rate | 0.19 | 0.24 | 0.13 | 0.15 | 0.15 | 0.18 | 0.17 | 0.16 | 0.15 |
False Negative Rate | 0.31 | 0.18 | 0.25 | 0.16 | 0.15 | 0.11 | 0.15 | 0.13 | 0.15 |
Positive Predictive Power | 0.21 | 0.22 | 0.28 | 0.37 | 0.38 | 0.33 | 0.34 | 0.36 | 0.38 |
Negative Predictive Power | 0.97 | 0.98 | 0.98 | 0.98 | 0.98 | 0.99 | 0.98 | 0.98 | 0.98 |
Sample | Kindergarten | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|---|
Date | 2015-2019 | 2016-2019 | 2017-2019 | 2018-2019 | 2018-2019 | 2018-2019 | 2018-2019 | 2018-2019 | 2018-2019 |
Sample Size | 15786 | 24652 | 22393 | 40699 | 41109 | 41928 | 41224 | 40209 | 38868 |
Geographic Representation | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) | East North Central (IN) |
Male | 51.3% | 52.0% | 51.5% | 51.8% | 50.7% | 50.9% | 51.1% | 51.0% | 51.1% |
Female | 48.7% | 48.0% | 48.5% | 48.2% | 49.3% | 49.1% | 48.9% | 49.0% | 48.9% |
Other | |||||||||
Gender Unknown | |||||||||
White, Non-Hispanic | 68.4% | 69.2% | 73.4% | 65.9% | 66.3% | 65.9% | 66.8% | 66.8% | 67.8% |
Black, Non-Hispanic | 11.8% | 10.6% | 8.8% | 12.7% | 12.5% | 12.5% | 12.2% | 12.4% | 12.0% |
Hispanic | 13.0% | 13.3% | 10.9% | 14.0% | 14.2% | 14.6% | 14.5% | 14.0% | 13.8% |
Asian/Pacific Islander | 1.7% | 1.7% | 1.9% | 2.0% | 1.8% | 1.8% | 1.6% | 1.8% | 1.7% |
American Indian/Alaska Native | 0.1% | 0.1% | 0.1% | 0.1% | 0.2% | 0.1% | 0.2% | 0.2% | 0.2% |
Other | 4.9% | 5.1% | 5.0% | 5.2% | 5.0% | 5.1% | 4.8% | 4.8% | 4.5% |
Race / Ethnicity Unknown | |||||||||
Low SES | |||||||||
IEP or diagnosed disability | |||||||||
English Language Learner |
Reliability
Grade |
Kindergarten
|
Grade 1
|
Grade 2
|
Grade 3
|
Grade 4
|
Grade 5
|
Grade 6
|
Grade 7
|
Grade 8
|
---|---|---|---|---|---|---|---|---|---|
Rating | d | d | d | d | d | d | d | d | d |
- *Offer a justification for each type of reliability reported, given the type and purpose of the tool.
- We provide evidence for marginal reliability and test-retest with alternate forms reliability to support the use of MAP Growth as a universal screener that can be administered three times per school year. Marginal reliability (internal consistency) measures how well the items on a test that reflect the same construct yield similar results. Determining the internal consistency of MAP Growth tests is challenging because traditional methods depend on all test takers taking a common test consisting of the same items. Application of these methods to adaptive tests is statistically cumbersome and inaccurate. Fortunately, an equally valid alternative is available in the marginal reliability coefficient (Samejima, 1977, 1994) that incorporates measurement error as a function of the test score. In effect, it is the result of combining measurement error estimated at different points on the achievement scale into a single index. This method of calculating internal consistency, 𝜌𝜃, yields results that are nearly identical to coefficient alpha when both methods are applied to the same fixed-form tests. MAP Growth affords the means to assess students on multiple occasions (e.g., fall, winter, and spring) during the school year. Thus, test-retest reliability is key as it provides insight into the consistency of MAP Growth assessments across time. The adaptive nature of MAP Growth assessments requires reliability to be examined using nontraditional methods because dynamic item selection is an integral part of MAP Growth. Parallel forms are restricted to identical item content from a common goal structure, but the item difficulties depend on the student’s responses to previous items on the test. Therefore, test-retest reliability of MAP Growth is more accurately described as a mix between test-retest reliability and a type of alternate forms reliability, both of which are spread across several months versus the typical two or three weeks. The second test (or retest) is not the same test. Rather, it is one that is comparable to the first by its content and structure, differing only in the difficulty level of its items. In other words, test-retest with alternate forms (Crocker & Algina, 1986) describes the influence of two sources of measurement error: time and item selection. Specifically, test-retest with alternate forms reliability for MAP Growth was estimated via the Pearson correlation between MAP Growth RIT scores of students taking MAP Growth in two terms within the 2018-19 school year (e.g., Fall 2018 and Winter 2019, Winter 2019 and Spring 2019).
- *Describe the sample(s), including size and characteristics, for each reliability analysis conducted.
- The sample for the study contained student records from a total of five states (Arkansas, Colorado, New York, Florida, and Missouri), and thus had representation from all four U.S. Census regions and five divisions. MAP Growth data were from test administrations occurring during the Fall 2018, Winter 2019, and Spring 2019 school terms, which spanned from August 2018 to June 2019. Gender (Percent): Male: 50.8%, Female: 49.2%. Race/Ethnicity (Percent): White, Non-Hispanic: 55.33%, Black, Non-Hispanic: 15.74%, Hispanic: 20.09%, American Indian/Alaska Native: 0.80%, Asian/Pacific Islander: 4.03%, Multi-Ethnic: 3.51%, Native/Hawaiian/Other Pacific Islander: 0.49%.
- *Describe the analysis procedures for each reported type of reliability.
- Marginal Reliability: The approach taken for estimating marginal reliability on MAP Growth was suggested by Wright in 1999. Marginal reliability is computed as a ratio of the difference between the observed variance of the achievement estimates (i.e., MAP Growth RIT score) and the observed mean of the score’s conditional error variances at each achievement level over the observed variance of the achievement estimates. A bootstrapping approach is used to calculate a 95% confidence interval for marginal reliability. For an initial dataset of the achievement levels and CSEMs for N students, a bootstrap 95% confidence interval for marginal reliability is obtained as follows: 1. Draw a random sample of size N with replacement from the initial dataset. 2. Calculate marginal reliability based on the random sample drawn in Step 1. 3. Repeat steps 1 and 2, 1,000 times. 4. Determine the 2.5 and 97.5 percentile points from the resulting 1,000 estimates of marginal reliability. The value of these two percentiles are the bootstrap 95% confidence interval. Test-Retest Reliability: Test-retest reliability of MAP Growth was estimated as the Pearson correlation of student RIT scores on MAP Growth for a set of students who took MAP Growth twice within the 2018-19 school year. This was either in Fall 2018 and in Winter 2019, in Fall 2018 and in Spring 2019, or in Winter 2019 and in Spring 2019. Fundamentally, the test-retest reliability coefficient is a Pearson correlation. As such, the confidence interval (CI) for the test-retest reliability coefficient was obtained using the standard CI for a Pearson correlation (i.e., via the Fisher z-transformation (Fisher, 1921)).
*In the table(s) below, report the results of the reliability analyses described above (e.g., internal consistency or inter-rater reliability coefficients).
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of reliability analysis not compatible with above table format:
- Manual cites other published reliability studies:
- No
- Provide citations for additional published studies.
- Do you have reliability data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
- Yes
If yes, fill in data for each subgroup with disaggregated reliability data.
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of reliability analysis not compatible with above table format:
- Tables with extensive disaggregated data are available from the Center upon request.
- Manual cites other published reliability studies:
- No
- Provide citations for additional published studies.
Validity
Grade |
Kindergarten
|
Grade 1
|
Grade 2
|
Grade 3
|
Grade 4
|
Grade 5
|
Grade 6
|
Grade 7
|
Grade 8
|
---|---|---|---|---|---|---|---|---|---|
Rating | d | d | d | d | d | d | d | d | d |
- *Describe each criterion measure used and explain why each measure is appropriate, given the type and purpose of the tool.
- In general terms, the better a test measures what it purports to measure and can support its intended uses and decision making, the stronger its validity is said to be. Within this broad statement resides a wide range of information that can be used as validity evidence. This information ranges, for example, from the adequacy and coverage of a test’s content, to its ability to yield scores that are predictive of a status in some area, to its ability to draw accurate inferences about a test taker’s status with respect to a construct, to its ability to allow generalizations from test performance within a domain to like performance in the same domain. Much of the validity evidence for MAP Growth comes from the relationships of MAP Growth test scores to state content-aligned accountability test scores. These relationships include a) the concurrent performance of students on MAP Growth tests with their performance on state tests given for accountability purposes and b) the predictive relationship between students’ performance on MAP Growth tests with their performance, two testing terms later, on state accountability tests. Several important points should be noted regarding concurrent performance on MAP Growth tests with that on state accountability tests. First, these two forms of tests (i.e., interim vs. summative) are designed to serve two related but different purposes. MAP Growth tests are designed to provide estimates of achievement status for students of various achievement levels. They are also designed to provide reasonable estimates of students’ strengths and weaknesses within the identified instructional areas. State accountability tests are commonly designed to determine student proficiency within the state performance standard structure, with the most important decision being the classification of the student as proficient or not proficient. This primary purpose of most state tests in conjunction with adopted content and curriculum standards and structures can influence the relationship of student performance between the two tests. For example, one of the most common factors influencing these relationships is the use of constructed-response items in state tests. In general, the greater the number of constructed-response items, the weaker the relationship will appear. Another difference is in test design. Since most state accountability tests are fixed form, it is reasonable for the test to be constructed so that maximum test information is established around the proficiency cut point. This is where a state wants to be the most confident about the classification decision that the test will inform. To the extent that this strategy is reflected in the state’s operational test, the relationship in performance between MAP Growth tests and state tests will be attenuated due to a more truncated range of scores on the state test. The requirement that state test content be connected to single grade level content standards is different than MAP Growth test content structure that spans multiple grade levels. This difference is another factor that weakens the observed score relationships between tests. Finally, when focus is placed on the relationship between performance on MAP Growth tests and the assigned proficiency category from the state test, information from the state test will have been collapsed into a few performance categories, which tend to range between three to five. The correlations between MAP Growth RIT scores and these category assignments will always be substantially lower than if the correlations were based on RIT scores and scale scores. Concurrent validity evidence is expressed as the degree of relationship to performance on another test measuring achievement in the same domain (e.g., mathematics, reading) administered close in time. This form of validity can also be expressed in the form of a Pearson correlation coefficient between the total domain area RIT score and the total scale score of another established test. It answers the question, “How well do the scores from this test that reference this (RIT) scale in this subject area (e.g., mathematics, reading) correspond to the scores obtained from an established test that references some other scale in the same subject area?” Both tests are administered to the same students roughly two to three weeks apart. Correlations with non-NWEA tests that include more performance test items that require subjective scoring tend to have lower correlations than when non-NWEA tests consist of exclusively multiple-choice items. Predictive validity evidence is expressed as the degree of relationship to performance on another test measuring achievement in the same domain (e.g., mathematics, reading) at some later point in time. This form of validity can also be expressed in the form of a Pearson correlation coefficient between the total domain area RIT score and the total scale score of another established test. It answers the question, “How well do the scores from this test that reference this (RIT) scale in this subject area (e.g., Reading) predict the scores obtained from an established test that references some other scale in the same subject area at a later point in time?” Both tests are administered to the same students several weeks or months apart, depending on grade. For grades 3–8, evidence reported here are from tests administered typically 12-36 weeks apart, but for grades K–2, evidence reported here are from tests administered 12-36 months apart. Strong predictive validity is indicated when the correlations are above 0.80. Correlations with non-NWEA tests that include more performance test items that require subjective scoring tend to have lower correlations than when non-NWEA tests consist of exclusively multiple-choice items. The criterion measure used for concurrent and predictive validity analyses was the scaled score on the five state English Language Arts summative assessments (ACT Aspire, used in Arkansas; Colorado Measures of Academic Success; the New York State Testing Program; Florida’s Statewide Assessment Program; and the Missouri Assessment Program) taken by students in the sample during the Spring 2018 school term. These state-level scaled scores were linked to the MAP Growth Reading scales using the equipercentile method and common-person design so that they were comparable with each other. In addition to concurrent and predictive validity, validity evidence for MAP Growth also comes from the degree and stability of the relationship of RIT scores across multiple and extended periods of time. This type of evidence supports the construct validity of MAP Growth and the ability underlying the RIT scale. This type of construct validity evidence is provided for grades K–2, since concurrent validity coefficients were not available for grades K–2 (i.e., grades K–2 RIT scores were from administrations during the school year prior to the administration of the state assessments).
- *Describe the sample(s), including size and characteristics, for each validity analysis conducted.
- The sample for the validity study contained student records from a total of five states (Colorado, Missouri, Arkansas, Florida, and New York) and thus had representation from all four U.S. Census regions. MAP Growth data used in the concurrent and predictive validity analyses for grades 3–8 came from test administrations occurring during the Fall 2017, Winter 2018, and Spring 2018 school terms, which spanned from August 2017 to June 2018. MAP Growth data used in the construct and predictive validity analyses for grades K–2 were MAP Growth scores of grade 3 students from the previous three academic years (i.e., Fall 2014, Winter 2015, Spring 2015, Fall 2015, Winter 2016, Spring 2016, Fall 2016, Winter 2017, and Spring 2017). The state-level data were from the Spring 2018 administration of the following five state assessments, administered approximately from March 2018 to June 2018. These five state assessments included: ACT Aspire, used in Arkansas; Colorado Measures of Academic Success; the New York State Testing Program; Florida’s Statewide Assessment Program; and the Missouri Assessment Program. Sample characteristics for grades 3–8 used in concurrent and predictive analyses: Gender (Percent): Male: 50.60%, Female: 49.40%. Race/Ethnicity (Percent): White, Non-Hispanic: 48.09%, Black, Non-Hispanic: 13.36%, Hispanic: 25.09%, American Indian/Alaska Native: 0.89%, Asian/Pacific Islander: 4.94%, Multi-Ethnic: 2.99%, Not Specified or Other: 3.87%, Native Hawaiian or Other Pacific Islander: 0.77%. Sample characteristics for grades K–2 used in predictive and construct validity analyses: Gender (Percent): Male: 51.03%, Female: 48.97%. Race/Ethnicity (Percent): White, Non-Hispanic: 48.40%, Black, Non-Hispanic: 14.37%, Hispanic: 23.99%, American Indian/Alaska Native: 0.72%, Asian/Pacific Islander: 4.55%, Multi-Ethnic: 3.54%, Not Specified or Other: 3.63%, Native Hawaiian or Other Pacific Islander: 0.79%.
- *Describe the analysis procedures for each reported type of validity.
- Concurrent validity was estimated as the Pearson correlation coefficient between student RIT scores from Spring 2018 and the same students’ scale scores on the state tests, which were also administered in Spring 2018 for grades 3–8 but linked to the MAP Growth scales using the equipercentile method and common-person design. Predictive validity was estimated as the Pearson correlation coefficient between student RIT scores from a given term (Fall 2017 or Winter 2018) and the same students’ total scale score on the state tests, which were administered in Spring 2018 for grades 3–8 but linked to the MAP Growth scales using the equipercentile method and common-person design. For grades K–2, predictive validity was estimated as the Pearson correlation coefficient between student RIT scores from each of the following terms, including Fall 2014, Winter 2015, Spring 2015, Fall 2015, Winter 2016, Spring 2016, Fall 2016, Winter 2017, and Spring 2017, versus the students’ total scales on the grade 3 state level scaled scores linked to the MAP Growth scales. For grades K–2, construct validity was estimated as the Pearson correlation coefficient between student RIT scores from each term in the two consecutive years between 2014-15 school year (Fall 2015, Winter 2016, and Spring 2016) and 2017-18 school year, i.e., 1) each term in the 2014-15 school year (Fall 2015, Winter 2016, and Spring 2016) versus RIT scores from each term of the 2015-16 school year (Fall 2015, Winter 2016, and Spring 2016); 2) each term in the 2015-16 school year (Fall 2015, Winter 2016, and Spring 2016) versus RIT scores from each term of the 2016-17 school year (Fall 2016, Winter 2017, and Spring 2017), and 3) each term in the 2016-17 school year (Fall 2016, Winter 2017, and Spring 2018) versus RIT scores from each term of the 2017-18 school year (Fall 2017, Winter 2018, and Spring 2018). For the concurrent, predictive, and construct validity coefficients, their 95% confidence interval values were based on the standard 95% confidence interval for a Pearson correlation, using the Fisher z-transformation.
*In the table below, report the results of the validity analyses described above (e.g., concurrent or predictive validity, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of validity analysis not compatible with above table format:
- Manual cites other published reliability studies:
- Yes
- Provide citations for additional published studies.
- Wang, S., Jiao, H, Zhang, L. (2013.) Validation of Longitudinal Achievement Constructs of Vertically Scaled Computerized Adaptive Tests: A Multiple-Indicator, Latent-Growth Modelling Approach. International Journal of Quantitative Research in Education, 1, 383-407. Wang, S., McCall, M., Hong, J., and Harris, G. (2013.) Construct Validity and Measurement Invariance of computerized Adaptive Testing: Application to Measures of Academic Progress (MAP) Using Confirmatory Factor Analysis. Journal of Educational and Developmental Psychology, 3, 88-100.
- Describe the degree to which the provided data support the validity of the tool.
- Concurrent validity coefficients for grades 3–8 and each time of year were consistently in the 0.70s and 0.80s, suggesting a strong relationship between MAP Growth and the state assessments for the grades of interest. Concurrent validity coefficients for grades K–2 and each time of year were consistently in the 0.60s and 0.80s, suggesting a medium strong to strong relationship of MAP Growth RIT scores across school years. Predictive validity coefficients for grades 2–8 were consistently in the 0.70s and 0.80s. For grade 1, they were in the 0.60s and 0.70s. For kindergarten, they were in the 0.50s and 0.60s. Note that for grades K–1, the criterion measures were the grade 3 state-level scaled scores linked to the MAP Growth scales. That is, MAP Growth tests were taken 24-36 months earlier than the state tests.
- Do you have validity data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
- Yes
If yes, fill in data for each subgroup with disaggregated validity data.
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of validity analysis not compatible with above table format:
- Tables with extensive disaggregated data are available from the Center upon request.
- Manual cites other published reliability studies:
- No
- Provide citations for additional published studies.
Bias Analysis
Grade |
Kindergarten
|
Grade 1
|
Grade 2
|
Grade 3
|
Grade 4
|
Grade 5
|
Grade 6
|
Grade 7
|
Grade 8
|
---|---|---|---|---|---|---|---|---|---|
Rating | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
- Have you conducted additional analyses related to the extent to which your tool is or is not biased against subgroups (e.g., race/ethnicity, gender, socioeconomic status, students with disabilities, English language learners)? Examples might include Differential Item Functioning (DIF) or invariance testing in multiple-group confirmatory factor models.
- Yes
- If yes,
- a. Describe the method used to determine the presence or absence of bias:
- The Mantel-Haenszel (MH) procedure (1959) is the most cited and studied method for detecting differential item functioning (DIF). It stratifies examinees by a composite test score, compares the item performance of reference and focal group members in each stratum, and then pools this comparison over all strata. The MH procedure is easy to implement and is featured in most statistical software. NWEA applied the MH method to assess DIF of MAP Growth items. The DIF analysis results are categorized based on the Educational Testing Service (ETS) method of classifying DIF (Zwick, 2012). This method allows items exhibiting negligible DIF (Category A) to be differentiated from those exhibiting moderate DIF (Category B) and severe DIF (Category C). Categories B and C have a further breakdown as “+” (DIF is in favor of the focal group) or “-” (DIF is in favor of the reference group). Four thousand reading and mathematics items were included in the DIF analyses, with 2,000 items from each content area. These items were administered on operational MAP Growth tests in Fall 2018, Winter 2019, and Spring 2019. Data from each season were analyzed separately, but the same items were included in the analysis for each season. Each item had more than 5,000 test records, ensuring an adequate sample size of students for each group involved in the comparison. This, in turn, ensured that each comparison had adequate power to detect DIF. Analyses were conducted across each grade level.
- b. Describe the subgroups for which bias analyses were conducted:
- DIF analyses were conducted by ethnic group (White, Asian, Black, and Hispanic) and gender (male, female). White serves as reference group in the DIF analysis based on ethnic group, and male serves as reference group in the DIF analysis based on gender.
- c. Describe the results of the bias analyses conducted, including data and interpretative statements. Include magnitude of effect (if available) if bias has been identified.
- Our analysis shows DIF related to gender is rare. The percentage of Category C DIF ranged from 0.5% to 0.7% across subject areas. The prevalence of B and C classifications are fewer than expected by chance. In terms of DIF related to ethnicity, 0.2% to 5.5% of items are classified as Category C. For Reading items, the prevalence of B and C classifications are fewer than expected by chance. All items exhibiting moderate (Category B) DIF are subjected to an extra review by content specialists to identify the source for DIF. For each item, these specialists decide the following: remove the item from the item bank; revise the item and re-submit it for field testing; or retain the item without modification. Items exhibiting severe DIF (Category C) are removed from the item bank. These procedures are consistent with periodic item quality reviews that remove or flag items for revision and re-field test problem items.
Data Collection Practices
Most tools and programs evaluated by the NCII are branded products which have been submitted by the companies, organizations, or individuals that disseminate these products. These entities supply the textual information shown above, but not the ratings accompanying the text. NCII administrators and members of our Technical Review Committees have reviewed the content on this page, but NCII cannot guarantee that this information is free from error or reflective of recent changes to the product. Tools and programs have the opportunity to be updated annually or upon request.