MAP® Growth
Mathematics

Summary

MAP Growth assessments are used across the country for multiple purposes, including as universal screening tools in response to intervention (RTI) programs. MAP Growth can serve as a universal screener for identifying students at risk of poor academic outcomes in mathematics. MAP Growth assessments give educators insight into the instructional needs of all students, whether they are performing at, above, or below grade level. These assessments include increasingly more complex items to correspond with the rigor of the standards for those grade levels. MAP Growth Survey with Goals are computer adaptive tests with a cross-grade vertical scale that assess achievement according to standards-aligned content. Scores from repeated administrations measure achievement over time from which users interpret academic growth. Survey with Goals tests can be administered three times per school year — once each in Fall, Winter, and Spring — with an optional summer administration. MAP Growth assessments are scaled across grades. The Rasch model, an item response theory (IRT) model commonly employed in K–12 assessment programs, was used to create the scales for MAP Growth assessments. These scales have been named RIT scales (for Rasch Unit).

Where to Obtain:
NWEA
proposals@nwea.org
121 NW Everett Street, Portland, OR 97209
(503) 624-1951
www.nwea.org
Initial Cost:
$8.25 per student
Replacement Cost:
$8.25 per student per annual
Included in Cost:
MAP Growth Mathematics annual per-student subscription fees range from $7.00-9.50. A bundled assessment suite of Mathematics, Reading, and Language Usage tests starts at $13.50 per student. Discounts are available based on volume and other factors. Annual subscription fees include the following: *Full Assessment Suite: MAP Growth assessments can be administered up to four times per calendar year; and the abbreviated Screening Assessment once a year for placement purposes. *Robust Reporting: All results from MAP Growth assessments, including RIT scale scores, proficiency projections, and status and growth norms, are available in a variety of views and formats through our comprehensive suite of reports.  *Learning Continuum: Dynamic reporting of learning statements, specifically aligned to the applicable state standards, provide information on what each student is ready to learn. *System of Support: A full system of support is provided to enable the success of our partners, including technical support; implementation support through the first test administration; and ongoing, dedicated account management for the duration of the partnership.  *NWEA Professional Learning Online: Access to this online learning portal offers on-demand tutorials, webinars, courses, and videos to supplement professional learning plans and help educators use MAP Growth to improve teaching and learning. NWEA offers a portfolio of flexible, customizable professional learning and training options to meet the needs of our partners. Please contact us via https://www.nwea.org/sales-information/ for specific details on pricing. The annual per-student MAP Growth price includes a suite of assessments, scoring and reporting, all assessment software including maintenance and upgrades, support services, and unlimited staff access to NWEA Professional Learning Online.
Special Accommodations for Students with Disabilities Accessibility is the foundation for the systems that create our assessments because we focus on universal design. If you start with accessibility, then test and item aids follow, and the more you attend to universal design and accessibility, the less students need accommodations. This means that all content areas are created considering universal design and accessibility standards from the start. For example, alternative text descriptions (alt-tags) for images are an important feature on a website to provide access to those using screen readers. Alt-tags provide descriptions of pictures, charts, graphs, etc., to those who may not be able to see the information. Laying this foundation ensures our product is accessible for students using various accommodations. Following national standards, such as Web Content Accessibility Guidelines (WCAG) 2.0 and Accessible Rich Internet Applications (ARIA), helps to guide the creation of our accessible foundation. With support from the WGBH National Center for Accessible Media (NCAM), we have created detailed and thorough guidelines for describing many variations of images, charts, and graphics targeted specifically to mathematics, reading, and language usage. The guidelines review concepts such as item integrity, fairness, and the unique challenges image description writers face in the context of assessment. These guidelines result in consistent, user-friendly, and valid image descriptions that support the use of screen readers. This approach pushes NWEA to the forefront when it comes to assessment accessibility. We are firmly committed to continuous improvement of our accessibility and accommodations, and we expect to build on our offerings in the years to come. Tools are made available for all students on the assessment. These tools are embedded into the user interface for each item and are at the appropriate test level. Tools are not specific to a certain population but will be available to all users whenever necessary so that students can use these tools during their testing experience. When tools are considered during the test design phase, accommodations become more precise to individual needs. Accommodations have an intended audience. The audience typically includes students who either have an individualized education plan (IEP) or students receiving accommodations under Section 504 of the Rehabilitation Act of 1973. It is vital for validity and reliability that these accommodations are tracked from the beginning of a student’s testing experience. More detailed information about MAP's accommodation compatibility is available from the Center upon request.
Training Requirements:
1-4 hours of training
Qualified Administrators:
Examiners should meet the same qualifications as a teaching paraprofessional; examiners should complete all necessary training related to administering an assessment.
Access to Technical Support:
Toll-free telephone support, online support, website knowledge base, live chat support
Assessment Format:
  • Direct: Computerized
Scoring Time:
  • Scoring is automatic
Scores Generated:
  • Percentile score
  • IRT-based score
  • Developmental benchmarks
  • Developmental cut points
  • Composite scores
  • Subscale/subtest scores
Administration Time:
  • 45 minutes per student/subject
Scoring Method:
  • Automatically (computer-scored)
Technology Requirements:
  • Computer or tablet
  • Internet connection
Accommodations:
Special Accommodations for Students with Disabilities Accessibility is the foundation for the systems that create our assessments because we focus on universal design. If you start with accessibility, then test and item aids follow, and the more you attend to universal design and accessibility, the less students need accommodations. This means that all content areas are created considering universal design and accessibility standards from the start. For example, alternative text descriptions (alt-tags) for images are an important feature on a website to provide access to those using screen readers. Alt-tags provide descriptions of pictures, charts, graphs, etc., to those who may not be able to see the information. Laying this foundation ensures our product is accessible for students using various accommodations. Following national standards, such as Web Content Accessibility Guidelines (WCAG) 2.0 and Accessible Rich Internet Applications (ARIA), helps to guide the creation of our accessible foundation. With support from the WGBH National Center for Accessible Media (NCAM), we have created detailed and thorough guidelines for describing many variations of images, charts, and graphics targeted specifically to mathematics, reading, and language usage. The guidelines review concepts such as item integrity, fairness, and the unique challenges image description writers face in the context of assessment. These guidelines result in consistent, user-friendly, and valid image descriptions that support the use of screen readers. This approach pushes NWEA to the forefront when it comes to assessment accessibility. We are firmly committed to continuous improvement of our accessibility and accommodations, and we expect to build on our offerings in the years to come. Tools are made available for all students on the assessment. These tools are embedded into the user interface for each item and are at the appropriate test level. Tools are not specific to a certain population but will be available to all users whenever necessary so that students can use these tools during their testing experience. When tools are considered during the test design phase, accommodations become more precise to individual needs. Accommodations have an intended audience. The audience typically includes students who either have an individualized education plan (IEP) or students receiving accommodations under Section 504 of the Rehabilitation Act of 1973. It is vital for validity and reliability that these accommodations are tracked from the beginning of a student’s testing experience. More detailed information about MAP's accommodation compatibility is available from the Center upon request.

Descriptive Information

Please provide a description of your tool:
MAP Growth assessments are used across the country for multiple purposes, including as universal screening tools in response to intervention (RTI) programs. MAP Growth can serve as a universal screener for identifying students at risk of poor academic outcomes in mathematics. MAP Growth assessments give educators insight into the instructional needs of all students, whether they are performing at, above, or below grade level. These assessments include increasingly more complex items to correspond with the rigor of the standards for those grade levels. MAP Growth Survey with Goals are computer adaptive tests with a cross-grade vertical scale that assess achievement according to standards-aligned content. Scores from repeated administrations measure achievement over time from which users interpret academic growth. Survey with Goals tests can be administered three times per school year — once each in Fall, Winter, and Spring — with an optional summer administration. MAP Growth assessments are scaled across grades. The Rasch model, an item response theory (IRT) model commonly employed in K–12 assessment programs, was used to create the scales for MAP Growth assessments. These scales have been named RIT scales (for Rasch Unit).
The tool is intended for use with the following grade(s).
not selected Preschool / Pre - kindergarten
not selected Kindergarten
not selected First grade
selected Second grade
selected Third grade
selected Fourth grade
selected Fifth grade
selected Sixth grade
selected Seventh grade
selected Eighth grade
selected Ninth grade
selected Tenth grade
selected Eleventh grade
selected Twelfth grade

The tool is intended for use with the following age(s).
not selected 0-4 years old
not selected 5 years old
not selected 6 years old
selected 7 years old
selected 8 years old
selected 9 years old
selected 10 years old
selected 11 years old
selected 12 years old
selected 13 years old
selected 14 years old
selected 15 years old
selected 16 years old
selected 17 years old
selected 18 years old

The tool is intended for use with the following student populations.
not selected Students in general education
not selected Students with disabilities
not selected English language learners

ACADEMIC ONLY: What skills does the tool screen?

Reading
Phonological processing:
not selected RAN
not selected Memory
not selected Awareness
not selected Letter sound correspondence
not selected Phonics
not selected Structural analysis

Word ID
not selected Accuracy
not selected Speed

Nonword
not selected Accuracy
not selected Speed

Spelling
not selected Accuracy
not selected Speed

Passage
not selected Accuracy
not selected Speed

Reading comprehension:
not selected Multiple choice questions
not selected Cloze
not selected Constructed Response
not selected Retell
not selected Maze
not selected Sentence verification
not selected Other (please describe):


Listening comprehension:
not selected Multiple choice questions
not selected Cloze
not selected Constructed Response
not selected Retell
not selected Maze
not selected Sentence verification
not selected Vocabulary
not selected Expressive
not selected Receptive

Mathematics
Global Indicator of Math Competence
selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Early Numeracy
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Mathematics Concepts
selected Accuracy
not selected Speed
selected Multiple Choice
not selected Constructed Response

Mathematics Computation
selected Accuracy
not selected Speed
selected Multiple Choice
not selected Constructed Response

Mathematic Application
selected Accuracy
not selected Speed
selected Multiple Choice
not selected Constructed Response

Fractions/Decimals
selected Accuracy
not selected Speed
selected Multiple Choice
not selected Constructed Response

Algebra
selected Accuracy
not selected Speed
selected Multiple Choice
not selected Constructed Response

Geometry
selected Accuracy
not selected Speed
selected Multiple Choice
not selected Constructed Response

selected Other (please describe):
All assessable content from each state’s standards (e.g., Measurement and Data)

Please describe specific domain, skills or subtests:
BEHAVIOR ONLY: Which category of behaviors does your tool target?


BEHAVIOR ONLY: Please identify which broad domain(s)/construct(s) are measured by your tool and define each sub-domain or sub-construct.

Acquisition and Cost Information

Where to obtain:
Email Address
proposals@nwea.org
Address
121 NW Everett Street, Portland, OR 97209
Phone Number
(503) 624-1951
Website
www.nwea.org
Initial cost for implementing program:
Cost
$8.25
Unit of cost
student
Replacement cost per unit for subsequent use:
Cost
$8.25
Unit of cost
student
Duration of license
annual
Additional cost information:
Describe basic pricing plan and structure of the tool. Provide information on what is included in the published tool, as well as what is not included but required for implementation.
MAP Growth Mathematics annual per-student subscription fees range from $7.00-9.50. A bundled assessment suite of Mathematics, Reading, and Language Usage tests starts at $13.50 per student. Discounts are available based on volume and other factors. Annual subscription fees include the following: *Full Assessment Suite: MAP Growth assessments can be administered up to four times per calendar year; and the abbreviated Screening Assessment once a year for placement purposes. *Robust Reporting: All results from MAP Growth assessments, including RIT scale scores, proficiency projections, and status and growth norms, are available in a variety of views and formats through our comprehensive suite of reports.  *Learning Continuum: Dynamic reporting of learning statements, specifically aligned to the applicable state standards, provide information on what each student is ready to learn. *System of Support: A full system of support is provided to enable the success of our partners, including technical support; implementation support through the first test administration; and ongoing, dedicated account management for the duration of the partnership.  *NWEA Professional Learning Online: Access to this online learning portal offers on-demand tutorials, webinars, courses, and videos to supplement professional learning plans and help educators use MAP Growth to improve teaching and learning. NWEA offers a portfolio of flexible, customizable professional learning and training options to meet the needs of our partners. Please contact us via https://www.nwea.org/sales-information/ for specific details on pricing. The annual per-student MAP Growth price includes a suite of assessments, scoring and reporting, all assessment software including maintenance and upgrades, support services, and unlimited staff access to NWEA Professional Learning Online.
Provide information about special accommodations for students with disabilities.
Special Accommodations for Students with Disabilities Accessibility is the foundation for the systems that create our assessments because we focus on universal design. If you start with accessibility, then test and item aids follow, and the more you attend to universal design and accessibility, the less students need accommodations. This means that all content areas are created considering universal design and accessibility standards from the start. For example, alternative text descriptions (alt-tags) for images are an important feature on a website to provide access to those using screen readers. Alt-tags provide descriptions of pictures, charts, graphs, etc., to those who may not be able to see the information. Laying this foundation ensures our product is accessible for students using various accommodations. Following national standards, such as Web Content Accessibility Guidelines (WCAG) 2.0 and Accessible Rich Internet Applications (ARIA), helps to guide the creation of our accessible foundation. With support from the WGBH National Center for Accessible Media (NCAM), we have created detailed and thorough guidelines for describing many variations of images, charts, and graphics targeted specifically to mathematics, reading, and language usage. The guidelines review concepts such as item integrity, fairness, and the unique challenges image description writers face in the context of assessment. These guidelines result in consistent, user-friendly, and valid image descriptions that support the use of screen readers. This approach pushes NWEA to the forefront when it comes to assessment accessibility. We are firmly committed to continuous improvement of our accessibility and accommodations, and we expect to build on our offerings in the years to come. Tools are made available for all students on the assessment. These tools are embedded into the user interface for each item and are at the appropriate test level. Tools are not specific to a certain population but will be available to all users whenever necessary so that students can use these tools during their testing experience. When tools are considered during the test design phase, accommodations become more precise to individual needs. Accommodations have an intended audience. The audience typically includes students who either have an individualized education plan (IEP) or students receiving accommodations under Section 504 of the Rehabilitation Act of 1973. It is vital for validity and reliability that these accommodations are tracked from the beginning of a student’s testing experience. More detailed information about MAP's accommodation compatibility is available from the Center upon request.

Administration

BEHAVIOR ONLY: What type of administrator is your tool designed for?
not selected General education teacher
not selected Special education teacher
not selected Parent
not selected Child
not selected External observer
not selected Other
If other, please specify:

What is the administration setting?
not selected Direct observation
not selected Rating scale
not selected Checklist
not selected Performance measure
not selected Questionnaire
selected Direct: Computerized
not selected One-to-one
not selected Other
If other, please specify:

Does the tool require technology?
Yes

If yes, what technology is required to implement your tool? (Select all that apply)
selected Computer or tablet
selected Internet connection
not selected Other technology (please specify)

If your program requires additional technology not listed above, please describe the required technology and the extent to which it is combined with teacher small-group instruction/intervention:

What is the administration context?
selected Individual
not selected Small group   If small group, n=
not selected Large group   If large group, n=
not selected Computer-administered
not selected Other
If other, please specify:

What is the administration time?
Time in minutes
45
per (student/group/other unit)
student/subject

Additional scoring time:
Time in minutes
per (student/group/other unit)

ACADEMIC ONLY: What are the discontinue rules?
selected No discontinue rules provided
not selected Basals
not selected Ceilings
not selected Other
If other, please specify:


Are norms available?
Yes
Are benchmarks available?
Yes
If yes, how many benchmarks per year?
3
If yes, for which months are benchmarks available?
Fall, Winter, Spring
BEHAVIOR ONLY: Can students be rated concurrently by one administrator?
If yes, how many students can be rated concurrently?

Training & Scoring

Training

Is training for the administrator required?
Yes
Describe the time required for administrator training, if applicable:
1-4 hours of training
Please describe the minimum qualifications an administrator must possess.
Examiners should meet the same qualifications as a teaching paraprofessional; examiners should complete all necessary training related to administering an assessment.
not selected No minimum qualifications
Are training manuals and materials available?
Yes
Are training manuals/materials field-tested?
Yes
Are training manuals/materials included in cost of tools?
Yes
If No, please describe training costs:
Can users obtain ongoing professional and technical support?
Yes
If Yes, please describe how users can obtain support:
Toll-free telephone support, online support, website knowledge base, live chat support

Scoring

How are scores calculated?
not selected Manually (by hand)
selected Automatically (computer-scored)
not selected Other
If other, please specify:

Do you provide basis for calculating performance level scores?
Yes
What is the basis for calculating performance level and percentile scores?
not selected Age norms
selected Grade norms
not selected Classwide norms
not selected Schoolwide norms
not selected Stanines
not selected Normal curve equivalents

What types of performance level scores are available?
not selected Raw score
not selected Standard score
selected Percentile score
not selected Grade equivalents
selected IRT-based score
not selected Age equivalents
not selected Stanines
not selected Normal curve equivalents
selected Developmental benchmarks
selected Developmental cut points
not selected Equated
not selected Probability
not selected Lexile score
not selected Error analysis
selected Composite scores
selected Subscale/subtest scores
not selected Other
If other, please specify:

Does your tool include decision rules?
If yes, please describe.
Can you provide evidence in support of multiple decision rules?
No
If yes, please describe.
Please describe the scoring structure. Provide relevant details such as the scoring format, the number of items overall, the number of items per subscale, what the cluster/composite score comprises, and how raw scores are calculated.
MAP Growth scores are not based on raw scores because they are adaptive. The difficulty of the item answered is used to derive the student’s scale score. During the assessment, a Bayesian scoring algorithm is used to inform item selection. Bayesian scoring for item selection prevents the artificially dramatic fluctuations in student achievement at the beginning of the test that can occur with other scoring algorithms. Although the Bayesian scoring works well as a procedure for selecting items during test administration, Bayesian scores are not appropriate for the calculation of final student achievement scores. This is because Bayesian scoring uses information other than the student’s responses to questions (such as past performance) to calculate the achievement estimate. Since only the student’s performance today should be used to give the student’s current score, a maximum-likelihood algorithm is used to calculate a student’s actual score at the completion of the test.
Describe the tool’s approach to screening, samples (if applicable), and/or test format, including steps taken to ensure that it is appropriate for use with culturally and linguistically diverse populations and students with disabilities.
MAP Growth enables educational agencies to measure the achievement of virtually all of their students, beginning at second grade, with a great deal of accuracy in a short period of time. MAP Growth assessments are computer adaptive, adjusting in difficulty based on each student’s responses to test questions. Assessments are provided in a survey structure with instructional area scores supplementing an overall score. Instructional area frameworks are aligned to Common Core State Standards or specific state standards. The fundamental assumption underlying IRT is that the probability of a correct response to a test item is a function of the item’s difficulty and the person’s ability. This function is expected to remain invariant to other person characteristics that are unrelated to ability such as gender, ethnic group membership, family wealth, etc. Therefore, if two test takers with the same ability respond to the same test item, they are assumed to have an equal probability of answering the item correctly. Renewing and Developing Items The NWEA Publishing team has a rigorous item development and review process managed and maintained by a team of content specialists. The multi-stage development and review process includes built-in quality assurance checks at each stage. It concludes with a formal editorial stage in which copy editors review items for grammar, usage, mechanics, and style inconsistencies and conduct a word-for-word proofread of passages against their original sources. The Publishing team reviews all items developed for our assessments for content validity, bias and sensitivity, alignment to standards, and to ensure they meet standards of universal test design. Test items also go through a second editorial and visual review with a separate group of Publishing experts. These individuals ensure the item displays correctly in the assessment system before being seen in a test by a student. Psychometricians from our Research team validate the sufficiency of the test design, item pool, and item statistics. Built into the MAP Growth assessment system is a partner self-serve feature to report problem items. If a partner encounters an item with display, content, or other issues, the partner is able to submit a Problem Item Report (PIR) directly from the web-based testing platform. These PIRs are reviewed by our Publishing and Research teams using the processes described below. Item Development Process Our item development process yields items with strong alignments to the breadth and depth of the standards and of excellent quality. In order to achieve this, we have developed a deep understanding of current state standards and use a variety of approaches and item types to assess them. Items are developed and reviewed through a variety of lenses, including how they align to their targeted standard and grade level, how they adhere to the principles of universal design, and whether they are free from potential bias and sensitivity issues. The NWEA content management system, the Standards and Content Integration Platform (SCIP), is a flexible system that enables item writers and content vendors to submit items directly into our content development work queues. SCIP enables our Publishing team to efficiently track, review, and refine item content throughout the item development process. A minimum of five separate professionals thoroughly review each item during the item review process. When an item or passage is initially received, it is first reviewed by an acquisition specialist, who ensures that the content is valid and meets our quality standards. The second reviewer is a copyright and permissions specialist, who ensures that public domain passages are selected from authoritative, authentic sources, that copyrighted texts are approved by the copyright holders, and that content is free of plagiarism. Two separate content specialists provide the third and fourth reviews. They review items and passages for the quality of the content, validate factual material, currency, bias and sensitivity, instructional relevance, and make alignments to the appropriate content standards. They also validate the grade appropriateness of the item and assign a Webb’s Depth of Knowledge (DOK) level and a Bloom’s classification. Finally, the content specialist assigns the item a preliminary difficulty level (called a provisional calibration or provisional RIT) that is needed for field-test purposes. The provisional estimate of difficulty is based on the observed difficulty of similar items and the content specialist’s expertise. During this third and fourth review, each item is evaluated against a set of criteria. An item is flagged if it:  Requires prior knowledge other than the skill/concept being assessed  Has cultural bias  Has linguistic bias  Has socio-economic bias  Has religious bias  Has geographic bias  Has color-blind bias  Has gender bias  Inappropriately employs idiomatic English  Offensively stereotypes a group of people  Mentions body/weight issues  Has inappropriate or sensitive topics (smoking, death, crime, violence, profanity, sex, etc.)  Has other bias issues In the final stage of the item development process, a member of the Publishing team provides the fifth review, of items and passages to ensure that they are free of grammar, usage, and mechanics errors, and that they adhere to style guidelines. An accessibility review is performed to create image descriptions when necessary to ensure that our items are accessible to all students. Image descriptions may allow students who use refreshable braille and/or screen readers to answer questions that otherwise would be inaccessible. They also ensure that items will display correctly in all supported browsers. At any point during the multiple reviews, an item can be sent back to a previous stage or rejected if it does not meet our strict standards. Embedded Field-Testing Item development and field-testing for MAP Growth assessments occurs continuously to enhance and deepen the item pool. Field-testing is required to maintain the item bank; as existing items are retired or removed due to changes in standards or item parameter drift. An embedded field-test design is used so that field-test items can be integrated seamlessly into the operational test. Often, in an embedded field-test design, field-test items are stored in a field-test item pool, and a predetermined number are administered to each examinee. When the assessment is scored, only the responses to the active items are used, so a student’s score is not influenced by the presence of field-test items. This approach provides advantages, such as:  Preserving testing mode  Obtaining response data in an efficient manner  Reducing the impact of motivation and representativeness concerns of the field-test The assessment algorithm holds a fixed number of slots for field-test items for all students. Therefore, the sample of students taking the field-test items is representative of the student population. NWEA uses a procedure called “adaptive item calibration” for item calibration. This procedure basically divides item calibration into two stages:  The first stage administers field-test items based on pre-specified criteria  The second stage estimates item parameters. Specifically, to start the field-testing process each field-test item needs to have an initial item difficulty estimate, established by content experts based on item characteristics such as item length and complexity or difficulties of existing items with similar content. This initial item difficulty estimate is treated as if it represents the true item difficulty and is used to collect responses for calibration. Once a pre-specified number of responses is collected, a field-test item goes through a calibration process and its calibration is updated. The updated item difficulty estimate is then used to collect the next specified number of responses and its item difficulty estimate is updated again by all accumulated responses. This process is repeated in an iterative manner until either a pre-specified level of item parameter estimate precision is obtained or item parameter estimate is stabilized. This adaptive item calibration approach allows items to be presented to students that closely match their estimated achievement level. This helps optimize the use of testing time by presenting items that are neither too difficult nor too easy for a particular student. To ensure that the quality of the data is high, field-test items are administered only in the grade range suggested by the item author. This ensures that the sample of students taking any field-test item is reflective of the sample of students who will be taking the item after it becomes active. On average, a field-test item takes about 2,000 responses to become operational, enough to ensure the stability and precision of item parameter estimates. The analysis of the field-test items includes distractor analysis and review against a series of statistical indices such as:  Percent correct  Point-biserial  Mean squared error  Z chi-square  Rasch INFIT and OUTFIT statistics  Correlation between expected and actual percent correct score The use of multiple statistical indices ensures field-test items are calibrated with rigor. Item Bank Maintenance Through the MAP Growth assessment platform, we are able to continuously embed items for field-testing and refresh the item bank. Our item quality review process provides a consistent system for continuously monitoring the validity, reliability, and relevance of items in the bank. Reviewing Test Items NWEA periodically reviews large portions or the entire item bank. During this process, items are reviewed to ensure they are free of bias, current, valid, and instructionally relevant. Items tagged as not meeting one or more of the criteria are reviewed on a case-by-case basis and either retired from active status or revised as needed. Additionally, NWEA conducts differential item functioning (DIF) studies to help identify and remove items that demonstrate bias, either on gender or ethnicity. NWEA also conducts regular studies to check for potential item parameter drift. Items flagged for parameter shift will be reviewed by content experts who will then either retire or revise these items as needed. Item development and field-testing for MAP Growth assessments is triggered by needs analysis performed on a particular test pool or set of academic content standards. Once areas of need are articulated in an item acquisition plan, item specifications are written to address these areas, ensuring that adequate content coverage is maintained. New items are introduced through a field-test process that embeds new items in operational test sessions. By embedding the field-test items in a scored and reported operational test, the field-test items appear identical to active items. As a result, students are equally motivated to answer field-test and active items. Retiring or Revising Items As part of our continuous monitoring of the item bank, we have processes in place for handling problem items and retiring items.  External Problem Item Reports: These problems are reported to NWEA through the proctor interface. Our Publishing team receives notification and determines if the concern is display-, editorial-, or content-related. The appropriate team member(s) review the item to decide whether the item is correct, or if it should be revised or retired.  Internal Problem Item Reports: If a problem with an operational item is found during regular reviews and maintenance of the item bank, the item is deactivated and logged in a database. A separate reviewer is assigned to confirm the problem. If confirmed, the item is retired or regenerated after necessary revisions and field-testing. If both reviewers agree that the item is not problematic, the item is reactivated. Disagreements are resolved during regular cross-team item review sessions.  Research-Initiated Reviews (e.g., parameter drift review, differential item function): Research provides Publishing with data on a set of items, and content specialists review the items (two reviewers per item). Sets of items that are recommended for retirement or revision and regeneration are checked by Research against the test pools to determine the impact of removing the items. Plans are then made to remove temporarily and revise and replace the items, or to permanently remove them.  Publishing-Initiated Review (e.g., reviewing all items or a subset of items for sustaining quality maintenance or to add a new piece of metadata): Content specialists review the items (two reviewers per item). Sets of items that are recommended for retirement or revision and regeneration are checked by Research against the test pools to determine the impact of removing the items. Plans are then made to remove temporarily and revise and replace the items, or to permanently remove them.  Post-Calibration Content Review: Weekly review of items that have been rejected by Research for statistical reasons. Content specialists make use of the statistical data to inform a content review of the item to determine if there is a revision that would address the statistical issues (e.g., if a distractor is getting too many responses, can we make it less distracting). Those that can be revised are regenerated and field-tested. Those that cannot be revised are retired.

Technical Standards

Classification Accuracy & Cross-Validation Summary

Grade Grade 2
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Classification Accuracy Fall Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence
Classification Accuracy Winter Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence Partially convincing evidence
Classification Accuracy Spring Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence Convincing evidence
Legend
Full BubbleConvincing evidence
Half BubblePartially convincing evidence
Empty BubbleUnconvincing evidence
Null BubbleData unavailable
dDisaggregated data available

(PARCC) Math

Classification Accuracy

Select time of year
Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
For this series of analyses, the outcome measure was the scaled score on the PARCC mathematics assessment, taken by students in the sample during the Spring 2016 school term. Note, the Grade 3 PARCC scaled scores were used as the outcome measure for the Grade 2 results and for the Grade 3 results.
Do the classification accuracy analyses examine concurrent and/or predictive classification?

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
Students actually “at-risk” were so designated when their PARCC scale scores fell within PARCC Performance Level 1 (“Did not Meet”). The PARCC scaled score range for Performance Level 1 is 650-699. Students actually “not at-risk” were those whose PARCC scale scores fell within PARCC Performance 2 or higher. (i.e., 700-850). Students were classified as “at-risk” or “not at-risk” by rank-ordering their scores on the predictor (MAP Growth) RIT score and using the 20th percentile point within the sample as the cut point, disaggregated within the sample by grade and subject area. Students with MAP Growth scores below the 20th percentile point were classified as “at-risk”.
Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
No
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Cross-Validation

Has a cross-validation study been conducted?
No
If yes,
Select time of year.
Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
Do the cross-validation analyses examine concurrent and/or predictive classification?

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Classification Accuracy - Fall

Evidence Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Criterion measure (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math
Cut Points - Percentile rank on criterion measure 20 20 20 20 20 20 20
Cut Points - Performance score on criterion measure <700 <700 <700 <700 <700 <700 <700
Cut Points - Corresponding performance score (numeric) on screener measure 167.00 181.00 192.00 201.00 207.00 213.00 218.00
Classification Data - True Positive (a) 783 1149 1331 1006 1277 1050 1484
Classification Data - False Positive (b) 1562 1714 1500 1758 2019 2038 1083
Classification Data - False Negative (c) 227 195 321 261 209 164 665
Classification Data - True Negative (d) 10083 12274 12013 11998 13146 12704 10345
Area Under the Curve (AUC) 0.91 0.94 0.93 0.92 0.94 0.94 0.91
AUC Estimate’s 95% Confidence Interval: Lower Bound 0.90 0.93 0.93 0.92 0.93 0.93 0.91
AUC Estimate’s 95% Confidence Interval: Upper Bound 0.92 0.94 0.94 0.93 0.94 0.94 0.92
Statistics Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Base Rate 0.08 0.09 0.11 0.08 0.09 0.08 0.16
Overall Classification Rate 0.86 0.88 0.88 0.87 0.87 0.86 0.87
Sensitivity 0.78 0.85 0.81 0.79 0.86 0.86 0.69
Specificity 0.87 0.88 0.89 0.87 0.87 0.86 0.91
False Positive Rate 0.13 0.12 0.11 0.13 0.13 0.14 0.09
False Negative Rate 0.22 0.15 0.19 0.21 0.14 0.14 0.31
Positive Predictive Power 0.33 0.40 0.47 0.36 0.39 0.34 0.58
Negative Predictive Power 0.98 0.98 0.97 0.98 0.98 0.99 0.94
Sample Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Date 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016
Sample Size 12655 15332 15165 15023 16651 15956 13577
Geographic Representation East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
Male              
Female              
Other              
Gender Unknown              
White, Non-Hispanic              
Black, Non-Hispanic              
Hispanic              
Asian/Pacific Islander              
American Indian/Alaska Native              
Other              
Race / Ethnicity Unknown              
Low SES              
IEP or diagnosed disability              
English Language Learner              

Classification Accuracy - Winter

Evidence Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Criterion measure (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math
Cut Points - Percentile rank on criterion measure 20 20 20 20 20 20 20
Cut Points - Performance score on criterion measure <700 <700 <700 <700 <700 <700 <700
Cut Points - Corresponding performance score (numeric) on screener measure 177.00 187.00 197.00 205.00 208.00 213.00 219.00
Classification Data - True Positive (a) 712 1031 1162 1022 1162 912 1319
Classification Data - False Positive (b) 1304 1276 1230 1457 1381 1363 774
Classification Data - False Negative (c) 176 181 253 227 229 192 685
Classification Data - True Negative (d) 8313 9341 9604 10245 10336 9369 8262
Area Under the Curve (AUC) 0.92 0.94 0.94 0.93 0.94 0.93 0.91
AUC Estimate’s 95% Confidence Interval: Lower Bound 0.91 0.94 0.93 0.92 0.93 0.93 0.90
AUC Estimate’s 95% Confidence Interval: Upper Bound 0.93 0.95 0.94 0.93 0.94 0.94 0.91
Statistics Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Base Rate 0.08 0.10 0.12 0.10 0.11 0.09 0.18
Overall Classification Rate 0.86 0.88 0.88 0.87 0.88 0.87 0.87
Sensitivity 0.80 0.85 0.82 0.82 0.84 0.83 0.66
Specificity 0.86 0.88 0.89 0.88 0.88 0.87 0.91
False Positive Rate 0.14 0.12 0.11 0.12 0.12 0.13 0.09
False Negative Rate 0.20 0.15 0.18 0.18 0.16 0.17 0.34
Positive Predictive Power 0.35 0.45 0.49 0.41 0.46 0.40 0.63
Negative Predictive Power 0.98 0.98 0.97 0.98 0.98 0.98 0.92
Sample Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Date 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016
Sample Size 10505 11829 12249 12951 13108 11836 11040
Geographic Representation East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
Male              
Female              
Other              
Gender Unknown              
White, Non-Hispanic              
Black, Non-Hispanic              
Hispanic              
Asian/Pacific Islander              
American Indian/Alaska Native              
Other              
Race / Ethnicity Unknown              
Low SES              
IEP or diagnosed disability              
English Language Learner              

Classification Accuracy - Spring

Evidence Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Criterion measure (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math (PARCC) Math
Cut Points - Percentile rank on criterion measure 20 20 20 20 20 20 20
Cut Points - Performance score on criterion measure <700 <700 <700 <700 <700 <700 <700
Cut Points - Corresponding performance score (numeric) on screener measure 183.00 194.00 203.00 210.00 214.00 219.00 219.00
Classification Data - True Positive (a) 907 1372 1584 1218 1466 1175 1644
Classification Data - False Positive (b) 1593 1881 1670 1939 2056 2126 1156
Classification Data - False Negative (c) 207 136 229 196 172 173 649
Classification Data - True Negative (d) 11043 13236 13086 13083 14289 13937 11447
Area Under the Curve (AUC) 0.92 0.96 0.96 0.94 0.95 0.94 0.92
AUC Estimate’s 95% Confidence Interval: Lower Bound 0.92 0.95 0.95 0.93 0.95 0.94 0.92
AUC Estimate’s 95% Confidence Interval: Upper Bound 0.93 0.96 0.96 0.94 0.95 0.95 0.93
Statistics Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Base Rate 0.08 0.09 0.11 0.09 0.09 0.08 0.15
Overall Classification Rate 0.87 0.88 0.89 0.87 0.88 0.87 0.88
Sensitivity 0.81 0.91 0.87 0.86 0.89 0.87 0.72
Specificity 0.87 0.88 0.89 0.87 0.87 0.87 0.91
False Positive Rate 0.13 0.12 0.11 0.13 0.13 0.13 0.09
False Negative Rate 0.19 0.09 0.13 0.14 0.11 0.13 0.28
Positive Predictive Power 0.36 0.42 0.49 0.39 0.42 0.36 0.59
Negative Predictive Power 0.98 0.99 0.98 0.99 0.99 0.99 0.95
Sample Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8
Date 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016
Sample Size 13750 16625 16569 16436 17983 17411 14896
Geographic Representation East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
East North Central (IL)
Middle Atlantic (NJ)
Mountain (CO, NM)
New England (RI)
South Atlantic (DC)
Male              
Female              
Other              
Gender Unknown              
White, Non-Hispanic              
Black, Non-Hispanic              
Hispanic              
Asian/Pacific Islander              
American Indian/Alaska Native              
Other              
Race / Ethnicity Unknown              
Low SES              
IEP or diagnosed disability              
English Language Learner              

Reliability

Grade Grade 2
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Rating Convincing evidence d Convincing evidence d Convincing evidence d Convincing evidence d Convincing evidence d Convincing evidence d Convincing evidence d
Legend
Full BubbleConvincing evidence
Half BubblePartially convincing evidence
Empty BubbleUnconvincing evidence
Null BubbleData unavailable
dDisaggregated data available
*Offer a justification for each type of reliability reported, given the type and purpose of the tool.
Using MAP Growth as an academic screener, the internal consistency reliability of student test scores (i.e., student RIT scores on MAP Growth) is key. However, estimating the internal consistency of an adaptive test, such as MAP Growth, is challenging because traditional methods depend on all test takers to take a common test consisting of the same items. Application of these methods to adaptive tests is statistically cumbersome and inaccurate. Fortunately, an equally valid alternative is available in the marginal reliability coefficient that incorporates measurement error as a function of the test score. In effect, it is the result of combining measurement error estimated at different points on the achievement scale into a single index. Note that this method of calculating reliability yields results that are nearly identical to coefficient alpha, when both methods are applied to the same fixed-form test. MAP Growth affords the means to screen students on multiple occasions (e.g., Fall, Winter, Spring) during the school year. Thus, test-retest reliability is also key, and we estimate test-retest reliability via the Pearson correlation between MAP Growth RIT scores of students taking MAP Growth in two terms within the school year (Fall and Winter, Fall and Spring, and Winter and Spring). Given that MAP Growth is an adaptive test, without any fixed-forms, this approach to test-retest reliability may be more accurately described as a mix between test-retest reliability and a type of parallel forms reliability. That is, MAP Growth RIT scores are obtained for students taking MAP twice, spread across several months. The second test (or retest) is not the same test. Rather, the second test is comparable to the first, by its content and structure, differing only in the difficulty level of its items. Thus, both temporally related and parallel forms of reliability are defined as the consistency of covalent measures taken across time. Green, Bock, Humphreys, Linn, and Reckase suggested the term “stratified, randomly parallel form reliability” to characterize this form of reliability.
*Describe the sample(s), including size and characteristics, for each reliability analysis conducted.
Local representation (please describe, including number of states): The sample for the study contained student records from a total of five states (Colorado, Illinois, New Jersey, New Mexico, and Rhode Island) and one federal district (District of Columbia), and thus had representation from all four U.S. Census regions. Date: MAP Growth data was from test administrations occurring during the Fall 2015, Winter 2016, and Spring 2016 school terms, which spanned from August 2015 through June 2016. The MAP Growth scores of the Grade 3 students from the previous academic year (i.e., Fall 2014, Winter 2015, and Spring 2015) were used as the Grade 2 MAP Growth scores.
*Describe the analysis procedures for each reported type of reliability.
Marginal Reliability. The approach taken for estimating marginal reliability on MAP Growth was suggested by Wright in 1999 . where θ is the IRT achievement level (on a standardized or scaled score metric), θ ̂ is an estimate of θ, S_θ ̂^2 is the observed variance of θ ̂ across the sample of N students, 〖CSEM〗^2 is the squared conditional (on θ) standard error of measurement (CSEM), and M_(〖CSEM〗^2 ) is the average squared CSEM across the sample of N students. A bootstrapping approach is used to calculate a 95% confidence interval for marginal reliability. For an initial dataset of the achievement levels and CSEMs for N students, a bootstrap 95% confidence interval for marginal reliability is obtained as follows: Draw a random sample of size N with replacement from the initial dataset. Calculate marginal reliability based on the random sample drawn in Step 1. Repeat steps 1 and 2, 1,000 times. Determine the 2.5 and 97.5 percentile points from the resulting 1,000 estimates of marginal reliability. The value of these two percentiles are the bootstrap 95% confidence interval. Test-Retest Reliability. Test-retest reliability of MAP Growth was estimated as the Pearson correlation of student RIT scores on MAP Growth for a set of students who took MAP Growth twice. For Grades 3–8, this was either in Fall 2015 and Winter 2016, in Fall 2015 and Spring 2016, or in Winter 2016 and Spring 2016. For Grade 2, this was either in Fall 2014 and Winter 2015, in Fall 2014 and Spring 2015, or in Winter 2015 and Spring 2015. Fundamentally, the test-retest reliability coefficient is a Pearson correlation. As such, the confidence interval (CI) for the test-retest reliability coefficient was obtained using the standard CI for a Pearson correlation (i.e., via the Fisher’s z-transformation).

*In the table(s) below, report the results of the reliability analyses described above (e.g., internal consistency or inter-rater reliability coefficients).

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of reliability analysis not compatible with above table format:
Manual cites other published reliability studies:
No
Provide citations for additional published studies.
Do you have reliability data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
Yes

If yes, fill in data for each subgroup with disaggregated reliability data.

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of reliability analysis not compatible with above table format:
Tables with extensive disaggregated data are available from the Center upon request.
Manual cites other published reliability studies:
No
Provide citations for additional published studies.

Validity

Grade Grade 2
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Rating Convincing evidence d Convincing evidence d Convincing evidence d Convincing evidence d Convincing evidence d Convincing evidence d Convincing evidence d
Legend
Full BubbleConvincing evidence
Half BubblePartially convincing evidence
Empty BubbleUnconvincing evidence
Null BubbleData unavailable
dDisaggregated data available
*Describe each criterion measure used and explain why each measure is appropriate, given the type and purpose of the tool.
In general terms, the better a test measures what it purports to measure and can support its intended uses and decision making, the stronger its validity is said to be. Within this broad statement resides a wide range of information that can be used as validity evidence. This information ranges, for example, from the adequacy and coverage of a test’s content, to its ability to yield scores that are predictive of a status in some area, to its ability to draw accurate inferences about a test taker’s status with respect to a construct, to its ability to allow generalizations from test performance within a domain to like performance in the same domain. Much of the validity evidence for MAP Growth comes from the relationships of MAP Growth test scores to state content-aligned accountability test scores. These relationships include a) the concurrent performance of students on MAP Growth tests with their performance on state tests given for accountability purposes and b) the predictive relationship between students’ performance on MAP Growth tests with their performance, two testing terms later, on state accountability tests Several important points should be noted regarding concurrent performance on MAP Growth tests with that on state accountability tests. First, these two forms of tests (i.e., interim vs. summative) are designed to serve two related but different purposes. MAP Growth tests are designed to provide estimates of achievement status with low measurement error. They are also designed to provide reasonable estimates of students’ strengths and weaknesses within the identified goal structure. State accountability tests are commonly designed to determine student proficiency within the state performance standard structure, with the most important decision being the classification of the student as proficient or not proficient. This primary purpose of most state tests in conjunction with adopted content and curriculum standards and structures can influence the relationship of student performance between the two tests. For example, one of the most common factors influencing these relationships is the use of constructed response items in state tests. In general, the greater the number of constructed response items, the weaker the relationship will appear. Another difference is in test design. Since most state accountability tests are fixed form, it is reasonable for the test to be constructed so that maximum test information is established around the proficiency cut point. This is where a state wants to be the most confident about the classification decision that the test will inform. To the extent that this strategy is reflected in the state’s operational test, the relationship in performance between MAP Growth tests and state tests will be attenuated due to a more truncated range of scores on the state test. The requirement that state test content be connected to single grade level content standards is different than MAP Growth test content structure that spans grade levels. This difference is another factor that weakens the observed score relationships between tests. Finally, when focus is placed on the relationship between performance on MAP Growth tests and the assigned proficiency category from the state test, information from the state test will have been collapsed into three to five categories. The correlations between RIT scores and these category assignments will always be substantially lower than if the correlations were based on RIT scores and scale scores. Concurrent validity evidence is expressed as the degree of relationship to performance on another test measuring achievement in the same domain (e.g., mathematics, reading) administered close in time. This form of validity can also be expressed in the form of a Pearson correlation coefficient between the total domain area RIT score and the total scale score of another established test. It answers the question, “How well do the scores from this test that reference this (RIT) scale in this subject area (e.g., mathematics, reading) correspond to the scores obtained from an established test that references some other scale in the same subject area?” Both tests are administered to the same students in close temporal proximity, roughly two to three weeks apart. Correlations with non-NWEA tests that include more performance test items that require subjective scoring tend to have lower correlations than when non-NWEA tests consist of exclusively multiple-choice items. Predictive validity evidence is expressed as the degree of relationship to performance on another test measuring achievement in the same domain (e.g., mathematics, reading) at some later point in time. This form of validity can also be expressed in the form of a Pearson correlation coefficient between the total domain area RIT score and the total scale score of another established test. It answers the question, “How well do the scores from this test that reference this (RIT) scale in this subject area (e.g., reading) predict the scores obtained from an established test that references some other scale in the same subject area at a later point in time?” Both tests are administered to the same students several weeks apart, typically 12 to 36 weeks in evidence reported here. Strong predictive validity is indicated when the correlations are in the low 0.80s. Correlations with non-NWEA tests that include more performance test items that require subjective scoring tend to have lower correlations than when non-NWEA tests consist of exclusively multiple-choice items. The criterion measure used for this series of analyses was the scaled score on the PARCC mathematics assessment, taken by students in the sample during the Spring 2016 school term. In addition to concurrent and predictive validity, validity evidence for MAP Growth also comes from the degree and stability of the relationship of RIT scores across multiple and extended periods of time. This type of evidence supports the construct validity of MAP Growth and the ability underlying the RIT scale. This type of construct validity evidence is provided for Grade 2, since concurrent validity coefficients were not available for Grade 2 (i.e., Grade 2 RIT scores were from administrations during the school year prior to the administration of the PARCC assessment).
*Describe the sample(s), including size and characteristics, for each validity analysis conducted.
Local representation (please describe, including number of states): The sample for the study contained student records from a total of five states (Colorado, Illinois, New Jersey, New Mexico, and Rhode Island) and one federal district (District of Columbia), and thus had representation from all four U.S. Census regions. Date: MAP Growth data for Grades 3–8 came from test administrations occurring during the Fall 2015, Winter 2016, and Spring 2016 school terms, which spanned from August 2015 through June 2016. The MAP Growth scores of the Grade 3 students from the previous academic year (i.e., Fall 2014, Winter 2015, and Spring 2015) were used as the Grade 2 MAP Growth scores. The Partnership for Assessment of Readiness for College and Careers (PARCC) data was from the Spring 2016 administration of the PARCC assessment, spanning approximately from March 2016 through June 2016. Gender (Percent): Male: 51.11% Female: 48.88% Unknown: 0.02% SES (Percent, measured as free or reduced-price lunch): Not available. Eligible for free or reduced-price lunch: Not available. Other SES Indicators: Not available. Race/Ethnicity (Percent): White, Non-Hispanic: 44.99% Black, Non-Hispanic: 6.42% Hispanic: 23.84% American Indian/Alaska Native: 1.83% Asian/Pacific Islander: 8.59% Multi-Ethnic: 2.88% Not Specified or Other: 11.45%
*Describe the analysis procedures for each reported type of validity.
Concurrent validity was estimated as the Pearson correlation coefficient between student RIT scores from Spring 2016 and the same students’ total scale score on the PARCC test, also administered in Spring 2016. Predictive validity was estimated as the Pearson correlation coefficient between student RIT scores from a given term (Fall 2015 or Winter 2016) and the same students’ total scale score on the PARCC test administered in Spring 2016. For Grade 2, construct validity was estimated as the Pearson correlation coefficient between student RIT scores from each term in the 2015 school year (Fall 2014, Winter 2015, and Spring 2015) versus RIT scores from each term of the 2016 school year (Fall 2015, Winter 2016, and Spring 2016). For Grade 2, predictive validity was estimated as the Pearson correlation coefficient between student RIT scores from each term in the 2015 school year (Fall 2014, Winter 2015, and Spring 2015) versus the students’ total scales on the Grade 3 PARCC test administered in Spring. The 95% confidence interval for concurrent and predictive validity coefficients was based on the standard 95% confidence interval for a Pearson correlation, using the Fisher z-transformation

*In the table below, report the results of the validity analyses described above (e.g., concurrent or predictive validity, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of validity analysis not compatible with above table format:
Manual cites other published reliability studies:
Yes
Provide citations for additional published studies.
Wang, S., Jiao, H, Zhang, L. (2013). Validation of Longitudinal Achievement Constructs Of Vertically Scaled Computerized Adaptive Tests: A Multiple-Indicator, Latent-Growth Modelling Approach. International Journal of Quantitative Research in Education, 1, 383-407. Wang, S., McCall, M., Hong, J., & Harris, G. (2013). Construct Validity and Measurement Invariance of Computerized Adaptive Testing: Application to Measures Of Academic Progress (MAP) Using Confirmatory Factor Analysis. Journal of Educational and Developmental Psychology, 3, 88-100.
Describe the degree to which the provided data support the validity of the tool.
Concurrent, predictive, and construct validity coefficients, for each grade and each time of year, were consistently in the mid to high 0.80s. This validity evidence demonstrates a strong relationship between the MAP Growth and PARCC assessments, across the grades and times of year reported, as well as a strong relationship of MAP Growth RIT scores across school years.
Do you have validity data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
Yes

If yes, fill in data for each subgroup with disaggregated validity data.

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of validity analysis not compatible with above table format:
Tables with extensive disaggregated data are available from the Center upon request.
Manual cites other published reliability studies:
No
Provide citations for additional published studies.

Bias Analysis

Grade Grade 2
Grade 3
Grade 4
Grade 5
Grade 6
Grade 7
Grade 8
Rating Yes Yes Yes Yes Yes Yes Yes
Have you conducted additional analyses related to the extent to which your tool is or is not biased against subgroups (e.g., race/ethnicity, gender, socioeconomic status, students with disabilities, English language learners)? Examples might include Differential Item Functioning (DIF) or invariance testing in multiple-group confirmatory factor models.
Yes
If yes,
a. Describe the method used to determine the presence or absence of bias:
Once tests have been administered and results collected, analysis to detect differential item function (DIF) may be conducted. The method used to detect DIF for NWEA is based on the work of Linacre and Wright , implemented by Linacre . When executed as part of a Winsteps analysis, this method entails: a) Carrying out a joint Rasch analysis of all person-group classifications that anchors all student abilities and item difficulties to a common (theta) scale. b) Carrying out a calibration analysis for the Reference group keeping the student ability estimates and scale structure anchored to produce Reference group item difficulty estimates. c) Carrying out a calibration analysis for the Focal group keeping the student ability estimates and scale structure anchored to produce Focal group item difficulty estimates. d) Computing pair-wise item difficulty differences (Focal group difficulty minus Reference group difficulty). The calibration analyses in steps b and c are computed for each item, as though all items, except the item currently targeted, are anchored at the joint calibration run (step a). Ideally, analyzing items for DIF would be incorporated within the item calibration process. This can prove to be a useful initial screen to identify items that should be subjected to heightened surveillance for DIF. However, the number of responses to an item by members of demographic groups of interest may well be insufficient to yield stable calibration estimates at the group level. This can introduce statistical artifacts as well as Type I errors into DIF analyses. To avoid this, data for analyses are taken from responses to operational tests.
b. Describe the subgroups for which bias analyses were conducted:
Each test record included the student’s recorded ethnic group membership (Native American, Asian, African American, Hispanic, and European/Anglo American).
c. Describe the results of the bias analyses conducted, including data and interpretative statements. Include magnitude of effect (if available) if bias has been identified.
The DIF analysis of the items initially proposed for the Common Core-aligned mathematics test events that were administered during the Spring testing term of the 2011-2012 academic year and the Fall, Winter, and Spring testing terms of the 2012-2013 academic year from the tests administered in Arizona, California, Colorado, Georgia, Minnesota, Mississippi, New Mexico, South Carolina, and Texas. It is worth noting that many items in the item pools align to both Common Core State Standards and state content standards. Each test record included the student’s recorded ethnic group membership (Native American, Asian, African American, Hispanic, and European/Anglo American), the student’s gender, and their responses. Data from all states and all grades were combined for each content area. This aggregation was made because DIF was focused narrowly on how students of the same ability but of different gender or from different ethnic groups respond to test items. The intent was to neutralize, as much as possible, the effects of differential curricular and instructional emphasis that could potentially influence the DIF analysis. Retaining states and grades as part of the analysis could have led to conclusions that were tangential to the primary focus. Winsteps (version 3.75.0) was invoked to carry out the analysis as outlined above. Calibrations were retained in their original logit metric. To help in summarizing results, the Educational Testing Service (ETS) delta method of categorizing DIF is incorporated into the tables as ETS Class. The delta method allows items exhibiting negligible DIF (difference < .43 logits) to be differentiated from those exhibiting moderate DIF (difference ≥ .43 and < .64 logits) from those exhibiting severe DIF (difference ≥ .63 logits). All items revealed as exhibiting moderate DIF are subjected to an extra review by NWEA Content Specialists to identify the source(s) for differential functioning. For each item, these specialists make a judgment to: 1) remove the item from the item bank, 2) revise the item and re-submit it for field-testing, or 3) to retain the item as is. Items exhibiting severe DIF are removed from the bank. These procedures are consistent with and act to extended periodic Item Quality Reviews, which remove or flag items for revision and re-field-testing problem items. Tables presenting detailed results from the bias analyses are available from the Center upon request.

Data Collection Practices

Most tools and programs evaluated by the NCII are branded products which have been submitted by the companies, organizations, or individuals that disseminate these products. These entities supply the textual information shown above, but not the ratings accompanying the text. NCII administrators and members of our Technical Review Committees have reviewed the content on this page, but NCII cannot guarantee that this information is free from error or reflective of recent changes to the product. Tools and programs have the opportunity to be updated annually or upon request.