Spring Math
Mathematics

Summary

Descriptive Information

Spring Math is a comprehensive RtI system that includes screening, progress monitoring, class-wide and individual math intervention, and implementation and decision-making support. Assessments are generated within the tool when needed and Spring Math uses student data to customize class-wide and individual intervention plans for students. Clear and easy-to-understand graphs and reports are provided within the teacher and coach dashboards. Spring Math uses gated screening that involves CBMs administered to the class as a whole followed by class-wide intervention to identify students in need of intensive intervention. Spring Math assesses 130 skills in Grades K-8 and remedies gaps in learning for grades K-12. The skills offer comprehensive but strategic coverage of the Common Core State Standards. Spring Math assesses mastery of number operations, pre-algebraic thinking, and mathematical logic. It also measures understanding of “tool skills.” Tool skills provide the foundation a child needs to question, speculate, reason, solve, and explain real-world problems. Spring Math emphasizes tool skills across grades with grade-appropriate techniques and materials.

Acquisition & Cost

Where to Obtain:: Developer: Amanda VanDerHeyden; Publisher: TIES/Sourcewell Tech; info@springmath.org; 2340 Energy Park Dr Suite 200, St Paul, MN 55108; 888-894-1930; www.springmath.com

Initial Cost:: $8.95 per student

Replacement Cost:: $8.95 per student per 1 year

Included in Cost:: Sites must have access to one computer per teacher, internet connection, and the ability to print in black and white. Spring Math provides extensive implementation support at no additional cost through a support portal to which all users have access. Support materials include how-to videos, brief how-to documents, access to all assessments and acquisition lesson plans for 130 skills, and live and archived webinars. In addition to the support portal, new users are required to purchase an onboarding training package to help them get underway at an approximate cost of $895. Additional ongoing coaching support via (remote and in-person) is available at a reasonable cost.; Assessments are standardized, but very brief in duration. If a student requires intervention, intervention allows for oral and written responding, the use of individual rewards for “beating the last best score,” a range of concrete, representational, and abstract understanding activities, and individualized modeling with immediate corrective feedback.

Training & Technical Support

Training Requirements:: We offer a standard onboarding training and now require this training for new sites. The training is customized to the site and requires approximately 2 hours.

Qualified Administrators:: The examiners need to be educators trained in the administration of the measures. Training materials are provided.

Access to Technical Support:: Support materials are provided organized by user-role (teacher, coach, data administrator) via our support portal under the drop-down menu under their log-in icon. We provide free webinars throughout the year for users and host free training institutes at least annually. We provide a systematic on-boarding process for new users to help them get underway. If users encounter technical difficulties, they can submit a request for help directly from their account, which generates a support ticket to our tech support team. Support tickets are monitored during business hours and are responded to the same day.

Administration

Assessment Format:

Performance measure

Scoring Time:

1 minutes per student

Scores Generated:

Raw score

Administration Time:

8 minutes per student

Scoring Method:

Manually (by hand)

Technology Requirements:

Computer or tablet
Internet connection

Accommodations:: Assessments are standardized, but very brief in duration. If a student requires intervention, intervention allows for oral and written responding, the use of individual rewards for “beating the last best score,” a range of concrete, representational, and abstract understanding activities, and individualized modeling with immediate corrective feedback.

Descriptive Information

Please provide a description of your tool:: Spring Math is a comprehensive RtI system that includes screening, progress monitoring, class-wide and individual math intervention, and implementation and decision-making support. Assessments are generated within the tool when needed and Spring Math uses student data to customize class-wide and individual intervention plans for students. Clear and easy-to-understand graphs and reports are provided within the teacher and coach dashboards. Spring Math uses gated screening that involves CBMs administered to the class as a whole followed by class-wide intervention to identify students in need of intensive intervention. Spring Math assesses 130 skills in Grades K-8 and remedies gaps in learning for grades K-12. The skills offer comprehensive but strategic coverage of the Common Core State Standards. Spring Math assesses mastery of number operations, pre-algebraic thinking, and mathematical logic. It also measures understanding of “tool skills.” Tool skills provide the foundation a child needs to question, speculate, reason, solve, and explain real-world problems. Spring Math emphasizes tool skills across grades with grade-appropriate techniques and materials.

The tool is intended for use with the following grade(s).

Preschool / Pre - kindergarten
selected

Kindergarten
selected

First grade
selected

Second grade
selected

Third grade
selected

Fourth grade
selected

Fifth grade
selected

Sixth grade
selected

Seventh grade
selected

Eighth grade
selected

Ninth grade
selected

Tenth grade
selected

Eleventh grade
selected

Twelfth grade

The tool is intended for use with the following age(s).

0-4 years old
not selected

5 years old
not selected

6 years old
not selected

7 years old
not selected

8 years old
not selected

9 years old
not selected

10 years old
not selected

11 years old
not selected

12 years old
not selected

13 years old
not selected

14 years old
not selected

15 years old
not selected

16 years old
not selected

17 years old
not selected

18 years old

The tool is intended for use with the following student populations.

Students in general education
selected

Students with disabilities
not selected

English language learners

ACADEMIC ONLY: What skills does the tool screen?

Reading

Phonological processing:

RAN

Memory

Awareness

Letter sound correspondence
not selected

Phonics

Structural analysis

Word ID

Accuracy

Speed

Nonword

Accuracy

Speed

Spelling

Accuracy

Speed

Passage

Accuracy

Speed

Reading comprehension:

Multiple choice questions
not selected

Cloze

Constructed Response
not selected

Retell

Maze

Sentence verification
not selected

Other (please describe):

Listening comprehension:

Multiple choice questions
not selected

Cloze

Constructed Response
not selected

Retell

Maze

Sentence verification
not selected

Vocabulary
not selected

Expressive
not selected

Receptive

Mathematics

Global Indicator of Math Competence

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Early Numeracy

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Mathematics Concepts

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Mathematics Computation

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Mathematic Application

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Fractions/Decimals

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Algebra

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Geometry

Accuracy

Speed

Multiple Choice
not selected

Constructed Response

Other (please describe):

Please describe specific domain, skills or subtests:

BEHAVIOR ONLY: Which category of behaviors does your tool target?: Internalizing
Externalizing
Internalizing and Externalizing

BEHAVIOR ONLY: Please identify which broad domain(s)/construct(s) are measured by your tool and define each sub-domain or sub-construct.

Acquisition and Cost Information

Where to obtain:

Email Address: info@springmath.org
Address: 2340 Energy Park Dr Suite 200, St Paul, MN 55108
Phone Number: 888-894-1930
Website: www.springmath.com

Initial cost for implementing program:

Cost: $8.95
Unit of cost: student

Replacement cost per unit for subsequent use:

Cost: $8.95
Unit of cost: student
Duration of license: 1 year

Additional cost information:

Describe basic pricing plan and structure of the tool. Provide information on what is included in the published tool, as well as what is not included but required for implementation.: Sites must have access to one computer per teacher, internet connection, and the ability to print in black and white. Spring Math provides extensive implementation support at no additional cost through a support portal to which all users have access. Support materials include how-to videos, brief how-to documents, access to all assessments and acquisition lesson plans for 130 skills, and live and archived webinars. In addition to the support portal, new users are required to purchase an onboarding training package to help them get underway at an approximate cost of $895. Additional ongoing coaching support via (remote and in-person) is available at a reasonable cost.

Provide information about special accommodations for students with disabilities.: Assessments are standardized, but very brief in duration. If a student requires intervention, intervention allows for oral and written responding, the use of individual rewards for “beating the last best score,” a range of concrete, representational, and abstract understanding activities, and individualized modeling with immediate corrective feedback.

Administration

BEHAVIOR ONLY: What type of administrator is your tool designed for?

General education teacher
not selected

Special education teacher
not selected

Parent

Child

External observer
not selected

Other

If other, please specify:

What is the administration setting?

Direct observation
not selected

Rating scale
not selected

Checklist

Performance measure
not selected

Questionnaire
not selected

Direct: Computerized
not selected

One-to-one
not selected

Other

If other, please specify:

Does the tool require technology?

Yes

If yes, what technology is required to implement your tool? (Select all that apply)

Computer or tablet
selected

Internet connection
not selected

Other technology (please specify)

If your program requires additional technology not listed above, please describe the required technology and the extent to which it is combined with teacher small-group instruction/intervention:

What is the administration context?

Individual
selected

Small group If small group, n=
selected

Large group If large group, n=
not selected

Computer-administered
not selected

Other

If other, please specify:

What is the administration time?

Time in minutes

per (student/group/other unit)

student

Additional scoring time:

Time in minutes

per (student/group/other unit)

student

ACADEMIC ONLY: What are the discontinue rules?

No discontinue rules provided
not selected

Basals

Ceilings

Other

If other, please specify:

Are norms available?: No

Are benchmarks available?: Yes
If yes, how many benchmarks per year?: 3
If yes, for which months are benchmarks available?: Fall, winter, spring

BEHAVIOR ONLY: Can students be rated concurrently by one administrator?
If yes, how many students can be rated concurrently?

Training & Scoring

Training

Is training for the administrator required?: Yes

Describe the time required for administrator training, if applicable:: We offer a standard onboarding training and now require this training for new sites. The training is customized to the site and requires approximately 2 hours.

Please describe the minimum qualifications an administrator must possess.: The examiners need to be educators trained in the administration of the measures. Training materials are provided.; No minimum qualifications

Are training manuals and materials available?: Yes

Are training manuals/materials field-tested?: Yes

Are training manuals/materials included in cost of tools?: Yes
If No, please describe training costs:

Can users obtain ongoing professional and technical support?: Yes
If Yes, please describe how users can obtain support:: Support materials are provided organized by user-role (teacher, coach, data administrator) via our support portal under the drop-down menu under their log-in icon. We provide free webinars throughout the year for users and host free training institutes at least annually. We provide a systematic on-boarding process for new users to help them get underway. If users encounter technical difficulties, they can submit a request for help directly from their account, which generates a support ticket to our tech support team. Support tickets are monitored during business hours and are responded to the same day.

Scoring

How are scores calculated?

Manually (by hand)
not selected

Automatically (computer-scored)
not selected

Other

If other, please specify:

Do you provide basis for calculating performance level scores?: Yes

What is the basis for calculating performance level and percentile scores?

Age norms

Grade norms
not selected

Classwide norms
not selected

Schoolwide norms
not selected

Stanines

Normal curve equivalents

What types of performance level scores are available?

Raw score

Standard score
not selected

Percentile score
not selected

Grade equivalents
not selected

IRT-based score
not selected

Age equivalents
not selected

Stanines

Normal curve equivalents
not selected

Developmental benchmarks
not selected

Developmental cut points
not selected

Equated

Probability
not selected

Lexile score
not selected

Error analysis
not selected

Composite scores
not selected

Subscale/subtest scores
not selected

Other

If other, please specify:

Does your tool include decision rules?: Yes
If yes, please describe.: Spring Math graphs student performance at screening relative to two criteria: the answers correct equivalent of the digits correct criterion reflecting instructional-level and mastery-level performance. Screening measures reflect a subskill mastery measurement approach to CBM and intentionally sample rigorous grade-level understanding (https://static1.squarespace.com/static/57ab866cf7e0ab5cbba29721/t/591b4b9a86e6c0d47e88a51d/1494961050245/SM_ScreeningByGrades_TimeOfYear.pdf) If 50% of the class or more score below the instructional range on the screening measures, Spring Math recommends (and then provides) class-wide math intervention. Class-wide intervention data are recorded weekly and after 4 weeks of implementation, Spring Math recommends individual students who score below the instructional range when the class median has reached mastery on any given skill. Spring Math then directs the diagnostic (i.e., “drill down”) assessment to provide the correctly aligned intervention for students needing individualized intervention. During intervention, all materials are provided to the teacher that are needed to conduct intervention including scripted activities to build related conceptual understanding. Student performance is graphed weekly, the student’s rate of improvement relative to the class median rate of improvement (during class-wide intervention) is provided and if receiving intensive individualized intervention, the student’s rate of improvement on the intervention skill and the generalization skill is provided. Spring Math adjusts interventions weekly based upon student data. The coach dashboard provides real-time summaries of intervention implementation (weeks with scores, most recent score entry) and intervention progress (rate of skill mastery) and directs the coach to support in cases where intervention growth is not sufficient.

Can you provide evidence in support of multiple decision rules?: Yes
If yes, please describe.: Our previous research (e.g., VanDerHeyden, Witt, & Naquin, 2001; VanDerHeyden & Witt, 2005; VanDerHeyden, Witt, & Gilbertson, 2007) of the RtI decision rules used in Spring Math. Further, more recent mathematics specific research, such as VanDerHeyden et al., 2012 and Burns, VanDerHeyden, & Jiban, 2006 have also used and tested the decision rules in the context of intervention delivery and RtI decision making. A full list of references is here: http://springmath.s3.amazonaws.com/pdf/faq/SpringMath_References.pdf. In this application, we provide classification accuracy data for screening + class-wide intervention to determine the need for individualized intensive intervention.

Please describe the scoring structure. Provide relevant details such as the scoring format, the number of items overall, the number of items per subscale, what the cluster/composite score comprises, and how raw scores are calculated.: Composite scores included in the analyses here are the raw score total of the screening measures administered on each screening occasion.

Describe the tool’s approach to screening, samples (if applicable), and/or test format, including steps taken to ensure that it is appropriate for use with culturally and linguistically diverse populations and students with disabilities.: Spring Math assessment procedures are similar across measures and include clear and consistent assessment and scoring practices. Simple, clear language is used along with sample items in order to make the assessments appropriate for students who are linguistically diverse or those with disabilities. Supplements to the assessment instructions which do not alter the procedures are allowed.

Technical Standards

Classification Accuracy & Cross-Validation Summary

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6	Grade 7
Classification Accuracy Fall
Classification Accuracy Winter
Classification Accuracy Spring

Legend

Convincing evidence

Partially convincing evidence

Unconvincing evidence

Data unavailable

^dDisaggregated data available

AzMERIT (2020) or AASA (2023)

Classification Accuracy

Select time of year

Fall

Winter

Spring

Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.: The outcome measure in grades 3, 5 and 7 was the AzMERIT (https://cms.azed.gov/home/GetDocumentFile?id=5b6b29191dcb250edc160590), which is the statewide achievement test in Arizona. The mean scores on the AZ measure for participants in grades 3, 5, and 7 were in the proficient range. The base rate of non-proficiency in grade 3 was 22% versus 58% nonproficient for the state. The base rate of non-proficiency in grade 5 was 26% versus 60% nonproficient for the state. The base rate of non-proficiency in grade 7 was 32% versus 69% nonproficient for the state. We used a local 20th percentile total math standard score equivalent on the AZ test as the reference criterion to identify students in need of more intensive intervention. At each grade, the local 20th percentile standard score equivalent was in the non-proficient range according to the state.

Do the classification accuracy analyses examine concurrent and/or predictive classification?

Concurrent
Predictive

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.

Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).: The reference criterion was performing below the 20th percentile (local norm) on the researcher-constructed composite measures in grades K and 1 and Arizona year-end test of mathematics for grades 3, 5 , and 7. In our first submission to NCII (subsequently published in Assessment for Effective Intervention), we used nonparametric ROC analysis in STATA 12 using the fall and winter screening composite scores as the test variable and below the 20th percentile performance on the reference criterion as the reference variable. ROC-generated AUC values and classification agreement values were generated within the ROC analysis. Screening risk was determined in two ways (1) on universal screening at fall and winter, and (2) response to classwide intervention which was delivered universally and has been shown to produce learning gains and permit more accurate determination of risk when base rates of risk are high (VanDerHeyden, McLaughlin, Algina, & Snyder, 2012; VanDerHeyden 2013). Classification agreement analyses were also conducted directly using a priori screening decision rules for universal screening at fall & winter and subsequent risk during classwide intervention. Results replicated the ROC-generated classification agreement indices reported in this application. Universal screening measures were highly sensitive but generated a high number of false-positive errors. Classwide intervention response data (also collected universally) was slightly less sensitive but much more specific than universal screening (e.g., Grade 3 sensitivity = .82 and specificity = .80). In grades K and 1, probability of year-end risk was zero for children who passed the screeners. If K and 1st grade children who met the risk criterion during classwide intervention had a 62% and 59% probability of year end risk which was 2.67 and 2.36 times the base rate of risk respectively. At Grade 3, children who passed the screeners had zero probability of year-end risk. Third graders meeting the risk criterion during classwide intervention had a 33% chance of year-end risk on the AZ measure which was 1.5 times the base rate of risk in the sample. Grade 5 students who passed the screener had zero probability of year-end risk, but 39% probability of year-end risk if they met the risk criterion during classwide intervention which was 2.17 times the base rate of risk for the sample. Grade 7 students who passed the screeners had a 1% chance of year-end risk on the AZ measure, but 44% chance of year-end risk if they met the risk criterion during classwide intervention which was 1.63 times the base rate of risk in the sample. Functionally, universal screening is used to determine the need for classwide intervention. Classwide intervention data can then be used as a second screening gate. In this NEW submission, we recruited new samples (collected in 2018-2019 school year) and conducted new analyses using the previously ROC-identified cut scores. This time, however, for the fall screening at all grades, we used the following screening criterion: fall composite screening score below criterion and/or identified as being at risk during subsequent classwide intervention, which reflects the screening criterion we use in practice to advance students to more intensive instruction (diagnostic assessment and individualized intervention). We used a sample for which all students were exposed to both the screening measures and classwide math intervention. We report the cell values obtained for the classification accuracy metrics and regression-generated ROC AUC values reflecting the combined screening criterion. In this submission, we also use the Spring reference criterion for fall and winter screening accuracy analyses which we believe to be more rigorous than using the more-proximal winter criterion and more reflective of the type of decisions we are making with these data.

Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?: Yes
If yes, please describe the intervention, what children received the intervention, and how they were chosen.: Classwide intervention is part of the screening decision. Intensive, individualized intervention was not provided to students between the screening decision and the outcome measurement.

Cross-Validation

Has a cross-validation study been conducted?: No
If yes,

Select time of year.

Fall

Winter

Spring

Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.

Do the cross-validation analyses examine concurrent and/or predictive classification?

Concurrent
Predictive

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.

Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).

Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Spring Composite Score

Classification Accuracy

Select time of year

Fall

Winter

Spring

Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.: In grades K and 1, a Winter Composite (for fall screening) and a Spring Composite (for winter screening) were the outcome criteria. For Grade K, the Winter Composite was a researcher-constructed measure that reflected the raw score total of 4 timed measures, timed at 1-minute each. These measures were Count Objects to 20 & Write Answer, Identify Number and Draw Circles to 20, Quantity Discrimination with Dot Sets to 20, and Missing Number to 20. Thus, the composite score reflected understanding of object-number correspondence, cardinality, and ordinality. To respond correctly, children also had to be facile with identifying and writing numbers. Curriculum-based measurement of understanding of object-number correspondence, cardinality, number identification/naming, and ordinality have been studied extensively by multiple researcher teams (Floyd, Hojnoski, & Key, 2006). Among the first research teams to investigate curriculum-based measures of early numeracy were VanDerHeyden, Witt, Naquin, and Noell (2001) who studied counting objects and writing numbers and identifying numbers and drawing corresponding object sets with alternate form reliability correlations ranging from r = .7 to .84 and concurrent correlation validity evidence of r = .44 to .61 with the Comprehensive Inventory of Basic Skills, Revised (Brigance, 1999). Clarke and Shinn (2004) examined, among other measures, a missing number measure to quantify ordinal understanding among first graders with excellent findings, reporting 26-week test-retest r = .81 and concurrent validity correlation with the Number Knowledge Test (Okamoto & Case, 1996) of r = .74. In 2011, VanDerHeyden and colleagues developed and tested new measures of early mathematical understanding, but also included the missing number measure, counting objects and writing the number, and identifying the number and drawing circles. These authors reported test-retest correlation values ranging from r = .71 to .87, correlation with Test of Early Mathematical Ability (TEMA; Ginsburg & Baroody, 2003) scores of r = .55 - to .71, and longitudinal correlation values with curriculum-based measures for addition and subtraction at the end of first grade of r = .51 to .55. The quantity comparison measure using dot sets was also examined with test-retest r = .82, concurrent validity with the TEMA r = .49, and predictive validity with year-end first grade measures of addition and subtraction of r = .43. The Spring Composite was a researcher-constructed measure that reflected the raw score total of 4 timed measures, timed at 1-minute each. These measures were Change Quantity of Dots to Make 10, Missing Number to 20, Addition 0-5 for Kindergarten, and Subtraction 0-5 for Kindergarten. These measures reflected understanding of object-number correspondence to make sets of objects ranging from 1-10 giving a starting set of 1-10 (except never matching). This measure required the child to strike out or add to a dot set to create the specified quantity to match a number. This measure required the child to understand the number quantity desired (number identification) and then to add or remove dots to create an equivalent set. The second measure is the Missing Number measure which assesses ordinality with excellent concurrent and predictive validity for children. The third and fourth measures assessed the child’s ability to combine and take quantities to 5 using numbers.

Do the classification accuracy analyses examine concurrent and/or predictive classification?

Concurrent
Predictive

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.

Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).: The reference criterion was performing below the 20th percentile (local norm) on the researcher-constructed composite measures in grades K and 1 and Arizona year-end test of mathematics for grades 3, 5 , and 7. Nonparametric ROC analysis was conducted using STATA 12 using the fall and winter screening composite scores as the test variable and below the 20th percentile performance on the reference criterion as the reference variable. ROC-generated AUC values and classification agreement values were generated within the ROC analysis. Screening risk was determined in two ways (1) on universal screening at fall and winter, and (2) response to classwide intervention which was delivered universally and has been shown to produce learning gains and permit more accurate determination of risk when base rates of risk are high (VanDerHeyden, McLaughlin, Algina, & Snyder, 2012; VanDerHeyden 2013). Classification agreement analyses were also conducted directly using a priori screening decision rules for universal screening at fall & winter and subsequent risk during classwide intervention. Results replicated the ROC-generated classification agreement indices reported in this application. Universal screening measures were highly sensitive but generated a high number of false-positive errors. Classwide intervention response data (also collected universally) was slightly less sensitive but much more specific than universal screening (e.g., Grade 3 sensitivity = .82 and specificity = .80). In grades K and 1, probability of year-end risk was zero for children who passed the screeners. If K and 1st grade children who met the risk criterion during classwide intervention had a 62% and 59% probability of year end risk which was 2.67 and 2.36 times the base rate of risk respectively. At Grade 3, children who passed the screeners had zero probability of year-end risk. Third graders meeting the risk criterion during classwide intervention had a 33% chance of year-end risk on the AZ measure which was 1.5 times the base rate of risk in the sample. Grade 5 students who passed the screener had zero probability of year-end risk, but 39% probability of year-end risk if they met the risk criterion during classwide intervention which was 2.17 times the base rate of risk for the sample. Grade 7 students who passed the screeners had a 1% chance of year-end risk on the AZ measure, but 44% chance of year-end risk if they met the risk criterion during classwide intervention which was 1.63 times the base rate of risk in the sample. Functionally, universal screening is used to determine the need for classwide intervention. Classwide intervention data can then be used as a second screening gate. In this NEW submission, we recruited new samples (collected in 2018-2019 school year) and conducted new analyses using the previously ROC-identified cut scores. This time, however, for the fall screening at all grades, we used the following screening criterion: fall composite screening score below criterion and/or identified as being at risk during subsequent classwide intervention, which reflects the screening criterion we use in practice to advance students to more intensive instruction (diagnostic assessment and individualized intervention). We used a sample for which all students were exposed to both the screening measures and classwide math intervention. We report the cell values obtained for the classification accuracy metrics and regression-generated ROC AUC values reflecting the combined screening criterion. In this submission, we also use the Spring reference criterion for fall and winter screening accuracy analyses which we believe to be more rigorous than using the more-proximal winter criterion and more reflective of the type of decisions we are making with these data.

Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?: Yes
If yes, please describe the intervention, what children received the intervention, and how they were chosen.: Classwide intervention is part of the screening decision. Intensive, individualized intervention was not provided to students between the screening decision and the outcome measurement.

Cross-Validation

Has a cross-validation study been conducted?: No
If yes,

Select time of year.

Fall

Winter

Spring

Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.

Do the cross-validation analyses examine concurrent and/or predictive classification?

Concurrent
Predictive

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.

Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).

Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

PSSA

Classification Accuracy

Select time of year

Fall

Winter

Spring

Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.: The PSSA is the year-end state accountability measure in Pennsylvania. The PSSA has been used in the state of PA since 2003. According to the most recent technical data available (2021), the reliability of mathematics scores at Grade 6 is 0.89 (and all remaining grades equal or exceed .90). PSSA scores have been examined in concert with National Assessment of Education Progress (NAEP) scores, Classroom Diagnostic Tools (CDT), Terra Nova, and subtest correlations demonstrating strong discriminant, convergent, and predictive validity.

Do the classification accuracy analyses examine concurrent and/or predictive classification?

Concurrent
Predictive

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.: The fall screening measures were administered in August of 2021. The criterion was administered in the spring of 2022 as part of the state's annual accountability testing program.

Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).: For the fall screening, we used a raw score composite of the screening measures which include: Add and subtract fractions with unlike denominators, Order of operations, Multiply 2x2 digits with decimals, and Multiply and divide mixed numbers. The raw score composite is transformed into a percentile rank and children at or below the 20th percentile are coded as being at risk. Criterion scores (PSSA) are similarly transformed and children are coded as scoring at or below the 20th percentile.

Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?: Yes
If yes, please describe the intervention, what children received the intervention, and how they were chosen.: We report predictive validity and children in this sample did receive classwide math intervention between the fall screening and the year-end criterion test.

Cross-Validation

Has a cross-validation study been conducted?: No
If yes,

Select time of year.

Fall

Winter

Spring

Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.

Do the cross-validation analyses examine concurrent and/or predictive classification?

Concurrent
Predictive

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.

Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).

Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Classification Accuracy - Fall

Evidence	Kindergarten	Grade 1	Grade 3	Grade 5	Grade 6	Grade 7
Criterion measure	Spring Composite Score	Spring Composite Score	AzMERIT (2020) or AASA (2023)	AzMERIT (2020) or AASA (2023)	PSSA	AzMERIT (2020) or AASA (2023)
Cut Points - Percentile rank on criterion measure	20	20	20	20	20	20
Cut Points - Performance score on criterion measure	20 (winter composite score)	59	3515	3582	885 on PSSA	3633
Cut Points - Corresponding performance score (numeric) on screener measure	At risk 7% or more of opportunities during classwide intervention	39	24	35	5	13
Classification Data - True Positive (a)	47	47	53	56	19	33
Classification Data - False Positive (b)	45	29	48	46	15	12
Classification Data - False Negative (c)	7	5	2	7	3	7
Classification Data - True Negative (d)	232	206	194	241	75	158
Area Under the Curve (AUC)	0.90	0.90	0.91	0.86	0.89	0.90
AUC Estimate’s 95% Confidence Interval: Lower Bound	0.85	0.85	0.87	0.81	0.81	0.80
AUC Estimate’s 95% Confidence Interval: Upper Bound	0.95	0.94	0.95	0.90	0.97	0.99

Statistics	Kindergarten	Grade 1	Grade 3	Grade 5	Grade 6	Grade 7
Base Rate	0.16	0.18	0.19	0.18	0.20	0.19
Overall Classification Rate	0.84	0.88	0.83	0.85	0.84	0.91
Sensitivity	0.87	0.90	0.96	0.89	0.86	0.83
Specificity	0.84	0.88	0.80	0.84	0.83	0.93
False Positive Rate	0.16	0.12	0.20	0.16	0.17	0.07
False Negative Rate	0.13	0.10	0.04	0.11	0.14	0.18
Positive Predictive Power	0.51	0.62	0.52	0.55	0.56	0.73
Negative Predictive Power	0.97	0.98	0.99	0.97	0.96	0.96

Sample	Kindergarten	Grade 1	Grade 3	Grade 5	Grade 6	Grade 7
Date	Fall (screening via classwide intervention), Winter (criterion)	9/1/18	9/1/18	9/1/18	Fall 2022 (screening) and Spring 2023 (criterion)	9/1/17
Sample Size	331	287	297	350	112	210
Geographic Representation	Mountain (AZ)	Mountain (AZ)	Mountain (AZ)	Mountain (AZ)		Mountain (AZ)
Male			53.5%	52.0%		53.8%
Female			46.5%	48.0%		46.2%
Other
Gender Unknown
White, Non-Hispanic			58.6%	54.3%		71.0%
Black, Non-Hispanic			2.0%	3.1%
Hispanic			35.0%	35.1%		26.2%
Asian/Pacific Islander			0.3%	2.0%		2.4%
American Indian/Alaska Native
Other			4.0%	5.4%		0.5%
Race / Ethnicity Unknown
Low SES
IEP or diagnosed disability			12.8%	9.1%		9.0%
English Language Learner			1.0%	1.7%		0.5%

Classification Accuracy - Winter

Evidence	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 7
Criterion measure	Spring Composite Score	Spring Composite Score	Spring Composite Score	AzMERIT (2020) or AASA (2023)	AzMERIT (2020) or AASA (2023)	AzMERIT (2020) or AASA (2023)	AzMERIT (2020) or AASA (2023)
Cut Points - Percentile rank on criterion measure	20	20	20	20	20	20	20
Cut Points - Performance score on criterion measure	38	37	12	3515	3545 standard score on AASA	3581 AASA	3633
Cut Points - Corresponding performance score (numeric) on screener measure	27	78	Met risk criterion 10% of opportunities or more during classwide intervention	31	Met risk criterion on 7% of opportunities during classwide math intervention	Met Risk criterion 5% of opportunities during classwide intervention	26
Classification Data - True Positive (a)	64	17	46	52	39	38	34
Classification Data - False Positive (b)	43	13	72	48	53	63	35
Classification Data - False Negative (c)	13	4	9	8	10	9	8
Classification Data - True Negative (d)	207	68	328	199	231	255	138
Area Under the Curve (AUC)	0.91	0.91	0.89	0.87	0.86	0.85	0.91
AUC Estimate’s 95% Confidence Interval: Lower Bound	0.87	0.86	0.84	0.82	0.80	0.78	0.87
AUC Estimate’s 95% Confidence Interval: Upper Bound	0.95	0.97	0.94	0.93	0.93	0.92	0.95

Statistics	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 7
Base Rate	0.24	0.21	0.12	0.20	0.15	0.13	0.20
Overall Classification Rate	0.83	0.83	0.82	0.82	0.81	0.80	0.80
Sensitivity	0.83	0.81	0.84	0.87	0.80	0.81	0.81
Specificity	0.83	0.84	0.82	0.81	0.81	0.80	0.80
False Positive Rate	0.17	0.16	0.18	0.19	0.19	0.20	0.20
False Negative Rate	0.17	0.19	0.16	0.13	0.20	0.19	0.19
Positive Predictive Power	0.60	0.57	0.39	0.52	0.42	0.38	0.49
Negative Predictive Power	0.94	0.94	0.97	0.96	0.96	0.97	0.95

Sample	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 7
Date	Winter (Screening), Spring (Criterion)	1/5/18		1/5/19	Spring 2022 for AASA; Winter of 2021 for RTI screening	Winter 2022 (screening), Spring 2022 (criterion)	1/5/18
Sample Size	327	102	455	307	333	365	215
Geographic Representation	Mountain (AZ)	Mountain (AZ)	Mountain (AZ)	Mountain (AZ)	Mountain (AZ)	Mountain (AZ)	Mountain (AZ)
Male				54.1%		49.6%	53.0%
Female				45.9%		50.4%	47.0%
Other
Gender Unknown
White, Non-Hispanic				58.3%		93.2%	70.2%
Black, Non-Hispanic				2.3%		6.0%
Hispanic				34.5%		34.5%	26.5%
Asian/Pacific Islander				0.3%		4.9%	2.3%
American Indian/Alaska Native						5.8%	0.5%
Other				4.6%			0.5%
Race / Ethnicity Unknown
Low SES
IEP or diagnosed disability				12.7%		7.1%	8.8%
English Language Learner				1.0%		0.3%	0.5%

Reliability

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6	Grade 7
Rating

Legend

Convincing evidence

Partially convincing evidence

Unconvincing evidence

Data unavailable

^dDisaggregated data available

*Offer a justification for each type of reliability reported, given the type and purpose of the tool.: Probes are generated following a set of programmed parameters that were built and tested in a development phase. To determine measure equivalence, problem sets were generated, and each problem within a problem set was scored for possible digits correct. The digits correct metric comes from the curriculum-based measurement literature (Deno & Mirkin, 1977) and allows for sensitive measurement of child responding. Typically, each digit that appears in the correct place value position to arrive at the correct final answer is counted as a digit correct. Generally, digits correct work is counted for all the work that occurs below the problem (in the answer) but does not include any work that may appear above the problem in composing or decomposing hundreds or tens, for example, when regrouping. During development of the measure generator, a standard response format was selected for all measures, which reflected the relevant responses in steps to arrive at a correct and complete answer. Potential digits correct was the unit of analysis that we used to test the equivalence of generated problem sets. For example, in scoring adding and subtracting fractions with unlike denominators, all digits correct in generating fractions with equivalent denominators, then the digits correct in combining or taking the fraction quantity, and digits correct in simplifying the final fraction were counted. The number of problems generated depended upon the task difficulty of the measure. If the measure assessed an easier skill (defined as having fewer potential digits correct), then the number of problems generated was greater than the number of problems that were generated and tested for harder skills for which the possible digits correct scores were much higher. Problems generated for equivalence testing ranged from 80 problems to 480 problems per measure. A total of 46,022 problems were generated and scored for possible digits correct to test the equivalence of generated problem sets. Problem sets ranged from 8-48 problems. Most problem sets contained 30 problems. For each round of testing, 10 problem sets were generated per measure. The mean possible digits correct per problem was computed for each problem set for each measure. The standard deviation of possible digits correct across the ten generated problem sets was computed and was required to be less than 10% of the mean possible digits correct to establish equivalence. Spring Math has 130 measures. Thirty-eight measures were not tested for equivalence because there was no variation in possible digits correct per problem type. These measures were all single-digit answers and included measures like Sums to 6, Subtraction 0-5, and Number Names. Eighty-three measures met equivalence standards on the first round of testing, with a standard deviation of possible digits correct per problem per problem set that was on average 4% of the mean possible digits correct per problem. Seven measures required revision and a second round of testing. These measures included Mixed Fraction Operations, Multiply Fractions, Convert Improper to Mixed, Solve 2-Step Equations, Solve Equations with Percentages, Convert Fractions to Decimals, and Collect Like Terms. After revision and re-testing, the average percent of the mean that the standard deviation represented was 4%. One measure required a third round of revision and re-testing. This measure was Order of Operations. On the third round, it met the equivalence criterion with the standard deviation representing on average 10% of the mean possible digits correct per problem across generated problem sets. In our previous submission, we reported the results from a year-long study in Louisiana during which screening measures were generated and administered to classes of children with a 1-week interval of time between assessment occasions. Measures were administered by researchers with rigorous integrity and inter-reliability controls in place. In this NEW submission, we are also reporting the results of a large multivariate analysis of reliability for the fall measures in grades K, 1, 3, 5, and 7.

*Describe the sample(s), including size and characteristics, for each reliability analysis conducted.: Reliability data were collected in three schools in southeastern Louisiana with appropriate procedural controls. Researchers administered the screening measures for the reliability study following an administration script. On 25% of testing occasions balanced across times (time1 and time2), grades, and classrooms, a second trained observer documented the percentage of correctly completed steps during screening administration. Average integrity (percentage of steps correctly conducted) was 99.36% with a less than perfect integrity score on only 4 occasions (one missed the sentence in the protocol telling students not to skip around, one exceeded the 2-min timing interval for one measure by 5 seconds, and on two occasions, students turned their papers over before being told to do so). Demographic data are provided in the Reliability Table below for the reliability sample. NEW EVIDENCE using G-theory Analyses: All students enrolled in 10 selected classrooms were eligible for participation (N = 256) in a district in the southwestern U.S. Sample demographics were: 51% female; 55% European American, 34% Latinx, 5% African American, 5% Asian, and 1% Native American; 9% of students received special education services; 20% of students received free or reduced price lunch. Children who were present on the days that data were collected were included and a single make-up day was permitted at Kindergarten due to a scheduled fire drill on the second day of data collection. Students with scores present on all eight occasions on both days were included (88% to 95% of eligible students across grades). Most common reasons for incomplete data included being absent on one or both of the assessment days or leaving the classroom part of the way through the assessment session on either day. Analyses accounted for missing data via imputation (detailed below).

*Describe the analysis procedures for each reported type of reliability.: Spring Math uses 3-4 timed measures per screening occasions. The initial risk decision and subsequent class-wide intervention risk decision is based on the set of measures as a whole and the subsequent risk during class-wide intervention. For these reliability analyses, we report the Pearson r correlation coefficient (with 95% CI) for two time occasions - fall and winter - for the generated (i.e., alternate form) measures administered 1 week apart.

*In the table(s) below, report the results of the reliability analyses described above (e.g., internal consistency or inter-rater reliability coefficients).

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of reliability analysis not compatible with above table format:: NEW RELIABILITY INFORMATION: Seventeen concurrent G-studies were conducted to allow for a multivariate examination of reliability of the 17 fall subskill mastery measures that are used in Spring Math. Measures are generated (not static) according to previously tested parameters. All students enrolled in 10 selected classrooms were eligible for participation (N = 256) in a district in the southwestern U.S. Sample demographics were: 51% female; 55% European American, 34% Latinx, 5% African American, 5% Asian, and 1% Native American; 9% of students received special education services; 20% of students received free or reduced price lunch. Children who were present on the days that data were collected were included and a single make-up day was permitted at Kindergarten due to a scheduled fire drill on the second day of data collection. Students with scores present on all eight occasions on both days were included (88% to 95% of eligible students across grades). Most common reasons for incomplete data included being absent on one or both of the assessment days or leaving the classroom part of the way through the assessment session on either day. Analyses accounted for missing data via imputation. On two consecutive days in early December, the research team administered measures in all classrooms using scripted instructions during a pre-arranged time in each classroom. Probes were administered classwide in all grades except for Kindergarten. In Kindergarten, probes were administered in small groups with 5 to 6 students per group. Group membership was alphabetical based on last name and was held constant across consecutive days. Kindergarten probes were administered by a researcher at a child-sized table in a shared space that connected to all kindergarten classrooms and was commonly used for pull-out activities with kindergarten students in the school. Children were called by group and were administered the measures by a researcher. Measures included in the fall screening battery for each grade level (K, 1, 3, 5, and 7) were the focus of this study. Each “batch” of screening measures was administered four times on two consecutive days using generated alternate forms for each measure. Measure order was counterbalanced across batches, days, and groups in Kindergarten and across batches, days, and classes in remaining grades. At Kindergarten and Grade 5, a total of 4 measures were administered across 8 alternate forms for a total of 32 administered probes per grade. At Grades 1, 3, and 7, a total of 3 measures were administered across 8 alternate forms for a total of 24 administered probes per grade. At all grades, half of the assessments were administered on each of two days to minimize assessment fatigue. Measures were timed and assessment times followed conventional CBM timings of 2 minutes per measure in Grades 1, 3, and 5 and 4 minutes in Grade 7. In Kindergarten, because of suspected ceiling effects, timing was reduced to 30 seconds for two of the four measures and remained at the standard 1-minute timing for kindergarten for the other two measures. Total assessment time per day per student in each grade was 12 minutes per day for Kindergarten, 24 minutes per day for Grades 1 and 3, 32 minutes per day for Grade 5, and 48 minutes for Grade 7. Measures were printed in black and white with standard instructions at the top of each measure. The measures for each day were organized in a student packet and each probe was separated with a brightly colored piece of paper so that instructions could be delivered by researchers for each probe. Researchers read the standardized directions for each probe, demonstrated correct completion of a sample problem, started the timed interval, and told students to stop working and turn to the next colored page when the timer rang. Researchers repeated this process for all of the probes that were administered each day. When the daily assessment was complete for each grade, a small reward was provided to each student for participation that had been selected by the teacher. Integrity and IOA data All measures were administered and scored by the researchers. Scripts were followed in administering measures in each assessment session. A second researcher used a procedural integrity checklist that listed each observable step of the assessment session to note the correct occurrence of each procedural step for approximately 25% of assessment sessions balanced across days, grades, and classes/groups. Thus far, integrity data have been computed for 266 probe administrations with 99.8% of probe administration steps correctly completed. Research assistants used answer keys to score all measures. Approximately 25% of assessments were scored by a second research assistant to allow for estimation of interobserver agreement (IOA). Analysis To analyze the data, we applied a univariate g-theory procedure well established in the extant literature (e.g., Christ & Vining, 2006; Hintze et al., 2000; 2002; Wilson, Chen, Sandbank, & Herbert, 2019). A univariate, as opposed to multivariate, procedure was selected because the former aligns with how Spring Math probes are used to guide instructional decision-making, where risk on any measure, independent of others administered, may trigger changes in instruction. We employed a fully crossed-design where variance components were estimated based on a sum-of-squares procedure (Brennan, 2001). The following facets were examined: person (p), probe (b), and day (d), which were random effects, as well as all possible interactions. Therefore, a single student’s score, generated from the universe of all possible scores across testing conditions (i.e., a given probe, day administered), was Xpbd = µ + np + nb + nd + npb + nbd + npd + nresid (eq. 1), where µ represents the grand mean of scores across students and test conditions. In other words, our model examined how variation in observed scores varied across individuals who participated in assessment (person; p), across generated probes one assumes are equivalent (probe; b), and across either the first or second day of the study (day; d). The latter facet, day, reflects our desire to consider the replicability of results within the model. If a large percentage of variance in scores was explained by day, this would suggest results from the first day – an intact study unto itself - did not replicate, the replication being the second day. Variance components were estimated using the g-theory package (Moore, 2016) in R (R Core Team, 2017). Five to 12% of data (median = 7%) was missing across measures, primarily as a result of students being absent one or the other day of testing. Missing data was imputed using the mice package in R. Specifically, we imputed problems correct scores using a 2 level linear effects model, where scores were nested within student. We created five imputed datasets, calculated study results for each one, then averaged across the resulting variance components and coefficients. We then calculated the standard deviation of study results across these imputed datasets. High levels of variability would suggest the imputed results were not stable and therefore may not be valid. Results suggest the treatment of missing data was both reliable and valid. Since, as describe below, only the person facet and the residual (error) explained meaningful amounts of variation in the data, we speak only to changes for those facets across imputed datasets. Standard deviations of the percentage of variance explained for these two facets, reflecting the range of findings across imputed datasets, were exceedingly small, ranging from .18 to 1.21 percentage points. As a test of limits, we also ran results with and without missing data. Variance components and g and d coefficients across imputed and non-imputed datasets were generally within 5% of each other. In the current study, we also examined potential bias in generalizability and dependability by splitting the sample into subsamples and comparing results across measures. We contrasted students who were and were not receiving free-and-reduced price lunch, students identified as male v. female, and finally students who identified as African American/Hispanic/American Indian in contrast to those identified as Caucasian/Asian. G- and D-study results were not moderated by learner/student characteristics. Results For all measures at all grades, students accounted for the most variance in scores. For 16 of the 17 measures, probe forms accounted for less than 5% of variance. Probe forms accounted for 0% to 4.42% of the variance in scores for the Kindergarten measures, 0.56% to 1.96% for Grade 1, 1.10% to 2.84% for Grade 3, 0.86% to 11.24% for Grade 5, and 0.34% to 2.28% for Grade 7. The measure for which probe forms accounted for 11.24% of variance in scores was Multiply 2-digit by 2-digit Numbers with and without Regrouping in Grade 5. Thus, the rank ordering of students did not vary based on the probe form. Generalizability coefficients were greater than .7 on the first trial (range, .74 - .92) and .8 (range, .83 - .95) on the second trial for all but three measures. The dependability coefficients followed the same pattern (see attached Figure 1). The three measures with weaker G and D coefficients were Count Objects to 10, Circle Answer (Kindergarten) and Identify Numbers to 10, Draw Circles (Kindergarten), and Calculate Missing Value in a Percentage Problem (Grade 7). For the two Kindergarten measures, G and D coefficients exceeded .70 for each of these by the third trial. For the Grade 7 measure, G coefficient exceeded .70 on the second trial and D coefficient exceeded .7 on the third trial.

Manual cites other published reliability studies:: No

Provide citations for additional published studies.

Do you have reliability data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?: No

If yes, fill in data for each subgroup with disaggregated reliability data.

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of reliability analysis not compatible with above table format:

Manual cites other published reliability studies:: No

Provide citations for additional published studies.

Validity

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6	Grade 7
Rating

Legend

Convincing evidence

Partially convincing evidence

Unconvincing evidence

Data unavailable

^dDisaggregated data available

*Describe each criterion measure used and explain why each measure is appropriate, given the type and purpose of the tool.

*Describe the sample(s), including size and characteristics, for each validity analysis conducted.: The sample size is included in the table. The demographics are similar to those described in the reliability section.

*Describe the analysis procedures for each reported type of validity.: We have reported the Pearson r correlation for theoretically anticipated convergent measures and theoretically anticipated discriminant measures.

*In the table below, report the results of the validity analyses described above (e.g., concurrent or predictive validity, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of validity analysis not compatible with above table format:

Manual cites other published reliability studies:: No

Provide citations for additional published studies.

Describe the degree to which the provided data support the validity of the tool.: We see a pattern of correlations that supports multi-trait, multi-method logic (Campbell & Fiske, 1959).

Do you have validity data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?

If yes, fill in data for each subgroup with disaggregated validity data.

Type of	Subgroup	Informant	Age / Grade	Test or Criterion	n	Median Coefficient	95% Confidence Interval Lower Bound	95% Confidence Interval Upper Bound

Results from other forms of validity analysis not compatible with above table format:

Manual cites other published reliability studies:

Provide citations for additional published studies.

Bias Analysis

Grade	Kindergarten	Grade 1	Grade 2	Grade 3	Grade 4	Grade 5	Grade 6	Grade 7
Rating	Provided	Provided	Not Provided	Provided	Not Provided	Provided	Not Provided	Provided

Have you conducted additional analyses related to the extent to which your tool is or is not biased against subgroups (e.g., race/ethnicity, gender, socioeconomic status, students with disabilities, English language learners)? Examples might include Differential Item Functioning (DIF) or invariance testing in multiple-group confirmatory factor models.: Yes

If yes,
a. Describe the method used to determine the presence or absence of bias:: We conducted a series of binary logistic regression analyses using Stata. Scoring below the 20th percentile on the Arizona year-end state test was the outcome criterion. The interaction term for each subgroup and the fall composite screening score, winter composite screening score, and class-wide intervention risk is provided in the table below.

b. Describe the subgroups for which bias analyses were conducted:: Gender, Students with Disabilities, Ethnicity, and SES.

c. Describe the results of the bias analyses conducted, including data and interpretative statements. Include magnitude of effect (if available) if bias has been identified.: None of the interactions were significant. Thus, screening accuracy did not differ across subgroups in a way that was statistically significant

Data Collection Practices

Most tools and programs evaluated by the NCII are branded products which have been submitted by the companies, organizations, or individuals that disseminate these products. These entities supply the textual information shown above, but not the ratings accompanying the text. NCII administrators and members of our Technical Review Committees have reviewed the content on this page, but NCII cannot guarantee that this information is free from error or reflective of recent changes to the product. Tools and programs have the opportunity to be updated annually or upon request.

Summary
Descriptive Information
Administration
Training & Scoring

Technical Standards
Classification Accuracy &
Cross-Validation Summary
Reliability
Validity
Bias Analysis

Data Collection Practices

Spring MathMathematics

Summary

Descriptive Information

Administration

Training & Scoring

Training

Scoring

Technical Standards

Classification Accuracy & Cross-Validation Summary

AzMERIT (2020) or AASA (2023)

Classification Accuracy

Cross-Validation

Spring Composite Score

Classification Accuracy

Cross-Validation

PSSA

Classification Accuracy

Cross-Validation

Classification Accuracy - Fall

Classification Accuracy - Winter

Reliability

Validity

Bias Analysis

Data Collection Practices

Spring Math
Mathematics