EarlyBird Education
EarlyBird Dyslexia and Early Literacy Screener

Summary

The EarlyBird Dyslexia and Early Literacy Screener, formerly known as the Boston Children’s Hospital Early Literacy Screener (BELS), is a universal screener that brings together all the relevant predictors of dyslexia/reading in one easy-to-administer assessment. Appropriate for pre-readers and early readers, it provides the tools to easily and accurately identify students at risk of dyslexia and other reading difficulties earlier, when the window for intervention is most effective. EarlyBird was developed at Boston Children’s Hospital by Dr. Nadine Gaab, in partnership with Dr. Yaacov Petscher at the Florida Center for Reading Research. It is self-administered in small groups with adult oversight and addresses the literacy milestones that have been found to be predictive of subsequent reading success. EarlyBird, administered three times per year, includes screening for: 1. severe reading risk (referred to as dyslexia screener); 2. moderate reading risk (referred to as PWR (MTSS/RTI) screener) From a student perspective, the engaging gamified assessment is accessible and child-centered. In the game, the child joins their new feathery friend Pip who guides them along a path through the city, meeting animal friends who introduce the child to each subtest. EarlyBird has three main components: 1. Game-based app, played on an iPad, Chromebook, or any device with a Chrome browser, that is comprehensive and scientifically validated, yet easy to administer. 2. Web-based dashboard presenting assessment data on student, classroom, school and district levels; providing intuitive, easily-accessed explanations of data; and customized evidence-based Next Step Resources including lesson plans, videos, activities, tools and resource links. 3. Professional learning, delivered through virtual synchronous workshops prior to the beginning of the year assessment and following each screening period (BOY, MOY, and EOY). The post-assessment workshops are designed to assist teachers with analyzing their data, grouping students for intervention, and providing prescriptive instruction using the “Next Steps” Resources within the platform.

Where to Obtain:
EarlyBird Education
connect@earlybirdeducation.com
1.833.759.2473
https://earlybirdeducation.com/
Initial Cost:
$10.00 per student
Replacement Cost:
Contact vendor for pricing details.
Included in Cost:
In addition to the student fee, there is a teacher license/platform fee and professional development workshop fees. Total cost will depend on the number of schools and students. Please contact connect@earlybirdeducation.com or 833-759-2473 for specific details on pricing for your district. Included with each contract is online student access to the assessment; staff access to the dashboard (with automatic scoring) and data reporting; intuitive easily-accessed “Knowledge Nest” of tutorials for educators; student specific reports for sharing results with families (in multiple languages); “Next Steps” Resources with printable lesson plans, activities, tools and resource links, customized based on assessment results; At Home Activities to share with families; dedicated account management; Live virtual professional development workshops led by literacy experts; easy-access service and support; secure hosting; and all maintenance, upgrades and enhancements for the duration of the license.
EarlyBird has been designed to support a wide range of learners, with accommodating features including: Students may use the touchscreen, mouse or keyboard, accommodating students with limited mobility skills. EarlyBird, with the exception of one subtest measuring RAN, is untimed. The game is designed to allow the student or administrator to pause at any time to allow for breaks and pauses with easy resuming of the game. Students can adjust the audio, as needed, and can use the assessment with noise-buffering headphones. EarlyBird can be administered individually or in small groups, as needed. EarlyBird was designed to reduce assessment bias. As part of item development, all items were reviewed for bias and fairness. During administration, EarlyBird utilizes voice recognition AI, so each student is evaluated consistently, and absent of administrator scoring bias or fatigue. Finally, as a fun, interactive game, students are relaxed and engaged, and participate fully in the assessment, often not even realizing they are being assessed, thereby reducing stress and test anxiety performance issues.
Training Requirements:
Because EarlyBird is an intuitive and easy-to-understand game, a brief 30-45 minute kick-off meeting is all that is required for staff participating in screener administration.
Qualified Administrators:
No minimum qualifications specified.
Access to Technical Support:
EarlyBird provides full service support via phone, email, or electronically via the EarlyBird dashboard. Additional support for users is available on-demand via our self-serve “Knowledge Nest” which is accessible through the user dashboard. Each EarlyBird customer has a Customer Success Manager who provides support and training throughout all stages of the customer journey. The EarlyBird training and professional development model includes: ● Implementation Planning ● Initial Kick-Off Training ● Access to Data Dashboard and “Next Steps” Resource Library ● “Next Steps” Workshops
Assessment Format:
Scoring Time:
  • Scoring is automatic OR
  • 1 minutes per per student (to confirm RAN score)
Scores Generated:
  • Percentile score
  • IRT-based score
  • Probability
  • Subscale/subtest scores
  • Other: Teachers are able to see the breakdown of individual subtest skills as they pertain to the foundational skills of reading. The subtest scores are presented in the form of percentiles (which indicate how the child’s performance on the subtest compares to a nationally representative sample of students). Additionally, there are two predictive profile scores, generated by weighted algorithms of a subset of subtests.
Administration Time:
  • 30 minutes per on average, per student (varies by grade)
Scoring Method:
  • Automatically (computer-scored)
  • Other : As soon as the child completes each subtest, scores are automatically calculated and displayed on the teacher dashboard. The only exception is the RAN subtest, which requires score confirmation (asynchronously, at the convenience of the teacher, by listening to the recording).
Technology Requirements:
  • Computer or tablet
  • Internet connection
Accommodations:
EarlyBird has been designed to support a wide range of learners, with accommodating features including: Students may use the touchscreen, mouse or keyboard, accommodating students with limited mobility skills. EarlyBird, with the exception of one subtest measuring RAN, is untimed. The game is designed to allow the student or administrator to pause at any time to allow for breaks and pauses with easy resuming of the game. Students can adjust the audio, as needed, and can use the assessment with noise-buffering headphones. EarlyBird can be administered individually or in small groups, as needed. EarlyBird was designed to reduce assessment bias. As part of item development, all items were reviewed for bias and fairness. During administration, EarlyBird utilizes voice recognition AI, so each student is evaluated consistently, and absent of administrator scoring bias or fatigue. Finally, as a fun, interactive game, students are relaxed and engaged, and participate fully in the assessment, often not even realizing they are being assessed, thereby reducing stress and test anxiety performance issues.

Descriptive Information

Please provide a description of your tool:
The EarlyBird Dyslexia and Early Literacy Screener, formerly known as the Boston Children’s Hospital Early Literacy Screener (BELS), is a universal screener that brings together all the relevant predictors of dyslexia/reading in one easy-to-administer assessment. Appropriate for pre-readers and early readers, it provides the tools to easily and accurately identify students at risk of dyslexia and other reading difficulties earlier, when the window for intervention is most effective. EarlyBird was developed at Boston Children’s Hospital by Dr. Nadine Gaab, in partnership with Dr. Yaacov Petscher at the Florida Center for Reading Research. It is self-administered in small groups with adult oversight and addresses the literacy milestones that have been found to be predictive of subsequent reading success. EarlyBird, administered three times per year, includes screening for: 1. severe reading risk (referred to as dyslexia screener); 2. moderate reading risk (referred to as PWR (MTSS/RTI) screener) From a student perspective, the engaging gamified assessment is accessible and child-centered. In the game, the child joins their new feathery friend Pip who guides them along a path through the city, meeting animal friends who introduce the child to each subtest. EarlyBird has three main components: 1. Game-based app, played on an iPad, Chromebook, or any device with a Chrome browser, that is comprehensive and scientifically validated, yet easy to administer. 2. Web-based dashboard presenting assessment data on student, classroom, school and district levels; providing intuitive, easily-accessed explanations of data; and customized evidence-based Next Step Resources including lesson plans, videos, activities, tools and resource links. 3. Professional learning, delivered through virtual synchronous workshops prior to the beginning of the year assessment and following each screening period (BOY, MOY, and EOY). The post-assessment workshops are designed to assist teachers with analyzing their data, grouping students for intervention, and providing prescriptive instruction using the “Next Steps” Resources within the platform.
The tool is intended for use with the following grade(s).
selected Preschool / Pre - kindergarten
selected Kindergarten
selected First grade
selected Second grade
not selected Third grade
not selected Fourth grade
not selected Fifth grade
not selected Sixth grade
not selected Seventh grade
not selected Eighth grade
not selected Ninth grade
not selected Tenth grade
not selected Eleventh grade
not selected Twelfth grade

The tool is intended for use with the following age(s).
selected 0-4 years old
selected 5 years old
selected 6 years old
selected 7 years old
selected 8 years old
not selected 9 years old
not selected 10 years old
not selected 11 years old
not selected 12 years old
not selected 13 years old
not selected 14 years old
not selected 15 years old
not selected 16 years old
not selected 17 years old
not selected 18 years old

The tool is intended for use with the following student populations.
selected Students in general education
selected Students with disabilities
selected English language learners

ACADEMIC ONLY: What skills does the tool screen?

Reading
Phonological processing:
selected RAN
selected Memory
selected Awareness
selected Letter sound correspondence
selected Phonics
not selected Structural analysis

Word ID
selected Accuracy
not selected Speed

Nonword
selected Accuracy
not selected Speed

Spelling
selected Accuracy
not selected Speed

Passage
selected Accuracy
selected Speed

Reading comprehension:
selected Multiple choice questions
not selected Cloze
not selected Constructed Response
not selected Retell
not selected Maze
not selected Sentence verification
not selected Other (please describe):


Listening comprehension:
selected Multiple choice questions
not selected Cloze
not selected Constructed Response
not selected Retell
not selected Maze
not selected Sentence verification
selected Vocabulary
not selected Expressive
selected Receptive

Mathematics
Global Indicator of Math Competence
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Early Numeracy
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Mathematics Concepts
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Mathematics Computation
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Mathematic Application
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Fractions/Decimals
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Algebra
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

Geometry
not selected Accuracy
not selected Speed
not selected Multiple Choice
not selected Constructed Response

not selected Other (please describe):

Please describe specific domain, skills or subtests:
EarlyBird is a comprehensive assessment, with subtests in the critical skill areas related to the science of reading: Naming Speed (Object RAN and Letter RAN); Phonemic and Phonological Awareness (First Sound Matching, Rhyming, Blending, Deletion, Nonword Repetition); Sound-Symbol Correspondence/Phonics (Letter Name, Letter Sound, Nonword Reading, Word Reading, Nonword Spelling); Passage Reading (Oral Reading Fluency and Reading Comprehension); and Oral Language (Receptive and Expressive Vocabulary, Word Matching, Oral Sentence Comprehension, Follow Directions).
BEHAVIOR ONLY: Which category of behaviors does your tool target?


BEHAVIOR ONLY: Please identify which broad domain(s)/construct(s) are measured by your tool and define each sub-domain or sub-construct.

Acquisition and Cost Information

Where to obtain:
Email Address
connect@earlybirdeducation.com
Address
Phone Number
1.833.759.2473
Website
https://earlybirdeducation.com/
Initial cost for implementing program:
Cost
$10.00
Unit of cost
student
Replacement cost per unit for subsequent use:
Cost
Unit of cost
Duration of license
Additional cost information:
Describe basic pricing plan and structure of the tool. Provide information on what is included in the published tool, as well as what is not included but required for implementation.
In addition to the student fee, there is a teacher license/platform fee and professional development workshop fees. Total cost will depend on the number of schools and students. Please contact connect@earlybirdeducation.com or 833-759-2473 for specific details on pricing for your district. Included with each contract is online student access to the assessment; staff access to the dashboard (with automatic scoring) and data reporting; intuitive easily-accessed “Knowledge Nest” of tutorials for educators; student specific reports for sharing results with families (in multiple languages); “Next Steps” Resources with printable lesson plans, activities, tools and resource links, customized based on assessment results; At Home Activities to share with families; dedicated account management; Live virtual professional development workshops led by literacy experts; easy-access service and support; secure hosting; and all maintenance, upgrades and enhancements for the duration of the license.
Provide information about special accommodations for students with disabilities.
EarlyBird has been designed to support a wide range of learners, with accommodating features including: Students may use the touchscreen, mouse or keyboard, accommodating students with limited mobility skills. EarlyBird, with the exception of one subtest measuring RAN, is untimed. The game is designed to allow the student or administrator to pause at any time to allow for breaks and pauses with easy resuming of the game. Students can adjust the audio, as needed, and can use the assessment with noise-buffering headphones. EarlyBird can be administered individually or in small groups, as needed. EarlyBird was designed to reduce assessment bias. As part of item development, all items were reviewed for bias and fairness. During administration, EarlyBird utilizes voice recognition AI, so each student is evaluated consistently, and absent of administrator scoring bias or fatigue. Finally, as a fun, interactive game, students are relaxed and engaged, and participate fully in the assessment, often not even realizing they are being assessed, thereby reducing stress and test anxiety performance issues.

Administration

BEHAVIOR ONLY: What type of administrator is your tool designed for?
not selected General education teacher
not selected Special education teacher
not selected Parent
not selected Child
not selected External observer
not selected Other
If other, please specify:

What is the administration setting?
not selected Direct observation
not selected Rating scale
not selected Checklist
not selected Performance measure
not selected Questionnaire
not selected Direct: Computerized
not selected One-to-one
not selected Other
If other, please specify:

Does the tool require technology?
Yes

If yes, what technology is required to implement your tool? (Select all that apply)
selected Computer or tablet
selected Internet connection
not selected Other technology (please specify)

If your program requires additional technology not listed above, please describe the required technology and the extent to which it is combined with teacher small-group instruction/intervention:

What is the administration context?
selected Individual
selected Small group   If small group, n=
not selected Large group   If large group, n=
selected Computer-administered
not selected Other
If other, please specify:

What is the administration time?
Time in minutes
30
per (student/group/other unit)
on average, per student (varies by grade)

Additional scoring time:
Time in minutes
1
per (student/group/other unit)
per student (to confirm RAN score)

ACADEMIC ONLY: What are the discontinue rules?
not selected No discontinue rules provided
not selected Basals
not selected Ceilings
selected Other
If other, please specify:
The majority of EarlyBird tasks (with the exception of RAN) are based on computer adaptive algorithms that leverage an Item Response Theory (IRT) framework to optimally match students to items. Because IRT item difficulties and person ability estimates are co-located on the same scale, algorithms are able to move students through individual assessments according to their response on individual items within a task. Correct responses to items typically result in students being administered relatively more difficult items based on the student’s ability whereas incorrect responses to items typically result in students being administered relatively easier items based on the student’s ability. The advantage of Computer Adaptive Testing (CAT) is that the student generally receives items that are never too difficult or too easy based on ability and tasks can be administered quickly to obtain reliable information. It allows for the most precise assessment within the shortest amount of time.


Are norms available?
Yes
Are benchmarks available?
Yes
If yes, how many benchmarks per year?
3
If yes, for which months are benchmarks available?
Beginning of Year (BOY) is available August to November, Middle of Year (MOY) is December to February, and End of Year (EOY) is March to June
BEHAVIOR ONLY: Can students be rated concurrently by one administrator?
If yes, how many students can be rated concurrently?

Training & Scoring

Training

Is training for the administrator required?
Yes
Describe the time required for administrator training, if applicable:
Because EarlyBird is an intuitive and easy-to-understand game, a brief 30-45 minute kick-off meeting is all that is required for staff participating in screener administration.
Please describe the minimum qualifications an administrator must possess.
EarlyBird is designed to be self-explanatory and easy for children to understand, so it can be administered by any adult, with no minimum qualifications or special training required.
selected No minimum qualifications
Are training manuals and materials available?
Yes
Are training manuals/materials field-tested?
Yes
Are training manuals/materials included in cost of tools?
Yes
If No, please describe training costs:
Can users obtain ongoing professional and technical support?
Yes
If Yes, please describe how users can obtain support:
EarlyBird provides full service support via phone, email, or electronically via the EarlyBird dashboard. Additional support for users is available on-demand via our self-serve “Knowledge Nest” which is accessible through the user dashboard. Each EarlyBird customer has a Customer Success Manager who provides support and training throughout all stages of the customer journey. The EarlyBird training and professional development model includes: ● Implementation Planning ● Initial Kick-Off Training ● Access to Data Dashboard and “Next Steps” Resource Library ● “Next Steps” Workshops

Scoring

How are scores calculated?
not selected Manually (by hand)
selected Automatically (computer-scored)
selected Other
If other, please specify:
As soon as the child completes each subtest, scores are automatically calculated and displayed on the teacher dashboard. The only exception is the RAN subtest, which requires score confirmation (asynchronously, at the convenience of the teacher, by listening to the recording).

Do you provide basis for calculating performance level scores?
Yes
What is the basis for calculating performance level and percentile scores?
not selected Age norms
selected Grade norms
not selected Classwide norms
not selected Schoolwide norms
not selected Stanines
not selected Normal curve equivalents

What types of performance level scores are available?
not selected Raw score
not selected Standard score
selected Percentile score
not selected Grade equivalents
selected IRT-based score
not selected Age equivalents
not selected Stanines
not selected Normal curve equivalents
not selected Developmental benchmarks
not selected Developmental cut points
not selected Equated
selected Probability
not selected Lexile score
not selected Error analysis
not selected Composite scores
selected Subscale/subtest scores
selected Other
If other, please specify:
Teachers are able to see the breakdown of individual subtest skills as they pertain to the foundational skills of reading. The subtest scores are presented in the form of percentiles (which indicate how the child’s performance on the subtest compares to a nationally representative sample of students). Additionally, there are two predictive profile scores, generated by weighted algorithms of a subset of subtests.

Does your tool include decision rules?
Yes
If yes, please describe.
The Dyslexia Risk score uses predictive algorithms and cutpoints to classify students into the categories of ‘at risk’ or ‘not at risk,’ based on performance on an outcome measure. Because it uses a cut point based on the 16th percentile on the outcome measure, the EarlyBird Dyslexia Screener can be used to help teachers identify which students are in need of further assessment/intervention. The concept of risk or success can be viewed in many ways, including the concept as a “percent chance” which is a number between 1 and 99, with 1 meaning there is no chance that a student will develop a problem, and 99 being there is no chance the student will not develop a problem. When attempting to identify children who are “at-risk” for poor performance on some type of future measure of reading achievement, this is typically a yes/no decision based upon some kind of “cut-point” along a continuum of risk. Decisions concerning appropriate cut-points are made based on the level of correct classification that is desired from the screening assessments. A variety of statistics may be used to guide such choices (e.g., sensitivity, specificity, positive and negative predictive power; see Schatschneider, Petscher & Williams, 2008) and each was considered in light of the other in choosing appropriate cut-points.
Can you provide evidence in support of multiple decision rules?
Yes
If yes, please describe.
The EarlyBird Dyslexia and Early Literacy assessment yields two predictive profile scores. The Dyslexia Screener Risk Score can be used to identify students needing intensive intervention, as it uses a cut point based on the 16th percentile on the outcome measure. The second EarlyBird predictive profile, called Potential for Word Reading, uses a cut point based on the 40th percentile on the outcome measure. For the Kindergarten validation study, students’ performance was coded as ‘1’ for scores at or above the 40th percentile on the SAT10 for PWR, and ‘0’ for scores that did not meet this criteria. The Potential for Word Reading (PWR) predictive profile represents a prediction of success - indicating the probability that the student will reach grade level expectations by end of year without appropriate intervention. If a student is “flagged” for the PWR predictive profile, the student would likely benefit from intervention in certain foundational skills related to literacy, as indicated by low percentiles on one or more subtests. Additional information can be found in the EarlyBird Dyslexia and Early Literacy Technical Manual.
Please describe the scoring structure. Provide relevant details such as the scoring format, the number of items overall, the number of items per subscale, what the cluster/composite score comprises, and how raw scores are calculated.
EarlyBird subtests (with the exception of RAN) are computer adaptive, based on item response theory (IRT). After practice questions, each subtest presents a set of initial items, which span a developmentally appropriate range of difficulty and are presented in a random order. These fixed items calibrate the child’s initial ability level and allow the computer adaptive algorithm to present additional items to further pinpoint the child’s ability score. This ensures that fewer questions are asked overall, and all are within an appropriate level of difficulty for the individual child. Once a subtest has been completed by the student, it is scored automatically, using an IRT-based formula which generates a raw score called a final theta. The final theta is converted to a normed percentile, which is displayed on EarlyBird’s Data Dashboard and indicates how the child’s performance on the subtest compares to a nationally representative sample of students. The final theta to percentile conversion tables are updated periodically to reflect the most recent representative sample of student scores available. Percentiles in the lowest two quintiles are highlighted blue (light blue for < 40th percentile; dark blue for < 20th percentile), which gives teachers a quick visual summary of each child’s areas of strength and need, as well as those of the classroom as a whole. In addition to the individual subtest percentiles, weighted algorithms of a subset of subtests yield two predictive profile scores: (1) Dyslexia Screener Risk Score - indicates the likelihood that a student will be at risk for severe weaknesses in phonological skills at the end of the year without intervention, which increases the likelihood of developing severe word reading difficulties including dyslexia and (2) Potential for Word Reading - indicates the probability that the student will reach grade level expectations by end of year without appropriate intervention These two risk scores identify which students are in need of further diagnostic testing and/or intervention. They appear on the classroom and student dashboard views, clearly labeled and defined for the teacher. All scores (subtest scores and predictive profiles) can be viewed by the teacher on EarlyBird’s Data Dashboard. The dashboard is designed to be intuitive, quick, and visually informative. Subtests are grouped on the dashboard according to broader categories, corresponding to the science of reading - sound/symbol correspondence, phonics, phonemic and phonological awareness, oral language, etc.
Describe the tool’s approach to screening, samples (if applicable), and/or test format, including steps taken to ensure that it is appropriate for use with culturally and linguistically diverse populations and students with disabilities.
The EarlyBird assessment addresses literacy milestones in pre-readers that have been found to be predictive of subsequent reading success. EarlyBird includes screening for dyslexia as well as other reading challenges, with analyses providing “predictive profiles” as well as subtest specific diagnostic norms for each child. EarlyBird is designed for three times per year benchmarking – Beginning of Year (BOY) July to November, Middle of Year (MOY) December to February, and End of Year (EOY) March to June. The subtest battery and risk algorithms are adapted to each time of year to accommodate child development and expectations. All results (both predictive profiles as well as the normed percentiles for each subtest) are further explained and explored in synchronous, virtual workshops and professional learning opportunities for teachers. Multiple validation studies, involving nationally representative samples of students (from all major geographic regions of the United States, attending a mix of public, private, and charter schools), have been designed and carried out to establish construct and predictive validity, as well as to create normed percentiles for each subtest. The samples included students with and without a familial history of diagnosed or suspected dyslexia and a range of socioeconomic backgrounds (as determined by the percentage of students receiving free or reduced price lunch at the participating schools). In terms of race and ethnicity, every attempt has been made to ensure that samples closely reflect U.S. census data. The gamified aspects of EarlyBird were designed by a group of experts at Massachusetts Institute of Technology (MIT) to be developmentally appropriate for pre-readers and early readers alike. Teachers report that children are engaged when using EarlyBird and the directions are simple, clear and age-appropriate. This is by design. Additionally, the game is set in an urban setting, with only animal characters, so that it is broadly appealing and widely understood across diverse populations. Finally, the game was designed with instructional best practices - provide instruction, model the activity, and allow for hands-on practice before assessment begins. At the beginning of the EarlyBird assessment, the child is asked “Are you ready to go on an adventure?” The child is then shown a map of a cartoon city and introduced to a new feathery friend, Pip, who will join them on their journey. The narrator explains that the child will meet more animal friends (located at fixed points along the path shown on the map), play games with them, and collect prizes on the way to the final destination. Each “game” is a subtest, associated with a different animal friend. When the child selects the icon, the animal friend greets the child and explains how to do the task with the help of animated visuals. Next, Pip demonstrates the task. For most subtests, the child then attempts 1-2 practice questions, for which corrective feedback is provided. Accommodations, in many cases, are not needed with EarlyBird. Because reading is not required to play the EarlyBird game, English Language Learners (Level 1 or above) who can understand the spoken directions in English generally do well with EarlyBird. They can also be assisted by translators who explain the directions for each subtest. Districts who have used EarlyBird with their Dual Language Learners report that they get highly valued information early in the year that they could not get any other way. Students with behavioral challenges also respond well to EarlyBird. Since it looks and feels like a gentle game, it captures the attention of some children who can be harder to assess, including students on the autism spectrum. And because the game is adaptive, students are assessed at an appropriate level for their ability. EarlyBird uses SoapBox Lab’s voice recognition engine that is built specifically for young children's unique speech patterns and processes differences in accents and dialects, thus reducing implicit bias in assessment. By powering its reading assessments with the SoapBox voice recognition engine, EarlyBird provides teachers with a much more comprehensive picture of a student’s reading proficiency at a much earlier age. AI technology automatically scores the child’s response and places the score into the teacher dashboard. SoapBox is the first and only automated speech recognition solution to demonstrate that it can deliver accurate and equitable assessments, recently receiving the Prioritizing Racial Equity in AI Design Product Certification by global education nonprofit Digital Promise.

Technical Standards

Classification Accuracy & Cross-Validation Summary

Grade Kindergarten
Classification Accuracy Fall Partially convincing evidence
Classification Accuracy Winter Partially convincing evidence
Classification Accuracy Spring Data unavailable
Legend
Full BubbleConvincing evidence
Half BubblePartially convincing evidence
Empty BubbleUnconvincing evidence
Null BubbleData unavailable
dDisaggregated data available

Kaufman Test of Educational Achievement (KTEA) - Phonological Processing subtest

Classification Accuracy

Select time of year
Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
The Kaufman Test of Educational Achievement, Third Edition (KTEA–3 Comprehensive Form; Kaufman & Kaufman, 2014) is an individually administered measure of academic achievement for grades pre-kindergarten through 12 or ages 4 through 25 years. The KTEA-3 has 19 subtests and one of them is named “Phonological Processing”. It can be administered to students in Pre-K through grade 12+ (ages 4 – 25). The student responds orally to items that require manipulation of sounds. The following skills are included in this subtest: Rhyming, Sound Matching, Blending and Segmenting Phonemes, and Deleting Sounds. While the items involve both phonological processing, phonological awareness, and phonemic awareness (often referred to phonological skills), the publisher (Pearson) named it “Phonological Processing”. In general, phonological skills can be defined as ‘understanding that words consist of smaller sound units (syllables, phonemes) and being able to manipulate these smaller units.” This ability to manipulate the sounds of your oral language enables emergent readers to analyze the phonological structure of a word and subsequently link it to the corresponding orthographic and lexical-semantic features which establishes and facilitates word recognition. Phonological skills can be measured using phonological processing and phonological and phonemic awareness tasks. Additionally, the KTEA-3 manual emphasizes a strong relationship between the Phonological Processing subtest and word reading and decoding skills. More specifically, the KTEA-3 manual reports correlations of around 0.6 between the Phonological Processing subtest and the Reading Comprehension and Letter Word Recognition subtests in pre-k (page 149, table B1) as well as kindergarten (page 150, table B.2). Furthermore, in grade 1, the Phonological Processing subtest correlates with the Nonword Decoding, Letter/Word Reading, and Reading Comprehension subtests (all above 0.6). Correlations with general abilities measure have been reported much lower (around 0.2-0.3; page 74, Table 2.16), emphasizing the relationship between Phonological Processing and measures of reading and decoding. When it comes to children with reading disabilities, the manual reports a significant difference for the Phonological Processing subtest for children with compared to children without a reading disorder, with children with reading disabilities exhibiting significantly lower scores (page 80, Table 2.18). Many of the skill areas covered by the Phonological Processing subtest of the KTEA-3 are also addressed through subtests in the EarlyBird assessment (e.g. rhyming, first sound matching, blending, deletion), although the EarlyBird subtests and items within those subtests were developed independently of the KTEA-3. Kaufman, A. S., & Kaufman, N. L. (2014). Kaufman test of educational achievement, third edition. Bloomington, MN: NCS Pearson.
Do the classification accuracy analyses examine concurrent and/or predictive classification?

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
Our screeners for risk are administered at fall and winter of the school year and the criterion is administered in the spring so that our risk calibration is consistent with a predictive classification model. The EarlyBird Dyslexia Risk Flag indicates the likelihood that a student will be at risk for reading struggles, as determined by poor phonological processing skills at the end of kindergarten, presuming the student doesn’t receive appropriate remediation. Dyslexia risk is based on a study conducted over the 2019-2020 school year, in which students were administered the EarlyBird assessments in the fall/winter and the outcome measure (KTEA-3 Phonological Processing subtests) in the spring of kindergarten. For the purposes of the analysis, dyslexia risk is defined as performing at or below the 16th percentile on the KTEA-3 Phonological Processing subtest (inclusive of blending, rhyming, sound matching, deletion, and segmenting items). The calculation involves logistic regression and receiver operating characteristic curve analyses (see next section) with a selection of our most predictive subtests (rhyming, nonword repetition, and follow directions) and an aggregation and weight averaging of that data according to degree of predictability to generate a single output score which is conveyed as a “flag”. That flag indicates the likelihood that a student would score poorly on the KTEA-3 task (Phonological Processing). Any child flagged for dyslexia risk is at high risk for low phonological processing skills at the end of kindergarten and therefore subsequent low reading proficiency and needs intensive instruction targeted to the student’s skill weaknesses.
Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
Logistic regressions were used, in part, to calibrate classification accuracy. Students’ performance on the selected criterions were coded as ‘1’ for performance below the 16th percentile on the KTEA-3 Phonological Processing for the Dyslexia Risk flag, and ‘0’ for scores that did not meet this criteria. In this way, the Dyslexia flag is a prediction of risk. Each dichotomous variable was then regressed on a combination of EarlyBird Assessments. As such, students could be identified as not at-risk on the multifactorial combination of screening tasks via the joint probability and demonstrating adequate performance on the criterion (i.e., specificity or true-negatives), at-risk on the combination of screening task scores via the joint probability and not demonstrating adequate performance on the criterion (i.e., sensitivity or true-positives), not at-risk based on the combination of screening task scores but at-risk on a criterion (i.e., false negative error), or at-risk on the combination of screening task scores but not at-risk on the criterion (i.e., false positive error). Classification of students in these categories allows for the evaluation of cut-points on the combination of screening tasks to determine which were the cut-point maximizing selected indicators. The concept of risk or success can be viewed in many ways, including the concept as a “percent chance” which is a number between 1 and 99, with 1 meaning there is low chance that a student may develop a problem, and 99 being there is a high chance that the student may develop a problem. When attempting to identify children who are “at-risk” for poor performance on some type of future measure of reading achievement, EarlyBird uses a yes/no decision based upon a “cut-point” along a continuum of risk. Decisions concerning appropriate cut-points are made based on the level of correct classification that is desired from the screening assessments. A variety of statistics may be used to guide such choices (e.g., sensitivity, specificity, positive and negative predictive power; see Schatschneider, Petscher & Williams, 2008) and each was considered in light of the other in choosing appropriate cut-points. Area under the curve, sensitivity, and specificity estimates from the final logistic regression model were bootstrapped 1,000 times in order to obtain a 95% confidence interval of scores using the cutpointr package in R statistical software.
Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
Yes
If yes, please describe the intervention, what children received the intervention, and how they were chosen.
EarlyBird’s studies did not include an intervention component and EarlyBird did not collect data related to intervention services conducted at the participating schools. That said, because the samples were comprised of children demonstrating a wide range of performance in terms of literacy-related skills, it is likely that some children included in the samples may have received intervention in addition to classroom instruction between administration of the screening measure and outcome assessment.

Cross-Validation

Has a cross-validation study been conducted?
No
If yes,
Select time of year.
Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
Do the cross-validation analyses examine concurrent and/or predictive classification?

Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
If yes, please describe the intervention, what children received the intervention, and how they were chosen.

Classification Accuracy - Fall

Evidence Kindergarten
Criterion measure Kaufman Test of Educational Achievement (KTEA) - Phonological Processing subtest
Cut Points - Percentile rank on criterion measure 16
Cut Points - Performance score on criterion measure
Cut Points - Corresponding performance score (numeric) on screener measure
Classification Data - True Positive (a) 24
Classification Data - False Positive (b) 29
Classification Data - False Negative (c) 7
Classification Data - True Negative (d) 124
Area Under the Curve (AUC) 0.85
AUC Estimate’s 95% Confidence Interval: Lower Bound 0.80
AUC Estimate’s 95% Confidence Interval: Upper Bound 0.90
Statistics Kindergarten
Base Rate 0.17
Overall Classification Rate 0.80
Sensitivity 0.77
Specificity 0.81
False Positive Rate 0.19
False Negative Rate 0.23
Positive Predictive Power 0.45
Negative Predictive Power 0.95
Sample Kindergarten
Date August - November 2019
Sample Size 184
Geographic Representation Middle Atlantic (NY, PA)
Mountain (MT)
New England (MA, RI)
West North Central (MO)
West South Central (LA, TX)
Male 48.4%
Female 50.0%
Other  
Gender Unknown 1.6%
White, Non-Hispanic 73.4%
Black, Non-Hispanic 6.5%
Hispanic 9.2%
Asian/Pacific Islander 0.5%
American Indian/Alaska Native  
Other 8.7%
Race / Ethnicity Unknown 1.6%
Low SES  
IEP or diagnosed disability 6.5%
English Language Learner  

Classification Accuracy - Winter

Evidence Kindergarten
Criterion measure Kaufman Test of Educational Achievement (KTEA) - Phonological Processing subtest
Cut Points - Percentile rank on criterion measure 16
Cut Points - Performance score on criterion measure
Cut Points - Corresponding performance score (numeric) on screener measure
Classification Data - True Positive (a) 24
Classification Data - False Positive (b) 29
Classification Data - False Negative (c) 7
Classification Data - True Negative (d) 124
Area Under the Curve (AUC) 0.85
AUC Estimate’s 95% Confidence Interval: Lower Bound 0.80
AUC Estimate’s 95% Confidence Interval: Upper Bound 0.90
Statistics Kindergarten
Base Rate 0.17
Overall Classification Rate 0.80
Sensitivity 0.77
Specificity 0.81
False Positive Rate 0.19
False Negative Rate 0.23
Positive Predictive Power 0.45
Negative Predictive Power 0.95
Sample Kindergarten
Date August - November 2019
Sample Size 184
Geographic Representation Middle Atlantic (NY, PA)
Mountain (MT)
New England (MA, RI)
West North Central (MO)
West South Central (LA, TX)
Male 48.4%
Female 50.0%
Other  
Gender Unknown 1.6%
White, Non-Hispanic 73.4%
Black, Non-Hispanic 6.5%
Hispanic 9.2%
Asian/Pacific Islander 0.5%
American Indian/Alaska Native  
Other 8.7%
Race / Ethnicity Unknown 1.6%
Low SES  
IEP or diagnosed disability 6.5%
English Language Learner  

Reliability

Grade Kindergarten
Rating Convincing evidence
Legend
Full BubbleConvincing evidence
Half BubblePartially convincing evidence
Empty BubbleUnconvincing evidence
Null BubbleData unavailable
dDisaggregated data available
*Offer a justification for each type of reliability reported, given the type and purpose of the tool.
Marginal Reliability is an appropriate model-based measure of reliability to use, given that most EarlyBird subtests are computer adaptive and use Item Response Theory (IRT). Reliability describes how consistent test scores will be across multiple administrations over time, as well as how well one form of the test relates to another. Because the EarlyBird screener uses IRT as its method of validation, reliability takes on a different meaning than from a Classical Test Theory (CTT) perspective. The biggest difference between the two approaches is the assumption made about the measurement error related to the test scores. CTT treats the error variance as being the same for all scores, whereas the IRT view is that the level of error is dependent on the ability of the individual. As such, reliability in IRT becomes more about the level of precision of measurement across ability. Although it is often more useful to graphically represent the standard error across ability levels to gauge for what range of abilities the test is more or less informative, it is possible to estimate marginal reliability through a calculation.
*Describe the sample(s), including size and characteristics, for each reliability analysis conducted.
Marginal reliability for the Rhyming, First Sound Matching, Nonword Repetition, and Vocabulary kindergarten subtests was conducted using a representative sample of 419 kindergarten students in 19 schools and eight states in every region of the country including MT, MO, MA, NY, LA, PA, RI, and TX who took the EarlyBird assessment between August and November 2019. The sample was 75.5% White, 12.92% Black or African American, 4.9% Asian, 2.45% American Indian, .89% Native Hawaiian / Pacific Islander. 12.22% identified as Hispanic. 3.34% did not respond. For the rest of the kindergarten subtests (Letter Name, Letter Sound, Blending, Deletion, Word Reading, Word Matching, Follow Directions, and Oral Sentence Comprehension), a statewide representative sample of kindergarten students that roughly reflected Florida’s demographic diversity and academic ability (N ~ 2,400) was collected on students in Kindergarten as part of a larger K-2 validation and linking study. Because the samples used for data collection did not strictly adhere to the state distribution of demographics (i.e., percent limited English proficiency, Black, White, Latino, and eligible for free/reduced lunch), sample weights according to student demographics were used.
*Describe the analysis procedures for each reported type of reliability.
An estimate of reliability, known as marginal reliability (Sireci, Thissen, & Wainer, 1991), was calculated using the variance of ability with the mean squared error. The formula and additional information about the procedure is available in the EarlyBird Dyslexia and Early Literacy Screener Technical Manual.

*In the table(s) below, report the results of the reliability analyses described above (e.g., internal consistency or inter-rater reliability coefficients).

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of reliability analysis not compatible with above table format:
Manual cites other published reliability studies:
Yes
Provide citations for additional published studies.
Barbara R. Foorman, Yaacov Petscher, Christopher Stanley & Adrea Truckenmiller (2017) Latent Profiles of Reading and Language and Their Association With Standardized Reading Outcomes in Kindergarten Through Tenth Grade, Journal of Research on Educational Effectiveness, 10:3, 619-645, DOI: 10.1080/19345747.2016.1237597 Foorman, B. R., Petscher, Y., & Herrera, S. (2018, March 7). Unique and common effects of decoding and language factors in predicting reading comprehension in grades 1–10. Learning and Individual Differences. Retrieved April 24, 2022, from https://www.sciencedirect.com/science/article/abs/pii/S1041608018300414#preview-section-abstract
Do you have reliability data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?

If yes, fill in data for each subgroup with disaggregated reliability data.

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of reliability analysis not compatible with above table format:
Manual cites other published reliability studies:
Yes
Provide citations for additional published studies.
Barbara R. Foorman, Yaacov Petscher, Christopher Stanley & Adrea Truckenmiller (2017) Latent Profiles of Reading and Language and Their Association With Standardized Reading Outcomes in Kindergarten Through Tenth Grade, Journal of Research on Educational Effectiveness, 10:3, 619-645, DOI: 10.1080/19345747.2016.1237597 Foorman, B. R., Petscher, Y., & Herrera, S. (2018, March 7). Unique and common effects of decoding and language factors in predicting reading comprehension in grades 1–10. Learning and Individual Differences. Retrieved April 24, 2022, from https://www.sciencedirect.com/science/article/abs/pii/S1041608018300414#preview-section-abstract

Validity

Grade Kindergarten
Rating Partially convincing evidence
Legend
Full BubbleConvincing evidence
Half BubblePartially convincing evidence
Empty BubbleUnconvincing evidence
Null BubbleData unavailable
dDisaggregated data available
*Describe each criterion measure used and explain why each measure is appropriate, given the type and purpose of the tool.
The Phonological Processing subtest on the KTEA-3 was used to determine predictive validity and construct (convergent) validity for the Dyslexia Risk part of the EarlyBird assessment. The Phonological Processing subtest measures the ability to perform a variety of tasks related to phonological awareness, such as rhyming, blending, and segmenting words, and yields a composite score. The SAT-10 Word Reading was used to determine additional predictive and concurrent validity. The Word Reading subtest of the SAT-10/SESAT measures the ability to read words. Both tests assess some of the constructs that EarlyBird also measures, but each is a standardized, paper-and-pencil psychometric test that was developed and published separately by Pearson and so is external to the EarlyBird assessment.
*Describe the sample(s), including size and characteristics, for each validity analysis conducted.
A validity study was conducted for the Dyslexia Risk (KTEA-3 outcome measure) aspect of the EarlyBird assessment for Kindergarten during the 2019-2020 school year. Students were administered the full EarlyBird assessment (all Kindergarten-appropriate subtests) during fall/winter and the KTEA-3 during spring/summer 2020. Having two data points from approximately 200 participants (located in 8 states across every region of the country), from the app the previous fall and from the psychometric assessments in late spring/summer 2020, allowed for the evaluation of the screener’s predictive validity. Separately, data collection for the PWR Risk (SAT-10/SESAT outcome measure) aspect of the EarlyBird screener began by testing item pools for the Screen tasks (i.e., Letter Sounds, Phonological Awareness, Word Reading, Vocabulary Pairs, and Following Directions). A statewide representative sample of students that roughly reflected Florida’s demographic diversity and academic ability (N ~ 2,400) was collected on students in Kindergarten as part of a larger K-2 validation and linking study. Because the samples used for data collection did not strictly adhere to the state distribution of demographics (i.e., percent limited English proficiency, Black, White, Latino, and eligible for free/reduced lunch), sample weights according to student demographics were used to inform the item and student parameter scores.
*Describe the analysis procedures for each reported type of validity.
Predictive Validity The predictive validity of the Dyslexia Risk screening tasks to the KTEA-3 Phonological Processing subtest was done through a series of multiple regression analyses tested for the additive and interactive relations between EarlyBird assessments and the K-PA outcome (KTEA-3, Phonological Processing) to find the fewest number of tasks that maximized the percentage of explained variance in K-PA. The final model included the Following Directions, Nonword Repetition, and Rhyming subtests with R2 of .37 (multiple r = .61, 95% CI = .50, .69, n = 184). The predictive and concurrent validity of the PWR screening tasks to the SAT-10 Word Reading (SESAT in K) was addressed through a series of linear and logistic regressions. The linear regressions were run two ways. First, a correlation analysis was used to evaluate the strength of relations between each of the Screening task ability scores with SESAT. Pearson correlations between PWR tasks and the SESAT Word Reading task ranged from .38 to .59. Second, a multiple regression was run to estimate the total amount of variance that the linear combination of the predictors explained in SESAT (46%). Construct Validity Construct validity describes how well scores from an assessment measure the construct it is intended to measure. A component of construct validity is convergent validity, which can be evaluated by testing relations between a developed assessment (like the EarlyBird Rhyming subtest) and another related assessment (like the Phonological Processing subtest of the KTEA-3). The goal of convergent validity is to yield a high association which indicates that the developed measure converges, or is empirically linked to, the intended construct. Concurrent validity (correlation) analyses were also conducted. Phonological awareness skills (like First Sound Matching) and sound/symbol correspondence tasks (like Letter Sounds) would be expected to have moderate associations between them; thus, the expectation is that moderate correlations would be observed. Predictive, convergent, and concurrent validity results are reported below.

*In the table below, report the results of the validity analyses described above (e.g., concurrent or predictive validity, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of validity analysis not compatible with above table format:
Convergent and concurrent validity analyses were also conducted. We compared the EarlyBird Rhyming subtest to the KTEA-3 Phonological Processing (n = 215). Convergent validity was .53. The correlation between the EarlyBird First Sound Matching (n = 191) and Letter Sounds (n = 213) subtests was .51.
Manual cites other published reliability studies:
Yes
Provide citations for additional published studies.
Barbara R. Foorman, Yaacov Petscher, Christopher Stanley & Adrea Truckenmiller (2017) Latent Profiles of Reading and Language and Their Association With Standardized Reading Outcomes in Kindergarten Through Tenth Grade, Journal of Research on Educational Effectiveness, 10:3, 619-645, DOI: 10.1080/19345747.2016.1237597 Foorman, B. R., Petscher, Y., & Herrera, S. (2018, March 7). Unique and common effects of decoding and language factors in predicting reading comprehension in grades 1–10. Learning and Individual Differences. Retrieved April 24, 2022, from https://www.sciencedirect.com/science/article/abs/pii/S1041608018300414#preview-section-abstract
Describe the degree to which the provided data support the validity of the tool.
Do you have validity data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
No

If yes, fill in data for each subgroup with disaggregated validity data.

Type of Subgroup Informant Age / Grade Test or Criterion n Median Coefficient 95% Confidence Interval
Lower Bound
95% Confidence Interval
Upper Bound
Results from other forms of validity analysis not compatible with above table format:
Manual cites other published reliability studies:
Provide citations for additional published studies.

Bias Analysis

Grade Kindergarten
Rating Yes
Have you conducted additional analyses related to the extent to which your tool is or is not biased against subgroups (e.g., race/ethnicity, gender, socioeconomic status, students with disabilities, English language learners)? Examples might include Differential Item Functioning (DIF) or invariance testing in multiple-group confirmatory factor models.
Yes
If yes,
a. Describe the method used to determine the presence or absence of bias:
DIF analysis on items / Guidelines for Retaining Items: Several criteria were used to evaluate item performance. The first process was to identify items which demonstrated strong floor or ceiling effects in response rates >= 95%. Such items are not useful in creating an item bank as there is little variability in whether students are successful on the item. In addition to evaluating the descriptive response rate, we estimated item-total correlations. Items with negative values are indicative of poor functioning such that it suggests individuals who correctly answer the question tend to have lower total scores. Similarly, items with low item-total correlations indicate the lack of a relation between item and total test performance. Items with correlations <.15 were flagged for removal. Following the descriptive analysis of item performance, difficulty and discrimination values from the IRT analyses were used to further identify items which were poorly functioning. Items were flagged for item revision if the item discrimination was negative or the item difficulty was greater than +4.0 or less than -4.0. Secondary criteria were used in evaluating the retained items, which was comprised of a differential item function (DIF) analysis. DIF refers to instances where individuals from different groups with the same level of underlying ability significantly differ in their probability to correctly endorse an item. Unchecked, items included in a test which demonstrate DIF will produce biased test results. For the PWR study, DIF testing was conducted comparing: Black-White students, Latino-White students, Black-Latino students, students eligible for Free or Reduced Priced Lunch (FRL) with students not receiving FRL, and English Language Learner to non-English Language Learner students. DIF testing in the PWR study was conducted with a multiple indicator multiple cause (MIMIC) analysis in Mplus (Muthén & Muthén, 2008); moreover, a series of four standardized and expected score effect size measures were generated using VisualDF software (Meade, 2010) to quantify various technical aspects of score differentiation between the gender groups. First, the signed item difference in the sample (SIDS) index was created, which describes the average unstandardized difference in expected scores between the groups. The second effect size calculated was the unsigned item difference in the sample (UIDS). This index can be utilized as supplementary to the SIDS. When the absolute value of the SIDS and UIDS values are equivalent, the differential functioning between groups is equivalent; however, when the absolute value of the UIDS is larger than SIDS, it provides evidence that the item characteristic curves for expected score differences cross, indicating that differences in the expected scores between groups change across the level of the latent ability score. The D-max index is reported as the maximum SIDS value in the sample, and may be interpreted as the greatest difference for any individual in the sample in the expected response. Lastly, an expected score standardized difference (ESSD) was generated, and was computed similar to a Cohen’s (1988) d statistic. As such, it is interpreted as a measure of standard deviation difference between the groups for the expected score response with values of .2 regarded as small, .5 as medium, and .8 as large. Items demonstrating DIF were flagged for further study in order to ascertain why groups with the same latent ability performed differently on the items. DIF testing in the Dyslexia Risk study was estimated using the difR package (Magis, Beland, & Raiche, 2020) using the Mantel-Haenszel method (1959) for detecting uniform DIF. For each of the six tasks, DIF was tested for four primary contrasts: 1) Male vs. Female, 2) White vs. Sample, and 3) Black vs. Sample. The Mantel-Haenszel chi-square statistic was reported for test by item and the chi-square was used to derive an effect size estimate (i.e., ETS delta scale; Holland & Thayer, 1988). Effect size values <= 1.0 are considered small, 1.0 – 1.5 is moderate, and >= 1.5 is considered large. Differential Test Functioning: A component of checking the validity of cut-points and scores on the assessments involved also testing differential accuracy of the regression equations across different demographic groups. This procedure involved a series of logistic regressions predicting success on the SESAT (i.e., at or above the 40th percentile) outcome measure. The independent variables included a variable that represented whether students were identified as not at-risk based on the identified cut-point on a combination score of the screening tasks, a variable that represented a selected demographic group, as well as an interaction term between the two variables. A statistically significant interaction term would suggest that differential accuracy in predicting end-of-year risk status existed for different groups of individuals based on the risk status identified by the PWR screener.
b. Describe the subgroups for which bias analyses were conducted:
For the PWR study, DIF testing was conducted comparing: Black-White students, Latino-White students, Black-Latino students, students eligible for Free or Reduced Priced Lunch (FRL) with students not receiving FRL, and English Language Learner to non-English Language Learner students. For the Dyslexia Risk study, DIF was tested for 1) Male vs. Female, 2) White vs. Sample, and 3) Black vs. Sample. Differential accuracy was separately tested for the PWR study for Black and Latino students as well as for students identified as English Language Learners (ELL) and students who were eligible for Free/Reduced Price Lunch (FRL).
c. Describe the results of the bias analyses conducted, including data and interpretative statements. Include magnitude of effect (if available) if bias has been identified.
Differential Item Functioning (DIF) - Across all Kindergarten tasks and comparisons, only 12 items demonstrated at DIF with at least a moderate effect size (i.e., ETS >= 1.0): 2 nonword repetition items, and 10 Word Matching items. These items were removed from the item bank for further study and testing. All remaining items presented with ETS delta values <1.00 indicating small DIF. Differential Test Functioning - No statistically significant differential accuracy was found for any demographic sub-group. For more information, see pages 16-17 and 29 in the EarlyBird Technical Manual.

Data Collection Practices

Most tools and programs evaluated by the NCII are branded products which have been submitted by the companies, organizations, or individuals that disseminate these products. These entities supply the textual information shown above, but not the ratings accompanying the text. NCII administrators and members of our Technical Review Committees have reviewed the content on this page, but NCII cannot guarantee that this information is free from error or reflective of recent changes to the product. Tools and programs have the opportunity to be updated annually or upon request.