EarlyBird Education
EarlyBird Dyslexia and Early Literacy Screener
Summary
The EarlyBird Dyslexia and Early Literacy Screener, formerly known as the Boston Children’s Hospital Early Literacy Screener (BELS), is a universal screener that brings together all the relevant predictors of dyslexia/reading in one easy-to-administer assessment. Appropriate for pre-readers and early readers, it provides the tools to easily and accurately identify students at risk of dyslexia and other reading difficulties earlier, when the window for intervention is most effective. EarlyBird was developed at Boston Children’s Hospital by Dr. Nadine Gaab, in partnership with Dr. Yaacov Petscher at the Florida Center for Reading Research. It is self-administered in small groups with adult oversight and addresses the literacy milestones that have been found to be predictive of subsequent reading success. EarlyBird, administered three times per year, includes screening for: 1. severe reading risk (referred to as dyslexia screener); 2. moderate reading risk (referred to as PWR (MTSS/RTI) screener) From a student perspective, the engaging gamified assessment is accessible and child-centered. In the game, the child joins their new feathery friend Pip who guides them along a path through the city, meeting animal friends who introduce the child to each subtest. EarlyBird has three main components: 1. Game-based app, played on an iPad, Chromebook, or any device with a Chrome browser, that is comprehensive and scientifically validated, yet easy to administer. 2. Web-based dashboard presenting assessment data on student, classroom, school and district levels; providing intuitive, easily-accessed explanations of data; and customized evidence-based Next Step Resources including lesson plans, videos, activities, tools and resource links. 3. Professional learning, delivered through virtual synchronous workshops prior to the beginning of the year assessment and following each screening period (BOY, MOY, and EOY). The post-assessment workshops are designed to assist teachers with analyzing their data, grouping students for intervention, and providing prescriptive instruction using the “Next Steps” Resources within the platform.
- Where to Obtain:
- EarlyBird Education
- connect@earlybirdeducation.com
- 1.833.759.2473
- https://earlybirdeducation.com/
- Initial Cost:
- $10.00 per student
- Replacement Cost:
- Contact vendor for pricing details.
- Included in Cost:
- In addition to the student fee, there is a teacher license/platform fee and professional development workshop fees. Total cost will depend on the number of schools and students. Please contact connect@earlybirdeducation.com or 833-759-2473 for specific details on pricing for your district. Included with each contract is online student access to the assessment; staff access to the dashboard (with automatic scoring) and data reporting; intuitive easily-accessed “Knowledge Nest” of tutorials for educators; student specific reports for sharing results with families (in multiple languages); “Next Steps” Resources with printable lesson plans, activities, tools and resource links, customized based on assessment results; At Home Activities to share with families; dedicated account management; Live virtual professional development workshops led by literacy experts; easy-access service and support; secure hosting; and all maintenance, upgrades and enhancements for the duration of the license.
- EarlyBird has been designed to support a wide range of learners, with accommodating features including: Students may use the touchscreen, mouse or keyboard, accommodating students with limited mobility skills. EarlyBird, with the exception of one subtest measuring RAN, is untimed. The game is designed to allow the student or administrator to pause at any time to allow for breaks and pauses with easy resuming of the game. Students can adjust the audio, as needed, and can use the assessment with noise-buffering headphones. EarlyBird can be administered individually or in small groups, as needed. EarlyBird was designed to reduce assessment bias. As part of item development, all items were reviewed for bias and fairness. During administration, EarlyBird utilizes voice recognition AI, so each student is evaluated consistently, and absent of administrator scoring bias or fatigue. Finally, as a fun, interactive game, students are relaxed and engaged, and participate fully in the assessment, often not even realizing they are being assessed, thereby reducing stress and test anxiety performance issues.
- Training Requirements:
- Because EarlyBird is an intuitive and easy-to-understand game, a brief 30-45 minute kick-off meeting is all that is required for staff participating in screener administration.
- Qualified Administrators:
- No minimum qualifications specified.
- Access to Technical Support:
- EarlyBird provides full service support via phone, email, or electronically via the EarlyBird dashboard. Additional support for users is available on-demand via our self-serve “Knowledge Nest” which is accessible through the user dashboard. Each EarlyBird customer has a Customer Success Manager who provides support and training throughout all stages of the customer journey. The EarlyBird training and professional development model includes: ● Implementation Planning ● Initial Kick-Off Training ● Access to Data Dashboard and “Next Steps” Resource Library ● “Next Steps” Workshops
- Assessment Format:
-
- Scoring Time:
-
- Scoring is automatic OR
- 1 minutes per per student (to confirm RAN score)
- Scores Generated:
-
- Percentile score
- IRT-based score
- Probability
- Subscale/subtest scores
- Other: Teachers are able to see the breakdown of individual subtest skills as they pertain to the foundational skills of reading. The subtest scores are presented in the form of percentiles (which indicate how the child’s performance on the subtest compares to a nationally representative sample of students). Additionally, there are two predictive profile scores, generated by weighted algorithms of a subset of subtests.
- Administration Time:
-
- 30 minutes per on average, per student (varies by grade)
- Scoring Method:
-
- Automatically (computer-scored)
- Other : As soon as the child completes each subtest, scores are automatically calculated and displayed on the teacher dashboard. The only exception is the RAN subtest, which requires score confirmation (asynchronously, at the convenience of the teacher, by listening to the recording).
- Technology Requirements:
-
- Computer or tablet
- Internet connection
- Accommodations:
- EarlyBird has been designed to support a wide range of learners, with accommodating features including: Students may use the touchscreen, mouse or keyboard, accommodating students with limited mobility skills. EarlyBird, with the exception of one subtest measuring RAN, is untimed. The game is designed to allow the student or administrator to pause at any time to allow for breaks and pauses with easy resuming of the game. Students can adjust the audio, as needed, and can use the assessment with noise-buffering headphones. EarlyBird can be administered individually or in small groups, as needed. EarlyBird was designed to reduce assessment bias. As part of item development, all items were reviewed for bias and fairness. During administration, EarlyBird utilizes voice recognition AI, so each student is evaluated consistently, and absent of administrator scoring bias or fatigue. Finally, as a fun, interactive game, students are relaxed and engaged, and participate fully in the assessment, often not even realizing they are being assessed, thereby reducing stress and test anxiety performance issues.
Descriptive Information
- Please provide a description of your tool:
- The EarlyBird Dyslexia and Early Literacy Screener, formerly known as the Boston Children’s Hospital Early Literacy Screener (BELS), is a universal screener that brings together all the relevant predictors of dyslexia/reading in one easy-to-administer assessment. Appropriate for pre-readers and early readers, it provides the tools to easily and accurately identify students at risk of dyslexia and other reading difficulties earlier, when the window for intervention is most effective. EarlyBird was developed at Boston Children’s Hospital by Dr. Nadine Gaab, in partnership with Dr. Yaacov Petscher at the Florida Center for Reading Research. It is self-administered in small groups with adult oversight and addresses the literacy milestones that have been found to be predictive of subsequent reading success. EarlyBird, administered three times per year, includes screening for: 1. severe reading risk (referred to as dyslexia screener); 2. moderate reading risk (referred to as PWR (MTSS/RTI) screener) From a student perspective, the engaging gamified assessment is accessible and child-centered. In the game, the child joins their new feathery friend Pip who guides them along a path through the city, meeting animal friends who introduce the child to each subtest. EarlyBird has three main components: 1. Game-based app, played on an iPad, Chromebook, or any device with a Chrome browser, that is comprehensive and scientifically validated, yet easy to administer. 2. Web-based dashboard presenting assessment data on student, classroom, school and district levels; providing intuitive, easily-accessed explanations of data; and customized evidence-based Next Step Resources including lesson plans, videos, activities, tools and resource links. 3. Professional learning, delivered through virtual synchronous workshops prior to the beginning of the year assessment and following each screening period (BOY, MOY, and EOY). The post-assessment workshops are designed to assist teachers with analyzing their data, grouping students for intervention, and providing prescriptive instruction using the “Next Steps” Resources within the platform.
ACADEMIC ONLY: What skills does the tool screen?
- Please describe specific domain, skills or subtests:
- EarlyBird is a comprehensive assessment, with subtests in the critical skill areas related to the science of reading: Naming Speed (Object RAN and Letter RAN); Phonemic and Phonological Awareness (First Sound Matching, Rhyming, Blending, Deletion, Nonword Repetition); Sound-Symbol Correspondence/Phonics (Letter Name, Letter Sound, Nonword Reading, Word Reading, Nonword Spelling); Passage Reading (Oral Reading Fluency and Reading Comprehension); and Oral Language (Receptive and Expressive Vocabulary, Word Matching, Oral Sentence Comprehension, Follow Directions).
- BEHAVIOR ONLY: Which category of behaviors does your tool target?
-
- BEHAVIOR ONLY: Please identify which broad domain(s)/construct(s) are measured by your tool and define each sub-domain or sub-construct.
Acquisition and Cost Information
Administration
- Are norms available?
- Yes
- Are benchmarks available?
- Yes
- If yes, how many benchmarks per year?
- 3
- If yes, for which months are benchmarks available?
- Beginning of Year (BOY) is available August to November, Middle of Year (MOY) is December to February, and End of Year (EOY) is March to June
- BEHAVIOR ONLY: Can students be rated concurrently by one administrator?
- If yes, how many students can be rated concurrently?
Training & Scoring
Training
- Is training for the administrator required?
- Yes
- Describe the time required for administrator training, if applicable:
- Because EarlyBird is an intuitive and easy-to-understand game, a brief 30-45 minute kick-off meeting is all that is required for staff participating in screener administration.
- Please describe the minimum qualifications an administrator must possess.
- EarlyBird is designed to be self-explanatory and easy for children to understand, so it can be administered by any adult, with no minimum qualifications or special training required.
- No minimum qualifications
- Are training manuals and materials available?
- Yes
- Are training manuals/materials field-tested?
- Yes
- Are training manuals/materials included in cost of tools?
- Yes
- If No, please describe training costs:
- Can users obtain ongoing professional and technical support?
- Yes
- If Yes, please describe how users can obtain support:
- EarlyBird provides full service support via phone, email, or electronically via the EarlyBird dashboard. Additional support for users is available on-demand via our self-serve “Knowledge Nest” which is accessible through the user dashboard. Each EarlyBird customer has a Customer Success Manager who provides support and training throughout all stages of the customer journey. The EarlyBird training and professional development model includes: ● Implementation Planning ● Initial Kick-Off Training ● Access to Data Dashboard and “Next Steps” Resource Library ● “Next Steps” Workshops
Scoring
- Do you provide basis for calculating performance level scores?
-
Yes
- Does your tool include decision rules?
-
Yes
- If yes, please describe.
- The Dyslexia Risk score uses predictive algorithms and cutpoints to classify students into the categories of ‘at risk’ or ‘not at risk,’ based on performance on an outcome measure. Because it uses a cut point based on the 16th percentile on the outcome measure, the EarlyBird Dyslexia Screener can be used to help teachers identify which students are in need of further assessment/intervention. The concept of risk or success can be viewed in many ways, including the concept as a “percent chance” which is a number between 1 and 99, with 1 meaning there is no chance that a student will develop a problem, and 99 being there is no chance the student will not develop a problem. When attempting to identify children who are “at-risk” for poor performance on some type of future measure of reading achievement, this is typically a yes/no decision based upon some kind of “cut-point” along a continuum of risk. Decisions concerning appropriate cut-points are made based on the level of correct classification that is desired from the screening assessments. A variety of statistics may be used to guide such choices (e.g., sensitivity, specificity, positive and negative predictive power; see Schatschneider, Petscher & Williams, 2008) and each was considered in light of the other in choosing appropriate cut-points.
- Can you provide evidence in support of multiple decision rules?
-
Yes
- If yes, please describe.
- The EarlyBird Dyslexia and Early Literacy assessment yields two predictive profile scores. The Dyslexia Screener Risk Score can be used to identify students needing intensive intervention, as it uses a cut point based on the 16th percentile on the outcome measure. The second EarlyBird predictive profile, called Potential for Word Reading, uses a cut point based on the 40th percentile on the outcome measure. For the Kindergarten validation study, students’ performance was coded as ‘1’ for scores at or above the 40th percentile on the SAT10 for PWR, and ‘0’ for scores that did not meet this criteria. The Potential for Word Reading (PWR) predictive profile represents a prediction of success - indicating the probability that the student will reach grade level expectations by end of year without appropriate intervention. If a student is “flagged” for the PWR predictive profile, the student would likely benefit from intervention in certain foundational skills related to literacy, as indicated by low percentiles on one or more subtests. Additional information can be found in the EarlyBird Dyslexia and Early Literacy Technical Manual.
- Please describe the scoring structure. Provide relevant details such as the scoring format, the number of items overall, the number of items per subscale, what the cluster/composite score comprises, and how raw scores are calculated.
- EarlyBird subtests (with the exception of RAN) are computer adaptive, based on item response theory (IRT). After practice questions, each subtest presents a set of initial items, which span a developmentally appropriate range of difficulty and are presented in a random order. These fixed items calibrate the child’s initial ability level and allow the computer adaptive algorithm to present additional items to further pinpoint the child’s ability score. This ensures that fewer questions are asked overall, and all are within an appropriate level of difficulty for the individual child. Once a subtest has been completed by the student, it is scored automatically, using an IRT-based formula which generates a raw score called a final theta. The final theta is converted to a normed percentile, which is displayed on EarlyBird’s Data Dashboard and indicates how the child’s performance on the subtest compares to a nationally representative sample of students. The final theta to percentile conversion tables are updated periodically to reflect the most recent representative sample of student scores available. Percentiles in the lowest two quintiles are highlighted blue (light blue for < 40th percentile; dark blue for < 20th percentile), which gives teachers a quick visual summary of each child’s areas of strength and need, as well as those of the classroom as a whole. In addition to the individual subtest percentiles, weighted algorithms of a subset of subtests yield two predictive profile scores: (1) Dyslexia Screener Risk Score - indicates the likelihood that a student will be at risk for severe weaknesses in phonological skills at the end of the year without intervention, which increases the likelihood of developing severe word reading difficulties including dyslexia and (2) Potential for Word Reading - indicates the probability that the student will reach grade level expectations by end of year without appropriate intervention These two risk scores identify which students are in need of further diagnostic testing and/or intervention. They appear on the classroom and student dashboard views, clearly labeled and defined for the teacher. All scores (subtest scores and predictive profiles) can be viewed by the teacher on EarlyBird’s Data Dashboard. The dashboard is designed to be intuitive, quick, and visually informative. Subtests are grouped on the dashboard according to broader categories, corresponding to the science of reading - sound/symbol correspondence, phonics, phonemic and phonological awareness, oral language, etc.
- Describe the tool’s approach to screening, samples (if applicable), and/or test format, including steps taken to ensure that it is appropriate for use with culturally and linguistically diverse populations and students with disabilities.
- The EarlyBird assessment addresses literacy milestones in pre-readers that have been found to be predictive of subsequent reading success. EarlyBird includes screening for dyslexia as well as other reading challenges, with analyses providing “predictive profiles” as well as subtest specific diagnostic norms for each child. EarlyBird is designed for three times per year benchmarking – Beginning of Year (BOY) July to November, Middle of Year (MOY) December to February, and End of Year (EOY) March to June. The subtest battery and risk algorithms are adapted to each time of year to accommodate child development and expectations. All results (both predictive profiles as well as the normed percentiles for each subtest) are further explained and explored in synchronous, virtual workshops and professional learning opportunities for teachers. Multiple validation studies, involving nationally representative samples of students (from all major geographic regions of the United States, attending a mix of public, private, and charter schools), have been designed and carried out to establish construct and predictive validity, as well as to create normed percentiles for each subtest. The samples included students with and without a familial history of diagnosed or suspected dyslexia and a range of socioeconomic backgrounds (as determined by the percentage of students receiving free or reduced price lunch at the participating schools). In terms of race and ethnicity, every attempt has been made to ensure that samples closely reflect U.S. census data. The gamified aspects of EarlyBird were designed by a group of experts at Massachusetts Institute of Technology (MIT) to be developmentally appropriate for pre-readers and early readers alike. Teachers report that children are engaged when using EarlyBird and the directions are simple, clear and age-appropriate. This is by design. Additionally, the game is set in an urban setting, with only animal characters, so that it is broadly appealing and widely understood across diverse populations. Finally, the game was designed with instructional best practices - provide instruction, model the activity, and allow for hands-on practice before assessment begins. At the beginning of the EarlyBird assessment, the child is asked “Are you ready to go on an adventure?” The child is then shown a map of a cartoon city and introduced to a new feathery friend, Pip, who will join them on their journey. The narrator explains that the child will meet more animal friends (located at fixed points along the path shown on the map), play games with them, and collect prizes on the way to the final destination. Each “game” is a subtest, associated with a different animal friend. When the child selects the icon, the animal friend greets the child and explains how to do the task with the help of animated visuals. Next, Pip demonstrates the task. For most subtests, the child then attempts 1-2 practice questions, for which corrective feedback is provided. Accommodations, in many cases, are not needed with EarlyBird. Because reading is not required to play the EarlyBird game, English Language Learners (Level 1 or above) who can understand the spoken directions in English generally do well with EarlyBird. They can also be assisted by translators who explain the directions for each subtest. Districts who have used EarlyBird with their Dual Language Learners report that they get highly valued information early in the year that they could not get any other way. Students with behavioral challenges also respond well to EarlyBird. Since it looks and feels like a gentle game, it captures the attention of some children who can be harder to assess, including students on the autism spectrum. And because the game is adaptive, students are assessed at an appropriate level for their ability. EarlyBird uses SoapBox Lab’s voice recognition engine that is built specifically for young children's unique speech patterns and processes differences in accents and dialects, thus reducing implicit bias in assessment. By powering its reading assessments with the SoapBox voice recognition engine, EarlyBird provides teachers with a much more comprehensive picture of a student’s reading proficiency at a much earlier age. AI technology automatically scores the child’s response and places the score into the teacher dashboard. SoapBox is the first and only automated speech recognition solution to demonstrate that it can deliver accurate and equitable assessments, recently receiving the Prioritizing Racial Equity in AI Design Product Certification by global education nonprofit Digital Promise.
Technical Standards
Classification Accuracy & Cross-Validation Summary
Grade |
Kindergarten
|
---|---|
Classification Accuracy Fall | |
Classification Accuracy Winter | |
Classification Accuracy Spring |
Kaufman Test of Educational Achievement (KTEA) - Phonological Processing subtest
Classification Accuracy
- Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
- The Kaufman Test of Educational Achievement, Third Edition (KTEA–3 Comprehensive Form; Kaufman & Kaufman, 2014) is an individually administered measure of academic achievement for grades pre-kindergarten through 12 or ages 4 through 25 years. The KTEA-3 has 19 subtests and one of them is named “Phonological Processing”. It can be administered to students in Pre-K through grade 12+ (ages 4 – 25). The student responds orally to items that require manipulation of sounds. The following skills are included in this subtest: Rhyming, Sound Matching, Blending and Segmenting Phonemes, and Deleting Sounds. While the items involve both phonological processing, phonological awareness, and phonemic awareness (often referred to phonological skills), the publisher (Pearson) named it “Phonological Processing”. In general, phonological skills can be defined as ‘understanding that words consist of smaller sound units (syllables, phonemes) and being able to manipulate these smaller units.” This ability to manipulate the sounds of your oral language enables emergent readers to analyze the phonological structure of a word and subsequently link it to the corresponding orthographic and lexical-semantic features which establishes and facilitates word recognition. Phonological skills can be measured using phonological processing and phonological and phonemic awareness tasks. Additionally, the KTEA-3 manual emphasizes a strong relationship between the Phonological Processing subtest and word reading and decoding skills. More specifically, the KTEA-3 manual reports correlations of around 0.6 between the Phonological Processing subtest and the Reading Comprehension and Letter Word Recognition subtests in pre-k (page 149, table B1) as well as kindergarten (page 150, table B.2). Furthermore, in grade 1, the Phonological Processing subtest correlates with the Nonword Decoding, Letter/Word Reading, and Reading Comprehension subtests (all above 0.6). Correlations with general abilities measure have been reported much lower (around 0.2-0.3; page 74, Table 2.16), emphasizing the relationship between Phonological Processing and measures of reading and decoding. When it comes to children with reading disabilities, the manual reports a significant difference for the Phonological Processing subtest for children with compared to children without a reading disorder, with children with reading disabilities exhibiting significantly lower scores (page 80, Table 2.18). Many of the skill areas covered by the Phonological Processing subtest of the KTEA-3 are also addressed through subtests in the EarlyBird assessment (e.g. rhyming, first sound matching, blending, deletion), although the EarlyBird subtests and items within those subtests were developed independently of the KTEA-3. Kaufman, A. S., & Kaufman, N. L. (2014). Kaufman test of educational achievement, third edition. Bloomington, MN: NCS Pearson.
- Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
- Our screeners for risk are administered at fall and winter of the school year and the criterion is administered in the spring so that our risk calibration is consistent with a predictive classification model. The EarlyBird Dyslexia Risk Flag indicates the likelihood that a student will be at risk for reading struggles, as determined by poor phonological processing skills at the end of kindergarten, presuming the student doesn’t receive appropriate remediation. Dyslexia risk is based on a study conducted over the 2019-2020 school year, in which students were administered the EarlyBird assessments in the fall/winter and the outcome measure (KTEA-3 Phonological Processing subtests) in the spring of kindergarten. For the purposes of the analysis, dyslexia risk is defined as performing at or below the 16th percentile on the KTEA-3 Phonological Processing subtest (inclusive of blending, rhyming, sound matching, deletion, and segmenting items). The calculation involves logistic regression and receiver operating characteristic curve analyses (see next section) with a selection of our most predictive subtests (rhyming, nonword repetition, and follow directions) and an aggregation and weight averaging of that data according to degree of predictability to generate a single output score which is conveyed as a “flag”. That flag indicates the likelihood that a student would score poorly on the KTEA-3 task (Phonological Processing). Any child flagged for dyslexia risk is at high risk for low phonological processing skills at the end of kindergarten and therefore subsequent low reading proficiency and needs intensive instruction targeted to the student’s skill weaknesses.
- Describe how the classification analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
- Logistic regressions were used, in part, to calibrate classification accuracy. Students’ performance on the selected criterions were coded as ‘1’ for performance below the 16th percentile on the KTEA-3 Phonological Processing for the Dyslexia Risk flag, and ‘0’ for scores that did not meet this criteria. In this way, the Dyslexia flag is a prediction of risk. Each dichotomous variable was then regressed on a combination of EarlyBird Assessments. As such, students could be identified as not at-risk on the multifactorial combination of screening tasks via the joint probability and demonstrating adequate performance on the criterion (i.e., specificity or true-negatives), at-risk on the combination of screening task scores via the joint probability and not demonstrating adequate performance on the criterion (i.e., sensitivity or true-positives), not at-risk based on the combination of screening task scores but at-risk on a criterion (i.e., false negative error), or at-risk on the combination of screening task scores but not at-risk on the criterion (i.e., false positive error). Classification of students in these categories allows for the evaluation of cut-points on the combination of screening tasks to determine which were the cut-point maximizing selected indicators. The concept of risk or success can be viewed in many ways, including the concept as a “percent chance” which is a number between 1 and 99, with 1 meaning there is low chance that a student may develop a problem, and 99 being there is a high chance that the student may develop a problem. When attempting to identify children who are “at-risk” for poor performance on some type of future measure of reading achievement, EarlyBird uses a yes/no decision based upon a “cut-point” along a continuum of risk. Decisions concerning appropriate cut-points are made based on the level of correct classification that is desired from the screening assessments. A variety of statistics may be used to guide such choices (e.g., sensitivity, specificity, positive and negative predictive power; see Schatschneider, Petscher & Williams, 2008) and each was considered in light of the other in choosing appropriate cut-points. Area under the curve, sensitivity, and specificity estimates from the final logistic regression model were bootstrapped 1,000 times in order to obtain a 95% confidence interval of scores using the cutpointr package in R statistical software.
- Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
-
Yes
- If yes, please describe the intervention, what children received the intervention, and how they were chosen.
- EarlyBird’s studies did not include an intervention component and EarlyBird did not collect data related to intervention services conducted at the participating schools. That said, because the samples were comprised of children demonstrating a wide range of performance in terms of literacy-related skills, it is likely that some children included in the samples may have received intervention in addition to classroom instruction between administration of the screening measure and outcome assessment.
Cross-Validation
- Has a cross-validation study been conducted?
-
No
- If yes,
- Describe the criterion (outcome) measure(s) including the degree to which it/they is/are independent from the screening measure.
- Describe when screening and criterion measures were administered and provide a justification for why the method(s) you chose (concurrent and/or predictive) is/are appropriate for your tool.
- Describe how the cross-validation analyses were performed and cut-points determined. Describe how the cut points align with students at-risk. Please indicate which groups were contrasted in your analyses (e.g., low risk students versus high risk students, low risk students versus moderate risk students).
- Were the children in the study/studies involved in an intervention in addition to typical classroom instruction between the screening measure and outcome assessment?
- If yes, please describe the intervention, what children received the intervention, and how they were chosen.
Classification Accuracy - Fall
Evidence | Kindergarten |
---|---|
Criterion measure | Kaufman Test of Educational Achievement (KTEA) - Phonological Processing subtest |
Cut Points - Percentile rank on criterion measure | 16 |
Cut Points - Performance score on criterion measure | |
Cut Points - Corresponding performance score (numeric) on screener measure | |
Classification Data - True Positive (a) | 24 |
Classification Data - False Positive (b) | 29 |
Classification Data - False Negative (c) | 7 |
Classification Data - True Negative (d) | 124 |
Area Under the Curve (AUC) | 0.85 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.80 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.90 |
Statistics | Kindergarten |
---|---|
Base Rate | 0.17 |
Overall Classification Rate | 0.80 |
Sensitivity | 0.77 |
Specificity | 0.81 |
False Positive Rate | 0.19 |
False Negative Rate | 0.23 |
Positive Predictive Power | 0.45 |
Negative Predictive Power | 0.95 |
Sample | Kindergarten |
---|---|
Date | August - November 2019 |
Sample Size | 184 |
Geographic Representation | Middle Atlantic (NY, PA) Mountain (MT) New England (MA, RI) West North Central (MO) West South Central (LA, TX) |
Male | 48.4% |
Female | 50.0% |
Other | |
Gender Unknown | 1.6% |
White, Non-Hispanic | 73.4% |
Black, Non-Hispanic | 6.5% |
Hispanic | 9.2% |
Asian/Pacific Islander | 0.5% |
American Indian/Alaska Native | |
Other | 8.7% |
Race / Ethnicity Unknown | 1.6% |
Low SES | |
IEP or diagnosed disability | 6.5% |
English Language Learner |
Classification Accuracy - Winter
Evidence | Kindergarten |
---|---|
Criterion measure | Kaufman Test of Educational Achievement (KTEA) - Phonological Processing subtest |
Cut Points - Percentile rank on criterion measure | 16 |
Cut Points - Performance score on criterion measure | |
Cut Points - Corresponding performance score (numeric) on screener measure | |
Classification Data - True Positive (a) | 24 |
Classification Data - False Positive (b) | 29 |
Classification Data - False Negative (c) | 7 |
Classification Data - True Negative (d) | 124 |
Area Under the Curve (AUC) | 0.85 |
AUC Estimate’s 95% Confidence Interval: Lower Bound | 0.80 |
AUC Estimate’s 95% Confidence Interval: Upper Bound | 0.90 |
Statistics | Kindergarten |
---|---|
Base Rate | 0.17 |
Overall Classification Rate | 0.80 |
Sensitivity | 0.77 |
Specificity | 0.81 |
False Positive Rate | 0.19 |
False Negative Rate | 0.23 |
Positive Predictive Power | 0.45 |
Negative Predictive Power | 0.95 |
Sample | Kindergarten |
---|---|
Date | August - November 2019 |
Sample Size | 184 |
Geographic Representation | Middle Atlantic (NY, PA) Mountain (MT) New England (MA, RI) West North Central (MO) West South Central (LA, TX) |
Male | 48.4% |
Female | 50.0% |
Other | |
Gender Unknown | 1.6% |
White, Non-Hispanic | 73.4% |
Black, Non-Hispanic | 6.5% |
Hispanic | 9.2% |
Asian/Pacific Islander | 0.5% |
American Indian/Alaska Native | |
Other | 8.7% |
Race / Ethnicity Unknown | 1.6% |
Low SES | |
IEP or diagnosed disability | 6.5% |
English Language Learner |
Reliability
Grade |
Kindergarten
|
---|---|
Rating |
- *Offer a justification for each type of reliability reported, given the type and purpose of the tool.
- Marginal Reliability is an appropriate model-based measure of reliability to use, given that most EarlyBird subtests are computer adaptive and use Item Response Theory (IRT). Reliability describes how consistent test scores will be across multiple administrations over time, as well as how well one form of the test relates to another. Because the EarlyBird screener uses IRT as its method of validation, reliability takes on a different meaning than from a Classical Test Theory (CTT) perspective. The biggest difference between the two approaches is the assumption made about the measurement error related to the test scores. CTT treats the error variance as being the same for all scores, whereas the IRT view is that the level of error is dependent on the ability of the individual. As such, reliability in IRT becomes more about the level of precision of measurement across ability. Although it is often more useful to graphically represent the standard error across ability levels to gauge for what range of abilities the test is more or less informative, it is possible to estimate marginal reliability through a calculation.
- *Describe the sample(s), including size and characteristics, for each reliability analysis conducted.
- Marginal reliability for the Rhyming, First Sound Matching, Nonword Repetition, and Vocabulary kindergarten subtests was conducted using a representative sample of 419 kindergarten students in 19 schools and eight states in every region of the country including MT, MO, MA, NY, LA, PA, RI, and TX who took the EarlyBird assessment between August and November 2019. The sample was 75.5% White, 12.92% Black or African American, 4.9% Asian, 2.45% American Indian, .89% Native Hawaiian / Pacific Islander. 12.22% identified as Hispanic. 3.34% did not respond. For the rest of the kindergarten subtests (Letter Name, Letter Sound, Blending, Deletion, Word Reading, Word Matching, Follow Directions, and Oral Sentence Comprehension), a statewide representative sample of kindergarten students that roughly reflected Florida’s demographic diversity and academic ability (N ~ 2,400) was collected on students in Kindergarten as part of a larger K-2 validation and linking study. Because the samples used for data collection did not strictly adhere to the state distribution of demographics (i.e., percent limited English proficiency, Black, White, Latino, and eligible for free/reduced lunch), sample weights according to student demographics were used.
- *Describe the analysis procedures for each reported type of reliability.
- An estimate of reliability, known as marginal reliability (Sireci, Thissen, & Wainer, 1991), was calculated using the variance of ability with the mean squared error. The formula and additional information about the procedure is available in the EarlyBird Dyslexia and Early Literacy Screener Technical Manual.
*In the table(s) below, report the results of the reliability analyses described above (e.g., internal consistency or inter-rater reliability coefficients).
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of reliability analysis not compatible with above table format:
- Manual cites other published reliability studies:
- Yes
- Provide citations for additional published studies.
- Barbara R. Foorman, Yaacov Petscher, Christopher Stanley & Adrea Truckenmiller (2017) Latent Profiles of Reading and Language and Their Association With Standardized Reading Outcomes in Kindergarten Through Tenth Grade, Journal of Research on Educational Effectiveness, 10:3, 619-645, DOI: 10.1080/19345747.2016.1237597 Foorman, B. R., Petscher, Y., & Herrera, S. (2018, March 7). Unique and common effects of decoding and language factors in predicting reading comprehension in grades 1–10. Learning and Individual Differences. Retrieved April 24, 2022, from https://www.sciencedirect.com/science/article/abs/pii/S1041608018300414#preview-section-abstract
- Do you have reliability data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
If yes, fill in data for each subgroup with disaggregated reliability data.
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of reliability analysis not compatible with above table format:
- Manual cites other published reliability studies:
- Yes
- Provide citations for additional published studies.
- Barbara R. Foorman, Yaacov Petscher, Christopher Stanley & Adrea Truckenmiller (2017) Latent Profiles of Reading and Language and Their Association With Standardized Reading Outcomes in Kindergarten Through Tenth Grade, Journal of Research on Educational Effectiveness, 10:3, 619-645, DOI: 10.1080/19345747.2016.1237597 Foorman, B. R., Petscher, Y., & Herrera, S. (2018, March 7). Unique and common effects of decoding and language factors in predicting reading comprehension in grades 1–10. Learning and Individual Differences. Retrieved April 24, 2022, from https://www.sciencedirect.com/science/article/abs/pii/S1041608018300414#preview-section-abstract
Validity
Grade |
Kindergarten
|
---|---|
Rating |
- *Describe each criterion measure used and explain why each measure is appropriate, given the type and purpose of the tool.
- The Phonological Processing subtest on the KTEA-3 was used to determine predictive validity and construct (convergent) validity for the Dyslexia Risk part of the EarlyBird assessment. The Phonological Processing subtest measures the ability to perform a variety of tasks related to phonological awareness, such as rhyming, blending, and segmenting words, and yields a composite score. The SAT-10 Word Reading was used to determine additional predictive and concurrent validity. The Word Reading subtest of the SAT-10/SESAT measures the ability to read words. Both tests assess some of the constructs that EarlyBird also measures, but each is a standardized, paper-and-pencil psychometric test that was developed and published separately by Pearson and so is external to the EarlyBird assessment.
- *Describe the sample(s), including size and characteristics, for each validity analysis conducted.
- A validity study was conducted for the Dyslexia Risk (KTEA-3 outcome measure) aspect of the EarlyBird assessment for Kindergarten during the 2019-2020 school year. Students were administered the full EarlyBird assessment (all Kindergarten-appropriate subtests) during fall/winter and the KTEA-3 during spring/summer 2020. Having two data points from approximately 200 participants (located in 8 states across every region of the country), from the app the previous fall and from the psychometric assessments in late spring/summer 2020, allowed for the evaluation of the screener’s predictive validity. Separately, data collection for the PWR Risk (SAT-10/SESAT outcome measure) aspect of the EarlyBird screener began by testing item pools for the Screen tasks (i.e., Letter Sounds, Phonological Awareness, Word Reading, Vocabulary Pairs, and Following Directions). A statewide representative sample of students that roughly reflected Florida’s demographic diversity and academic ability (N ~ 2,400) was collected on students in Kindergarten as part of a larger K-2 validation and linking study. Because the samples used for data collection did not strictly adhere to the state distribution of demographics (i.e., percent limited English proficiency, Black, White, Latino, and eligible for free/reduced lunch), sample weights according to student demographics were used to inform the item and student parameter scores.
- *Describe the analysis procedures for each reported type of validity.
- Predictive Validity The predictive validity of the Dyslexia Risk screening tasks to the KTEA-3 Phonological Processing subtest was done through a series of multiple regression analyses tested for the additive and interactive relations between EarlyBird assessments and the K-PA outcome (KTEA-3, Phonological Processing) to find the fewest number of tasks that maximized the percentage of explained variance in K-PA. The final model included the Following Directions, Nonword Repetition, and Rhyming subtests with R2 of .37 (multiple r = .61, 95% CI = .50, .69, n = 184). The predictive and concurrent validity of the PWR screening tasks to the SAT-10 Word Reading (SESAT in K) was addressed through a series of linear and logistic regressions. The linear regressions were run two ways. First, a correlation analysis was used to evaluate the strength of relations between each of the Screening task ability scores with SESAT. Pearson correlations between PWR tasks and the SESAT Word Reading task ranged from .38 to .59. Second, a multiple regression was run to estimate the total amount of variance that the linear combination of the predictors explained in SESAT (46%). Construct Validity Construct validity describes how well scores from an assessment measure the construct it is intended to measure. A component of construct validity is convergent validity, which can be evaluated by testing relations between a developed assessment (like the EarlyBird Rhyming subtest) and another related assessment (like the Phonological Processing subtest of the KTEA-3). The goal of convergent validity is to yield a high association which indicates that the developed measure converges, or is empirically linked to, the intended construct. Concurrent validity (correlation) analyses were also conducted. Phonological awareness skills (like First Sound Matching) and sound/symbol correspondence tasks (like Letter Sounds) would be expected to have moderate associations between them; thus, the expectation is that moderate correlations would be observed. Predictive, convergent, and concurrent validity results are reported below.
*In the table below, report the results of the validity analyses described above (e.g., concurrent or predictive validity, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of validity analysis not compatible with above table format:
- Convergent and concurrent validity analyses were also conducted. We compared the EarlyBird Rhyming subtest to the KTEA-3 Phonological Processing (n = 215). Convergent validity was .53. The correlation between the EarlyBird First Sound Matching (n = 191) and Letter Sounds (n = 213) subtests was .51.
- Manual cites other published reliability studies:
- Yes
- Provide citations for additional published studies.
- Barbara R. Foorman, Yaacov Petscher, Christopher Stanley & Adrea Truckenmiller (2017) Latent Profiles of Reading and Language and Their Association With Standardized Reading Outcomes in Kindergarten Through Tenth Grade, Journal of Research on Educational Effectiveness, 10:3, 619-645, DOI: 10.1080/19345747.2016.1237597 Foorman, B. R., Petscher, Y., & Herrera, S. (2018, March 7). Unique and common effects of decoding and language factors in predicting reading comprehension in grades 1–10. Learning and Individual Differences. Retrieved April 24, 2022, from https://www.sciencedirect.com/science/article/abs/pii/S1041608018300414#preview-section-abstract
- Describe the degree to which the provided data support the validity of the tool.
- Do you have validity data that are disaggregated by gender, race/ethnicity, or other subgroups (e.g., English language learners, students with disabilities)?
- No
If yes, fill in data for each subgroup with disaggregated validity data.
Type of | Subgroup | Informant | Age / Grade | Test or Criterion | n | Median Coefficient | 95% Confidence Interval Lower Bound |
95% Confidence Interval Upper Bound |
---|
- Results from other forms of validity analysis not compatible with above table format:
- Manual cites other published reliability studies:
- Provide citations for additional published studies.
Bias Analysis
Grade |
Kindergarten
|
---|---|
Rating | Yes |
- Have you conducted additional analyses related to the extent to which your tool is or is not biased against subgroups (e.g., race/ethnicity, gender, socioeconomic status, students with disabilities, English language learners)? Examples might include Differential Item Functioning (DIF) or invariance testing in multiple-group confirmatory factor models.
- Yes
- If yes,
- a. Describe the method used to determine the presence or absence of bias:
- DIF analysis on items / Guidelines for Retaining Items: Several criteria were used to evaluate item performance. The first process was to identify items which demonstrated strong floor or ceiling effects in response rates >= 95%. Such items are not useful in creating an item bank as there is little variability in whether students are successful on the item. In addition to evaluating the descriptive response rate, we estimated item-total correlations. Items with negative values are indicative of poor functioning such that it suggests individuals who correctly answer the question tend to have lower total scores. Similarly, items with low item-total correlations indicate the lack of a relation between item and total test performance. Items with correlations <.15 were flagged for removal. Following the descriptive analysis of item performance, difficulty and discrimination values from the IRT analyses were used to further identify items which were poorly functioning. Items were flagged for item revision if the item discrimination was negative or the item difficulty was greater than +4.0 or less than -4.0. Secondary criteria were used in evaluating the retained items, which was comprised of a differential item function (DIF) analysis. DIF refers to instances where individuals from different groups with the same level of underlying ability significantly differ in their probability to correctly endorse an item. Unchecked, items included in a test which demonstrate DIF will produce biased test results. For the PWR study, DIF testing was conducted comparing: Black-White students, Latino-White students, Black-Latino students, students eligible for Free or Reduced Priced Lunch (FRL) with students not receiving FRL, and English Language Learner to non-English Language Learner students. DIF testing in the PWR study was conducted with a multiple indicator multiple cause (MIMIC) analysis in Mplus (Muthén & Muthén, 2008); moreover, a series of four standardized and expected score effect size measures were generated using VisualDF software (Meade, 2010) to quantify various technical aspects of score differentiation between the gender groups. First, the signed item difference in the sample (SIDS) index was created, which describes the average unstandardized difference in expected scores between the groups. The second effect size calculated was the unsigned item difference in the sample (UIDS). This index can be utilized as supplementary to the SIDS. When the absolute value of the SIDS and UIDS values are equivalent, the differential functioning between groups is equivalent; however, when the absolute value of the UIDS is larger than SIDS, it provides evidence that the item characteristic curves for expected score differences cross, indicating that differences in the expected scores between groups change across the level of the latent ability score. The D-max index is reported as the maximum SIDS value in the sample, and may be interpreted as the greatest difference for any individual in the sample in the expected response. Lastly, an expected score standardized difference (ESSD) was generated, and was computed similar to a Cohen’s (1988) d statistic. As such, it is interpreted as a measure of standard deviation difference between the groups for the expected score response with values of .2 regarded as small, .5 as medium, and .8 as large. Items demonstrating DIF were flagged for further study in order to ascertain why groups with the same latent ability performed differently on the items. DIF testing in the Dyslexia Risk study was estimated using the difR package (Magis, Beland, & Raiche, 2020) using the Mantel-Haenszel method (1959) for detecting uniform DIF. For each of the six tasks, DIF was tested for four primary contrasts: 1) Male vs. Female, 2) White vs. Sample, and 3) Black vs. Sample. The Mantel-Haenszel chi-square statistic was reported for test by item and the chi-square was used to derive an effect size estimate (i.e., ETS delta scale; Holland & Thayer, 1988). Effect size values <= 1.0 are considered small, 1.0 – 1.5 is moderate, and >= 1.5 is considered large. Differential Test Functioning: A component of checking the validity of cut-points and scores on the assessments involved also testing differential accuracy of the regression equations across different demographic groups. This procedure involved a series of logistic regressions predicting success on the SESAT (i.e., at or above the 40th percentile) outcome measure. The independent variables included a variable that represented whether students were identified as not at-risk based on the identified cut-point on a combination score of the screening tasks, a variable that represented a selected demographic group, as well as an interaction term between the two variables. A statistically significant interaction term would suggest that differential accuracy in predicting end-of-year risk status existed for different groups of individuals based on the risk status identified by the PWR screener.
- b. Describe the subgroups for which bias analyses were conducted:
- For the PWR study, DIF testing was conducted comparing: Black-White students, Latino-White students, Black-Latino students, students eligible for Free or Reduced Priced Lunch (FRL) with students not receiving FRL, and English Language Learner to non-English Language Learner students. For the Dyslexia Risk study, DIF was tested for 1) Male vs. Female, 2) White vs. Sample, and 3) Black vs. Sample. Differential accuracy was separately tested for the PWR study for Black and Latino students as well as for students identified as English Language Learners (ELL) and students who were eligible for Free/Reduced Price Lunch (FRL).
- c. Describe the results of the bias analyses conducted, including data and interpretative statements. Include magnitude of effect (if available) if bias has been identified.
- Differential Item Functioning (DIF) - Across all Kindergarten tasks and comparisons, only 12 items demonstrated at DIF with at least a moderate effect size (i.e., ETS >= 1.0): 2 nonword repetition items, and 10 Word Matching items. These items were removed from the item bank for further study and testing. All remaining items presented with ETS delta values <1.00 indicating small DIF. Differential Test Functioning - No statistically significant differential accuracy was found for any demographic sub-group. For more information, see pages 16-17 and 29 in the EarlyBird Technical Manual.
Data Collection Practices
Most tools and programs evaluated by the NCII are branded products which have been submitted by the companies, organizations, or individuals that disseminate these products. These entities supply the textual information shown above, but not the ratings accompanying the text. NCII administrators and members of our Technical Review Committees have reviewed the content on this page, but NCII cannot guarantee that this information is free from error or reflective of recent changes to the product. Tools and programs have the opportunity to be updated annually or upon request.