Imagine Learning

Imagine Language & Literacy Benchmark

  

Cost

Technology, Human Resources, and Accommodations for Special Needs

Service and Support

Purpose and Other Implementation Information

Usage and Reporting

Initial Cost:

No isolated assessment pricing is available, because the Benchmarking system is not sold as a standalone product. Pricing information can be obtained from the vendor and applies to the full Imagine Language & Literacy product, including the Placement and Benchmarking systems.

 

Replacement Cost:

Contact vendor for details.

 

Included in Cost:

Imagine Language & Literacy is sold as a flat annual fee per student or per school site, depending on customer preference. The purchase comes with customer support included. Additional support, or Customer Success, packages can be purchased as well. Those prices include the full instructional and assessment product offering.

 

Technology Requirements:

  • Computer or tablet
  • Internet connection

 

Training Requirements:

  • Less than 1 hour of training

 

Qualified Administrators:

  • No minimum qualifications specified

 

Accommodations:

Some of the accommodations included with the screening tool include sync highlighting, repeated directions, visual cues, and directions with pictures. The tool does not have built-in assistive technology specifically designed as accommodations for students with disabilities; however, the software can be used with the operating system accessibility features enabled. Please visit Imagine Learning's support site (support.imaginelearning.com) for a list of supported operating systems.

 

Acceptable testing accommodations include breaking testing into multiple sessions, providing short breaks within testing sessions, scheduling testing to optimize student performance, and administering the test in environments that minimize distractions.

Where to Obtain:

Website: www.imaginelearning.com

Address: 382 W Park Circle, Provo, UT 84604

Phone number: (866) 377-5071 [sales] / (866) 457-8776 [support]

Email:

 info@imaginelearning.com [sales] / support@imaginelearning.com [support]


Access to Technical Support:

Educators can access Imagine Learning’s customer support services to address any technical issues via telephone or email from 6am-6pm MT, Monday-Friday. Local representatives are also available to provide assistance.

 

The Imagine Language & Literacy Placement and Benchmarking system includes subtests addressing print concepts, phonological awareness, phonics and word recognition, reading comprehension, oral vocabulary, and grammar. Student performance data is provided by subtest.

Data collected through the Imagine Language & Literacy Placement and Benchmarking system serve multiple purposes. First, Placement Test results are used to determine developmentally-appropriate starting points for each student across several literacy and language development curricula. Second, Placement and Benchmark Test results provide point-in-time skill-level estimates that are independently useful for Universal Screening. Third, when data are collected for both the Placement and Benchmark Tests (or multiple administrations of the Benchmark Test), differences in performance can be used to estimate scaled student growth. Those data can then be compared across administrations to illuminate changes in students’ reading proficiency levels.

 

Assessment Format:

  • Direct: Computerized

 

Administration Time:

  • 10-45 minutes per student*
  • 45-60 minutes per group

*10-15 minutes for students in pre-K, K; up to 45 minutes for students in grades 1-6.

 

Scoring Time:

  • Scoring is automatic

 

Scoring Method:

  • Calculated automatically

 

Scores Generated:

  • Raw score
  • IRT-based score
  • Developmental benchmarks
  • Subscale/subtest scores

 

Classification Accuracy

GradeK123456
Criterion 1 Falldashdashdashdashdashdashdash
Criterion 1 WinterEmpty bubbleEmpty bubbleEmpty bubbleEmpty bubbleEmpty bubbleEmpty bubbleEmpty bubble
Criterion 1 Springdashdashdashdashdashdashdash
Criterion 2 Falldashdashdashdashdashdashdash
Criterion 2 Winterdashdashdashdashdashdashdash
Criterion 2 Springdashdashdashdashdashdashdash

Primary Sample

 

Criterion 1, Winter

Grade

K

1

2

3

4

5

6

Criterion

Spring MAP Reading

Spring MAP Reading

Spring MAP Reading

Spring MAP Reading

Spring MAP Reading

Spring MAP Reading

Spring MAP Reading

Cut points: Percentile rank on criterion measure

 

20th Percentile

 

20th Percentile

 

20th Percentile

 

20th Percentile

 

20th Percentile

 

20th Percentile

 

20th Percentile

Cut points: Performance score (numeric) on criterion measure

-3.31

-1.87

-0.22

0.51

0.97

1.27

2.56

Cut points: Corresponding performance score (numeric) on screener measure

Not Provided

Not Provided

Not Provided

Not Provided

Not Provided

Not Provided

Not Provided

Base rate in the sample for children requiring intensive intervention

Not Provided

Not Provided

Not Provided

Not Provided

Not Provided

Not Provided

Not Provided

False Positive Rate

0.31

0.14

0.20

0.20

0.15

0.14

0.18

False Negative Rate

0.30

0.23

0.12

0.20

0.26

0.30

0.21

Sensitivity

0.70

0.77

0.88

0.80

0.74

0.70

0.79

Specificity

0.69

0.86

0.80

0.80

0.85

0.86

0.82

Positive Predictive Power

0.37

0.59

0.58

0.53

0.59

0.58

0.54

Negative Predictive Power

0.90

0.93

0.96

0.93

0.92

0.91

0.94

Overall Classification Rate

0.69

0.84

0.82

0.80

0.82

0.82

0.81

Area Under the Curve (AUC)

0.75

0.87

0.91

0.87

0.86

0.84

0.88

AUC 95% Confidence Interval Lower Bound

0.71

0.83

0.89

0.84

0.82

0.81

0.83

AUC 95% Confidence Interval Upper Bound

0.79

0.90

0.93

0.90

0.89

0.88

0.94

 

Reliability

GradeK123456
RatingFull bubbleFull bubbleFull bubbleFull bubbleFull bubbleFull bubbleFull bubble
  1. Justification for each type of reliability reported, given the type and purpose of the tool: Because the Imagine Learning Beginning-of-Year Benchmark assessment uses a total score as a basis for screening decisions, we followed guidance from the Technical Review Committee (National Center for Intensive Intervention, 2018) and conducted model-based analyses of item quality. A model-based approach also accommodates the computerized adaptive nature of the assessment.

Sijtsma (2009) and Bentler (2009) describe statistical model-based internal consistency reliability coefficients based in covariance and correlational approaches to identifying the proportion of variance that can be reliably interpreted as true and not as due to error. Bentler (2003, 2007) proposed using any statistically acceptable structural model that contains additive random errors to estimate model-based internal consistency reliability. Rasch models (Rasch, 1960; Andrich, 1988; Bond & Fox, 2015; Linacre, 1993, 1994, 1996; Wilson, 2005) are widely-used statistically acceptable structural models applying additive random errors to estimate internal consistency reliability.

Use of a Rasch model-based approach to reliability is justified in the substantive relation to the fact that education is founded on the idea that future student success can be predicted from performance on tasks similar enough to those encountered in real life to be useful proxies. These tasks are assembled into curricular materials and approached from pedagogical perspectives that typically are not tailored to each individual student but are assumed to apply in a developmentally appropriate way across students as they age and progress in knowledge.

Much recent work in education has focused on developing and testing measurement models and calibrating tools as coherent guides to future practice, across the full range of assessment applications, from formative classroom instruction to accountability (Wilson, 2004; Gorin & Mislevy, 2013; NRC, 2006).

All measures and calibrations are associated with individually estimated error and model fit statistics. A Rasch measurement model-based approach to reliability (Andrich, 1982; Andrich & Douglas, 1977; Fisher, 1992, 2008; Wright & Masters, 1982) differs, then, from the statistical model-based approaches described by Sijtsma (2009) and Bentler (2009) in the way error is estimated and treated en route to arriving at an estimate of the ratio of true variation to error variation.

 

  1. Description of the sample(s), including size and characteristics, for each reliability analysis conducted: the sample for the reliability analysis came from multiple regions, divisions, and states for grades K-5. For students for whom gender was known, approximately half of each grade level was male and half was female. Fewer than ten percent of each grade level was identified as English Language Learners, though ELL status data were missing for the majority of students. Inconsistent information was available for FRL, other SES indicators, disability classification, and first language.

 

  1. Description of the analysis procedures for each reported type of reliability: A Rasch measurement model-based approach was used to assess reliability. This approach estimates error (better termed uncertainty; Fisher & Stenner, 2017; JCGM, 2008) for each individual person and item. The mean square error variance is then subtracted from the total variance to arrive directly at an estimate of the true variance. The ratio of the true standard deviation to the root mean square error then provides a separation ratio, G, that increases in linear error-unit amounts (Wright & Masters, 1982) relative to a separation reliability coefficient formulated in the traditional 0.00-1.00 form (Andrich, 1982). An intuitive expression of measurement model-based reliability is obtained in terms of strata, ranges with centers separated by three errors. Strata are estimated by multiplying the separation ratio, G, by 4, which captures 95% of the variation in the measures; adding 1, allowing for error; and dividing that total by 3, for the number of errors taken as separating the centers of the resulting ranges. Winsteps version 4.00.1 was used for all Rasch analyses (Linacre, 2017).

Model fit statistics and Principal Components Analyses of the standardized residuals explicitly focus on the internal consistency of the observed responses, evaluating them in terms of expectations formed on the basis of a simple conception of what is supposed to happen when students respond to test questions. Simulations show that model-based assessments of internal consistency using fit statistics are better able to detect violations of interval measurement’s unidimensionality and invariance requirements than Cronbach’s alpha (Fisher, Elbaum, & Coulter, 2010).

 

  1. Reliability of performance level score (e.g., model-based, internal consistency, inter-rater reliability).

Type of Reliability

Age or Grade

n

Coefficient

95% Confidence Interval: Lower Bound

95% Confidence Interval: Upper Bound

Rasch Model Based

K

1,974

0.88

0.87

0.89

Rasch Model Based

1

2,202

0.91

0.90

0.92

Rasch Model Based

2

2,289

0.90

0.89

0.91

Rasch Model Based

3

2,313

0.87

0.86

0.88

Rasch Model Based

4

2,216

0.87

0.86

0.88

Rasch Model Based

5

1,981

0.88

0.87

0.89

Rasch Model Based

6

452

0.87

0.85

0.89

 

Disaggregated Reliability

The following disaggregated reliability data are provided for context and did not factor into the Reliability rating.

Type of Reliability

Subgroup

Age or Grade

n

Coefficient

95% Confidence Interval: Lower Bound

95% Confidence Interval: Upper Bound

None

 

 

 

 

 

 

 

Validity

GradeK123456
RatingEmpty bubbleFull bubbleFull bubbleFull bubbleHalf-filled bubbleHalf-filled bubbleFull bubble

1.Description of each criterion measure used and explanation as to why each measure is appropriate, given the type and purpose of the tool: Rasch Unit (RIT) scale scores from the Measures of Academic Progress (MAP) Reading assessment (Northwest Evaluation Association, 2014), were used as a criterion measure to demonstrate concurrent and predictive validity of the Imagine Learning Beginning-of-Year Benchmark assessment. Performance on this external measure was expected to be related to performance on the Imagine Learning Beginning-of-Year Benchmark because it includes similar measures of language and literacy skills for grades K – 6. Specifically, the Imagine Learning Beginning-of-Year Benchmark includes measures of letter recognition, phonemic awareness, word recognition, basic reading vocabulary, sentence cloze, beginning book comprehension, leveled book comprehension, and cloze passage completion. MAP Reading includes measures of word recognition, structure and vocabulary, and reading informational texts. MAP Reading is widely-used to identify students who are at-risk or in need of intensive intervention and, like the Imagine Learning Beginning-of-Year Benchmark, is a computer adaptive assessment. MAP Reading also has good evidence of reliability and validity for use as an interim progress monitoring assessment.

 

2.Description of the sample(s), including size and characteristics, for each validity analysis conducted: Across all grade K-5 samples used to examine predictive and concurrent validity, sample sizes ranged 641 to 914; sample sizes for grade 6 ranged from 115 to 159. Multiple regions, divisions, and states were represented for each grade level K-5; the 6th grade sample represented one region, division, and state. The proportion of male and female students across all samples was roughly equal. English Language Learners (ELLs) represented between a quarter and a third of all samples, except for Grade 6 samples in which ELLs represented between 4 and 12 percent of all students. In all samples, most students were Hispanic, with the next most prevalent race/ethnicity categories being White/non-Hispanic and Black/African American. Data indicating eligibility for free/reduced price lunch, student disability status, and services received were not consistently available.

Statistics describing the distribution of students according to MAP Reading performance, based on winter 2016 data showed that students at each grade level performed below national norms, on average. Standard deviations were similar or slightly larger than national norms, indicating a diversity of students across performance levels.

 

3.Description of the analysis procedures for each reported type of validity: Three bivariate correlation analyses were conducted for each grade level. The first provides evidence of predictive validity using winter 2016 Imagine Learning Beginning-of-Year Benchmark assessment data to predict scores on the spring 2017 MAP Reading assessment data. Two analyses were used to demonstrate concurrent validity by relating Imagine Learning Beginning-of-Year Benchmark assessment data and MAP Reading assessment data (separately, using winter 2016 and spring 2017 data). The Pearson correlation coefficient was computed for each correlation and a 95% confidence interval around each coefficient was calculated.

 

4.Validity for the performance level score (e.g., concurrent, predictive, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, and/or evidence based on consequences of testing), and the criterion measures.

Type of Validity

Age or Grade

Test or Criterion

n

Coefficient

95% Confidence Interval: Lower Bound

95% Confidence Interval: Upper Bound

Predictive

K

MAP Reading

860

0.53

0.48

0.58

Concurrent 1

K

MAP Reading

854

0.67

0.63

0.71

Concurrent 2

K

MAP Reading

760

0.51

0.46

0.56

Predictive

1

MAP Reading

913

0.67

0.64

0.71

Concurrent 1

1

MAP Reading

912

0.79

0.76

0.81

Concurrent 2

 1

MAP Reading

749

0.67

0.63

0.71

Predictive

 2

MAP Reading

912

0.75

0.73

0.78

Concurrent 1

 2

MAP Reading

914

0.82

0.79

0.84

Concurrent 2

 2

MAP Reading

756

0.75

0.72

0.78

Predictive

 3

MAP Reading

777

0.69

0.65

0.73

Concurrent 1

 3

MAP Reading

781

0.79

0.76

0.81

Concurrent 2

 3

MAP Reading

644

0.67

0.63

0.71

Predictive

 4

MAP Reading

804

0.64

0.60

0.68

Concurrent 1

 4

MAP Reading

804

0.83

0.81

0.85

Concurrent 2

 4

MAP Reading

655

0.66

0.62

0.70

Predictive

 5

MAP Reading

774

0.64

0.59

0.68

Concurrent 1

 5

MAP Reading

777

0.77

0.74

0.80

Concurrent 2

 5

MAP Reading

641

0.65

0.61

0.69

Predictive

 6

MAP Reading

159

0.71

0.63

0.78

Concurrent 1

 6

MAP Reading

159

0.74

0.64

0.81

Concurrent 2

 6

MAP Reading

115

0.75

0.60

0.81

 

5.Results for other forms of validity (e.g. factor analysis) not conducive to the table format: Not Provided

 

6.Describe the degree to which the provided data support the validity of the tool: Using the NCII lower bound threshold of .6 for the confidence intervals around the standardized estimate, analyses showed evidence of concurrent validity for all grade levels. Based on this threshold, evidence of predictive validity was available for first, second, third, and sixth grades. The lower bound of the confidence intervals for predictive validity for fourth and fifth grades were very close to the threshold (.597 for fourth grade and .594 for fifth grade).

 

Disaggregated Validity

The following disaggregated validity data are provided for context and did not factor into the Validity rating.

Type of Validity

Subgroup

Age or Grade

Test or Criterion

n

Coefficient

95% Confidence Interval: Lower Bound

95% Confidence Interval: Upper Bound

None

 

 

 

 

 

 

 

 

Results for other forms of disaggregated validity (e.g. factor analysis) not conducive to the table format: Not Provided

Sample Representativeness

GradeK123456
Data
  • Regional without Cross-Validation
  • Regional without Cross-Validation
  • Regional without Cross-Validation
  • Regional without Cross-Validation
  • Regional without Cross-Validation
  • Regional without Cross-Validation
  • Regional without Cross-Validation
  • Primary Classification Accuracy Sample

     

    Grade

    K

    1

    2

    3

    4

    5

    6

    Criterion

    MAP Reading

    MAP Reading

    MAP Reading

    MAP Reading

    MAP Reading

    MAP Reading

    MAP Reading

    National/Local Representation

    South (West South Central)

    West (Mountain, Pacific)

    South (West South Central)

    West (Mountain, Pacific)

    South (West South Central)

    West (Mountain, Pacific)

    West (Mountain, Pacific)

    West (Mountain, Pacific)

    West (Mountain, Pacific)

    West (Mountain)

    Date

    AZ, CA, LA, WY

    AZ, CA, LA, WY

    AZ, CA, LA, WY

    AZ, CA, WY

    AZ, CA, WY

    AZ, CA, WY

    AZ

    Sample Size

    Spring 2017

    Spring 2017

    Spring 2017

    Spring 2017

    Spring 2017

    Spring 2017

    Spring 2017

    Male

    680

    913

    912

    777

    804

    774

    159

    Female

    49%

    50%

    50%

    51%

    53%

    54%

    52%

    Gender Unknown

    52%

    50%

    50%

    49%

    47%

    46%

    48%

    Free or Reduced-price Lunch Eligible

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    White, Non-Hispanic

    13%

    11%

    12%

    6%

    7%

    6%

    4%

    Black, Non-Hispanic

    0%

    2%

    0%

    0%

    0%

    0%

    0%

    Hispanic

    7%

    10%

    8%

    4%

    4%

    5%

    7%

    American Indian/Alaska Native

    79%

    78%

    79%

    89%

    88%

    88%

    85%

    Other

    0%

    1%

    1%

    1%

    1%

    1%

    2%

    Race/Ethnicity Unknown

    0%

    0%

    0%

    0%

    0%

    0%

    0%

    Disability Classification

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    First Language

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    Not Provided

    Language Proficiency Status (ELL)

    32%

    31%

    28%

    36%

    32%

    25%

    13%

     

    Bias Analysis Conducted

    GradeK123456
    RatingYesYesYesYesYesYesYes
    1. Description of the method used to determine the presence or absence of bias: Differential Item Functioning (DIF) was evaluated for all items on the Imagine Learning Beginning-of-Year Benchmark using focal versus contrast groups differences by means of t-tests. Winsteps version 4.00.1 was used for all analyses (Linacre, 2017).

    Evaluating DIF requires close attention to what item estimate stability is desired. Lee (1992) finds that student performance across elementary grade levels changes by about a logit per year, with the rate of growth slowing as the upper grades are reached. The present data roughly corresponded with that, as the changes from K to Grade 1, Grade 1 to Grade 2, and Grade 2 to Grade 3 were all a logit or more, and the remaining three were all half a logit or more. Accordingly, as Linacre (1994) points out, items stable to within a logit will be targeted at the correct grade level.

    Linacre (1994) shows that to have 99% confidence that an item difficulty estimate is stable to within a logit, its error of measurement must be 0.385 or less (since that is the value obtained dividing the desired one logit range by the 2.6 standard deviations needed for 99% confidence). The sample size needed for item estimate errors of 0.385 or less was in the range of 27 to 61, depending on how well targeted the item is at the sample. An item that is too easy or hard, and so is off target, will have a higher error and will need a correspondingly larger sample size to drive down the error to the range needed for stability within a logit. The criterion of interest in the Winsteps table is then the DIF size, which should be less than 1.00 logit for samples of sufficient size.

     

    1. Description of the subgroups for which bias analyses were conducted: Groups tested included ELL versus non-ELL, and unknown; female versus male, and unknown; and White versus non-White. Race/ethnicity categories were collapsed into binary categories (i.e., White and Nonwhite) because of small cell sizes for certain groups.

     

    1. Description of the results of the bias analyses conducted, including data and interpretative statements: For ELL, with three groups, the hypothesis tested was that the item has the same difficulty as its average difficulty for all groups. In 98.5% of comparisons, DIF sizes of |0.77| or less were obtained. Of the six instances of DIF greater than |0.77| logits, four of them involved Phoneme Segmentation items with only four nonELL students responding. The other two items were two Cloze Passage items (items 107.1 and 107.18); both had 77 ELL students answering, both had DIF sizes of about -1.00 logit, meaning these ELL students correctly answered these two items disproportionately more frequently. However, neither item had a t-statistic greater than 1.96, so neither difference was statistically significant at a p-value less than .05. 

    For gender, also with three groups, the hypothesis tested was again that the item has the same difficulty as its average difficulty for all groups. No DIF sizes over |0.77| logits were found and seven items had DIF sizes greater than or equal to |0.50|. All were Cloze Passage items. Males found four items more difficult than the other two groups did, and females found three items easier. Two items with DIF sizes of 0.66 disadvantaging males were statistically significant at p less than .05. 

    For ethnicity, the hypothesis tested was whether the item has the same difficulty for the two groups. There were 3 items with DIF sizes greater than |0.76|. Two of the three have DIF measure standard errors greater than 0.385 for one of the contrasted groups, and so were not expected to be stable to within a logit with 99% confidence. The other item (Cloze Passage 104.1), however, was 1.57 logits more difficult for white students (CLASS=1) than it was for others.

    The DIF analysis provides a statistical approach to assessing whether the items retain their relative difficulties across different groups. This kind of invariance can also be investigated more directly, by scaling the items on subsamples of students. A series of analyses of each grade level resulted in seven separate sets of item calibrations. Of the 21 pairs of correlations, 15 were between 0.94 and 0.99, four were between 0.83 and 0.89, and the remaining two (Kindergarten vs 5th and 6th grades) were 0.78 and 0.66. Disattenuated for error (Muchinsky, 1996; Schumacker, 1996), the smallest correlation was 0.75. These results supported the overall stability of the scale across grade levels.

     

    Administration Format

    GradeK123456
    Data
  • Individual
  • Group
  • Individual
  • Group
  • Individual
  • Group
  • Individual
  • Group
  • Individual
  • Group
  • Individual
  • Group
  • Individual
  • Group
  • Administration & Scoring Time

    GradeK123456
    Data
  • 10-45 minutes
  • 10-45 minutes
  • 10-45 minutes
  • 10-45 minutes
  • 10-45 minutes
  • 10-45 minutes
  • 10-45 minutes
  • Scoring Format

    GradeK123456
    Data
  • Automatic
  • Automatic
  • Automatic
  • Automatic
  • Automatic
  • Automatic
  • Automatic
  • Types of Decision Rules

    GradeK123456
    Data
  • Benchmark Goals
  • Benchmark Goals
  • Benchmark Goals
  • Benchmark Goals
  • Benchmark Goals
  • Benchmark Goals
  • Benchmark Goals
  • Evidence Available for Multiple Decision Rules

    GradeK123456
    Data
  • Yes
  • Yes
  • No
  • No
  • No
  • No
  • No