The truth will set you free, but first it will piss you off.
—Gloria Steinem
When a measure becomes a target, it ceases to be a good measure.
—Goodhart's Law (via Marilyn Strathern)
most common method to evaluate teaching
used for hiring, firing, tenure, promotion
simple, cheap, fast to administer
most common method to evaluate teaching
used for hiring, firing, tenure, promotion
simple, cheap, fast to administer
Part I: Basic Statistics
SET surveys are an incomplete census, not a random sample.
Suppose 70% of students respond, with an average of 4 on a 7-point scale.
SET surveys are an incomplete census, not a random sample.
Suppose 70% of students respond, with an average of 4 on a 7-point scale. Then the class average could be anywhere between 3.1 & 4.9
SET surveys are an incomplete census, not a random sample.
Suppose 70% of students respond, with an average of 4 on a 7-point scale. Then the class average could be anywhere between 3.1 & 4.9
"Margin of error" meaningless: not a random sample
Does a 3 mean the same thing to every student—even approximately?
Does a 5 mean the same thing in an upper-division architecture studio, a required freshman Econ course with 500 students, or an online Statistics course with no face-to-face interaction?
Does a 3 mean the same thing to every student—even approximately?
Does a 5 mean the same thing in an upper-division architecture studio, a required freshman Econ course with 500 students, or an online Statistics course with no face-to-face interaction?
Is the difference between 1 & 2 the same as the difference between 5 & 6?
Does a 3 mean the same thing to every student—even approximately?
Does a 5 mean the same thing in an upper-division architecture studio, a required freshman Econ course with 500 students, or an online Statistics course with no face-to-face interaction?
Is the difference between 1 & 2 the same as the difference between 5 & 6?
Does a 1 balance a 7 to make two 4s?
Does a 3 mean the same thing to every student—even approximately?
Does a 5 mean the same thing in an upper-division architecture studio, a required freshman Econ course with 500 students, or an online Statistics course with no face-to-face interaction?
Is the difference between 1 & 2 the same as the difference between 5 & 6?
Does a 1 balance a 7 to make two 4s?
what about variability?
(1+1+7+7)/4 = (2+3+5+6)/4 = (1+5+5+5)/4 = (4+4+4+4)/4 = 4
Polarizing teacher ≠ teacher w/ mediocre ratings
3 statisticians go deer hunting …
Averages makes sense for interval scales, not ordinal scales like SET.
Averages makes sense for interval scales, not ordinal scales like SET.
Averaging SET doesn't make sense.
Averages makes sense for interval scales, not ordinal scales like SET.
Averaging SET doesn't make sense.
Doesn't make sense to compare average SET across:
Averages makes sense for interval scales, not ordinal scales like SET.
Averaging SET doesn't make sense.
Doesn't make sense to compare average SET across:
Shouldn't ignore variability or nonresponse.
Quantifauxcation
Part II: Science, Sciencism
If you can't prove what you want to prove, demonstrate something else and pretend they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anyone will notice the difference.
Darrell Huff
If you can't prove what you want to prove, demonstrate something else and pretend they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anyone will notice the difference.
Darrell Huff
Hard to measure teaching effectiveness, so instead we measure student opinion (poorly) and pretend they are the same thing.
Should facilitate learning
Grades usually not a good proxy for learning
Should facilitate learning
Grades usually not a good proxy for learning
Students generally can't judge how much they learned
Should facilitate learning
Grades usually not a good proxy for learning
Students generally can't judge how much they learned
Serious problems with confounding
Should facilitate learning
Grades usually not a good proxy for learning
Students generally can't judge how much they learned
Serious problems with confounding
Faculty & students don't mean the same thing by "fair," "professional," "organized," "challenging," & "respectful"
Faculty & students don't mean the same thing by "fair," "professional," "organized," "challenging," & "respectful"
not fair means … | student % | instructor % |
---|---|---|
plays favorites | 45.8 | 31.7 |
grading problematic | 2.3 | 49.2 |
work is too hard | 12.7 | 0 |
won't "work with you" on problems | 12.3 | 0 |
other | 6.9 | 19 |
grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)
letters of recommendation (e.g., Schmader et al., 2007, Madera et al., 2009)
grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)
letters of recommendation (e.g., Schmader et al., 2007, Madera et al., 2009)
job applications (e.g., Moss-Racusin et al., 2012, Reuben et al., 2014)
grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)
letters of recommendation (e.g., Schmader et al., 2007, Madera et al., 2009)
job applications (e.g., Moss-Racusin et al., 2012, Reuben et al., 2014)
interruptions of job talks (e.g., Blair-Loy et al., 2017)
grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)
letters of recommendation (e.g., Schmader et al., 2007, Madera et al., 2009)
job applications (e.g., Moss-Racusin et al., 2012, Reuben et al., 2014)
interruptions of job talks (e.g., Blair-Loy et al., 2017)
credit for joint work (e.g., Sarsons, 2015)
grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)
letters of recommendation (e.g., Schmader et al., 2007, Madera et al., 2009)
job applications (e.g., Moss-Racusin et al., 2012, Reuben et al., 2014)
interruptions of job talks (e.g., Blair-Loy et al., 2017)
credit for joint work (e.g., Sarsons, 2015)
teaching evaluations (e.g., MacNell et al. 2015, Boring et al. 2016)
But I know some women who get great scores & win teaching awards!
I know SET aren't perfect, but they must have some connection to effectiveness.
But I know some women who get great scores & win teaching awards!
I know SET aren't perfect, but they must have some connection to effectiveness.
I get better SET when I feel the class went better.
But I know some women who get great scores & win teaching awards!
I know SET aren't perfect, but they must have some connection to effectiveness.
I get better SET when I feel the class went better.
Shouldn't students have a voice?
Wagner et al., 2016
64% of students gave answers contradicting what they had demonstrated they knew was true.
The number of points on the rating scale affects gender differences
Part III: Deeper Dive
MacNell, Driscoll, & Hunt, 2014
NC State online course.
Students randomized into 6 groups, 2 taught by primary prof, 4 by GSIs.
2 GSIs: 1 male, 1 female.
GSIs used actual names in 1 section, swapped names in 1 section.
5-point scale.
Characteristic | M - F | perm P | t-test P |
---|---|---|---|
Overall | 0.47 | 0.12 | 0.128 |
Professional | 0.61 | 0.07 | 0.124 |
Respectful | 0.61 | 0.06 | 0.124 |
Caring | 0.52 | 0.10 | 0.071 |
Enthusiastic | 0.57 | 0.06 | 0.112 |
Communicate | 0.57 | 0.07 | NA |
Helpful | 0.46 | 0.17 | 0.049 |
Feedback | 0.47 | 0.16 | 0.054 |
Prompt | 0.80 | 0.01 | 0.191 |
Consistent | 0.46 | 0.21 | 0.045 |
Fair | 0.76 | 0.01 | 0.188 |
Responsive | 0.22 | 0.48 | 0.013 |
Praise | 0.67 | 0.01 | 0.153 |
Knowledge | 0.35 | 0.29 | 0.038 |
Clear | 0.41 | 0.29 | NA |
Test statistic: 2-sample t-test, following MacNell et al. (stratifying would be better)
Neyman model: 4 potential responses per student assigned to GSI
Null: response (incl nonresponse) to given GSI does not depend on name the GSI used
Randomization: conditions on students assigned to each actual GSI, matches actual randomization
Nonresponders fixed
Mean grade and instructor gender (male minus female)
difference in means | P-value | |
---|---|---|
Perceived | 1.76 | 0.54 |
Actual | -6.81 | 0.02 |
5 years of data for 6 mandatory freshman classes at SciencesPo:
History, Political Institutions, Microeconomics, Macroeconomics, Political Science, Sociology
23,001 SET, 379 instructors, 4,423 students, 1,194 sections (950 without PI), 21 year-by-course strata
5 years of data for 6 mandatory freshman classes at SciencesPo:
History, Political Institutions, Microeconomics, Macroeconomics, Political Science, Sociology
23,001 SET, 379 instructors, 4,423 students, 1,194 sections (950 without PI), 21 year-by-course strata
response rate ~100%
anonymous finals except PI
interim grades before final
5 years of data for 6 mandatory freshman classes at SciencesPo:
History, Political Institutions, Microeconomics, Macroeconomics, Political Science, Sociology
23,001 SET, 379 instructors, 4,423 students, 1,194 sections (950 without PI), 21 year-by-course strata
response rate ~100%
anonymous finals except PI
interim grades before final
Correlation between SET and gender within each stratum, averaged across strata
Correlation between SET and average final exam score within each stratum, averaged across strata
Could have used, e.g., correlation within strata, combined with Fisher's function
Average correlation between SET and final exam score
strata | ˉρ | P | |
---|---|---|---|
Overall | 26 (21) | 0.04 | 0.09 |
History | 5 | 0.16 | 0.01 |
Political Institutions | 5 | N/A | N/A |
Macroeconomics | 5 | 0.06 | 0.19 |
Microeconomics | 5 | -0.01 | 0.55 |
Political science | 3 | -0.03 | 0.62 |
Sociology | 3 | -0.02 | 0.61 |
Average correlation between SET and instructor gender
ˉρ | P | |
---|---|---|
Overall | 0.09 | 0.00 |
History | 0.11 | 0.08 |
Political institutions | 0.11 | 0.10 |
Macroeconomics | 0.10 | 0.16 |
Microeconomics | 0.09 | 0.16 |
Political science | 0.04 | 0.63 |
Sociology | 0.08 | 0.34 |
Average correlation between final exam scores and instructor gender
ˉρ | P | |
---|---|---|
Overall | -0.06 | 0.07 |
History | -0.08 | 0.22 |
Macroeconomics | -0.06 | 0.37 |
Microeconomics | -0.06 | 0.37 |
Political science | -0.03 | 0.70 |
Sociology | -0.05 | 0.55 |
Average correlation between SET and interim grades
ˉρ | P | |
---|---|---|
Overall | 0.16 | 0.00 |
History | 0.32 | 0.00 |
Political institutions | -0.02 | 0.61 |
Macroeconomics | 0.15 | 0.01 |
Microeconomics | 0.13 | 0.03 |
Political science | 0.17 | 0.02 |
Sociology | 0.24 | 0.00 |
lack of association between SET and final exam scores (negative result, so multiplicity not an issue)
lack of association between instructor gender and final exam scores (negative result, so multiplicity not an issue)
association between SET and instructor gender
association between SET and interim grades
Bonferroni's adjustment for four tests leaves the associations highly significant: adjusted P<0.01.
US data: controls for everything but the name, since compares each TA to him/herself.
French data: controls for subject, year, teaching effectiveness
Aside: Permutation Tests and PRNGs
232≈4e9<13!≈6e9
232×624≈9e6010<<2084!≈4e6013
In R, unif_rand()
is a 32-bit MT PRN times 2.3283064365386963e-10.
static R_INLINE double ru() { double U = 33554432.0; return (floor(U*unif_rand()) + unif_rand())/U; } double R_unif_index(double dn) { .... cut = 33554431.0; /* 2^25 - 1 */ .... double u = dn > cut ? ru() : unif_rand(); return floor(dn * u); }
static void walker_ProbSampleReplace(int n, double *p, int *a, int nans, int *ans) { .... /* generate sample */ for (i = 0; i < nans; i++) { rU = unif_rand() * n; k = (int) rU; ans[i] = (rU < q[k]) ? k+1 : a[k]+1; } .... }
Example by Duncan Murdoch:
> m <- (2/5)*2^32> x <- sample(m, 1000000, replace = TRUE)> table(x %% 2)0 1399850 600150
Part IV: Is there real controversy?
It is difficult to get a man to understand something, when his salary depends upon his not understanding it! —Upton Sinclair
Widely cited, but it's a technical report from IDEA, which sells SET.
Claims SET are reliable and valid.
Widely cited, but it's a technical report from IDEA, which sells SET.
Claims SET are reliable and valid.
Does not cite Carrell & West (2008) or Braga et al. (2011), randomized experiments published before B&C (2012)
Widely cited, but it's a technical report from IDEA, which sells SET.
Claims SET are reliable and valid.
Does not cite Carrell & West (2008) or Braga et al. (2011), randomized experiments published before B&C (2012)
As far as I can tell, no study B&C cite in support of validity used randomization.
Theoretically, the best indicant of effective teaching is student learning. Other things being equal, the students of more effective teachers should learn more.
Theoretically, the best indicant of effective teaching is student learning. Other things being equal, the students of more effective teachers should learn more.
I agree. We just can't measure/compare that in most universities.
Are SET more sensitive to effectiveness or to something else?
Do comparably effective women and men get comparable SET?
But for their gender, would women get higher SET than they do? (And but for their gender, would men get lower SET than they do?)
Are SET more sensitive to effectiveness or to something else?
Do comparably effective women and men get comparable SET?
But for their gender, would women get higher SET than they do? (And but for their gender, would men get lower SET than they do?)
Need to compare like teaching with like teaching, not an arbitrary collection of women with an arbitrary collection of men.
Are SET more sensitive to effectiveness or to something else?
Do comparably effective women and men get comparable SET?
But for their gender, would women get higher SET than they do? (And but for their gender, would men get lower SET than they do?)
Need to compare like teaching with like teaching, not an arbitrary collection of women with an arbitrary collection of men.
Boring (2014) finds costs of increasing SET very different for men and women.
Ethnicity and race
Age
Attractiveness
Accents / non-native English speakers
"Halo effect"
…
strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
correlated with instructor gender, ethnicity, attractiveness, & age
Anderson & Miller, 1997; Ambady & Rosenthal, 1993; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2017; Boring et al., 2016;
Chisadza et al. 2019; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014;
Wachtel, 1998; Wallish & Cachia, 2018; Weinberg et al., 2007; Worthington, 2002
strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
correlated with instructor gender, ethnicity, attractiveness, & age
Anderson & Miller, 1997; Ambady & Rosenthal, 1993; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2017; Boring et al., 2016;
Chisadza et al. 2019; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014;
Wachtel, 1998; Wallish & Cachia, 2018; Weinberg et al., 2007; Worthington, 2002
omnibus, abstract questions about curriculum design, effectiveness, etc., most influenced by factors unrelated to learning
Worthington, 2002
strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
correlated with instructor gender, ethnicity, attractiveness, & age
Anderson & Miller, 1997; Ambady & Rosenthal, 1993; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2017; Boring et al., 2016;
Chisadza et al. 2019; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014;
Wachtel, 1998; Wallish & Cachia, 2018; Weinberg et al., 2007; Worthington, 2002
omnibus, abstract questions about curriculum design, effectiveness, etc., most influenced by factors unrelated to learning
Worthington, 2002
SET are not very sensitive to effectiveness; weak and/or negative association
strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
correlated with instructor gender, ethnicity, attractiveness, & age
Anderson & Miller, 1997; Ambady & Rosenthal, 1993; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2017; Boring et al., 2016;
Chisadza et al. 2019; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014;
Wachtel, 1998; Wallish & Cachia, 2018; Weinberg et al., 2007; Worthington, 2002
omnibus, abstract questions about curriculum design, effectiveness, etc., most influenced by factors unrelated to learning
Worthington, 2002
SET are not very sensitive to effectiveness; weak and/or negative association
Calling something "teaching effectiveness" does not make it so
strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
correlated with instructor gender, ethnicity, attractiveness, & age
Anderson & Miller, 1997; Ambady & Rosenthal, 1993; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2017; Boring et al., 2016;
Chisadza et al. 2019; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014;
Wachtel, 1998; Wallish & Cachia, 2018; Weinberg et al., 2007; Worthington, 2002
omnibus, abstract questions about curriculum design, effectiveness, etc., most influenced by factors unrelated to learning
Worthington, 2002
SET are not very sensitive to effectiveness; weak and/or negative association
Calling something "teaching effectiveness" does not make it so
Computing averages to 2 decimals doesn't make them reliable
Part V: What to do, then?
students' subjective experience of the course (e.g., did the student find the course challenging?)
personal observations (e.g., was the instructor’s handwriting legible?)
avoid abstract questions, omnibus questions, and questions requiring judgment, because they are particularly subject to biases related to instructor gender, age, and other characteristics protected by employment law
avoid questions that require evaluative judgments (e.g., how effective was the instructor?)
focus on things that affect learning and the learning experience
focus on things the instructor has control over
focus on things that can inform better pedagogy
multiple-choice items generally should provide free-form space to explain choice
interpret responses cautiously
I attended scheduled classes
Lectures helped me to understand the substance
I read the assigned textbook, lecture notes, or other materials
When did you do the readings?
The textbook, lecture notes or other course materials helped me understand the material
I completed the assignments.
The assignments helped me understand the material
I could understand what was being asked of me in assessments and assignments.
I attended office hours.
I found feedback (in class, on assignments, exams, term papers, presentations, etc.) helpful to understand how to improve.
What materials or activities did you find most useful? [lectures, recorded lectures, readings, assignments, ...]
I felt there were ways to get help, even if I did not take advantage of them.
I felt adequately prepared for the course.
If you did not feel prepared, had you taken the prerequisites listed for the course?
I felt that active participation in class was welcomed or encouraged by the instructor.
I could hear and/or understand the instructor in class.
I could read the instructor’s handwriting and/or slides.
Did physical aspects of the classroom (boards, lighting, projectors, sound system, seating) impede your ability to learn or participate? (Yes, no)
Compared to other courses at this level, I found this course … (more difficult, about the same, easier)
Compared to other courses with the same number of units, I found this course ... (more work, about the same, less work)
I enjoyed this course.
I found this course valuable or worthwhile.
Are you satisfied with the effort you put into this course?
Was this course in your (intended) major?
If this course was an elective outside your (intended) major, do you plan to take a sequel course in the discipline?
What would you have liked to have more of in the course?
What would you have liked to have less of in the course?
The instructor created an inclusive environment consistent with the university's diversity goals.
The structure of the course helped me learn.
Union arbitration in Newfoundland (Memorial U.)
Union arbitration in Ontario (OCUFA, Ryerson U.)
Civil litigation in Ohio (Miami U.)
Civil litigation in Vermont
Union arbitration in Florida (UFF, U. Florida)
Union grievance in California (Berkeley)
Discussions with several attorneys who want to pursue class actions
USC, University of Oregon, Colorado State University, Ontario
UC Berkeley: Division of Mathematical and Physical Sciences
The truth will set you free, but first it will piss you off.
—Gloria Steinem
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |