Student Evaluations, Quantifauxcation, and Gender Bias

Student Evaluations of Teaching Do Not Measure Teaching Effectiveness. What Do They Measure?

School of Computing and Information Systems, University of Melbourne
16 April 2019

Philip B. Stark
Department of Statistics
University of California, Berkeley
http://www.stat.berkeley.edu/~stark | @philipbstark

Joint work with Anne Boring, Richard Freishtat, Kellie Ottoboni

1 / 120

The truth will set you free, but first it will piss you off.
—Gloria Steinem

2 / 120

When a measure becomes a target, it ceases to be a good measure.
—Goodhart's Law (via Marilyn Strathern)

3 / 120

Student Evaluations of Teaching (SET)

most common method to evaluate teaching
used for hiring, firing, tenure, promotion
simple, cheap, fast to administer

4 / 120

Student Evaluations of Teaching (SET)

most common method to evaluate teaching
used for hiring, firing, tenure, promotion
simple, cheap, fast to administer

What do they measure?

5 / 120

Part I: Basic Statistics

6 / 120

Nonresponse7 / 120

NonresponseSET surveys are an incomplete census, not a random sample. 
8 / 120

Nonresponse

SET surveys are an incomplete census, not a random sample.
Suppose 70% of students respond, with an average of 4 on a 7-point scale.

9 / 120

Nonresponse

SET surveys are an incomplete census, not a random sample.
Suppose 70% of students respond, with an average of 4 on a 7-point scale. Then the class average could be anywhere between 3.1 & 4.9

10 / 120

Nonresponse

SET surveys are an incomplete census, not a random sample.
Suppose 70% of students respond, with an average of 4 on a 7-point scale. Then the class average could be anywhere between 3.1 & 4.9
"Margin of error" meaningless: not a random sample

11 / 120

All our arithmetic is below averageDoes a 3 mean the same thing to every student—even approximately?
12 / 120

All our arithmetic is below average

Does a 3 mean the same thing to every student—even approximately?
Does a 5 mean the same thing in an upper-division architecture studio, a required freshman Econ course with 500 students, or an online Statistics course with no face-to-face interaction?

13 / 120

All our arithmetic is below average

Does a 3 mean the same thing to every student—even approximately?
Does a 5 mean the same thing in an upper-division architecture studio, a required freshman Econ course with 500 students, or an online Statistics course with no face-to-face interaction?
Is the difference between 1 & 2 the same as the difference between 5 & 6?

14 / 120

All our arithmetic is below average

Does a 3 mean the same thing to every student—even approximately?
Does a 5 mean the same thing in an upper-division architecture studio, a required freshman Econ course with 500 students, or an online Statistics course with no face-to-face interaction?
Is the difference between 1 & 2 the same as the difference between 5 & 6?
Does a 1 balance a 7 to make two 4s?

15 / 120

All our arithmetic is below average

Does a 3 mean the same thing to every student—even approximately?
Does a 5 mean the same thing in an upper-division architecture studio, a required freshman Econ course with 500 students, or an online Statistics course with no face-to-face interaction?
Is the difference between 1 & 2 the same as the difference between 5 & 6?
Does a 1 balance a 7 to make two 4s?
what about variability?
- (1+1+7+7)/4 = (2+3+5+6)/4 = (1+5+5+5)/4 = (4+4+4+4)/4 = 4
- Polarizing teacher ≠ teacher w/ mediocre ratings
- 3 statisticians go deer hunting …

16 / 120

What does the mean mean?

Averages makes sense for interval scales, not ordinal scales like SET.

17 / 120

What does the mean mean?

Averages makes sense for interval scales, not ordinal scales like SET.

Averaging SET doesn't make sense.

18 / 120

What does the mean mean?

Averages makes sense for interval scales, not ordinal scales like SET.

Averaging SET doesn't make sense.

Doesn't make sense to compare average SET across:

courses
instructors
levels
types of classes
modes of instruction
disciplines

19 / 120

What does the mean mean?

Averages makes sense for interval scales, not ordinal scales like SET.

Averaging SET doesn't make sense.

Doesn't make sense to compare average SET across:

courses
instructors
levels
types of classes
modes of instruction
disciplines

Shouldn't ignore variability or nonresponse.

20 / 120

Same issues in averaging self-selected Likert scale responses elsewhere:reviews of physicians
online product ratings
Yelp reviews
21 / 120

Quantifauxcation

Assign a meaningless number, then conclude that because it's quantitative, it's meaningful.

22 / 120

Part II: Science, Sciencism

23 / 120

If you can't prove what you want to prove, demonstrate something else and pretend they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anyone will notice the difference.

Darrell Huff

24 / 120

If you can't prove what you want to prove, demonstrate something else and pretend they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anyone will notice the difference.

Darrell Huff

Hard to measure teaching effectiveness, so instead we measure student opinion (poorly) and pretend they are the same thing.

25 / 120

What's effective teaching?26 / 120

What's effective teaching?Should facilitate learning
27 / 120

What's effective teaching?

Should facilitate learning
Grades usually not a good proxy for learning

28 / 120

What's effective teaching?

Should facilitate learning
Grades usually not a good proxy for learning
Students generally can't judge how much they learned

29 / 120

What's effective teaching?

Should facilitate learning
Grades usually not a good proxy for learning
Students generally can't judge how much they learned
Serious problems with confounding

30 / 120

What's effective teaching?

Should facilitate learning
Grades usually not a good proxy for learning
Students generally can't judge how much they learned
Serious problems with confounding

https://xkcd.com/552/

31 / 120

Lauer, 2012. Survey of 185 students, 45 faculty at Rollins College, FL

Faculty & students don't mean the same thing by "fair," "professional," "organized," "challenging," & "respectful"

32 / 120

Lauer, 2012. Survey of 185 students, 45 faculty at Rollins College, FL

Faculty & students don't mean the same thing by "fair," "professional," "organized," "challenging," & "respectful"

not fair means …	student %	instructor %
plays favorites	45.8	31.7
grading problematic	2.3	49.2
work is too hard	12.7	0
won't "work with you" on problems	12.3	0
other	6.9	19

33 / 120

Bias against women & URM:34 / 120

Bias against women & URM:

grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)

35 / 120

Bias against women & URM:

grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)
letters of recommendation (e.g., Schmader et al., 2007, Madera et al., 2009)

36 / 120

Bias against women & URM:

grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)
letters of recommendation (e.g., Schmader et al., 2007, Madera et al., 2009)
job applications (e.g., Moss-Racusin et al., 2012, Reuben et al., 2014)

37 / 120

Bias against women & URM:

grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)
letters of recommendation (e.g., Schmader et al., 2007, Madera et al., 2009)
job applications (e.g., Moss-Racusin et al., 2012, Reuben et al., 2014)
interruptions of job talks (e.g., Blair-Loy et al., 2017)

38 / 120

Bias against women & URM:

grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)
letters of recommendation (e.g., Schmader et al., 2007, Madera et al., 2009)
job applications (e.g., Moss-Racusin et al., 2012, Reuben et al., 2014)
interruptions of job talks (e.g., Blair-Loy et al., 2017)
credit for joint work (e.g., Sarsons, 2015)

39 / 120

Bias against women & URM:

grant applications (e.g., Kaatz et al., 2014, Witteman et al., 2018)
letters of recommendation (e.g., Schmader et al., 2007, Madera et al., 2009)
job applications (e.g., Moss-Racusin et al., 2012, Reuben et al., 2014)
interruptions of job talks (e.g., Blair-Loy et al., 2017)
credit for joint work (e.g., Sarsons, 2015)
teaching evaluations (e.g., MacNell et al. 2015, Boring et al. 2016)

40 / 120

Retorts:41 / 120

Retorts:But I know some women who get great scores & win teaching awards!
42 / 120

Retorts:

But I know some women who get great scores & win teaching awards!
I know SET aren't perfect, but they must have some connection to effectiveness.

43 / 120

Retorts:

But I know some women who get great scores & win teaching awards!
I know SET aren't perfect, but they must have some connection to effectiveness.
I get better SET when I feel the class went better.

44 / 120

Retorts:

But I know some women who get great scores & win teaching awards!
I know SET aren't perfect, but they must have some connection to effectiveness.
I get better SET when I feel the class went better.
Shouldn't students have a voice?

45 / 120

46 / 120

How much does appearance matter? Consider the source.47 / 120

48 / 120

49 / 120

Ben Schmidt Chili peppers clearly matter for teaching effectiveness.

50 / 120

"She does have an accent, but … " Subtirelu 2015

51 / 120

52 / 120

53 / 120

54 / 120

55 / 120

56 / 120

57 / 120

Keng 2017

58 / 120

59 / 120

Wagner et al., 2016

60 / 120

64% of students gave answers contradicting what they had demonstrated they knew was true.

61 / 120

62 / 120

63 / 120

The number of points on the rating scale affects gender differences

Rivera Tilcsik 2019

64 / 120

65 / 120

66 / 120

Part III: Deeper Dive

67 / 120

68 / 120

MacNell, Driscoll, & Hunt, 2014

NC State online course.

Students randomized into 6 groups, 2 taught by primary prof, 4 by GSIs.

2 GSIs: 1 male, 1 female.

GSIs used actual names in 1 section, swapped names in 1 section.

5-point scale.

Characteristic	M - F	perm $P$	t-test $P$
Overall	0.47	0.12	0.128
Professional	0.61	0.07	0.124
Respectful	0.61	0.06	0.124
Caring	0.52	0.10	0.071
Enthusiastic	0.57	0.06	0.112
Communicate	0.57	0.07	NA
Helpful	0.46	0.17	0.049
Feedback	0.47	0.16	0.054
Prompt	0.80	0.01	0.191
Consistent	0.46	0.21	0.045
Fair	0.76	0.01	0.188
Responsive	0.22	0.48	0.013
Praise	0.67	0.01	0.153
Knowledge	0.35	0.29	0.038
Clear	0.41	0.29	NA

69 / 120

Test statistic: 2-sample $t$ -test, following MacNell et al. (stratifying would be better)
Neyman model: 4 potential responses per student assigned to GSI
Null: response (incl nonresponse) to given GSI does not depend on name the GSI used
Randomization: conditions on students assigned to each actual GSI, matches actual randomization
Nonresponders fixed

70 / 120

Exam performance and instructor gender

Mean grade and instructor gender (male minus female)

	difference in means	$P$ -value
Perceived	1.76	0.54
Actual	-6.81	0.02

71 / 120

"Natural experiment": Boring et al., 2016.5 years of data for 6 mandatory freshman classes at SciencesPo:
History, Political Institutions, Microeconomics, Macroeconomics, Political Science, Sociology
72 / 120

"Natural experiment": Boring et al., 2016.

5 years of data for 6 mandatory freshman classes at SciencesPo:
History, Political Institutions, Microeconomics, Macroeconomics, Political Science, Sociology
23,001 SET, 379 instructors, 4,423 students, 1,194 sections (950 without PI), 21 year-by-course strata

73 / 120

"Natural experiment": Boring et al., 2016.

5 years of data for 6 mandatory freshman classes at SciencesPo:
History, Political Institutions, Microeconomics, Macroeconomics, Political Science, Sociology
23,001 SET, 379 instructors, 4,423 students, 1,194 sections (950 without PI), 21 year-by-course strata
response rate ~100%
anonymous finals except PI
interim grades before final

74 / 120

"Natural experiment": Boring et al., 2016.

5 years of data for 6 mandatory freshman classes at SciencesPo:
History, Political Institutions, Microeconomics, Macroeconomics, Political Science, Sociology
23,001 SET, 379 instructors, 4,423 students, 1,194 sections (950 without PI), 21 year-by-course strata
response rate ~100%
anonymous finals except PI
interim grades before final

Test statistics (for stratified permutation test)
Correlation between SET and gender within each stratum, averaged across strata
Correlation between SET and average final exam score within each stratum, averaged across strata
Could have used, e.g., correlation within strata, combined with Fisher's function

75 / 120

SciencesPo

Average correlation between SET and final exam score

	strata	$\bar{\rho}$	$P$
Overall	26 (21)	0.04	0.09
History	5	0.16	0.01
Political Institutions	5	N/A	N/A
Macroeconomics	5	0.06	0.19
Microeconomics	5	-0.01	0.55
Political science	3	-0.03	0.62
Sociology	3	-0.02	0.61

76 / 120

Average correlation between SET and instructor gender

	$\bar{\rho}$	$P$
Overall	0.09	0.00
History	0.11	0.08
Political institutions	0.11	0.10
Macroeconomics	0.10	0.16
Microeconomics	0.09	0.16
Political science	0.04	0.63
Sociology	0.08	0.34

77 / 120

Average correlation between final exam scores and instructor gender

	$\bar{\rho}$	$P$
Overall	-0.06	0.07
History	-0.08	0.22
Macroeconomics	-0.06	0.37
Microeconomics	-0.06	0.37
Political science	-0.03	0.70
Sociology	-0.05	0.55

78 / 120

Average correlation between SET and interim grades

	$\bar{\rho}$	$P$
Overall	0.16	0.00
History	0.32	0.00
Political institutions	-0.02	0.61
Macroeconomics	0.15	0.01
Microeconomics	0.13	0.03
Political science	0.17	0.02
Sociology	0.24	0.00

79 / 120

Main conclusions; multiplicity

lack of association between SET and final exam scores (negative result, so multiplicity not an issue)
lack of association between instructor gender and final exam scores (negative result, so multiplicity not an issue)
association between SET and instructor gender
association between SET and interim grades

Bonferroni's adjustment for four tests leaves the associations highly significant: adjusted $P < 0.01$ .

80 / 120

What are we measuring?

US data: controls for everything but the name, since compares each TA to him/herself.

French data: controls for subject, year, teaching effectiveness

81 / 120

Aside: Permutation Tests and PRNGs

$2^{32} \approx 4e9 < 13! \approx 6e9$

$2^{32 \times 624} \approx 9e6010 << 2084! \approx 4e6013$

In R, unif_rand() is a 32-bit MT PRN times 2.3283064365386963e-10.

82 / 120

    static R_INLINE double ru()
    {
        double U = 33554432.0;
        return (floor(U*unif_rand()) + unif_rand())/U;
    }
    double R_unif_index(double dn)
    {
    ....
        cut = 33554431.0; /* 2^25 - 1 */
    ....
        double u = dn > cut ? ru() : unif_rand();
        return floor(dn * u);
    }

    static void walker_ProbSampleReplace(int n, double *p, int *a, int nans, int *ans)
    {
    ....
        /* generate sample */
        for (i = 0; i < nans; i++) {
            rU = unif_rand() * n;
            k = (int) rU;
            ans[i] = (rU < q[k]) ? k+1 : a[k]+1;
        }
    ....
    }

83 / 120

Random problems in R

84 / 120

Example by Duncan Murdoch:

> m <- (2/5)*2^32
> x <- sample(m, 1000000, replace = TRUE)
> table(x %% 2)
   0      1
399850 600150

85 / 120

R fixed tweet

86 / 120

Part IV: Is there real controversy?

87 / 120

Who supports SET?

It is difficult to get a man to understand something, when his salary depends upon his not understanding it! —Upton Sinclair

88 / 120

Benton & Cashin, 2012: exemplar SET apologistsWidely cited, but it's a technical report from IDEA, which sells SET.
89 / 120

Benton & Cashin, 2012: exemplar SET apologists

Widely cited, but it's a technical report from IDEA, which sells SET.
Claims SET are reliable and valid.

90 / 120

Benton & Cashin, 2012: exemplar SET apologists

Widely cited, but it's a technical report from IDEA, which sells SET.
Claims SET are reliable and valid.
Does not cite Carrell & West (2008) or Braga et al. (2011), randomized experiments published before B&C (2012)

91 / 120

Benton & Cashin, 2012: exemplar SET apologists

Widely cited, but it's a technical report from IDEA, which sells SET.
Claims SET are reliable and valid.
Does not cite Carrell & West (2008) or Braga et al. (2011), randomized experiments published before B&C (2012)
As far as I can tell, no study B&C cite in support of validity used randomization.

92 / 120

Benton & Cashin on validity

Theoretically, the best indicant of effective teaching is student learning. Other things being equal, the students of more effective teachers should learn more.

93 / 120

Benton & Cashin on validity

Theoretically, the best indicant of effective teaching is student learning. Other things being equal, the students of more effective teachers should learn more.

I agree. We just can't measure/compare that in most universities.

94 / 120

Better questions:

Are SET more sensitive to effectiveness or to something else?
Do comparably effective women and men get comparable SET?
But for their gender, would women get higher SET than they do? (And but for their gender, would men get lower SET than they do?)

95 / 120

Better questions:

Are SET more sensitive to effectiveness or to something else?
Do comparably effective women and men get comparable SET?
But for their gender, would women get higher SET than they do? (And but for their gender, would men get lower SET than they do?)

Need to compare like teaching with like teaching, not an arbitrary collection of women with an arbitrary collection of men.

96 / 120

Better questions:

Are SET more sensitive to effectiveness or to something else?
Do comparably effective women and men get comparable SET?
But for their gender, would women get higher SET than they do? (And but for their gender, would men get lower SET than they do?)

Need to compare like teaching with like teaching, not an arbitrary collection of women with an arbitrary collection of men.

Boring (2014) finds costs of increasing SET very different for men and women.

97 / 120

Example fallacy: Wallisch & Cachia, 2018

Wallish & Cachia, 2018

98 / 120

These are not the only biases!

Ethnicity and race
Age
Attractiveness
Accents / non-native English speakers
"Halo effect"

…

99 / 120

What do SET measure?strongly correlated with students' grade expectations
 Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
100 / 120

What do SET measure?

strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.

101 / 120

What do SET measure?

strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
correlated with instructor gender, ethnicity, attractiveness, & age
Anderson & Miller, 1997; Ambady & Rosenthal, 1993; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2017; Boring et al., 2016; Chisadza et al. 2019; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014; Wachtel, 1998; Wallish & Cachia, 2018; Weinberg et al., 2007; Worthington, 2002

102 / 120

What do SET measure?

strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
correlated with instructor gender, ethnicity, attractiveness, & age
Anderson & Miller, 1997; Ambady & Rosenthal, 1993; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2017; Boring et al., 2016; Chisadza et al. 2019; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014; Wachtel, 1998; Wallish & Cachia, 2018; Weinberg et al., 2007; Worthington, 2002
omnibus, abstract questions about curriculum design, effectiveness, etc., most influenced by factors unrelated to learning
Worthington, 2002

103 / 120

What do SET measure?

strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
correlated with instructor gender, ethnicity, attractiveness, & age
Anderson & Miller, 1997; Ambady & Rosenthal, 1993; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2017; Boring et al., 2016; Chisadza et al. 2019; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014; Wachtel, 1998; Wallish & Cachia, 2018; Weinberg et al., 2007; Worthington, 2002
omnibus, abstract questions about curriculum design, effectiveness, etc., most influenced by factors unrelated to learning
Worthington, 2002
SET are not very sensitive to effectiveness; weak and/or negative association

104 / 120

What do SET measure?

strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
correlated with instructor gender, ethnicity, attractiveness, & age
Anderson & Miller, 1997; Ambady & Rosenthal, 1993; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2017; Boring et al., 2016; Chisadza et al. 2019; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014; Wachtel, 1998; Wallish & Cachia, 2018; Weinberg et al., 2007; Worthington, 2002
omnibus, abstract questions about curriculum design, effectiveness, etc., most influenced by factors unrelated to learning
Worthington, 2002
SET are not very sensitive to effectiveness; weak and/or negative association
Calling something "teaching effectiveness" does not make it so

105 / 120

What do SET measure?

strongly correlated with students' grade expectations
Boring et al., 2016; Johnson, 2003; Marsh & Cooper, 1980; Short et al., 2008; Worthington, 2002
strongly correlated with enjoyment Stark, unpublished, 2012. 1486 students
Correlation btw instructor effectiveness & enjoyment: 0.75.
Correlation btw course effectiveness & enjoyment: 0.8.
correlated with instructor gender, ethnicity, attractiveness, & age
Anderson & Miller, 1997; Ambady & Rosenthal, 1993; Arbuckle & Williams, 2003; Basow, 1995; Boring, 2017; Boring et al., 2016; Chisadza et al. 2019; Cramer & Alexitch, 2000; Marsh & Dunkin, 1992; MacNell et al., 2014; Wachtel, 1998; Wallish & Cachia, 2018; Weinberg et al., 2007; Worthington, 2002
omnibus, abstract questions about curriculum design, effectiveness, etc., most influenced by factors unrelated to learning
Worthington, 2002
SET are not very sensitive to effectiveness; weak and/or negative association
Calling something "teaching effectiveness" does not make it so
Computing averages to 2 decimals doesn't make them reliable

106 / 120

Part V: What to do, then?

107 / 120

What might we be able to discover about teaching?Is she dedicated to and engaged in her teaching?
Is she available to students?
Is she putting in appropriate effort? Is she creating new materials, new courses,
or new pedagogical approaches?
Is she revising, refreshing, and reworking existing courses using feedback
and on-going experiment?
Is she helping keep the department's curriculum up to date?
Is she trying to improve?
Is she contributing to the college's teaching mission in a serious way?
Is she supervising undergraduates for research, internships, and honors theses?
Is she advising and mentoring students?
Do her students do well when they graduate?

108 / 120

Principles for SET items

students' subjective experience of the course (e.g., did the student find the course challenging?)
personal observations (e.g., was the instructor’s handwriting legible?)
avoid abstract questions, omnibus questions, and questions requiring judgment, because they are particularly subject to biases related to instructor gender, age, and other characteristics protected by employment law
avoid questions that require evaluative judgments (e.g., how effective was the instructor?)
focus on things that affect learning and the learning experience
focus on things the instructor has control over
focus on things that can inform better pedagogy
multiple-choice items generally should provide free-form space to explain choice
interpret responses cautiously

109 / 120

Better SET items (too many!)

I attended scheduled classes
Lectures helped me to understand the substance
I read the assigned textbook, lecture notes, or other materials
When did you do the readings?
The textbook, lecture notes or other course materials helped me understand the material
I completed the assignments.
The assignments helped me understand the material
I could understand what was being asked of me in assessments and assignments.
I attended office hours.

110 / 120

I found feedback (in class, on assignments, exams, term papers, presentations, etc.) helpful to understand how to improve.
What materials or activities did you find most useful? [lectures, recorded lectures, readings, assignments, ...]
I felt there were ways to get help, even if I did not take advantage of them.
I felt adequately prepared for the course.
If you did not feel prepared, had you taken the prerequisites listed for the course?
I felt that active participation in class was welcomed or encouraged by the instructor.
I could hear and/or understand the instructor in class.
I could read the instructor’s handwriting and/or slides.
Did physical aspects of the classroom (boards, lighting, projectors, sound system, seating) impede your ability to learn or participate? (Yes, no)

111 / 120

Compared to other courses at this level, I found this course … (more difficult, about the same, easier)
Compared to other courses with the same number of units, I found this course ... (more work, about the same, less work)
I enjoyed this course.
I found this course valuable or worthwhile.
Are you satisfied with the effort you put into this course?
Was this course in your (intended) major?
If this course was an elective outside your (intended) major, do you plan to take a sequel course in the discipline?
What would you have liked to have more of in the course?
What would you have liked to have less of in the course?

112 / 120

New bad items

The instructor created an inclusive environment consistent with the university's diversity goals.
The structure of the course helped me learn.

113 / 120

Litigation

Union arbitration in Newfoundland (Memorial U.)
Union arbitration in Ontario (OCUFA, Ryerson U.)
Civil litigation in Ohio (Miami U.)
Civil litigation in Vermont
Union arbitration in Florida (UFF, U. Florida)
Union grievance in California (Berkeley)
Discussions with several attorneys who want to pursue class actions

114 / 120

Progress

USC, University of Oregon, Colorado State University, Ontario
UC Berkeley: Division of Mathematical and Physical Sciences

115 / 120

References

Ambady, N., and R. Rosenthal, 1993. Half a Minute: Predicting Teacher Evaluations from Thin Slices of Nonverbal Behavior and Physical Attractiveness, J. Personality and Social Psychology, 64, 431-441.
Arbuckle, J. and B.D. Williams, 2003. Students' Perceptions of Expressiveness: Age and Gender Effects on Teacher Evaluations, Sex Roles, 49, 507-516. DOI 10.1023/A:1025832707002
Archibeque, O., 2014. Bias in Student Evaluations of Minority Faculty: A Selected Bibliography of Recent Publications, 2005 to Present. http://library.auraria.edu/content/bias-student-evaluations-minority-faculty (last retrieved 30 September 2016)
Basow, S., S. Codos, and J. Martin, 2013. The Effects of Professors' Race and Gender on Student Evaluations and Performance, College Student Journal, 47 (2), 352-363.
Blair-Loy, M., E. Rogers, D. Glaser, Y.L.A. Wong, D. Abraham, and P.C. Cosman, 2017. Gender in Engineering Departments: Are There Gender Differences in Interruptions of Academic Job Talks?, Social Sciences, 6, doi:10.3390/socsci6010029
Boring, A., 2015. Gender Bias in Student Evaluations of Teachers, OFCE-PRESAGE-Sciences-Po Working Paper, http://www.ofce.sciences-po.fr/pdf/dtravail/WP2015-13.pdf (last retrieved 30 September 2016)
Boring, A., K. Ottoboni, and P.B. Stark, 2016. Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness, ScienceOpen, DOI 10.14293/S2199-1006.1.SOR-EDU.AETBZC.v1
Braga, M., M. Paccagnella, and M. Pellizzari, 2014. Evaluating Students' Evaluations of Professors, Economics of Education Review, 41, 71-88.
Carrell, S.E., and J.E. West, 2010. Does Professor Quality Matter? Evidence from Random Assignment of Students to Professors, J. Political Economy, 118, 409-432.
Chisadza, C., N. Nicholls, and E. Yitbarek, 2019. Race and Gender biases in Student Evaluations of Teachers, Economics Letters, 179, 66-71, DOI 10.1016/j.econlet.2019.03.022.

116 / 120

Fan Y., L.J. Shepherd, E. Slavich, D. Waters, M. Stone, R. Abel, and E.L. Johnston, 2019. Gender and cultural bias in student evaluations: Why representation matters. PLoS ONE, 14, e0209749. https://doi.org/10.1371/journal.pone.0209749
Feeley, T.H., 2002. Evidence of Halo Effects in Student Evaluations of Communication Instruction, Communication Education, 51:3, 225-236, DOI: 10.1080/03634520216519
Hill, M.C., and K.K. Epps, 2010. The Impact of Physical Classroom Environment on Student Satisfaction and Student Evaluation of Teaching in the University Environment, Academy of Educational Leadership Journal, 14, 65.
Hamermesh, D. S., and A. Parker, 2004. Beauty in the classroom: Instructors' pulchritude and putative pedagogical productivity. Economics of Education Review, 24(4), 369-376. https://www.sciencedirect.com/science/article/abs/pii/S0272775704001165
Hessler, M., D.M. Pöpping, H. Hollstein, H. Ohlenburg, P.H. Arnemann, C. Massoth, L.M. Seidel, A. Zarbock & M. Wenk, 2018. Availability of cookies during an academic course session affects evaluation of teaching, Medical Education, 52, 1064–1072. doi: 10.1111/medu.13627
Hornstein, H.A., 2017. Student evaluations of teaching are an inadequate assessment tool for evaluating faculty performance, Cogent Education, 4 1304016 http://dx.doi.org/10.1080/2331186X.2017.1304016
Johnson, V.E., 2003. Grade Inflation: A Crisis in College Education, Springer-Verlag, NY, 262pp.
Kaatz, A., B. Gutierrez, and M. Carnes, 2014. Threats to objectivity in peer review: the case of gender, Trends in Pharmacological Science, 35, 371–373. doi:10.1016/j.tips.2014.06.005
Keng, S.-H., 2017. Tenure system and its impact on grading leniency, teaching effectiveness and student effort, Empirical Economics, DOI 10.1007/s00181-017-1313-7
Lake, D.A., 2001. Student Performance and Perceptions of a Lecture-based Course Compared with the Same Course Utilizing Group Discussion. Physical Therapy, 81, 896-902.

117 / 120

Lauer, C., 2012. A Comparison of Faculty and Student Perspectives on Course Evaluation Terminology, in To Improve the Academy: Resources for Faculty, Instructional, and Educational Development, 31, J.E. Groccia and L. Cruz, eds., Jossey-Bass, 195-211.
Lee, L.J., M.E. Connolly, M.H. Dancy, C.R. Henderson, and W.M. Christensen, 2018. A comparison of student evaluations of instruction vs. students' conceptual learning gains, American Journal of Physics, 86, 531. DOI 10.1119/1.50393300
MacNell, L., A. Driscoll, and A.N. Hunt, 2015. What's in a Name: Exposing Gender Bias in Student Ratings of Teaching, Innovative Higher Education, 40, 291-303. DOI 10.1007/s10755-014-9313-4
Madera, J.M., M.R. Hebl, and R.C. Martin, 2009. Gender and Letters of Recommendation for Academia: Agentic and Communal Differences, Journal of Applied Psychology, 94, 1591–1599 DOI: 10.1037/a0016539
Marsh, H.W., and T. Cooper. 1980. Prior Subject Interest, Students Evaluations, and Instructional Effectiveness. Paper presented at the annual meeting of the American Educational Research Association.
Marsh, H.W., and L.A. Roche. 1997. Making Students' Evaluations of Teaching Effectiveness Effective. American Psychologist, 52, 1187-1197
Mengl, F., J. Sauermann, and U. Zölitz, 2018. Gender Bias in Teaching Evaluations. Journal of the European Economic Association, jvx057, DOI 10.1093/jeaa/jvx057
Moss-Racusin, C.A., J.F. Dovidio, V.L. Brescoll, M.J. Graham, and J. Handelsman, 2012. Science faculty's subtle gender biases favor male students, PNAS, 109, 16474–16479. www.pnas.org/cgi/doi/10.1073/pnas.1211286109
Nilson, L.B., 2012. Time to Raise Questions about Student Ratings, in To Improve the Academy: Resources for Faculty, Instructional, and Educational Development, 31, J.E. Groccia and L. Cruz, eds., Jossey-Bass, 213-227.
Ontario Confederation of University Faculty Associations, 2019. Report of the OCUFA on Student Questionnaires on Courses and Teaching Working Group, https://ocufa.on.ca/assets/OCUFA-SQCT-Report.pdf

118 / 120

Reuben, E., P. Sapienza, and L. Zingales, 2014. How stereotypes impair women’s careers in science, PNAS, 111, 4403–4408. www.pnas.org/cgi/doi/10.1073/pnas.1314788111
Rivera, L. and A. Tilcsik. 2019. Scaling Down Inequality: Rating Scales, Gender Bias, and the Architecture of Evaluation, American Sociological Review, in press.
Schmader, T., J. Whitehead, and V.H. Wysocki, 2007. A Linguistic Comparison of Letters of Recommendation for Male and Female Chemistry and Biochemistry Job Applicants, Sex Roles, 57, 509–514. doi:10.1007/s11199-007-9291-4
Short, H., Boyle, R., Braithwaite, R., Brookes, M., Mustard, J., & Saundage, D. (2008, July). A comparison of student evaluation of teaching with student performance. In OZCOTS 2008: Proceedings of the 6th Australian Conference on Teaching Statistics (pp. 1-10). OZCOTS
Sarsons, 2015. http://scholar.harvard.edu/files/sarsons/files/gender_groupwork.pdf?m=1449178759
Schmidt, B., 2015. Gendered Language in Teacher Reviews, http://benschmidt.org/profGender (last retrieved 30 September 2016)
Stanfel, L.E., 1995. Measuring the Accuracy of Student Evaluations of Teaching. Journal of Instructional Psychology, 22(2), 117-125.
Stark, P.B., and R. Freishtat, 2014. An Evaluation of Course Evaluations, ScienceOpen, DOI 10.14293/S2199-1006.1.SOR-EDU.AOFRQA.v1
Stroebe, W., 2016. Why Good Teaching Evaluations May Reward Bad Teaching: On Grade Inflation and Other Unintended Consequences of Student Evaluations, Perspectives on Psychological Science, 11 (6) 800–816, DOI: 10.1177/1745691616650284
Subtirelu, N.C., 2015. "She does have an accent but…": Race and language ideology in students' evaluations of mathematics instructors on RateMyProfessors.com, Language in Society, 44, 35-62. DOI 10.1017/S0047404514000736

119 / 120

Uttl, B., C.A. White, and A. Morin, 2013. The Numbers Tell it All: Students Don't Like Numbers!, PLoS ONE, 8 (12): e83443, DOI 10.1371/journal.pone.0083443
Uttl, B., C.A. White, and D.W. Gonzalez, 2016. Meta-analysis of Faculty's Teaching Effectiveness: Student Evaluation of Teaching Ratings and Student Learning Are Not Related, Studies in Educational Evaluation, DOI: 0.1016/j.stueduc.2016.08.007
Wagner, N., M. Rieger, and K. Voorvelt, 2016. International Institute of Social Studies Working Paper 617.
Wallish, P. and J. Cachia, 2018. Are student evaluations really biased by gender? Nope, they're biased by "hotness." Slate, slate.com/technology/2018/04/hotness-affects-student-evaluations-more-than-gender.html
Witteman, H., M. Hendricks, S. Straus, and C. Tannenbaum, 2018. Female grant applicants are equally successful when peer reviewers assess the science, but not when they assess the scientist. https://www.biorxiv.org/content/early/2018/01/19/232868
Wolbring, T. and P. Riordan, 2016. How beauty works. Theoretical mechanisms and two empirical applications on students' evaluation of teaching, Social Science Research, 57, 253–272

120 / 120

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Student Evaluations of Teaching Do Not Measure Teaching Effectiveness. What Do They Measure?

School of Computing and Information Systems, University of Melbourne16 April 2019

Philip B. Stark Department of Statistics University of California, Berkeleyhttp://www.stat.berkeley.edu/~stark | @philipbstark

Joint work with Anne Boring, Richard Freishtat, Kellie Ottoboni

Student Evaluations of Teaching (SET)

Student Evaluations of Teaching (SET)

What do they measure?

Nonresponse

Nonresponse

Nonresponse

Nonresponse

Nonresponse

All our arithmetic is below average

All our arithmetic is below average

All our arithmetic is below average

All our arithmetic is below average

All our arithmetic is below average

What does the mean mean?

What does the mean mean?

What does the mean mean?

What does the mean mean?

Same issues in averaging self-selected Likert scale responses elsewhere:

Assign a meaningless number, then conclude that because it's quantitative, it's meaningful.

What's effective teaching?

What's effective teaching?

What's effective teaching?

What's effective teaching?

What's effective teaching?

What's effective teaching?

Lauer, 2012. Survey of 185 students, 45 faculty at Rollins College, FL

Lauer, 2012. Survey of 185 students, 45 faculty at Rollins College, FL

Bias against women & URM:

Bias against women & URM:

Bias against women & URM:

Bias against women & URM:

Bias against women & URM:

Bias against women & URM:

Bias against women & URM:

Retorts:

Retorts:

Retorts:

Retorts:

Retorts:

How much does appearance matter? Consider the source.

"She does have an accent, but … " Subtirelu 2015

Exam performance and instructor gender

"Natural experiment": Boring et al., 2016.

"Natural experiment": Boring et al., 2016.

"Natural experiment": Boring et al., 2016.

"Natural experiment": Boring et al., 2016.

Test statistics (for stratified permutation test)

SciencesPo

Main conclusions; multiplicity

What are we measuring?

Who supports SET?

Benton & Cashin, 2012: exemplar SET apologists

Benton & Cashin, 2012: exemplar SET apologists

Benton & Cashin, 2012: exemplar SET apologists

Benton & Cashin, 2012: exemplar SET apologists

Benton & Cashin on validity

Benton & Cashin on validity

Better questions:

Better questions:

Better questions:

Example fallacy: Wallisch & Cachia, 2018

These are not the only biases!

What do SET measure?

What do SET measure?

What do SET measure?

What do SET measure?

What do SET measure?

What do SET measure?

What do SET measure?

What might we be able to discover about teaching?

Principles for SET items

Better SET items (too many!)

New bad items

Litigation

Progress

References

School of Computing and Information Systems, University of Melbourne
16 April 2019

Philip B. Stark
Department of Statistics
University of California, Berkeley
http://www.stat.berkeley.edu/~stark | @philipbstark