Gender Bias In Graduate School Admissions - UC Berkeley Dataset

Study Of Simpson's Paradox

  • Dataset: 1973 UC-Berkeley Graduate School Admission Data
  • What is it? Simpson’s paradox, also called Yule-Simpson effect, in statistics, an effect that occurs when the marginal association between two categorical variables is qualitatively different from the partial association between the same two variables after controlling for one or more other variables. Simpson’s paradox is important for three critical reasons. First, people often expect statistical relationships to be immutable. They often are not. The relationship between two variables might increase, decrease, or even change direction depending on the set of variables being controlled. Second, Simpson’s paradox is not simply an obscure phenomenon of interest only to a small group of statisticians. Simpson’s paradox is actually one of a large class of association paradoxes. Third, Simpson’s paradox reminds researchers that causal inferences, particularly in nonexperimental studies, can be hazardous. Uncontrolled and even unobserved variables that would eliminate or reverse the association observed between two variables might exist.
  • Background In 1973, the University of California-Berkeley (UC-Berkeley) was sued for sex discrimination. Its admission data showed that men applying to graduate school at UC-Berkley were more likely to be admitted than women. The graduate schools had just accepted 44% of male applicants but only 35% of female applicants. The difference was so great that it was unlikely to be due to chance.
  • Wiki Information
  • Project Author: Amitrajit Bose
In [176]:
dataset=[]
with open ('MLTutorial/Udacity/simpsons.txt') as file:
    for line in file:
        dataset.append((line.strip().split(',')))
In [177]:
import pprint
import pandas as pd
category=dataset[0]
data=dataset[1:]
#print(category)
#pprint.pprint(data)
df=pd.DataFrame(data=data, columns=category)
df
Out[177]:
Admit Gender Dept Freq
0 Admitted Male A 512
1 Rejected Male A 313
2 Admitted Female A 89
3 Rejected Female A 19
4 Admitted Male B 353
5 Rejected Male B 207
6 Admitted Female B 17
7 Rejected Female B 8
8 Admitted Male C 120
9 Rejected Male C 205
10 Admitted Female C 202
11 Rejected Female C 391
12 Admitted Male D 138
13 Rejected Male D 279
14 Admitted Female D 131
15 Rejected Female D 244
16 Admitted Male E 53
17 Rejected Male E 138
18 Admitted Female E 94
19 Rejected Female E 299
20 Admitted Male F 22
21 Rejected Male F 351
22 Admitted Female F 24
23 Rejected Female F 317
In [178]:
maleFemale=(list(df.groupby('Gender')))
maleFemale[1][1]
Out[178]:
Admit Gender Dept Freq
0 Admitted Male A 512
1 Rejected Male A 313
4 Admitted Male B 353
5 Rejected Male B 207
8 Admitted Male C 120
9 Rejected Male C 205
12 Admitted Male D 138
13 Rejected Male D 279
16 Admitted Male E 53
17 Rejected Male E 138
20 Admitted Male F 22
21 Rejected Male F 351
In [179]:
males=maleFemale[1][1]['Freq'].astype(int).aggregate(sum)
males
Out[179]:
2691
In [180]:
maleFemale=(list(df.groupby('Gender')))
maleFemale[0][1]
Out[180]:
Admit Gender Dept Freq
2 Admitted Female A 89
3 Rejected Female A 19
6 Admitted Female B 17
7 Rejected Female B 8
10 Admitted Female C 202
11 Rejected Female C 391
14 Admitted Female D 131
15 Rejected Female D 244
18 Admitted Female E 94
19 Rejected Female E 299
22 Admitted Female F 24
23 Rejected Female F 317
In [181]:
females=maleFemale[0][1]['Freq'].astype(int).aggregate(sum)
females
Out[181]:
1835
In [182]:
(males/(males+females),females/(males+females)) #male female applicant ratio
Out[182]:
(0.5945647370746796, 0.4054352629253204)
In [211]:
#department wise statistic
deptStat=list(df.groupby('Dept'))

stat=[]
for i in range(6):
    dr=list(deptStat[i][1].groupby('Gender'))[1][1]['Freq'].astype(int).agg(sum)
    nr=list(list(deptStat[i][1].groupby('Gender'))[1][1].groupby('Admit'))[0][1]['Freq'].astype(int).aggregate(sum)
    maleRatio=round((nr/dr)*100,2)
    dr=list(deptStat[i][1].groupby('Gender'))[0][1]['Freq'].astype(int).agg(sum)
    nr=list(list(deptStat[i][1].groupby('Gender'))[0][1].groupby('Admit'))[0][1]['Freq'].astype(int).aggregate(sum)
    femRatio=round((nr/dr)*100,2)
    stat.append((deptStat[i][0], maleRatio, femRatio))

categ=['Department','Male Acceptance (%)', 'Female Acceptance (%)']
df2=pd.DataFrame(data=stat, columns=categ)
df2
Out[211]:
Department Male Acceptance (%) Female Acceptance (%)
0 A 62.06 82.41
1 B 63.04 68.00
2 C 36.92 34.06
3 D 33.09 34.93
4 E 27.75 23.92
5 F 5.90 7.04

Observations

  • Total male applicants(2691) were much more than total female applicants(1835)
  • % of male applicants = 59.45
  • % of female applicants = 40.54
  • In case of departments A, B, D and F - female acceptance ratio is higher than male acceptance ratio. This proves the presence of Simpson's Paradox.

Conclusion

The research paper by Bickel et al. concluded that women tended to apply to competitive departments with low rates of admission even among qualified applicants (such as in the English Department), whereas men tended to apply to less-competitive departments with high rates of admission among the qualified applicants (such as in engineering and chemistry).