This report lists the candidate variable for DataScheme variables of the construct age.
This report is a record of interaction with a data transfer object (dto) produced by
./manipulation/0-ellis-island.R
.
The next section recaps this script, exposes the architecture of the DTO, and demonstrates the language of interacting with it.
All data land on Ellis Island.
The script 0-ellis-island.R
is the first script in the analytic workflow. It accomplished the following:
./data/shared/derived/meta-data-live.csv
, which is updated every time Ellis Island script is executed../data/shared/meta-data-map.csv
. They are used by automatic scripts in later harmonization and analysis.# load the product of 0-ellis-island.R, a list object containing data and metadata
dto <- readRDS("./data/unshared/derived/dto.rds")
# the list is composed of the following elements
names(dto)
[1] "studyName" "filePath" "unitData" "metaData"
# 1st element - names of the studies as character vector
dto[["studyName"]]
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# 2nd element - file paths of the data files for each study as character vector
dto[["filePath"]]
[1] "./data/unshared/raw/ALSA-Wave1.Final.sav" "./data/unshared/raw/LBSL-Panel2-Wave1.Final.sav"
[3] "./data/unshared/raw/SATSA-Q3.Final.sav" "./data/unshared/raw/SHARE-Israel-Wave1.Final.sav"
[5] "./data/unshared/raw/TILDA-Wave1.Final.sav"
# 3rd element - is a list object containing the following elements
names(dto[["unitData"]])
[1] "alsa" "lbsl" "satsa" "share" "tilda"
# each of these elements is a raw data set of a corresponding study, for example
dplyr::tbl_df(dto[["unitData"]][["lbsl"]])
Source: local data frame [656 x 27]
id AGE94 SEX94 MSTAT94 EDUC94 NOWRK94 SMK94 SMOKE
(int) (int) (int) (fctr) (int) (fctr) (fctr) (fctr)
1 4001026 68 1 divorced 16 no, retired no never smoked
2 4012015 94 2 widowed 12 no, retired no never smoked
3 4012032 94 2 widowed 20 no, retired no don't smoke at present but smoked in the past
4 4022004 93 2 NA NA NA NA never smoked
5 4022026 93 2 widowed 12 no, retired no never smoked
6 4031031 92 1 married 8 no, retired no don't smoke at present but smoked in the past
7 4031035 92 1 widowed 13 no, retired no don't smoke at present but smoked in the past
8 4032201 92 2 NA NA NA NA don't smoke at present but smoked in the past
9 4041062 91 1 widowed 7 NA no don't smoke at present but smoked in the past
10 4042057 91 2 NA NA NA NA NA
.. ... ... ... ... ... ... ... ...
Variables not shown: ALCOHOL (fctr), WINE (int), BEER (int), HARDLIQ (int), SPORT94 (int), FIT94 (int), WALK94 (int),
SPEC94 (int), DANCE94 (int), CHORE94 (int), EXCERTOT (int), EXCERWK (int), HEIGHT94 (int), WEIGHT94 (int), HWEIGHT
(int), HHEIGHT (int), SRHEALTH (fctr), smoke_now (lgl), smoked_ever (lgl)
# 4th element - a dataset names and labels of raw variables + added metadata for all studies
dto[["metaData"]] %>% dplyr::select(study_name, name, item, construct, type, categories, label_short, label) %>%
DT::datatable(
class = 'cell-border stripe',
caption = "This is the primary metadata file. Edit at `./data/shared/meta-data-map.csv",
filter = "top",
options = list(pageLength = 6, autoWidth = TRUE)
)
Everybody wants to be somebody.
We query metadata set to retrieve all variables potentially tapping the construct age
. These are the candidates to enter the DataSchema and contribute to computing harmonized variables.
NOTE: what is being retrieved depends on the manually entered values in the column construct
of the metadata file ./data/shared/meta-data-map.csv
. To specify a different group of variables, edit the metadata, not the script.
dto[["metaData"]] %>%
dplyr::filter(construct %in% c('age')) %>%
dplyr::select(-url, -label, -notes) %>%
dplyr::arrange(study_name, item) %>%
base::print()
study_name name label_short item construct type categories
1 alsa AGE Age age age demo 38
2 lbsl AGE94 Age in 1994 age_1994 age demo 65
3 satsa QAGE3 Age at current wave age_q3 age demo 879
4 satsa YRBORN Year born year_born age demo 62
5 share DN0030 Year born born_year age demo 57
6 tilda AGE age_interview age demo 33
View descriptives : age for closer examination of each candidates. After reviewing these descriptives and relevant codebooks, the following harmonization target for construct age
have been adopted.
Each study is to provide:
year_of_wave
- the calendaric year in which the measurement wave occured. These data values are added manually, after consulting respective study’s documentation.year_born
- the calendaric year in which the respondent was bornage_in_years
- the age of respondent in yearsThese variables will be generated next, in the Development section.
We export DataSchema variables for age
into a single object for easier handling.
get_these_variables <- c("id", "AGE","QAGE3","AGE94","YRBORN","DN0030")
dmls <- list() # dummy list
for(s in dto[["studyName"]]){
ds <- dto[["unitData"]][[s]] # get study data from dto
(varnames <- names(ds)) # see what variables there are
(variables_present <- varnames %in% get_these_variables) # variables on the list
dmls[[s]] <- ds[,variables_present] # keep only them
}
lapply(dmls, names) # view the contents of the list object
$alsa
[1] "id" "AGE"
$lbsl
[1] "id" "AGE94"
$satsa
[1] "id" "YRBORN" "QAGE3"
$share
[1] "id" "DN0030"
$tilda
[1] "id" "AGE"
Next, variables will be renamed and computed to fill the mold of the harmonization target.
# review existing variables
ds <- dmls[['alsa']]; head(ds)
id AGE
1 41 86
2 42 78
3 61 89
4 71 78
5 91 85
6 121 92
# # https://www.maelstrom-research.org/mica/study/alsa
ds$year_of_wave <- 1992
ds$AGE <- as.numeric(ds$AGE)
ds <- ds %>% # transform into harmonization target
dplyr::mutate(age_in_years = AGE, # rename
year_born = year_of_wave - age_in_years # compute
) %>%
dplyr::select(-AGE)
head(ds)
id year_of_wave age_in_years year_born
1 41 1992 86 1906
2 42 1992 78 1914
3 61 1992 89 1903
4 71 1992 78 1914
5 91 1992 85 1907
6 121 1992 92 1900
str(ds)
'data.frame': 2087 obs. of 4 variables:
$ id : int 41 42 61 71 91 121 181 201 221 261 ...
$ year_of_wave: num 1992 1992 1992 1992 1992 ...
$ age_in_years: num 86 78 89 78 85 92 74 80 99 85 ...
$ year_born : num 1906 1914 1903 1914 1907 ...
sapply(ds,summary)
id year_of_wave age_in_years year_born
Min. 41 1992 65.00 1889
1st Qu. 8376 1992 73.00 1909
Median 16330 1992 78.00 1914
Mean 17340 1992 78.16 1914
3rd Qu. 25210 1992 83.00 1919
Max. 43200 1992 103.00 1927
dmls[['alsa']] <- ds
# review existing variables
ds <- dmls[['lbsl']]; head(ds)
id AGE94
1 4001026 68
2 4012015 94
3 4012032 94
4 4022004 93
5 4022026 93
6 4031031 92
ds$AGE94 <- as.numeric(ds$AGE94)
# https://www.maelstrom-research.org/mica/study/lbls
ds$year_of_wave <- 1994
ds <- ds %>% # transform into harmonization target
dplyr::mutate(age_in_years = AGE94, # rename
year_born = year_of_wave - age_in_years # compute
) %>%
dplyr::select(-AGE94)
head(ds)
id year_of_wave age_in_years year_born
1 4001026 1994 68 1926
2 4012015 1994 94 1900
3 4012032 1994 94 1900
4 4022004 1994 93 1901
5 4022026 1994 93 1901
6 4031031 1994 92 1902
str(ds)
'data.frame': 656 obs. of 4 variables:
$ id : int 4001026 4012015 4012032 4022004 4022026 4031031 4031035 4032201 4041062 4042057 ...
$ year_of_wave: num 1994 1994 1994 1994 1994 ...
$ age_in_years: num 68 94 94 93 93 92 92 92 91 91 ...
$ year_born : num 1926 1900 1900 1901 1901 ...
sapply(ds,summary)
id year_of_wave age_in_years year_born
Min. 4001000 1994 30.00 1896
1st Qu. 4142000 1994 62.00 1913
Median 4252000 1994 70.00 1924
Mean 4266000 1994 68.66 1925
3rd Qu. 4341000 1994 81.00 1932
Max. 4972000 1994 98.00 1964
dmls[['lbsl']] <- ds # replace with augmented
# review existing variables
ds <- dmls[['satsa']]; head(ds)
id YRBORN QAGE3
1 2321 26 64.81331
2 2322 26 64.81331
3 2501 26 64.80330
4 2502 26 64.80330
5 2621 26 64.75332
6 11301 0 90.20333
ds$YRBORN <- as.numeric(ds$YRBORN)
ds$QAGE3 <- as.numeric(ds$QAGE3)
# https://www.maelstrom-research.org/mica/study/satsa
ds$year_of_wave <- 1991
ds <- ds %>% # transform into harmonization target
dplyr::mutate(age_in_years = QAGE3, # direct transform; (??) should it be rounded (??)
year_born = YRBORN + 1900 # compute
) %>%
dplyr::select(-QAGE3, - YRBORN )
head(ds)
id year_of_wave age_in_years year_born
1 2321 1991 64.81331 1926
2 2322 1991 64.81331 1926
3 2501 1991 64.80330 1926
4 2502 1991 64.80330 1926
5 2621 1991 64.75332 1926
6 11301 1991 90.20333 1900
str(ds)
'data.frame': 1497 obs. of 4 variables:
$ id : int 2321 2322 2501 2502 2621 11301 11501 11502 12501 12901 ...
$ year_of_wave: num 1991 1991 1991 1991 1991 ...
$ age_in_years: num 64.8 64.8 64.8 64.8 64.8 ...
$ year_born : num 1926 1926 1926 1926 1926 ...
sapply(ds,summary)
$id
Min. 1st Qu. Median Mean 3rd Qu. Max.
2321 152800 185500 824000 2144000 2445000
$year_of_wave
Min. 1st Qu. Median Mean 3rd Qu. Max.
1991 1991 1991 1991 1991 1991
$age_in_years
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
32.07 53.78 65.77 63.75 73.95 95.13 1
$year_born
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1916 1925 1927 1937 1998
dmls[['satsa']] <- ds
# review existing variables
ds <- dmls[['tilda']]; head(ds)
id AGE
1 1091 80
2 1111 51
3 1112 51
4 1151 60
5 1281 72
6 1411 66
ds$AGE <- as.numeric(ds$AGE)
# https://www.maelstrom-research.org/mica/study/tilda
ds$year_of_wave <- 2009
ds <- ds %>% # transform into harmonization target
dplyr::mutate(age_in_years = AGE, # rename
year_born = year_of_wave - age_in_years # compute
) %>%
dplyr::select(-AGE)
str(ds)
'data.frame': 8504 obs. of 4 variables:
$ id : Factor w/ 8504 levels "100051 ",..: 159 170 174 217 398 611 614 815 1243 1333 ...
$ year_of_wave: num 2009 2009 2009 2009 2009 ...
$ age_in_years: num 80 51 51 60 72 66 63 78 67 77 ...
$ year_born : num 1929 1958 1958 1949 1937 ...
head(ds)
id year_of_wave age_in_years year_born
1 1091 2009 80 1929
2 1111 2009 51 1958
3 1112 2009 51 1958
4 1151 2009 60 1949
5 1281 2009 72 1937
6 1411 2009 66 1943
sapply(ds, summary)
$id
100051 10011 10012 10031 100381 10051
1 1 1 1 1 1
100571 100572 100611 100821 100822 100841
1 1 1 1 1 1
100861 100891 100892 101051 101071 101091
1 1 1 1 1 1
101131 101132 101151 101251 101281 101301
1 1 1 1 1 1
101302 10131 101361 101362 101411 101412
1 1 1 1 1 1
101431 101432 10151 101551 101591 101592
1 1 1 1 1 1
101631 101681 101701 101702 101731 101732
1 1 1 1 1 1
101751 101771 101772 101791 101792 101861
1 1 1 1 1 1
101862 101911 102011 102012 102071 102072
1 1 1 1 1 1
102131 102151 102152 102181 102182 102381
1 1 1 1 1 1
10251 10252 102571 102631 102632 102681
1 1 1 1 1 1
102771 102772 102821 103031 103201 10321
1 1 1 1 1 1
10322 103321 103381 103451 103501 103521
1 1 1 1 1 1
103571 103591 103631 103751 103752 10381
1 1 1 1 1 1
103821 103822 103861 104031 104032 104071
1 1 1 1 1 1
104111 104131 104132 104201 104251 104252
1 1 1 1 1 1
10451 10452 104521 (Other)
1 1 1 8405
$year_of_wave
Min. 1st Qu. Median Mean 3rd Qu. Max.
2009 2009 2009 2009 2009 2009
$age_in_years
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
49.00 55.00 62.00 62.97 70.00 80.00 12
$year_born
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1929 1939 1947 1946 1954 1960 12
dmls[['tilda']] <- ds
At this point the dto[["unitData"]]
elements (raw data files for each study) have been augmented with the harmonized variables year_of_wave
,age_in_years
, amd year_born
. We retrieve harmonized variables to view frequency counts across studies:
# convert the dummy list into a dataframe with study names as factor
ds <- plyr::ldply(dmls,data.frame,.id = "study_name")
ds$id <- 1:nrow(ds) # some ids values might be identical, replace
head(ds); str(ds)
study_name id year_of_wave age_in_years year_born
1 alsa 1 1992 86 1906
2 alsa 2 1992 78 1914
3 alsa 3 1992 89 1903
4 alsa 4 1992 78 1914
5 alsa 5 1992 85 1907
6 alsa 6 1992 92 1900
'data.frame': 15342 obs. of 5 variables:
$ study_name : Factor w/ 5 levels "alsa","lbsl",..: 1 1 1 1 1 1 1 1 1 1 ...
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ year_of_wave: num 1992 1992 1992 1992 1992 ...
$ age_in_years: num 86 78 89 78 85 92 74 80 99 85 ...
$ year_born : num 1906 1914 1903 1914 1907 ...
# augment dto with the harmonized age variables
for(s in dto[["studyName"]]){
d <- dto[["unitData"]][[s]]
d <- d %>%
dplyr::left_join(dmls[[s]], by = "id")
dto[["unitData"]][[s]] <- d
}
Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.
# Save as a compress, binary R dataset. It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")