(I) Exposition
- (I.A) Ellis Island
  - Meta
- (I.B) Target-H
(II) Development
- (II.A)
  - (1) Schema sets
- (II.B) bmi
  - ALSA
  - LBSL
  - SATSA
  - SHARE
  - TILDA
(III) Recapitulation

This report lists the candidate variable for DataScheme variables of the construct physique.

(I) Exposition

This report is a record of interaction with a data transfer object (dto) produced by ./manipulation/0-ellis-island.R.

The next section recaps this script, exposes the architecture of the DTO, and demonstrates the language of interacting with it.

(I.A) Ellis Island

All data land on Ellis Island.

The script 0-ellis-island.R is the first script in the analytic workflow. It accomplished the following:

1. Reads in raw data files from the candidate studies
1. Extract, combines, and exports their metadata (specifically, variable names and labels, if provided) into ./data/shared/derived/meta-data-live.csv, which is updated every time Ellis Island script is executed.
1. Augments raw metadata with instructions for renaming and classifying variables. The instructions are provided as manually entered values in ./data/shared/meta-data-map.csv. They are used by automatic scripts in later harmonization and analysis.
1. Combines unit and metadata into a single DTO to serve as a starting point to all subsequent analyses.

# load the product of 0-ellis-island.R,  a list object containing data and metadata
dto <- readRDS("./data/unshared/derived/dto.rds")

# the list is composed of the following elements
names(dto)

[1] "studyName" "filePath"  "unitData"  "metaData"

# 1st element - names of the studies as character vector
dto[["studyName"]]

[1] "alsa"  "lbsl"  "satsa" "share" "tilda"

# 2nd element - file paths of the data files for each study as character vector
dto[["filePath"]]

[1] "./data/unshared/raw/ALSA-Wave1.Final.sav"         "./data/unshared/raw/LBSL-Panel2-Wave1.Final.sav" 
[3] "./data/unshared/raw/SATSA-Q3.Final.sav"           "./data/unshared/raw/SHARE-Israel-Wave1.Final.sav"
[5] "./data/unshared/raw/TILDA-Wave1.Final.sav"

# 3rd element - is a list object containing the following elements
names(dto[["unitData"]])

[1] "alsa"  "lbsl"  "satsa" "share" "tilda"

# each of these elements is a raw data set of a corresponding study, for example
dplyr::tbl_df(dto[["unitData"]][["lbsl"]])

Source: local data frame [656 x 38]

        id AGE94 SEX94  MSTAT94 EDUC94     NOWRK94  SMK94                                         SMOKE
     (int) (int) (int)   (fctr)  (int)      (fctr) (fctr)                                        (fctr)
1  4001026    68     1 divorced     16 no, retired     no                                  never smoked
2  4012015    94     2  widowed     12 no, retired     no                                  never smoked
3  4012032    94     2  widowed     20 no, retired     no don't smoke at present but smoked in the past
4  4022004    93     2       NA     NA          NA     NA                                  never smoked
5  4022026    93     2  widowed     12 no, retired     no                                  never smoked
6  4031031    92     1  married      8 no, retired     no don't smoke at present but smoked in the past
7  4031035    92     1  widowed     13 no, retired     no don't smoke at present but smoked in the past
8  4032201    92     2       NA     NA          NA     NA don't smoke at present but smoked in the past
9  4041062    91     1  widowed      7          NA     no don't smoke at present but smoked in the past
10 4042057    91     2       NA     NA          NA     NA                                            NA
..     ...   ...   ...      ...    ...         ...    ...                                           ...
Variables not shown: ALCOHOL (fctr), WINE (int), BEER (int), HARDLIQ (int), SPORT94 (int), FIT94 (int), WALK94 (int),
  SPEC94 (int), DANCE94 (int), CHORE94 (int), EXCERTOT (int), EXCERWK (int), HEIGHT94 (int), WEIGHT94 (int), HWEIGHT
  (int), HHEIGHT (int), SRHEALTH (fctr), smoke_now (lgl), smoked_ever (lgl), year_of_wave (dbl), age_in_years (dbl),
  year_born (dbl), female (lgl), marital (chr), single (lgl), educ3 (chr), current_work_2 (lgl), current_drink (lgl),
  sedentary (lgl), poor_health (lgl)

(I.B) Target-H

Everybody wants to be somebody.

We query metadata set to retrieve all variables potentially tapping the construct physique. These are the candidates to enter the DataSchema and contribute to computing harmonized variables.

NOTE: what is being retrieved depends on the manually entered values in the column construct of the metadata file ./data/shared/meta-data-map.csv. To specify a different group of variables, edit the metadata, not the script.

meta_data <- dto[["metaData"]] %>%
  dplyr::filter(construct %in% c('physique')) %>% 
  dplyr::select(study_name, name, construct, label_short, categories, url) %>%
  dplyr::arrange(construct, study_name)
knitr::kable(meta_data)

study_name	name	construct	label_short	categories
alsa	WEIGHT	physique	Weight in kilograms	NA
lbsl	HEIGHT94	physique	Height in Inches	NA
lbsl	WEIGHT94	physique	Weight in Pounds	NA
lbsl	HWEIGHT	physique	Self-reported weight in pounds	NA
lbsl	HHEIGHT	physique	Self-reported height in inches	NA
satsa	GHTCM	physique		NA
satsa	GWTKG	physique		NA
satsa	GPI	physique		NA
share	PH0130	physique	how tall are you?	NA
share	PH0120	physique	weight of respondent	NA
tilda	SR.HEIGHT.CENTIMETRES	physique	Height Centimetres	NA
tilda	HEIGHT	physique	Respondent height	NA
tilda	SR.WEIGHT.KILOGRAMMES	physique	Weight Kilogrammes	NA
tilda	WEIGHT	physique	Respondent weight	NA

View descriptives : physique for closer examination of each candidate.

After reviewing descriptives and relevant codebooks, the following operationalization of the harmonized variables for physique have been adopted:

Target (1) : `bmi`

from metric:

bmi = weight_kg / height_m ^ 2

from imperial

bmi = weight_lb * 703 / height_in ^ 2

These variables will be generated next, in the Development section.

(II) Development

The particulare goal of this section is to ensure that the schema to encode the values for the physique variable is consisten across studies.

In this section we will define the schema sets for harmonizing physique construct (i.e. specify which variables from which studies will be contributing to computing harmonized variables ). Each of these schema sets will have a particular pattern of possible response values to these variables, which we will export for inspection as .csv tables. We then will manually edit these .csv tables, populating new columns that will map values of harmonized variables to the specific response pattern of the schema set variables. We then will import harmonization algorithms encoded in .csv tables and apply them to compute harmonized variables in the dataset combining raw and harmonized variables for physique construct across studies.

(II.A)

(1) Schema sets

Having all potential variables in categorical format we have defined the sets of data schema variables thus:

(II.B) `bmi`

from metric:

bmi = weight_kg / height_m ^ 2

from imperial

bmi = weight_lb * 703 / height_in ^ 2

ALSA

Items that can contribute to generating values for the harmonized variable bmi are:

dto[["metaData"]] %>%
  dplyr::filter(study_name=="alsa", construct %in% c("physique")) %>%
  dplyr::select(study_name, name, label,categories)

  study_name   name               label categories
1       alsa WEIGHT Weight in kilograms         NA

ALSA is lacking the measure of height. It is not possible to calculate bmi for this study.

dto[["unitData"]][["alsa"]] <- dto[["unitData"]][["alsa"]] %>% 
    dplyr::mutate(
      HIEGHT = NA,
      bmi = (WEIGHT)/ (HIEGHT^2))
# verify
dto[["unitData"]][["alsa"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "WEIGHT","HIEGHT", "bmi")

      id WEIGHT HIEGHT bmi
1  10761   73.2     NA  NA
2  18821   76.4     NA  NA
3  19401   89.1     NA  NA
4  19812   65.9     NA  NA
5  21501   66.4     NA  NA
6  25131   88.2     NA  NA
7  26302   65.5     NA  NA
8  29251   63.6     NA  NA
9  35831     NA     NA  NA
10 42861   66.8     NA  NA

LBSL

Items that can contribute to generating values for the harmonized variable bmi are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "lbsl", construct == "physique") %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name     name                    label_short categories
1       lbsl HEIGHT94               Height in Inches         NA
2       lbsl WEIGHT94               Weight in Pounds         NA
3       lbsl  HWEIGHT Self-reported weight in pounds         NA
4       lbsl  HHEIGHT Self-reported height in inches         NA

We compute bmi according to the declared formula:

dto[["unitData"]][["lbsl"]] <- dto[["unitData"]][["lbsl"]] %>% 
  dplyr::mutate(bmi = (WEIGHT94 * 703)/ (HEIGHT94^2))
# verify
dto[["unitData"]][["lbsl"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "WEIGHT94","HEIGHT94", "bmi")

        id WEIGHT94 HEIGHT94      bmi
1  4051040       NA       NA       NA
2  4101168      160       67 25.05681
3  4111082      170       71 23.70760
4  4121036      150       67 23.49076
5  4141059      195       69 28.79332
6  4152078      160       61 30.22843
7  4162001      150       65 24.95858
8  4312027      140       64 24.02832
9  4462037      145       68 22.04477
10 4472001      175       62 32.00442

# graph
histogram_continuous(dto[["unitData"]][["lbsl"]],"bmi")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

SATSA

Items that can contribute to generating values for the harmonized variable bmi are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "satsa", construct == "physique") %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name  name label_short categories
1      satsa GHTCM                     NA
2      satsa GWTKG                     NA
3      satsa   GPI                     NA

We compute bmi according to the declared formula:

dto[["unitData"]][["satsa"]] <- dto[["unitData"]][["satsa"]] %>% 
  dplyr::mutate(bmi = (GWTKG)/ ((GHTCM/100)^2))
# verify
dto[["unitData"]][["satsa"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "GWTKG","GHTCM","GPI", "bmi")

        id GWTKG GHTCM      GPI      bmi
1    16261    85   170 29.41016 29.41176
2   122131    61   151 26.75000 26.75321
3   134301    72   174 23.77734 23.78121
4   147872    75    NA       NA       NA
5   153711    58   160 22.65234 22.65625
6   175251    70   166 25.40234 25.40282
7   191271    73   168 25.86328 25.86451
8   225241    50   154 21.08203 21.08281
9  2176041    65   172 21.96875 21.97134
10 2301901    67   171 22.91016 22.91303

# graph
histogram_continuous(dto[["unitData"]][["satsa"]],"bmi")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Items that can contribute to generating values for the harmonized variable bmi are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "share", construct == "physique") %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name   name          label_short categories
1      share PH0130    how tall are you?         NA
2      share PH0120 weight of respondent         NA

We compute bmi according to the declared formula:

# recode non-numeric flags
ds <- dto[["unitData"]][["share"]]
str(ds$PH0130); summary(ds$PH0130); table(ds$PH0130)

Classes 'labelled', 'numeric'  atomic [1:2598] 182 160 164 165 173 168 165 154 165 175 ...
  ..- attr(*, "value.labels")= Named num [1:2] 1e+07 1e+07
  .. ..- attr(*, "names")= chr [1:2] "don't know" "refusal"
  ..- attr(*, "label")= Named chr "how tall are you?"
  .. ..- attr(*, "names")= chr "PH0130"

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
     140      160      167   196800      172 10000000        4


    140     143     145     146     147     148     149     150     151     152     153     154     155     156     157 
      2       1       9       1       5       6       1      64       5      29      17      26      70      50      45 
    158     159     160     161     162     163     164     165     166     167     168     169     170     171     172 
     94      20     245      22     106      80     101     246      44     109     137      38     249      23     109 
    173     174     175     176     177     178     179     180     181     182     183     184     185     186     187 
     62      65     111      59      20      60      17      65      12      27      22       6      26      12      10 
    188     190     191     192     194     198     204 9999998 9999999 
      2       4       3       3       1       1       1       1      50

ds$PH0130 <- as.numeric(ds$PH0130)
ds$PH0130 <- car::recode(ds$PH0130, "
            c(9999998,9999999) = NA
            ")
str(ds$PH0120); summary(ds$PH0120); table(ds$PH0120)

Classes 'labelled', 'numeric'  atomic [1:2598] 110 100 110 100 75 64 90 58 83 94 ...
  ..- attr(*, "value.labels")= Named num [1:2] 1e+07 1e+07
  .. ..- attr(*, "names")= chr [1:2] "don't know" "refusal"
  ..- attr(*, "label")= Named chr "weight of respondent"
  .. ..- attr(*, "names")= chr "PH0120"

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0      65      74   18190      83 1000000       4


      0      10      32      41      42      43      44      45      46      47      48      49      50      51      52 
      2       1       1       1       3       3       2       1       1       1       4       5      32      10      23 
     53      54      55      56      57      58      59      60      61      62      63      64      65      66      67 
     32      29      43      21      31      48      26     106      28      53      61      60     131      31      55 
     68      69      70      71      72      73      74      75      76      77      78      79      80      81      82 
     68      19     185      23      86      63      70     156      37      36      82      24     149      12      60 
     83      84      85      86      87      88      89      90      91      92      93      94      95      96      97 
     41      42     102      21      28      30      13      95       2      23      17      11      42      16       9 
     98      99     100     101     102     103     104     105     106     107     108     109     110     112     114 
     17       2      47       1       5       2       1      14       1       1       4       1      14       2       2 
    115     117     118     120     125     128     130     135     140     146     150 1000000 
      5       1       2       8       2       1       3       1       1       1       1      47

ds$PH0120 <- car::recode(ds$PH0120, "
            c(0,10, 1000000) = NA
            ")
ds <- ds %>% 
  dplyr::mutate(bmi = (PH0120)/ ((PH0130/100)^2))
dto[["unitData"]][["share"]] <- ds
# verify
dto[["unitData"]][["share"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id", "PH0120" ,"PH0130", "bmi")

             id PH0120 PH0130      bmi
1  2.505202e+12     90    184 26.58318
2  2.505204e+12     32    140 16.32653
3  2.505211e+12     82    167 29.40227
4  2.505239e+12     95    167 34.06361
5  2.505248e+12     80    169 28.01022
6  2.505261e+12     70    170 24.22145
7  2.505281e+12     67    168 23.73866
8  2.505284e+12     72    165 26.44628
9  2.605276e+12     90    165 33.05785
10 2.605283e+12     90    170 31.14187

# graph
histogram_continuous(dto[["unitData"]][["share"]],"bmi")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

TILDA

Items that can contribute to generating values for the harmonized variable bmi are:

dto[["metaData"]] %>%
  dplyr::filter(study_name == "tilda", construct == "physique") %>%
  dplyr::select(study_name, name, label_short,categories)

  study_name                  name        label_short categories
1      tilda SR.HEIGHT.CENTIMETRES Height Centimetres         NA
2      tilda                HEIGHT  Respondent height         NA
3      tilda SR.WEIGHT.KILOGRAMMES Weight Kilogrammes         NA
4      tilda                WEIGHT  Respondent weight         NA

We compute bmi according to the declared formula:

ds <- dto[["unitData"]][["tilda"]]
ds <- ds %>% 
  dplyr::mutate(
    weight = ifelse(
    !is.na(WEIGHT), WEIGHT, ifelse(
      !is.na(SR.WEIGHT.KILOGRAMMES),SR.WEIGHT.KILOGRAMMES, NA)),
    height = ifelse(
      !is.na(HEIGHT), HEIGHT, ifelse(
        !is.na(SR.HEIGHT.CENTIMETRES),SR.HEIGHT.CENTIMETRES, NA))
  ) 
ds <- ds %>%
  dplyr::mutate(bmi = (weight)/ ((height/100)^2))
dto[["unitData"]][["tilda"]] <- ds
# verify
dto[["unitData"]][["tilda"]] %>%
  dplyr::filter(id %in% sample(unique(id),10)) %>%
  dplyr::select_("id","weight", "height", "bmi")

                   id weight height      bmi
1  48281                  NA     NA       NA
2  61281               70.10  164.8 25.81093
3  102151              78.40  166.4 28.31454
4  283521                 NA     NA       NA
5  291281              80.25  165.3 29.36969
6  305281                 NA     NA       NA
7  405012              53.75  156.0 22.08662
8  433381              74.40  165.4 27.19581
9  448912              85.20  168.6 29.97260
10 457072              91.10  174.2 30.02079

# graph
histogram_continuous(dto[["unitData"]][["tilda"]],"bmi")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

(III) Recapitulation

At this point the dto[["unitData"]] elements (raw data files for each study) have been augmented with the harmonized variable bmi.

dumlist <- list()
for(s in dto[["studyName"]]){
  ds <- dto[["unitData"]][[s]]
  dumlist[[s]] <- ds[,c("id","bmi")]
}
ds <- plyr::ldply(dumlist,data.frame,.id = "study_name")
head(ds)

  study_name  id bmi
1       alsa  41  NA
2       alsa  42  NA
3       alsa  61  NA
4       alsa  71  NA
5       alsa  91  NA
6       alsa 121  NA

ds$id <- 1:nrow(ds) # some ids values might be identical, replace
for(s in dto[["studyName"]]){
  print(s)
  print(summary(dto[["unitData"]][[s]]$bmi))
}

[1] "alsa"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     NA      NA      NA     NaN      NA      NA    2087 
[1] "lbsl"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  9.683  23.050  25.740  26.540  29.210  48.820     105 
[1] "satsa"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  14.70   22.45   24.39   24.75   26.78   48.90      46 
[1] "share"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  14.69   23.88   26.09   26.67   29.00   53.78      82 
[1] "tilda"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  16.46   25.23   28.09   28.64   31.36   56.77    2372

Finally, we have added the newly created, harmonized variables to the raw source objects and save the data transfer object.

# Save as a compress, binary R dataset.  It's no longer readable with a text editor, but it saves metadata (eg, factor information).
saveRDS(dto, file="./data/unshared/derived/dto.rds", compress="xz")

Harmonize: physique

(I) Exposition

(I.A) Ellis Island

Meta

(I.B) Target-H

Target (1) : bmi

(II) Development

(II.A)

(1) Schema sets

(II.B) bmi

ALSA

LBSL

SATSA

SHARE

TILDA

(III) Recapitulation

Target (1) : `bmi`

(II.B) `bmi`