Time and Effort Dataset Synthesis

OUHSC Statistical Computing User Group

Will Beasley, Dept of Pediatrics,

Biomedical and Behavioral Methodology Core (BBMC)

2015-12-01

Goal

Combine three difference datasets that structurally and cosmetically differ. The state data has three different sources, each managed by a different agency.

File Description
nurse-month-oklahoma.csv one row per nurse per month for Oklahoma County
month-tulsa.csv one row per month for Tulsa County (ie, it's already aggregated)
nurse-month-rural.csv one row per nurse per month for the other 75 counties

Structural Differences

Oklahoma Tulsa Rural Approach
Structure one row per month
per nurse
one row per month
(it's already aggregated)
one row per month
per nurse
dplyr's group_by()
and summarize()
Contains PHI Yes n Yes Hash
Rename Fields Yes Yes Yes dplyr::rename()
Missing Values n n Yes compare county holes
Legit Holes n n Yes enumerate all combos
and fill z/ zeros
Right Censored Maybe Maybe n group, sort, and
zoo::rollmedian()

Cosmetic Differences

Oklahoma Tulsa Rural Approach
Date Year & Month
separate
1/15/2009 06/2012 as.Date() format
parameter
FTE Type Proportion Sum Percentage regex gsub()
Requires Linking Counties Sorta Sorta Yes Lookup Table
& left join
Misspelled Counties Yes car::recode()
or plyr::revalue()
Counties to Drop n n Yes blacklist

Demo

Stage 1 -Initial Stack

Alt text Alt text

Stage 2 -filled in missing records

Alt text Alt text

Stage 3 -interpolated missing months

Alt text Alt text

Stage 4 -extrapolated right censored

Alt text Alt text