rbindlist() for all of you list-of-dataframe monsters

ALAS! Why is gov’t data so terribly structured?!

I’ve been working with the CMS Open Payments data recently because the annual datasets offer enough granular info to allow for good modeling practices and pharmaceutical marketing practices are pretty interesting from a policy stand-point.

My only harp with the data is it’s poor data structure and lack of standardization. Aside from running DQA and standardization scripts, I’ve been fairly involved in subdividing the original dataset into relational data tables that are much easier to work with. One of the most involved tasks I had to do was write conditional reformatting scripts for payment detail scripts because CMS changed the release data-structure between 2015 & 2016.

It’s not like the data structure was pretty begin with, and although the new structure is a little easier to work with than what was used in PY13-15, the abrupt switch definitely adds a few extra headaches in preparing the data for analysis.

Here’s what the old structure looked like:

Here is the new one:

The drawback of the previous structure was that it subdivided which lumped the Medicare/Medicaid coverage information into a single field when up to 10 products could be reported into a single field with only three factors (‘Covered’, ‘Non-covered’ and ‘Mixed’). The new 2016 structure fixed this sisue by splitting coverage information into independent ‘Covered_or_Noncovered_Indicator’ fields, but the one-to-many relationship between a single transaction and reported marketed products in a single row still made running queries inefficient.

To get over this hurdle, you have to write two separate conditional sorting protocols – one for PY2013-2015 and one for 2016 due to varying levels of data granularity.

Here’s a quick mock up of the sorting protocol for 2013-2015:

 

 

Yeah, man … not pretty.

The most efficient method I found to write & test the procedure was individually process each transactional record into its own small dataframe and store it into a pre-allocated list. That way, I could easily do spot checks between original transaction records and their corresponding reformatted dataframes by referencing the same object number in the CMS dataset and output list. Although this method is extremely fast if you wrap pack your sorting logic into a custom function and then run process everything using an apply function, the downside is that you get massive list object in the end.

Here's a sample of the code...
 # Create temporary dataframe to hold non-product transactions
 pydet_13.npp = as.data.frame(matrix(NA, nrow = nrow(pydat_13[pydat_13$Product_Indicator == "None",]), ncol = 6))
 names(pydet_13.npp) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 pydet_13.npp$Payment_ID = pydat_13$Payment_ID[pydat_13$Product_Indicator == "None"]
 pydet_13.npp$Payment_Type = "NONPRODUCT"

# Create temporary data-frame list to hold product transactions
 ppp = function(payment){
 if(payment$Product_Indicator == "Covered"){
 # Sort Logic for <> payment txns
 # ... 1) Determine the number of mentioned products
 # ... if >1 then continue with normal processing
 # ... if <1 then assume that name contents are located within the NDC-fields

# Determine the Number of Mentioned Products
 ppos = which(is.blank(payment[1,grep("^Name_of_Associated", names(payment))]) == FALSE)
 #  :: Relative Table Position of Product Names
 # 1-5 = Drug/Biological
 # 6-10 = Medical Device/Supply

if(length(ppos) >= 1){
 # Get Product Names
 pnam = as.character(payment[1,grep("^Name_of_Associated", names(payment))][ppos])

# Get Corresponding NDC Data
 pndc = as.character(payment[1, grep("^NDC_of_Associated", names(payment))][ppos[ppos < 6]]) pndc = c(pndc, rep(NA, (length(pnam) - length(pndc)))) # Match array length to 'pnam' # Validate NDC Values; should contain only numeric characters pndc[grep("[^[:digit:]]",pndc)] = NA pndc[pndc == ""] = NA # Set Corresponding Product Classes pcls = rep("Drug/Biological", length(ppos)) pcls[which(ppos > 5)] = "Medical Device/Supply"

# Create Output Dataframe
 df.out = as.data.frame(matrix(NA, nrow = length(ppos), ncol = 6))
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class = pcls
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name = pnam
 df.out$Product_NDC = pndc

}else {
 # Condition where names are inappropriately included in NDC-field
 # Quantified as less than .01% of records in PY13
 # Get names in NDC Field
 ppos = which(is.blank(payment[1,grep("NDC_of_Associated", names(payment))]) == FALSE)
 pnam = as.character(payment[1,grep("NDC_of_Associated", names(payment))][ppos])
 if(length(pnam)<1){pnam = "UNKNOWN"; ppos = 1} # When no name information is available

# Set Class as 'Unknown' because both drugs & devices have been observed in error fields
 pcls = rep("UNKNOWN", length(pnam))

# Create Output Dataframe
 df.out = as.data.frame(matrix(NA, nrow = 1, ncol = 6))
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class = "UNKNOWN"
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name = pnam
 df.out$Product_NDC = rep(NA, length(ppos))

}

}else if(payment$Product_Indicator == "Non-Covered"){
 # Determine the Number of Disclosed Products
 # ...If products are disclosed products, they assumed to be 'Non-Covered' products
 # ...If no products are disclosed, a dummy entry is made for one (1) 'Undisclosed' product,
 # though payment maybe associated with more than one. Unable to determine this fact.

ppos = which(is.blank(payment[1,grep("^Name_of_Associated", names(payment))]) == FALSE)
 #  :: Relative Table Position of Product Names
 # 1-5 = Drug/Biological
 # 6-10 = Medical Device/Supply

if(length(ppos) >= 1){
 # Condition: Product names are disclosed
 # Use standard 'covered' product sorting method, with necessary adjustments

pnam = as.character(payment[1,grep("^Name_of_Associated", names(payment))][ppos])

# Get Corresponding NDC Data
 pndc = as.character(payment[1, grep("^NDC_of_Associated", names(payment))][ppos[ppos < 6]]) pndc = c(pndc, rep(NA, (length(pnam) - length(pndc)))) # Match array length to 'pnam' # Validate NDC Values; should contain only numeric characters pndc[grep("[^[:digit:]]",pndc)] = NA pndc[pndc == ""] = NA # Set Corresponding Product Classes pcls = rep("Drug/Biological", length(ppos)) pcls[which(ppos > 5)] = "Medical Device/Supply"

# Create Output Dataframe
 df.out = as.data.frame(matrix(NA, nrow = length(ppos), ncol = 6))
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class = pcls
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name = pnam
 df.out$Product_NDC = pndc

}else{
 # Condition: Product names are not disclosed
 # Create dummy-record for 'Undisclosed' product
 df.out = as.data.frame(matrix(NA, nrow = 1, ncol = 6))
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class = "UNKNOWN"
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name = "UNKNOWN"
 df.out$Product_NDC = NA

}

}else if(payment$Product_Indicator == "Combination"){
 # Assumes that all disclosed products are covered and that there is one(1) undisclosed non-covered product

ppos = which(is.blank(payment[1,grep("^Name_of_Associated", names(payment))]) == FALSE)
 #  :: Relative Table Position of Product Names
 # 1-5 = Drug/Biological
 # 6-10 = Medical Device/Supply

if(length(ppos) >= 1){
 # Condition: Product names are disclosed
 # Use standard 'covered' product sorting method, with necessary adjustments

pnam = as.character(payment[1,grep("^Name_of_Associated", names(payment))][ppos])

# Get Corresponding NDC Data
 pndc = as.character(payment[1, grep("^NDC_of_Associated", names(payment))][ppos[ppos < 6]]) pndc = c(pndc, rep(NA, (length(pnam) - length(pndc)))) # Match array length to 'pnam' # Validate NDC Values; should contain only numeric characters pndc[grep("[^[:digit:]]",pndc)] = NA pndc[pndc == ""] = NA # Set Corresponding Product Classes pcls = rep("Drug/Biological", length(ppos)) pcls[which(ppos > 5)] = "Medical Device/Supply"

# Create Output Dataframe
 df.out = as.data.frame(matrix(NA, nrow = (length(ppos) + 1), ncol = 6))
 df.len = nrow(df.out)
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class[1:(df.len - 1)] = pcls
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name[1:(df.len - 1)] = pnam
 df.out$Product_NDC[1:(df.len - 1)] = pndc

df.out$Product_Class[df.len] = "UNKNOWN"
 df.out$Product_Name[df.len] = "UNKNOWN"
 df.out$Product_NDC[df.len] = NA

}else{
 # Condition: Product names are not disclosed
 # Create dummy-record for 'Undisclosed' product
 df.out = as.data.frame(matrix(NA, nrow = 1, ncol = 6))
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class = "UNKNOWN"
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name = "UNKNOWN"
 df.out$Product_NDC = NA

}

}else{
 #Return Warning for Unknown payment-type
 warning(paste("Payment_ID:", as.character(payment$Payment_ID[1]), "", sep = " "))
 }

df.out
 }

# Payment Processsing Test -> PYDAT_13 txns
 prodlist = which(pydat_13$Product_Indicator != "None"); #rm(a,b)

paylist_13 = vector(mode = "list",length = length(prodlist))
 system.time({
 for(i in 1:length(prodlist)){
 paylist_13[[i]] = ppp(pydat_13[prodlist[i], ])

if((i%%10000) == 0){print(i)}
 }
 })

The 2013 data contained approximately 3M transactions, resulting in a list of data-frames that was 8.4 Gb. Besides the inherent query challenges associated with lists of dataframes, they are a pretty impractical way to store information.
The list had to be reprocessed into a single dataframe, but the traditional methods like do.call("rbind", listofdataframes) and ldply(listofdataframes, data.frame) are too slow when you have a list greater than 500 Mb.

I tried them initially and then gave up after sitting and staring at processing screen for 20 minutes.

Luckily, there’s a much faster method: as.data.frame(data.table::rbindlist(listOfDataFrames))

The ‘new’ data.table package method created the final dataframe in about 5 minutes resulting, and the dataframe was only 0.4 Gb in size. That’s a 95% reduction in object size. — I wasn’t expecting that, but it makes sesnse conceptually since each individaul dataframe has a fixed memory cost that adds up eventually as you create hundreds of thousands of them.

I bet that if I had let the do.call method run it’s course it would have taken at least 1.5hrs.
If you’re interested in actual benchmarks, here’s a snippet from the stackoverflow page where I found this technique:

library(rbenchmark)
benchmark(
  do.call = do.call("rbind", listOfDataFrames),
  plyr_rbind.fill = plyr::rbind.fill(listOfDataFrames), 
  plyr_ldply = plyr::ldply(listOfDataFrames, data.frame),
  data.table_rbindlist = as.data.frame(data.table::rbindlist(listOfDataFrames)),
  replications = 100, order = "relative", 
  columns=c('test','replications', 'elapsed','relative')
  )  ....
                  test replications elapsed relative
4 data.table_rbindlist          100    0.11    1.000
1              do.call          100    9.39   85.364
2      plyr_rbind.fill          100   12.08  109.818
3           plyr_ldply          100   15.14  137.636

Here’s a glimpse of what the final output datatable looks like:

All in all, I am happy with the results. It’s safe to say I never want to go through this ordeal ever again (definitely will).

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s