rbindlist() for all of you list-of-dataframe monsters

ALAS! Why is gov’t data so terribly structured?!

I’ve been working with the CMS Open Payments data recently because the annual datasets offer enough granular info to allow for good modeling practices and pharmaceutical marketing practices are pretty interesting from a policy stand-point.

My only harp with the data is it’s poor data structure and lack of standardization. Aside from running DQA and standardization scripts, I’ve been fairly involved in subdividing the original dataset into relational data tables that are much easier to work with. One of the most involved tasks I had to do was write conditional reformatting scripts for payment detail scripts because CMS changed the release data-structure between 2015 & 2016.

It’s not like the data structure was pretty begin with, and although the new structure is a little easier to work with than what was used in PY13-15, the abrupt switch definitely adds a few extra headaches in preparing the data for analysis.

Here’s what the old structure looked like:

Here is the new one:

The drawback of the legacy structure was that it lumped the Medicare/Medicaid coverage information into a single “Product Indicator” field when up to 10 products could be reported for a single transaction, and reports their joint coverage as ‘Covered’, ‘Non-covered’ or ‘Mixed’.

The new 2016 structure fixed this issue by splitting coverage information into independent ‘Covered_or_Noncovered_Indicator’ fields, but the encapsulation of the one-to-many relationship between transactional and product information in a single row still made running queries inefficient.

To get over this hurdle, I had to write separate conditional sorting protocols – one for PY2013-2015 and one for 2016 due to varying levels of data granularity.

Here’s a quick mock up of the sorting protocol for 2013-2015:

Yeah, man … not pretty.

The most efficient method I found to write & test the procedure was individually process each transactional record into its own small dataframe and store it into a pre-allocated list.

That way, I could easily do spot checks between original transaction records and their corresponding reformatted dataframes by referencing the same object number in the CMS dataset and output list. Although this method is extremely fast if you wrap your sorting logic into a custom function and then process everything using an apply function, it has some serious drawbacks.

Specifically, you get massive list object in the end.

Here's a sample of the code...
 # Create temporary dataframe to hold non-product transactions
 pydet_13.npp = as.data.frame(matrix(NA, nrow = nrow(pydat_13[pydat_13$Product_Indicator == "None",]), ncol = 6))
 names(pydet_13.npp) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 pydet_13.npp$Payment_ID = pydat_13$Payment_ID[pydat_13$Product_Indicator == "None"]
 pydet_13.npp$Payment_Type = "NONPRODUCT"

# Create temporary data-frame list to hold product transactions
 ppp = function(payment){
 if(payment$Product_Indicator == "Covered"){
 # Sort Logic for <> payment txns
 # ... 1) Determine the number of mentioned products
 # ... if >1 then continue with normal processing
 # ... if <1 then assume that name contents are located within the NDC-fields

# Determine the Number of Mentioned Products
 ppos = which(is.blank(payment[1,grep("^Name_of_Associated", names(payment))]) == FALSE)
 #  :: Relative Table Position of Product Names
 # 1-5 = Drug/Biological
 # 6-10 = Medical Device/Supply

if(length(ppos) >= 1){
 # Get Product Names
 pnam = as.character(payment[1,grep("^Name_of_Associated", names(payment))][ppos])

# Get Corresponding NDC Data
 pndc = as.character(payment[1, grep("^NDC_of_Associated", names(payment))][ppos[ppos < 6]]) pndc = c(pndc, rep(NA, (length(pnam) - length(pndc)))) # Match array length to 'pnam' # Validate NDC Values; should contain only numeric characters pndc[grep("[^[:digit:]]",pndc)] = NA pndc[pndc == ""] = NA # Set Corresponding Product Classes pcls = rep("Drug/Biological", length(ppos)) pcls[which(ppos > 5)] = "Medical Device/Supply"

# Create Output Dataframe
 df.out = as.data.frame(matrix(NA, nrow = length(ppos), ncol = 6))
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class = pcls
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name = pnam
 df.out$Product_NDC = pndc

}else {
 # Condition where names are inappropriately included in NDC-field
 # Quantified as less than .01% of records in PY13
 # Get names in NDC Field
 ppos = which(is.blank(payment[1,grep("NDC_of_Associated", names(payment))]) == FALSE)
 pnam = as.character(payment[1,grep("NDC_of_Associated", names(payment))][ppos])
 if(length(pnam)<1){pnam = "UNKNOWN"; ppos = 1} # When no name information is available

# Set Class as 'Unknown' because both drugs & devices have been observed in error fields
 pcls = rep("UNKNOWN", length(pnam))

# Create Output Dataframe
 df.out = as.data.frame(matrix(NA, nrow = 1, ncol = 6))
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class = "UNKNOWN"
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name = pnam
 df.out$Product_NDC = rep(NA, length(ppos))

}

}else if(payment$Product_Indicator == "Non-Covered"){
 # Determine the Number of Disclosed Products
 # ...If products are disclosed products, they assumed to be 'Non-Covered' products
 # ...If no products are disclosed, a dummy entry is made for one (1) 'Undisclosed' product,
 # though payment maybe associated with more than one. Unable to determine this fact.

ppos = which(is.blank(payment[1,grep("^Name_of_Associated", names(payment))]) == FALSE)
 #  :: Relative Table Position of Product Names
 # 1-5 = Drug/Biological
 # 6-10 = Medical Device/Supply

if(length(ppos) >= 1){
 # Condition: Product names are disclosed
 # Use standard 'covered' product sorting method, with necessary adjustments

pnam = as.character(payment[1,grep("^Name_of_Associated", names(payment))][ppos])

# Get Corresponding NDC Data
 pndc = as.character(payment[1, grep("^NDC_of_Associated", names(payment))][ppos[ppos < 6]]) pndc = c(pndc, rep(NA, (length(pnam) - length(pndc)))) # Match array length to 'pnam' # Validate NDC Values; should contain only numeric characters pndc[grep("[^[:digit:]]",pndc)] = NA pndc[pndc == ""] = NA # Set Corresponding Product Classes pcls = rep("Drug/Biological", length(ppos)) pcls[which(ppos > 5)] = "Medical Device/Supply"

# Create Output Dataframe
 df.out = as.data.frame(matrix(NA, nrow = length(ppos), ncol = 6))
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class = pcls
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name = pnam
 df.out$Product_NDC = pndc

}else{
 # Condition: Product names are not disclosed
 # Create dummy-record for 'Undisclosed' product
 df.out = as.data.frame(matrix(NA, nrow = 1, ncol = 6))
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class = "UNKNOWN"
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name = "UNKNOWN"
 df.out$Product_NDC = NA

}

}else if(payment$Product_Indicator == "Combination"){
 # Assumes that all disclosed products are covered and that there is one(1) undisclosed non-covered product

ppos = which(is.blank(payment[1,grep("^Name_of_Associated", names(payment))]) == FALSE)
 #  :: Relative Table Position of Product Names
 # 1-5 = Drug/Biological
 # 6-10 = Medical Device/Supply

if(length(ppos) >= 1){
 # Condition: Product names are disclosed
 # Use standard 'covered' product sorting method, with necessary adjustments

pnam = as.character(payment[1,grep("^Name_of_Associated", names(payment))][ppos])

# Get Corresponding NDC Data
 pndc = as.character(payment[1, grep("^NDC_of_Associated", names(payment))][ppos[ppos < 6]]) pndc = c(pndc, rep(NA, (length(pnam) - length(pndc)))) # Match array length to 'pnam' # Validate NDC Values; should contain only numeric characters pndc[grep("[^[:digit:]]",pndc)] = NA pndc[pndc == ""] = NA # Set Corresponding Product Classes pcls = rep("Drug/Biological", length(ppos)) pcls[which(ppos > 5)] = "Medical Device/Supply"

# Create Output Dataframe
 df.out = as.data.frame(matrix(NA, nrow = (length(ppos) + 1), ncol = 6))
 df.len = nrow(df.out)
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class[1:(df.len - 1)] = pcls
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name[1:(df.len - 1)] = pnam
 df.out$Product_NDC[1:(df.len - 1)] = pndc

df.out$Product_Class[df.len] = "UNKNOWN"
 df.out$Product_Name[df.len] = "UNKNOWN"
 df.out$Product_NDC[df.len] = NA

}else{
 # Condition: Product names are not disclosed
 # Create dummy-record for 'Undisclosed' product
 df.out = as.data.frame(matrix(NA, nrow = 1, ncol = 6))
 names(df.out) = c("Payment_ID","Payment_Type","Product_Class","Product_Coverage", "Product_Name", "Product_NDC")
 df.out$Payment_ID = payment$Payment_ID[1]
 df.out$Payment_Type = "PRODUCT"
 df.out$Product_Class = "UNKNOWN"
 df.out$Product_Coverage = payment$Product_Indicator[1]
 df.out$Product_Name = "UNKNOWN"
 df.out$Product_NDC = NA

}

}else{
 #Return Warning for Unknown payment-type
 warning(paste("Payment_ID:", as.character(payment$Payment_ID[1]), "", sep = " "))
 }

df.out
 }

# Payment Processsing Test -> PYDAT_13 txns
 prodlist = which(pydat_13$Product_Indicator != "None"); #rm(a,b)

paylist_13 = vector(mode = "list",length = length(prodlist))
 system.time({
 for(i in 1:length(prodlist)){
 paylist_13[[i]] = ppp(pydat_13[prodlist[i], ])

if((i%%10000) == 0){print(i)}
 }
 })

The 2013 data contained approximately 3M transactions, resulting in an 8.4 Gb output list of data-frames.

Besides the inherent query challenges associated with lists of dataframes, they are a pretty impractical way to store information.

To resolve this, I had to reprocess this list into a single dataframe object, but the traditional methods of do.call("rbind", listofdataframes) and ldply(listofdataframes, data.frame) were too slow, sharply decreasing in efficiency for objects greater than 500 Mb.

I tried them initially and then gave up after sitting and staring at processing screen for 20 minutes.

Luckily, there’s a much faster method: as.data.frame(data.table::rbindlist(listOfDataFrames))

The ‘new’ data.table package method created the final dataframe in about 5 minutes resulting, and the dataframe was only 0.4 Gb in size.

That’s a 95% reduction in object size. — I wasn’t expecting that, but it makes sense conceptually since each individaul dataframe has a fixed memory cost that adds up eventually as you create hundreds of thousands of them.

I bet that if I had let the do.call method run it’s course it would have taken at least 1.5hrs.

If you’re interested in actual comparative benchmarks between the various methods, here’s a snippet from the stackoverflow page where I found this technique:

library(rbenchmark)
benchmark(
  do.call = do.call("rbind", listOfDataFrames),
  plyr_rbind.fill = plyr::rbind.fill(listOfDataFrames), 
  plyr_ldply = plyr::ldply(listOfDataFrames, data.frame),
  data.table_rbindlist = as.data.frame(data.table::rbindlist(listOfDataFrames)),
  replications = 100, order = "relative", 
  columns=c('test','replications', 'elapsed','relative')
  )  ....

                  test replications elapsed relative
4 data.table_rbindlist          100    0.11    1.000
1              do.call          100    9.39   85.364
2      plyr_rbind.fill          100   12.08  109.818
3           plyr_ldply          100   15.14  137.636

Here’s a glimpse of what the final output datatable looks like:

All in all, I am happy with the results. It’s safe to say I never want to go through this ordeal ever again (definitely will).

1 thought on “rbindlist() for all of you list-of-dataframe monsters”

DEEP DIVE: Opioids, Hierarchical Clustering and Kick-back Detection – MODEX

March 18, 2018 at 11:08 PM

[…] noted in a previous post that the CMS data-structure is horrendous, so it was further processed to enable efficient […]

LikeLike

Share this:

Related

1 thought on “rbindlist() for all of you list-of-dataframe monsters”

Leave a comment Cancel reply