Pair Programming Statistical Analyses

Abstract

Control calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results and illustrate its use by a case study with two statisticians ping-ponging an analysis using dplyr and SQL, respectively.

Clip art is based on work by Gan Khoon Lay available under a CC BY 3.0 US license.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Introduction

If you are a statistician working in climate science, data driven journalism, official statistics, public health, economics or any related field working with real data, chances are that you have to perform analyses, where you know there is zero tolerance for errors. The easiest way to ensure the correctness of such an analysis is to check your results over and over again (the iterated 2-eye principle). A better approach is to pair-program the analysis by either having a colleague read through your code (the 4-eye principle) or have a skilled colleague completely redo your analysis from scratch using her favorite toolchain (the 2-mindset principle). Structured software development in the form of, e.g. version control and unit tests, provides valuable inspiration on how to ensure the quality of your code. However, when it comes to pair-programming analyses, surprisingly many steps remain manual. The daff package provides the equivalent of a diff statement on data frames and we shall illustrate its use by automatizing the comparison step of the control calculation ping-pong process.

The Case Story

Ada and Bob have to calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. The present exposition will for the sake of convenience ignore the complex survey part of the problem and just pretend that the sample corresponds to the population (complete inventory count).

The Data

After the questionnaire phase, the following data are available to Ada and Bob.

Id	Version	Region	Sales Volume	Staff Costs	People
01	1	A	23000	10003	5
02	1	B	12200	7200	1
03	1	NA	19500	7609	2
04	1	A	17500	13000	3
05	1	B	119500	90000	NA
05	2	B	119500	95691	19
06	1	B	NA	19990	6
07	1	A	19123	20100	8
08	1	D	25000	100	NA
09	1	D	45860	32555	9
10	1	E	33020	25010	7

Here Id denotes the unique identifier of the sampled fitness center, Version indicates the version of a center’s questionnaire and Region denotes the region in which the center is located. In case a center’s questionnaire lacks information or has inconsistent information, the protocol is to get back to the center and have it send a revised questionnaire. All Ada and Bob now need to do is aggregate the data per region in order to obtain region stratified estimates of:

the overall number of fitness centres
total sales volume
total staff cost
total number of people employed in fitness centres

However, the analysis protocol instructs that only fitness centers with an annual sales volume larger than or equal to €17,500 are to be included in the analysis. Furthermore, if missing values occur, they are to be ignored in the summations. Since a lot of muscle will be angered in case of errors, Ada and Bob agree on following the 2-mindset procedure and meet after an hour to discuss their results. Here is what each of them came up with.

Ada

Ada loves the tidyverse and in particular dplyr. This is her solution:

ada <- fitness %>% na.omit() %>% group_by(Region,Id) %>% top_n(1,Version) %>%
  group_by(Region) %>%
  filter(`Sales Volume` >= 17500) %>%
  summarise(`NoOfUnits`=n(),
            `Sales Volume`=sum(`Sales Volume`),
            `Staff Costs`=sum(`Staff Costs`),
            People=sum(People))
ada

## # A tibble: 4 x 5
##   Region NoOfUnits `Sales Volume` `Staff Costs` People
##   <chr>      <int>          <int>         <int>  <int>
## 1 A              3          59623         43103     16
## 2 B              1         119500         95691     19
## 3 D              1          45860         32555      9
## 4 E              1          33020         25010      7

Bob

Bob has a dark past as a relational database management system (RDBMS) developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach. The hope is that in big data situation this might be a little more speedy than base R:

library(RSQLite)
## Create ad-hoc database
db <- dbConnect(SQLite(), dbname = file.path(filePath,"Test.sqlite"))
##Move fitness data into the ad-hoc DB
dbWriteTable(conn = db, name = "fitness", fitness, overwrite=TRUE, row.names=FALSE)
##Query using SQL
bob <- dbGetQuery(db, "
    SELECT Region,
           COUNT(*) As NoOfUnits,
           SUM([Sales Volume]) As [Sales Volume],
           SUM([Staff Costs]) AS [Staff Costs],
           SUM(People) AS People
    FROM fitness
    WHERE [Sales Volume] > 17500 GROUP BY Region
")
bob

##   Region NoOfUnits Sales Volume Staff Costs People
## 1   <NA>         1        19500        7609      2
## 2      A         2        42123       30103     13
## 3      B         2       239000      185691     19
## 4      D         2        70860       32655      9
## 5      E         1        33020       25010      7

Update: An alternative approach with less syntactic overhead would have been the sqldf package, which has a standard SQLite backend and automagically handles the import of the data.frame into the DB using the RSQLite pkg.

##Load package
suppressPackageStartupMessages(library(sqldf))
##Ensure SQLite backend.
options(sqldf.driver = "SQLite")
##Same query as before
bob <- sqldf("
    SELECT Region,
           COUNT(*) As NoOfUnits,
           SUM([Sales Volume]) As [Sales Volume],
           SUM([Staff Costs]) AS [Staff Costs],
           SUM(People) AS People
    FROM fitness
    WHERE [Sales Volume] > 17500 GROUP BY Region
")

Even shorter is the direct use of SQL chunks in knitr using the variable db as connection and using the chunk argument output.var=bob:

SELECT Region,
       COUNT(*) As NoOfUnits,
       SUM([Sales Volume]) As [Sales Volume],
       SUM([Staff Costs]) AS [Staff Costs],
       SUM(People) AS People
FROM fitness
WHERE [Sales Volume] > 17500 GROUP BY Region

The Ping-Pong Phase

After Ada and Bob each have a result, they compare their resulting data.frame using the daff package, which was recently presented by @edwindjonge at the useR! 2017 conference.

library(daff)
diff <- diff_data(ada, bob)
diff$get_data()

##    @@ Region NoOfUnits   Sales Volume   Staff Costs People
## 1 +++   <NA>         1          19500          7609      2
## 2  ->      A      3->2   59623->42123  43103->30103 16->13
## 3  ->      B      1->2 119500->239000 95691->185691     19
## 4  ->      D      1->2   45860->70860  32555->32655      9
## 5          E         1          33020         25010      7

After Ada’s and Bob’s serve, the two realize that their results just agree for the region E.

Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

##Small helper function for better printing
print.data_diff <- function(x) x$get_data() %>% filter(`@@` != "")

diff %>% print()

##    @@ Region NoOfUnits   Sales Volume   Staff Costs People
## 1 +++   <NA>         1          19500          7609      2
## 2  ->      A      3->2   59623->42123  43103->30103 16->13
## 3  ->      B      1->2 119500->239000 95691->185691     19
## 4  ->      D      1->2   45860->70860  32555->32655      9

The two decide to first agree on the number of units per region.

diff$get_data() %>% filter(`@@` != "") %>% select(`@@`, Region, NoOfUnits)

##    @@ Region NoOfUnits
## 1 +++   <NA>         1
## 2  ->      A      3->2
## 3  ->      B      1->2
## 4  ->      D      1->2

One obvious reason for the discrepancies appears to be that Bob has results for an extra <NA> region. Therefore, Bob takes another look at his management of missing values and decides to improve his code by:

Pong Bob

SELECT Region,
       COUNT(*) As NoOfUnits,
       SUM([Sales Volume]) As [Sales Volume],
       SUM([Staff Costs]) AS [Staff Costs],
       SUM(People) AS People
FROM fitness
WHERE ([Sales Volume] > 17500 AND REGION IS NOT NULL)
GROUP BY Region

##   @@ Region NoOfUnits   Sales Volume   Staff Costs People
## 1 ->      A      3->2   59623->42123  43103->30103 16->13
## 2 ->      B      1->2 119500->239000 95691->185691     19
## 3 ->      D      1->2   45860->70860  32555->32655      9

Ping Bob

Better. Now the NA region is gone, but still quite some differences remain. Note: You may at this point want to stop reading and try yourself to fix the analysis - the data and code are available from the github repository.

Pong Bob

Now Bob notices that he forgot to handle the duplicate records and apparently misunderstood the exact definition of the €17,500 exclusion limit. His massaged SQL query looks as follows:

SELECT Region,
       COUNT(*) As NoOfUnits,
       SUM([Sales Volume]) As [Sales Volume],
       SUM([Staff Costs]) AS [Staff Costs],
       SUM(People) AS People
FROM (SELECT Id, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM fitness GROUP BY Id)
WHERE ([Sales Volume] >= 17500 AND REGION IS NOT NULL)
GROUP BY Region

##    @@ Region NoOfUnits Sales Volume  Staff Costs People
## 1 ...    ...       ...          ...          ...    ...
## 2  ->      D      1->2 45860->70860 32555->32655      9

Ping Ada

Comparing with Ada, Bob is sort of envious that she was able to just use dplyr’s group_by and top_n functions. However, daff shows that there still is one difference left. By looking more carefully at Ada’s code it becomes clear that she accidentally leaves out one unit in region D. The reason is the too liberate use of na.omit, which also removes the one entry with an NA in one of the not so important columns. However, they discuss the issue, if one really wants to include partial records or not, because summation in the different columns then is over a different number of units. After consulting with the standard operation procedure (SOP) for these kind of surveys they decide to include the observation where possible. Here is Ada’s modified code:

ada2 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>%
  group_by(Region) %>%
  filter(`Sales Volume` >= 17500) %>%
  summarise(`NoOfUnits`=n(),
            `Sales Volume`=sum(`Sales Volume`),
            `Staff Costs`=sum(`Staff Costs`),
            People=sum(People))
(diff_final <- diff_data(ada2,bob3)) %>% print()

##    @@ Region NoOfUnits ... Staff Costs People
## 1 ...    ...       ... ...         ...    ...
## 2  ->      D         2 ...       32655  NA->9

Pong Ada

Oops, Ada forgot to take care of the NA in the summation:

ada3 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>%
  group_by(Region) %>%
  filter(`Sales Volume` >= 17500) %>%
  summarise(`NoOfUnits`=n(),
            `Sales Volume`=sum(`Sales Volume`),
            `Staff Costs`=sum(`Staff Costs`),
            People=sum(People,na.rm=TRUE))
diff_final <- diff_data(ada3,bob3)

##Check if the results really are the same
length(diff_final$get_data()) == 0

## [1] TRUE

Finally, their results agree and they move on to production and their results are published in a nice report.

Conclusion

As shown, the ping-pong game is quite manual and particularly annoying, if at some point someone steps into the office with a statement like Btw, I found some extra questionnaires, which need to be added to the analysis asap. However, the two now aligned analysis scripts and the corresponding daff-overlay could be put into a script, which is triggered every time the data change. In case new discrepancies emerge as length(diff$get_data()) > 0, the two could then be automatically informed.

Question 1: Now the two get the same results, do you agree with them?

## # A tibble: 4 x 5
##   Region NoOfUnits `Sales Volume` `Staff Costs` People
##   <chr>      <int>          <int>         <int>  <int>
## 1 A              3          59623         43103     16
## 2 B              1         119500         95691     19
## 3 D              2          70860         32655      9
## 4 E              1          33020         25010      7

Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below. Control calculations appear quite common, but little structured code support appears to be available for such processes.

Photo is copyright Johnathan J. Stegeman under the GNU Free Documentation License, version 1.2.