Friday, June 3, 2016

Essential elements of clear and reproducible statistical output

I read a lot of statistical output. Whether I created it or someone else did, one thing that's really frustrating and wastes a lot of time is not knowing what's contained in the output. Even if the programmer kept a good research log and annotated their code well, I usually don't have access to that when I'm reviewing output. So I've come up with a few rules-of-thumb for essential elements of any output that I produce (or that people produce for me).

1) These annotations need to appear someone on the output document that you're going to share with people, whether electronically or physically. It can't be hidden in another document.

2) Ideally, they should appear right next to the specific output (i.e., table, figure) that you want to discuss. So the easiest way to do this is often in a table or figure title, or a figure legend. But honestly, I'd rather have output copied and pasted into Word with notes and annotations added there than to just have output straight from SAS, Stata, etc. with no annotation.

3) Any output should include the following elements somewhere in it:

a) Who ran the program

b) The date it was run

c) The data file(s) that were used and the name of the program file and output file itself

d) What the level of data are (i.e., people, businesses)

e) Total sample sizes used in the output and any selection/sub-selection done from a larger sample

f) What type of output is presented (e.g., direct summaries or modelled estimates like predicted means/probabilities)

g) What type of model was used if model output is presented (e.g., ANOVA, linear regression, logistic regression). It's also often helpful to say the procedure or command that was used (e.g., PROC GLM) if that's not already part of the output by default.

h) Whether results are "new" (v. old if the are second, third, etc. runs of a program/process)

i) Any asides or exceptions based on the specific study that would help the team understand the output (e.g., "Joe's idea" or "From meeting on 5/1/15")

A lot of statistical work is translation between statisticians/programmers and "substantive" experts or non-statistical clients. So it's important for output to have annotations for the programmers AND annotations in more lay language for people who aren't programmers/statisticians and who aren't in the data every day. In the list above, a) through c) are probably not helpful to anyone beyond the programmers responsible for the work, but they are essential for making work reproducible. The others are for interpretation. A lot of time and opportunities can be wasted if output isn't clear when the right people are all in a room together or when a key review has a chance to review. Using these tips in your standard practice should help avoid those bottlenecks.