Friday, June 3, 2016

Essential elements of clear and reproducible statistical output

I read a lot of statistical output. Whether I created it or someone else did, one thing that's really frustrating and wastes a lot of time is not knowing what's contained in the output. Even if the programmer kept a good research log and annotated their code well, I usually don't have access to that when I'm reviewing output. So I've come up with a few rules-of-thumb for essential elements of any output that I produce (or that people produce for me).

1) These annotations need to appear someone on the output document that you're going to share with people, whether electronically or physically. It can't be hidden in another document.

2) Ideally, they should appear right next to the specific output (i.e., table, figure) that you want to discuss. So the easiest way to do this is often in a table or figure title, or a figure legend. But honestly, I'd rather have output copied and pasted into Word with notes and annotations added there than to just have output straight from SAS, Stata, etc. with no annotation.

3) Any output should include the following elements somewhere in it:

a) Who ran the program

b) The date it was run

c) The data file(s) that were used and the name of the program file and output file itself

d) What the level of data are (i.e., people, businesses)

e) Total sample sizes used in the output and any selection/sub-selection done from a larger sample

f) What type of output is presented (e.g., direct summaries or modelled estimates like predicted means/probabilities)

g) What type of model was used if model output is presented (e.g., ANOVA, linear regression, logistic regression). It's also often helpful to say the procedure or command that was used (e.g., PROC GLM) if that's not already part of the output by default.

h) Whether results are "new" (v. old if the are second, third, etc. runs of a program/process)

i) Any asides or exceptions based on the specific study that would help the team understand the output (e.g., "Joe's idea" or "From meeting on 5/1/15")

A lot of statistical work is translation between statisticians/programmers and "substantive" experts or non-statistical clients. So it's important for output to have annotations for the programmers AND annotations in more lay language for people who aren't programmers/statisticians and who aren't in the data every day. In the list above, a) through c) are probably not helpful to anyone beyond the programmers responsible for the work, but they are essential for making work reproducible. The others are for interpretation. A lot of time and opportunities can be wasted if output isn't clear when the right people are all in a room together or when a key review has a chance to review. Using these tips in your standard practice should help avoid those bottlenecks.


Thursday, June 4, 2015

Some tips on using OneNote for research documentation

OneNote is a neat MS Office product that comes with many (maybe all now) Office packages, and you can even get online for free now (I think, even if you don't have office). It's big attraction for me is that it's designed to work like a notebook page and let you just drop items (text, images, even attachments) onto a page and do a lot of formatting easily. It also stores all your notes in a "notebook" structure, which basically replicates a folder structure system, but is a little more intuitive and easier to search in some ways.

I've been using OneNote as my primary "lab notebook" and note-taking program for a few years now. In my opinion it beats Evernote in visual/formatting ability, but it's quite a bit slower and clunkier in some situations, and syncing isn't as reliable as Evernote in my experience. So if all your notes are just text, and you don't like to add a lot of pictures, highlighting, etc., it might be overkill for you (in which Evernote would would be a great choice).

This post is a running list of tips and things I've found counter-intuitive, and frequently-used keyboard shortcuts and tricks I've come to rely on in OneNote. It's a reference for myself as much as for you.

1) Tables: 

Tables in OneNote are clunky. They don't operate by the same rules as tables in Word or Excel. For example:
  • To add a new row, you hit enter from the start of the row within a cell (rather than going to the end of row outside the cell like you do in Word). 
    • Alternatively, you right click > Table > Insert to add a row (unless you have the table formatting ribbon open).
  • Row/column highlighting: Rather than highlighting just text within the cell (as you up-arrow holding shift), when you get to the first row (last of your selection) the selection jumps out of the table and highlights everything on the page. So you have to highlight to the second to row, and then arrow left (or ctrl+home) to get the last row. Kind of clunky if you're moving a lot around and used to Word tables.
  • You can create a table by just hitting tab (instead of space after a word or phase, which will put that word or phase in the first cell). 

Overall, I've found that OneNote tables are nice for very small tables (a few rows or a few columns) where you don't care if the formatting is perfect (i.e., you just want to tabularize it to make it clearer for note-taking or checking things off). But  I wouldn't try doing detailed formatting in these, or expect functionality like Excel tables (or even word). The most recent version incorporates Excel tables, but you have to "convert" it to Excel, and it didn't interface well with the rest of the page the one time I tried that. It's also not backward compatible if you have an older version on another computer. However, this does let you sort, which you couldn't do the older versions.


2) Grouping content:

The structure: The first thing to know is that the structure goes like this:

Notebook > Section > Page > Subpage. You can also put section groups and subgroups in there, too. Pages are the only thing you can write on (but you can title the other things). For web-access, I find the groups and subgroups to be kind of clunky.

Syncing: If you work on one machine, it probably doesn't matter how you organize, but if you're on a lot of machines it can.

If you have multiple notebooks, you have to download each to every new computer separately. If you want the flexibility of choosing which notes go where, then many notebooks is for you. If you want everything to sync easily, and setup on a new computer or mobile to be easy, then fewer notebooks is better.

Naming notebooks and sections: Note that changing the name of a notebook is tricky. Once you make it (with the primary online anyway) the name is fixed. If you change the name on your computer, it won't change on the other computers that sync. It will still sync, but it will have the original name. This way, they are NOT like folders. Naming and renaming sections and pages doesn't work this way.

Sorting: Sections and pages cannot be sorted alphabetically without a special Power Toy. However, section groups will be sorted alpha by default. Notebooks will not be sorted alpha either. So keep that in mind since it can be a lot of work to keep things in order if you use alpha sort.


3) Working with the page itself
  • Images and drawing: It's neat that you can copy a screen cap or image, paste it to your page and then draw on it (particularly if you use a table). However, the ink you add (highlights, pen, etc) don't follow the image. You can't even group those objects you drew with the original image object so that they move together. This means that if you then add text or anything before it, the image you copied will move down  (just like in a Word doc), but the drawn objects will stay where they are as if they are physically attached to the background of the page. This baffles me b/c I can't see any reason someone would want it and it makes it impractical for note taking. 
  • OneNote can do math and recognize Latex equations. These aren't dynamic though. It will convert "1+1=" to 2. That's convenient, but also irritating if just want to show "1+1". This can be turned off, or if you just "undo" (ctrl+z) you get 1+1 back. All the latex equations I've tried work (e.g., \y_i becomes "y sub i" and \alpha becomes the alpha symbol), but I don't do too many equations and haven't tried anything too complex. You then have Office equation objects that you can edit, but you can't get the Latex code version (i.e., what you typed in) back if you like editing that way. Note, this function works in Word, too. And I think in Excel
  • When you copy and paste text from a website it automatically copies the URL too, cutting out one step. Only irritating if you're copying a lot from one site in multiple steps and don't want the URL each time. 
  • Page hyperlinks are the best! Just highlight the page or section you want to link to in your current page and right click. Pick "link to page/section", Then paste that in in your new page. You can make yourself a mini "web page" or set of reference materials this way. I'm organizing my stat references and code this way, setting it up "like a webpage" but in OneNote and it's pretty convenient to setup and use. 

4) Keyboard shortcuts and other Office conventions:
  • Although some conventions aren't like Word, many are, so KB shortcuts for font, etc. work like word. 
  • You can change indent/outdent for bullets with tab and backspace, which is really convenient. Or you can you can still use Word's (ctrl+alt+ao and ctrl+alt+ai).
5) Tweaks


6) Things that can really mess you up

If you're syncing to OneDrive, then changing notebook names on your local machine only changes them on that machine. They won't be changed on other machines, and the actual notebook file name won't change (the name you see in OneNote is really just a label, not a file name). If I want to change the name (even the display name), my habit has become to a) make sure the notebook is synced, close the notebook, change the names in OneDrive, and then download it to OneNote on my computer.  

My most commonly-used Excel keyboard shortcuts and tips

There are a lot of sites where you can find Excel KB shortcuts (URL) and tips (URL), but these are the ones I seem to use most. They should make you much quicker in Excel.

Add a row

shift+space to highlight row, then ctrl+shift+'+'

Delete a row

shift+space to highlight row, then ctrl+shift+'+'

Add a column

shift+space to highlight row, then ctrl+shift+'+'

Delete a column

shift+space to highlight row, then ctrl+shift+'+'
If you're in Office X or forward you've probably already figured out that when you hit "alt" you're shown the next key to press to reach ribbons and menu items within ribbons. Here are the ones I use the most.

Add a comment to a cell

shift+space to highlight row, then ctrl+shift+'+'
Note, if you want those comments to be visible by default (more like comments in Word), then go to 

File > Options > Advanced > 
In the "Display" group, under "For cells with comments, show: click "Comments and Indicators"

Here are some other esoteric Excel tips (including formulas I like a lot). 

=average(CELL:RANGE)
=!sheet_name... [If you keep your sheet names short, you could do these references by hand, but the quickest way I've found to do them is to do the first one with point and click and then copy and edit it in other cells]



Thursday, November 13, 2014

The long road that is short

It was in a job interview that I really first thought about the difference between writing code and programming. The conversation went something like this.

Potential Boss: Tell me about your statistical software experience. Do you program?

Me: I write my own code.

PB: Yes, but do you program?

Me: Ummm...

PB: Hmmm...Do you use macros?

Me: Oh! Yes.

Programming involves writing code, but writing code is not necessarily programming. I've become hypersensitive to this distinction since that experience. Although I had a tacit understanding of it, I never really separated the two things out.

Then this morning I came across a great quote in Michael N. Mitchell's Data Management Using Stata: A Practical Handbook that reads:

"The word programming can be a loaded word. I use it here to describe the creation of a series of commands that can be easily repeated to perform a given task. As such, this chapter is about how to create a series of Stata commands that [can] be easily repeated to perform data-management and data analysis tasks. But you might say that you already know how to use Stata for your data management and data analysis. Why spend time learning about programming? My colleague at UCLA, Phil Ender, had a wise saying that I loved: 'There is the short road that is long and the long road that is short.' Investing time in learning and applying programming strategies may seem like it will cost you extra time, but at the end of your research project, you will find that it is part of the 'long road that is short.'" (http://www.stata.com/bookstore/data-management-using-stata/, p. 278)

This will be my new mantra..."the long road that is short"

Thursday, October 16, 2014

Why keep a project log?

Of course you keep all your syntax, output, and logs from your analyses, but why should you keep a separate log of your work, and what should it look like?

Why? 

Because your analysis logs won't tell the whole story. If you're good about commenting your code, they can tell most of the story, but there will inevitably be some question or insight that comes up after you've run things. At this step, it's probably easier to update notes in a Word file, text file, Excel file, or some place external to your code.

What? 

The format isn't too important. Use whatever program you find easiest to use. I use OneNote b/c I like how it lets me drop in pictures, etc., and automatically inserts URLs from things I copy and paste from the web. But it can be slow at times, and has more than one needs for a log. I've tried Excel before, but that's just too restrictive for me. It's good for other kinds of logs, but for a general project log, I like to have more free space to write.

What goes in this log? Anything you want to remember for later. I write a short summary of what I did during the session/day, and list/highlight open questions and next actions. If I made big decisions, I'll document that (though I like to have a decisions log, too). If I have key output or a finding, I'll add that.

Think of this as your "captains log"...record all the highlights and puzzling issues of the work, as well as major insights and new ideas.

Wednesday, August 20, 2014

The keyboard shortcuts I use the most

There's no doubt that kb shortcuts improve your efficiency (compared to mousing) once you've learned them. I learned a couple new ones from my boss yesterday, so I figured I should start listing them all here as a reference. I'll update this from time to time and include those I've listed elsewhere.

Navigating Text

Move to beginning of the line

home

Move to end of the line

end

Move one word to the right(left) of cursor

ctrl + right (or left) arrow


Editing/Formatting Text

Delete word to the right of cursor

ctrl + del

Select text while moving the cursor (e.g., select next word or text to the end of the line)

Hold down 'shift' with any of the movement shortcuts above


Select all text in the document, form field, etc.

ctrl + a


Excel Text Editing

For the most part, the editing shortcuts above work in Excel, too. You can apply them to to the whole cell or enter the cell and apply them to selected text in the cell.

Enter Excel cell

F2


Excel Navigation

Go to a cell

F5


Go to the next worksheet to the right

ctrl + page up 


Go to the next worksheet to the left

ctrl + page down