The actuarial workflow typically looks like this: you complete an analysis in a spreadsheet, then you paste the chart into a word-processing document. If you want to make a change to the math, or the data, that copy-and-paste step must be repeated. It’s easy for the two (or more!) documents to fall out of sync, prompting the question, “Am I looking at the most recent version?” which we’ve all heard in meetings ad nauseum. This is inefficient, risky and tedious, but it’s a fact of life. Isn’t it?
Well, not by design. Or, rather, this is a workflow that evolved from an absence of unified design. Office programs were meant to perform one task and perform it well. Bundling disparate programs – word processor, spreadsheet, lightweight database – into a single suite was a triumph of marketing, not design. Insofar as it appears, a holistic understanding of the needs of a numerical analyst was something of an afterthought. Sure, there have been efforts to bridge the divide between components of an office suite (chief among them OLE), but they still rely on the user working with multiple tools and multiple environments. The failure of a tool like Microsoft Binder suggests that the mainstream user community has little interest in uniting the needs of data analysis and documentation. Most users of office software are comfortable using one tool at a time.
Things look a little different in the programming community. Despite what you may have heard, programming is – first and foremost – a medium for communication, not computation. The programming literature is rife with attempts to enhance the ability of a human to communicate with a logical device. We’ve all seen comments in code, which aid our understanding about what the programmer is trying to instruct the computer to do. This idea finds a kind of extreme with the idea of “literate programming,” first proposed by the legendary Donald Knuth. This is the idea that a program may be “read” by both a human and a computer. In practical terms, this means a synthesis of large blocks of plain text along with computer code.
In other words, communication and computation in the same document.
And that brings us to Jupyter. Jupyter is a very specific kind of literate programming, based on the idea of a literate program being a kind of research “notebook.” It captures all of the data analysis that the researcher carries out, the results of the analysis and detailed commentary along the way. There is a deliberate analogue to clinical, laboratory research, which chemists and other scientists carry out. The integrity of such research is critical; often there are protocols in place to have the notebook remain in the laboratory so that it may not be altered.
Jupyter has similar controls. Each notebook is composed of “cells,” which may be either code or commentary. Each time that a cell of code is executed, a counter next to the cell is incremented. If the code has not been run, the counter is empty. To see an example, here is a Jupyter notebook put together by John Bogaardt and me for a recent presentation. Notice the numbers in brackets to the left of the code cells. This indicates whether they’ve been run and in what order. The notebook also contains commentary on what we’re trying to analyze as well as the output of the analysis. No need to copy and paste.
You will notice that the notebook that John and I created uses Python. Jupyter has its origins in Python, having evolved out of the IPython project to support other languages. “Jupyter” is a portmanteau formed by the names Julia, Python and R. However, Jupyter also supports other languages like Java Script, Ruby, Go and many others. This is another departure from the office suite paradigm: Jupyter is a common environment for using many different tools.
So, is Jupyter the only example of literate programming? Heavens, no. Beginning with SWeave, R has embraced literate programming for years. This has evolved to tools like RMarkdown and knitr, which may run R code, include comments and graphics and generate resulting documents in many different formats, including Microsoft Word, .PDF and .HTML. This makes it possible for a “literate program” to run analysis and produce an actuarial report, a slide show or a website.
This post grew out of a discussion in the Research Oversight Committee, which was started by Morgan Bugbee. He read a great article in The Atlantic, which talks about reproducibility in research. I’ll have more to say about that in a future blog post.