Last August, the CAS made a significant organizational investment. We created an account on GitHub, one of the leading cloud-based platforms, which implements the version control utility, git. Everyone who read that last sentence will have one of three reactions:
1) I love git, it’s about time the CAS did this!
2) I’ve heard of git, but don’t really know much about it, or
3) I have no idea what any of that sentence means.
If you’re in that first group, feel free to hop over to casact.github.com and take a look around. If you’re in one of the other groups, read on.
First, let’s talk about version control. It’s rare that you would ever create analysis (in a spreadsheet or elsewhere) that requires no additional changes. This problem is particularly acute in the software field, where updates need to be tracked across a very large user base. The developers working on the Linux operating system recognized this and Linus Torvalds came up with the program “git” to manage different versions of the software. The details of how this is implemented could fill a book. Suffice it to say that git allows you to see all of the different versions of the files that make up a project. Make a mistake? It’s possible to go back to a prior state when things were correct. Don’t know what changed? That can be easily monitored.
Although this is primarily used for software development, it’s important to bear in mind that code files are simply text files. In that regard, they are no different than something with a .txt extension that you might open in Notepad or some other text reader. This means that file formats like HTML, markdown or (my favorite) R markdown may be tracked with git, and therefore, your actuarial analysis and documentation can have changes controlled and tracked. No more asking a colleague, “which version of the file should I be looking at?”
Now let’s talk about doing this in the cloud. Git can be used over a LAN – I’ve done it myself – but GitHub (and BitBucket and GitLab) puts projects in the cloud. Anyone can access and benefit from the material that’s out there. R and Python users have likely already come across any number of packages whose code is hosted in the cloud. (Like this one, or this one and also this one.)
So what’s wrong with the way that the CAS currently makes code available? I mean, isn’t it already “in the cloud” on the Variance or E-Forum pages of the CAS website, where I can copy and paste it? Why reinvent the wheel? Well, in this case, we’re not so much reinventing the wheel as augmenting the car. Let’s say that the code associated with a Variance paper has an error. A user would have to notify the author or the CAS webmaster. E-mails would follow and the code would get fixed once the author has capacity to debug and correct the error.
Placing the code on GitHub does a few things.
- It eliminates the need to find contact details for the author. Simply open an issue and the author will be aware that the code needs to be examined.
- Whoever spots the error can suggest a fix. In this case, the author would simply need to review and improve the change.
- Third, this whole discussion happens in the open. Users of the code will be aware of what’s taking place and can join the discussion on making changes.
There’s another advantage to collecting material on GitHub and it embodies the phrase “passive discovery.” If I’m looking for “The Decline and Fall of the Roman Empire” by Edward Gibbon in a bookstore, I’ll go to the history section and start looking for authors whose last name begins with “G.” While I’m there, though, I’ll see Tom Holland’s “Rubicon,” John Julius Norwich’s books about Byzantium and any number of other titles. I didn’t come in looking for them, but now I’m interested. This is passive discovery. I didn’t seek out a book by Tom Holland, I found it without even knowing that I needed to look for it.
As the collection of material grows, I’m hoping this will happen on GitHub. You may be looking for yield curve research written by Gary Venter and Kailan Shang in the E-Forum. While you’re there, you may notice a project that examines Bayesian neural networks. You may also see that there’s a Python package to implement the chain ladder reserving model. You didn’t know you needed it when you went looking for yield curves, but now you do. And it’s all right there for you to use!
The Legal Stuff
None of this would have happened without the CAS Vice President-Research, Avi Adler, supporting and challenging me every step of the way. For an actuary, he sounds an awful lot like a lawyer, and trust me, that’s not a high compliment. Sharing software that actuaries can use on the job introduces two significant legal issues: liability and copyright.
Every CAS actuary is familiar with the financial consequences of legal liability. What happens if the software causes financial loss? Set aside the likelihood that this will happen and consider the conditional probability. Should the authors of the code bear the responsibility of the loss? Should the user? Should their employer? Should the CAS?
The second issue is copyright. As a general rule, authors have the right to decide how their work is used, unless they’ve made other arrangements. For example, articles in Variance require signing a consent to publish agreement, which assigns these rights to the CAS. Software, particularly collaborative software, is not so easy. Moreover, what happens if an organization utilizes a python routine from GitHub in their pricing systems? Does this kind of system integration mean that their code must be shared with others, just as the original software had been?
To address both of these issues, we have – with the advice of legal counsel – adopted the Mozilla Public license, version 2.0 (MPL 2). Every repository that we host must include this license within the project. Further, every source file must contain a reference to it as well. The most significant element is a clear statement that the material is provided without any warranty. The entire risk is borne by users of the software. You or your attorneys can take a look, but consider your emptor caveated.
MPL2 addresses copyright in a very interesting and unique way. Like some other open source licenses like MIT or GPL, the material is freely available to anyone, so long as the license terms are adhered to. Where MPL2 differs from MIT, GPL and others is with respect to composite works. Let’s say that an aggregate loss model is placed on GitHub. You’d like to use it at your company (at your own risk!) as part of your pricing model. You can do this. Let’s say that you make an improvement to the code. Are you now required to make all of the code for your company’s pricing model available? The answer is no. MPL permits open source and closed source elements to be used together. A license like GPL does not. The R package data.table decided to move to MPL2 for this reason. Again, check with your company’s legal department on how you may proceed, but we have the view that MPL2 is the best fit for the actuarial community.
What does this mean for you?
I hope this means that you’re more likely to try out some of the techniques you’ve been reading about in CAS publications. Even better, I hope it means that you’ll contribute ideas, corrections and discussion about the various repositories that appear on GitHub.
As time goes on, I hope to contribute occasional articles in the Actuarial Review or this blog about git. In March, Bryce Chamberlain, Rajesh Sahasrabuddhe and I will be giving a presentation about it at the CAS RPM Seminar. If you can’t wait that long, check out the recording of the webinar that Bryce and I gave last October.