Teaching scientific software best practices at a molecular modeling summit

26 Jul 2015

I was fortunate to be invited to Foundations of Molecular Modeling and Simulation, a summit attended by the top academic minds in molecular modeling. It was a wonderful conference filled with incredibly smart people.

I presented on a topic near and dear to me - software best practices in scientific software. It’s no secret that scientific software has a bad reputation and even basic computational results can’t be reproduced between labs. These are all problems with known solutions, but adoption is mixed in the “publish or perish” academic incentive structure.

FOMMS gave me an opportunity to hash out the concepts I presented to NIST. In partnership with the incredibly talented Professor Chris Wilmer, we devised a series of short lessons that we felt could gain traction in any computational lab.

Molecular modeling is dominated by proprietary formats and brittle code that segfaults on trailing whitespace.💬I’ve seen month-long simulations fail without recovery due to files not ending in a LF, massive job queues billing thousands of useless hours due to a hard tab, and more. White space is the kryptonite of custom parsers. In File Formats, we introduced the conference to standard file formats and libraries to replace their custom scripts.

The field is deeply tied to high-performance computing, with code based on extremely tight loops triggering billions of times. Pair this with people who are not software developers and you end up with tons of lab-specific performance “best practices” that are often nothing more than superstition.💬I’m looking at you, Quake square root approximator. In Optimization, we introduce profiling and big-O notation as tools to optimize scientifically.

All too many scientific graduate students are still taught programming with FORTRAN and C. They get used to building everything from source and working in unmaintained environments with dozens of conflicting binaries. In Package Managers, we introduce the concept of installing pre-built packages and maintaining clean versioned environments.

Finally, every scientific project I’ve ever inherited is “version controlled” with a file naming scheme. To make matters worse, the instructions often involve not running the most recent version or running different versions depending on the use case.💬Repeating myself: molecular modelers can be a superstitious bunch when it comes to software. In Version Control, we make our case for version control managers such as git.

Once you’re over the activation energy barrier, I promise you’ll be publishing better results more quickly. Please let me know how it goes!