A hallmark of good practice in scientific computing is simulation reproducibility: ensuring that all computational results in a simulation can be re-generated when needed. Computational work is difficult and time-consuming, and it is easy to jump straight to publication of results without checking to ensure that all the steps to obtain them have been properly documented.
Outline of a Paper Repository
This is an example of a paper repository containing simulation results. The Alamo convention is used here, but the principles can be followed for any simulation results.
PaperDescription # Repo name should be all one word beginning with "Paper"
./main.tex # Always call main.tex
./main.pdf # Always ensure that your .gitignore is
./main.out # up to date
./figures/ # Put illustrations ONLY here
MyFigure.svg # If saving as SVG, save a copy as
MyFigure.pdf # a PDF that can be generated from the SVG
./results/ # Everything relating to actual results goes here
TestCaseA/ # Subdirectory for different result types
output1/ # Each simulation has its own directory
input.in # ALWAYS include the input file
metadata # ALWAYS store all metadata including git IDs
diff.patch # If the reference code has changed since the most recent
# commmit, store the git patch
smalldatafile.dat # Store small data - if possible and not too many
bigdatafile.dat # Do not store large or binary files
pressure_profile.pdf # Store all figures presenting this data in the folder
output2/
...
pressure_profile.py # Any scripts used for more than one dataset should go here
comparefigure.pdf # Figures that compare more than one dataset should go here
TestCaseB/
...
Postprocessing best practices
Design your plotting scripts so that they are easy to use. Often, multiple people are involved in the creation and tweaking of figures. Good organization will streamline this process and will make your life easier – especially when/if you need to revise your paper after a round of requested revisions,.
## Bad filename: plot.py - not obvious what the plot is doing or where the output is
## Good filename: thermal_contours.py
# BAD: do not use absolute file paths; this is not portable.
data = load_data("/home/brunnels/Research/PaperDescription/results/TestCaseA/outpu1/bigdatafile.dat")
# GOOD: design your scripts to be run from the results directory
data = load_data("./TestCaseA/output1/bigdatafile.dat")
# IF POSSIBLE: process your data to small data files that can be stored in the repo.
small_data = process_my_data(bigdata)
small_data.save("./TestCaseA/output1/smalldatafile.dat")
# then give the user the option to load the data from the small data file - this way the results can be reproduced by someone without access to the big data.
# BAD: - don't separate plot file from data
# - don't give plot file a cryptic name
# - avoid rasterized formats unless absolutely necessary
small_data.saveplot("output1.pdf")
# GOOD: - localize the plot with the data in the file structure
# - give the plot an obvious name (matching the script file name helps!)
# - use PDF whenever possible
small_data.saveplot("./TestCasdeA/output1/thermal_contours.pdf")
Including figures in LaTeX
Now it’s time to include your figure in the paper directory. Notice how much information you can get just by looking at the filepath: it is a simulation result, it’s a “TestCaseA” type, and it contains thermal contours. If you need to reproduce it, you will find a script called “thermal_contours.py”, which you can run locally to regenerate the figures if needed.
\begin{figure}
\includegraphics{results/TestCaseA/output1/thermal_contours.pdf}
\end{figure}
Guiding Principles
Use the following principles when organizing your data.
- Every paper is a git repository. Every paper is written on overleaf and can be cloned to your desktop using git.
- Simulation data is stored in the associated paper’s git repository. All simulation files that are small enough to store in a git repository should be stored inside the git repository associated with the paper. Data that is too large to store in a git repository should still be stored in the same directory, just not added to the repository. (You can use a .gitignore file for this).
- Every simulation gets its own simulation subdirectory: Each simulation result should be stored in a self-contained directory with a unique name. Here, “self-contained” means that you should be able to send only the contents of the directory to someone else for them to reproduce your results.
- The simulation directory contains everything needed to generate the simulation: this means input files, data, etc.
- Visualizations are stored as close to the data as possible: if your visualization (for example a figure) contains data from a single simulation, it should be stored in the simulation’s output directory. If it contains data from multiple simulations, it should be stored in the lowest directory possible that contains both output directories.
- Scripts to generate visualizations are stored as close to the visualizations as possible: and, if possible, named similarly. A figure titled “stress_xx.pdf” is ideally generated using a script called “stress_xx.py” and stored in the same directory.
- Results are not figures: The figures directory is for figures only, meaning illustrations that do not contain actual scientific results.