LibGuides: Data Services: Track Different Versions of Documents

Data Versioning

Definition

Versioning refers to saving new copies of your files when you make changes so that you can go back and retrieve specific versions of your files later.

Practices

Naming versions

When creating new versions of your files, record what changes are being made to the files and give the new files a unique name. Follow the general advice on the site for naming files, but also consider the following:

Include a version number, according to the practices laid out under naming files.
- Keep track of document versions either sequentially (e.g. v01, v02,) or with a unique date and time ( e.g. 20140403_1800)
  
  DO: FileNm_Guidelines_20140409_v01.docx
  
  DON’T: FileNm_Guidelines_20140409_Review.docx AND FileNm_Guidelines_20140409_Investigation.docx
  
  BECAUSE: Two years from now, you won’t remember what you meant.
Include information about the status of the file, e.g. "draft" or "final," as long as you don't end up with confusing names like "final2" or "final_revised".
Include information about what changes were made, e.g. "cropped" or "normalized".

Simple file versioning

One simple way to version files is to manually save new versions when you make significant changes. This works well if:

You don't need to keep a lot of different versions.
Only one person is working on the files.
The files are always accessed from one location.

Some useful practices include:

Save an untouched copy of the raw data, and leave it that way. Always work on something other than the "safe" untouched copy (it is always possible to go back to the "safe" data and make a new copy in order to start from scratch).
Avoid ambiguous labels, such as 'revision', 'final', 'final2', etc. Instead, use a file naming convention (like v001, v002 or v1_0, v1_2, v2_0).
Use a directory structure naming convention that includes version information. (See image below)

The directory below shows multiple versions of a web page mock-up called DMSSiteHome.jpg. Note the use of v1, v2, etc. to indicate versions. The notations "FISH" and "SandC" indicate different images that were swapped into some versions, i.e. major changes that were made.

Use tools that automatically assign version numbers to manage the data. It may be a good idea to test this method to make sure that it is possible to revert to earlier versions and that the tool functions as expected.

Overall, saving multiple versions makes it possible to decide at a later time that you prefer an earlier version. You can then immediately revert back to that version instead of having to retrace your steps to recreate it.

The Simple File Versioning method requires that you remember to save new versions when it is appropriate. This method can become confusing when collaborating on a document with multiple people.

Version control systems

What the manual process described above requires most is self-discipline. Another approach is to use a version control system that manages changes to a project. These system automate some steps while enforcing others and thereby require less self-discipline for more reliable results.

It's hard to know what version control tool is most widely used in research today, but the one that's most talked about is undoubtedly Git. This is largely because of GitHub, a popular hosting site that combines the technical infrastructure for collaboration via Git with a modern web interface. GitHub is free for public and open source projects and for users in academia and nonprofits. GitLab is a well-regarded alternative that some prefer, because the GitLab platform itself is free and open source.

The one caveat may be that these systems do come with a learning curve.

How version control systems work

A version control system stores snapshots of a project's files in a repository. Users modify their working copy of the project, then save changes to the repository when they wish to make a permanent record and/or share their work with colleagues. The version control system automatically records when the change was made and by whom, along with the changes themselves.

Crucially, if several people have edited 1 or more files simultaneously, the version control system will detect any overlapping changes and require conflicts to be resolved before storing the result. Modern version control systems allow repositories to be synchronized with each other, so that no 1 repository becomes a single point of failure. Tool-based version control has several benefits over manual version control:

Instead of requiring users to make backup copies of the whole project, version control safely stores just enough information to allow old versions of files to be recreated on demand.
Instead of relying on users to choose sensible names for backup copies, the version control system timestamps all saved changes automatically.
Instead of requiring users to be disciplined about completing the change log, version control systems prompt them every time a change is saved. They also keep a 100% accurate record of what was actuallychanged as opposed to what the user thought they changed, which can be invaluable when problems crop up later.
Instead of simply copying files to remote storage, version control checks to see whether doing that would overwrite anyone else's work. If so, they facilitate identifying conflict and merging changes.

What not to put under version control

The benefits of version control systems don't apply equally to all file types. In particular, version control can be more or less rewarding depending on file size and format. First, file comparison in version control systems is optimized for plain text files, such as source code. The ability to see so-called "diffs" is one of the great joys of version control. Unfortunately, while Microsoft Office files (like the .docx files used by Word) or other binary files, e.g., PDFs, can be stored in a version control system, it is not possible to pinpoint specific changes from 1 version to the next. Tabular data (such as CSV files) can be put in version control, but changing the order of the rows or columns will create a big change for the version control system, even if the actual data have not changed.

Second, raw data should not change and therefore should not require version tracking. Keeping intermediate data files and other results under version control is also not necessary if you can regenerate them from raw data and software. However, if data and results are small, we still recommend versioning them for ease of access by collaborators and for comparison across versions.

Third, today's version control systems are not designed to handle megabyte-sized files, never mind gigabytes, so large data or results files should not be included (as a benchmark for "large", the limit for an individual file on GitHub is 100 MB). Some emerging hybrid systems such as Git Large File Storage (LFS) put textual notes under version control while storing the large data in a remote server, but these are not yet mature enough for us to recommend.

Inadvertent sharing

Researchers dealing with data subject to legal restrictions that prohibit sharing (such as medical data) should be careful not to put data in public version control systems. Some institutions may provide access to private version control systems, so it is worth checking with your IT department.

Additionally, be sure not to unintentionally place security credentials such as passwords and private keys in a version control system where it may be accessed by others.

Software options

Brandon University

To learn about local software options consult IT Services at Brandon University. They can inform you how frequently backups occur.

Pros: As a local service option in BU, there are fewer concerns around storing data in countries that may have practices that differs from Canada when it comes to privacy and security.

Cons: Backup will likely not be real time in regards to versioning.

Google Drive

Drive's word processing, spreadsheet, and presentation software automatically create versions as you edit.

Any time you edit files created on Google Drive, new versions are saved as you go.
Version information includes who was editing the file and the date and time the new version was created.
You can also see what changes were made from one version to the next (or between the current version and any older version) and revert back to a previous version at any time.

Pros: The real-time editing feature means that Google Drive works well for collaborating on files with multiple people. And because the files are on Google Drive, they are accessible from anywhere.

Cons: You are restricted to the software provided by Google, which may not have all the bells and whistles of your desktop word processing, spreadsheet, or presentation software.

Advanced software options

If you have more sophisticated version control needs, you might consider a distributed version control system like git. Files are kept in a repository and users clone copies of the repository for editing and commit changes back to the repository when they are done.

Version control systems like git are frequently used for groups writing software and code, but can be used for any kind of files or projects. Many people share their git repositories on GitHub.

Adapted from:

Stanford Libraries, Research Support, Data Management Services. “Data versioning”. Accessed: February 22, 2021, https://library.stanford.edu/research/data-management-services/data-best-practices/data-versioning

Cornell University, Research Data Management Services Group. “File Management”. Accessed Feb 24th, 2021, https://data.research.cornell.edu/content/file-management

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

Data Services

Data Versioning Definition