Modern science strongly relies on collaborative efforts and distributed expertise. Experiments may require a wide variety of disciplines, often located at various research institutes. Data that is analyzed and updated by a group of scientists or even by a single experimenter, quickly exceeds the level of complexity that can be handled without a defined structure. Especially when work is wrapped up for a thesis or publication the manipulations done to the data lacks in proper description or tracking ability. To emphasize the benefit of version control, let me start with an small example:
A scientist made some measurements and shows it to the supervisor. The supervisor has some ideas about the analysis, makes some changes and sends the data to the lab head, who again changes the data by reordering some columns. In the end there is the final document with the analyzed data. So far so good. If the student finds a typo in the measurements the struggle begins… Either, the student requests the final results from the lab head, fixes the typo and hopes that the change did not corrupt any of the other data. Or he/she hands the corrected measurements to the supervisor, that reprocesses the data and hands it to the lab head. Whatever option is chosen there is an uncertainty, if all changes made in the first round were performed in the second, or what changes depend on the typo. Good scientific practice obliges all changes to be traceable and requires analysis to be performed in parallel. In day to day routine the folder structure of many collaborative student project more or less looks as follows:
There is "some sort" of version control, but in practice the structure won’t be concise for future students working with the project.
The life journey of a single file of the described experiment is visualized in the following:
Fig 2: (A) Data processing pipeline with three people working sequentially on a shared file. (B) Data processing pipeline with three people working simultaneously on local copies of a file. Neither in A, nor in B a tracking of performed changes is possible.
Since all participants are working on the data simultaneously, so called merge conflicts (or “conflicting changes”) happen. You may know these from apps when several users working on the same document simultaneously and modify the same sentence. Shared documents of e.g. MS Office care for versioning and tracking of changes, but for data files or project directories you need some tooling.
Git is a version control system for tracking changes in computer files. It takes snapshots of files that are located within a repository (repo), a place to store and describe all objects of your work as a maintained digital archive. In the easiest case the repository would be the ‘Bachelor’ folder from the example above. Git enables decentralized version control and coordination between multiple developers. At any time you can revert back to a prior step to look up changes up to this time or to start all over again from the state the project had at this particular moment. Additionally you can inspect
Todo: Branches
There are many commands that can be used with git, but the knowledge of only a few are sufficient to use git as version control system.
git --help
git init
git checkout -b <branchname>
git status
git clone <pathToRemoteRepository>
git fetch
git pull
git add <filename|*>
git commit -m <commitmessage>
git push
Fig 3: git procedures. If you don’t have a local repo, you can clone it from the remote. If you have it already, you can pull the latest changes. After you changed a single or multiple files you can add all you want to commit to the staging area. Then you can add a concise commit message and commit it to the local repository. Afterwards you can push your updates to the remote repository. If you just want to see, what might have changed in the remote, you can use fetch.
GitHub is a web-based hosting service for git repositories. It makes it easy for a group of people to work on the same project and provides nice visual inspection opportunities for a repository. You do not have to use github, there are several alternatives. The corresponding tree of the bachelor repository using git may look like this:
Fig 4: Each file in the repository is tracked, so you can go to every point in the past to inspect the state of the file at this moment. Each commit is descripted with: (i) username, (ii) timestamp, (iii) changelog and (iv) a commit message.
Check out the commit history of e.g. Plotly.NET to get an idea of how a repository history may look like.
create GitHub account at https://github.com/
install git at https://git-scm.com/downloads
git config --global user.name "FIRST_NAME LAST_NAME"
git config --global user.email "MY_NAME@example.com"
open Visual Studio Code