Casthoughts

Software and such

May 25, 2022

No magic, only engineering

Computers are not magic, but there is some very slick engineering out there.

“Any sufficiently advanced technology is indistinguishable from magic.”

Arthur C Clark

In the context of this well known Arthur C Clark quote, it is common to various aspects of software called “magic”. Slack has a sign-in workflow that they call “magic link” and I will colloquially refer to command line methods as “incantations. However, computers can be understood. This quote should be re-framed more optimistic way as:

“All ‘magic’ is just good engineering”

This framing acknowledges the hard work that goes into making computers “just work”. The next time you use an computer and it is perfect, smooth experience that exactly solves your problem remember there was a team of people who worked very hard to make that happen. There is work at every level, from designing then interaction, to implementing the layers involved, to making sure the services at every layer of the stack are running to designing and fabricating the hardware. If any layer of this (massive) stack does not work just right, your experience will fail.

I find this an empowering point of view. Magic is mystical, arcane, something secret that is closely held, something that you have to be born with, something super natural. Once something is deemed “magic” it is out of the ken of mere mortals; it must be accepted as-is, there is no point in even trying.

On the other hand science and engineering exist to be understood. They are systematic frameworks to understand the physical world and bend it to our will respectively. Understanding any given system (natural or engineered) may not be easy and may take more time and energy than you want to spend, but it is always possible in principle. The statement is never “I can not understand this”, it is always “I do not understand this yet”.

The tools and methods of science and engineering are open to anyone who has the time, resources, and inclination to use them; anyone who says differently (like those who deny life is pain) is selling something. However, while this position is well and good in theory, in reality there are massive structural inequalities in who actually participates in STEM broadly and open source development specifically. It is incumbent on those of us who work in STEM fields to ensure equitable access and opportunity to all. Just like a good science challenge this is not going to be easy, but it must be “We have not fixed this yet”.

Reference

May 24, 2022

Think Like Git

This article is for people who already know how to use git day-to-day, but want a deeper understand of the why of git to do a better job reasoning about what should or should not be possible rather than just memorizing incantations.

While this text is going to (mostly) refer to the git CLI because it is the lowest common denominator (everyone who uses git has access to the CLI), there are many richer graphical user interfaces available (likely built into your IDE). There is nothing wrong with using a GUI for working with git nor is the CLI “morally superior” – anyone who says otherwise is engaging in gatekeeping nonsense. I personally use magit and gitk in my day-to-day work. Real programers use tools that make them effective, if a GUI makes your life easier use it.

For each of the CLI interfaces I’m highlighting I am only covering the functionality relevant to the point I’m making. Many of these CLIs can do more (and sometimes wildly different) things, see the links back to the documentation for the full details.

This article is focused on the version tracking aspect of git. I will only touch in passing on the fact that git uses content based addressing and how it actually encodes the state of the repository at each commit. These details are interesting in their own right and critical to the implementation of git being efficient (both time and space wise), but are out of scope for this article.

Another article in a similar vein to this, but starting from a user story and building up is the Git Parable . When I read this essay years ago it made git “click” for me. If you have not read it, I suggest you go read it instead of this!

Table of Contents

git’s view of the world

At the core, git keeps track of many (many) copies of your code, creating a snapshot whenever you commit. Along with the code, git attaches to each a block of text, information about who and when the code was written and committed, and what commits are the “parents” to from a commit. The hash of all of this serves both as a globally unique name for the commit and to validate the commit.

Because each commit knows its parent(s), the commits form a directed acyclic graph (DAG). The code snap-shots and metadata are the nodes, the parents relationships define the edges, and because you can only go backwards in history (commits do not know who their children are) it is directed. DAG’s are a relatively common data structure in programming (and if you need to work with them in Python checkout networkx). By identifying a DAG as the core data structure of git’s view of history we can start to develop intuition of what operations will be easy on git history (if they would be easy to express via operations on a DAG). Using this intuition, we can (hopefully) start to guess how git would probably implement the functionality we need to actually get our work done!

Because the hash includes information about the parents the tree of commits forms a variation on a Merkle Tree. Using these hashes you can validate that a git repository is self consistent and that the source you have checked out is indeed the source that was checked in. If you and a collaborator both have a clone of a shared project then they can send you just the hash of a commit and you can be sure that you have both an identical working tree and identical history.

Given such a graph, what operations would we want to do to it? For example we want to

  1. get a repository to work with (git clone, git init)
  2. give commits human readable names (git tag, git branch)
  3. compare source between commits (git diff)
  4. look at the whole graph of commits (gitk, git log)
  5. look at a commit (both the code content and meta-data) (gitk, git switch, git checkout)
  6. add commits (git stage, git add)
  7. discard changes (both local changes and whole commits) (git reset, git restore, git clean, git checkout)
  8. change/move commits around the graph (git rebase, git cherry-pick)
  9. share your code (and history) with your friends (git push, git fetch, git remote, git merge)
  10. have more than one commit checked out at a time (git worktree)

What does it mean to be distributed (but centralized)?

From a technical stand point no clone of a git repository is more special than any other. Each contains a self consistent section of the history of the repository and they can all share that information with each other. From a certain point of view, there is only one global history which consists of every commit any developer on any computer has ever created and any given computer only ever has a sub-graph of the full history.

While technically pure, fully distributed collaboration is deeply impractical. Almost every project has socially picked a central repository to be considered the “canonical” repository. For example for Matplotlib matplotlib/matplotlib is the ground truth repository. At the end of the day what is Matplotlib the library is that git history, full stop. Because of the special social role that repository holds only people with commit rights are able to push to that repository and we have a agreed on social process for deciding who gets that access and what code gets merged. When people talk about a project having a “hard fork” or a “hostile fork” they are referring to a community that has split about which repository is “the one” and who has the ability to push to it.

Similarly, while every commit has a (gloablly) unique name – its hash – they are effectively unusable. The branch and tag names that we use are for the humans and any meaning we attach to the names is purely social. Within the canonical repository there is a particular branch which is identified as the branch for new development along and optionally a handful of other “official” branches for maintaining bug-fix series. The exact details of the names, the life cycles and the development workflow will vary from team-to-team. For example on Matplotlib we have a main branch for new development, the vX.Y.x branches which are the maintenance branches for each vX.Y.0 minor release, and vX.Y-doc for the version specific documentation. To git these names are meaningless, but socially they are critical.

In the standard fork-based development workflow that many open source software projects use the commits move from less visible but loosely controlled parts of the global graphs to more public and controlled parts. For example anyone can create commits on their local clone at will! However no one else can (easily) see them and those commits are inaccessible to almost everyone else who has part of the total graph. A developer can then choose to publish their commits to a public location (for example I push all of my work on Matplotlib to tacaswell/matplotlib first). Once the commits are public anyone can see them but only a handful of people are likely to actually access them. To get the code into the canonical repository, and hence used by everyone, the user can request that the committers to the canonical repository “pull” or “merge” their branch into the default branch. If this “pull request” is accepted and merged to the default branch then that code (and commit history) is forever part of the project’s history.

Get a graph to work with

The most common way to get a copy of a project history is not to start ab initio, but to get a copy of a preexisting history. Any given project only starts once, but over time will receive many more commits (this repository already has 20+ commits, Matplotlib has over 43k, the kernel has over 1 million).

To get a local copy of a repository so you can start working on it you use the git clone sub-command:

git clone url_to_remote    # will create a new directory in the CWD

By default git will fetch everything from the remote repository (there are ways to reduces this for big repositories). If you clone from the canonical repository then you have the complete up-to-date official history of the project on your computer!

If you need to create a new repository use the git init sub-command:

git init

However, I have probably only ever used git init a few dozen times in my career, where as I use git clone a few dozen times a week.

Label a commit

From the hash we have a globally unique identifier for each commit, however these hashes look something like: 6f8bc7c6f192f664a7ab2e4ff200d050bb2edc8f. While unique and well-suited for a computer, it is neither memorable nor does it roll off the tongue. This is partly ameliorated because anyplace that the git interface takes a SHA you can instead pass a prefix, e.g. 6f8bc7 for the SHA above. However the number of characters needed ensure the that the prefix is actually unique depends on the size of the repository.

To give a human-friendly name to a commit git offers two flavors of labels: branches and tags. The conceptual difference is that a branch is expected to move between commits over time and tags are fixed to a particular commit for all time.

Branches point to a fixed concept. As discussed above, most repositories have a socially designated “canonical branch” that is the point of truth for the development effort. The exact name does not matter, but common names include “main”, “trunk”, or “devel”. It is also conventional to do new development on a “development” branch, named anything but the canonical branch name. This enables you to keep multiple independent work directions in flight at a time and easily discard any work turns out to be less of a good idea than you thought.

To list, create, and delete branches use the git branch sub-command. The most important incantations are:

git branch            # list local branches
git branch -c <name>  # create a branch
git branch -d <name>  # delete a branch, if safe
git branch -D <name>  # delete a branch over git's concerns

In git branches are cheap to make, when in doubt, make a new branch!

In contrast tags label a specific commit and never move. This is used most often for identifying released versions of software (e.g. v1.5.2). To work with tags use the git tag sub-command. The most important incantations are:

git tag           # list tags
git tag -a <name> # create a new tag

You should always create “annotated” tags. If the commit is important enough to get a permanent name, it is important enough get an explanation of why you gave it a name.

In git jargon these are “refs”. See the docs if you want even more details about how git encodes these.

Compare source between nodes

There is a “dual space” relationship between the code state at each commit and the differences between the commits. If you have one you can always compute the other. On first glance the natural way to track the changes of source over time is to track the differences (this is in fact how many earlier version control systems worked!). However git (and mercurial) instead track the full state of the of the source at each commit which solves a number of performance problems and enables some additional operations.

Because the diffs between subsequent commits are derived, it is just as easy to compute the diff between any two commits! Using the git diff sub-command. To get the difference between two commits :

git diff <before> <after>

which will give you a patch that if applied to the <before> commit will land you at the <after> commit. If you want to get a patch that will undo a change swap the order of the commits.

Calling git diff without any arguments is very common command that will show any uncommitted changes in your working tree.

Look at the whole tree

It is useful to look at the whole graph. There is the git log sub-command which will show you text versions of history, however this is an application where a GUI interface really shines. There is so much information available:

  • the commit message
  • the author and committer
  • dates
  • the computed diffs
  • the connectivity between the commits

that it is difficult to see it all and navigate it in a pure text interface.

My preferred tool for exploring the full history is gitk which is typically installed with git. It is a bit ugly, but it works! In addition to visualizing the tree it also has a user interface for searching both the commit messages and the code changes and for limiting the history to only particular files.

Look at a node

When working with a git repository on your computer you almost always have one of the commits materialized into a working tree (or more than one with the git worktree sub-command). The working tree is, as the name suggests, where you actually do your work! We will come back to this in the next section.

To checkout a particular commit (or tag or branch) you can use the git checkout sub-command as

git checkout <commit hash>  # checks out a particular commit
git checkout <tag name>     # checks out a particular tag
git checkout <branch name>  # checks out a particular branch

In addition, there is also a new git switch sub-command that is specifically for switching branches.

git switch <branch>

which is more scoped (git checkout has a number of other features) and clearly named.

If you want to see the history of what commits you have had checked out (as opposed to the history the repository) you can use the git reflog sub-command. While not something to use day-to-day, it can save your bacon in cases where you have accidentally lost references to a commit.

Adding nodes

The most important, and likely most common, operation we do on the graph is to add new commits!

As mention above when you checkout a branch on your computer you have a working tree that starts at the state of the commit you have checked out. There is special name that can be used as a commit HEAD which means “the commit that is currently checked out in your work tree”. There is also the short hand HEAD^ which means “the commit before the one checked out”, HEAD^^ which means “the commit two before the one checked out”, and so on for repeated ^.

As you make changes there are two common commands git status sub-command and git diff sub-command. git status will give you a summary of what changes you have in the local tree, relative to HEAD and what changes are staged. git diff, when called with no arguments will show the detailed diff between the current working tree and HEAD.

As you work on your code, git does not require you to commit all of your changes at once, but to enable this committing is two a stage process. The first step is to use the git add sub-command to stage changes

git add path/to/file  # to stage all the changes in a file
git add -p            # to commit by hunk

Once you have staged all of the changes you want, you create a new commit via the git commit sub-command

git commit -m "Short Message"   # commit with a short commit message
git commit                      # open an editor to write a commit message

Writing a commit message is one of the most important parts of using git. While it is frequently possible to, only from the source, reconstruct the what of a code change it can be impossible to reconstruct the why of the change. The commit message is a place that you can leave notes to your collaborators explaining the motivations of the change. Remember that your most frequent collaborator is your future / past self! For a comprehensive guide to writing good commit messages see this article.

As git encourages the creation of branches for new development, when the work is done (via the cycle above) we will need to merge this work back into the canonical branch which is done via the git merge sub-command. By default, this will create a new commit on your current branch who has two parents (the tips of each branch involved).

git merge <other branch>   # merge other branch into the current branch

If you are using a code hosting platform (GitHub, GitLab, BitBucket, …) this command will typically be done through a the web UI’s “merge” button.

discard changes

Not all changes are a good idea, sometimes you need to go back.

If you have not yet committed your changes then they can be discarded using the git checkout sub-command (yes the same one we used to change branches)

git checkout path/to/file  # discard local changes

There is also the new git restore sub-command which is more tightly scoped to discarding local file changes

git restore path/to/file

If you have files in your working directory that git is not currently tracking you can use the git clean sub-command.

git clean -xfd    # purge any untracked files (including ignored files)

If you need to discard commits you can use the git reset sub-command. By default git reset will change the currently checked out commit but not change your working tree (so you keep all of the code changes).

git reset HEAD^      # move the branch back one, keep working tree the same
git reset HEAD^^     # move the branch back two, keep working tree the same
git reset <a SHA1>   # move the branch a commit, keep working tree the same

This can be very useful if you like the changes you made, but not the commits or commit messages.

Alternatively if you want to discard the commits and the changes you can use the --hard flag:

git reset --hard HEAD^     # move the branch back one, discard all changes
git reset --hard HEAD^^    # move the branch back two, discard all changes
git reset --hard <a SHA1>  # move the branch a commit, discard all changes

Be aware that these can be a destructive commands! If you move a branch back there maybe commits that are inaccessible (remember commits only know their parents). This is where the git reflog sub-command can help recover the lost commits.

git commands may create objects behind the scenes that ultimately become inaccessible. git will on its own clean up, but you can manually trigger this clean up via git gc sub-command.

If you have accidentally committed something sensitive, but not yet pushed, you can use these tools to purge it. If you have push the commit you will need some higher test tools.

change or move nodes

Due to the way the hashes work in git you can not truly change a commit, but you can modify and recommit it or make copies elsewhere in the graph. Remember that if you have already shared the commits you are replacing you will have to force-push them again (see below). Be very careful about doing this to any branch that many other people are using.

If you have just created a commit and realized you need to add one more change you can use the --amend flag to git commit sub-command.

# hack
git add path/to/file     # stage the changes like normal
git commit --amend       # add the changes to the HEAD commit

This does not actually change the old commit. A commit is uniquely identified by its hash and the hash includes the state of the code, thus “amending” a commit creates a new commit and then resets the current branch to point to the new commit and orphans the old commit.

If you want to move a range of commits from one place to another you can use the git rebase sub-command.

git rebase target_branch     # rebase the current branch onto target_branch

which will attempt to “replay” the changes in each of the commits on your current branch on top of target_branch. If there are conflicts that git can not automatically resolve it will pause for you to manually resolve the conflicts and stage the changes to continue or abort the whole rebase

git rebase --continue   #  continue with your manual resolution
git rebase --abort      #  abort and go back to where you started

If you want to re-order, drop or combine commits you can use:

git rebase -i                # interactively rebase, squash and re-order

which will open an editor with instructions. This can be particularly useful if you want commit early and often as you work, but when you are done re-order and re-arrange the changes into a smaller number of better organized commits to tell a better story.

Common reasons to be asked to rebase (and squash) a branch is if your development branch has grown merge conflicts and the project prefers rebasing over merging the default branch back into the development branches or if your commit history has too many “noise” commits (small typo fixes, reversions of work, committing and then deleting files).

To move a commit from one branch to another use the git cherry-pick sub-command which is conceptually similar to git rebase

git cherry-pick <commit>      # pick the commit on to the current branch
git cherry-pick -m 1 <commit> # pick a merge commit onto the current branch
git cherry-pick --continue    # continue if you have to manually resolve conflicts
git cherry-pick --skip        # drop a redundant commit
git cherry-pick --abort       # give up and go back to where you started

In all of these cases, sub-command can be useful if things do not go as you expect!

Sharing with your friends

So far we have not talked much about any of the collaborative or distributed nature of git. Except for git clone, every command so far can be done only with information than git has on your computer and can be done without a network connection. This lets you work in your own private enclave, either temporarily, because you are working on a laptop on commuter rail or are not yet ready to share your work, or permanently if you just prefer to work alone.

While version control is useful if you are working alone (your most frequent collaborator is your future / past self and version control can save you from typos), it really shines when you are working with other people. To share code with others we need to a notion of a shared history. Given that under the hood git is a graph of nodes uniquely named by their content “all” we have to do is be able to share information about the commits, branches, and tags between the different computers!

By default after an initial git clone there is one “remote” pointing to where ever you cloned from by default named origin. To modify an existing remote or add a new remote use the git remote sub-command.

git remote add <name> <url>     # add a new remote
git remote rm <name>            # delete a remote
git remote rename <old> <new>   # rename a remote from old -> new

Once you have one or more remotes updated the first thing we want to is be able to get new commits from the remotes via git fetch sub-command or git remote sub-command.

git fetch <name>      # fetch just one remote
git fetch --all       # fetch all of the
git remote update     # update all the remotes

The git pull sub-command combines a fetch and a merge into one command. While this seems convenient, it will frequently generate unexpected merge commits that take longer to clean up than being explicit about fetching and merging separately.

git merge --ff-only remote/branch   # merge remote branch into the local branch

The --ff-only flag fails unless the history can be “fast forwarded” meaning that only the remote branch has new commits.

To share your work with others you need to put the commits someplace other people can see it. The exact details of this depend on the workflow of the project and team, but if using a hosting platform this is done via the git push sub-command.

git push <remote> <branch_name>       # push the branch_name to remote

Given that in a typical workflow you are likely to be pushing to the same branch on the same remote many times git has streamlined ways of keeping track of the association between your local branch and a remote branch on a (presumably) more public location. By telling git about this association we save both typing and the chance of mistakes due to typos.

git branch --set-upstream-to <remote> # set an "upstream"
git push                              # "do the right thing" with upstream set

If you try to push commits to a remote branch that has commits that are not on your local branch git will reject the push. The course of action depends on why you are missing commits. If there are new commits on the remote branch that you have not fetched before, then you should either merge the remote branch into your local branch before pushing or rebase your local branch on the remote branch and push again.

Because git can not tell the difference new commits on the remote and old commits on the remote because you have re-written history locally, either via git commit --ammend or git rebase, then you have to do something a bit .... dangerous. git detecting that if the remote branch were to be updated to where the local branch it would make some commits inaccessible and protecting you from yourself. However, if you are sure we can tell git to trust our judgment and do it anyway:

git push --force-with-lease

Be very careful about doing this to branches that other people are relying on and have checked out. Other people will have the same problem you just had, but in reverse. git can not tell that the re-written commits are “right” and the history on the other users computer are “wrong”. They will be presented with the same options you just had and may re-force-push your changes out of existences. We recently had to re-write the history on the default Matplotlib branch and it required a fair amount of planning and work to manage.

checking out more than one commit

When you checkout a commit git materializes the code into the directory where the repository is cloned and your local directory is made to match the tree of the commit. Thus, it is logically impossible to have more than one commit checked out at once. However, it can be extremely useful to have more than one commit checked out at once if you are working on a project with multiple “live” branches. One way around this is to simply clone the repository N times, however because each repository is unaware of the other, you will have N complete copies of the repository and each will have to synchronized with their remotes independently, etc. To make this efficient you can use the git worktree sub-command

git worktree add ../somepath branch_name

This will share all of the git resources and configuration with the main git worktree. One surprising limitation of the worktrees is that you can only have a given branch checked out in at most one worktree at a time.

git config

There are many (many) knobs to configure the default behavior. I suggest using starting with these settings:

[transfer]
    # actually verify the hashes
    fsckobjects = true
[fetch]
    # actually verify the hashes
    fsckobjects = true
    # automatically drop branches that are deleted on the remotes
    prune = true
    # fetch remotes in parallel
    parallel = 0
[receive]
    # actually verify the hashes
    fsckObjects = true
[pull]
    # requires opting-into creating a merge commit locally.
    # Given a platform based workflow, this prevents unintentional merge
    # commits that need to be un-wound
    ff = only
[merge]
    # same as above
    ff = only
[color]
    # colours are always fun
    ui = auto
[init]
    # get ahead of the master -> main change
    defaultBranch = main
[feature]
    manyFiles = true
[alias]
    # this gives `git credit` as an alternative to `git blame`, just
    # puts you in a more positive mind set while using it.
    credit = blame

Other things you might want to do

There are obviously many things that git can do that are not covered here. Some things that I have had to do from time-to-time but did not make the cut for this article include:

  • track the history of a line of code back in time (gitk, git blame + UI tooling, git log)
  • find the commit that broke something (git bisect)
  • merge un-related git histories into one (git merge --allow-unrelated-histories)
  • extract the history of a sub-directory into its own repository (git filter-branch)
  • purge a particular file (or change) from the history (git filter-branch or BFG repo-cleaner)
  • fast searching (git grep)
  • ask git to clean up after itself (git gc)

Other resources

Acknowledgments

Thank you to James Powell, Alex Held, Dora Caswell and the other beta-readers who read (or listened to) early drafts of this post and provided valuable feedback. Thank you to Elliott Sales de Andrade for pointing out git restore.

May 09, 2022

Why I am fanatical about version control

A short horror story that could have been prevented by git.

So, there I was. The year was 2007 and I was in the process of wrapping up the research I had done as an undergraduate. One of my tasks before heading off to grad school was to archive all of the code I had written over the last two or so years to DVD-R for the lab archive.

My editor writes temporary backup copies while editing to the filename with a ~ appended. Thus, for every source file foo.m I had been working on there was a paired foo.m~ in the directory. At the time I was working in MATLAB and as was the convention had all off my source in one directory. This was the code that I used to generate the figures in my senior thesis and – more critically – the figures of a paper we had in preparation which would require further revision. Not wanting to archive transitory and duplicate files I happily typed

cd all/of/my-source
rm * ~

which very efficiently deleted all of the code I had written over the last two years (“Computers: making mistakes faster than you possibly imagined!”).

I did want any reasonable person would do and spent the next five minutes lying on the floor trying to stave off a panic attack.

In this particular case I was got stupendously lucky. I had every file open in an editor and was able to go through and systematically add a space and re-save all of them. From then on, everything I have written that looks like text that I suspect I may want to keep is very quickly committed into a version control system and distributed to at least two computers.