So you have been handed a generator co-routine…
This is a explanation of how to practically work with generator coroutines in
Python. If you are willing to accept the rules laid out in this (not so) short
guide as empirical observations then you should be able to productively work
with generator co-routines. In a future post I intend to dive into the “why”
(and the “why this makes sense”), but this is the pragmatic view.
Generator coroutines in Python have two bi-directional communication channels
- data
yield
/ send()
- exceptions via
raise
/ throw()
and two unidirectional channels
- data via
return
close()
to exit the coroutine immediately
Another way to look at this is generator co-routines have two “happy path”
communication channels:
- data
yield
/ send()
- data via
return
and two “sad path” communication channels:
- exceptions via
raise
/ throw()
close()
to exit the coroutine immediately
Each of these channels has a different purpose and without one of them
co-routines would be incomplete. You may not need to (explicitly) use all of
these channels in any given application.
yield
/ send()
data channel
The first half of this channel is yield
which will (as the name suggests)
yield value out of the coroutine. If we only use the yield
the we have a
“generator function”, for example if we write
def my_gen():
yield 'a'
yield 'b'
yield 'c'
which we can then use with the iteration protocol as:
>>> list(my_gen())
['a', 'b', 'c']
More explicitly, what list
(or a for
loop) is doing under the hood is:
>>> g = my_gen()
>>> next(g)
'a'
>>> next(g)
'b'
>>> next(g)
'c'
>>> next(g)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
The way that the generator communicates that it is exhausted is by raising the
StopIteration
exception.
We will come back to the raised Exception object in a bit.
Using yield
we can get information out of generator coroutine, to get data
into the generator coroutine we need to capture a left-hand side of the yield
as
def my_gen():
in1 = yield 'a'
print(f'got {in1!r}')
in2 = yield 'b'
print(f'got {in2!r}')
If we pass that to list
we see:
>> list(my_gen())
got None
got None
['a', 'b']
What this (and next
) is doing under the hood is
>>> g = my_gen()
>>> g.send(None)
'a'
>>> g.send(None)
got None
'b'
>>> g.send(None)
got None
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
The sequence is:
- Create the generator. At this point no code has run yet.
- The first
.send()
runs the coroutine up to the first yield
and sends the
right hand side out. The value of the first .send
must be None
because there is no way to access the value passed in.
- The coroutine is suspended until the next
send()
. The value pass to the
second send()
is assigned to the left hand side of the yield
expression.
- The coroutine runs until the next
yield
and sends out the right hand side. We
then go back to step 3 until there are no more yield
expressions in the coroutine.
- when the coroutine returns Python will raise the
StopIteration
exception
for us.
To see this more clearly, re-running the above code but sending in diffrent
values at each step:
>>> g = my_gen()
>>> g.send(None)
'a'
>>> g.send('step 1')
got 'step 1'
'b'
>>> g.send('step 2')
got 'step 2'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration
return
data channel
So far we have not used return
and relying on the implicit return None
that
Python provides. As with any Python function we put a return
in our
coroutine:
def my_gen():
in1 = yield 'a'
print(f'got {in1!r}')
in2 = yield 'b'
print(f'got {in2!r}')
return 'Done!'
However, this raise the question of how do we get to the returned value? It
can not come back as the return from .send()
as that is where the yield
values are carried. Instead the value is carried on the StopIteration
exception that is raised when the iterator is exhausted.
>>> g = my_gen()
>>> g.send(None)
'a'
>>> g.send('step 1')
got 'step 1'
'b'
>>> g.send('step 2')
got 'step 2'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration: Done!
>>>
To get the the value we need to catch the StopIteration
and access ex.value
.
>>> gen = my_gen()
>>> print('yielded: ', gen.send(None))
yielded: a
>>> for j in count(1):
... try:
... print('yielded: ', gen.send(f'step {j}'))
... except StopIteration as ex:
... print(f'Returned: {ex.value}')
... break
...
got 'step 1'
yielded: b
got 'step 2'
Returned: Done!
It may be tempting to try and raise your own StopIteration
rather than
returning, however if you do Python will convert it to a RuntimeError
. This
is because Python can not tell the difference between your intentionally
raising StopIteration
and something you have called unexpectedly raising
StopIteration
. Pre Python 3.5 the
StopIteration
would be raised to the outer caller which would be interpreted
as the generator returning normally which in turn would mask bugs in very
confusing ways.
raise
/ throw
channel
Like all Python we can use the standard exception raising and handling tools,
however there are a couple a caveates.
- The co-routine must not raise
StopIteration
(as noted above)
- If you catch
GeneratorExit
the co-routine must return (see the section on
close()
below).
Any valid exception raised will inturn be raised from the call point of the
obj.send()
in the outer code, identically to how an exception called in a
function will propogate to the call site.
>>> def my_gen(N):
... for j in count():
... if j >= N:
... return f"Got {N} ints and are done"
... a = yield f'{j}/{N} ints in a row'
... if not isinstance(a, int):
... raise ValueError("We only take integers!")
...
If we exhaust the “happy path” of this co-routine we see:
>>> gen = my_gen(3)
>>> gen.send(None)
'0/3 ints in a row'
>>> gen.send(1)
'1/3 ints in a row'
>>> gen.send(-1)
'2/3 ints in a row'
>>> gen.send(10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration: Got 3 ints and are done
which raises StopIteration
with the payload of a string as expected. However, if we
were to send in not an integer
>>> gen = my_gen(3)
>>> gen.send(None)
'0/3 ints in a row'
>>> gen.send(5)
'1/3 ints in a row'
>>> gen.send('aardvark')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 7, in my_gen
ValueError: We only take integers!
However, if an unhandled exception is raised from a co-routine it is fully
exhausted and subsequently sending in new values will immediantly raise
StopIteration
.
Sometimes it is necessary to inject an exception into a co-routine, for example
to let the co-routine know the outer code did not like the last yielded value.
This can be done with the obj.throw
method which causes the passed
Exception
to be raised at the yield
. Within the co-routine we can use all
of the standard exception handling tools of Python:
>>> def my_gen():
... for j in range(5):
... try:
... inp = yield j
... except ValueError:
... print("Ignoring ValueError")
... else:
... print("No exception")
... finally:
... print("Finish loop")
...
>>> gen = my_gen()
>>> gen.send(None)
0
>>> gen.send('a')
No exception
Finish loop
1
>>> gen.send(None)
No exception
Finish loop
2
>>> gen.throw(ValueError)
Ignoring ValueError
Finish loop
3
>>> gen.throw(RuntimeError)
Finish loop
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 4, in my_gen
RuntimeError
If the generator is exhausted than any exceptions thrown in are immediately
re-raised.
close
channel
Sometimes the outer caller of a generator needs to tell the co-routine to clean
up and drop-dead. This is done via the gen.close()
method which will cause a
GeneratorExit
exepction to be raised at the point where the co-routine is
suspended (the yield
). If the co-routine catches this exception and tries to
yield additional values, then close()
will raise a RuntimeError
.
>>> def my_gen():
... for j in count():
... try:
... yield j
... except GeneratorExit:
... print("I refuse to exit")
...
>>> gen = my_gen()
>>> gen.send(None)
0
>>> gen.send(None)
1
>>> gen.send(None)
2
>>> gen.close()
I refuse to exit
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: generator ignored GeneratorExit
>>>
The reason that GeneratorExit
is not suppressible is that it is as part of
garbage collection and Python must be able to clean up the co-routine.
If the co-routine catches the exception and returns
there is no way for the
outer caller to access the returned value.
>>> def my_gen():
... for j in count():
... try:
... yield j
... except GeneratorExit:
... print("I acquiese to your request.")
... return 'Aardvark'
...
>>> gen = my_gen()
>>> gen.send(None)
0
>>> gen.send(None)
1
>>> gen.send(None)
2
>>> gen.close()
I acquiese to your request.
>>>
This is particularly useful if the co-routine is holding onto resources, such
as open files or sockets, that need to be gracefully shut down.
This article is for people who already know how to use git
day-to-day, but
want a deeper understand of the why of git
to do a better job reasoning
about what should or should not be possible rather than just memorizing
incantations.
While this text is going to (mostly) refer to the git
CLI because it is the
lowest common denominator (everyone who uses git
has access to the CLI), there
are many richer graphical user interfaces available (likely built into your
IDE). There is nothing wrong with using a GUI for working with git
nor is
the CLI “morally superior” – anyone who says otherwise is engaging in
gatekeeping nonsense. I personally use magit and
gitk in my day-to-day work. Real
programers use tools that make them effective, if a GUI
makes your life easier use it.
For each of the CLI interfaces I’m highlighting I am only covering the
functionality relevant to the point I’m making. Many of these CLIs can do more
(and sometimes wildly different) things, see the links back to the
documentation for the full details.
This article is focused on the version tracking aspect of git
. I will only
touch in passing on the fact that git
uses content based
addressing and how
it actually encodes the state of the repository at each commit. These details
are interesting in their own right and critical to the implementation of git
being efficient (both time and space wise), but are out of scope for this
article.
Another article in a similar vein to this, but starting from a user story and
building up is the Git Parable
. When I read
this essay years ago it made git
“click” for me. If you have not read it, I
suggest you go read it instead of this!
Table of Contents
git’s view of the world
At the core, git
keeps track of many (many) copies of your code, creating a
snapshot whenever you commit. Along with the code, git
attaches to each a
block of text, information about who and when the code was written and
committed, and what commits are the “parents” to from a commit. The hash
of all of this serves both as a globally unique name for the commit and to
validate the commit.
Because each commit knows its parent(s), the commits form a directed acyclic
graph (DAG). The code
snap-shots and metadata are the nodes, the parents relationships define the
edges, and because you can only go backwards in history (commits do not know
who their children are) it is directed. DAG’s are a relatively common data
structure in programming (and if you need to work with them in Python checkout
networkx). By identifying a DAG
as the core data structure of git
’s view of history we can start to develop
intuition of what operations will be easy on git
history (if they would be
easy to express via operations on a DAG). Using this intuition, we can
(hopefully) start to guess how git
would probably implement the functionality
we need to actually get our work done!
Because the hash includes information about the parents the tree of commits
forms a variation on a Merkle
Tree. Using these hashes you can
validate that a git
repository is self consistent and that the source you
have checked out is indeed the source that was checked in. If you and a
collaborator both have a clone of a shared project then they can send you just
the hash of a commit and you can be sure that you have both an identical
working tree and identical history.
Given such a graph, what operations would we want to do to it? For example we
want to
- get a repository to work with (
git clone
, git init
)
- give commits human readable names (
git tag
, git branch
)
- compare source between commits (
git diff
)
- look at the whole graph of commits (
gitk
, git log
)
- look at a commit (both the code content and meta-data) (
gitk
, git switch
, git checkout
)
- add commits (
git stage
, git add
)
- discard changes (both local changes and whole commits) (
git reset
, git restore
, git clean
, git checkout
)
- change/move commits around the graph (
git rebase
, git cherry-pick
)
- share your code (and history) with your friends (
git push
, git fetch
, git remote
, git merge
)
- have more than one commit checked out at a time (
git worktree
)
What does it mean to be distributed (but centralized)?
From a technical stand point no clone of a git
repository is more special
than any other. Each contains a self consistent section of the history of the
repository and they can all share that information with each other. From a
certain point of view, there is only one global history which consists of every
commit any developer on any computer has ever created and any given computer
only ever has a sub-graph of the full history.
While technically pure, fully distributed collaboration is deeply impractical.
Almost every project has socially picked a central repository to be considered
the “canonical” repository. For example for Matplotlib
matplotlib/matplotlib is the ground
truth repository. At the end of the day what is Matplotlib the library is
that git history, full stop. Because of the special social role that
repository holds only people with commit rights are able to push to that
repository and we have a agreed on social process for deciding who gets that
access and what code gets merged. When people talk about a project having a
“hard fork” or a “hostile fork” they are referring to a community that has
split about which repository is “the one” and who has the ability to push to
it.
Similarly, while every commit has a (gloablly) unique name – its hash – they
are effectively unusable. The branch and tag names that we use are for the
humans and any meaning we attach to the names is purely social. Within the
canonical repository there is a particular branch which is identified as the
branch for new development along and optionally a handful of other “official”
branches for maintaining bug-fix series. The exact details of the names, the
life cycles and the development workflow will vary from team-to-team. For
example on Matplotlib we have a main
branch for new development, the vX.Y.x
branches which are the maintenance branches for each vX.Y.0
minor release,
and vX.Y-doc
for the version specific documentation. To git
these names
are meaningless, but socially they are critical.
In the standard fork-based development workflow that many open source software
projects use the commits move from less visible but loosely controlled parts of
the global graphs to more public and controlled parts. For example anyone can
create commits on their local clone at will! However no one else can (easily)
see them and those commits are inaccessible to almost everyone else who has
part of the total graph. A developer can then choose to publish their commits
to a public location (for example I push all of my work on Matplotlib to
tacaswell/matplotlib first). Once
the commits are public anyone can see them but only a handful of people are
likely to actually access them. To get the code into the canonical repository,
and hence used by everyone, the user can request that the committers to the
canonical repository “pull” or “merge” their branch into the default branch.
If this “pull request” is accepted and merged to the default branch then that
code (and commit history) is forever part of the project’s history.
Get a graph to work with
The most common way to get a copy of a project history is not to start ab initio,
but to get a copy of a preexisting history. Any given project only starts
once, but over time will receive many more commits (this repository already has
20+ commits, Matplotlib has over 43k, the kernel has over 1
million).
To get a local copy of a repository so you can start working on it you use the
git clone
sub-command:
git clone url_to_remote # will create a new directory in the CWD
By default git
will fetch everything from the remote repository (there are
ways to reduces this for big
repositories).
If you clone from the canonical repository then you have the complete
up-to-date official history of the project on your computer!
If you need to create a new repository use the git init
sub-command:
However, I have probably only ever used git init
a few dozen times in my
career, where as I use git clone
a few dozen times a week.
Label a commit
From the hash we have a globally unique identifier for each commit, however
these hashes look something like: 6f8bc7c6f192f664a7ab2e4ff200d050bb2edc8f
.
While unique and well-suited for a computer, it is neither memorable nor does
it roll off the tongue. This is partly ameliorated because anyplace that the
git
interface takes a SHA you can instead pass a prefix, e.g. 6f8bc7
for
the SHA above. However the number of characters needed ensure the that the
prefix is actually unique depends on the size of the
repository.
To give a human-friendly name to a commit git offers two flavors of labels:
branches and tags. The conceptual difference is that a branch is
expected to move between commits over time and tags are fixed to a
particular commit for all time.
Branches point to a fixed concept. As discussed above, most repositories
have a socially designated “canonical branch” that is the point of truth for
the development effort. The exact name does not matter, but common names
include “main”, “trunk”, or “devel”. It is also conventional to do new
development on a “development” branch, named anything but the canonical branch
name. This enables you to keep multiple independent work directions in flight
at a time and easily discard any work turns out to be less of a good idea than
you thought.
To list, create, and delete branches use the git branch
sub-command. The most important
incantations are:
git branch # list local branches
git branch -c <name> # create a branch
git branch -d <name> # delete a branch, if safe
git branch -D <name> # delete a branch over git's concerns
In git
branches are cheap to make, when in doubt, make a new branch!
In contrast tags label a specific commit and never move. This is used most
often for identifying released versions of software (e.g. v1.5.2). To work with tags
use the git tag
sub-command. The most important incantations
are:
git tag # list tags
git tag -a <name> # create a new tag
You should always create “annotated” tags. If the commit is important enough
to get a permanent name, it is important enough get an explanation of why you
gave it a name.
In git
jargon these are “refs”. See the
docs if you want
even more details about how git
encodes these.
Compare source between nodes
There is a “dual space” relationship between the code state at each commit and
the differences between the commits. If you have one you can always compute
the other. On first glance the natural way to track the changes of source over
time is to track the differences (this is in fact how many earlier version
control systems worked!). However git (and mercurial) instead track the full
state of the of the source at each commit which solves a number of performance
problems and enables some additional operations.
Because the diffs between subsequent commits are derived, it is just as easy to
compute the diff between any two commits! Using the git diff
sub-command. To get the difference between two
commits :
git diff <before> <after>
which will give you a patch that if applied to the <before>
commit will land
you at the <after>
commit. If you want to get a patch that will undo a
change swap the order of the commits.
Calling git diff
without any arguments is very common command that will show
any uncommitted changes in your working tree.
Look at the whole tree
It is useful to look at the whole graph. There is the git log
sub-command which will show you text
versions of history, however this is an application where a GUI interface
really shines. There is so much information available:
- the commit message
- the author and committer
- dates
- the computed diffs
- the connectivity between the commits
that it is difficult to see it all and navigate it in a pure text interface.
My preferred tool for exploring the full history is
gitk which is typically installed with git.
It is a bit ugly, but it works! In addition to visualizing the tree it also
has a user interface for searching both the commit messages and the code
changes and for limiting the history to only particular files.
Look at a node
When working with a git repository on your computer you almost always have one
of the commits materialized into a working tree (or more than one with the
git worktree
sub-command). The working
tree is, as the name suggests, where you actually do your work! We will come
back to this in the next section.
To checkout a particular commit (or tag or branch) you can use the
git checkout
sub-command as
git checkout <commit hash> # checks out a particular commit
git checkout <tag name> # checks out a particular tag
git checkout <branch name> # checks out a particular branch
In addition, there is also a new git switch
sub-command that is specifically for
switching branches.
which is more scoped (git checkout
has a number of other features) and
clearly named.
If you want to see the history of what commits you have had checked out (as
opposed to the history the repository) you can use the git reflog
sub-command. While not something to use
day-to-day, it can save your bacon in cases where you have accidentally lost
references to a commit.
Adding nodes
The most important, and likely most common, operation we do on the graph is to
add new commits!
As mention above when you checkout a branch on your computer you have a working
tree that starts at the state of the commit you have checked out. There is
special name that can be used as a commit HEAD
which means “the commit that
is currently checked out in your work tree”. There is also the short hand
HEAD^
which means “the commit before the one checked out”, HEAD^^
which
means “the commit two before the one checked out”, and so on for repeated ^
.
As you make changes there are two common commands git status
sub-command and git diff
sub-command. git status
will give you a
summary of what changes you have in the local tree, relative to HEAD
and what
changes are staged. git diff
, when called with no arguments will show the detailed
diff between the current working tree and HEAD
.
As you work on your code, git
does not require you to commit all of your
changes at once, but to enable this committing is two a stage process. The
first step is to use the git add
sub-command to stage changes
git add path/to/file # to stage all the changes in a file
git add -p # to commit by hunk
Once you have staged all of the changes you want, you create a new commit via the
git commit
sub-command
git commit -m "Short Message" # commit with a short commit message
git commit # open an editor to write a commit message
Writing a commit message is one of the most important parts of using git.
While it is frequently possible to, only from the source, reconstruct the what
of a code change it can be impossible to reconstruct the why of the change.
The commit message is a place that you can leave notes to your collaborators
explaining the motivations of the change. Remember that your most frequent
collaborator is your future / past self! For a comprehensive guide to writing
good commit messages see this article.
As git
encourages the creation of branches for new development, when the
work is done (via the cycle above) we will need to merge this work back into
the canonical branch which is done via the git merge
sub-command. By default, this will
create a new commit on your current branch who has two parents (the tips of
each branch involved).
git merge <other branch> # merge other branch into the current branch
If you are using a code hosting platform (GitHub, GitLab, BitBucket, …) this
command will typically be done through a the web UI’s “merge” button.
discard changes
Not all changes are a good idea, sometimes you need to go back.
If you have not yet committed your changes then they can be discarded using the
git checkout
sub-command (yes the
same one we used to change branches)
git checkout path/to/file # discard local changes
There is also the new git restore
sub-command
which is more tightly scoped to discarding local file changes
If you have files in your working directory that git
is not currently tracking you
can use the git clean
sub-command.
git clean -xfd # purge any untracked files (including ignored files)
If you need to discard commits you can use the git reset
sub-command. By default git reset
will
change the currently checked out commit but not change your working tree (so you keep
all of the code changes).
git reset HEAD^ # move the branch back one, keep working tree the same
git reset HEAD^^ # move the branch back two, keep working tree the same
git reset <a SHA1> # move the branch a commit, keep working tree the same
This can be very useful if you like the changes you made, but not the commits
or commit messages.
Alternatively if you want to discard the commits and the changes you can use
the --hard
flag:
git reset --hard HEAD^ # move the branch back one, discard all changes
git reset --hard HEAD^^ # move the branch back two, discard all changes
git reset --hard <a SHA1> # move the branch a commit, discard all changes
Be aware that these can be a destructive commands! If you move a branch back
there maybe commits that are inaccessible (remember commits only know their
parents). This is where the git reflog
sub-command can help recover the lost
commits.
git
commands may create objects behind the scenes that ultimately become
inaccessible. git
will on its own clean up, but you can manually trigger
this clean up via git gc
sub-command.
If you have accidentally committed something sensitive, but not yet pushed, you
can use these tools to purge it. If you have push the commit you will need
some higher test
tools.
change or move nodes
Due to the way the hashes work in git
you can not truly change a commit,
but you can modify and recommit it or make copies elsewhere in the graph.
Remember that if you have already shared the commits you are replacing you will
have to force-push them again (see below). Be very careful about doing this to
any branch that many other people are using.
If you have just created a commit and realized you need to add one more change
you can use the --amend
flag to git commit
sub-command.
# hack
git add path/to/file # stage the changes like normal
git commit --amend # add the changes to the HEAD commit
This does not actually change the old commit. A commit is uniquely
identified by its hash and the hash includes the state of the code, thus
“amending” a commit creates a new commit and then resets the current branch
to point to the new commit and orphans the old commit.
If you want to move a range of commits from one place to another you can use
the git rebase
sub-command.
git rebase target_branch # rebase the current branch onto target_branch
which will attempt to “replay” the changes in each of the commits on your
current branch on top of target_branch
. If there are conflicts that git
can not automatically resolve it will pause for you to manually resolve the
conflicts and stage the changes to continue or abort the whole rebase
git rebase --continue # continue with your manual resolution
git rebase --abort # abort and go back to where you started
If you want to re-order, drop or combine commits you can use:
git rebase -i # interactively rebase, squash and re-order
which will open an editor with instructions. This can be particularly useful
if you want commit early and often as you work, but when you are done re-order
and re-arrange the changes into a smaller number of better organized commits to
tell a better story.
Common reasons to be asked to rebase (and squash) a branch is if your
development branch has grown merge conflicts and the project prefers rebasing
over merging the default branch back into the development branches or if your
commit history has too many “noise” commits (small typo fixes, reversions of
work, committing and then deleting files).
To move a commit from one branch to another use the git cherry-pick
sub-command which is conceptually
similar to git rebase
git cherry-pick <commit> # pick the commit on to the current branch
git cherry-pick -m 1 <commit> # pick a merge commit onto the current branch
git cherry-pick --continue # continue if you have to manually resolve conflicts
git cherry-pick --skip # drop a redundant commit
git cherry-pick --abort # give up and go back to where you started
In all of these cases, sub-command can
be useful if things do not go as you expect!
Sharing with your friends
So far we have not talked much about any of the collaborative or distributed
nature of git. Except for git clone
, every command so far can be done only
with information than git
has on your computer and can be done without a
network connection. This lets you work in your own private enclave, either
temporarily, because you are working on a laptop on commuter rail or are not
yet ready to share your work, or permanently if you just prefer to work alone.
While version control is useful if you are working alone (your most frequent
collaborator is your future / past self and version control can save you from
typos), it really shines when you are working
with other people. To share code with others we need to a notion of a shared
history. Given that under the hood git is a graph of nodes uniquely named by
their content “all” we have to do is be able to share information about the
commits, branches, and tags between the different computers!
By default after an initial git clone
there is one “remote” pointing to where
ever you cloned from by default named origin
. To modify an existing remote
or add a new remote use the git remote
sub-command.
git remote add <name> <url> # add a new remote
git remote rm <name> # delete a remote
git remote rename <old> <new> # rename a remote from old -> new
Once you have one or more remotes updated the first thing we want to is be able
to get new commits from the remotes via git fetch
sub-command or git remote
sub-command.
git fetch <name> # fetch just one remote
git fetch --all # fetch all of the
git remote update # update all the remotes
The git pull
sub-command combines a
fetch and a merge into one command. While this seems convenient, it will
frequently generate unexpected merge commits that take longer to clean up than
being explicit about fetching and merging separately.
git merge --ff-only remote/branch # merge remote branch into the local branch
The --ff-only
flag fails unless the history can be “fast forwarded” meaning
that only the remote branch has new commits.
To share your work with others you need to put the commits someplace other
people can see it. The exact details of this depend on the workflow of the
project and team, but if using a hosting platform this is done via the git
push
sub-command.
git push <remote> <branch_name> # push the branch_name to remote
Given that in a typical workflow you are likely to be pushing to the same
branch on the same remote many times git
has streamlined ways of keeping track
of the association between your local branch and a remote
branch on a
(presumably) more public location. By telling git about this association we
save both typing and the chance of mistakes due to typos.
git branch --set-upstream-to <remote> # set an "upstream"
git push # "do the right thing" with upstream set
If you try to push commits to a remote branch that has commits that are not on
your local branch git will reject the push. The course of action depends on
why you are missing commits. If there are new commits on the remote branch
that you have not fetched before, then you should either merge the remote
branch into your local branch before pushing or rebase your local branch on the
remote branch and push again.
Because git can not tell the difference new commits on the remote and old
commits on the remote because you have re-written history locally, either via
git commit --ammend
or git rebase
, then you have to do something a bit ....
dangerous. git
detecting that if the remote branch were to be updated to
where the local branch it would make some commits inaccessible and protecting
you from yourself. However, if you are sure we can tell git
to trust our
judgment and do it anyway:
git push --force-with-lease
Be very careful about doing this to branches that other people are relying on
and have checked out. Other people will have the same problem you just had,
but in reverse. git
can not tell that the re-written commits are “right” and
the history on the other users computer are “wrong”. They will be presented
with the same options you just had and may re-force-push your changes out of
existences. We recently had to re-write the history on the default Matplotlib
branch
and it required a fair amount of planning and work to manage.
checking out more than one commit
When you checkout a commit git
materializes the code into the directory where
the repository is cloned and your local directory is made to match the tree of
the commit. Thus, it is logically impossible to have more than one commit
checked out at once. However, it can be extremely useful to have more than one
commit checked out at once if you are working on a project with multiple “live”
branches. One way around this is to simply clone the repository N times,
however because each repository is unaware of the other, you will have N
complete copies of the repository and each will have to synchronized with their
remotes independently, etc. To make this efficient you can use the git
worktree
sub-command
git worktree add ../somepath branch_name
This will share all of the git
resources and configuration with the main
git
worktree. One surprising limitation of the worktrees is that you can
only have a given branch checked out in at most one worktree at a time.
git config
There are many (many) knobs to configure the default behavior. I suggest using
starting with these settings:
[transfer]
# actually verify the hashes
fsckobjects = true
[fetch]
# actually verify the hashes
fsckobjects = true
# automatically drop branches that are deleted on the remotes
prune = true
# fetch remotes in parallel
parallel = 0
[receive]
# actually verify the hashes
fsckObjects = true
[pull]
# requires opting-into creating a merge commit locally.
# Given a platform based workflow, this prevents unintentional merge
# commits that need to be un-wound
ff = only
[merge]
# same as above
ff = only
[color]
# colours are always fun
ui = auto
[init]
# get ahead of the master -> main change
defaultBranch = main
[feature]
manyFiles = true
[alias]
# this gives `git credit` as an alternative to `git blame`, just
# puts you in a more positive mind set while using it.
credit = blame
Other things you might want to do
There are obviously many things that git
can do that are not covered here.
Some things that I have had to do from time-to-time but did not make the cut
for this article include:
- track the history of a line of code back in time (
gitk
, git blame
+ UI tooling, git log
)
- find the commit that broke something (
git bisect
)
- merge un-related git histories into one (
git merge --allow-unrelated-histories
)
- extract the history of a sub-directory into its own repository (
git filter-branch
)
- purge a particular file (or change) from the history (
git filter-branch
or
BFG repo-cleaner)
- fast searching (
git grep
)
- ask
git
to clean up after itself (git gc
)
Other resources
Acknowledgments
Thank you to James Powell, Alex Held, Dora Caswell and the other beta-readers
who read (or listened to) early drafts of this post and provided valuable
feedback. Thank you to Elliott Sales de Andrade for pointing out git restore
.