How to use git to lose data




Acknowledgments

First of all I have to say that git has lots of positive features, not least of which is an energetic development community with a focus on quality and power while resisting uncontrolled feature bloat. Linus Torvalds deserves credit for advancing the cause of distributed version control systems, and for doing something to advance his cause rather than just complaining.

Next, I am grateful to Scott Chacon, author of Pro Git, a book which makes it possible to achieve a substantial understanding of git in a single evening. I bought a physical copy of the book, and encourage you to, as well (at least, if you're sufficiently ancient that you buy books in physical form).

The DVCS nature of git

Git originally arose as an emergency solution to a very specific problem, namely: how can thousands of people contribute debugged changes to the Linux kernel without causing the central coordinator or the central source repository to collapse? Git is now used by many other people with many other project goals, but the original problem it was designed to solve influences what happens when it is applied to other problems.

Git is a "distributed version control system". DVCS's are currently a hot area, and different ones have different goals and accomplishments. (Sadly, this variety is often obscured by the fact that different version control systems use the same terms for things that are completely different: git and SubVersion and Mercurial and Perforce all have "changesets", but that word means something genuinely different in each of the four systems! Whee!)

It's easy to think that a "distributed version control system" bears a similar relationship to a plain old VCS as a "distributed file system" bears to a plain old FS: the plain old centralized thing does X, and the distributed thing also does X, but without the bad old "centralized" part. Well.... the job of a file system is to store files, permanently and reliably, and most distributed file systems harness a bunch of machines together to also store files permanently and reliably--hopefully even more reliably.

On the other hand, the job of a version control system is to reliably store multiple versions of files, but at least in the case of git the job of a distributed version control system is NOT to harness a bunch of machines together to reliably store everybody's versions. A big part of the git design is FORGETTING things, because of the (quite reasonable) viewpoint of the coordinator of a giant software project that patches obey Sturgeon's Law. In other words, from the point of a big software project as a whole, most proposed changes are wrong and should be forgotten: not maintained, tracked, and shared with other people.

This plays out in multiple places in git. For example, compared to other VCS's, git branches are so ephemeral that they can barely be considered branches. In particular, there is no real way to answer the question "what was the state of the XXX branch two weeks ago?". The git version of the question is "if we start from the snapshot currently named XXX and travel back two weeks, what do we find?" But because git branches are actually ephemeral labels, there may be no relationship at all between what XXX points to now and what it pointed to two weeks ago. Usually there is a very close relationship, especially for "remote-tracking" branches (branches defined by wise and careful managers of upstream repositories), and usually when there isn't it is possible to look up what the relationship used to be in the reflog, but--for better or worse--that's really not the same thing as a traditional VCS branch.

Why might I care?

Of course, if you are a developer on a small project, say 2 to 5 developers, you might not need multiple shared branches at all, so this discussion may seem esoteric and irrelevant. You may well think that git when used in a small setting is a lot like other VCS's when used in a small setting, except that it's fresh and cool. In the rest of this piece I'd like to counsel you on ways that git's core design goals could surprise you in potentially unpleasant ways.

The key issue is this: even when used purely locally by one person, git is designed to forget things. Usually the things it forgets are things you have already forgotten and will never need again, in which case that's great. But most git users have no idea what git intends to forget, so unpleasant surprises are theoretically possible. For example, let's say you tend to develop on your laptop (a key motivator of DVCS systems in general). Because your laptop might break, be stolen, or might fall into a river, you set up a repository somewhere else. This is also useful if you are collaborating closely with with a small group of other people. Let's say one day something bad does happen and the git repository on your laptop is lost. No problem, right? You've been diligently pushing stuff to the non-laptop repository, so everything should be there. Right?

Nope. Before you read further, you might want to take a moment to write down everything that you can think of that will be gone forever along with your local repository. For concreteness, imagine that you typed
$ git push
and that after the command completed successfully an elephant crushed your laptop.



What's missing?

Ok, here's my list of what's missing.

If you wrote down a list before reading mine, did you miss anything that was on my list? If so, you might want to pause to adjust your mental model a bit. If your list looks exactly like mine, don't take too much comfort: maybe we're both overlooking something.

Suggestions



Best viewed with any browser Proud Donor
davide+receptionist@cs.cmu.edu