How to use git to lose data

Acknowledgments

First of all I have to say that git has lots of positive features, not least of which is an energetic development community with a focus on quality and power while resisting uncontrolled feature bloat. Linus Torvalds deserves credit for advancing the cause of distributed version control systems, and for doing something to advance his cause rather than just complaining.

Next, I am grateful to Scott Chacon, author of Pro Git, a book which makes it possible to achieve a substantial understanding of git in a single evening. I bought a physical copy of the book, and encourage you to, as well (at least, if you're sufficiently ancient that you buy books in physical form).

The DVCS nature of git

Git originally arose as an emergency solution to a very specific problem, namely: how can thousands of people contribute debugged changes to the Linux kernel without causing the central coordinator or the central source repository to collapse? Git is now used by many other people with many other project goals, but the original problem it was designed to solve influences what happens when it is applied to other problems.

Git is a "distributed version control system". DVCS's are currently a hot area, and different ones have different goals and accomplishments. (Sadly, this variety is often obscured by the fact that different version control systems use the same terms for things that are completely different: git and SubVersion and Mercurial and Perforce all have "changesets", but that word means something genuinely different in each of the four systems! Whee!)

It's easy to think that a "distributed version control system" bears a similar relationship to a plain old VCS as a "distributed file system" bears to a plain old FS: the plain old centralized thing does X, and the distributed thing also does X, but without the bad old "centralized" part. Well.... the job of a file system is to store files, permanently and reliably, and most distributed file systems harness a bunch of machines together to also store files permanently and reliably--hopefully even more reliably.

On the other hand, the job of a version control system is to reliably store multiple versions of files, but at least in the case of git the job of a distributed version control system is NOT to harness a bunch of machines together to reliably store everybody's versions. A big part of the git design is FORGETTING things, because of the (quite reasonable) viewpoint of the coordinator of a giant software project that patches obey Sturgeon's Law. In other words, from the point of a big software project as a whole, most proposed changes are wrong and should be forgotten: not maintained, tracked, and shared with other people.

This plays out in multiple places in git. For example, compared to other VCS's, git branches are so ephemeral that they can barely be considered branches. In particular, there is no real way to answer the question "what was the state of the XXX branch two weeks ago?". The git version of the question is "if we start from the snapshot currently named XXX and travel back two weeks, what do we find?" But because git branches are actually ephemeral labels, there may be no relationship at all between what XXX points to now and what it pointed to two weeks ago. Usually there is a very close relationship, especially for "remote-tracking" branches (branches defined by wise and careful managers of upstream repositories), and usually when there isn't it is possible to look up what the relationship used to be in the reflog, but--for better or worse--that's really not the same thing as a traditional VCS branch.

Why might I care?

Of course, if you are a developer on a small project, say 2 to 5 developers, you might not need multiple shared branches at all, so this discussion may seem esoteric and irrelevant. You may well think that git when used in a small setting is a lot like other VCS's when used in a small setting, except that it's fresh and cool. In the rest of this piece I'd like to counsel you on ways that git's core design goals could surprise you in potentially unpleasant ways.

The key issue is this: even when used purely locally by one person, git is designed to forget things. Usually the things it forgets are things you have already forgotten and will never need again, in which case that's great. But most git users have no idea what git intends to forget, so unpleasant surprises are theoretically possible. For example, let's say you tend to develop on your laptop (a key motivator of DVCS systems in general). Because your laptop might break, be stolen, or might fall into a river, you set up a repository somewhere else. This is also useful if you are collaborating closely with with a small group of other people. Let's say one day something bad does happen and the git repository on your laptop is lost. No problem, right? You've been diligently pushing stuff to the non-laptop repository, so everything should be there. Right?

Nope. Before you read further, you might want to take a moment to write down everything that you can think of that will be gone forever along with your local repository. For concreteness, imagine that you typed
$ git push
and that after the command completed successfully an elephant crushed your laptop.

What's missing?

Ok, here's my list of what's missing.

Dirty (uncommitted) state - This is true of almost every revision control system, and is mentioned here as a warning to anybody who is using git as their first revision control system. In git, any changes which haven't been "committed" (declared permanent) via git add and git commit are considered tentative, and hence are not saved anywhere. Use git status frequently to make sure you know what is uncommitted; keep your "ignore file" up to date so that the output from git status is accurate enough that you pay attention to it.
Hooks - these are small pieces of code which execute when you ask git to perform various actions, e.g., it is possible to configure git to sanity-check a commit, maybe to enforce a coding style, before it's actually committed. Hooks are not automatically copied from one repository to another for security reasons, so if you lose your repository you will need to manually reinstall whatever hooks you were using. No big deal, probably. In a private repository, hopefully any hooks were performing convenience functions rather than necessary ones.
Reflog - if you lose your repository, your reflog will be gone. The reflog is a data structure which records the historical associations between "refs" (branch names, more or less) and commits. This is the data structure that answers the question "Which commit was the head of the master branch last Tuesday just before lunch?". If you've accidentally done something odd to one of your branches, the reflog is the easiest place to go spelunking to find your lost history (unless the reflog is lost).
Branches - Git branches are, as mentioned above, ephemeral cursors. The git model expects that most branches most people have will remain private to one repository, will be merged or rebased, fully or partially, onto some other branch, will then be abandoned, and will later be garbage-collected. A branch may be set up (or promoted) to a "tracking branch", in which case it can be easily pushed somewhere else. But that isn't the common case, so it isn't automatic, so if you lose your repository you may well lose state.
tags - git has two types of tag. A "lightweight" tag is, much like a branch, a name for a commit. The main difference is that branch cursors move around semi-automatically when you make commits and lightweight tags are expected to stay still once created. "Annotated" tags are more like commits in that they have textual descriptions, are hashed, and can be signed. Regardless, the key issue is that tags are not pushed to remote repositories by default. Furthermore, it is customary not to push lightweight tags (ever). Therefore, in order for tags to survive the loss of your repository, you should remember to invoke both tag -a and push --tags. While it is possible to do this, it is also possible to forget.
commits on inactive branches - if you do some work on a tracking branch aren't completely happy with it, so you don't push it, and you switch to working on some other branch, you can forget that you have local-only state on the first branch. If some time goes by, you may believe that because all the work you've done was committed and you've been busily pushing that all of your work is all safely stored somewhere else, but it's not. You can work around this with git push --all or git push --mirror, but note that --all, despite the name, does not push everything and --mirror can actually delete things from the remote repository if you've deleted them locally.
unreferenced commits - When you commit, git stores up your work in a robust hash-certified data structure and moves the "HEAD" cursor of the relevant branch so it points to the commit object. An entry is also made in the reflog. This commit will reach safety (some other repository) if the branch is pushed. However, standard git operations (rebase, reset --hard) can cause a branch to be redirected in such a way that a given commit is no longer part of the historical record on that branch. If you notice that this has happened, and you care, you can use the reflog to find the commit and make it part of history again. Otherwise, by design, the commit will never leave your repository. (Note that the --all and --mirror flags to push do not push commits which aren't part of any branch.)
"stashed" work - It is possible to temporarily abandon the dirty state of a branch, return to a clean checkout, and later get back the "stashed" dirty state. The git "stash" area is implemented in terms of a local ref, with the reflog as a back-up safety net. The commit object which contains the saved work can be pushed via git push --mirror (see warning above) but will otherwise, by design, remain local to your repository.
git-rerere data - rerere is a tool which remembers how you have resolved particular merge conflicts in the past and attempts to replay your resolutions if the same conflict arises later (at the time of this writing, this is the closest that git comes to having actual support for changes as opposed to states). This information is stored in .git/rr-cache, so it is local to any given repository and is not pushed (ever, by anything). Of course this means the information is lost if a repository is lost, but it also has the somewhat surprising implication that git rerere can fully automate a merge in one repository (which has "learned" how to merge certain conflicts) while the same merge would need to be done completely manually in a different repository. Good luck explaining that to an SVN user! Or another git user, for that matter.
git notes - Notes are a way to add information about a commit object after the commit has happened (the commit object itself is immutable). As I read the documentation, notes, independent of which branch's commits they annotate, are not pushed by default or by --all, but are pushed by --mirror and can be pushed manually (see "Pro Git - Note to Self" for more information).

If you wrote down a list before reading mine, did you miss anything that was on my list? If so, you might want to pause to adjust your mental model a bit. If your list looks exactly like mine, don't take too much comfort: maybe we're both overlooking something.

Suggestions

Internalize the concept that git is designed to forget things. If you haven't seen something reach another repository, maybe it didn't. Heck, even if you did see it go somewhere else, maybe it fell out of the historical record there and then got garbage-collected.
Internalize the fact that git (basically by design) does not have a way to save all of your work somewhere else, even though push has two options which sound like they would do that.
If you use a git extension such as stgit/qgit, carefully study the documentation to determine whether and when the information the extension creates or relies on is backed up to another repository.
Never never never delete a repository. There's no telling what you might need from it.
Even if you diligently try to push all changes from your personal repository to a shared repository, you should also periodically back your repository up by copying the whole thing somewhere else.
Don't garbage-collect a repository just because it sounds cool. Remember the old saying: one person's trash is another person's treasure. In fact, it is probably wise for you to consult the git-gc manual page and raise the default garbage-collection time thresholds and/or disable garbage collection entirely (this advice supposes you are configuring your personal repository or a shared repository for a small project, i.e., you are one of the many people using git to solve a different problem than it was originally designed for).
Because of a quirk in git's data model, it is probably best if a single commit does not both change the contents of a file and rename the file. Nothing dreadful will happen if you do this, but if the change to the file is "too drastic" git may think you deleted the original file and then created a new one. Also, if you can, don't rename directories. This same quirk means that, while git can often apply a change to a file even if the name has changed, it's almost impossible for git to create a file in the right directory if the create originally happened in a directory which was renamed on another branch.