unused data

It's possible for data to accumulate in the annex that no files in any branch point to anymore. One way it can happen is if you git rm a file without first calling git annex drop. And, when you modify an annexed file, the old content of the file remains in the annex. Another way is when migrating between key-value backends.

This might be historical data you want to preserve, so git-annex defaults to preserving it. So from time to time, you may want to check for such data:

$ git annex unused
unused . (checking for unused data...) 
  Some annexed data is no longer used by any files in the repository.
    NUMBER  KEY
    1       SHA256-s86050597--6ae2688bc533437766a48aa19f2c06be14d1bab9c70b468af445d4f07b65f41e
    2       SHA1-s14--f1358ec1873d57350e3dc62054dc232bc93c2bd1
  (To see where data was previously used, try: git log --stat -S'KEY')
  (To remove unwanted data: git-annex dropunused NUMBER)
ok

After running git annex unused, you can follow the instructions to examine the history of files that used the data, and if you decide you don't need that data anymore, you can easily remove it from your local repository.

$ git annex dropunused 1
dropunused 1 ok

Hint: To drop a lot of unused data, use a command like this:

$ git annex dropunused 1-1000

Rather than removing the data, you can instead send it to other repositories:

$ git annex copy --unused --to backup
$ git annex move --unused --to archive

comment 5

Tom, it should suffice for your colleague to pull the new version from you, and then run git annex unused to find unused files.

Now, if you have other branches or tags pointing at older versions of the data, those files will still be considered to be used, even if the current version doesn't use them. There's a new --used-refspec option for git-annex unused that you can use to specify which branches to consider to be used.

Comment by joey — Thu Jan 26 19:33:52 2023

dropping files after changing branches/tags

I have a use case in which a colleague wishes to have a working copy of my data repository to use with the current version of my model. When a new version of the model is available they would likewise update their git-annex clone of my data. The colleague wants to drop any files that have been made obsolete by this change, but I do not see an efficient way to make this determination. They could of course drop everything and then do git annex get . but that could be very expensive if only a small subset of the files have actually changed.

I'm probably just missing something basic, as this seems to be a reasonably frequent use case.

Comment by tom_clune — Thu Jan 26 19:33:52 2023

comment 3

git-annex unused looks at what data is used by git branches and tags, but not by other commits. It's a reasonable request and I have made a todo for it: find unused in any commit .. But I am unure if it can be implemented to run fast enough to be usable.

Comment by joey — Thu Jan 26 19:33:52 2023

Keep historical data, but delete data never referenced

Is there an easy solution for the following? There are two kinds of "unused" I would like to treat differently:

Kind "really unused": Was added once to the annex, but symlink was never committed
Kind "only history": A commit contains a symlink to the data, but no active branch

I want to preserve "only history", and only drop "really unused". What is an elegant way to do this? Thanks for your suggestions.

Comment by https://www.google.com/accounts/o8/id?id=AItOawn3p4i4lk_zMilvjnJ9sS6g2nerpgz0Fjc — Thu Jan 26 19:33:52 2023

finding data that isn't unused, but should be.

Sometimes links to annexed data still exists on some branch, when it was supposed to be dropped. Here is how I found these; perhaps there is a simpler way.

% git annex find --format '${key}\n' | sort > /tmp/known-keys
% find .git/annex/objects -type f -exec basename {} \; | sort  > /tmp/local-keys
% comm -23 /tmp/local-keys /tmp/known-keys

to look for what branch these are on, try

% git log --stat --all -S$key

for one of the keys output above. In my case it was the same remote branch keeping them all alive.

EDIT sort key lists to make comm work properly

Comment by bremner — Thu Jan 26 19:33:52 2023

Comments on this page are closed.