git-annex now has experimental support for tuning a repository for different work loads.
For example, a repository with a very large number of files in it may work
better if git-annex uses some nonstandard hash format, for either the
.git/annex/objects/
directory, or for the log files in the git-annex
branch.
A repository can currently only be tuned when it is first created; this is
done by passing -c name=value
parameters to git annex init
.
For example, this will make git-annex use only 1 level for hash directories
in .git/annex/objects
:
git -c annex.tune.objecthash1=true annex init
It's very important to keep in mind that this makes a nonstandard format git-annex repository. In general, this cannot safely be used with git-annex older than version 5.20150128. Older version of git-annex will not understand and will get confused and perhaps do bad things.
Also, it's not safe to merge two separate git repositories that have been
tuned differently (or one tuned and the other one not). git-annex will
prevent merging their git-annex branches together, but it cannot prevent
git merge remote/master
merging two branches, and the result will be ugly
at best (git annex fix
can fix up the mess somewhat).
Again, tuned repositories are an experimental feature; use with caution!
The following tuning parameters are available:
annex.tune.objecthash1=true
Use just one level of hash directories in.git/annex/objects/
, instead of the default two levels.annex.tune.objecthashlower=true
Make the hash directories in.git/annex/objects/
use all lower-case, instead of the default mixed-case.annex.tune.branchhash1=true
Use just one level of hash directories in the git-annex branch, instead of the default two levels.
Note that git-annex will automatically propagate these settings to
.git/config
for tuned repositories. You should never directly change
these settings in .git/config
, and should never set them in global
gitconfig.
My main use repo is 1.7TB large and holds 172.000+ annexed files. Variations in filename case has lead to a number of file duplications that are still not solved (I have base scripts that can be used to flatten filename case and fix references in other files, but it will probably mean handling some corner cases and there are more urgent matters for now).
For these reasons I'm highly interested in the lowercase option and I'm probably not the only one in a similar case.
Does migrating to a tuned repository mean unannexing everything and reimporting into a newly created annex, replica by replica then sync again? That's a high price in some setup. Or is there a way to somehow
git annex sync
between a newly created repo and an old, untuned one?It should be possible to write a
git-filter-branch
that converts a repository from one tuning to aonther, but it would not be trivial, and noone has done it yet. You'd still have to run it in every clone of the repository. Tuned and non-tuned repositories can't interoperate.Right, it's not simply lower-casing but a different hash strategy as described in hashing.
Combining annex.tune.objecthashlower and annex.tune.objecthash1 will result in one level of hash directories. If you get two levels then you probabaly typoed "objecthas1" ...
it starts to use 2 levels (even if annex.tune.objecthash1=true) of hash directories having 3 characters in the filename at each level. So it is not just "taken existing hash directories (1 or 2 levels) and use their lower-case version. It is a different way to create the hash directories:
e.g. one with objecthas1=true
1 -> .git/annex/objects/qj/SHA256E-s6--ecdc5536f73bdae8816f0ea40726ef5e9b810d914493075903bb90623d97b1d8/SHA256E-s6--ecdc5536f73bdae8816f0ea40726ef5e9b810d914493075903bb90623d97b1d8
and if I provide all three options at once:
1 -> .git/annex/objects/ccf/a40/SHA256E-s6--ecdc5536f73bdae8816f0ea40726ef5e9b810d914493075903bb90623d97b1d8/SHA256E-s6--ecdc5536f73bdae8816f0ea40726ef5e9b810d914493075903bb90623d97b1d8