How does one compress a large number of small files? I have over a terrabyte of data that is currently stored as small matlab .mat files (~750 KB each) in a hierarchy of directories (the top level directories contain several GB of data which are then divided among a few hundred subdirectories that vary in size from a couple of MB to a few hundred MB). I'm considering compressing the data, but I'm not 100% what the best way to do that is for such a large structure. Also, does anybody have any suggestions as to which compression format will work best? — D. Borrero 2009-07-08 13:29
Aren't Matlab .mat files binary floating-point data and thus already compressed? If there is room to be gained from compression (try gzipping an individual file) I would suggest one of two alternatives:
tar cvfpz bigdir.tgz bigdir” where bigdir is the name of the toplevel directory. That will compress everything into one big tarfile, which you then list contents/extract with ”tar tvfpz bigdir.tgz” or ”tar xvfpz bigdir.tgz”. gzip -r .”. That'll do a recursive descent into the current directory and compress all files within. Alternatively you could do ”find . -name '*.mat' -exec gzip {} \;”. You can use bzip2 instead of gzip in the latter commands (j in place of z in the tar commands); bzip2 is supposed to give better compression but it doesn't always.