User Tools

Site Tools


gtspring2009:howto:compress

How to compress a data file

How does one compress a large number of small files? I have over a terrabyte of data that is currently stored as small matlab .mat files (~750 KB each) in a hierarchy of directories (the top level directories contain several GB of data which are then divided among a few hundred subdirectories that vary in size from a couple of MB to a few hundred MB). I'm considering compressing the data, but I'm not 100% what the best way to do that is for such a large structure. Also, does anybody have any suggestions as to which compression format will work best? — D. Borrero 2009-07-08 13:29

Aren't Matlab .mat files binary floating-point data and thus already compressed? If there is room to be gained from compression (try gzipping an individual file) I would suggest one of two alternatives:

  1. Compress the entire directory structure with “tar cvfpz bigdir.tgz bigdir” where bigdir is the name of the toplevel directory. That will compress everything into one big tarfile, which you then list contents/extract with “tar tvfpz bigdir.tgz” or “tar xvfpz bigdir.tgz”.
  1. Compress files individually with “gzip -r .”. That'll do a recursive descent into the current directory and compress all files within. Alternatively you could do “find . -name '*.mat' -exec gzip {} \;”. You can use bzip2 instead of gzip in the latter commands (j in place of z in the tar commands); bzip2 is supposed to give better compression but it doesn't always.
gtspring2009/howto/compress.txt · Last modified: 2010/02/02 07:55 (external edit)