Back up of large filesystem with a very large number of files

Posted: June 6th, 2014 | Author: | Filed under: Backup, Linux, Storage | Comments Off on Back up of large filesystem with a very large number of files

Backing up large filesystems with very large numbers of files and directories is challenging. I routinely deal with 20-40TB filesystems with millions, sometimes tens of millions, of files. Filesystem with Unix home directories in a place with large number of users is an example of such a filesystem. The problem is more of backup speed than capacity.

Some commercial backup packages can do a better job here, but if you rely on free backup solutions, the situation is a bit more challenging. I have tried quite a few open-source backup programs, and performance of most of them is not acceptable in this type of environment. However, I have identified a few approaches which performance comes as acceptable, barely acceptable but acceptable from my perspective.

First what not to do / not to use:

  • Anything based on rsync is too slow. Rsync is a great tool and I use it frequently in my work, but it is not fast enough when dealing with a very large number of files and directories.
  • If you have to transfer your backup data over the network – even your local LAN – abandon ssh protocol, again it is too slow.
  • Forget backup encryption (unless you work in an environment where it is required by regulations, such as medical records, financial records, etc.).

And now, what I’ve found as working the best. (Note that everything here is about backing up Linux-based system.)


Yes, the old trusted GNU tar – it works better than anything else. Its features (incremental archives, multi-tape volumes), combined with its performance, make tar my preferred tool for backups.

I’ve used tar in three different scenarios based on where tar archives are stored.

Tar and LTO5/6 tape

GNU tar combined with locally attached (SAS or fiber) LTO5/LTO6 tape autochanger or tape library is the best performing backup solution in my environment.

GNU tar can create volumes spanning multiple tapes. Thus, it can be used with a tape changer script to create a fully automatic backup solution.

An example of a typical backup performance:

  • Dell PowerVault 124T LTO6 tape autochanger (SAS attached)
  • Hardware compression
  • 55TB filesystem with 22TB in use: 5592231 files, and 60437 directories
  • Full backup duration: 48 hours, 40 minutes

You may frown at 48 hour window for a full backup, but this is only if you have not tried yourself to back up 22TB of data consisting of 6 millions of files, and 60 thousands of directories. It is not an easy task, and I am quite satisfied with the above performance.

A couple of additional notes on using LTO tape libraries:

  • I always use hardware compression available within tape library. It is fast and it does not task CPU on the server.
  • LTO is very fast for sequential reads and writes. Disk storage, on which the filesystem resides that is being backed up, must be fast enough to feed the tape. Otherwise, performance will substantially drop.

Tar and local disk storage

When storing tar archived to a disk, significant performance is gained by combining tar with parallel compression tools, such as parallel gzip (pigz) or parallel bzip2 (pbzip2), with pigz being my choice. Naturally, this is only possible if the server provides multiple and available cpu cores but having 8 or more cores is quite common on servers these days.

In my practice I pipe tar output through pigz, however, compression program can also be specified via “–use-compress-program” argument to tar.

Tar and remote disk storage

Remote storage could be mounted via NFS (or SMB, etc.), however, I find that transferring via netcat is much faster. My workflow is as follows:

 local host: tar | pigz | netcat
 remote host: netcat -> disk

where “local host” is the one that is being backed up, and “remote host” is where the backup is stored

An example of the above process, we are going to create a full tar backup on the server and store it on the server

 tar -cvf - -g /var/log/backups/maximus.snap --level=0 /home/maximus | pigz -p 8 | nc -l 8888
 nc 8888 > /backup/A_maximus.tar.gz

This all can be automated in a backup script when using password-less ssh communications between servers A and B, such as in this script that I use on one of my servers. This script is run on the server “biospace” and created tar backup archives are transferred and stored on the server “mimus”:

# Tar backups - create tar.gz archive files and
# sends them via netcat from biospace to mimus server 
#   Karol M.   May 2014

# check if /home is mounted
if (! df /home > /dev/null 2>&1); then
   exit 1

# check if the lock file exists, exit if it does
if [ -e /etc/backups/__lockfile_mimus ]; then
   exit 2

# change FULL to yes to run a full backup
# this will result in removal of tar snap files, and labelling tar files with "full"

touch /etc/backups/__lockfile_mimus
TODAY=`date +%Y%m%d`

cd /home
for DIR in `find . -maxdepth 1 -mindepth 1 -type d`
  sleep 10
  DIR=`echo $DIR | sed 's/\///' | sed 's/^\.//'`
  echo "DIR = $DIR"
  echo "SNAP = $SNAP"
  if [ $FULL == "yes" ]; then
    if [ -e $SNAP ]; then
       rm $SNAP
  echo "NAME = $NAME"
  tar -cvf - -g $SNAP --one-file-system $DIR | pigz -p 8 | nc -l 8888 &
  ssh mimus "nc biospace 8888 > /data1/home/${NAME}.tar.gz"

rm -f /etc/backups/__lockfile_mimus
exit 0 


If you do not want to create your own backup scripts using tar, and prefer to use a ready-to-use backup software, you may want to check StoreBackup

While “tar” combined with LTO tape library is a much better choice for very large filesystems, I have used storeBackup quite successfully on smaller (<10TB) filesystems. StoreBackup (written in Perl) is the best performing of open-source backup software that I have tested. StoreBackup does not do any special magic. It uses "cp" command combined with compression program of one's choice (pigz in my case), to create and store backups on a disk medium - any filesystem that can be mounted will work: NFS, Samba, etc. Performance gains of storeBackup come from running multiple instances of copy and compression at the same time, and, so it benefits from large number of cores available on the server.

Comments are closed.