Links

Content Skeleton

This Page

Previous topic

SCM History

Next topic

Notes

SCM BACKUP

FUNCTIONS

scm-backup-du
local backup .gz sizes in $SCM_FOLD
scm-backup-rls
remote ls the .gz on the paired backup node $BACKUP_TAG
scm-backup-mail
send mail with the remote list
scm-backup-check
find tarballs on all backup nodes
scm-backup-df
check freespace on server and all backup nodes

scm-backup-postfix-start

scm-backup-bootstrap
rsync the tarballs from the backup node of the designated server node for this node and recover from them
scm-backup-nightly-as-root
does it as root ... as done in the crontab
scm-backup-all-as-root
does the below as root ... as done in the crontab
scm-backup-all

invokes the below:

scm-backup-repo
scm-backup-trac
scm-backup-folder   for the apache-confdir
scm-backup-purge   : retain the backups from the last 7 days only

scm-recover-all fromnode

In addition to the Trac and SVN repos this also now recovers the users.conf and authz.conf with scm-recover-config in a careful manner prompting for confirmation before replacing this critial apache/svn/Trac config files.

scm-recover-config fromnode

Extracts the users.conf and authz.conf from the svnsetup.tar.gz backup file into a temporary location. Compares these temporaries with the corresponding config files within apache-confdir. If there are preexisting config files, the diffs are show and a confirmation dialog is required to replace them with the extractions.

This calls:

scm-recover-folders # contrary to the name this just places a last link to identify the last tarball folder scm-recover-users scm-recover-authz

scm-recover-users fromnode

extract the users file from the last svnsetup tarball, called by scm-recover-all NB the other svnsetup files are sourced from the repository and contain system specific paths ... so more direct to re-generate them rather than using the backups

The users file is different because it is edited thru the webadmin interface

scm-recover-authz fromnode

Analogous to scm-recover-users for the authz file
scm-recover-folders fromnode

still experimental .. NEEDS FURTHER CHECKING PRIOR TO REAL USAGE

recovers the users and permissions files from the last backup

scm-recover-lastlinks typ

typ defaults to tar.gz

this must be run from the backup folder that should contain the “last” link eg:

/var/scm/backup/cms01/tracs/env
last -> 2008/08/14/174749

if the “last” link exists then exit without doing anything, however if the last link has been collapsed into a folder (eg by web transfers or non-careful copying) then delete that folder and attempt to recreate the “last” link to the directory containing the last file of type

scm-backup-purge from-node number-to-keep

scm-backup-rsync

to the paired node to override and send the backup to non-standard destination, eg while not inside home internal network need to use G3R:

BACKUP_TAG=G3R scm-backup-rsync

scm-backup-rsync-from-node

rsync the backups from a remote node

scm-backup-dybsvn-from-node

copy over the reps for a specific day

scm-backup-eup

updates the env sphinx docs, including the SCM backup tarball monitoring pages and plots.

On repo node C2, this is done automatically via root crontab running scm-backup-monitor This means that in order to update env docs on C2, must do so as root:

ssh C2 /data/env/system/svn/subversion-1.4.6/bin/svn up \~/env
ssh C2R
         scm-backup-
         scm-backup-eup

STATE OF LOCK ADDITIONS

  • cannot incorp the scm-backup-rsync LOCKS in the IHEP to NTU transfers due to lack of permissions : working with Q to incorp the scm-backup-rsync into the root cron task by reviving the ssh-agent hook up

ABOUT LOCKING : GLOBAL AND RSYNC LOCKS

  • during scm-backup-all the “global” LOCKED directory $SCM_FOLD/LOCKED is created
  • scm-backup-rsync pays attention to this LOCKED, and will abort if present
  • during scm-backup-rsync both the global LOCKED as described above and additional rsync LOCKED are planted in eg $SCM_FOLD/backup/cms02/LOCKED/ during each transfer to partnered remote nodes
  • following rsync completion the rsync LOCKED is removed and a quick re-rsync is done to remove the LOCKED

Note that the rsync LOCKED status is propagated to the remote directory during the rsync transfer, thus avoiding usage during transfers.

INTEGRITY CHECKS

Locking now prevents backup/rsync/recover functions both locally and remotely from touching partials. The backup procedures are purported to be hotcopy of themselves although mismatches between what gets into the trac instance backup and the svn repo backup are possible. Such mismatches would not cause corruption however, probably just warnings from Trac syncing.

The DNA check ensures that the tarball content immediately after creation corresponds precisely to the tarball at the other end of the transfers.

  • scm-backup-trac

    • scm-tgzcheck-trac : does a ztvf to /dev/null, extracts trac.db from tgz, dumps trac sql using sqlite3
    • scm-backup-dna : writes python dict containing md5 digest and size of tgz in sidecar .dna file
  • scm-backup-repo

    • scm-tgzcheck-ztvf : does a ztvf to /dev/null
    • scm-backup-dna : as above
  • scm-backup-rsync

    • performs remote DNA check for each paired backup node with scm-backup-dnachecktgzs : finds .tar.gz.dna and looks for mutants (by comparing sidecar DNA with recomputed)

DURING AN RSYNC TRANSFER, BOTH SIZE AND DIGEST DIFFER

[dayabay] /home/blyth/env > ~/e/base/digestpath.py  /home/scm/backup/dayabay/svn/dybaux/2011/10/19/100802/dybaux-5086.tar.gz
{'dig': '7b87e78cc03ea544e2ad3abae46eecd1', 'size': 1915051630L}

[blyth@cms01 ~]$  ~/e/base/digestpath.py  /data/var/scm/backup/dayabay/svn/dybaux/2011/10/18/100802/dybaux-5083.tar.gz
{'dig': 'da39aee61a748602a15c98e3db25d008', 'size': 1915004348L}

[blyth@cms01 ~]$  ~/e/base/digestpath.py  /data/var/scm/backup/dayabay/svn/dybaux/2011/10/18/100802/dybaux-5083.tar.gz
{'dig': 'da39aee61a748602a15c98e3db25d008', 'size': 1915004348L}
              sometime later, there is no change : transfer stalled  ?

ISSUE : rsync not woking, tarballs not getting purged ?

  1. Aug 19, 2014 : observe that tarballs on C have not been purged since July 20 ?
  2. Feb 5, 2015 : same again, suspect that a hiatus results in too many files changed which means rsync falls foul of the timeout : causing the rsync and the purge that it causes on remote nodes from never happening

Checking logs see error:

=== scm-backup-rsync : quick re-transfer /var/scm/backup/cms02 to C:/data/var/scm/backup/ after unlock
=== scm-backup-rsync : time rsync -e "ssh" --delete-after --stats -razvt /var/scm/backup/cms02 C:/data/var/scm/backup/ --timeout 10
Scientific Linux CERN SLC release 4.8 (Beryllium)
building file list ... done
rsync: mkdir "/data/var/scm/backup" failed: No such file or directory (2)
rsync error: error in file IO (code 11) at main.c(576) [receiver=3.0.6]
rsync: connection unexpectedly closed (8 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(359)
real    0m1.153s

Repeating the rsync command manually works, deleting the backlog of unpurged tarballs:

[root@cms02 log]# rsync -e "ssh" --delete-after --stats -razvt /var/scm/backup/cms02 C:/data/var/scm/backup/ --timeout 10

ISSUE : fabric run fails

INFO:env.tools.libfab:ENV setting (key,val)  (timeout,2)
INFO:__main__:to check db:  echo .dump tgzs | sqlite3 /data/env/local/env/scm/scm_backup_monitor.db
INFO:env.scm.tgz:opening DB /data/env/local/env/scm/scm_backup_monitor.db
INFO:ssh.transport:Connected (version 1.99, client OpenSSH_4.3p2-6.cern-hpn-CERN-4.3p2-6.cern)
INFO:ssh.transport:Authentication (publickey) successful!
INFO:ssh.transport:Secsh channel 1 opened.
monitor cfg: {'HOST': 'C',
 'HUB': 'C2',
 'dbpath': '$LOCAL_BASE/env/scm/scm_backup_monitor.db',
 'email': 'blyth@hep1.phys.ntu.edu.tw simon.c.blyth@gmail.com',
 'jspath': '$APACHE_HTDOCS/data/scm_backup_monitor_%(node)s.json',
 'reporturl': 'http://dayabay.phys.ntu.edu.tw/e/scm/monitor/%(srvnode)s/',
 'select': 'repos/env tracs/env repos/aberdeen tracs/aberdeen repos/tracdev tracs/tracdev repos/heprez tracs/heprez',
 'srvnode': 'cms02'}
[C] run: find $SCM_FOLD/backup/cms02 -name '*.gz' -exec du --block-size=1M {} \;
[C] out: /home/blyth/.bash_profile: line 32: /data/env/local/env/home/env.bash: No such file or directory^M
[C] out: /home/blyth/.bash_profile: line 313: sv-: command not found^M
[C] out: /home/blyth/.bash_profile: line 315: python-: command not found^M
[C] out: find: /backup/cms02: No such file or directory^M

Fatal error: run() received nonzero return code 1 while executing!

ISSUES WITH NEW INTEGRITY TESTS

  • SCM_BACKUP_TEST_FOLD ignored by scm-backup-purge
  • expensive
  • temporarily take loadsa disk space ~4GB (liable to cause non-interesting problems)

IHEP CRON RUNNING OF THE BACKUPS

changed Aug 2011 : Cron jobs time changed to 15pm(Beijing Time) and 09am(beijing).

POTENTIAL scm-backup-repo ISSUE AT 2GB

Early versions of APR on its 0.9 branch, which Apache 2.0.x and Subversion 1.x use, have no support for copying large files (2Gb+). A fix which solves the ‘svnadmin hotcopy’ problem has been applied and is included in APR 0.9.5+ and Apache 2.0.50+. The fix doesn’t work on all platforms, but works on Linux.

On C2 are using source apache /data/env/system/apache/httpd-2.0.63

HOW TO RECOVER dayabay TARBALLS ONTO cms02, run from C2 (sudo is used)

  1. from C2 : scm-backup-rsync-dayabay-pull-from-cms01
  2. from C2 : scm-recover-all dayabay

Note potential issue of incomplete tarballs, to reduce change

HOW TO TEST SOME IMPROVED ERROR CHECKING WITH SINGLE REPO/TRAC BACKUPS

Run as root, eg from C2R:

scm-backup-         ## pick up changes
t scm-backup-repo   ## check the function

mkdir -p /tmp/bkp
scm-backup-repo newtest /var/scm/repos/newtest /tmp/bkp dummystamp

export LD_LIBRARY_PATH=/data/env/system/sqlite/sqlite-3.3.16/lib:$LD_LIBRARY_PATH   ## for the right sqlite, otherwise aborts
scm-backup-trac newtest /var/scm/tracs/newtest /tmp/bkp dummystamp

TESTING FULL BACKUP INTO TMP DIRECTORY

Run as root, eg from C2R:

scm-backup-
t scm-backup-all   ## check the function

rm -rf /tmp/bkptest ; mkdir -p /tmp/bkptest
export LD_LIBRARY_PATH=/data/env/system/sqlite/sqlite-3.3.16/lib:$LD_LIBRARY_PATH
cd /tmp ; SCM_BACKUP_TEST_FOLD=/tmp/bkptest scm-backup-all

SLIMMING THE TRAC TGZ ... ALL THOSE BITTEN LOGS

DELETE FROM bitten_log_message WHERE log IN (SELECT id FROM bitten_log WHERE build IN (SELECT id FROM bitten_build WHERE rev < 23000 AND config = 'trunk'))
DELETE FROM bitten_log WHERE build IN (SELECT id FROM bitten_build WHERE rev < 23000 AND config = 'trunk')
DELETE FROM bitten_error WHERE build IN (SELECT id FROM bitten_build WHERE rev < 23000 AND config = 'trunk')
DELETE FROM bitten_step WHERE build IN (SELECT id FROM bitten_build WHERE rev < 23000 AND config = 'trunk')
DELETE FROM bitten_slave WHERE build IN (SELECT id FROM bitten_build WHERE rev < 23000 AND config = 'trunk')
DELETE FROM bitten_build WHERE rev < 23000 AND config = 'trunk'

Common issues

backups stopped

compare:

scm-backup-du
scm-backup-rls

check base/cron.bash ... usually some environment change has broken the env setup for cron after modifications reset the cron backups:

cron-
cron-usage
cron-backup-reset
cron-list root
cron-list blyth

Warning

Usage of cron fabrication is deprecated, its easier to do this manually

backups done but not synced off box

Probably the agent needs restarting.. this is needs to be done manually after a reboot see:

ssh--usage
ssh--agent-start

then check offbox passwordless access with:

scm-backup-
scm-backup-rls

Do an emergency backup and rsync, with:

scm-backup-all-as-root
scm-backup-rsync
scm-backup-rls      ## check the remote tgz

TODO

  1. divided reposnsibilities between here and cron.bash is a mess
  2. not easy to add things to crontab because of this morass