OpenAFS VLDB Address Troubleshooting

Author: Nate Coraor <nate@psu.???>
Version: 0.1
Copyright: Creative Commons Attribution Non-Commercial Share Alike
Thanks:Andrew Deason, Derrick Brashear, Jeffrey Altman, Freenode #openafs, and the OpenAFS Jabber conference room.

Background

I have vlservers and fileservers in EC2. In EC2, your public IP address is not assigned to any of the ethernet interfaces configured in the instance. A private (10.0.0.0/8) address is assigned.

The problem

Twice, I've ended up with private addresses in the VLDB where public addresses should be. To avoid this, you need to create the following (on Debian/Ubuntu, ${afslocaldir} is /var/lib/openafs/local):

  1. ${afslocaldir}/NetInfo with contents:
f <public-ip>
<private ip>
  1. ${afslocaldir}/NetRestrict with contents:
<private ip>

If you start a dafs instance without these files, the private IP will be registered in the VLDB and you will have to get rid of it.

The solution

You want to remove the incorrect address and make sure that the fileserver is registered with the correct address. To do this:

  1. Fix your NetInfo and NetRestrict files.
  2. Use bos restart <fileserver-address> dafs to restart the fileserver with the corrected address info. This will register the correct address(es) in the VLDB.
  3. Use vos listvldb -server <private-ip> to make sure that the private IP lists 0 entries.
  4. Execute vos changeaddr <private-ip> -remove

What not to do

  1. Do not use vos setaddrs.
  2. Do not use vos changeaddr <private-ip> <public-ip>
  3. Do not remove sysid unless you are doing a full disaster recovery.

Disaster recovery

You can recover from a screwed up VLDB with the following steps (on Debian/Ubuntu, ${afsdbdir} is /var/lib/openafs/db):

  1. bos stop vlserver-address vlserver for all of your vlservers
  2. rm ${afsdbdir}/vldb* ${afsdbdir}/sysid on all fileserver and vlservers. This removes the VLDB and the fileserver's sysid file, which will be regenerated upon a dafs restart (with the correct IPs in it)
  3. bos start <vlserver-address> vlserver for all of your vlservers
  4. Wait until udebug <vlserver-address> 7003 shows Recovery state 1f on whichever vlserver is elected as the sync site
  5. bos restart <fileserver-address> dafs for all of your fileservers
  6. vos syncvldb <fileserver-address> for all of your fileservers

Alternatively, Michael Meffie has a tool in development that can repair damaged VLDBs.