An old cache manager bug squashed

An old but critical bug in the unix version of the OpenAFS cache manager kernel module was recently fixed by Sine Nomine and was committed in the upstream stable code tree for inclusion in the next release of OpenAFS. This was quite an old bug. In fact, it has been present since OpenAFS 1.0, which makes it about ten years old.

The site reporting the bug had several hosts crash after removing a bogus IP address in their VLDB, which initially was quite baffling. As it turned out, a rare combination of events lead to a code path that exposed a race condition in the cache manager. In this case, the cache manager would crash when trying to use a pointer to memory which was freed and then reused on another thread.

This was triggered when the client noticed one of the fileserver network interfaces has a new address. At that point the cache manager invalidates the old address from all the cache entries for that server. The memory holding the server information is freed and is available for other uses in the cache manager.

The cache manager code which flushes vcache entries also accesses the server data members when flushing cache entries for read-only volumes. This is done to save the volume level callback information, since read-only volumes have callbacks for the entire volume, and not per individual files.

Now, there are a series of locks in the cache manager to prevent threads from walking over each other’s memory, but in this case, the locks were not used correctly in the code which was flushing the read-only cache entry. This code took a pointer to the memory holding the server information before the lock was held, a classic race condition. The fix was to make sure the pointer to the shared data member is used only after the mutual exclusion lock is held.

The patch is available in the OpenAFS git repository,

cm: address race condition in afs_QueueVCB

This is a conservative fix for the stable series. No new locks, or changes to locking order are introduced. However, longer term, we may want to revisit this part of the cache manager.

Special Type: