Michael Meffie's blog

OpenAFS 1.4.12 pre-release is available

Item Type: 
Topics: 

Update: OpenAFS 1.4.12 release candidate 4 is now available.

The OpenAFS 1.4.12 release candidate 1 has been announced. The source code is available for download. This pre-release contains a large number of fixes since OpenAFS 1.4.11, including several critical fileserver and unix cache manager fixes.

Testing this pre-release in your environment will help improve the quality of the release. We can’t fix problems we don’t know about, so please report any errors to openafs-bugs@openafs.org and any positive results to the OpenAFS information mail list.

Thanks,
Mike

An old cache manager bug squashed

Item Type: 
Topics: 

An old but critical bug in the unix version of the OpenAFS cache manager kernel module was recently fixed by Sine Nomine and was committed in the upstream stable code tree for inclusion in the next release of OpenAFS. This was quite an old bug. In fact, it has been present since OpenAFS 1.0, which makes it about ten years old.

The site reporting the bug had several hosts crash after removing a bogus IP address in their VLDB, which initially was quite baffling. As it turned out, a rare combination of events lead to a code path that exposed a race condition in the cache manager. In this case, the cache manager would crash when trying to use a pointer to memory which was freed and then reused on another thread.

This was triggered when the client noticed one of the fileserver network interfaces has a new address. At that point the cache manager invalidates the old address from all the cache entries for that server. The memory holding the server information is freed and is available for other uses in the cache manager.

The cache manager code which flushes vcache entries also accesses the server data members when flushing cache entries for read-only volumes. This is done to save the volume level callback information, since read-only volumes have callbacks for the entire volume, and not per individual files.

Now, there are a series of locks in the cache manager to prevent threads from walking over each other’s memory, but in this case, the locks were not used correctly in the code which was flushing the read-only cache entry. This code took a pointer to the memory holding the server information before the lock was held, a classic race condition. The fix was to make sure the pointer to the shared data member is used only after the mutual exclusion lock is held.

The patch is available in the OpenAFS git repository,

cm: address race condition in afs_QueueVCB

This is a conservative fix for the stable series. No new locks, or changes to locking order are introduced. However, longer term, we may want to revisit this part of the cache manager.

Two important OpenAFS fileserver fixes

Item Type: 
Topics: 

Two important fileserver fixes are available for OpenAFS 1.4.11, both of which address intermittent fileserver crashes. Source code patches are available in the OpenAFS git source code repository and are in the pipeline for the next release of OpenAFS.

The first patch fixes an error in the handling of multi-homed client hosts. An OpenAFS client host may have multiple interfaces, and hence multiple IP addresses. The fileserver attempts to associate these IP address to the host in memory. This multi-home tracking has been improved in recent releases of OpenAFS, however a subtle error was introduced around OpenAFS 1.4.8. When the last address associated with a host is removed, the callback connection for that host was also removed. In some cases that connection object was still in use by other threads, and the premature removal of the connection object will lead to a server crash when the fileserver attempts to access a null pointer.

The second fix is for an insidious and long standing bug in the host package of the fileserver. Several cases were found where the fileserver could be using a host object that had been freed. This bug could manifest in a number of terrible ways. Sometimes this bug lead to a situation where the internal list of client hosts was corrupted, in which case the fileserver could crash or even hang as it was trying to traverse a linked list that looped on itself. In other cases, the fileserver heap could be corrupted and the fileserver would crash when calling malloc, or the filerserver would crash when attempting to free an object which was already freed.

The fixes are available in the OpenAFS git repository, and are mirrored on bm1vsrv05.sinenomine.net,

  • viced-null-callback-rxcon-20091022 eliminates the premature removal of the connection object
  • viced-avoid-using-released-hosts-20091102 fixes the host package bug where the host list could be corrupted

Subscribe to RSS - Michael Meffie's blog