Category Archives: Implementing KSharedDataCache

KSharedDataCache debugging

I was approached the other day by an Amarok developer who was receiving a lot of debug output from KImageCache (which uses my KSharedDataCache). When the cache was nearly full he was starting to receive a lot of messages about being “Unable to free up memory for” each entry.

He caught up with me on IRC and I pointed him to where the error message was from (which he had already found) and then explained what the logic was supposed to be inside of that function. Unfortunately I wasn’t able to be more helpful but I did leave with some tips about how I would debug it.

Probably half the time I get reports like this that’s where the matter is left. So I was surprised when I logged into IRC the next day that the dev was still looking into it! He had taken the steps I had suggested (including having to adapt it to his installation) and was able to verify that the KSharedDataCache was not finding free pages even though there should have been room available.

This meant that either the record-keeping was corrupted somewhere (which is bad), or that there was a problem in the function to find free pages. My contribution this day was limited to more debugging suggestions (pre-filling the cache to reproduce quickly), and more advice on what some auxiliary functions were doing. Oh, and the recommendation to check cache consistency (and when to do it) to flag immediately when the problem occurs.

And the next day, he had identified the problem and made a fix! As it turned out, I wasn’t searching the last n pages of the cache (where n is the number of pages needed to hold the entry to be inserted). So this problem could only occur when the cache was nearly full, or would be nearly full after the entry is added (since we defragment the cache first in this situation).

So thanks to Sam for sticking with the debugging! The fix will be in the next routine release of kdelibs (probably first week of July).

Implementing a shared cache: Part 4

Almost two weeks since I posted Part 3, which means it’s probably time to wrap up my series on implementing KSharedCache with this post. As a recap for those who don’t want to skip to the end of Part 3, I said I’d talk about having to defragment the shared cache and porting KIconLoader.

Defragmentation

One of the sub-optimal parts of my implementation of KSharedDataCache is the fact that the data must be contiguous in memory. It would actually be fairly simple to change this due to the existing page design, but right now this is what we’ve got.

The reason this is sub-optimal is due to an effect called fragmentation (more precisely, external fragmentation). The effect is illustrated in the following diagram:

The problem with fragmentation can be seen in the cache layout at the bottom. The maroon blocks indicate allocated pages. In the bottom cache layout, even though the overall allocation is less than in the other cache, it is not possible to fit the new block in memory because there is no solid block of free memory of sufficient size.

This problem could be solved if only there were a way to move the allocated memory around. In the case of KSharedDataCache this can be done, because only the cache records the location of each data block (when an item is returned to an application, it is actually a copy). This process is called defragmentation, and is essentially the same idea as what disk-based defragmenters do.

My defragmentation is probably fairly naïve but does the job, simply looking for used pages of memory and moving them as far to the beginning as possible. The more interesting part is deciding when to defragment. Right now defragmentation is performed as follows:

  • When we are having to remove currently-used pages due to insufficient consecutive memory when total free space is higher than a certain threshold, before pages are actually removed.
  • After evicting cache entries, since there are probably more holes in memory, and the reason we evicted things from the cache was due to insufficient consecutive free memory.

As I said my defragmentation routine is very simple and could probably be easily improved. I haven’t noticed issues with it becoming a problem during desktop usage but that’s perhaps attributable to not coming into use very often (if at all) due to the cache aging I mentioned in Part 3.

Porting KIconLoader

Perhaps the largest impetus driving me to do all this work in the first place was due to KIconLoader, which used to use KPixmapCache to cache loaded icons. KIconLoader is used everywhere in KDE and so many KPixmapCache-related crashes were first noticed in seemingly very-unrelated applications when trying to load icons.

Porting KIconLoader to KSharedDataCache was not a direct method name replacement unfortunately, as one of the things KIconLoader stored was the path that the icon was found at. (This used to be done by using the custom data API for KPixmapCache I mentioned in Part 1). I had first intended to use KImageCache (an image-related subclass of KSharedDataCache) for KIconLoader, but I ended up using KSharedDataCache directly, to hold cache entries that simply contained the pixmap and the path.

One problem that came up that I only fixed a few minutes ago was that KIconLoader would not only cache loaded icons, but it would also cache failed icon lookups, which was a behavior I had not ported over. This was especially notable in Dolphin browsing in large directories apparently. Either way, that is now fixed.

Future Directions

I’m proud of the work I put into KSharedDataCache, and especially since it has been running in trunk for about 4 weeks now with no major issues that seem to have popped up. However there are quite a few things that could be improved about it:

  • Cache corruption: Although in my opinion the risk of crashes from a corrupted cache is less right now due to the cache layout and non-usage of QDataStream, the possibility is not zero. Because of the serious consequences of cache corruption leading to crashes, it would be nice to have an efficient way to mark that a disk cache should be deleted because it is corrupt. I’ve thought of some things but have no concrete plans at this point.
  • The page table/index table method of storing data is very simplistic. There is surely a more appropriate method buried in some ACM or IEEE publication somewhere, even in the limits of fixed memory size. As it stands my method blends some of the disadvantages of 1960’s-era memory allocators with paged memory allocators, without all of the benefits.
  • Assuming defragmentation remains required, the defragmenter could probably be made faster as well.
  • It is not at this point possible to resize a cache once it has been created. There’s no reason in theory that it can’t be done, it’s just not implemented. (Note that implementing this is more complicated than simply changing a size flag in the cache…)
  • The cache could possibly be made more concurrent with lock-free algorithms or finer-grained locking. This is not something I’d like to touch until I have a way to verify correctness of the result, however.
  • Finally, it possible that someone has done this way better and that I simply missed it, in which event we should look at whether we should just adopt that library as a dependency and make KSharedDataCache a wrapper around it.
  • Should we remove old KPixmapCache caches after starting up a shiny new 4.5 desktop?

So, this concludes my series on implementing a shared cache. I’ve got to get working on other efficiency improvements, new kdesvn-build releases, classes, etc. It’s been fun writing though!

Implementing a shared cache: Part 3

So it has been a few days since Part 2, where I promised I’d talk about some issues that go with using pointers in shared memory, initial cache setup, and my arbitrary methods I use to handle various scenarios.

Pointing to things in shared memory

First I’ll talk about pointers. Essentially all you need to know about them is that a pointer holds a memory address, namely the address of the data you’re really interested in.

Now, every process in modern operating systems has its own “address space”, which defines where things are in memory. So, memory addresses in process 1 have no relation to addresses used in process 2, or any other process.

What this means for shared memory algorithms is that you cannot use normal pointers, since they rely on pointing to a specific spot in a process’s address space. See below for an example:

Demonstration of address space for three processes shared a cache

Three KSharedDataCache-using processes are running, and let’s say that kcalc was the first to create that cache, so the pointers are created from the perspective of kcalc’s address space. If KSharedDataCache used normal pointers to point to the data in the cache (from the cache header) then things would fail to work right in kate where we point into the middle of the data. The case of krunner is even worse, as we point into the krunner process’s private data!

The solution is not too hard. The memory map call that creates the connection will tell you where the connection starts. So instead of saying “the data is at address 0x400000”, use pointers that say “the data is 1024 bytes past the start”. These are called offsets. For example, the pthread library that is standard in POSIX could use this type of technique to implement “process-shared” mutexes (mutexes are by default merely thread-shared).

Initial cache setup

Taking that first step in creating a cache is hard. Once the cache is setup we can rely on having some means of locking, entry tables that are setup, and other niceties. Creating that in the face of race conditions is another matter though.

My decision to use pthreads for the mutex made this part harder than it could have been otherwise, as the mutex has to be stored with the cache. But you can’t use the mutex without initializing it first (remember that pthread mutexes default to not being process-shared). If two processes try to create a non-existing cache at the same time, they would both try to initialize the mutex, and the process that initializes the mutex the second time could potentially cause logic errors in the first process.

So, I went with a simpler solution for this rare case: A spinlock, using Qt’s built-in atomic operations. It is not quite a pure spinlock because there are a couple of possibilities (numbered as they are in the code):

Case 0 is that the cache has just been created, without ever having been attached to. (0 because that is the default value for an initially empty file that has just been mapped into shared memory).

Case 1 is that there is a process which has noticed that the cache is not initialized, and is initializing it (in other words, it atomically switched the 0 flag to 1). This is a locking condition: No other process will attempt to initialize the cache. But the cache is not initialized yet!

Case 2 occurs when the cache has been finally initialized, and can have the standard locks and methods used. To access a cache that is in this condition you must use the cache mutex.

I don’t use a spinlock all the time because my implementation does not do any magical non-locking algorithms, and therefore some operations might take some significant time with the lock held. Using a mutex allows threads that are waiting to sleep and save CPU and battery power, which would not work with a spinlock.

Cache handling

Any cache needs a method of deciding when to remove old entries. This is especially vital for hash-based caches that use probing, like KSharedDataCache, where allowing the cache to reach maximum capacity will make it very slow since probing becomes both more common, and more lengthy. I use several techniques to try to get rid of old entries. I make no promises as to their effectiveness, but I felt it was better to try something than to do nothing. The techniques are:

  • Limited Quadratic Probing: One standard method of handling items that hash to the same location in a hash table is to use “probing”, where the insertion/find algorithms look at the next entry, then the next, and so on until a free spot is found. Obviously this takes longer, especially if the hash function tends to make entries cluster anyways. In the case of KSharedDataCache it’s perfectly acceptable to simply give up after a certain number of attempts, and I quite willingly do so (but see the next technique). On the other hand if you can avoid colliding you don’t have to worry about finding empty spots, so to that end I use the “FNV” hash function by Fowler, Yo, and Noll.
  • Ramped Entry Aging™: The basic idea is that as the amount of entries in the cache goes up, it becomes more likely that the insertion method will, instead of probing past a colliding entry, artificially decrease its use count, and kick it out if it becomes unused. There are competing effects here: There’s no point having a cache if you’re not going to employ it, so this aging never happens if the cache is lightly loaded. On the other hand entries that get added to the cache and only ever used once could cause collisions for weeks afterwards in the long-lived scenarios I envision so it is important to make entries justify their use. So as the cache load increases, there is a higher and higher chance of entries being evicted if unused. I simply divide the use count in half, so an entry can be quickly evicted even if used a million times a month ago.
  • Collateral Evictions®: In the event that the first two options don’t work, then some entries have to be kicked out to make room for new ones. This process is called eviction. In my Collateral Evictions plan, anytime we kick entries out, we kick even more entries out on top of that (I chose twice as many, for no particular reason). The idea is that if we’re running of out space we’ll probably have to kick someone else out on the very next insert call anyways, so since it’s a time-consuming operation we might as well make it effective. The exact entries that get kicked out are decided based on the developer-configurable eviction policy.

Next time I’ll talk about defragmentation, porting KIconLoader, and any other random things I can think up. I hope it hasn’t been deathly boring so far!

Implementing a shared cache: Part 2

In my last post, I gave some background on what a shared-memory cache is, and how KDE already uses one (KPixmapCache) to save memory and make the desktop more efficient. I also noted how the current implementation leaves some things to be desired, and hinted at a new implementation I was working on.

In this second part, I’ll discuss some of the basic design principles of the new class, which I called KSharedDataCache.

Why a new class?

If you didn’t read Part 1, you may be wondering why I don’t just fix the current implementation, KPixmapCache, instead of writing new code. It’s a good question, but the short story is that due to the public API used by KPixmapCache, it is non-trivial (to say the least :) to improve KPixmapCache and take some necessary steps to improve its performance. The penalty of getting it wrong is pretty severe as well, as there have been probably hundreds of reported crash bugs already due to KPixmapCache.

So, someone on IRC gave me the idea that, why don’t I just make my improvements in a different class, like a KPixmapCache2 and move the majority of the current users of KPixmapCache to use that instead? It sounded like a good option to me, so that’s what I started on, eventually settling on a generic cache layer under a slightly more specialized image-handling cache.

KSharedDataCache

KSharedDataCache is a class that manages a cache, keyed by QString values, and holding QByteArrays for generality. The cache is held in shared memory, which is accessed across multiple processes based on the cache name (which is converted internally to a file name).

The central data structure is the cache itself. Everything that is needed to be able to insert items, find items, and otherwise manage the cache is kept in the same memory segment, instead of being split into two different files like in KPixmapCache. A very ugly drawing of the layout would look like this:

Block diagram of the KSharedDataCache memory layout

Starting from the left, we have the header for the shared cache itself. This contains several important pieces of data, including the cache size, the page size (which is adjustable), the number of free pages, and the mutex which protects against concurrent access to the shared data.

One note about the mutex, is that it is used instead of KLockFile. It requires support for process-shared POSIX thread primitives (which is required for XSI-conformant systems, but was not present in Linux/glibc until NPTL IIRC). As long as your system tells the truth about whether it supports process-shared primitives KSharedDataCache will still work (even if it can’t use shared memory).

After the cache header, the entry index table is located (starting from the first byte meeting alignment criteria to avoid crashing on non-x86 systems… although I have none to test!). This table is a fixed-size table, based on the total cache size and page size. Entries are placed into the entry table based on the hash of the entry key, and each entry contains information such as the item size, hash code, use count, time of last access, and location of its data.

Collisions are possible with any hash table. The standard answer to handling collisions is to use a method called “chaining” to just make a list of entries which share the same hash code. Unfortunately dynamic memory allocation is much more involved when you’re dealing with a fixed-size block of shared memory, so currently quadratic probing is used to try to seek out other, hopefully empty candidates. Since this is just a cache, the probing is only continued for a small number of attempts.

Following the entry table is the page table, which simply records the entry currently using every page in memory. It is possible to compress the page table by using a bit vector, and making a full page table only when needed (currently only during defragmenting) but I didn’t have time to implement that.

Finally, the rest of the shared memory is devoted to a paged memory allocation system (this is probably the most suboptimal part of my current implementation, but at least it can be fixed later this time ;). Every entry is stored in this data area, with the key that is used followed by the actual QByteArray data.

Resolving an access for an item with a key of “juk_128x128.png” would work something like this:

  1. Lock the cache. If unable to acquire the lock, assume the cache is corrupt, unlink it on disk, and create it all over again.
  2. Convert the key to a byte array using the UTF-8 encoding, then determine the hash code.
  3. Use the hash code to find the appropriate entry index. Compare the hash code to the candidate entry’s hash code, and if they don’t match, use quadratic probing to find another candidate. Give up if the entry is not found within several attempts.
  4. If the hash codes match, search the matching data area to determine the saved key value, and make sure the keys also match. If they don’t match then go back to before, using quadratic probing to find a match.
  5. If the keys did match, we found our entry. Update the entry’s use count and last access time, and then copy the data out of the page or pages to return to the caller. Since this all happened in shared memory this should be much much faster than loading it from disk (assuming of course that the operating system hasn’t paged out the shared memory to disk in the meantime).

You might have noticed that I possibly unlink the cache with reckless abandon in step 1. I actually do this in many more places, the idea being that a corrupt cache can lead to bugs that are very hard for the end user to diagnose and correct, and by definition a cache can be expected to drop entries at any time. The only danger would be tampering with a cache that other processes are currently using in shared memory. By unlinking (and only unlinking) the cache, the other processes can continue to use the inode that used to be associated with the file, and the kernel will finish the cleanup when the other processes exit.

Of course I’m up past a thousand words now, so I’ll continue in Part 3, where I’ll discuss how pointers work in shared memory, how initial cache setup is performed, and my attempts at handling cache pressure, defragmentation, etc.

Implementing a shared cache: Part 1

So awhile ago I mentioned that I was trying to add a new shared-memory cache for the next version of the KDE platform. It’s almost done now, and has been submitted for review (both old-skool on kde-core-devel, and all Web 2.0-style on our review board).

Given the number of things I had to think about while implementing it (and I promise you that even still it’s not fully as thought-out as it could be), I decided that I could probably make a half-decent, if very technical series of posts about the implementation process.

I’ve got a basic outline set out, but without further ado, I’ll go over in this post what exactly a shared-memory cache is, why KDE has one now, and why I’m trying to make a different one.

Why a shared-memory cache?

Most of the programmer types are already familiar with the idea of a cache: You have some input that isn’t particularly helpful right now, you have a function to turn that non-helpful input into something you can use, but that function takes awhile to run. So, you save the output of that function (the value) once, somewhere where you can refer to it later (by a key) if you need it. The idea is to trade some extra memory usage for reduced time requirements. Caching is used everywhere, in your CPU, in your Web browser, the operating system, and more.

Now, the shared-memory part allows this cache to be shared between running processes (if you don’t know what a process is, just think of it as an instance of a running program. Each different instance, even of the same program, would be a different process). The operating system normally does a very good job of making sure that different processes can’t interfere with each other, but there are times when it makes sense to open a couple of gateways between them to let them share the burden. In KDE’s case, many of the icons used by a standard KDE application will be used unchanged by other KDE programs, so it makes sense for us to cache generated icons for use by other KDE program. We could simply write the icons out to disk where other programs could read them, but putting them into shared memory allows for the near-immediate transfer of that data without any disk I/O.

I’d like to find examples of current shared-memory caches (besides our current KPixmapCache), but the only ones I can find are the fully distributed type like memcached. Cairo has a cache for glyphs, but that seems to be done per-process. GTK+ has a cache which is designed to be read in directly using mmap(2), but not necessarily to be accessed via shared memory. Let me know if you find any though!

So again, in our case we use a shared-memory cache in large part to handle icon loading and Plasma themes (both potentially heavy users of SVG icons). This gives us two speedups: 1) We don’t always have to re-generate a pixmap from the source data (a very big speedup when the source is an SVG), and 2) If one process generates a pixmap, every other KDE process can get access to the results nearly instantly.

What KDE currently does

My interest in shared-memory caching came about from looking into some bugs reported against our current cache, KPixmapCache, which was developed in conjunction with the lead-up to KDE 4.0 to allow the new Plasma subsystem and the existing icon subsystem to use the fancy SVG graphics without taking forever.

In addition, KPixmapCache had a feature where not only could it cache the image data (the 0’s and 1’s that make up the image itself), but also the resulting QPixmap handle to the image as it exists in “the graphics memory” (I’ll gloss over the distinction for now, it will be important in a future part).

KPixmapCache is implemented by hosting two different files in a known location, one ending in .index and the other ending in .data. Respectively these files hold the index metadata needed for the cache to work, and the actual image data.

Anytime you talk about shared resources, you also need to think about how to protect access to those shared resources to keep everything consistent. KPixmapCache uses the trusty KLockFile to protect against concurrent access (this has the benefit of being safe if the partition is mounted on NFS, although I think the reason is more because that’s what already existed in kdelibs).

From there, KPixmapCache uses a sorted binary tree (sorted by the hash code) to manage the index, and treats the .data file as a simple block of memory. Every time a new item is entered into the cache, a new chunk of free space is allocated from the .data file, even if there already exists empty space. Likewise, if a new index entry is required to insert an item, it is always added at the end of the index (with one exception, when overwriting an existing index entry due to a hash collision). I said the index was sorted earlier, but that might seem incompatible with always adding new entries at the end. What the implementer did to solve this problem is hold pointers to the actual children and parent in each index item, that way a virtual hierarchy could be arranged. The binary tree is never re-arranged (like in AVL trees or red-black trees), so the initial root node is always the root node. The overall structure looks like this figure:

Schematic of KPixmapCache layout

One disadvantage to this architecture is that it is difficult and inefficient to delete entries that should be expired out of the cache. Properly removing an entry would require possibly having to move entries in the data file in order to minimize fragmentation. This alone wouldn’t be a large deal (I end up having to do the same thing in KSharedDataCache), but updating the .index file is even harder, since it requires updating information for both parent and children (although again, not impossible by any means). Avoiding fragmentation in the index would require either moving nodes around in the index file (possibly recursively), or having to scan for the first free node when adding items. None of these are big problems, but it does make the implementation more annoying.

KPixmapCache worked around all of this by punting the problem: No entries ever got removed until the cache filled up. At this point the entries would be ranked by whatever the expiration policy in effect was (i.e. least recently used preferred, newest preferred, etc.), a new (smaller) cache would be constructed holding the most-desired entries, and the old cache would be deleted. Although infrequent, this could possibly take a not-insignificant amount of time when it did happen.

So why a new implementation?

Probably the one thing that led to me starting from a different architecture however, was the interface to KPixmapCache: It is designed to be subclassed, and to allow subclasses access to both the index and individual data items through a QDataStream (see KIconCache for an example usage). Doing this meant the internal code had to use a QIODevice to interface to the data and index, and so what ends up happening is that even though KPixmapCache tries to place all of the data in shared memory, it always ends up accessing it like it was a disk file anyways (even though a memory-to-QIODevice adapter is used).

Having to support subclassing (in the public API no less) makes changing many of the implementation details a journey fraught with hazard, and it’s bad enough that any little problem in KPixmapCache seemingly guarantees a crash. Since KPixmapCache is used in very core desktop platform code (Plasma and KIconLoader) I knew I wanted to go a different direction.

So, I started work on a “KIconCache”. However all the work I was doing was hardly specific to icons, and when I’d heard of a developer that was abusing KPixmapCache to hold non-image-data somehow, I decided to make a generic shared-memory cache, KSharedDataCache. Next post I’ll try to explain the direction I decided to take with KSharedDataCache.