Nasty ABA problem in array-based lock-free stack

Stack of books and laptopIn the comments to my post on testing the lock-free stack, Dmitriy V'jukov wondered if my code could test a broken implementation of a stack built on a preallocated array. He provided a link to the code on a Russian forum (yes, I used Google Translate a lot). It seems it’s not his, but a friend’s or a colleague’s.

I downloaded it and cleaned it up a bit for here (the volatile keywords aren’t needed for example since there are memory fences all over the place, so that gets rid of the #pragmas, and I really wanted better names for the variables):

  class SafeStackItem<T> {
    public T Value;
    public Int32 Next;
  }

  class SafeStack<T> where T : class {
    private Int32 head;
    private Int32 count;
    private SafeStackItem<T>[] array;

    public SafeStack(SafeStackItem<T>[] array, int pushFrom, int pushCount) {
      this.head = -1;
      this.array = array;

      count = pushCount;

      head = pushFrom;
      for (int i = 0; i < pushCount - 1; i++)
        array[i + pushFrom].Next = pushFrom + i + 1;
      array[pushFrom + pushCount - 1].Next = -1;
    }

    public Int32 Pop() {
      while (count > 1) {
        Int32 oldHead = head;
        Int32 oldHeadNext = Interlocked.Exchange(ref array[oldHead].Next, -1);

        if (oldHeadNext >= 0) {
          if (Interlocked.CompareExchange(ref head, oldHeadNext, oldHead) == oldHead) {
            Interlocked.Decrement(ref count);
            return oldHead;
          }
          else
            Interlocked.Exchange(ref array[oldHead].Next, oldHeadNext);
        }
      }

      return -1;
    }

    public void Push(Int32 index) {
      Int32 oldHead;
      do {
        oldHead = head;
        array[index].Next = oldHead;
      } while (oldHead != Interlocked.CompareExchange(ref head, index, oldHead));

      Interlocked.Increment(ref count);
    }
  }

My first comments were essentially summed up by the word ‘horror’. The array is public, the indexes into the array are public (the push and pop operations work on array indexes, for heaven’s sake), and the elements are all public. I also pointed out that the stack was just completely broken: even in a single-threaded app you could never pop off the last item on the stack. He came back and asked whether I could find the bug in the algorithm. I replied with a negative, pointing out that

I'm not even sure how it's supposed to be used, so a test app would be more than welcome. In using a multithreaded stack, I'm more used to having 'producers' that push 'work' items on the stack, and 'consumers' that pop them off. Here I'm not even sure how a consumer would signal to a producer that it had completed the work on index N and hence that index can now be reused. Yuk, even writing that down gives me the shivers.

In translating and reading the thread on the Russian forum, I got the impression that he tested it with each thread popping off an item, “processing” it in some way, and then pushing it back. Why, I have absolutely no idea; I just think this stack is broken as a usable multi-threaded container for the reason given above, and that’s even before you get to the problem of it suffering from an ABA problem. (Or maybe you should have two such stacks, equal in size, pop off from one and push onto the other. Yuk.)

The ABA problem occurs because of the reuse of ‘nodes’ without making sure that all threads have relinquished control of a given node before reusing it. It’s a classic ABA scenario and after I’d thought about it for a little while, I played the game of “hunt the node reuse”. In the end, it wasn’t that hard to find.

Let us suppose that slow thread A ‘sees’ the stack 3 -> 4 -> 5 -> etc; that is, the head of the list is node 3. It calls Pop(). The first step is to read the head index into oldHead, setting it to 3. In between this step and the next, it falls asleep, and another thread, X, pops node 3. The oldHead node (that is, node 3) will have its Next index set to -1 by this pop operation.

Thread X then starts to push node 3 back onto the stack. However, it has some extraordinary difficulty because some rogue super-fast thread is pushing a bunch of nodes back onto the stack, bam, bam, bam. X tries, node 3’s Next gets set to 42, the current head, but the compare-exchange fails because the head has changed right under its nose. Damn. It tries again, node 3’s Next gets set to 43. No go, fails again. It tries again, setting node 3’s Next to 44 and finally succeeds.

The stack now should look like 3 -> 44 -> 43 -> 42 -> 4 -> 5 -> etc.

But in the middle of all that, thread A wakes up. It sees the momentary scenario of 3 -> 42 that is about to fail in the middle of thread X’s Push() loop. Its own exchange succeeds (the second step of its loop), and node 3 now points to -1 and oldHeadNext is 42. Thread A falls asleep again...

...and a little time passes. The stack now looks like this: 44 -> 43 -> 42 -> 4 -> etc. Thread X, in its attempt to push node 3 back onto the list has just set its Next pointer to point to 44, since that’s the current head of the stack (that -1 from the previous paragraph has long gone). It’s about to call the compare-exchange when...

...thread A awakes with a start. It fails its compare-exchange and so resets oldHead.Next to 42 (that is, node 3’s Next to 42) and goes round its loop again.

And thread X completes its compare-exchange successfully to push element 3 back onto the stack. Unfortunately, the stack now looks like this: 3 -> 42 -> 4  -> 5 -> etc, and elements 44 and 43 have been lost. ABA!

So, in summary, rule 1 of lock-free containers is “don’t reuse a node until every thread has relinquished that node’s reference”. With ‘my’ lock-free stack, the garbage collector enforces that rule in spades.

And before Dmitriy asks, this array-based stack is broken beyond repair. There is no fix. An array of nodes just implies “node reuse”.

Album cover for HeligolandNow playing:
Massive Attack - Girl I Love You (She Is Danger Remix)
(from Heligoland)

Loading similar posts...   Loading links to posts on similar topics...

6 Responses

 avatar
#1 Dmitriy V'jukov said...
30-Mar-10 12:07 AM

Hats off to Julian!

(Humm... didn't you just read my description of the error?.. :) translate.google.com/.../translate )

Is it really not that hard to find? I bet there is only a handful of people in the world who can easily find such bugs with a manual code review!

Btw, the stack is trivial to repair - just add IBM ABA-tags.

#2 Dew Drop – March 30, 2010 | Alvin Ashcraft's Morning Dew said...
30-Mar-10 5:21 AM

Pingback from Dew Drop – March 30, 2010 | Alvin Ashcraft's Morning Dew

julian m bucknall avatar
#3 julian m bucknall said...
30-Mar-10 7:57 AM

Dmitriy: The stack is not trivial to repair since it doesn't function properly as a stack in the first place, let alone before you slap on ABA tags. It's like putting power windows into an old car that resting on bricks in your front yard. Yes, you could do it, but why would you?

Cheers, Julian

 avatar
#4 Dmitriy V'jukov said...
30-Mar-10 8:55 AM

ABA aside, the stack is broken beyond repair just like your lock-free queue implementation: www.boyet.com/.../LockfreeQueue.h It also contains fake empty node as an implementation detail.

And publicity of fields has nothing to do with correctness and brokenness.

So in what way it is really broken (provided that one fixed ABA with IBM-tags)?

julian m bucknall avatar
#5 julian m bucknall said...
30-Mar-10 5:55 PM

Dmitriy: I think we're arguing by looking at the problem from different angles.

I've been involved in the component business for so long that I expect classes and controls to be well encapsulated. So my biggest issue, as you know, with the stack code is that it is not encapsulated at all. All the internal fields are exposed. To use it, you have to know that you need to reserve a node in the sub-array you pass to the constructor to be a dummy node (since it won't be popped). You have to know that you can't pass in arbitrary index values in the Push, but have to use those you get from the Pop. And so on.

Cheers, Julian

 avatar
#6 Dmitriy V'jukov said...
30-Mar-10 10:40 PM

Probably. For me interface design is a problem of higher infinitesimal order when it comes to lock-free algorithms (where one has to ensure correctness, scalability and performance at the same time). However, I still do not see why it's broken beyond repair. You already made it some neater. Reservation of an item is not a user problem, user just has to pass an array and specify what range a stack can use. Pushing of arbitrary ints is not a serious problem too, well, if you get a db connection from a pool you have to return exactly that connection to exactly that pool (if you use C++\CLI you can return an interior pointer to the element in the array, but AFAIK other .NET languages do not feature such a capability).

Leave a response

Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.

  •  Emphasize with italics: surround word with underscores _emphasis_
  •  Emphasize strongly: surround word with double-asterisks **strong**
  •  Link: surround text with square brackets, url with parentheses [text](url)
  •  Inline code: surround text with backticks `IEnumerable`
  •  Unordered list: start each line with an asterisk, space * an item
  •  Ordered list: start each line with a digit, period, space 1. an item
  •  Insert code block: start each line with four spaces
  •  Insert blockquote: start each line with right-angle-bracket, space > Now is the time...
Preview of response