Threadsafe programming
  Rodolphe Ortalo
  1998/01/30 (ed. 98/2/4)

  General thoughts and information about threadsafe programming

  1.  Preface

  Well... Threadsafe programming is not such a difficult thing, but you
  need a clear mind concerning mutual exclusion, semaphores/mutexes,
  and, as a corollary who Djikstra is.

  I've spent several months working on the Chorus microkernel where
  threads are the common (and not the special) thing. Maybe I can state
  a few general things. (Section titles are here to help the hurrying
  reader.)

  2.  Cultural

  First, Djikstra is a computer science professor who made several
  theoretical contributions in the programming languages area (and
  probably many other areas :-). He is most famous due to his
  "invention" of computer related semaphores (which have nothing to do
  with traffic lights of any kind).


  Second, in my opinion, the easiest way of considering the problem of
  thread protection is to view a multiple-threaded program as a PARALLEL
  program. Think of the problem with ALL threads executing
  simultaneously (really simultaneously) even though it is not the case,
  and you will see the annoying cases much more easily. (And let the
  linux kernel do its scheduling in the order he wants. Our parallel
  brain has a better way of exhibiting the problems.)


  Third, there are two relatively different problems addressed by the
  use of semaphores and mutexes:

  o  threads synchronization;

  o  and thrads shared data access protection.

  The latter is probably the easiest way of seeing the problem, and the
  most common case. Synchronization may also be an issue, but... well,
  let's speak of the data protection.


  3.  Basic

  The problem: when parallel (or real asynchronous) control flows (ie:
  threads) access the _same_ data, you HAVE to protect such accesses.
  Full stop.  There are read/write differences, but, you may think to
  the problem in a simple way: shared data should be protected so, you
  should:

  o  do something before accessing it,

  o  and do another thing after (in order for the other threads to
     access it).

  ==> Suppose you have a global variable called "the_data" used by
  several threads. You _must_ protect it, so you use a mutex:
  "the_data_mutex".  Think to the mutex as a switch (or a barrier with a
  big Guardian Minion).  Before accessing the_data (potentially accessed
  by multiple threads) you do : MutexGet(the_data_mutex). You will be
  allowed to proceed only when the_data is free (and you will sleep
  until then).  After you've done things with the_data you do:
  MutexRelease(the_data_mutex).  Then you get out of the protected area
  and other threads may be allowed to enter in it.


  NB: "local" C variables are really local (they are on the stack and
  each thread has its own stack). So you never need to protect them, of
  course.

  Here we are. We have protected a simple thing. Of course, you may put
  a big abstract mutex "the_mutex_for_everything_shared" in your
  program, and then Get/Release it each time you do something in the
  global space.  But this would be relatively inefficient.  So if you
  have "first_data" and "second_data", and their repective:
  "first_mutex" and "second_mutex", you think that you should do
  Get/Release(first_mutex) before accessing first_data, and
  Get/Release(second_mutex) before accessing second_data.  You are right
  !

  But: there is a new problem arising now. Consider 2 threads doing
  this:


  ______________________________________________________________________
        Th. 1                                     Th. 2
          |                                         |
   MutexGet(first_mutex)                 MutexGet(second_mutex)
          |                                         |
   MutexGet(second_mutex)                 MutexGet(first_mutex)
          |                                         |
  MutexRelease(second_mutex)            MutexRelease(first_mutex)
          |                                         |
  MutexRelease(first_mutex)             MutexRelease(second_mutex)
          |                                         |
          +                                         +
  ______________________________________________________________________


  Each of these threads seems to do the right thing, but in fact, they
  don't! We are completely asynchronous, so the worst can arrive.
  Suppose Th.1 successfully Get the first_mutex, and then just after Th.
  2 successfully Get the second_mutex.  After that, Th.1 will try to get
  the second_mutex and Th.2 the first_one, and none of them will never
  get it. They are waiting for each other to release it so... This is a
  DEADLOCK.

  ===> Therefore, the conclusion is: threads should do The Right Thing,
  i.e. always access the mutexes in the same order !

  This is a general rule: always access the mutexes in the same order!
  If you ever get fb_mutex before gc_mutex, you should always do this in
  all your threads! Of course, you may access only fb_mutex or gc_mutex
  independently... But if you get gc_mutex, don't ever try to Get
  fb_mutex. Okay ?


  4.  In GGI


  In GGI: it seems that the IS_CRITICAL, ENTER_CRITICAL and
  LEAVE_CRITICAL macros implement the "the_mutex_for_everything_shared"
  abstraction.

  So, if you do something that has to be thread-safe, you may surround
  all your code with ENTER_CRITICAL and LEAVE_CRITICAL pairs. This may
  be inefficient, but well, probably not, and you are safe at least.  It
  means that nothing else will be done by any GGI code before you
  LEAVE_CRITICAL. (To others: Is it the actual meaning of these macros
  ?)

  NB: A code section surrounded by a MutexGet/Release pair is generally
  called a critical section (hence the macros names). But I don't really
  like this naming, because in fact a section is always critical with
  respect to something (the data accessed, and nothing else). You can
  encapsulate _several_ critical sections inside each other without any
  harm if you are careful. You may even interleave several critical
  sections without any harm, but, you have to be _very_ careful with
  this (and I wouldn't recomment this to anyone).

  You should not protect the library calls, you should protect _your_
  data ! (Well, if you program a library, your data is library data,
  but...:-) Of course, if you use thread-UNsafe libraries, your program
  will not be thread-safe either and you are doing extra work. But well,
  libggi is supposed to be thread-safe one day (and then, it must rely
  on a thread-safe libc for example...).  However, if you pass a
  "pointer" to a library function, the associated data is _your_ data
  (even though the library will modify it, you simply delegate the work
  to it). So you should protect it: but only if your main program is
  multi-threaded.


  The problem with GGI accelerated commands is that, implicitly, a new
  "thread" (in fact a control-flow) can always appear: the "program"
  that executes on the card's chipset (bitBLT engine, 3D engine). The
  data to protect (your data) is the framebuffer... We could use a real
  mutex to protect this things, but well, it is simpler to have a
  "wait_until_finished" approach. You can (and you will be obliged) to
  wait for the on-video-board "thread" to finish.  Hence, each time you
  want to draw on the framebuffer ((either yourself or thanks to the
  library), you must do a ggiFlush (or maybe ggiFinish as Andreas wanted
  to change the name), to let the engine leave the framebuffer to you.
  If you don't do that: you may face a data corruption problem (as with
  first_data or second_data). (NB: It is not so "critical" for graphical
  data, it's "only" ugly...)  You may ask me: but, well, we don't have a
  true mutex, so, if I think truly "parallel", what about the case where
  the 2D engine starts drawing something on the framebuffer and I start
  also ? Should I use ggiFinish _before_ drawing. The answer is no: the
  underlying assumption is that the accelerated engine will never start
  something by itself (at least I hope so). So, you should ggiFinish
  only after a drawing command (and before the next one of course).  By
  the way, we also do a synchronization with ggiFinish as the _whole_
  framebuffer will be "ready" after that call.

  Why not implement a "complete" locking system on the framebuffer (ie:
  Get/Release, or Enter/Leave) ? Simply because it is a very big data,
  and it would be inefficient. Furthermore, unless you use the card's
  memory to store program code, corruption of this big data is only
  garbage on screen, so... Let it be simpler, and the programmer will
  check that things are correct, and be synchronous if he doesn't want
  to worry about this. In this case, the programmer forgets about the
  bitBLT or 3D engine running asynchronously, and simply says to the
  libggi: be SYNC and worry about these stupid things yourself.  This is
  a "convenience" mode, that effectively makes use of a complete locking
  semantic on the framebuffer (maybe a clever one, but well only because
  we are good persons) and manages it itself.  If the programmer wants
  to go ASYNC, then, _he_ will have to manage the ggiFinish problems
  himself. The advantage ? He _can_ do a finer grain control of the data
  and make his application draw in the framebuffer at the same time as
  the card engines, (hopefully in different areas, or the result will
  be... funny :-). In this case, GGI (kgi+libggi) only does controls wrt
  hw safety (prevent hardware locks if needed, etc.).  We may rename
  these macros: GARBAGE_FORBIDDEN, GARBAGE_ALLOWED... :-)

  NB: However, even if you are synchronous wrt graphics (ie: a single
  drawing thread), this does not mean that your program shouldn't
  protect any data. If the AI thread sends data to the drawing thread,
  these data should be protected in the application program.


  5.  Tricky

  Now, that you have two simple rules. Some other ones (left unjustified
  here):


  o  don't use a C int to protect the data ! Always use a mutex (or a
     semaphore if you know how it works) ! It is a common fault (and
     sometimes not a fault). This is architecture dependent ! The final
     operation behind a mutex MUST be atomic, really atomic. Therefore,
     it should finish by a test-and-set _hardware_ instruction. I don't
     think C guarantees its test-and-set language ops. to be atomic at
     the hardware level... So you must use a dedicated library. (NB:
     This is why a mutex is never called an int...) Don't try to
     implement a mutex yourself, it may be tricky (it may work on x86
     with gcc and not on alpha with cc for example).

  o  don't do this in too many very concurrent threads:

     ___________________________________________________________________

             while (1) {
                     MutexGet(a_mutex);
                     /* some work */
                     MutexRelease(a_mutex);
             }
     ___________________________________________________________________


  if you are not sure that a "schedule" entry point may appear in the /*
  some work */ (a system call for example). Use semaphores (or verify
  the "semantic" of mutexes). In this case, you may finish with a new
  "starvation" problem: on preemptive systems you only create a perfor-
  mance slowdown, on real-time system, some threads will never execute.

  (NB: This problem is very rare on real systems. In fact I don't know
  any system on which it appears. But this is a symptom showing that a
  mutex is _not_ a semaphore with initial value of 1. Theoretically at
  least, in practice, it is always implemented in this way, and that's
  why you never face the problem ;-)

  o  be careful on our Unix systems... We don't have true parallel
     systems, we have time-sharing deterministic systems. This means
     that your program may work _perfectly_, on all the runs you can
     try, and _not_ be thread safe. This is simply because, with your
     environment, the deadlocks or data corruption conditions never
     occur. If you change the workload, or the time-slice, or something
     else, it may fail. These faults are _very_very_very_ difficult to
     identify a posteriori. The best solution is to READ the code and
     check that you have no obscure zone that you don't understand. (NB:
     This a general good coding practice by the way: I know that, I
     rarely do so and I have lots of bugs. :-)

     If you have more precise questions or comments, you are welcome, of
     course.