From: David van Moolenbroek <david@minix3.org>
Date: Fri, 30 Aug 2013 12:00:50 +0000 (+0200)
Subject: VFS: worker thread model overhaul
X-Git-Tag: v3.3.0~606
X-Git-Url: http://zhaoyanbai.com/repos/?a=commitdiff_plain;h=723e51327fe8b8c5a99378f9787924c35d1e8947;p=minix.git

VFS: worker thread model overhaul

The main purpose of this patch is to fix handling of unpause calls
from PM while another call is ongoing. The solution to this problem
sparked a full revision of the threading model, consisting of a large
number of related changes:

- all active worker threads are now always associated with a process,
  and every process has at most one active thread working for it;
- the process lock is always held by a process's worker thread;
- a process can now have both normal work and postponed PM work
  associated to it;
- timer expiry and non-postponed PM work is done from the main thread;
- filp garbage collection is done from a thread associated with VFS;
- reboot calls from PM are now done from a thread associated with PM;
- the DS events handler is protected from starting multiple threads;
- support for a system worker thread has been removed;
- the deadlock recovery thread has been replaced by a parameter to the
  worker_start() function; the number of worker threads has
  consequently been increased by one;
- saving and restoring of global but per-thread variables is now
  centralized in worker_suspend() and worker_resume(); err_code is now
  saved and restored in all cases;
- the concept of jobs has been removed, and job_m_in now points to a
  message stored in the worker thread structure instead;
- the PM lock has been removed;
- the separate exec lock has been replaced by a lock on the VM
  process, which was already being locked for exec calls anyway;
- PM_UNPAUSE is now processed as a postponed PM request, from a thread
  associated with the target process;
- the FP_DROP_WORK flag has been removed, since it is no longer more
  than just an optimization and only applied to processes operating on
  a pipe when getting killed;
- assignment to "fp" now takes place only when obtaining new work in
  the main thread or a worker thread, when resuming execution of a
  thread, and in the special case of exiting processes during reboot;
- there are no longer special cases where the yield() call is used to
  force a thread to run.

Change-Id: I7a97b9b95c2450454a9b5318dfa0e6150d4e6858
---

diff --git a/servers/vfs/README b/servers/vfs/README
index 2eac0af6d..6c91f6687 100644
--- a/servers/vfs/README
+++ b/servers/vfs/README
@@ -106,50 +106,18 @@ know the details of the protocol.
 Upon start up, VFS spawns a configurable amount of worker threads. The
 main thread fetches requests and replies, and hands them off to idle or
 reply-pending workers, respectively. If no worker threads are available,
-the request is queued. There are 3 types of worker threads: normal, a system
-worker, and a deadlock resolver. All standard system calls are handled by
-normal worker threads. Jobs from PM and notifications from the kernel are taken
-care of by the system worker. The deadlock resolver handles jobs from system
+the request is queued. All standard system calls are handled by such worker
+threads. One of the threads is reserved to handle new requests from system
 processes (i.e., File Servers and drivers) when there are no normal worker
 threads available; all normal threads might be blocked on a single worker
 thread that caused a system process to send a request on its own. To unblock
-all normal threads, we need to reserve one thread to handle that situation.
-VFS drives all File Servers and drivers asynchronously. While waiting for
-a reply, a worker thread is blocked and other workers can keep processing
-requests. Upon reply the worker thread is unblocked.
-As mentioned above, the main thread is responsible for retrieving new jobs and
-replies to current jobs and start or unblock the proper worker thread. Given
-how many sources for new jobs and replies there are, the work for the main
-thread is quite complicated. Consider Table 1.
-{{{
----------------------------------------------------------
-| From                 |  normal  | deadlock |  system  |
----------------------------------------------------------
- msg is new job
----------------------------------------------------------
-| PM                   |          |          |    X     |
-+----------------------+----------+----------+----------+
-| Notification from    |          |          |          |
-| the kernel           |          |          |    X     |
-+----------------------+----------+----------+----------+
-| Notification from    |          |          |          |
-| DS or system process |    X     |    X     |          |
-+----------------------+----------+----------+----------+
-| User process         |    X     |          |          |
-+----------------------+----------+----------+----------+
-| Unsuspended process  |    X     |          |          |
----------------------------------------------------------
- msg is reply
----------------------------------------------------------
-| File Server reply    |  resume  |          |          |
-+----------------------+----------+----------+----------+
-| Block driver reply   |  resume  |          |          |
-+----------------------+----------+----------+----------+
-| Char. driver reply   | resume/X |          |          |
----------------------------------------------------------
-}}}
-Table 1: VFS' message fetching main loop. X means 'start thread'.
+all normal threads, we need to reserve one spare thread to handle that
+situation. VFS drives all File Servers and drivers asynchronously. While
+waiting for a reply, a worker thread is blocked and other workers can keep
+processing requests. Upon reply the worker thread is unblocked.
 
+As mentioned above, the main thread is responsible for retrieving new jobs and
+replies to current jobs and start or unblock the proper worker thread.
 Driver replies are processed directly from the main thread. As a consequence,
 these processing routines may not block their calling thread. In some cases,
 these routines may resume a thread that is blocked waiting for the reply. This
@@ -158,19 +126,56 @@ character driver replies. The character driver reply processing routines may
 also unblock suspended processes which in turn generate new jobs to be handled
 by the main loop (e.g., suspended reads and writes on pipes). So depending
 on the reply a new thread may have to be started.
-When there are no worker threads available and there is no need to invoke
-the deadlock resolver (i.e., normal system calls), the request is queued in
-the fproc table. This works because a process can send only one system call
-at a time. When implementing kernel threads, one has to take this assumption
-into account.
-The protocol PM speaks with VFS is asynchronous and PM is allowed to
-send as many request to VFS as it wants. It is impossible to use the same
-queueing mechanism as normal processes use, because that would allow for
-just 1 queued message. Instead, the system worker maintains a linked list
-of pending requests. Moreover, this queueing mechanism is also the reason
-why notifications from the kernel are handled by the system worker; the
-kernel has no corresponding fproc table entry (so we can't store it there)
-and the linked list has no dependencies on that table.
+
+Worker threads are strictly tied to a process, and each process can have at
+most one worker thread running for it. Generally speaking, there are two types
+of work supported by worker threads: normal work, and work from PM. The main
+subtype of normal work is the handling of a system call made by the process
+itself. The process is blocked while VFS is handling the system call, so no new
+system call can arrive from a process while VFS has not completed a previous
+system call from that process. For that reason, if there are no worker threads
+available to handle the work, the work is queued in the corresponding process
+entry of the fproc table.
+
+The other main type of work consists of requests from PM. The protocol PM
+speaks with VFS is asynchronous. PM is allowed to send up to one request per
+process to VFS, in addition to a request to initiate a reboot. Most jobs from
+PM are taken care of immediately by the main thread, but some jobs require a
+worker thread context (to be able to sleep) and/or serialization with normal
+work. Therefore, each process may have a PM request queued for execution, also
+in the fproc table. Managing proper queuing, addition, and execution of both
+normal and PM work is the responsibility of the worker thread infrastructure.
+
+There are several special tasks that require a worker thread, and these are
+implemented as normal work associated with a certain special process that does
+not make regular VFS calls anyway. For example, the initial ramdisk mount
+procedure and the post-crash filp garbage collector use a thread associated
+with the VFS process. Some of these special tasks require protection against
+being started multiple times at once, as this is not only undesirable but also
+disallowed. The full list of worker thread task types and subtypes is shown in
+Table 1.
+
+{{{
+-------------------------------------------------------------------------
+| Worker thread task        | Type   | Association     | May use spare? |
++---------------------------+--------+-----------------+----------------+
+| system call from process  | normal | calling process | if system proc |
++---------------------------+--------+-----------------+----------------+
+| resumed pipe operation    | normal | calling process | no             |
++---------------------------+--------+-----------------+----------------+
+| postponed PM request      | PM     | target process  | no             |
++---------------------------+--------+-----------------+----------------+
+| DS event notification     | normal | DS              | yes            |
++---------------------------+--------+-----------------+----------------+
+| filp garbage collection   | normal | VFS             | no             |
++---------------------------+--------+-----------------+----------------+
+| initial ramdisk mounting  | normal | VFS             | no             |
++---------------------------+--------+-----------------+----------------+
+| reboot sequence           | normal | PM              | no             |
+-------------------------------------------------------------------------
+}}}
+Table 1: worker thread work types and subtypes
+
 Communication with block drivers is asynchronous, but at this time, access to
 these drivers is serialized on a per-driver basis. File Servers are treated
 differently. VFS was designed to be able to send requests concurrently to File
@@ -305,22 +310,42 @@ fproc, vmnt, vnode, and filp objects. To prevent deadlocks as a result of
 object locking, we need to define a strict locking order. In VFS we use the
 following order:
 
-fproc -> [exec] -> vmnt -> vnode -> filp -> [block special file] -> [dmap]
+{{{
+fproc > [exec] > vmnt > vnode > filp > [block special file] > [dmap]
+}}}
 
 That is, no thread may lock an fproc object while holding a vmnt lock,
 and no thread may lock a vmnt object while holding an (associated) vnode, etc.
+
 Fproc needs protection because processes themselves can initiate system
 calls, but also PM can cause system calls that have to be executed in their
 name. For example, a process might be busy reading from a character device
 and another process sends a termination signal. The exit(2) that follows is
 sent by PM and is to be executed by the to-be-killed process itself. At this
 point there is contention for the fproc object that belongs to the process,
-hence the need for protection.
-The exec(2) call is protected by a mutex for the following reason. VFS uses a
-number of variables on the heap to read ELF headers. They are on the heap due
-to their size; putting them on the stack would increase stack size demands for
-worker threads. The exec call does blocking read calls and thus needs exclusive
-access to these variables. However, only the exec(2) syscall needs this lock.
+hence the need for protection. This problem is solved in a simple way. Recall
+that all worker threads are bound to a process. This also forms the basis of
+fproc locking: each worker thread acquires and holds the fproc lock for its
+associated process for as long as it is processing work for that process.
+
+There are two cases where a worker thread may hold the lock to more than one
+process. First, as mentioned, the reboot procedure is executed from a worker
+thread set in the context of the PM process, thus with the PM process entry
+lock held. The procedure itself then acquires a temporary lock on every other
+process in turn, in order to clean it up without interference. Thus, the PM
+process entry is higher up in the locking order than all other process entries.
+
+Second, the exec(2) call is protected by a lock, and this exec lock is
+currently implemented as a lock on the VM process entry. The exec lock is
+acquired by a worker thread for the process performing the exec(2) call, and
+thus, the VM process entry is below all other process entries in the locking
+order. The exec(2) call is protected by a lock for the following reason. VFS
+uses a number of variables on the heap to read ELF headers. They are on the
+heap due to their size; putting them on the stack would increase stack size
+demands for worker threads. The exec call does blocking read calls and thus
+needs exclusive access to these variables. However, only the exec(2) syscall
+needs this lock.
+
 Access to block special files needs to be exclusive. File Servers are
 responsible for handling reads from and writes to block special files; if
 a block special file is on a device that is mounted, the FS responsible for
diff --git a/servers/vfs/comm.c b/servers/vfs/comm.c
index 41760827f..9e8ab8a74 100644
--- a/servers/vfs/comm.c
+++ b/servers/vfs/comm.c
@@ -1,7 +1,4 @@
 #include "fs.h"
-#include "glo.h"
-#include "vmnt.h"
-#include "fproc.h"
 #include <minix/vfsif.h>
 #include <assert.h>
 
diff --git a/servers/vfs/const.h b/servers/vfs/const.h
index d8c2f5c41..56731d2ca 100644
--- a/servers/vfs/const.h
+++ b/servers/vfs/const.h
@@ -6,7 +6,7 @@
 #define NR_LOCKS           8	/* # slots in the file locking table */
 #define NR_MNTS           16 	/* # slots in mount table */
 #define NR_VNODES       1024	/* # slots in vnode table */
-#define NR_WTHREADS	   8	/* # slots in worker thread table */
+#define NR_WTHREADS	   9	/* # slots in worker thread table */
 
 #define NR_NONEDEVS	NR_MNTS	/* # slots in nonedev bitmap */
 
diff --git a/servers/vfs/coredump.c b/servers/vfs/coredump.c
index 578d43df5..f1a3f800d 100644
--- a/servers/vfs/coredump.c
+++ b/servers/vfs/coredump.c
@@ -1,7 +1,6 @@
 #include "fs.h"
 #include <fcntl.h>
 #include <string.h>
-#include "fproc.h"
 #include <minix/vm.h>
 #include <sys/mman.h>
 #include <sys/exec_elf.h>
diff --git a/servers/vfs/device.c b/servers/vfs/device.c
index 43e453998..2eff10979 100644
--- a/servers/vfs/device.c
+++ b/servers/vfs/device.c
@@ -34,7 +34,6 @@
 #include <minix/ioctl.h>
 #include <minix/u64.h>
 #include "file.h"
-#include "fproc.h"
 #include "scratchpad.h"
 #include "dmap.h"
 #include <minix/vfsif.h>
@@ -50,7 +49,6 @@ static void reopen_reply(message *m_ptr);
 
 static int dummyproc;
 
-
 /*===========================================================================*
  *				dev_open				     *
  *===========================================================================*/
@@ -958,7 +956,7 @@ static void open_reply(message *m_ptr)
   proc_e = m_ptr->REP_ENDPT;
   if (isokendpt(proc_e, &slot) != OK) return;
   rfp = &fproc[slot];
-  wp = worker_get(rfp->fp_wtid);
+  wp = rfp->fp_worker;
   if (wp == NULL || wp->w_task != who_e) {
 	printf("VFS: no worker thread waiting for a reply from %d\n", who_e);
 	return;
@@ -1003,7 +1001,7 @@ static void task_reply(message *m_ptr)
 	 */
 	if (isokendpt(proc_e, &slot) != OK) return;
 	rfp = &fproc[slot];
-	wp = worker_get(rfp->fp_wtid);
+	wp = rfp->fp_worker;
 	if (wp != NULL && wp->w_task == who_e) {
 		assert(!fp_is_blocked(rfp));
 		*wp->w_drv_sendrec = *m_ptr;
@@ -1079,6 +1077,20 @@ void bdev_reply(struct dmap *dp)
 	worker_signal(wp);
 }
 
+/*===========================================================================*
+ *				filp_gc_thread				     *
+ *===========================================================================*/
+static void filp_gc_thread(void)
+{
+/* Filp garbage collection thread function. Since new filps may be invalidated
+ * while the actual garbage collection procedure is running, we repeat the
+ * procedure until it can not find any more work to do.
+ */
+
+  while (do_filp_gc())
+	/* simply repeat */;
+}
+
 /*===========================================================================*
  *				restart_reopen				     *
  *===========================================================================*/
@@ -1126,9 +1138,14 @@ int maj;
 
 	/* We have to clean up this filp and vnode, but can't do that yet as
 	 * it's locked by a worker thread. Start a new job to garbage collect
-	 * invalidated filps associated with this device driver.
+	 * invalidated filps associated with this device driver. This thread
+	 * is associated with a process that we know is idle otherwise: VFS.
+	 * Be careful that we don't start two threads or lose work, though.
 	 */
-	sys_worker_start(do_filp_gc);
+	if (worker_can_start(fproc_addr(VFS_PROC_NR))) {
+		worker_start(fproc_addr(VFS_PROC_NR), filp_gc_thread,
+			&m_out /*unused*/, FALSE /*use_spare*/);
+	}
   }
 
   /* Nothing more to re-open. Restart suspended processes */
diff --git a/servers/vfs/dmap.c b/servers/vfs/dmap.c
index 947f638f3..c12407a31 100644
--- a/servers/vfs/dmap.c
+++ b/servers/vfs/dmap.c
@@ -11,8 +11,6 @@
 #include <unistd.h>
 #include <minix/com.h>
 #include <minix/ds.h>
-#include "fproc.h"
-#include "dmap.h"
 #include "param.h"
 
 /* The order of the entries in the table determines the mapping between major
@@ -31,20 +29,17 @@ void lock_dmap(struct dmap *dp)
 {
 /* Lock a driver */
 	struct worker_thread *org_self;
-	struct fproc *org_fp;
 	int r;
 
 	assert(dp != NULL);
 	assert(dp->dmap_driver != NONE);
 
-	org_fp = fp;
-	org_self = self;
+	org_self = worker_suspend();
 
 	if ((r = mutex_lock(dp->dmap_lock_ref)) != 0)
 		panic("unable to get a lock on dmap: %d\n", r);
 
-	fp = org_fp;
-	self = org_self;
+	worker_resume(org_self);
 }
 
 /*===========================================================================*
diff --git a/servers/vfs/exec.c b/servers/vfs/exec.c
index 6f829ef13..51881b24b 100644
--- a/servers/vfs/exec.c
+++ b/servers/vfs/exec.c
@@ -27,7 +27,6 @@
 #include <dirent.h>
 #include <sys/exec.h>
 #include <sys/param.h>
-#include "fproc.h"
 #include "path.h"
 #include "param.h"
 #include "vnode.h"
@@ -54,8 +53,6 @@ struct vfs_exec_info {
     int vmfd_used;
 };
 
-static void lock_exec(void);
-static void unlock_exec(void);
 static int patch_stack(struct vnode *vp, char stack[ARG_MAX],
 	size_t *stk_bytes, char path[PATH_MAX], vir_bytes *vsp);
 static int is_script(struct vfs_exec_info *execi);
@@ -83,36 +80,8 @@ static const struct exec_loaders exec_loaders[] = {
 	{ NULL, NULL }
 };
 
-/*===========================================================================*
- *				lock_exec				     *
- *===========================================================================*/
-static void lock_exec(void)
-{
-  struct fproc *org_fp;
-  struct worker_thread *org_self;
-
-  /* First try to get it right off the bat */
-  if (mutex_trylock(&exec_lock) == 0)
-	return;
-
-  org_fp = fp;
-  org_self = self;
-
-  if (mutex_lock(&exec_lock) != 0)
-	panic("Could not obtain lock on exec");
-
-  fp = org_fp;
-  self = org_self;
-}
-
-/*===========================================================================*
- *				unlock_exec				     *
- *===========================================================================*/
-static void unlock_exec(void)
-{
-  if (mutex_unlock(&exec_lock) != 0)
-	panic("Could not release lock on exec");
-}
+#define lock_exec() lock_proc(fproc_addr(VM_PROC_NR))
+#define unlock_exec() unlock_proc(fproc_addr(VM_PROC_NR))
 
 /*===========================================================================*
  *				get_read_vp				     *
@@ -213,10 +182,9 @@ static int vfs_memmap(struct exec_info *execi,
 /*===========================================================================*
  *				pm_exec					     *
  *===========================================================================*/
-int pm_exec(endpoint_t proc_e, vir_bytes path, size_t path_len,
-		   vir_bytes frame, size_t frame_len, vir_bytes *pc,
-		   vir_bytes *newsp, vir_bytes *UNUSED(ps_str), int user_exec_flags)
-		   //vir_bytes *newsp, vir_bytes *ps_str, int user_exec_flags)
+int pm_exec(vir_bytes path, size_t path_len, vir_bytes frame, size_t frame_len,
+	vir_bytes *pc, vir_bytes *newsp, vir_bytes *UNUSED(ps_str),
+	int user_exec_flags)
 {
 /* Perform the execve(name, argv, envp) call.  The user library builds a
  * complete stack image, including pointers, args, environ, etc.  The stack
@@ -225,9 +193,8 @@ int pm_exec(endpoint_t proc_e, vir_bytes path, size_t path_len,
  * ps_str is not currently used, but may be if the ps_strings structure has to
  * be moved to another location.
  */
-  int r, slot;
+  int r;
   vir_bytes vsp;
-  struct fproc *rfp;
   static char mbuf[ARG_MAX];	/* buffer for stack and zeroes */
   struct vfs_exec_info execi;
   int i;
@@ -236,12 +203,11 @@ int pm_exec(endpoint_t proc_e, vir_bytes path, size_t path_len,
 	firstexec[PATH_MAX],
 	finalexec[PATH_MAX];
   struct lookup resolve;
-  struct fproc *vmfp = &fproc[VM_PROC_NR];
+  struct fproc *vmfp = fproc_addr(VM_PROC_NR);
   stackhook_t makestack = NULL;
   struct filp *newfilp = NULL;
 
   lock_exec();
-  lock_proc(vmfp, 0);
 
   /* unset execi values are 0. */
   memset(&execi, 0, sizeof(execi));
@@ -252,10 +218,8 @@ int pm_exec(endpoint_t proc_e, vir_bytes path, size_t path_len,
   execi.args.stack_high = kinfo.user_sp;
   execi.args.stack_size = DEFAULT_STACK_LIMIT;
 
-  okendpt(proc_e, &slot);
-  rfp = fp = &fproc[slot];
-  rfp->text_size = 0;
-  rfp->data_size = 0;
+  fp->text_size = 0;
+  fp->data_size = 0;
 
   lookup_init(&resolve, fullpath, PATH_NOFLAGS, &execi.vmp, &execi.vp);
 
@@ -266,7 +230,7 @@ int pm_exec(endpoint_t proc_e, vir_bytes path, size_t path_len,
   if (frame_len > ARG_MAX)
 	FAILCHECK(ENOMEM); /* stack too big */
 
-  r = sys_datacopy(proc_e, (vir_bytes) frame, SELF, (vir_bytes) mbuf,
+  r = sys_datacopy(fp->fp_endpoint, (vir_bytes) frame, SELF, (vir_bytes) mbuf,
 		   (size_t) frame_len);
   if (r != OK) { /* can't fetch stack (e.g. bad virtual addr) */
         printf("VFS: pm_exec: sys_datacopy failed\n");
@@ -278,8 +242,8 @@ int pm_exec(endpoint_t proc_e, vir_bytes path, size_t path_len,
   vsp = execi.args.stack_high - frame_len;
 
   /* The default is to keep the original user and group IDs */
-  execi.args.new_uid = rfp->fp_effuid;
-  execi.args.new_gid = rfp->fp_effgid;
+  execi.args.new_uid = fp->fp_effuid;
+  execi.args.new_gid = fp->fp_effgid;
 
   /* Get the exec file name. */
   FAILCHECK(fetch_name(path, path_len, fullpath));
@@ -379,7 +343,7 @@ int pm_exec(endpoint_t proc_e, vir_bytes path, size_t path_len,
   execi.args.allocmem_ondemand = libexec_alloc_mmap_ondemand;
   execi.args.opaque = &execi;
 
-  execi.args.proc_e = proc_e;
+  execi.args.proc_e = fp->fp_endpoint;
   execi.args.frame_len = frame_len;
   execi.args.filesize = execi.vp->v_size;
 
@@ -392,7 +356,7 @@ int pm_exec(endpoint_t proc_e, vir_bytes path, size_t path_len,
   FAILCHECK(r);
 
   /* Inform PM */
-  FAILCHECK(libexec_pm_newexec(proc_e, &execi.args));
+  FAILCHECK(libexec_pm_newexec(fp->fp_endpoint, &execi.args));
 
   /* Save off PC */
   *pc = execi.args.pc;
@@ -401,25 +365,25 @@ int pm_exec(endpoint_t proc_e, vir_bytes path, size_t path_len,
   if(makestack) FAILCHECK(makestack(&execi, mbuf, &frame_len, &vsp));
 
   /* Copy the stack from VFS to new core image. */
-  FAILCHECK(sys_datacopy(SELF, (vir_bytes) mbuf, proc_e, (vir_bytes) vsp,
-		   (phys_bytes)frame_len));
+  FAILCHECK(sys_datacopy(SELF, (vir_bytes) mbuf, fp->fp_endpoint,
+	(vir_bytes) vsp, (phys_bytes)frame_len));
 
   /* Return new stack pointer to caller */
   *newsp = vsp;
 
-  clo_exec(rfp);
+  clo_exec(fp);
 
   if (execi.args.allow_setuid) {
 	/* If after loading the image we're still allowed to run with
 	 * setuid or setgid, change credentials now */
-	rfp->fp_effuid = execi.args.new_uid;
-	rfp->fp_effgid = execi.args.new_gid;
+	fp->fp_effuid = execi.args.new_uid;
+	fp->fp_effgid = execi.args.new_gid;
   }
 
   /* Remember the new name of the process */
-  strlcpy(rfp->fp_name, execi.args.progname, PROC_NAME_LEN);
-  rfp->text_size = execi.args.text_size;
-  rfp->data_size = execi.args.data_size;
+  strlcpy(fp->fp_name, execi.args.progname, PROC_NAME_LEN);
+  fp->text_size = execi.args.text_size;
+  fp->data_size = execi.args.data_size;
 
 pm_execfinal:
   if(newfilp) unlock_filp(newfilp);
@@ -434,7 +398,6 @@ pm_execfinal:
 	}
   }
 
-  unlock_proc(vmfp);
   unlock_exec();
 
   return(r);
diff --git a/servers/vfs/filedes.c b/servers/vfs/filedes.c
index 9ba0c0bd7..31802624a 100644
--- a/servers/vfs/filedes.c
+++ b/servers/vfs/filedes.c
@@ -22,7 +22,6 @@
 #include <sys/stat.h>
 #include "fs.h"
 #include "file.h"
-#include "fproc.h"
 #include "vnode.h"
 
 
@@ -77,16 +76,22 @@ void check_filp_locks(void)
 }
 
 /*===========================================================================*
- *				do_filp_gc					     *
+ *				do_filp_gc				     *
  *===========================================================================*/
-void *do_filp_gc(void *UNUSED(arg))
+int do_filp_gc(void)
 {
+/* Perform filp garbage collection. Return whether at least one invalidated
+ * filp was found, in which case the entire procedure will be invoked again.
+ */
   struct filp *f;
   struct vnode *vp;
+  int found = FALSE;
 
   for (f = &filp[0]; f < &filp[NR_FILPS]; f++) {
 	if (!(f->filp_state & FS_INVALIDATED)) continue;
 
+	found = TRUE;
+
 	if (f->filp_mode == FILP_CLOSED || f->filp_vno == NULL) {
 		/* File was already closed before gc could kick in */
 		assert(f->filp_count <= 0);
@@ -125,12 +130,11 @@ void *do_filp_gc(void *UNUSED(arg))
 	f->filp_state &= ~FS_INVALIDATED;
   }
 
-  thread_cleanup(NULL);
-  return(NULL);
+  return found;
 }
 
 /*===========================================================================*
- *				init_filps					     *
+ *				init_filps				     *
  *===========================================================================*/
 void init_filps(void)
 {
@@ -320,7 +324,6 @@ void lock_filp(filp, locktype)
 struct filp *filp;
 tll_access_t locktype;
 {
-  struct fproc *org_fp;
   struct worker_thread *org_self;
   struct vnode *vp;
 
@@ -350,14 +353,12 @@ tll_access_t locktype;
   if (mutex_trylock(&filp->filp_lock) != 0) {
 
 	/* Already in use, let's wait for our turn */
-	org_fp = fp;
-	org_self = self;
+	org_self = worker_suspend();
 
 	if (mutex_lock(&filp->filp_lock) != 0)
 		panic("unable to obtain lock on filp");
 
-	fp = org_fp;
-	self = org_self;
+	worker_resume(org_self);
   }
 }
 
diff --git a/servers/vfs/fproc.h b/servers/vfs/fproc.h
index 3f9ff28e9..444d99968 100644
--- a/servers/vfs/fproc.h
+++ b/servers/vfs/fproc.h
@@ -43,8 +43,11 @@ EXTERN struct fproc {
   mode_t fp_umask;		/* mask set by umask system call */
 
   mutex_t fp_lock;		/* mutex to lock fproc object */
-  struct job fp_job;		/* pending job */
-  thread_t fp_wtid;		/* Thread ID of worker */
+  struct worker_thread *fp_worker;/* active worker thread, or NULL */
+  void (*fp_func)();		/* handler function for pending work */
+  message fp_msg;		/* pending or active message from process */
+  message fp_pm_msg;		/* pending/active postponed PM request */
+
   char fp_name[PROC_NAME_LEN];	/* Last exec() */
 #if LOCK_DEBUG
   int fp_vp_rdlocks;		/* number of read-only locks on vnodes */
@@ -64,9 +67,8 @@ EXTERN struct fproc {
 #define FP_SESLDR	 0004	/* Set if process is session leader */
 #define FP_PENDING	 0010	/* Set if process has pending work */
 #define FP_EXITING	 0020	/* Set if process is exiting */
-#define FP_PM_PENDING	 0040	/* Set if process has pending PM request */
+#define FP_PM_WORK	 0040	/* Set if process has a postponed PM request */
 #define FP_SRV_PROC	 0100	/* Set if process is a service */
-#define FP_DROP_WORK	 0200	/* Set if process won't accept new work */
 
 /* Field values. */
 #define NOT_REVIVING       0xC0FFEEE	/* process is not being revived */
diff --git a/servers/vfs/fs.h b/servers/vfs/fs.h
index a96a88943..085f3d4d1 100644
--- a/servers/vfs/fs.h
+++ b/servers/vfs/fs.h
@@ -30,5 +30,6 @@
 #include "glo.h"
 #include "type.h"
 #include "vmnt.h"
+#include "fproc.h"
 
 #endif
diff --git a/servers/vfs/gcov.c b/servers/vfs/gcov.c
index 683a2bf82..642fe558a 100644
--- a/servers/vfs/gcov.c
+++ b/servers/vfs/gcov.c
@@ -1,7 +1,6 @@
 
 #include "fs.h"
 #include "file.h"
-#include "fproc.h"
 
 int gcov_flush(cp_grant_id_t grantid, size_t size );
 
diff --git a/servers/vfs/glo.h b/servers/vfs/glo.h
index be48c1ba3..09009c450 100644
--- a/servers/vfs/glo.h
+++ b/servers/vfs/glo.h
@@ -25,24 +25,17 @@ EXTERN u32_t system_hz;		/* system clock frequency. */
 /* The parameters of the call are kept here. */
 EXTERN message m_in;		/* the input message itself */
 # define who_p		((int) (fp - fproc))
-# define isokslot(p)	(p >= 0 && \
-			 p < (int)(sizeof(fproc) / sizeof(struct fproc)))
-# define who_e		(self != NULL && fp != NULL ? fp->fp_endpoint : \
-							m_in.m_source)
+# define fproc_addr(e)	(&fproc[_ENDPOINT_P(e)])
+# define who_e		(self != NULL ? fp->fp_endpoint : m_in.m_source)
 # define call_nr	(m_in.m_type)
-# define job_m_in	(self->w_job.j_m_in)
+# define job_m_in	(self->w_msg)
 # define job_call_nr	(job_m_in.m_type)
 # define super_user	(fp->fp_effuid == SU_UID ? 1 : 0)
 # define scratch(p)		(scratchpad[((int) ((p) - fproc))])
 EXTERN struct worker_thread *self;
-EXTERN int force_sync;		/* toggle forced synchronous communication */
 EXTERN int deadlock_resolving;
-EXTERN mutex_t exec_lock;
 EXTERN mutex_t bsf_lock;/* Global lock for access to block special files */
 EXTERN struct worker_thread workers[NR_WTHREADS];
-EXTERN struct worker_thread sys_worker;
-EXTERN struct worker_thread dl_worker;
-EXTERN thread_t invalid_thread_id;
 EXTERN char mount_label[LABEL_MAX];	/* label of file system to mount */
 
 /* The following variables are used for returning results to the caller. */
diff --git a/servers/vfs/job.h b/servers/vfs/job.h
deleted file mode 100644
index 046a11b30..000000000
--- a/servers/vfs/job.h
+++ /dev/null
@@ -1,12 +0,0 @@
-#ifndef __VFS_WORK_H__
-#define __VFS_WORK_H__
-
-struct job {
-  struct fproc *j_fp;
-  message j_m_in;
-  int j_err_code;
-  void *(*j_func)(void *arg);
-  struct job *j_next;
-};
-
-#endif
diff --git a/servers/vfs/link.c b/servers/vfs/link.c
index 9ad1c223a..9537fbeb7 100644
--- a/servers/vfs/link.c
+++ b/servers/vfs/link.c
@@ -20,7 +20,6 @@
 #include <dirent.h>
 #include <assert.h>
 #include "file.h"
-#include "fproc.h"
 #include "path.h"
 #include "vnode.h"
 #include "param.h"
diff --git a/servers/vfs/lock.c b/servers/vfs/lock.c
index e762d88e8..1b9ed021e 100644
--- a/servers/vfs/lock.c
+++ b/servers/vfs/lock.c
@@ -11,7 +11,6 @@
 #include <fcntl.h>
 #include <unistd.h>
 #include "file.h"
-#include "fproc.h"
 #include "scratchpad.h"
 #include "lock.h"
 #include "vnode.h"
diff --git a/servers/vfs/main.c b/servers/vfs/main.c
index a6f3af03e..110faff63 100644
--- a/servers/vfs/main.c
+++ b/servers/vfs/main.c
@@ -27,12 +27,9 @@
 #include <minix/debug.h>
 #include <minix/vfsif.h>
 #include "file.h"
-#include "dmap.h"
-#include "fproc.h"
 #include "scratchpad.h"
 #include "vmnt.h"
 #include "vnode.h"
-#include "job.h"
 #include "param.h"
 
 #if ENABLE_SYSCALL_STATS
@@ -40,23 +37,18 @@ EXTERN unsigned long calls_stats[NCALLS];
 #endif
 
 /* Thread related prototypes */
-static void *do_fs_reply(struct job *job);
-static void *do_work(void *arg);
-static void *do_pm(void *arg);
-static void *do_init_root(void *arg);
-static void handle_work(void *(*func)(void *arg));
+static void do_fs_reply(struct worker_thread *wp);
+static void do_work(void);
+static void do_init_root(void);
+static void handle_work(void (*func)(void));
 
 static void get_work(void);
-static void lock_pm(void);
-static void unlock_pm(void);
 static void service_pm(void);
-static void service_pm_postponed(void);
 static int unblock(struct fproc *rfp);
 
 /* SEF functions and variables. */
 static void sef_local_startup(void);
 static int sef_cb_init_fresh(int type, sef_init_info_t *info);
-static mutex_t pm_lock;
 static endpoint_t receive_from;
 
 /*===========================================================================*
@@ -69,7 +61,7 @@ int main(void)
  * the reply.  This loop never terminates as long as the file system runs.
  */
   int transid;
-  struct job *job;
+  struct worker_thread *wp;
 
   /* SEF local startup. */
   sef_local_startup();
@@ -83,33 +75,35 @@ int main(void)
   while (TRUE) {
 	yield_all();	/* let other threads run */
 	self = NULL;
-	job = NULL;
 	send_work();
 	get_work();
 
 	transid = TRNS_GET_ID(m_in.m_type);
 	if (IS_VFS_FS_TRANSID(transid)) {
-		job = worker_getjob( (thread_t) transid - VFS_TRANSID);
-		if (job == NULL) {
+		wp = worker_get((thread_t) transid - VFS_TRANSID);
+		if (wp == NULL || wp->w_fp == NULL) {
 			printf("VFS: spurious message %d from endpoint %d\n",
 				m_in.m_type, m_in.m_source);
 			continue;
 		}
 		m_in.m_type = TRNS_DEL_ID(m_in.m_type);
-	}
-
-	if (job != NULL) {
-		do_fs_reply(job);
+		do_fs_reply(wp);
 		continue;
 	} else if (who_e == PM_PROC_NR) { /* Calls from PM */
 		/* Special control messages from PM */
-		sys_worker_start(do_pm);
+		service_pm();
 		continue;
 	} else if (is_notify(call_nr)) {
 		/* A task notify()ed us */
 		switch (who_e) {
 		case DS_PROC_NR:
-			handle_work(ds_event);
+			/* Start a thread to handle DS events, if no thread
+			 * is pending or active for it already. DS is not
+			 * supposed to issue calls to VFS or be the subject of
+			 * postponed PM requests, so this should be no problem.
+			 */
+			if (worker_can_start(fp))
+				handle_work(ds_event);
 			break;
 		case KERNEL:
 			mthread_stacktraces();
@@ -169,141 +163,73 @@ int main(void)
 /*===========================================================================*
  *			       handle_work				     *
  *===========================================================================*/
-static void handle_work(void *(*func)(void *arg))
+static void handle_work(void (*func)(void))
 {
 /* Handle asynchronous device replies and new system calls. If the originating
  * endpoint is an FS endpoint, take extra care not to get in deadlock. */
   struct vmnt *vmp = NULL;
   endpoint_t proc_e;
+  int use_spare = FALSE;
 
   proc_e = m_in.m_source;
 
   if (fp->fp_flags & FP_SRV_PROC) {
 	vmp = find_vmnt(proc_e);
 	if (vmp != NULL) {
-		/* A call back or dev result from an FS
-		 * endpoint. Set call back flag. Can do only
-		 * one call back at a time.
-		 */
+		/* A callback from an FS endpoint. Can do only one at once. */
 		if (vmp->m_flags & VMNT_CALLBACK) {
 			replycode(proc_e, EAGAIN);
 			return;
 		}
+		/* Already trying to resolve a deadlock? Can't handle more. */
+		if (worker_available() == 0) {
+			replycode(proc_e, EAGAIN);
+			return;
+		}
+		/* A thread is available. Set callback flag. */
 		vmp->m_flags |= VMNT_CALLBACK;
 		if (vmp->m_flags & VMNT_MOUNTING) {
 			vmp->m_flags |= VMNT_FORCEROOTBSF;
 		}
 	}
 
-	if (worker_available() == 0) {
-		if (!deadlock_resolving) {
-			deadlock_resolving = 1;
-			dl_worker_start(func);
-			return;
-		}
-
-		if (vmp != NULL) {
-			/* Already trying to resolve a deadlock, can't
-			 * handle more, sorry */
-
-			replycode(proc_e, EAGAIN);
-			return;
-		}
-	}
+	/* Use the spare thread to handle this request if needed. */
+	use_spare = TRUE;
   }
 
-  worker_start(func);
+  worker_start(fp, func, &m_in, use_spare);
 }
 
 
 /*===========================================================================*
  *			       do_fs_reply				     *
  *===========================================================================*/
-static void *do_fs_reply(struct job *job)
+static void do_fs_reply(struct worker_thread *wp)
 {
   struct vmnt *vmp;
-  struct worker_thread *wp;
 
   if ((vmp = find_vmnt(who_e)) == NULL)
 	panic("Couldn't find vmnt for endpoint %d", who_e);
 
-  wp = worker_get(job->j_fp->fp_wtid);
-
-  if (wp == NULL) {
-	printf("VFS: spurious reply from %d\n", who_e);
-	return(NULL);
-  }
-
   if (wp->w_task != who_e) {
 	printf("VFS: expected %d to reply, not %d\n", wp->w_task, who_e);
-	return(NULL);
+	return;
   }
   *wp->w_fs_sendrec = m_in;
   wp->w_task = NONE;
   vmp->m_comm.c_cur_reqs--; /* We've got our reply, make room for others */
   worker_signal(wp); /* Continue this thread */
-  return(NULL);
-}
-
-/*===========================================================================*
- *				lock_pm					     *
- *===========================================================================*/
-static void lock_pm(void)
-{
-  struct fproc *org_fp;
-  struct worker_thread *org_self;
-
-  /* First try to get it right off the bat */
-  if (mutex_trylock(&pm_lock) == 0)
-	return;
-
-  org_fp = fp;
-  org_self = self;
-
-  if (mutex_lock(&pm_lock) != 0)
-	panic("Could not obtain lock on pm\n");
-
-  fp = org_fp;
-  self = org_self;
-}
-
-/*===========================================================================*
- *				unlock_pm				     *
- *===========================================================================*/
-static void unlock_pm(void)
-{
-  if (mutex_unlock(&pm_lock) != 0)
-	panic("Could not release lock on pm");
-}
-
-/*===========================================================================*
- *			       do_pm					     *
- *===========================================================================*/
-static void *do_pm(void *arg __unused)
-{
-  lock_pm();
-  service_pm();
-  unlock_pm();
-
-  thread_cleanup(NULL);
-  return(NULL);
 }
 
 /*===========================================================================*
  *			       do_pending_pipe				     *
  *===========================================================================*/
-static void *do_pending_pipe(void *arg)
+static void do_pending_pipe(void)
 {
   int r, op;
-  struct job my_job;
   struct filp *f;
   tll_access_t locktype;
 
-  my_job = *((struct job *) arg);
-  fp = my_job.j_fp;
-
-  lock_proc(fp, 1 /* force lock */);
-
   f = scratch(fp).file.filp;
   assert(f != NULL);
   scratch(fp).file.filp = NULL;
@@ -318,48 +244,18 @@ static void *do_pending_pipe(void *arg)
 	replycode(fp->fp_endpoint, r);
 
   unlock_filp(f);
-  thread_cleanup(fp);
-  unlock_proc(fp);
-  return(NULL);
-}
-
-/*===========================================================================*
- *			       do_dummy					     *
- *===========================================================================*/
-void *do_dummy(void *arg)
-{
-  struct job my_job;
-  int r;
-
-  my_job = *((struct job *) arg);
-  fp = my_job.j_fp;
-
-  if ((r = mutex_trylock(&fp->fp_lock)) == 0) {
-	thread_cleanup(fp);
-	unlock_proc(fp);
-  } else {
-	/* Proc is busy, let that worker thread carry out the work */
-	thread_cleanup(NULL);
-  }
-  return(NULL);
 }
 
 /*===========================================================================*
  *			       do_work					     *
  *===========================================================================*/
-static void *do_work(void *arg)
+static void do_work(void)
 {
   int error;
-  struct job my_job;
   message m_out;
 
   memset(&m_out, 0, sizeof(m_out));
 
-  my_job = *((struct job *) arg);
-  fp = my_job.j_fp;
-
-  lock_proc(fp, 0); /* This proc is busy */
-
   if (job_call_nr == MAPDRIVER) {
 	error = do_mapdriver();
   } else if (job_call_nr == COMMON_GETSYSINFO) {
@@ -395,10 +291,6 @@ static void *do_work(void *arg)
 
   /* Copy the results back to the user and send reply. */
   if (error != SUSPEND) reply(&m_out, fp->fp_endpoint, error);
-
-  thread_cleanup(fp);
-  unlock_proc(fp);
-  return(NULL);
 }
 
 /*===========================================================================*
@@ -427,7 +319,6 @@ static int sef_cb_init_fresh(int UNUSED(type), sef_init_info_t *info)
   message mess;
   struct rprocpub rprocpub[NR_BOOT_PROCS];
 
-  force_sync = 0;
   receive_from = ANY;
   self = NULL;
   verbose = 0;
@@ -468,8 +359,6 @@ static int sef_cb_init_fresh(int UNUSED(type), sef_init_info_t *info)
   s = send(PM_PROC_NR, &mess);		/* send synchronization message */
 
   /* All process table entries have been set. Continue with initialization. */
-  fp = &fproc[_ENDPOINT_P(VFS_PROC_NR)];/* During init all communication with
-					 * FSes is on behalf of myself */
   init_dmap();			/* Initialize device table. */
   system_hz = sys_hz();
 
@@ -491,28 +380,28 @@ static int sef_cb_init_fresh(int UNUSED(type), sef_init_info_t *info)
   if (s != OK) panic("VFS: can't subscribe to driver events (%d)", s);
 
   /* Initialize worker threads */
-  for (i = 0; i < NR_WTHREADS; i++)  {
-	worker_init(&workers[i]);
-  }
-  worker_init(&sys_worker); /* exclusive system worker thread */
-  worker_init(&dl_worker); /* exclusive worker thread to resolve deadlocks */
+  worker_init();
 
   /* Initialize global locks */
-  if (mthread_mutex_init(&pm_lock, NULL) != 0)
-	panic("VFS: couldn't initialize pm lock mutex");
-  if (mthread_mutex_init(&exec_lock, NULL) != 0)
-	panic("VFS: couldn't initialize exec lock");
   if (mthread_mutex_init(&bsf_lock, NULL) != 0)
 	panic("VFS: couldn't initialize block special file lock");
 
-  /* Initialize event resources for boot procs and locks for all procs */
+  /* Initialize locks and initial values for all processes. */
   for (rfp = &fproc[0]; rfp < &fproc[NR_PROCS]; rfp++) {
 	if (mutex_init(&rfp->fp_lock, NULL) != 0)
 		panic("unable to initialize fproc lock");
+	rfp->fp_worker = NULL;
 #if LOCK_DEBUG
 	rfp->fp_vp_rdlocks = 0;
 	rfp->fp_vmnt_rdlocks = 0;
 #endif
+
+	/* Initialize process directories. mount_fs will set them to the
+	 * correct values.
+	 */
+	FD_ZERO(&(rfp->fp_filp_inuse));
+	rfp->fp_rd = NULL;
+	rfp->fp_wd = NULL;
   }
 
   init_dmap_locks();		/* init dmap locks */
@@ -521,9 +410,11 @@ static int sef_cb_init_fresh(int UNUSED(type), sef_init_info_t *info)
   init_select();		/* init select() structures */
   init_filps();			/* Init filp structures */
   mount_pfs();			/* mount Pipe File Server */
-  worker_start(do_init_root);	/* mount initial ramdisk as file system root */
-  yield();			/* force do_init_root to start */
-  self = NULL;
+
+  /* Mount initial ramdisk as file system root. */
+  receive_from = MFS_PROC_NR;
+  worker_start(fproc_addr(VFS_PROC_NR), do_init_root, &mess /*unused*/,
+	FALSE /*use_spare*/);
 
   return(OK);
 }
@@ -531,68 +422,36 @@ static int sef_cb_init_fresh(int UNUSED(type), sef_init_info_t *info)
 /*===========================================================================*
  *			       do_init_root				     *
  *===========================================================================*/
-static void *do_init_root(void *arg)
+static void do_init_root(void)
 {
-  struct fproc *rfp;
-  struct job my_job;
   int r;
   char *mount_type = "mfs"; /* FIXME: use boot image process name instead */
   char *mount_label = "fs_imgrd"; /* FIXME: obtain this from RS */
 
-  my_job = *((struct job *) arg);
-  fp = my_job.j_fp;
-
-  lock_proc(fp, 1 /* force lock */); /* This proc is busy */
-  lock_pm();
-
-  /* Initialize process directories. mount_fs will set them to the correct
-   * values */
-  for (rfp = &fproc[0]; rfp < &fproc[NR_PROCS]; rfp++) {
-	FD_ZERO(&(rfp->fp_filp_inuse));
-	rfp->fp_rd = NULL;
-	rfp->fp_wd = NULL;
-  }
-
-  receive_from = MFS_PROC_NR;
   r = mount_fs(DEV_IMGRD, "bootramdisk", "/", MFS_PROC_NR, 0, mount_type,
 	mount_label);
   if (r != OK)
 	panic("Failed to initialize root");
   receive_from = ANY;
-
-  unlock_pm();
-  thread_cleanup(fp);
-  unlock_proc(fp);
-  return(NULL);
 }
 
 /*===========================================================================*
  *				lock_proc				     *
  *===========================================================================*/
-void lock_proc(struct fproc *rfp, int force_lock)
+void lock_proc(struct fproc *rfp)
 {
   int r;
-  struct fproc *org_fp;
   struct worker_thread *org_self;
 
   r = mutex_trylock(&rfp->fp_lock);
-
-  /* Were we supposed to obtain this lock immediately? */
-  if (force_lock) {
-	assert(r == 0);
-	return;
-  }
-
   if (r == 0) return;
 
-  org_fp = fp;
-  org_self = self;
+  org_self = worker_suspend();
 
   if ((r = mutex_lock(&rfp->fp_lock)) != 0)
 	panic("unable to lock fproc lock: %d", r);
 
-  fp = org_fp;
-  self = org_self;
+  worker_resume(org_self);
 }
 
 /*===========================================================================*
@@ -609,48 +468,23 @@ void unlock_proc(struct fproc *rfp)
 /*===========================================================================*
  *				thread_cleanup				     *
  *===========================================================================*/
-void thread_cleanup(struct fproc *rfp)
+void thread_cleanup(void)
 {
-/* Clean up worker thread. Skip parts if this thread is not associated
- * with a particular process (i.e., rfp is NULL) */
-
-#if LOCK_DEBUG
-  if (rfp != NULL) {
-	check_filp_locks_by_me();
-	check_vnode_locks_by_me(rfp);
-	check_vmnt_locks_by_me(rfp);
-  }
-#endif
-
-  if (rfp != NULL && rfp->fp_flags & FP_PM_PENDING) {	/* Postponed PM call */
-	job_m_in = rfp->fp_job.j_m_in;
-	rfp->fp_flags &= ~FP_PM_PENDING;
-	service_pm_postponed();
-  }
+/* Perform cleanup actions for a worker thread. */
 
 #if LOCK_DEBUG
-  if (rfp != NULL) {
-	check_filp_locks_by_me();
-	check_vnode_locks_by_me(rfp);
-	check_vmnt_locks_by_me(rfp);
-  }
+  check_filp_locks_by_me();
+  check_vnode_locks_by_me(fp);
+  check_vmnt_locks_by_me(fp);
 #endif
 
-  if (rfp != NULL) {
-	rfp->fp_flags &= ~FP_DROP_WORK;
-	if (rfp->fp_flags & FP_SRV_PROC) {
-		struct vmnt *vmp;
+  if (fp->fp_flags & FP_SRV_PROC) {
+	struct vmnt *vmp;
 
-		if ((vmp = find_vmnt(rfp->fp_endpoint)) != NULL) {
-			vmp->m_flags &= ~VMNT_CALLBACK;
-		}
+	if ((vmp = find_vmnt(fp->fp_endpoint)) != NULL) {
+		vmp->m_flags &= ~VMNT_CALLBACK;
 	}
   }
-
-  if (deadlock_resolving) {
-	if (self->w_tid == dl_worker.w_tid)
-		deadlock_resolving = 0;
-  }
 }
 
 /*===========================================================================*
@@ -760,75 +594,83 @@ void replycode(endpoint_t whom, int result)
 /*===========================================================================*
  *				service_pm_postponed			     *
  *===========================================================================*/
-static void service_pm_postponed(void)
+void service_pm_postponed(void)
 {
-  int r;
-  vir_bytes pc, newsp;
+  int r, term_signal;
+  vir_bytes core_path;
+  vir_bytes exec_path, stack_frame, pc, newsp, ps_str;
+  size_t exec_path_len, stack_frame_len;
+  endpoint_t proc_e;
   message m_out;
 
   memset(&m_out, 0, sizeof(m_out));
 
   switch(job_call_nr) {
-    case PM_EXEC:
-	{
-		endpoint_t proc_e;
-		vir_bytes exec_path, stack_frame, ps_str;
-		size_t exec_path_len, stack_frame_len;
-
-		proc_e = job_m_in.PM_PROC;
-		exec_path = (vir_bytes)job_m_in.PM_PATH;
-		exec_path_len = (size_t)job_m_in.PM_PATH_LEN;
-		stack_frame = (vir_bytes)job_m_in.PM_FRAME;
-		stack_frame_len = (size_t)job_m_in.PM_FRAME_LEN;
-		ps_str = (vir_bytes)job_m_in.PM_PS_STR;
-
-		r = pm_exec(proc_e, exec_path, exec_path_len, stack_frame,
-			    stack_frame_len, &pc, &newsp, &ps_str,
-			    job_m_in.PM_EXECFLAGS);
-
-		/* Reply status to PM */
-		m_out.m_type = PM_EXEC_REPLY;
-		m_out.PM_PROC = proc_e;
-		m_out.PM_PC = (void *)pc;
-		m_out.PM_STATUS = r;
-		m_out.PM_NEWSP = (void *)newsp;
-		m_out.PM_NEWPS_STR = ps_str;
-	}
+  case PM_EXEC:
+	proc_e = job_m_in.PM_PROC;
+	exec_path = (vir_bytes)job_m_in.PM_PATH;
+	exec_path_len = (size_t)job_m_in.PM_PATH_LEN;
+	stack_frame = (vir_bytes)job_m_in.PM_FRAME;
+	stack_frame_len = (size_t)job_m_in.PM_FRAME_LEN;
+	ps_str = (vir_bytes)job_m_in.PM_PS_STR;
+
+	assert(proc_e == fp->fp_endpoint);
+
+	r = pm_exec(exec_path, exec_path_len, stack_frame, stack_frame_len,
+		&pc, &newsp, &ps_str, job_m_in.PM_EXECFLAGS);
+
+	/* Reply status to PM */
+	m_out.m_type = PM_EXEC_REPLY;
+	m_out.PM_PROC = proc_e;
+	m_out.PM_PC = (void *)pc;
+	m_out.PM_STATUS = r;
+	m_out.PM_NEWSP = (void *)newsp;
+	m_out.PM_NEWPS_STR = ps_str;
+
 	break;
 
-    case PM_EXIT:
-	{
-		endpoint_t proc_e;
-		proc_e = job_m_in.PM_PROC;
+  case PM_EXIT:
+	proc_e = job_m_in.PM_PROC;
 
-		pm_exit(proc_e);
+	assert(proc_e == fp->fp_endpoint);
+
+	pm_exit();
+
+	/* Reply dummy status to PM for synchronization */
+	m_out.m_type = PM_EXIT_REPLY;
+	m_out.PM_PROC = proc_e;
 
-		/* Reply dummy status to PM for synchronization */
-		m_out.m_type = PM_EXIT_REPLY;
-		m_out.PM_PROC = proc_e;
-	}
 	break;
 
-    case PM_DUMPCORE:
-	{
-		endpoint_t proc_e;
-		int term_signal;
-		vir_bytes core_path;
+  case PM_DUMPCORE:
+	proc_e = job_m_in.PM_PROC;
+	term_signal = job_m_in.PM_TERM_SIG;
+	core_path = (vir_bytes) job_m_in.PM_PATH;
 
-		proc_e = job_m_in.PM_PROC;
-		term_signal = job_m_in.PM_TERM_SIG;
-		core_path = (vir_bytes) job_m_in.PM_PATH;
+	assert(proc_e == fp->fp_endpoint);
 
-		r = pm_dumpcore(proc_e, term_signal, core_path);
+	r = pm_dumpcore(term_signal, core_path);
+
+	/* Reply status to PM */
+	m_out.m_type = PM_CORE_REPLY;
+	m_out.PM_PROC = proc_e;
+	m_out.PM_STATUS = r;
 
-		/* Reply status to PM */
-		m_out.m_type = PM_CORE_REPLY;
-		m_out.PM_PROC = proc_e;
-		m_out.PM_STATUS = r;
-	}
 	break;
 
-    default:
+  case PM_UNPAUSE:
+	proc_e = job_m_in.PM_PROC;
+
+	assert(proc_e == fp->fp_endpoint);
+
+	unpause();
+
+	m_out.m_type = PM_UNPAUSE_REPLY;
+	m_out.PM_PROC = proc_e;
+
+	break;
+
+  default:
 	panic("Unhandled postponed PM call %d", job_m_in.m_type);
   }
 
@@ -840,22 +682,34 @@ static void service_pm_postponed(void)
 /*===========================================================================*
  *				service_pm				     *
  *===========================================================================*/
-static void service_pm()
+static void service_pm(void)
 {
+/* Process a request from PM. This function is called from the main thread, and
+ * may therefore not block. Any requests that may require blocking the calling
+ * thread must be executed in a separate thread. Aside from PM_REBOOT, all
+ * requests from PM involve another, target process: for example, PM tells VFS
+ * that a process is performing a setuid() call. For some requests however,
+ * that other process may not be idle, and in that case VFS must serialize the
+ * PM request handling with any operation is it handling for that target
+ * process. As it happens, the requests that may require blocking are also the
+ * ones where the target process may not be idle. For both these reasons, such
+ * requests are run in worker threads associated to the target process.
+ */
+  struct fproc *rfp;
   int r, slot;
   message m_out;
 
   memset(&m_out, 0, sizeof(m_out));
 
-  switch (job_call_nr) {
+  switch (call_nr) {
     case PM_SETUID:
 	{
 		endpoint_t proc_e;
 		uid_t euid, ruid;
 
-		proc_e = job_m_in.PM_PROC;
-		euid = job_m_in.PM_EID;
-		ruid = job_m_in.PM_RID;
+		proc_e = m_in.PM_PROC;
+		euid = m_in.PM_EID;
+		ruid = m_in.PM_RID;
 
 		pm_setuid(proc_e, euid, ruid);
 
@@ -869,9 +723,9 @@ static void service_pm()
 		endpoint_t proc_e;
 		gid_t egid, rgid;
 
-		proc_e = job_m_in.PM_PROC;
-		egid = job_m_in.PM_EID;
-		rgid = job_m_in.PM_RID;
+		proc_e = m_in.PM_PROC;
+		egid = m_in.PM_EID;
+		rgid = m_in.PM_RID;
 
 		pm_setgid(proc_e, egid, rgid);
 
@@ -884,7 +738,7 @@ static void service_pm()
 	{
 		endpoint_t proc_e;
 
-		proc_e = job_m_in.PM_PROC;
+		proc_e = m_in.PM_PROC;
 		pm_setsid(proc_e);
 
 		m_out.m_type = PM_SETSID_REPLY;
@@ -895,39 +749,22 @@ static void service_pm()
     case PM_EXEC:
     case PM_EXIT:
     case PM_DUMPCORE:
+    case PM_UNPAUSE:
 	{
-		endpoint_t proc_e = job_m_in.PM_PROC;
+		endpoint_t proc_e = m_in.PM_PROC;
 
 		if(isokendpt(proc_e, &slot) != OK) {
 			printf("VFS: proc ep %d not ok\n", proc_e);
 			return;
 		}
 
-		fp = &fproc[slot];
-
-		if (fp->fp_flags & FP_PENDING) {
-			/* This process has a request pending, but PM wants it
-			 * gone. Forget about the pending request and satisfy
-			 * PM's request instead. Note that a pending request
-			 * AND an EXEC request are mutually exclusive. Also, PM
-			 * should send only one request/process at a time.
-			 */
-			 assert(fp->fp_job.j_m_in.m_source != PM_PROC_NR);
-		}
+		rfp = &fproc[slot];
 
 		/* PM requests on behalf of a proc are handled after the
 		 * system call that might be in progress for that proc has
-		 * finished. If the proc is not busy, we start a dummy call.
+		 * finished. If the proc is not busy, we start a new thread.
 		 */
-		if (!(fp->fp_flags & FP_PENDING) &&
-					mutex_trylock(&fp->fp_lock) == 0) {
-			mutex_unlock(&fp->fp_lock);
-			worker_start(do_dummy);
-			fp->fp_flags |= FP_DROP_WORK;
-		}
-
-		fp->fp_job.j_m_in = job_m_in;
-		fp->fp_flags |= FP_PM_PENDING;
+		worker_start(rfp, NULL, &m_in, FALSE /*use_spare*/);
 
 		return;
 	}
@@ -939,16 +776,16 @@ static void service_pm()
 		uid_t reuid;
 		gid_t regid;
 
-		pproc_e = job_m_in.PM_PPROC;
-		proc_e = job_m_in.PM_PROC;
-		child_pid = job_m_in.PM_CPID;
-		reuid = job_m_in.PM_REUID;
-		regid = job_m_in.PM_REGID;
+		pproc_e = m_in.PM_PPROC;
+		proc_e = m_in.PM_PROC;
+		child_pid = m_in.PM_CPID;
+		reuid = m_in.PM_REUID;
+		regid = m_in.PM_REGID;
 
 		pm_fork(pproc_e, proc_e, child_pid);
 		m_out.m_type = PM_FORK_REPLY;
 
-		if (job_call_nr == PM_SRV_FORK) {
+		if (call_nr == PM_SRV_FORK) {
 			m_out.m_type = PM_SRV_FORK_REPLY;
 			pm_setuid(proc_e, reuid, reuid);
 			pm_setgid(proc_e, regid, regid);
@@ -963,9 +800,9 @@ static void service_pm()
 		int group_no;
 		gid_t *group_addr;
 
-		proc_e = job_m_in.PM_PROC;
-		group_no = job_m_in.PM_GROUP_NO;
-		group_addr = (gid_t *) job_m_in.PM_GROUP_ADDR;
+		proc_e = m_in.PM_PROC;
+		group_no = m_in.PM_GROUP_NO;
+		group_addr = (gid_t *) m_in.PM_GROUP_ADDR;
 
 		pm_setgroups(proc_e, group_no, group_addr);
 
@@ -974,29 +811,21 @@ static void service_pm()
 	}
 	break;
 
-    case PM_UNPAUSE:
-	{
-		endpoint_t proc_e;
-
-		proc_e = job_m_in.PM_PROC;
-
-		unpause(proc_e);
-
-		m_out.m_type = PM_UNPAUSE_REPLY;
-		m_out.PM_PROC = proc_e;
-	}
-	break;
-
     case PM_REBOOT:
-	pm_reboot();
-
-	/* Reply dummy status to PM for synchronization */
-	m_out.m_type = PM_REBOOT_REPLY;
+	/* Reboot requests are not considered postponed PM work and are instead
+	 * handled from a separate worker thread that is associated with PM's
+	 * process. PM makes no regular VFS calls, and thus, from VFS's
+	 * perspective, PM is always idle. Therefore, we can safely do this.
+	 * We do assume that PM sends us only one PM_REBOOT message at once,
+	 * or ever for that matter. :)
+	 */
+	worker_start(fproc_addr(PM_PROC_NR), pm_reboot, &m_in,
+		FALSE /*use_spare*/);
 
-	break;
+	return;
 
     default:
-	printf("VFS: don't know how to handle PM request %d\n", job_call_nr);
+	printf("VFS: don't know how to handle PM request %d\n", call_nr);
 
 	return;
   }
@@ -1004,7 +833,6 @@ static void service_pm()
   r = send(PM_PROC_NR, &m_out);
   if (r != OK)
 	panic("service_pm: send failed: %d", r);
-
 }
 
 
@@ -1014,33 +842,41 @@ static void service_pm()
 static int unblock(rfp)
 struct fproc *rfp;
 {
+/* Unblock a process that was previously blocked on a pipe or a lock.  This is
+ * done by reconstructing the original request and continuing/repeating it.
+ * This function returns TRUE when it has restored a request for execution, and
+ * FALSE if the caller should continue looking for work to do.
+ */
   int blocked_on;
 
-  fp = rfp;
   blocked_on = rfp->fp_blocked_on;
+
+  assert(blocked_on == FP_BLOCKED_ON_PIPE || blocked_on == FP_BLOCKED_ON_LOCK);
+
+  /* READ, WRITE, FCNTL requests all use the same message layout. */
   m_in.m_source = rfp->fp_endpoint;
   m_in.m_type = rfp->fp_block_callnr;
-  m_in.fd = scratch(fp).file.fd_nr;
-  m_in.buffer = scratch(fp).io.io_buffer;
-  m_in.nbytes = scratch(fp).io.io_nbytes;
+  m_in.fd = scratch(rfp).file.fd_nr;
+  m_in.buffer = scratch(rfp).io.io_buffer;
+  m_in.nbytes = scratch(rfp).io.io_nbytes;
 
   rfp->fp_blocked_on = FP_BLOCKED_ON_NONE;	/* no longer blocked */
   rfp->fp_flags &= ~FP_REVIVED;
   reviving--;
   assert(reviving >= 0);
 
-  /* This should be a pipe I/O, not a device I/O. If it is, it'll 'leak'
-   * grants.
-   */
+  /* This should not be device I/O. If it is, it'll 'leak' grants. */
   assert(!GRANT_VALID(rfp->fp_grant));
 
-  /* Pending pipe reads/writes can be handled directly */
+  /* Pending pipe reads/writes cannot be repeated as is, and thus require a
+   * special resumption procedure.
+   */
   if (blocked_on == FP_BLOCKED_ON_PIPE) {
-	worker_start(do_pending_pipe);
-	yield();	/* Give thread a chance to run */
-	self = NULL;
-	return(0);	/* Retrieve more work */
+	worker_start(rfp, do_pending_pipe, &m_in, FALSE /*use_spare*/);
+	return(FALSE);	/* Retrieve more work */
   }
 
-  return(1);	/* We've unblocked a process */
+  /* A lock request. Repeat the original request as though it just came in. */
+  fp = rfp;
+  return(TRUE);	/* We've unblocked a process */
 }
diff --git a/servers/vfs/misc.c b/servers/vfs/misc.c
index 0ab27abc5..deec593a5 100644
--- a/servers/vfs/misc.c
+++ b/servers/vfs/misc.c
@@ -32,9 +32,7 @@
 #include <sys/svrctl.h>
 #include <sys/resource.h>
 #include "file.h"
-#include "fproc.h"
 #include "scratchpad.h"
-#include "dmap.h"
 #include <minix/vfsif.h>
 #include "vnode.h"
 #include "vmnt.h"
@@ -47,14 +45,7 @@
 unsigned long calls_stats[NCALLS];
 #endif
 
-static void free_proc(struct fproc *freed, int flags);
-/*
-static int dumpcore(int proc_e, struct mem_map *seg_ptr);
-static int write_bytes(struct inode *rip, off_t off, char *buf, size_t
-	bytes);
-static int write_seg(struct inode *rip, off_t off, int proc_e, int seg,
-	off_t seg_off, phys_bytes seg_bytes);
-*/
+static void free_proc(int flags);
 
 /*===========================================================================*
  *				do_getsysinfo				     *
@@ -306,7 +297,7 @@ int dupvm(struct fproc *rfp, int pfd, int *vmfd, struct filp **newfilp)
 {
 	int result, procfd;
 	struct filp *f = NULL;
-	struct fproc *vmf = &fproc[VM_PROC_NR];
+	struct fproc *vmf = fproc_addr(VM_PROC_NR);
 
 	*newfilp = NULL;
 
@@ -383,7 +374,7 @@ int do_vm_call(message *m_out)
 	if(isokendpt(ep, &slot) != OK) rfp = NULL;
 	else rfp = &fproc[slot];
 
-	vmf = &fproc[VM_PROC_NR];
+	vmf = fproc_addr(VM_PROC_NR);
 	assert(fp == vmf);
 	assert(rfp != vmf);
 
@@ -481,9 +472,14 @@ reqdone:
  *===========================================================================*/
 void pm_reboot()
 {
-/* Perform the VFS side of the reboot call. */
-  int i;
-  struct fproc *rfp;
+/* Perform the VFS side of the reboot call. This call is performed from the PM
+ * process context.
+ */
+  message m_out;
+  int i, r;
+  struct fproc *rfp, *pmfp;
+
+  pmfp = fp;
 
   sync_fses();
 
@@ -493,15 +489,25 @@ void pm_reboot()
    * them the chance to unmount (which should be possible as all normal
    * processes have no open files anymore).
    */
+  /* This is the only place where we allow special modification of "fp". The
+   * reboot procedure should really be implemented as a PM message broadcasted
+   * to all processes, so that each process will be shut down cleanly by a
+   * thread operating on its behalf. Doing everything here is simpler, but it
+   * requires an exception to the strict model of having "fp" be the process
+   * that owns the current worker thread.
+   */
   for (i = 0; i < NR_PROCS; i++) {
 	rfp = &fproc[i];
 
 	/* Don't just free the proc right away, but let it finish what it was
 	 * doing first */
-	lock_proc(rfp, 0);
-	if (rfp->fp_endpoint != NONE && find_vmnt(rfp->fp_endpoint) == NULL)
-		free_proc(rfp, 0);
-	unlock_proc(rfp);
+	if (rfp != fp) lock_proc(rfp);
+	if (rfp->fp_endpoint != NONE && find_vmnt(rfp->fp_endpoint) == NULL) {
+		worker_set_proc(rfp);	/* temporarily fake process context */
+		free_proc(0);
+		worker_set_proc(pmfp);	/* restore original process context */
+	}
+	if (rfp != fp) unlock_proc(rfp);
   }
 
   sync_fses();
@@ -513,15 +519,25 @@ void pm_reboot()
 
 	/* Don't just free the proc right away, but let it finish what it was
 	 * doing first */
-	lock_proc(rfp, 0);
-	if (rfp->fp_endpoint != NONE)
-		free_proc(rfp, 0);
-	unlock_proc(rfp);
+	if (rfp != fp) lock_proc(rfp);
+	if (rfp->fp_endpoint != NONE) {
+		worker_set_proc(rfp);	/* temporarily fake process context */
+		free_proc(0);
+		worker_set_proc(pmfp);	/* restore original process context */
+	}
+	if (rfp != fp) unlock_proc(rfp);
   }
 
   sync_fses();
   unmount_all(1 /* Force */);
 
+  /* Reply to PM for synchronization */
+  memset(&m_out, 0, sizeof(m_out));
+
+  m_out.m_type = PM_REBOOT_REPLY;
+
+  if ((r = send(PM_PROC_NR, &m_out)) != OK)
+	panic("pm_reboot: send failed: %d", r);
 }
 
 /*===========================================================================*
@@ -591,7 +607,7 @@ void pm_fork(endpoint_t pproc, endpoint_t cproc, pid_t cpid)
 /*===========================================================================*
  *				free_proc				     *
  *===========================================================================*/
-static void free_proc(struct fproc *exiter, int flags)
+static void free_proc(int flags)
 {
   int i;
   register struct fproc *rfp;
@@ -599,46 +615,43 @@ static void free_proc(struct fproc *exiter, int flags)
   register struct vnode *vp;
   dev_t dev;
 
-  if (exiter->fp_endpoint == NONE)
+  if (fp->fp_endpoint == NONE)
 	panic("free_proc: already free");
 
-  if (fp_is_blocked(exiter))
-	unpause(exiter->fp_endpoint);
+  if (fp_is_blocked(fp))
+	unpause();
 
   /* Loop on file descriptors, closing any that are open. */
   for (i = 0; i < OPEN_MAX; i++) {
-	(void) close_fd(exiter, i);
+	(void) close_fd(fp, i);
   }
 
   /* Release root and working directories. */
-  if (exiter->fp_rd) { put_vnode(exiter->fp_rd); exiter->fp_rd = NULL; }
-  if (exiter->fp_wd) { put_vnode(exiter->fp_wd); exiter->fp_wd = NULL; }
+  if (fp->fp_rd) { put_vnode(fp->fp_rd); fp->fp_rd = NULL; }
+  if (fp->fp_wd) { put_vnode(fp->fp_wd); fp->fp_wd = NULL; }
 
   /* The rest of these actions is only done when processes actually exit. */
   if (!(flags & FP_EXITING)) return;
 
-  exiter->fp_flags |= FP_EXITING;
+  fp->fp_flags |= FP_EXITING;
 
   /* Check if any process is SUSPENDed on this driver.
    * If a driver exits, unmap its entries in the dmap table.
    * (unmapping has to be done after the first step, because the
    * dmap table is used in the first step.)
    */
-  unsuspend_by_endpt(exiter->fp_endpoint);
-  dmap_unmap_by_endpt(exiter->fp_endpoint);
+  unsuspend_by_endpt(fp->fp_endpoint);
+  dmap_unmap_by_endpt(fp->fp_endpoint);
 
-  worker_stop_by_endpt(exiter->fp_endpoint); /* Unblock waiting threads */
-  vmnt_unmap_by_endpt(exiter->fp_endpoint); /* Invalidate open files if this
+  worker_stop_by_endpt(fp->fp_endpoint); /* Unblock waiting threads */
+  vmnt_unmap_by_endpt(fp->fp_endpoint); /* Invalidate open files if this
 					     * was an active FS */
 
-  /* Invalidate endpoint number for error and sanity checks. */
-  exiter->fp_endpoint = NONE;
-
   /* If a session leader exits and it has a controlling tty, then revoke
    * access to its controlling tty from all other processes using it.
    */
-  if ((exiter->fp_flags & FP_SESLDR) && exiter->fp_tty != 0) {
-      dev = exiter->fp_tty;
+  if ((fp->fp_flags & FP_SESLDR) && fp->fp_tty != 0) {
+      dev = fp->fp_tty;
       for (rfp = &fproc[0]; rfp < &fproc[NR_PROCS]; rfp++) {
 	  if(rfp->fp_pid == PID_FREE) continue;
           if (rfp->fp_tty == dev) rfp->fp_tty = 0;
@@ -660,25 +673,21 @@ static void free_proc(struct fproc *exiter, int flags)
   }
 
   /* Exit done. Mark slot as free. */
-  exiter->fp_pid = PID_FREE;
-  if (exiter->fp_flags & FP_PENDING)
-	pending--;	/* No longer pending job, not going to do it */
-  exiter->fp_flags = FP_NOFLAGS;
+  fp->fp_endpoint = NONE;
+  fp->fp_pid = PID_FREE;
+  fp->fp_flags = FP_NOFLAGS;
 }
 
 /*===========================================================================*
  *				pm_exit					     *
  *===========================================================================*/
-void pm_exit(proc)
-endpoint_t proc;
+void pm_exit(void)
 {
-/* Perform the file system portion of the exit(status) system call. */
-  int exitee_p;
+/* Perform the file system portion of the exit(status) system call.
+ * This function is called from the context of the exiting process.
+ */
 
-  /* Nevertheless, pretend that the call came from the user. */
-  okendpt(proc, &exitee_p);
-  fp = &fproc[exitee_p];
-  free_proc(fp, FP_EXITING);
+  free_proc(FP_EXITING);
 }
 
 /*===========================================================================*
@@ -844,21 +853,18 @@ int do_svrctl(message *UNUSED(m_out))
 /*===========================================================================*
  *				pm_dumpcore				     *
  *===========================================================================*/
-int pm_dumpcore(endpoint_t proc_e, int csig, vir_bytes exe_name)
+int pm_dumpcore(int csig, vir_bytes exe_name)
 {
-  int slot, r = OK, core_fd;
+  int r = OK, core_fd;
   struct filp *f;
   char core_path[PATH_MAX];
   char proc_name[PROC_NAME_LEN];
 
-  okendpt(proc_e, &slot);
-  fp = &fproc[slot];
-
   /* if a process is blocked, scratch(fp).file.fd_nr holds the fd it's blocked
    * on. free it up for use by common_open().
    */
   if (fp_is_blocked(fp))
-          unpause(fp->fp_endpoint);
+          unpause();
 
   /* open core file */
   snprintf(core_path, PATH_MAX, "%s.%d", CORE_NAME, fp->fp_pid);
@@ -878,15 +884,15 @@ int pm_dumpcore(endpoint_t proc_e, int csig, vir_bytes exe_name)
 
 core_exit:
   if(csig)
-	  free_proc(fp, FP_EXITING);
+	  free_proc(FP_EXITING);
   return(r);
 }
 
 /*===========================================================================*
  *				 ds_event				     *
  *===========================================================================*/
-void *
-ds_event(void *arg)
+void
+ds_event(void)
 {
   char key[DS_MAX_KEYLEN];
   char *blkdrv_prefix = "drv.blk.";
@@ -895,11 +901,6 @@ ds_event(void *arg)
   int type, r, is_blk;
   endpoint_t owner_endpoint;
 
-  struct job my_job;
-
-  my_job = *((struct job *) arg);
-  fp = my_job.j_fp;
-
   /* Get the event and the owner from DS. */
   while ((r = ds_check(key, &type, &owner_endpoint)) == OK) {
 	/* Only check for block and character driver up events. */
@@ -922,9 +923,6 @@ ds_event(void *arg)
   }
 
   if (r != ENOENT) printf("VFS: ds_event: ds_check failed: %d\n", r);
-
-  thread_cleanup(NULL);
-  return(NULL);
 }
 
 /* A function to be called on panic(). */
diff --git a/servers/vfs/mount.c b/servers/vfs/mount.c
index a5e61782d..ea47ed31d 100644
--- a/servers/vfs/mount.c
+++ b/servers/vfs/mount.c
@@ -24,8 +24,6 @@
 #include <dirent.h>
 #include <assert.h>
 #include "file.h"
-#include "fproc.h"
-#include "dmap.h"
 #include <minix/vfsif.h>
 #include "vnode.h"
 #include "vmnt.h"
diff --git a/servers/vfs/open.c b/servers/vfs/open.c
index d9ce851ef..843a6e389 100644
--- a/servers/vfs/open.c
+++ b/servers/vfs/open.c
@@ -18,9 +18,7 @@
 #include <minix/com.h>
 #include <minix/u64.h>
 #include "file.h"
-#include "fproc.h"
 #include "scratchpad.h"
-#include "dmap.h"
 #include "lock.h"
 #include "param.h"
 #include <dirent.h>
diff --git a/servers/vfs/path.c b/servers/vfs/path.c
index 260a047e0..6f6dc52df 100644
--- a/servers/vfs/path.c
+++ b/servers/vfs/path.c
@@ -18,11 +18,9 @@
 #include <sys/stat.h>
 #include <sys/un.h>
 #include <dirent.h>
-#include "threads.h"
 #include "vmnt.h"
 #include "vnode.h"
 #include "path.h"
-#include "fproc.h"
 #include "param.h"
 
 /* Set to following define to 1 if you really want to use the POSIX definition
diff --git a/servers/vfs/pipe.c b/servers/vfs/pipe.c
index d923ec546..83cc7fe66 100644
--- a/servers/vfs/pipe.c
+++ b/servers/vfs/pipe.c
@@ -27,9 +27,7 @@
 #include <sys/select.h>
 #include <sys/time.h>
 #include "file.h"
-#include "fproc.h"
 #include "scratchpad.h"
-#include "dmap.h"
 #include "param.h"
 #include <minix/vfsif.h>
 #include "vnode.h"
@@ -523,29 +521,27 @@ void revive(endpoint_t proc_e, int returned)
 /*===========================================================================*
  *				unpause					     *
  *===========================================================================*/
-void unpause(endpoint_t proc_e)
+void unpause(void)
 {
 /* A signal has been sent to a user who is paused on the file system.
  * Abort the system call with the EINTR error message.
  */
-
-  register struct fproc *rfp, *org_fp;
-  int slot, blocked_on, fild, status = EINTR;
+  int blocked_on, fild, status = EINTR;
   struct filp *f;
   dev_t dev;
   int wasreviving = 0;
 
-  if (isokendpt(proc_e, &slot) != OK) {
-	printf("VFS: ignoring unpause for bogus endpoint %d\n", proc_e);
-	return;
-  }
+  if (!fp_is_blocked(fp)) return;
+  blocked_on = fp->fp_blocked_on;
 
-  rfp = &fproc[slot];
-  if (!fp_is_blocked(rfp)) return;
-  blocked_on = rfp->fp_blocked_on;
+  /* Clear the block status now. The procedure below might make blocking calls
+   * and it is imperative that while at least dev_cancel() is executing, other
+   * parts of VFS do not perceive this process as blocked on something.
+   */
+  fp->fp_blocked_on = FP_BLOCKED_ON_NONE;
 
-  if (rfp->fp_flags & FP_REVIVED) {
-	rfp->fp_flags &= ~FP_REVIVED;
+  if (fp->fp_flags & FP_REVIVED) {
+	fp->fp_flags &= ~FP_REVIVED;
 	reviving--;
 	wasreviving = 1;
   }
@@ -565,30 +561,27 @@ void unpause(endpoint_t proc_e)
 		break;
 
 	case FP_BLOCKED_ON_OTHER:/* process trying to do device I/O (e.g. tty)*/
-		if (rfp->fp_flags & FP_SUSP_REOPEN) {
+		if (fp->fp_flags & FP_SUSP_REOPEN) {
 			/* Process is suspended while waiting for a reopen.
 			 * Just reply EINTR.
 			 */
-			rfp->fp_flags &= ~FP_SUSP_REOPEN;
+			fp->fp_flags &= ~FP_SUSP_REOPEN;
 			status = EINTR;
 			break;
 		}
 
-		fild = scratch(rfp).file.fd_nr;
+		fild = scratch(fp).file.fd_nr;
 		if (fild < 0 || fild >= OPEN_MAX)
 			panic("file descriptor out-of-range");
-		f = rfp->fp_filp[fild];
+		f = fp->fp_filp[fild];
 		if(!f) {
-			sys_sysctl_stacktrace(rfp->fp_endpoint);
+			sys_sysctl_stacktrace(fp->fp_endpoint);
 			panic("process %d blocked on empty fd %d",
-				rfp->fp_endpoint, fild);
+				fp->fp_endpoint, fild);
 		}
 		dev = (dev_t) f->filp_vno->v_sdev;	/* device hung on */
 
-		org_fp = fp;
-		fp = rfp;	/* hack - ctty_io uses fp */
 		status = dev_cancel(dev);
-		fp = org_fp;
 
 		break;
 	default :
@@ -600,9 +593,5 @@ void unpause(endpoint_t proc_e)
 	susp_count--;
   }
 
-  if(rfp->fp_blocked_on != FP_BLOCKED_ON_NONE) {
-	rfp->fp_blocked_on = FP_BLOCKED_ON_NONE;
-	replycode(proc_e, status);	/* signal interrupted call */
-  }
+  replycode(fp->fp_endpoint, status);	/* signal interrupted call */
 }
-
diff --git a/servers/vfs/protect.c b/servers/vfs/protect.c
index 4f76bc9e2..d719e3e07 100644
--- a/servers/vfs/protect.c
+++ b/servers/vfs/protect.c
@@ -14,7 +14,6 @@
 #include <assert.h>
 #include <minix/callnr.h>
 #include "file.h"
-#include "fproc.h"
 #include "path.h"
 #include "param.h"
 #include <minix/vfsif.h>
diff --git a/servers/vfs/proto.h b/servers/vfs/proto.h
index 1b09bea15..96fac650e 100644
--- a/servers/vfs/proto.h
+++ b/servers/vfs/proto.h
@@ -75,12 +75,11 @@ int map_service(struct rprocpub *rpub);
 void write_elf_core_file(struct filp *f, int csig, char *exe_name);
 
 /* exec.c */
-int pm_exec(endpoint_t proc_e, vir_bytes path, size_t path_len, vir_bytes frame,
-	size_t frame_len, vir_bytes *pc, vir_bytes *newsp, vir_bytes *ps_str,
-	int flags);
+int pm_exec(vir_bytes path, size_t path_len, vir_bytes frame, size_t frame_len,
+	vir_bytes *pc, vir_bytes *newsp, vir_bytes *ps_str, int flags);
 
 /* filedes.c */
-void *do_filp_gc(void *arg);
+int do_filp_gc(void);
 void check_filp_locks(void);
 void check_filp_locks_by_me(void);
 void init_filps(void);
@@ -124,14 +123,15 @@ void lock_revive(void);
 
 /* main.c */
 int main(void);
-void lock_proc(struct fproc *rfp, int force_lock);
+void lock_proc(struct fproc *rfp);
+void unlock_proc(struct fproc *rfp);
 void reply(message *m_out, endpoint_t whom, int result);
 void replycode(endpoint_t whom, int result);
-void thread_cleanup(struct fproc *rfp);
-void unlock_proc(struct fproc *rfp);
+void service_pm_postponed(void);
+void thread_cleanup(void);
 
 /* misc.c */
-void pm_exit(endpoint_t proc);
+void pm_exit(void);
 int do_fcntl(message *m_out);
 void pm_fork(endpoint_t pproc, endpoint_t cproc, pid_t cpid);
 void pm_setgid(endpoint_t proc_e, int egid, int rgid);
@@ -143,8 +143,8 @@ void pm_reboot(void);
 int do_svrctl(message *m_out);
 int do_getsysinfo(void);
 int do_vm_call(message *m_out);
-int pm_dumpcore(endpoint_t proc_e, int sig, vir_bytes exe_name);
-void * ds_event(void *arg);
+int pm_dumpcore(int sig, vir_bytes exe_name);
+void ds_event(void);
 int dupvm(struct fproc *fp, int pfd, int *vmfd, struct filp **f);
 int do_getrusage(message *m_out);
 
@@ -190,7 +190,7 @@ int do_check_perms(message *m_out);
 int do_pipe(message *m_out);
 int do_pipe2(message *m_out);
 int map_vnode(struct vnode *vp, endpoint_t fs_e);
-void unpause(endpoint_t proc_e);
+void unpause(void);
 int pipe_check(struct filp *filp, int rw_flag, int oflags, int bytes,
 	int notouch);
 void release(struct vnode *vp, int op, int count);
@@ -364,15 +364,17 @@ void select_timeout_check(timer_t *);
 void select_unsuspend_by_endpt(endpoint_t proc);
 
 /* worker.c */
+void worker_init(void);
 int worker_available(void);
 struct worker_thread *worker_get(thread_t worker_tid);
-struct job *worker_getjob(thread_t worker_tid);
-void worker_init(struct worker_thread *worker);
 void worker_signal(struct worker_thread *worker);
-void worker_start(void *(*func)(void *arg));
+int worker_can_start(struct fproc *rfp);
+void worker_start(struct fproc *rfp, void (*func)(void), message *m_ptr,
+	int use_spare);
 void worker_stop(struct worker_thread *worker);
 void worker_stop_by_endpt(endpoint_t proc_e);
 void worker_wait(void);
-void sys_worker_start(void *(*func)(void *arg));
-void dl_worker_start(void *(*func)(void *arg));
+struct worker_thread *worker_suspend(void);
+void worker_resume(struct worker_thread *org_self);
+void worker_set_proc(struct fproc *rfp);
 #endif
diff --git a/servers/vfs/read.c b/servers/vfs/read.c
index 61fcd6de4..fd3f9fcb6 100644
--- a/servers/vfs/read.c
+++ b/servers/vfs/read.c
@@ -20,7 +20,6 @@
 #include <fcntl.h>
 #include <unistd.h>
 #include "file.h"
-#include "fproc.h"
 #include "param.h"
 #include "scratchpad.h"
 #include "vnode.h"
@@ -42,20 +41,17 @@ int do_read(message *UNUSED(m_out))
  *===========================================================================*/
 void lock_bsf(void)
 {
-  struct fproc *org_fp;
   struct worker_thread *org_self;
 
   if (mutex_trylock(&bsf_lock) == 0)
 	return;
 
-  org_fp = fp;
-  org_self = self;
+  org_self = worker_suspend();
 
   if (mutex_lock(&bsf_lock) != 0)
 	panic("unable to lock block special file lock");
 
-  fp = org_fp;
-  self = org_self;
+  worker_resume(org_self);
 }
 
 /*===========================================================================*
diff --git a/servers/vfs/request.c b/servers/vfs/request.c
index d60eb8d1d..e6a91a780 100644
--- a/servers/vfs/request.c
+++ b/servers/vfs/request.c
@@ -19,8 +19,6 @@
 #include <string.h>
 #include <unistd.h>
 #include <time.h>
-#include "fproc.h"
-#include "param.h"
 #include "path.h"
 #include "vmnt.h"
 #include "vnode.h"
diff --git a/servers/vfs/select.c b/servers/vfs/select.c
index f743c6e57..073d1cea3 100644
--- a/servers/vfs/select.c
+++ b/servers/vfs/select.c
@@ -25,8 +25,6 @@
 #include <assert.h>
 
 #include "file.h"
-#include "fproc.h"
-#include "dmap.h"
 #include "vnode.h"
 
 /* max. number of simultaneously pending select() calls */
@@ -734,7 +732,6 @@ void select_timeout_check(timer_t *timer)
 
   se = &selecttab[s];
   if (se->requestor == NULL) return;
-  fp = se->requestor;
   if (se->expiry <= 0) return;	/* Strange, did we even ask for a timeout? */
   se->expiry = 0;
   if (is_deferred(se)) return;	/* Wait for initial replies to DEV_SELECT */
diff --git a/servers/vfs/stadir.c b/servers/vfs/stadir.c
index 85c0d6b36..badcf10f8 100644
--- a/servers/vfs/stadir.c
+++ b/servers/vfs/stadir.c
@@ -18,7 +18,6 @@
 #include <minix/u64.h>
 #include <string.h>
 #include "file.h"
-#include "fproc.h"
 #include "path.h"
 #include "param.h"
 #include <minix/vfsif.h>
diff --git a/servers/vfs/table.c b/servers/vfs/table.c
index 285f90100..53f861ecb 100644
--- a/servers/vfs/table.c
+++ b/servers/vfs/table.c
@@ -8,7 +8,6 @@
 #include <minix/callnr.h>
 #include <minix/com.h>
 #include "file.h"
-#include "fproc.h"
 #include "lock.h"
 #include "scratchpad.h"
 #include "vnode.h"
diff --git a/servers/vfs/threads.h b/servers/vfs/threads.h
index cddb73e9a..93ca0be34 100644
--- a/servers/vfs/threads.h
+++ b/servers/vfs/threads.h
@@ -1,7 +1,6 @@
 #ifndef __VFS_WORKERS_H__
 #define __VFS_WORKERS_H__
 #include <minix/mthread.h>
-#include "job.h"
 
 #define thread_t	mthread_thread_t
 #define mutex_t		mthread_mutex_t
@@ -23,12 +22,15 @@
 #define cond_wait	mthread_cond_wait
 #define cond_signal	mthread_cond_signal
 
+struct fproc;
+
 struct worker_thread {
   thread_t w_tid;
   mutex_t w_event_mutex;
   cond_t w_event;
-  struct job w_job;
   struct fproc *w_fp;
+  message w_msg;
+  int w_err_code;
   message *w_fs_sendrec;
   message *w_drv_sendrec;
   endpoint_t w_task;
diff --git a/servers/vfs/time.c b/servers/vfs/time.c
index d3b42669b..4de1cd747 100644
--- a/servers/vfs/time.c
+++ b/servers/vfs/time.c
@@ -13,7 +13,6 @@
 #include <sys/stat.h>
 #include <fcntl.h>
 #include "file.h"
-#include "fproc.h"
 #include "path.h"
 #include "param.h"
 #include "vnode.h"
diff --git a/servers/vfs/utility.c b/servers/vfs/utility.c
index 687b53d95..b35d2809d 100644
--- a/servers/vfs/utility.c
+++ b/servers/vfs/utility.c
@@ -20,7 +20,6 @@
 #include <assert.h>
 #include <time.h>
 #include "file.h"
-#include "fproc.h"
 #include "param.h"
 #include "vmnt.h"
 
diff --git a/servers/vfs/vmnt.c b/servers/vfs/vmnt.c
index 00e9a7dc0..b7e3d6567 100644
--- a/servers/vfs/vmnt.c
+++ b/servers/vfs/vmnt.c
@@ -3,11 +3,9 @@
  */
 
 #include "fs.h"
-#include "threads.h"
 #include "vmnt.h"
 #include <assert.h>
 #include <string.h>
-#include "fproc.h"
 
 static int is_vmnt_locked(struct vmnt *vmp);
 static void clear_vmnt(struct vmnt *vmp);
diff --git a/servers/vfs/vnode.c b/servers/vfs/vnode.c
index 022711748..438c057ba 100644
--- a/servers/vfs/vnode.c
+++ b/servers/vfs/vnode.c
@@ -9,10 +9,8 @@
  */
 
 #include "fs.h"
-#include "threads.h"
 #include "vnode.h"
 #include "vmnt.h"
-#include "fproc.h"
 #include "file.h"
 #include <minix/vfsif.h>
 #include <assert.h>
diff --git a/servers/vfs/worker.c b/servers/vfs/worker.c
index 8b9f20f27..2eb74c270 100644
--- a/servers/vfs/worker.c
+++ b/servers/vfs/worker.c
@@ -1,18 +1,10 @@
 #include "fs.h"
-#include "glo.h"
-#include "fproc.h"
-#include "threads.h"
-#include "job.h"
 #include <assert.h>
 
-static void append_job(struct job *job, void *(*func)(void *arg));
-static void get_work(struct worker_thread *worker);
+static void worker_get_work(void);
 static void *worker_main(void *arg);
-static void worker_sleep(struct worker_thread *worker);
+static void worker_sleep(void);
 static void worker_wake(struct worker_thread *worker);
-static int worker_waiting_for(struct worker_thread *worker, endpoint_t
-	proc_e);
-static int init = 0;
 static mthread_attr_t tattr;
 
 #ifdef MKCOVERAGE
@@ -21,66 +13,60 @@ static mthread_attr_t tattr;
 # define TH_STACKSIZE (28 * 1024)
 #endif
 
-#define ASSERTW(w) assert((w) == &sys_worker || (w) == &dl_worker || \
-		   ((w) >= &workers[0] && (w) < &workers[NR_WTHREADS]));
+#define ASSERTW(w) assert((w) >= &workers[0] && (w) < &workers[NR_WTHREADS])
 
 /*===========================================================================*
  *				worker_init				     *
  *===========================================================================*/
-void worker_init(struct worker_thread *wp)
+void worker_init(void)
 {
 /* Initialize worker thread */
-  if (!init) {
-	threads_init();
-	if (mthread_attr_init(&tattr) != 0)
-		panic("failed to initialize attribute");
-	if (mthread_attr_setstacksize(&tattr, TH_STACKSIZE) != 0)
-		panic("couldn't set default thread stack size");
-	if (mthread_attr_setdetachstate(&tattr, MTHREAD_CREATE_DETACHED) != 0)
-		panic("couldn't set default thread detach state");
-	invalid_thread_id = mthread_self(); /* Assuming we're the main thread*/
-	pending = 0;
-	init = 1;
+  struct worker_thread *wp;
+  int i;
+
+  threads_init();
+  if (mthread_attr_init(&tattr) != 0)
+	panic("failed to initialize attribute");
+  if (mthread_attr_setstacksize(&tattr, TH_STACKSIZE) != 0)
+	panic("couldn't set default thread stack size");
+  if (mthread_attr_setdetachstate(&tattr, MTHREAD_CREATE_DETACHED) != 0)
+	panic("couldn't set default thread detach state");
+  pending = 0;
+
+  for (i = 0; i < NR_WTHREADS; i++) {
+	wp = &workers[i];
+
+	wp->w_fp = NULL;		/* Mark not in use */
+	wp->w_next = NULL;
+	if (mutex_init(&wp->w_event_mutex, NULL) != 0)
+		panic("failed to initialize mutex");
+	if (cond_init(&wp->w_event, NULL) != 0)
+		panic("failed to initialize conditional variable");
+	if (mthread_create(&wp->w_tid, &tattr, worker_main, (void *) wp) != 0)
+		panic("unable to start thread");
   }
 
-  ASSERTW(wp);
-
-  wp->w_job.j_func = NULL;		/* Mark not in use */
-  wp->w_next = NULL;
-  if (mutex_init(&wp->w_event_mutex, NULL) != 0)
-	panic("failed to initialize mutex");
-  if (cond_init(&wp->w_event, NULL) != 0)
-	panic("failed to initialize conditional variable");
-  if (mthread_create(&wp->w_tid, &tattr, worker_main, (void *) wp) != 0)
-	panic("unable to start thread");
-  yield();
+  /* Let all threads get ready to accept work. */
+  yield_all();
 }
 
 /*===========================================================================*
- *				get_work				     *
+ *				worker_get_work				     *
  *===========================================================================*/
-static void get_work(struct worker_thread *worker)
+static void worker_get_work(void)
 {
 /* Find new work to do. Work can be 'queued', 'pending', or absent. In the
- * latter case wait for new work to come in. */
-
-  struct job *new_job;
+ * latter case wait for new work to come in.
+ */
   struct fproc *rfp;
 
-  ASSERTW(worker);
-  self = worker;
-
   /* Do we have queued work to do? */
-  if ((new_job = worker->w_job.j_next) != NULL) {
-	worker->w_job = *new_job;
-	free(new_job);
-	return;
-  } else if (worker != &sys_worker && worker != &dl_worker && pending > 0) {
+  if (pending > 0) {
 	/* Find pending work */
 	for (rfp = &fproc[0]; rfp < &fproc[NR_PROCS]; rfp++) {
 		if (rfp->fp_flags & FP_PENDING) {
-			worker->w_job = rfp->fp_job;
-			rfp->fp_job.j_func = NULL;
+			self->w_fp = rfp;
+			rfp->fp_worker = self;
 			rfp->fp_flags &= ~FP_PENDING; /* No longer pending */
 			pending--;
 			assert(pending >= 0);
@@ -91,11 +77,11 @@ static void get_work(struct worker_thread *worker)
   }
 
   /* Wait for work to come to us */
-  worker_sleep(worker);
+  worker_sleep();
 }
 
 /*===========================================================================*
- *				worker_available				     *
+ *				worker_available			     *
  *===========================================================================*/
 int worker_available(void)
 {
@@ -103,7 +89,7 @@ int worker_available(void)
 
   busy = 0;
   for (i = 0; i < NR_WTHREADS; i++) {
-	if (workers[i].w_job.j_func != NULL)
+	if (workers[i].w_fp != NULL)
 		busy++;
   }
 
@@ -116,151 +102,207 @@ int worker_available(void)
 static void *worker_main(void *arg)
 {
 /* Worker thread main loop */
-  struct worker_thread *me;
 
-  me = (struct worker_thread *) arg;
-  ASSERTW(me);
+  self = (struct worker_thread *) arg;
+  ASSERTW(self);
 
   while(TRUE) {
-	get_work(me);
+	worker_get_work();
+
+	fp = self->w_fp;
+	assert(fp->fp_worker == self);
+
+	/* Lock the process. */
+	lock_proc(fp);
+
+	/* The following two blocks could be run in a loop until both the
+	 * conditions are no longer met, but it is currently impossible that
+	 * more normal work is present after postponed PM work has been done.
+	 */
 
-	/* Register ourselves in fproc table if possible */
-	if (me->w_job.j_fp != NULL) {
-		me->w_job.j_fp->fp_wtid = me->w_tid;
+	/* Perform normal work, if any. */
+	if (fp->fp_func != NULL) {
+		self->w_msg = fp->fp_msg;
+		err_code = OK;
+
+		fp->fp_func();
+
+		fp->fp_func = NULL;	/* deliberately unset AFTER the call */
 	}
 
-	/* Carry out work */
-	me->w_job.j_func(&me->w_job);
+	/* Perform postponed PM work, if any. */
+	if (fp->fp_flags & FP_PM_WORK) {
+		self->w_msg = fp->fp_pm_msg;
+
+		service_pm_postponed();
 
-	/* Deregister if possible */
-	if (me->w_job.j_fp != NULL) {
-		me->w_job.j_fp->fp_wtid = invalid_thread_id;
+		fp->fp_flags &= ~FP_PM_WORK;
 	}
 
-	/* Mark ourselves as done */
-	me->w_job.j_func = NULL;
-	me->w_job.j_fp = NULL;
-  }
+	/* Perform cleanup actions. */
+	thread_cleanup();
 
-  return(NULL);	/* Unreachable */
-}
+	unlock_proc(fp);
 
-/*===========================================================================*
- *				dl_worker_start				     *
- *===========================================================================*/
-void dl_worker_start(void *(*func)(void *arg))
-{
-/* Start the deadlock resolving worker. This worker is reserved to run in case
- * all other workers are busy and we have to have an additional worker to come
- * to the rescue. */
-  assert(dl_worker.w_job.j_func == NULL);
-
-  if (dl_worker.w_job.j_func == NULL) {
-	dl_worker.w_job.j_fp = fp;
-	dl_worker.w_job.j_m_in = m_in;
-	dl_worker.w_job.j_func = func;
-	worker_wake(&dl_worker);
+	fp->fp_worker = NULL;
+	self->w_fp = NULL;
   }
-}
 
-/*===========================================================================*
- *				sys_worker_start			     *
- *===========================================================================*/
-void sys_worker_start(void *(*func)(void *arg))
-{
-/* Carry out work for the system (i.e., kernel or PM). If this thread is idle
- * do it right away, else create new job and append it to the queue. */
-
-  if (sys_worker.w_job.j_func == NULL) {
-	sys_worker.w_job.j_fp = fp;
-	sys_worker.w_job.j_m_in = m_in;
-	sys_worker.w_job.j_func = func;
-	worker_wake(&sys_worker);
-  } else {
-	append_job(&sys_worker.w_job, func);
-  }
+  return(NULL);	/* Unreachable */
 }
 
 /*===========================================================================*
- *				append_job				     *
+ *				worker_can_start			     *
  *===========================================================================*/
-static void append_job(struct job *job, void *(*func)(void *arg))
+int worker_can_start(struct fproc *rfp)
 {
-/* Append a job */
-
-  struct job *new_job, *tail;
-
-  /* Create new job */
-  new_job = calloc(1, sizeof(struct job));
-  assert(new_job != NULL);
-  new_job->j_fp = fp;
-  new_job->j_m_in = m_in;
-  new_job->j_func = func;
-  new_job->j_next = NULL;
-  new_job->j_err_code = OK;
-
-  /* Append to queue */
-  tail = job;
-  while (tail->j_next != NULL) tail = tail->j_next;
-  tail->j_next = new_job;
+/* Return whether normal (non-PM) work can be started for the given process.
+ * This function is used to serialize invocation of "special" procedures, and
+ * not entirely safe for other cases, as explained in the comments below.
+ */
+  int is_pending, is_active, has_normal_work, has_pm_work;
+
+  is_pending = (rfp->fp_flags & FP_PENDING);
+  is_active = (rfp->fp_worker != NULL);
+  has_normal_work = (rfp->fp_func != NULL);
+  has_pm_work = (rfp->fp_flags & FP_PM_WORK);
+
+  /* If there is no work scheduled for the process, we can start work. */
+  if (!is_pending && !is_active) return TRUE;
+
+  /* If there is already normal work scheduled for the process, we cannot add
+   * more, since we support only one normal job per process.
+   */
+  if (has_normal_work) return FALSE;
+
+  /* If this process has pending PM work but no normal work, we can add the
+   * normal work for execution before the worker will start.
+   */
+  if (is_pending) return TRUE;
+
+  /* However, if a worker is active for PM work, we cannot add normal work
+   * either, because the work will not be considered. For this reason, we can
+   * not use this function for processes that can possibly get postponed PM
+   * work. It is still safe for core system processes, though.
+   */
+  return FALSE;
 }
 
 /*===========================================================================*
- *				worker_start				     *
+ *				worker_try_activate			     *
  *===========================================================================*/
-void worker_start(void *(*func)(void *arg))
+static void worker_try_activate(struct fproc *rfp, int use_spare)
 {
-/* Find an available worker or wait for one */
-  int i;
+/* See if we can wake up a thread to do the work scheduled for the given
+ * process. If not, mark the process as having pending work for later.
+ */
+  int i, available, needed;
   struct worker_thread *worker;
 
-  if (fp->fp_flags & FP_DROP_WORK) {
-	return;	/* This process is not allowed to accept new work */
-  }
+  /* Use the last available thread only if requested. Otherwise, leave at least
+   * one spare thread for deadlock resolution.
+   */
+  needed = use_spare ? 1 : 2;
 
   worker = NULL;
-  for (i = 0; i < NR_WTHREADS; i++) {
-	if (workers[i].w_job.j_func == NULL) {
-		worker = &workers[i];
-		break;
+  for (i = available = 0; i < NR_WTHREADS; i++) {
+	if (workers[i].w_fp == NULL) {
+		if (worker == NULL)
+			worker = &workers[i];
+		if (++available >= needed)
+			break;
 	}
   }
 
-  if (worker != NULL) {
-	worker->w_job.j_fp = fp;
-	worker->w_job.j_m_in = m_in;
-	worker->w_job.j_func = func;
-	worker->w_job.j_next = NULL;
-	worker->w_job.j_err_code = OK;
+  if (available >= needed) {
+	assert(worker != NULL);
+	rfp->fp_worker = worker;
+	worker->w_fp = rfp;
 	worker_wake(worker);
-	return;
+  } else {
+	rfp->fp_flags |= FP_PENDING;
+	pending++;
   }
+}
 
-  /* No worker threads available, let's wait for one to finish. */
-  /* If this process already has a job scheduled, forget about this new
-   * job;
-   *  - the new job is do_dummy and we have already scheduled an actual job
-   *  - the new job is an actual job and we have already scheduled do_dummy in
-   *    order to exit this proc, so doing the new job is pointless. */
-  if (fp->fp_job.j_func == NULL) {
-	assert(!(fp->fp_flags & FP_PENDING));
-	fp->fp_job.j_fp = fp;
-	fp->fp_job.j_m_in = m_in;
-	fp->fp_job.j_func = func;
-	fp->fp_job.j_next = NULL;
-	fp->fp_job.j_err_code = OK;
-	fp->fp_flags |= FP_PENDING;
-	pending++;
+/*===========================================================================*
+ *				worker_start				     *
+ *===========================================================================*/
+void worker_start(struct fproc *rfp, void (*func)(void), message *m_ptr,
+	int use_spare)
+{
+/* Schedule work to be done by a worker thread. The work is bound to the given
+ * process. If a function pointer is given, the work is considered normal work,
+ * and the function will be called to handle it. If the function pointer is
+ * NULL, the work is considered postponed PM work, and service_pm_postponed
+ * will be called to handle it. The input message will be a copy of the given
+ * message. Optionally, the last spare (deadlock-resolving) thread may be used
+ * to execute the work immediately.
+ */
+  int is_pm_work, is_pending, is_active, has_normal_work, has_pm_work;
+
+  assert(rfp != NULL);
+
+  is_pm_work = (func == NULL);
+  is_pending = (rfp->fp_flags & FP_PENDING);
+  is_active = (rfp->fp_worker != NULL);
+  has_normal_work = (rfp->fp_func != NULL);
+  has_pm_work = (rfp->fp_flags & FP_PM_WORK);
+
+  /* Sanity checks. If any of these trigger, someone messed up badly! */
+  if (is_pending || is_active) {
+	if (is_pending && is_active)
+		panic("work cannot be both pending and active");
+
+	/* The process cannot make more than one call at once. */
+	if (!is_pm_work && has_normal_work)
+		panic("process has two calls (%x, %x)",
+			rfp->fp_msg.m_type, m_ptr->m_type);
+
+	/* PM will not send more than one job per process to us at once. */
+	if (is_pm_work && has_pm_work)
+		panic("got two calls from PM (%x, %x)",
+			rfp->fp_pm_msg.m_type, m_ptr->m_type);
+
+	/* Despite PM's sys_delay_stop() system, it is possible that normal
+	 * work (in particular, do_pending_pipe) arrives after postponed PM
+	 * work has been scheduled for execution, so we don't check for that.
+	 */
+#if 0
+	printf("VFS: adding %s work to %s thread\n",
+		is_pm_work ? "PM" : "normal",
+		is_pending ? "pending" : "active");
+#endif
+  } else {
+	/* Some cleanup step forgotten somewhere? */
+	if (has_normal_work || has_pm_work)
+		panic("worker administration error");
   }
+
+  /* Save the work to be performed. */
+  if (!is_pm_work) {
+	rfp->fp_msg = *m_ptr;
+	rfp->fp_func = func;
+  } else {
+	rfp->fp_pm_msg = *m_ptr;
+	rfp->fp_flags |= FP_PM_WORK;
+  }
+
+  /* If we have not only added to existing work, go look for a free thread.
+   * Note that we won't be using the spare thread for normal work if there is
+   * already PM work pending, but that situation will never occur in practice.
+   */
+  if (!is_pending && !is_active)
+	worker_try_activate(rfp, use_spare);
 }
 
 /*===========================================================================*
  *				worker_sleep				     *
  *===========================================================================*/
-static void worker_sleep(struct worker_thread *worker)
+static void worker_sleep(void)
 {
+  struct worker_thread *worker = self;
   ASSERTW(worker);
-  assert(self == worker);
   if (mutex_lock(&worker->w_event_mutex) != 0)
 	panic("unable to lock event mutex");
   if (cond_wait(&worker->w_event, &worker->w_event_mutex) != 0)
@@ -285,16 +327,55 @@ static void worker_wake(struct worker_thread *worker)
 	panic("unable to unlock event mutex");
 }
 
+/*===========================================================================*
+ *				worker_suspend				     *
+ *===========================================================================*/
+struct worker_thread *worker_suspend(void)
+{
+/* Suspend the current thread, saving certain thread variables. Return a
+ * pointer to the thread's worker structure for later resumption.
+ */
+
+  ASSERTW(self);
+  assert(fp != NULL);
+  assert(self->w_fp == fp);
+  assert(fp->fp_worker == self);
+
+  self->w_err_code = err_code;
+
+  return self;
+}
+
+/*===========================================================================*
+ *				worker_resume				     *
+ *===========================================================================*/
+void worker_resume(struct worker_thread *org_self)
+{
+/* Resume the current thread after suspension, restoring thread variables. */
+
+  ASSERTW(org_self);
+
+  self = org_self;
+
+  fp = self->w_fp;
+  assert(fp != NULL);
+
+  err_code = self->w_err_code;
+}
+
 /*===========================================================================*
  *				worker_wait				     *
  *===========================================================================*/
 void worker_wait(void)
 {
-  self->w_job.j_err_code = err_code;
-  worker_sleep(self);
+/* Put the current thread to sleep until woken up by the main thread. */
+
+  (void) worker_suspend(); /* worker_sleep already saves and restores 'self' */
+
+  worker_sleep();
+
   /* We continue here after waking up */
-  fp = self->w_job.j_fp;	/* Restore global data */
-  err_code = self->w_job.j_err_code;
+  worker_resume(self);
   assert(self->w_next == NULL);
 }
 
@@ -323,7 +404,9 @@ void worker_stop(struct worker_thread *worker)
 		panic("reply storage consistency error");	/* Oh dear */
 	}
   } else {
-	worker->w_job.j_m_in.m_type = EIO;
+	/* This shouldn't happen at all... */
+	printf("VFS: stopping worker not blocked on any task?\n");
+	util_stacktrace();
   }
   worker_wake(worker);
 }
@@ -338,12 +421,9 @@ void worker_stop_by_endpt(endpoint_t proc_e)
 
   if (proc_e == NONE) return;
 
-  if (worker_waiting_for(&sys_worker, proc_e)) worker_stop(&sys_worker);
-  if (worker_waiting_for(&dl_worker, proc_e)) worker_stop(&dl_worker);
-
   for (i = 0; i < NR_WTHREADS; i++) {
 	worker = &workers[i];
-	if (worker_waiting_for(worker, proc_e))
+	if (worker->w_fp != NULL && worker->w_task == proc_e)
 		worker_stop(worker);
   }
 }
@@ -354,52 +434,36 @@ void worker_stop_by_endpt(endpoint_t proc_e)
 struct worker_thread *worker_get(thread_t worker_tid)
 {
   int i;
-  struct worker_thread *worker;
 
-  worker = NULL;
-  if (worker_tid == sys_worker.w_tid)
-	worker = &sys_worker;
-  else if (worker_tid == dl_worker.w_tid)
-	worker = &dl_worker;
-  else {
-	for (i = 0; i < NR_WTHREADS; i++) {
-		if (workers[i].w_tid == worker_tid) {
-			worker = &workers[i];
-			break;
-		}
-	}
-  }
+  for (i = 0; i < NR_WTHREADS; i++)
+	if (workers[i].w_tid == worker_tid)
+		return(&workers[i]);
 
-  return(worker);
+  return(NULL);
 }
 
 /*===========================================================================*
- *				worker_getjob				     *
+ *				worker_set_proc				     *
  *===========================================================================*/
-struct job *worker_getjob(thread_t worker_tid)
+void worker_set_proc(struct fproc *rfp)
 {
-  struct worker_thread *worker;
+/* Perform an incredibly ugly action that completely violates the threading
+ * model: change the current working thread's process context to another
+ * process. The caller is expected to hold the lock to both the calling and the
+ * target process, and neither process is expected to continue regular
+ * operation when done. This code is here *only* and *strictly* for the reboot
+ * code, and *must not* be used for anything else.
+ */
 
-  if ((worker = worker_get(worker_tid)) != NULL)
-	return(&worker->w_job);
+  if (fp == rfp) return;
 
-  return(NULL);
-}
+  if (rfp->fp_worker != NULL)
+	panic("worker_set_proc: target process not idle");
 
-/*===========================================================================*
- *				worker_waiting_for			     *
- *===========================================================================*/
-static int worker_waiting_for(struct worker_thread *worker, endpoint_t proc_e)
-{
-  ASSERTW(worker);		/* Make sure we have a valid thread */
+  fp->fp_worker = NULL;
 
-  if (worker->w_job.j_func != NULL) {
-	if (worker->w_task != NONE)
-		return(worker->w_task == proc_e);
-	else if (worker->w_job.j_fp != NULL) {
-		return(worker->w_job.j_fp->fp_task == proc_e);
-	}
-  }
+  fp = rfp;
 
-  return(0);
+  self->w_fp = rfp;
+  fp->fp_worker = self;
 }