From 625f4ae4a3988ae3412bde46fecfae70b53c8031 Mon Sep 17 00:00:00 2001 From: Thomas Veerman Date: Tue, 18 Dec 2012 14:53:12 +0000 Subject: [PATCH] VFS: add documentation about internal working --- servers/vfs/README | 674 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 674 insertions(+) create mode 100644 servers/vfs/README diff --git a/servers/vfs/README b/servers/vfs/README new file mode 100644 index 000000000..7ded7c3e0 --- /dev/null +++ b/servers/vfs/README @@ -0,0 +1,674 @@ +Description of VFS Thomas Veerman 18-12-2012 + +Table of contents +1 ..... General description of responsibilities +2 ..... General architecture +3 ..... Worker threads +4 ..... Locking +4.1 .... Locking requirements +4.2 .... Three-level Lock +4.3 .... Data structures subject to locking +4.4 .... Locking order +4.5 .... Vmnt (file system) locking +4.6 .... Vnode (open file) locking +4.7 .... Filp (file position) locking +4.8 .... Lock characteristics per request type +5 ..... Recovery from driver crashes +5.1 .... Recovery from block drivers crashes +5.2 .... Recovery from character driver crashes +5.3 .... Recovery from File Server crashes + +1 General description of responsibilities +VFS implements the file system in cooperation with one or more File Servers +(FS). The File Servers take care of the actual file system on a partition. That +is, they interpret the data structure on disk, write and read data to/from +disk, etc. VFS sits on top of those File Servers and communicates with +them. Looking inside VFS, we can identify several roles. First, a role of VFS +is to handle most POSIX system calls that are supported by Minix. Additionally, +it supports a few calls necessary for libc. The following system calls are +handled by VFS: +access, chdir, chmod, chown, chroot, close, creat, fchdir, fcntl, fstat, +fstatfs, fstatvfs, fsync, ftruncate getdents, ioctl, link, llseek, lseek, +lstat, mkdir, mknod, mount, open, pipe, read, readlink, rename, rmdir, select, +stat, statvfs, symlink, sync, truncate, umask, umount, unlink, utime, write. +Second, it maintains part of the state belonging to a process (process state is +spread out over the kernel, VM, PM, and VFS). For example, it maintains state +for select(2) calls, file descriptors and file positions. Also, it cooperates +with the Process Manager to handle the fork, exec, and exit system calls. +Third, VFS keeps track of endpoints that are supposed to be drivers for +character or block special files. File Servers can be regarded as drivers for +block special files, although they are handled entirely different compared +to other drivers. + +The following diagram depicts how a read() on a file in /home is being handled: + + ---------------- + | user process | + ---------------- + ^ ^ + | | + read(2) \ + | \ + V \ + ---------------- | + | VFS | | + ---------------- | + ^ | + | | + V | + ------- -------- --------- + | MFS | | MFS | | MFS | + | / | | /usr | | /home | + ------- -------- --------- +Diagram 1: handling of read(2) system call + +The user process executes the read system call which is delivered to VFS. VFS +verifies the read is done on a valid (open) file and forwards the request +to the FS responsible for the file system on which the file resides. The FS +reads the data, copies it directly to the user process, and replies to VFS +it has executed the request. Subsequently, VFS replies to the user process +the operation is done and the user process continues to run. + +2 General architecture +VFS works roughly identical to every other server and driver in Minix; it +fetches a message (internally referred to as a job in some cases), executes +the request embedded in the message, returns a reply, and fetches the next +job. There are several sources for new jobs: from user processes, from PM, from +the kernel, and from suspended jobs inside VFS itself (suspended operations +on pipes, locks, or character special files). File Servers are regarded as +normal user processes in this case, but their abilities are limited. This +is to prevent deadlocks. Once a job is received, a worker thread starts +executing it. During the lifetime of a job, the worker thread might need +to talk to several File Servers. The protocol VFS speaks with File Servers +is fully documented on the Wiki at [0]. The protocol fields are defined in +. If the job is an operation on a character or block special +file and the need to talk to a driver arises, VFS uses the Character and +Block Device Protocol. See [1]. This is sadly not official documentation, +but it is an accurate description of how it works. Luckily, driver writers +can use the libchardriver and libblockdriver libraries and don't have to +know the details of the protocol. + +3 Worker threads +Upon start up, VFS spawns a configurable amount of worker threads. The +main thread fetches requests and replies, and hands them off to idle or +reply-pending workers, respectively. If no worker threads are available, +the request is queued. There are 3 types of worker threads: normal, a system +worker, and a deadlock resolver. All standard system calls are handled by +normal worker threads. Jobs from PM and notifications from the kernel are taken +care of by the system worker. The deadlock resolver handles jobs from system +processes (i.e., File Servers and drivers) when there are no normal worker +threads available; all normal threads might be blocked on a single worker +thread that caused a system process to send a request on its own. To unblock +all normal threads, we need to reserve one thread to handle that situation. +VFS drives all File Servers and drivers asynchronously. While waiting for +a reply, a worker thread is blocked and other workers can keep processing +requests. Upon reply the worker thread is unblocked. +As mentioned above, the main thread is responsible for retrieving new jobs and +replies to current jobs and start or unblock the proper worker thread. Given +how many sources for new jobs and replies there are, the work for the main +thread is quite complicated. Consider Table 1. + +--------------------------------------------------------- +| From | normal | deadlock | system | +--------------------------------------------------------- + msg is new job +--------------------------------------------------------- +| PM | | | X | ++----------------------+----------+----------+----------+ +| Notification from | | | | +| the kernel | | | X | ++----------------------+----------+----------+----------+ +| Notification from | | | | +| DS or system process | X | X | | ++----------------------+----------+----------+----------+ +| User process | X | | | ++----------------------+----------+----------+----------+ +| Unsuspended process | X | | | +--------------------------------------------------------- + msg is reply +--------------------------------------------------------- +| File Server reply | resume | | | ++----------------------+----------+----------+----------+ +| Sync. driver reply | resume | | | ++----------------------+----------+----------+----------+ +| Async. driver reply | resume/X | X | | +--------------------------------------------------------- +Table 1: VFS' message fetching main loop. X means 'start thread'. + +The reason why asynchronous driver replies get their own thread is for the +following. In some cases, a reply has a thread blocked waiting for it which +can be resumed (e.g., open). In another case there's a lot of work to be +done which involves sending new messages (e.g., select replies). Finally, +DEV_REVIVE replies unblock suspended processes which in turn generate new jobs +to be handled by the main loop (e.g., suspended reads and writes). So depending +on the reply a new thread has to be started. Having all this logic in the main +loop is messy, so we start a thread regardless of the actual reply contents. +When there are no worker threads available and there is no need to invoke +the deadlock resolver (i.e., normal system calls), the request is queued in +the fproc table. This works because a process can send only one system call +at a time. When implementing kernel threads, one has to take this assumption +into account. +The protocol PM speaks with VFS is asynchronous and PM is allowed to +send as many request to VFS as it wants. It is impossible to use the same +queueing mechanism as normal processes use, because that would allow for +just 1 queued message. Instead, the system worker maintains a linked list +of pending requests. Moreover, this queueing mechanism is also the reason +why notifications from the kernel are handled by the system worker; the +kernel has no corresponding fproc table entry (so we can't store it there) +and the linked list has no dependencies on that table. +Communication with drivers is asynchronous even when the driver uses the +synchronous driver protocol. However, to guarantee identical behavior, +access to synchronous drivers is serialized. File Servers are treated +differently. VFS was designed to be able to send requests concurrently to +File Servers, although at the time of writing there are no File Servers that +can actually make use of that functionality. To identify which reply from an +FS belongs to which worker thread, all requests have an embedded transaction +identification number (a magic number + thread id encoded in the mtype field +of a message) which the FS has to echo upon reply. Because the range of valid +transaction IDs is isolated from valid system call numbers, VFS can use that +ID to differentiate between replies from File Servers and actual new system +calls from FSes. Using this mechanism VFS is able to support FUSE and ProcFS. + +4 Locking +To ensure correct execution of system calls, worker threads sometimes need +certain objects within VFS to remain unchanged during thread suspension +and resumption (i.e., when they need to communicate with a driver or File +Server). Threads keep most state on the stack, but there are a few global +variables that require protection: the fproc table, vmnt table, vnode table, +and filp table. Other tables such as lock table, select table, and dmap table +don't require protection by means of exclusive access. There it's required +and enough to simply mark an entry in use. + +4.1 Locking requirements +VFS implements the locking model described in [2]. For completeness of this +document we'll describe it here, too. The requirements are based on a threading +package that is non-preemptive. VFS must guarantee correct functioning with +several, semi-concurrently executing threads in any arbitrary order. The +latter requirement follows from the fact that threads need service from +other components like File Servers and drivers, and they may take any time +to complete requests. +1) Consistency of replicated values. Several system calls rely on VFS keeping +a replicated representation of data in File Servers (e.g., file sizes, +file modes, etc.). +2) Isolation of system calls. Many system calls involve multiple requests to +FSes. Concurrent requests from other processes must not lead to otherwise +impossible results (e.g., a chmod operation on a file cannot fail halfway +through because it's suddenly unlinked or moved). +3) Integrity of objects. From the point of view of threads, obtaining mutual +exclusion is a potentially blocking operation. The integrity of any objects +used across blocking calls must be guaranteed (e.g., the file mode in a vnode +must remain intact not only when talking to other components, but also when +obtaining a lock on a filp). +4) No deadlock. Not one call may cause another call to never complete. Deadlock +situations are typically the result of two or more threads that each hold +exclusive access to one resource and want exclusive access to the resource +held by the other thread. These resources are a) data (global variables) +and b) worker threads. +4a) Conflicts between locking of different types of objects can be avoided by +keeping a locking order: objects of different type must always be locked in +the same order. If multiple objects of the same type are to be locked, then +first a "common denominator" higher up in the locking order must be locked. +4b) Some threads can only run to completion when another thread does work on +their behalf. Examples of this are drivers and file servers that do system +calls on their own (e.g., ProcFS, PFS/UNIX Domain Sockets, FUSE) or crashing +components (e.g., a driver for a character special file that crashes during +a request; a second thread is required to handle resource clean up or driver +restart before the first thread can abort or retry the request). +5) No starvation. VFS must guarantee that every system call completes in finite +time (e.g., an infinite stream of reads must never completely block writes). +Furthermore, we want to maximize parallelism to improve performance. This +leads to: +6) A request to one File Server must not block access to other FS +processes. This means that most forms of locking cannot take place at a +global level, and must at most take place on the file system level. +7) No read-only operation on a regular file must block an independent read +call to that file. In particular, (read-only) open and close operations may +not block such reads, and multiple independent reads on the same file must +be able to take place concurrently (i.e., reads that do not share a file +position between their file descriptors). + +4.2 Three-level Lock +From the requirements it follows that we need at least two locking types: read +and write locks. Concurrent reads are allowed, but writes are exclusive both +from reads and from each other. However, in a lot of cases it possible to use +a third locking type that is in between read and write lock: the serialize +lock. This is implemented in the three-level lock [2]. The three-level +lock provides: +TLL_READ: allows an unlimited number of threads to hold the lock with the +same type (both the thread itself and other threads); N * concurrent. +TLL_READSER: also allows an unlimited number of threads with type TLL_READ, +but only one thread can obtain serial access to the lock; N * concurrent + +1 * serial. +TLL_WRITE: provides full mutual exclusion; 1 * exclusive + 0 * concurrent + +0 * serial. +In absence of TLL_READ locks, a TLL_READSER is identical to TLL_WRITE. However, +TLL_READSER never blocks concurrent TLL_READ access. TLL_READSER can be +upgraded to TLL_WRITE; the thread will block until the last TLL_READ lock +leaves and new TLL_READ locks are blocked. Locks can be downgraded to a +lower type. The three-level lock is implemented using two FIFO queues with +write-bias. This guarantees no starvation. + +4.3 Data structures subject to locking +VFS has a number of global data structures. See Table 2. + +-------------------------------------------------------------------- +| Structure | Object description | ++------------+-----------------------------------------------------| +| fproc | Process (includes process's file descriptors) | ++------------+-----------------------------------------------------| +| vmnt | Virtual mount; a mounted file system | ++------------+-----------------------------------------------------| +| vnode | Virtual node; an open file | ++------------+-----------------------------------------------------| +| filp | File position into an open file | ++------------+-----------------------------------------------------| +| lock | File region locking state for an open file | ++------------+-----------------------------------------------------| +| select | State for an in-progress select(2) call | ++------------+-----------------------------------------------------| +| dmap | Mapping from major device number to a device driver | +-------------------------------------------------------------------- +Table 2: VFS object types. + +An fproc object is a process. An fproc object is created by fork(2) +and destroyed by exit(2) (which may, or may not, be instantiated from the +process itself). It is identified by its endpoint number ('fp_endpoint') +and process id ('fp_pid'). Both are unique although in general the endpoint +number is used throughout the system. +A vmnt object is a mounted file system. It is created by mount(2) and destroyed +by umount(2). It is identified by a device number ('m_dev') and FS endpoint +number ('m_fs_e'); both are unique to each vmnt object. There is always a +single process that handles a file system on a device and a device cannot +be mounted twice. +A vnode object is the VFS representation of an open inode on the file +system. A vnode object is created when a first process opens or creates the +corresponding file and is destroyed when the last process, which has that +file open, closes it. It is identified by a combination of FS endpoint number +('v_fs_e') and inode number of that file system ('v_inode_nr'). A vnode +might be mapped to another file system; the actual reading and writing is +handled by a different endpoint. This has no effect on locking. +A filp object contains a file position within a file. It is created when a file +is opened or anonymous pipe created and destroyed when the last user (i.e., +process) closes it. A file descriptor always points to a single filp. A filp +always point to a single vnode, although not all vnodes are pointed to by a +filp. A filp has a reference count ('filp_count') which is identical to the +number of file descriptors pointing to it. It can be increased by a dup(2) +or fork(2). A filp can therefore be shared by multiple processes. +A lock object keeps information about locking of file regions. This has +nothing to do with the threading type of locking. The lock objects require +no locking protection and won't be discussed further. +A select object keeps information on a select(2) operation that cannot +be fulfilled immediately (waiting for timeout or file descriptors not +ready). They are identified by their owner ('requestor'); a pointer to the +fproc table. A null pointer means not in use. A select object can be used by +only one process and a process can do only one select(2) at a time. Select(2) +operates on filps and is organized in such a way that it is sufficient to +apply locking on individual filps and not on select objects themselves. They +won't be discussed further. +A dmap object is a mapping from a device number to a device driver. A device +driver can have multiple device numbers associated (e.g., TTY). Access to +a driver is exclusive when it uses the synchronous driver protocol. + +4.4 Locking order +Based on the description in the previous section, we need protection for +fproc, vmnt, vnode, and filp objects. To prevent deadlocks as a result of +object locking, we need to define a strict locking order. In VFS we use the +following order: + +fproc -> [exec] -> vmnt -> vnode -> filp -> [block special file] -> [dmap] + +That is, no thread may lock an fproc object while holding a vmnt lock, +and no thread may lock a vmnt object while holding an (associated) vnode, etc. +Fproc needs protection because processes themselves can initiate system +calls, but also PM can cause system calls that have to be executed in their +name. For example, a process might be busy reading from a character device +and another process sends a termination signal. The exit(2) that follows is +sent by PM and is to be executed by the to-be-killed process itself. At this +point there is contention for the fproc object that belongs to the process, +hence the need for protection. +The exec(2) call is protected by a mutex for the following reason. VFS uses a +number of variables on the heap to read ELF headers. They are on the heap due +to their size; putting them on the stack would increase stack size demands for +worker threads. The exec call does blocking read calls and thus needs exclusive +access to these variables. However, only the exec(2) syscall needs this lock. +Access to block special files needs to be exclusive. File Servers are +responsible for handling reads from and writes to block special files; if +a block special file is on a device that is mounted, the FS responsible for +that mount point takes care of it, otherwise the FS that handles the root of +the file system is responsible. Due to mounting and unmounting file systems, +the FS handling a block special file may change. Locking the vnode is not +enough since the inode can be on an entirely different File Server. Therefore, +access to block special files must be mutually exclusive from concurrent +mount(2)/umount(2) operations. However, when we're not accessing a block +special file, we don't need this lock. + +4.5 Vmnt (file system) locking +Vmnt locking cannot be seen completely separately from vnode locking. For +example, umount(2) fails if there are still in-use vnodes, which means that +FS requests [0] only involving in-use inodes do not have to acquire a vmnt +lock. On the other hand, all other request do need a vmnt lock. Extrapolating +this to system calls this means that all system calls involving a file +descriptor don't need a vmnt lock and all other system calls (that make FS +requests) do need a vmnt lock. + +------------------------------------------------------------------------------- +| Category | System calls | ++-------------------+---------------------------------------------------------+ +| System calls with | access, chdir, chmod, chown, chroot, creat, dumpcore*, | +| a path name | exec, link, lstat, mkdir, mknod, mount, open, readlink, | +| argument | rename, rmdir, stat, statvfs, symlink, truncate, umount,| +| | unlink, utime | ++-------------------+---------------------------------------------------------+ +| System calls with | close, fchdir, fcntl, fstat, fstatvfs, ftruncate, | +| a file descriptor | getdents, ioctl, llseek, pipe, read, select, write | +| argument | | ++-------------------+---------------------------------------------------------+ +| System calls with | fsync**, sync, umask | +| other or no | | +| arguments | | +------------------------------------------------------------------------------- +Table 3: System call categories +* path name argument is implicit, the path name is "core." +** although fsync actually provides a file descriptor argument, it's only +used to find the vmnt and not to do any actual operations on + +Before we describe what kind of vmnt locks VFS applies to system calls with a +path name or other arguments, we need to make some notes on path lookup. Path +lookups take arbitrary paths as input (relative and absolute). They can start +at any vmnt (based on root directory and working directory of the process doing +the lookup) and visit any file system in arbitrary order, possibly visiting +the same file system more than once. As such, VFS can never tell in advance +at which File Server a lookup will end. This has the following consequences: + - In the lookup procedure, only one vmnt must be locked at a time. When + moving from one vmnt to another, the first vmnt has to be unlocked before + acquiring the next lock to prevent deadlocks. + - The lookup procedure must lock each visited file system with TLL_READSER + and downgrade or upgrade to the lock type desired by the caller for the + destination file system (as VFS cannot know which file system is final). This + is to prevent deadlocks when a thread acquires a TLL_READSER on a vmnt and + another thread TLL_READ on the same vmnt. If the second thread is blocked + on the first thread due to it acquiring a lock on a vnode, the first thread + will be unable to upgrade a TLL_READSER lock to TLL_WRITE. + +We use the following mapping for vmnt locks onto three-level lock types: +------------------------------------------------------------------------------- +| Lock type | Mapped to | Used for | ++------------+-------------+--------------------------------------------------+ +| VMNT_READ | TLL_READ | Read-only operations and fully independent write | +| | | operations | ++------------+-------------+--------------------------------------------------+ +| VMNT_WRITE | TLL_READSER | Independent create and modify operations | ++------------+-------------+--------------------------------------------------+ +| VMNT_EXCL | TLL_WRITE | Delete and dependent write operations | +------------------------------------------------------------------------------- +Table 4: vmnt to tll lock mapping + +The following table shows a sub-categorization of system calls without a +file descriptor argument, together with their locking types and motivation +as used by VFS. +------------------------------------------------------------------------------- +| Group | System calls | Lock type | Motivation | ++-------------+--------------+------------+-----------------------------------+ +| File open | chdir, | VMNT_READ | These operations do not interfere | +| ops. | chroot, exec,| | with each other, as vnodes can be | +| (non-create)| open | | opened concurrently, and open | +| | | | operations do not affect | +| | | | replicated state. | ++-------------+--------------+------------+-----------------------------------+ +| File create-| creat, | VMNT_EXCL | File create ops. require mutual | +| and-open | open(O_CREAT)| for create | exclusion from concurrent file | +| ops | | VMNT_WRITE | open ops. If the file already | +| | | for open | existed, the VMNT_WRITE lock that | +| | | | is necessary for the lookup is | +| | | | not upgraded | ++-------------+--------------+------------+-----------------------------------+ +| File create-| pipe | VMNT_READ | These create nameless inodes | +| unique-and- | | | which cannot be opened by means | +| open ops. | | | of a path. Their creation | +| | | | therefore does not interfere with | +| | | | anything else | ++-------------+--------------+------------+-----------------------------------+ +| File create-| mkdir, mknod,| VMNT_WRITE | These operations do not affect | +| only ops. | slink | | any VFS state, and can therefore | +| | | | take place concurrently with open | +| | | | operations | ++-------------+--------------+------------+-----------------------------------+ +| File info | access, lstat| VMNT_READ | These operations do not interfere | +| retrieval or| readlink,stat| | with each other and do not modify | +| modification| utime | | replicated state | ++-------------+--------------+------------+-----------------------------------+ +| File | chmod, chown,| VMNT_READ | These operations do not interfere | +| modification| truncate | | with each other. They do need | +| | | | exclusive access on the vnode | +| | | | level | ++-------------+--------------+------------+-----------------------------------+ +| File link | link | VMNT_WRITE | Identical to file create-only | +| ops. | | | operations | ++-------------+--------------+------------+-----------------------------------+ +| File unlink | rmdir, unlink| VMNT_EXCL | These must not interfere with | +| ops. | | | file create operations, to avoid | +| | | | the scenario where inodes are | +| | | | reused immediately. However, due | +| | | | to necessary path checks, the | +| | | | vmnt is first locked VMNT_WRITE | +| | | | and then upgraded | ++-------------+--------------+------------+-----------------------------------+ +| File rename | rename | VMNT_EXCL | Identical to file unlink | +| ops. | | | operations | ++-------------+--------------+------------+-----------------------------------+ +| Non-file | sync, umask | VMNT_READ | umask does not involve the file | +| ops. | | or none | system, so it does not need | +| | | | locks. sync does not alter state | +| | | | in VFS and is atomic at the FS | +| | | | level | +------------------------------------------------------------------------------- +Table 5: System call without file descriptor argument sub-categorization + +4.6 Vnode (open file) locking +Compared to vmnt locking, vnode locking is relatively straightforward. All +read-only accesses to vnodes that merely read the vnode object's fields are +allowed to be concurrent. Consequently, all accesses that change fields +of a vnode object must be exclusive. This leaves us with creation and +destruction of vnode objects (and related to that, their reference counts); +it's sufficient to serialize these accesses. This follows from the fact +that a vnode is only created when the first user opens it, and destroyed +when the last user closes it. A open file in process A cannot be be closed +by process B. Note that this also relies on the fact that a process can do +only one system call at a time. Kernel threads would violate this assumption. + +We use the following mapping for vnode locks onto three-level lock types: +------------------------------------------------------------------------------- +| Lock type | Mapped to | Used for | ++------------+-------------+--------------------------------------------------+ +| VNODE_READ | TLL_READ | Read access to previously opened vnodes | ++------------+-------------+--------------------------------------------------+ +| VNODE_OPCL | TLL_READSER | Creation, opening, closing, and destruction of | +| | | vnodes | ++------------+-------------+--------------------------------------------------+ +| VNODE_WRITE| TLL_WRITE | Write access to previously opened vnodes | +------------------------------------------------------------------------------- +Table 6: vnode to tll lock mapping + +When vnodes are destroyed, they are initially locked with VNODE_OPCL. After +all, we're going to alter the reference count, so this must be serialized. If +the reference count then reaches zero we obtain exclusive access. This should +always be immediately possible unless there is a consistency problem. See +section 4.8 for an exhaustive listing of locking methods for all operations on +vnodes. + +4.7 Filp (file position) locking +The main fields of a filp object that are shared between various processes +(and by extension threads), and that can change after object creation, +are filp_count and filp_pos. Writes to and reads from filp object must be +mutually exclusive, as all system calls have to use the latest version. For +example, a read(2) call changes the file position (i.e., filp_pos), so two +concurrent reads must obtain exclusive access. Consequently, as even read +operations require exclusive access, filp object don't use three-level locks, +but only mutexes. + +System calls that involve a file descriptor often access both the filp and +the corresponding vnode. The locking order requires us to first lock the +vnode and then the filp. This is taken care of at the filp level. Whenever +a filp is locked, a lock on the vnode is acquired first. Conversely, when +a filp is unlocked, the corresponding vnode is also unlocked. A convenient +consequence is that whenever a vnode is locked exclusively (VNODE_WRITE), +all corresponding filps are implicitly locked. This is of particular use +when multiple filps must be locked at the same time: + - When opening a named pipe, VFS must make sure that there is at most one + filp for the reader end and one filp for the writer end. + - Pipe readers and writers must be suspended in the absence of (respectively) + writers and readers. + - To prevent pipe file sizes to grow too large and wrap, the file size is + reset to zero when the pipe is empty. This can happen after a read(2). +Because both filps are linked to the same vnode object (they are for the same +pipe), it suffices to exclusively lock that vnode instead of both filp objects. + +In some cases it can happen that a function that operates on a locked filp, +calls another function that triggers another lock on a different filp for +the same vnode. For example, close_filp. At some point, close_filp() calls +release() which in turn will loop through the filp table looking for pipes +being select(2)ed on. If there are, the select code will lock the filp and do +operations on it. This works fine when doing a select(2) call, but conflicts +with close(2) or exit(2). Lock_filp() makes an exception for this situation; +if you've already locked a vnode with VNODE_OPCL or VNODE_WRITE when locking +a filp, you obtain a "soft lock" on the vnode for this filp. This means +that lock_filp won't actually try to lock the vnode (which wouldn't work), +but flags the vnode as "skip unlock_vnode upon unlock_filp." Upon unlocking +the filp, the vnode remains locked, the soft lock is removed, and the filp +mutex is released. Note that this scheme does not violate the locking order; +the vnode is (already) locked before the filp. + +A similar problem arises with do_pipe. In this case we obtain a new vnode +object, lock it, and obtain two new, locked, filp objects. If everything works +out and the filp objects are linked to the same vnode, we run into trouble +when unlocking both filps. The first filp being unlocked would work; the +second filp doesn't have an associated vnode that's locked anymore. Therefore +we introduced a plural unlock_filps(filp1, filp2) that can unlock two filps +that both point to the same vnode. + +4.8 Lock characteristics per request type +For File Servers that support concurrent requests, it's useful to know which +locking guarantees VFS provides for vmnts and vnodes, so it can take that +into account when protecting internal data structures. READ = TLL_READ, +READSER = TLL_READSER, WRITE = TLL_WRITE. The vnode locks applies to the +REQ_INODE_NR field in requests, unless the notes say otherwise. + +------------------------------------------------------------------------------ +| request | vmnt | vnode | notes | ++--------------+---------+---------+-----------------------------------------+ +| REQ_BREAD | | READ | VFS serializes reads from and writes to | +| | | | block special files | ++--------------+---------+---------+-----------------------------------------+ +| REQ_BWRITE | | WRITE | VFS serializes reads from and writes to | +| | | | block special files | ++--------------+---------+---------+-----------------------------------------+ +| REQ_CHMOD | READ | WRITE | vmnt is only locked if file is not | +| | | | already opened | ++--------------+---------+---------+-----------------------------------------+ +| REQ_CHOWN | READ | WRITE | vmnt is only locked if file is not | +| | | | already opened | ++--------------+---------+---------+-----------------------------------------+ +| REQ_CREATE | WRITE | WRITE | The directory in which the file is | +| | | | created is write locked | ++--------------+---------+---------+-----------------------------------------+ +| REQ_FLUSH | | | Mutually exclusive to REQ_BREAD and | +| | | | REQ_BWRITE | ++--------------+---------+---------+-----------------------------------------+ +| REQ_FSTATFS | | | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_FTRUNC | READ | WRITE | vmnt is only locked if file is not | +| | | | already opened | ++--------------+---------+---------+-----------------------------------------+ +| REQ_GETDENTS | READ | READ | vmnt is only locked if file is not | +| | | | already opened | ++--------------+---------+---------+-----------------------------------------+ +| REQ_INHIBREAD| | READ | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_LINK | READSER | WRITE | REQ_INODE_NR is locked READ | +| | | | REQ_DIR_INO is locked WRITE | ++--------------+---------+---------+-----------------------------------------+ +| REQ_LOOKUP | READSER | | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_MKDIR | READSER | WRITE | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_MKNOD | READSER | WRITE | | ++--------------+---------+---------+-----------------------------------------+ +|REQ_MOUNTPOINT| WRITE | WRITE | | ++--------------+---------+---------+-----------------------------------------+ +|REQ_NEW_DRIVER| | | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_NEWNODE | | | Only sent to PFS | ++--------------+---------+---------+-----------------------------------------+ +| REQ_PUTNODE | | READSER | READSER when dropping all but one | +| | | or WRITE| references. WRITE when final reference | +| | | | is dropped (i.e., no longer in use) | ++--------------+---------+---------+-----------------------------------------+ +| REQ_RDLINK | READ | READ | In some circumstances stricter locking | +| | | | might be applied, but not guaranteed | ++--------------+---------+---------+-----------------------------------------+ +| REQ_READ | | READ | | ++--------------+---------+---------+-----------------------------------------+ +|REQ_READSUPER | WRITE | | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_RENAME | WRITE | WRITE | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_RMDIR | WRITE | WRITE | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_SLINK | READSER | READ | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_STAT | READ | READ | vmnt is only locked if file is not | +| | | | already opened | ++--------------+---------+---------+-----------------------------------------+ +| REQ_STATVFS | READ | READ | vmnt is only locked if file is not | +| | | | already opened | ++--------------+---------+---------+-----------------------------------------+ +| REQ_SYNC | READ | | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_UNLINK | WRITE | WRITE | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_UNMOUNT | WRITE | | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_UTIME | READ | READ | | ++--------------+---------+---------+-----------------------------------------+ +| REQ_WRITE | | WRITE | | +-----------------------------------------------------------------------------+ +Table 7: VFS-FS requests locking guarantees + +5 Recovery from driver crashes +VFS can recover from block special file and character special file driver +crashes. It can recover to some degree from a crashed File Server (which we +can regard as a driver). + +5.1 Recovery from block drivers crashes +When reading or writing, VFS doesn't communicate with block drivers directly, +but always through a File Server (the root File Server being default). If the +block driver crashes, the File Server does most of the work of the recovery +procedure. VFS loops through all open files for block special files that +were handled by this driver and reopens them. After that it sends the new +endpoint to the File Server so it can finish the recover procedure. Finally, +the File Server will retry pending requests if possible. However, reopening +files can cause the block driver to crash again. When that happens, VFS will +stop the recovery. A driver can return ERESTART to VFS to tell it to retry +a request. VFS does this with an arbitrary maximum of 5 attempts. + +5.2 Recovery from character driver crashes +Character special files are treated differently. Once VFS has found out a +driver has been restarted, it will stop the current request (if there is +any). It makes no sense to retry requests due to the nature of character +special files. If a character special driver can restart without changing +endpoints, this merely results in the current request (e.g., read, write, or +ioctl) failing and allows the user process to reissue the same request. On +the other hand, if a driver restart causes the driver to change endpoint +number, all associated file descriptors are marked invalid and subsequent +operations on them will always fail with a bad file descriptor error. + +5.3 Recovery from File Server crashes +At the time of writing we cannot recover from crashed File Servers. When +VFS detects it has to clean up the remnants of a File Server process (i.e., +through an exit(2)), it marks all associated file descriptors as invalid +and cancels ongoing and pending requests to that File Server. Resources that +were in use by the File Server are cleaned up. + +[0] http://wiki.minix3.org/en/DevelopersGuide/VfsFsProtocol +[1] http://www.cs.vu.nl/~dcvmoole/minix/blockchar.txt +[2] http://www.minix3.org/theses/moolenbroek-multimedia-support.pdf -- 2.44.0