This patch prepares for moving of the creation of socket files on the
file system from the libc bind(2) stub into the UDS service. This
change is necessary for the socket type agnostic libc implementation.
The change is not yet activated - the code that is not yet used is
enclosed in "#if NOT_YET" blocks. The activation needs to be atomic
with UDS's switch to libsockdriver; otherwise, user applications may
break.
As part of the change, various UDS bind(2) semantics are changed to
match the POSIX standard and other operating systems. In
implementation terms, the service-only VFS API checkperms(2) is
renamed to socketpath(2), and extended with a new subcall which
creates a new socket file. An extension to test56 checks the new
bind(2) semantics of UDS, although most new tests are still disabled
until activation as well.
Finally, as further preparation for a more structural redesign of the
UDS service, also return the <device,inode> number pair for the
created or checked file name, and make returning the canonized path
name optional.
Add libsockevent: a socket event dispatching library
This library provides an event-based abstraction model and dispatching
facility for socket drivers. Its main goal is to eliminate any and
all need for socket drivers to keep track of pending socket calls.
Additionally, this library takes over responsibility of a number of
other tasks that would otherwise be duplicated between socket drivers,
but in such a way that individual socket drivers retain a large degree
of freedom in terms of API behavior. The library's main features are:
- suspension, resumption, and cancellation of socket calls;
- an abstraction layer for select(2);
- state tracking of shutdown(2);
- pending (asynchronous) errors and the SO_ERROR socket option;
- listening-socket tracking and the SO_ACCEPTCONN socket option;
- generation of SIGPIPE signals; SO_NOSIGPIPE, MSG_NOSIGNAL;
- send and receive low-watermark tracking, SO_SNDLOWAT, SO_RCVLOWAT;
- send and receive timeout support and SO_SNDTIMEO, SO_RCVTIMEO;
- an abstraction layer for the SO_LINGER socket option;
- tracking of various on/off socket options as well as SO_TYPE;
- a range of pre-checks on socket calls that are required POSIX.
In order to track per-socket state, the library manages an opaque
"sock" object for each socket. The allocation of such objects is left
entirely to the socket driver. Each sock object has an associated
callback table for calls from libsockevent to the socket driver. The
socket driver can raise events on the sock object in order to flag
that any previously suspended operations of a particular type should
be resumed. The library may defer processing such raised events if
immediate processing could interfere with internal consistency.
The sockevent library is layered on top of libsockdriver, and should
be used by all socket driver implementations if at all possible.
This library provides abstractions for socket drivers, and should be
used as the basis for all socket driver implementations. It provides
the following functionality:
- a function call table abstraction, hiding the details of the
socket driver protocol with simple parameters and presenting the
socket driver with callback functions very similar to the BSD
socket API calls made from userland;
- abstracting data structures and helper functions for suspending
and resuming blocking calls;
- abstracting data structures and helper functions for copying data
from and to the caller.
Overall, the library is similar to lib{block,char,fs,input,net}driver
in concept. Some of the abstractions provided here should in fact be
applied to libchardriver as well. As always, for the case that the
provided message loop is too restrictive, a set of more low-level
message processing functions is provided.
This patch stops a socket driver from using copyfd(2) to copy in a
file descriptor that is a reference to a socket owned by that socket
driver, returning EDEADLK instead. In effect, this will stop deadlock
and resource exhaustion issues with UDS once it has been converted to
a socket driver. See the comment in the patch for details.
This change effectively adds the VFS side of support for the SO_LINGER
socket option, by allowing file descriptor close operations to be
suspended (and later resumed) by socket drivers. Currently, support
is limited to the close(2) system call--in all other cases where file
descriptors are closed (dup2, close-on-exec, process exit..), the
close operation still completes instantly. As a general policy, the
close(2) return value will always indicate that the file descriptor
has been closed: either 0, or -1 with errno set to EINPROGRESS. The
latter error may be thrown only when a suspended close is interrupted
by a signal.
As necessary for UDS, this change also introduces a closenb(2) system
call extension, allowing the caller to bypass blocking SO_LINGER close
behavior. This extension allows UDS to avoid blocking on closing the
last reference to an in-flight file descriptor, in an atomic fashion.
The extension is currently part of libsys, but there is no reason why
userland would not be allowed to make this call, so it is deliberately
not protected from use by userland.
If a select(2) call was issued on a file descriptor for which the file
pointer was closed due to invalidation (FILP_CLOSED), typically as the
result of a character/socket driver dying, the call would previously
return with an error: EINTR upon call entry or EIO on invalidation at
at a later time. Especially the former could severely confuse
applications, which would assume the call was interrupted by a signal,
restart the select call and immediately get EINTR again, ad infinitum.
This patch changes the select(2) semantics such that for closed filps,
the file descriptor is returned as readable and/or writable (depending
on the requested operations), as such letting the entire select call
finish successfully. Applications will then typically attempt to read
from and/or write to the file descriptor, resulting in an I/O error
that they should generally be better equipped to handle.
This patch also fixes a potential problem with returning early from a
select(2) call if a bad file descriptor is given: previously, in such
cases not all actions taken so far would be undone; now they are.
This patch adds the implementation of the BSD socket system calls
which have been introduced in an earlier patch. At the same time, it
adds support for communication with socket drivers, using a new
"socket device" (SDEV_) protocol. These two parts, implemented in
socket.c and sdev.c respectively, form the upper and lower halves of
the new BSD socket support in VFS. New mapping functionality for
socket domains and drivers is added as well, implemented in smap.c.
The rest of the changes mainly facilitate the separation of character
and socket driver calls, and do not make any fundamental alterations.
For example, while this patch changes VFS's select.c rather heavily,
the new select logic for socket drivers is the exact same as for
character drivers; the changes mainly separate the driver type
specific parts from the generic select logic further than before.
This patch introduces the first piece of support for the concept of
"socket drivers": services that implement one or more socket protocol
families. The latter are also known as "domains", as per the first
parameter of the socket(2) API. More specifically, this patch adds
the basic infrastructure for specifying that a particular service is
the socket driver for a set of domains.
Unlike major number mappings for block and character drivers, socket
domain mappings are static. For that reason, they are specified in
system.conf files, using the "domain" keyword. Such a keyword is to
be followed by one or more protocol families, without their "PF_"
prefix. For example, a service with the line "domain INET INET6;"
will be mapped as the socket driver responsible for the AF_INET and
AF_INET6 protocol families.
This patch implements only the infrastructure for creating such
mappings; the actual mapping will be implemented in VFS in a later
patch. The infrastructure is implemented in service(8), RS, and VFS.
For now there is a hardcoded limit of eight domains per socket driver.
This may sound like a lot, but the upcoming new LWIP service will
already use four of those. Also, it is allowed for a service to be
both a block/character driver and a socket driver at the same time,
which is a requirement for the new LWIP service.
The st_blocks field should count 512-byte units, not file system
block units. The previous computation would cause utilities such
as du(1), when used on isofs, to be off by a factor four.
A pair of manual pages were already present in /usr/share/man, but
not yet installed. Install them as well. Lots and lots more from
NetBSD's set of manual pages should be imported, though.
In order to comply with the pkgsrc standards, pkgsrc packages are no
longer auto-started. Instead, we require that users follow the common
pkgsrc procedure: to start a pkgsrc package as part of system startup,
copy its startup script from /usr/pkg/etc/rc.d to /etc/rc.d, and make
the appropriate changes to /etc/rc.conf.
This change affects in particular the openssh package, of which its
ssh daemon is no longer auto-started. However, installing this
package also no longer causes all kinds of Kerberos-related warnings
to be reported at boot time now.
Also remove a leftover reference to the defunct ddekit usb package.
This patch performs an initial import of the infrastructure and a
subset of the NetBSD set of rc startup and shutdown scripts. The
"initial" refers to the fact that this is not yet a full switch to the
NetBSD rc system: the MINIX ramdisk rc script, which (typically) runs
as the first thing, is kept as is. After mounting the root file
system, the ramdisk rc script will start the NetBSD rc infrastructure
by invoking /etc/rc, however. The regular MINIX startup-and-shutdown
script has been moved from /etc/rc to /etc/rc.minix, and is now
invoked as part of the NetBSD rc infrastructure through a bridge rc
script /etc/rc.d/minixrc. /etc/rc.minix invokes /usr/etc/rc as before.
Switching over the ramdisk to the NetBSD system and decomposing the
MINIX rc.minix script into smaller components are left to future work.
Also, the current pkgsrc etc/rc.d auto-start functionality is left as
is, even though it should be removed (see the etc/usr/rc comment).
After processing certain asynchronous requests from VFS, VM would send
an asynchronous reply without supplying the AMF_NOREPLY flag. As a
result, this asynchronous reply could be taken as the result of an
ipc_sendrec() call, causing the entire VM/VFS communication to become
desynchronized. The end result was a deadlock-induced panic during a
later request.
This bug was exposed because of the higher-than-usual concurrency
level in the NetBSD rc scripts. The fix consists of properly setting
the AMF_NOREPLY flag for asynchronous replies.
Performing the update at any later time may cause rc scripts to work
with a wrong date, which may have side effects, such as databse files
getting regenerated on every boot.
This rename is unfortunately necessary because NetBSD has decided to
create its own service(8) utility, and we will want to import theirs
as well. The two can obviously not coexist.
Also move ours from /bin to /sbin, as it is a superuser-only utility.
This change is a long overdue switch-over from the old MINIX set of
user and group accounts to the NetBSD set. This switch-over is
increasingly important now that we are importing more and more
utilities from NetBSD, several of which expect various user accounts
to exist. By switching over in one go, we save ourselves various
headaches in the long run, even if the switch-over itself is a bit
painful for existing MINIX users.
The newly imported master.passwd and group files have three exceptions
compared to their NetBSD originals:
1. There is a custom "service" account for MINIX 3 services. This
account is used to limit run-time privileges of various system
services, and is not used for any files on disk. Its user ID may
be changed later, but should always correspond to whatever the
SERVICE_UID definition is set to.
2. The user "bin" has its shell set to /bin/sh, instead of NetBSD's
/sbin/nologin. The reason for this is that the test set in
/usr/tests/minix-posix will not be able to run otherwise.
3. The group "operator" has been set to group ID 0, to match its old
value. This tweak is purely for transitioning purposes: as of
writing, pkgsrc packages are still using root:operator as owner and
group for most installed files. Sometime later, we can change back
"operator" to group ID 5 without breaking anything, because it does
not appear that this group name is used for anything important.
In order to allow for proper matching of available drivers to system
hardware, the output of this utility should reflect the full details
of the input from configuration files. In particular, that includes
sub-IDs of PCI devices when those have been specified.
This was a MINIX3-specific header file placed outside of the minix/
header subdirectory, with its definitions duplicated in the more
standard minix/sysutil.h header.
Site-local addresses are out, as they are RFC-deprecated and not
supported on MINIX 3 at all. Interface-local and link-local multicast
addresses are in, because they are relevant in the context of a
particular zone ID only.
- POLLRDBAND is reported by select(2) as errorfd, not readfd;
- POLLERR is not the same as errorfd of select(2);
- flags that are not requested should not be returned.
The callback, which was dropped in commit git-842c4ed, allows drivers
to fetch the interrupt status once and save it locally for subsequent
calls to drv_int().
This patch adds strace-like support for a -t command line option,
which causes a timestamp to be printed at the beginning of each line.
If the option is given more than once, the output will also include
microseconds.
Antoine Leca [Fri, 4 Nov 2016 18:19:14 +0000 (19:19 +0100)]
Fix the process for GNU tools on MINIX
This is a fix over commit a150b26ee803b20080
On a MINIX station, the tools are not usually built and
on a first-time building of the tree, the fetching script
of texinfo was not triggered in some cases. Let force it.
Reported on minix3 googlegroup by Chris Card.
As of change git-87c599d, when processing CLOCK notifications, PM no
longer set the current process pointer 'mp'. That pointer is however
used when delivering signals through check_sig(), to see whether the
current process may deliver a signal to the target process. As a
result, delivering SIGALARM signals used a previous pointer in these
checks, causing alarm signals not to be delivered in some cases.
This patch ensures that alarm signals are again delivered with PM as
current process.
With this patch, it is now possible to generate coverage information
for MINIX3 system services with LLVM. In particular, the system can
be built with MKCOVERAGE=yes, either with a native "make build" or
with crosscompilation. Either way, MKCOVERAGE=yes will build the
MINIX3 system services with coverage profiling support, generating a
.gcno file for each source module. After a reboot it is possible to
obtain runtime coverage data (.gcda files) for individual system
services using gcov-pull(8). The combination of the .gcno and .gcda
files can then be inspected with llvm-cov(1).
For reasons documented in minix.gcov.mk, only system service program
modules are supported for now; system service libraries (libsys etc.)
are not included. Userland programs are not affected by MKCOVERAGE.
The heart of this patch is the libsys code that writes data generated
by the LLVM coverage hooks into a serialized format using the routines
we already had for GCC GCOV. Unfortunately, the new llvm_gcov.c code
is LLVM ABI dependent, and may therefore have to be updated later when
we upgrade LLVM. The current implementation should support all LLVM
versions 3.x with x >= 4.
The rest of this patch is mostly a light cleanup of our existing GCOV
infrastructure, with as most visible change that gcov-pull(8) now
takes a service label string rather than a PID number.
Antoine Leca [Fri, 15 Jul 2016 15:27:36 +0000 (17:27 +0200)]
Adjust .gitignore for MINIX file system
Some files in LLVM have more than Minix-maximum of 60 characters.
Also drop some obsolete stuff, and add obj which are symlinks added
to every directory when using /usr/obj as OBJDIR (hinted in wiki.)
The way these options work is by creating files that contain debugging
symbols and stashing them in a dedicated set. The minix-debug set has
been created for this purpose, but it will probably have to be refined
since it has been tested only with the default options with an i386
cross-build.
LSC: Amended to support many combination of MKDEBUG, MKDEBUGLIB, with
and without X11, for both intel and arm.
Antoine Leca [Mon, 29 Aug 2016 11:48:31 +0000 (13:48 +0200)]
Improve the process for GNU tools
Split the process to fetch GNU tools (until now embedded
within tools/Makefile.gnuhost) into a new Makefile.fetchgnu,
MINIX-specific hence relocated, which is to be also used
to fetch sources even when not building the tools.
Use it for binutils too.
Improve documentation.
Also do not run configure on each run when MKUPDATE=yes
The .WAIT serialization instruction between fetching and other
configure sources was raising a new run of configure at each
compilation. Avoid it by using two rules.
The warnings in test47 seem to be a symptom of a larger problem,
i.e., not an issue with the test set code but rather with the GCC
configuration. Hopefully the switch to LLVM will resolve those.
Ever since a VM allocation strategy change, this test is fully
dysfunctional. It should be repaired and added to the regular
test set, but that will require some work.
libmthread: resolve memory leaks on exception path
If libmthread runs into a memory allocation failure while attempting
to enlarge its thread pool, it does not free up any preliminary
allocations made so far.
All functions prefixed with bdev_ are moved into bdev.c, and those
prefixed with cdev_ are now in cdev.c. The code in both files are
converted to KNF. The little (IOCTL-related) code left in device.c
is also cleaned up but should probably be moved into other existing
source files. This is left to a future patch. In general, VFS is
long overdue for a source code rebalancing, and the patch here is
only a step in the right direction.
Previously, VFS would use various subsets of a number of fproc
structure fields to store state when the process is blocked
(suspended) for various reasons. As a result, there was a fair
amount of abuse of fields, hidden state, and confusion as to
which fields were used with which suspension states.
Instead, the suspension state is now split into per-state
structures, which are then stored in a union. Each of the union's
structures should be accessed only right before, during, and right
after the fp_blocked_on field is set to the corresponding blocking
type. As a result, it is now very clear which fields are in use
at which times, and we even save a bit of memory as a side effect.
Any attempt to use open(2) to open a socket file now fails with
EOPNOTSUPP, as is common and in the process of being standardized.
The behavior and error code is now tested in test56.
Any attempt to open a file of which the type is not known to VFS
(e.g., as a result of bogus file system contents) now fails with EIO.
For now, this is a safety feature, to prevent VFS tripping over such
types in unchecked cases. In the future, a proper VFS code audit
should determine whether we can lift this restriction again, although
it does not seem particularly useful to be able to open files of
unknown types anyway. Another error code may be assigned to this case
later, too.
By now it has become clear that the VFS select code has an unusually
high concentration of bugs, and there is no indication that any form
of convergence to a bug-free state is in sight. Thus, for now, it
may be helpful to be able to dump the contents of the select tables
in order to track down any bugs in the future. Hopefully that will
allow the next bugs to be resolved slightly after than before.
The debug dump can be triggered with "svrctl vfs get print_select".
- it was querying a character or socket device that, at the start of
the select query, was not known to be ready for the requested
operations;
- this device could not be checked immediately, due to another ongoing
query to the same character or socket driver;
- the select query had a timer that triggered before the device could
be checked, thereby changing the select query to non-blocking.
In this situation, a missing flag check would cause the select code to
conclude erroneously that the operations which it flagged for later,
were satisfied. At the same time, the same flag remained set, so that
the select query would continue to wait for that device. This
resulted in a deadlock. The same bug could most likely be triggered
through other scenarios that were even less likely to occur.
This patch fixes the race condition and puts in a hopefully slightly
more informative comment for the affected block of code.
In practice, the bug could be triggered fairly reliably by generating
lots of output in tmux.
Now that clock_t is an unsigned value, we can also allow the system
uptime to wrap. Essentially, instead of using (a <= b) to see if time
a occurs no later than time b, we use (b - a <= CLOCK_MAX / 2). The
latter value does not exist, so instead we add TMRDIFF_MAX for that
purpose.
We must therefore also avoid using values like 0 and LONG_MAX as
special values for absolute times. This patch extends the libtimers
interface so that it no longer uses 0 to indicate "no timeout".
Similarly, TMR_NEVER is now used as special value only when
otherwise a relative time difference would be used. A minix_timer
structure is now considered in use when it has a watchdog function set,
rather than when the absolute expiry time is not TMR_NEVER. A few new
macros in <minix/timers.h> help with timer comparison and obtaining
properties from a minix_timer structure.
This patch also eliminates the union of timer arguments, instead using
the only union element that is only used (the integer). This prevents
potential problems with e.g. live update. The watchdog function
prototype is changed to pass in the argument value rather than a
pointer to the timer structure, since obtaining the argument value was
the only current use of the timer structure anyway. The result is a
somewhat friendlier timers API.
The VFS select code required a few more invasive changes to restrict
the timer value to the new maximum, effectively matching the timer
code in PM. As a side effect, select(2) has been changed to reject
invalid timeout values. That required a change to the test set, which
relied on the previous, erroneous behavior.
Finally, while we're rewriting significant chunks of the timer code
anyway, also covert it to KNF and add a few more explanatory comments.
Antoine Leca [Tue, 2 Aug 2016 21:07:15 +0000 (23:07 +0200)]
Adapt MINIX-specific part of tools/installboot
This is necessary to enable correct compilation of the tools version
of installboot_nbsd(8)when cross-compiling on a system close enough
to MINIX, like NetBSD 7.0.1 for example.
At a point not too far in the future, we will be switching from the
hardcoded MINIX3 implementation of the getifaddrs(3) libc routine to
the proper NetBSD implementation. The latter uses the
net.route.rtable sysctl functionality to obtain its information. In
order make the transition as painless as possible, this patch adds
basic support for that net.route.rtable functionality to INET and
LWIP, using the remote MIB (RMIB) facility.
With this patch, the IPC service is changed to use the new RMIB
facility to register and handle the "kern.ipc" sysctl subtree itself.
The subtree was previously handled by the MIB service directly. This
change improves locality of handling: especially the
kern.ipc.sysvipc_info node has some peculiarities specific to the IPC
service and is therefore better handled there. Also, since the IPC
service is essentially optional to the system, this rearrangement
yields a cleaner situation when the IPC service is not running: in
that case, the MIB service will expose a few basic kern.ipc nodes
indicating that no SysV IPC facilities are present. Those nodes will
be overridden through RMIB when the IPC service is running.
It should be easier to add the remaining (from NetBSD) kern.ipc nodes
as well now.
Test88 is extended with a new subtest that verifies that sysctl-based
information retrieval for semaphore sets works as expected.
MIB/libsys: support for remote MIB (RMIB) subtrees
Most of the nodes in the general sysctl tree will be managed directly
by the MIB service, which obtains the necessary information as needed.
However, in certain cases, it makes more sense to let another service
manage a part of the sysctl tree itself, in order to avoid replicating
part of that other service in the MIB service. This patch adds the
basic support for such delegation: remote services may now register
their own subtrees within the full sysctl tree with the MIB service,
which will then forward any sysctl(2) requests on such subtrees to the
remote services.
The system works much like mounting a file system, but in addition to
support for shadowing an existing node, the MIB service also supports
creating temporary mount point nodes. Each have their own use cases.
A remote "kern.ipc" would use the former, because even when such a
subtree were not mounted, userland would still expect some of its
children to exist and return default values. A remote "net.inet"
would use the latter, as there is no reason to precreate nodes for all
possible supported networking protocols in the MIB "net" subtree.
A standard remote MIB (RMIB) implementation is provided for services
that wish to make use of this functionality. It is essentially a
simplified and somewhat more lightweight version of the MIB service's
internals, and works more or less the same from a programmer's point
of view. The most important difference is the "rmib" prefix instead
of the "mib" prefix. Documentation will hopefully follow later.
Overall, the RMIB functionality should not be used lightly, for
several reasons. First, despite being more lightweight than the MIB
service, the RMIB module still adds substantially to the code
footprint of the containing service. Second, the RMIB protocol not
only adds extra IPC for sysctl(2), but has also not been optimized for
performance in other ways. Third, and most importantly, the RMIB
implementation also several limitations. The main limitation is that
remote MIB subtrees must be fully static. Not only may the user not
create or destroy nodes, the service itself may not either, as this
would clash with the simplified remote node versioning system and
the cached subtree root node child counts. Other limitations exist,
such as the fact that the root of a remote subtree may only be a
node-type node, and a stricter limit on the highest node identifier
of any child in this subtree root (currently 4095).
The current implementation was born out of necessity, and therefore
it leaves several improvements to future work. Most importantly,
support for exit and crash notification is missing, primarily in the
MIB service. This means that remote subtrees may not be cleaned up
immediately, but instead only when the MIB service attempts to talk
to the dead remote service. In addition, if the MIB service itself
crashes, re-registration of remote subtrees is currently left up to
the individual RMIB users. Finally, the MIB service uses synchronous
(sendrec-based) calls to the remote services, which while convenient
may cause cascading service hangs. The underlying protocol is ready
for conversion to an asynchronous implementation already, though.
A new test set, testrmib.sh, tests the basic RMIB functionality. To
this end it uses a test service, rmibtest, and also reuses part of
the existing test87 MIB service test.
Instead, filter it in libc for old networking implementations, as
those do not support sending SIGPIPE to user processes anyway. This
change allows newer socket drivers to implement the flag as per the
specification.