Normally, each RMIB subtree consists of an array of nodes, indexed
by node identifier. In a sparsely filled subtree, most of the array
is empty and just wasting memory. In that case, it may be beneficial
to have a level of indirection, with an intermediate array containing
pairs of node IDs and pointers to the actual nodes. This patch adds
support for such indirection.
For the use cases that inspired this patch, net.inet and net.inet6,
the indirection shaves off a little under 16KB of memory from the
TCP/IP service.
Since the grant table is allocated dynamically, a system service always
runs the risk of running out of memory at run time when trying to
allocate a grant. In order to allow services to mitigate that risk,
grants can now be preallocated, typically at system service startup,
using the new cpf_prealloc(3) libsys function. The function takes a
'count' parameter that indicates the number of additional grants to
preallocate. Thus, the function may be called from multiple submodules
within a service, each preallocating their own maximum of grants that
it may need at run time.
In order to match NetBSD-style imports of external code, the library
has been restructured. The full lwIP source tree is imported, except
for a few .git* files in its root directory, into dist/. The MINIX 3
Makefiles and other custom files are located in lib/. Finally, since
we need to apply a number of small patches to lwIP, these patches are
stored in patches/, in addition to being applied to the lwIP tree.
The currently imported version of lwIP is taken from its master
branch sometime after the 2.0.1 release, specifically git-7ffe5bf.
When performing a restart (CSR0 STOP, STRT), the behavior regarding
the NIC's current RX/TX descriptor ring counters varies between cards:
older LANCE cards do not reset the counters; newer PCnet cards do
reset them; VirtualBox's emulation is once again broken in that it
claims to emulate newer cards but implements the older behavior.
Changing the card's receive mode requires such a restart, and now that
the system can actually change receive modes dynamically as part of
normal network operation, this results in the lance driver breaking
all the time on at least VirtualBox.
Instead of trying to figure out exactly what is going on with the
counters during a restart, we now simply perform a full-blown
reinitialization every time the NIC is restarted. That leaves no
ambiguity regarding the counters, and appears to be what drivers on
other OSes do as well. As a bonus, this approach actually saves code.
This is a driver-breaking update to the netdriver library, which is
used by all network drivers. The aim of this change is to make the
library more compatible with NetBSD, and in particular with various
features that are expected to be supported by the NetBSD userland.
The main changes made by this patch are the following:
- each network driver now has a NetBSD-style short device name;
- drivers are not expected to receive packets right after startup;
- extended support for receipt modes, including multicast lists;
- support for multiple parallel send, receive requests;
- embedding of I/O vectors in send and receive requests;
- support for capabilities, including checksum offloading;
- support for reporting link status updates to the TCP/IP stack;
- support for setting and retrieving media status;
- support for changing the hardware (MAC) address;
- support for NetBSD interface flags IFF_DEBUG, IFF_LINK[0-2];
- support for NetBSD error statistics;
- support for regular time-based ("tick") callbacks.
IMPORTANT: this patch applies a minimal update to the existing drivers
in order to make them work at all with the new netdriver library. It
however does *not* change all drivers to make use of the new features.
In fact, strictly speaking, all drivers are now violating requirements
imposed by the new library in one way or another, most notably by
enabling packet receipt when starting the driver. Changing all the
drivers to be compliant, and to support the newly added options, is
left to future patches. The existing drivers should currently *not*
be taken as examples of how to implement a new network driver!
With that said, a few drivers have already been changed to make use of
some of the new features: fxp, e1000, rtl8139, and rtl8169 now report
link and media status, and the last three of those now support setting
the hardware MAC address on the fly. In addition, dp8390 has been
changed to default to PCI autoconfiguration if no configuration is
specified through environment variables.
Like arp(8), this utility already uses the NetBSD 8 protocol for
talking to the operating system through routing sockets.
Like arp(8), this utility is not fully functional, due to limitations
of lwIP. While ndp(8) should provide a proper (read-only) view of the
contents of the Neighbor Discovery table, any attempts to modify the
table will fail. In addition, various other ndp(8) features are not
supported. On MINIX 3, the prefix and default router lists are not
managed by the operating system however, but rather by dhcpcd(8);
therefore, an implementation of the features related to those lists
would not provide any actual functionality.
For now, printing of Sun RPC requests is disabled because we do not
yet have the RPC header files. This should affect basically noone,
as we do not have any RPC-based programs yet, for the same reason.
As part of this, we import bpf_filter.c from NetBSD. Even though that
file is part of the NetBSD kernel, it is also used by userland (as is
clear here). Our LWIP service has its own bpf_filter.c implementation
but that implementation has certain limits (e.g. on program size) that
are fine for a system service but should not apply to userland.
The libpcap code has a number of blocks guarded by __NetBSD__, but
none of those blocks apply to MINIX 3. In particular, some of the
alignment logic used for NetBSD may in fact not work in our case.
The port could be improved by adding support for pselect(2).
Other than that, this port has a few MINIX-specific changes:
- we undefine IN_IFF_ flags to stop dhcpcd from thinking that we have
operating system support for link-local IPv4 address management;
- we work around one crash bug that seems triggered by using dhcpcd
on some but not all interfaces;
- we add "noalias" to the default dhcpcd.conf(5) configuration file.
Behaviorally this port should already be largely on par with the
NetBSD 8 version, in that it sets the RTF_LLDATA flag on routing
socket requests to indicate that they target link-local data.
Many parts of the arp(8) functionality are currently not yet supported
by the operating system, largely due to lwIP not exposing appropriate
means of implementing them.
The port forces the use of sysctl(7), as obtaining information through
KVM is not and will never be viable. The sysctl mode of netstat(1) is
currently somewhat limited and buggy, though. We fix a few minimal
issues, but more improvements will have to come from NetBSD reimports.
Some of netstat(1)'s views are currently not supported by the
operating system. Later improvements on this point will not require
changes to the imported code, though.
Not all of its functionality is actually implemented in the operating
system. In addition, a few modules (agr, vlan) have been disabled
because we have not imported the necessary headers yet.
With TIOCPKT enabled, each piece of output is preceded by a zero byte
on the PTY master. In addition, a non-zero byte is a flags field
that conveys information about changes on the pseudoterminal. This
patch implements the former, but not the latter. That is enough to
get telnetd(8) going, however. TIOCPKT support may be extended later.
Also retire support for the MINIX versions of /etc/hosts and
/etc/resolv.conf. These files will be brought back with NetBSD
imports, although like NetBSD, MINIX 3 will be using external
resolvers directly from then on. Since resolv.conf is hand-created
rather than installed, we do not mark it as obsolete.
This new implementation of the UDS service is built on top of the
libsockevent library. It thereby inherits all the advantages that
libsockevent brings. However, the fundamental restructuring
required for that change also paved the way for resolution of a
number of other important open issues with the old UDS code. Most
importantly, the rewrite brings the behavior of the service much
closer to POSIX compliance and NetBSD compatibility. These are the
most important changes:
- due to the use of libsockevent, UDS now supports multiple suspending
calls per socket and a large number of standard socket flags and
options;
- socket address matching is now based on <device,inode> lookups
instead of canonized path names, and socket addresses are no longer
altered either due to canonization or at connect time;
- the socket state machine is now well defined, most importantly
resolving the erroneous reset-on-EOF semantics of the old UDS, but
also allowing socket reuse;
- sockets are now connected before being accepted instead of being
held in connecting state, unless the LOCAL_CONNWAIT option is set
on either the connecting or the listening socket;
- connect(2) on datagram sockets is now supported (needed by syslog),
and proper datagram socket disconnect notification is provided;
- the receive queue now supports segmentation, associating ancillary
data (in-flight file descriptors and credentials) with each segment
instead of being kept fully separately; this is a POSIX requirement
(and needed by tmux);
- as part of the segmentation support, the receive queue can now hold
as many packets as can fit, instead of one;
- in addition to the flags supported by libsockevent, the MSG_PEEK,
MSG_WAITALL, MSG_CMSG_CLOEXEC, MSG_TRUNC, and MSG_CTRUNC send and
receive flags are now supported;
- the SO_PASSCRED and SO_PEERCRED socket options are replaced by
LOCAL_CREDS and LOCAL_PEEREID respectively, now following NetBSD
semantics and allowing use of NetBSD libc's getpeereid(3);
- memory usage is reduced by about 250 KB due to centralized in-flight
file descriptor tracking, with a limit of OPEN_MAX total rather than
of OPEN_MAX per socket;
- memory usage is reduced by another ~50 KB due to removal of state
redundancy, despite the fact that socket path names may now be up to
253 bytes rather than the previous 104 bytes;
- compared to the old UDS, there is now very little direct indexing on
the static array of sockets, thus allowing dynamic allocation of
sockets more easily in the future;
- the UDS service now has RMIB support for the net.local sysctl tree,
implementing preliminary support for NetBSD netstat(1).
RMIB: expose full node path; improve restartability
A single function may be used to handle the implementation of more
than one node. In some cases, the behavior of that function may
depend on the path name used to reach the node. Therefore, provide
the full path name as part of the call information.
As a result, RMIB has to save the paths for each of its remote MIB
mount points. That in turn also allows it to autonomously remount its
mount points after a MIB service restart, thus bringing us a step
closer to proper recovery after a MIB crash without requiring the
service using RMIB to perform explicit steps. As before, the missing
ingredient is actual notification of MIB service restarts, and proper
support for *that* will likely require changes to the DS service.
The service-only getepinfo(2) PM call returns information about a
given endpoint. This patch extends that call so that it returns
enough information to allow correctly filling a sockcred structure.
A new getsockcred(3) function is added to libsys to fill an actual
sockcred structure with the obtained information. However, for the
caller's convenience, the groups list is kept separate.
The getnucred() function was used by UDS to obtain credentials of user
processes in a form used in the UDS API, namely the ucred structure.
Since the NetBSD merge, this structure has changed drastically (aside
from being renamed to "uucred"), and it is no longer in UDS's best
interest to use this structure internally. Therefore, getnucred() is
no longer a useful API either, and instead we directly use the
previously private getepinfo() function to obtain credentials.
This patch prepares for moving of the creation of socket files on the
file system from the libc bind(2) stub into the UDS service. This
change is necessary for the socket type agnostic libc implementation.
The change is not yet activated - the code that is not yet used is
enclosed in "#if NOT_YET" blocks. The activation needs to be atomic
with UDS's switch to libsockdriver; otherwise, user applications may
break.
As part of the change, various UDS bind(2) semantics are changed to
match the POSIX standard and other operating systems. In
implementation terms, the service-only VFS API checkperms(2) is
renamed to socketpath(2), and extended with a new subcall which
creates a new socket file. An extension to test56 checks the new
bind(2) semantics of UDS, although most new tests are still disabled
until activation as well.
Finally, as further preparation for a more structural redesign of the
UDS service, also return the <device,inode> number pair for the
created or checked file name, and make returning the canonized path
name optional.
Add libsockevent: a socket event dispatching library
This library provides an event-based abstraction model and dispatching
facility for socket drivers. Its main goal is to eliminate any and
all need for socket drivers to keep track of pending socket calls.
Additionally, this library takes over responsibility of a number of
other tasks that would otherwise be duplicated between socket drivers,
but in such a way that individual socket drivers retain a large degree
of freedom in terms of API behavior. The library's main features are:
- suspension, resumption, and cancellation of socket calls;
- an abstraction layer for select(2);
- state tracking of shutdown(2);
- pending (asynchronous) errors and the SO_ERROR socket option;
- listening-socket tracking and the SO_ACCEPTCONN socket option;
- generation of SIGPIPE signals; SO_NOSIGPIPE, MSG_NOSIGNAL;
- send and receive low-watermark tracking, SO_SNDLOWAT, SO_RCVLOWAT;
- send and receive timeout support and SO_SNDTIMEO, SO_RCVTIMEO;
- an abstraction layer for the SO_LINGER socket option;
- tracking of various on/off socket options as well as SO_TYPE;
- a range of pre-checks on socket calls that are required POSIX.
In order to track per-socket state, the library manages an opaque
"sock" object for each socket. The allocation of such objects is left
entirely to the socket driver. Each sock object has an associated
callback table for calls from libsockevent to the socket driver. The
socket driver can raise events on the sock object in order to flag
that any previously suspended operations of a particular type should
be resumed. The library may defer processing such raised events if
immediate processing could interfere with internal consistency.
The sockevent library is layered on top of libsockdriver, and should
be used by all socket driver implementations if at all possible.
This library provides abstractions for socket drivers, and should be
used as the basis for all socket driver implementations. It provides
the following functionality:
- a function call table abstraction, hiding the details of the
socket driver protocol with simple parameters and presenting the
socket driver with callback functions very similar to the BSD
socket API calls made from userland;
- abstracting data structures and helper functions for suspending
and resuming blocking calls;
- abstracting data structures and helper functions for copying data
from and to the caller.
Overall, the library is similar to lib{block,char,fs,input,net}driver
in concept. Some of the abstractions provided here should in fact be
applied to libchardriver as well. As always, for the case that the
provided message loop is too restrictive, a set of more low-level
message processing functions is provided.
This patch stops a socket driver from using copyfd(2) to copy in a
file descriptor that is a reference to a socket owned by that socket
driver, returning EDEADLK instead. In effect, this will stop deadlock
and resource exhaustion issues with UDS once it has been converted to
a socket driver. See the comment in the patch for details.
This change effectively adds the VFS side of support for the SO_LINGER
socket option, by allowing file descriptor close operations to be
suspended (and later resumed) by socket drivers. Currently, support
is limited to the close(2) system call--in all other cases where file
descriptors are closed (dup2, close-on-exec, process exit..), the
close operation still completes instantly. As a general policy, the
close(2) return value will always indicate that the file descriptor
has been closed: either 0, or -1 with errno set to EINPROGRESS. The
latter error may be thrown only when a suspended close is interrupted
by a signal.
As necessary for UDS, this change also introduces a closenb(2) system
call extension, allowing the caller to bypass blocking SO_LINGER close
behavior. This extension allows UDS to avoid blocking on closing the
last reference to an in-flight file descriptor, in an atomic fashion.
The extension is currently part of libsys, but there is no reason why
userland would not be allowed to make this call, so it is deliberately
not protected from use by userland.
If a select(2) call was issued on a file descriptor for which the file
pointer was closed due to invalidation (FILP_CLOSED), typically as the
result of a character/socket driver dying, the call would previously
return with an error: EINTR upon call entry or EIO on invalidation at
at a later time. Especially the former could severely confuse
applications, which would assume the call was interrupted by a signal,
restart the select call and immediately get EINTR again, ad infinitum.
This patch changes the select(2) semantics such that for closed filps,
the file descriptor is returned as readable and/or writable (depending
on the requested operations), as such letting the entire select call
finish successfully. Applications will then typically attempt to read
from and/or write to the file descriptor, resulting in an I/O error
that they should generally be better equipped to handle.
This patch also fixes a potential problem with returning early from a
select(2) call if a bad file descriptor is given: previously, in such
cases not all actions taken so far would be undone; now they are.
This patch adds the implementation of the BSD socket system calls
which have been introduced in an earlier patch. At the same time, it
adds support for communication with socket drivers, using a new
"socket device" (SDEV_) protocol. These two parts, implemented in
socket.c and sdev.c respectively, form the upper and lower halves of
the new BSD socket support in VFS. New mapping functionality for
socket domains and drivers is added as well, implemented in smap.c.
The rest of the changes mainly facilitate the separation of character
and socket driver calls, and do not make any fundamental alterations.
For example, while this patch changes VFS's select.c rather heavily,
the new select logic for socket drivers is the exact same as for
character drivers; the changes mainly separate the driver type
specific parts from the generic select logic further than before.
This patch introduces the first piece of support for the concept of
"socket drivers": services that implement one or more socket protocol
families. The latter are also known as "domains", as per the first
parameter of the socket(2) API. More specifically, this patch adds
the basic infrastructure for specifying that a particular service is
the socket driver for a set of domains.
Unlike major number mappings for block and character drivers, socket
domain mappings are static. For that reason, they are specified in
system.conf files, using the "domain" keyword. Such a keyword is to
be followed by one or more protocol families, without their "PF_"
prefix. For example, a service with the line "domain INET INET6;"
will be mapped as the socket driver responsible for the AF_INET and
AF_INET6 protocol families.
This patch implements only the infrastructure for creating such
mappings; the actual mapping will be implemented in VFS in a later
patch. The infrastructure is implemented in service(8), RS, and VFS.
For now there is a hardcoded limit of eight domains per socket driver.
This may sound like a lot, but the upcoming new LWIP service will
already use four of those. Also, it is allowed for a service to be
both a block/character driver and a socket driver at the same time,
which is a requirement for the new LWIP service.
The st_blocks field should count 512-byte units, not file system
block units. The previous computation would cause utilities such
as du(1), when used on isofs, to be off by a factor four.
A pair of manual pages were already present in /usr/share/man, but
not yet installed. Install them as well. Lots and lots more from
NetBSD's set of manual pages should be imported, though.
In order to comply with the pkgsrc standards, pkgsrc packages are no
longer auto-started. Instead, we require that users follow the common
pkgsrc procedure: to start a pkgsrc package as part of system startup,
copy its startup script from /usr/pkg/etc/rc.d to /etc/rc.d, and make
the appropriate changes to /etc/rc.conf.
This change affects in particular the openssh package, of which its
ssh daemon is no longer auto-started. However, installing this
package also no longer causes all kinds of Kerberos-related warnings
to be reported at boot time now.
Also remove a leftover reference to the defunct ddekit usb package.
This patch performs an initial import of the infrastructure and a
subset of the NetBSD set of rc startup and shutdown scripts. The
"initial" refers to the fact that this is not yet a full switch to the
NetBSD rc system: the MINIX ramdisk rc script, which (typically) runs
as the first thing, is kept as is. After mounting the root file
system, the ramdisk rc script will start the NetBSD rc infrastructure
by invoking /etc/rc, however. The regular MINIX startup-and-shutdown
script has been moved from /etc/rc to /etc/rc.minix, and is now
invoked as part of the NetBSD rc infrastructure through a bridge rc
script /etc/rc.d/minixrc. /etc/rc.minix invokes /usr/etc/rc as before.
Switching over the ramdisk to the NetBSD system and decomposing the
MINIX rc.minix script into smaller components are left to future work.
Also, the current pkgsrc etc/rc.d auto-start functionality is left as
is, even though it should be removed (see the etc/usr/rc comment).
After processing certain asynchronous requests from VFS, VM would send
an asynchronous reply without supplying the AMF_NOREPLY flag. As a
result, this asynchronous reply could be taken as the result of an
ipc_sendrec() call, causing the entire VM/VFS communication to become
desynchronized. The end result was a deadlock-induced panic during a
later request.
This bug was exposed because of the higher-than-usual concurrency
level in the NetBSD rc scripts. The fix consists of properly setting
the AMF_NOREPLY flag for asynchronous replies.
Performing the update at any later time may cause rc scripts to work
with a wrong date, which may have side effects, such as databse files
getting regenerated on every boot.
This rename is unfortunately necessary because NetBSD has decided to
create its own service(8) utility, and we will want to import theirs
as well. The two can obviously not coexist.
Also move ours from /bin to /sbin, as it is a superuser-only utility.
This change is a long overdue switch-over from the old MINIX set of
user and group accounts to the NetBSD set. This switch-over is
increasingly important now that we are importing more and more
utilities from NetBSD, several of which expect various user accounts
to exist. By switching over in one go, we save ourselves various
headaches in the long run, even if the switch-over itself is a bit
painful for existing MINIX users.
The newly imported master.passwd and group files have three exceptions
compared to their NetBSD originals:
1. There is a custom "service" account for MINIX 3 services. This
account is used to limit run-time privileges of various system
services, and is not used for any files on disk. Its user ID may
be changed later, but should always correspond to whatever the
SERVICE_UID definition is set to.
2. The user "bin" has its shell set to /bin/sh, instead of NetBSD's
/sbin/nologin. The reason for this is that the test set in
/usr/tests/minix-posix will not be able to run otherwise.
3. The group "operator" has been set to group ID 0, to match its old
value. This tweak is purely for transitioning purposes: as of
writing, pkgsrc packages are still using root:operator as owner and
group for most installed files. Sometime later, we can change back
"operator" to group ID 5 without breaking anything, because it does
not appear that this group name is used for anything important.
In order to allow for proper matching of available drivers to system
hardware, the output of this utility should reflect the full details
of the input from configuration files. In particular, that includes
sub-IDs of PCI devices when those have been specified.
This was a MINIX3-specific header file placed outside of the minix/
header subdirectory, with its definitions duplicated in the more
standard minix/sysutil.h header.
Site-local addresses are out, as they are RFC-deprecated and not
supported on MINIX 3 at all. Interface-local and link-local multicast
addresses are in, because they are relevant in the context of a
particular zone ID only.
- POLLRDBAND is reported by select(2) as errorfd, not readfd;
- POLLERR is not the same as errorfd of select(2);
- flags that are not requested should not be returned.
The callback, which was dropped in commit git-842c4ed, allows drivers
to fetch the interrupt status once and save it locally for subsequent
calls to drv_int().
This patch adds strace-like support for a -t command line option,
which causes a timestamp to be printed at the beginning of each line.
If the option is given more than once, the output will also include
microseconds.