Arne Welzel [Wed, 21 Mar 2018 19:29:58 +0000 (20:29 +0100)]
bsd.own.mk: use -mno-unaligned-access on ARM
Without this option, gcc may emit code accessing unaligned memory. This,
and the fact that SCTRL.A (System Control Register - Alignment Check) is
set to 1 in Minix causes data aborts when such code is encountered.
This was the cause of #104. The `minix-service' executable caused
unaligned memory accesses calling into getpwnam(). These then trigger
data abort exceptions. On ARM, these were previously forwarded to `vm'
as pagefaults. However, `vm' did not properly handle them, but instead
allocated one page for the faulting address (over and over again) and
then resumed the process at the faulting instruction (over and over
again). This behavior masked the whole story as an OOM.
Below the assembly version getpwent.c in which unaligned memory
accesses are even highlighted...
Arne Welzel [Wed, 21 Mar 2018 21:01:17 +0000 (22:01 +0100)]
kernel/arm: do not treat all data aborts as pagefaults
For now, distinguish alignment, translation and permission faults.
The first kind of faults cause the kernel to send SIGBUS to the
process causing the fault, the latter two are forwarded to `vm' as
pagefaults. Previously, any data abort was forwarded to `vm' as
a pagefault, resulting in hard to debug issue #104.
Any unhandled fault status results in a disaster. This seems
better than naively hoping `vm' can do something about it.
I tried to launch Minix3 in Qubes OS. While there is no problem to boot
minix as a qube (in Qubes OS terminology) before 3641562, it fails with
the commit (and after). I didn't digg into PCI handling but this change
fixes the problem. Minix handles NULL case from pci_subclass_name.
In particular, remove the hardcoded limit of 4096 entries in a single
directory, as there are (at least) real DVDs out there with more
entries than that. The implementation of this change requires a
second pass on large directories; performance optimizations are left
to future work.
at_wini was previously hardcoded to present ATAPI devices as having a
size of 800 MiB, which was enough for CDs but not for DVDs. This
patch increases the device size to 8500 MiB, which should be large
enough to cover all DVDs.
When possible, network drivers are now started automatically. That
means that netconf(8)'s network driver selection has become obsolete.
This patch changes netconf(8) to allow the user to specify a network
configuration (currently one of DHCP IPv4+IPv6, DHCP IPv4-only,
manual IPv4-only) for any hardware network interfaces that are
currently present.
Selection of network drivers that require manual configuration first
(mainly old ISA cards) is still supported, but now as a special case.
This commit adds a new TCP/IP service to MINIX 3. As its core, the
service uses the lwIP TCP/IP stack for maintenance reasons. The
service aims to be compatible with NetBSD userland, including its
low-level network management utilities. It also aims to support
modern features such as IPv6. In summary, the new LWIP service has
support for the following main features:
- TCP, UDP, RAW sockets with mostly standard BSD API semantics;
- IPv6 support: host mode (complete) and router mode (partial);
- most of the standard BSD API socket options (SO_);
- all of the standard BSD API message flags (MSG_);
- the most used protocol-specific socket and control options;
- a default loopback interface and the ability to create one more;
- configuration-free ethernet interfaces and driver tracking;
- queuing and multiple concurrent requests to each ethernet driver;
- standard ioctl(2)-based BSD interface management;
- radix tree backed, destination-based routing;
- routing sockets for standard BSD route reporting and management;
- multicast traffic and multicast group membership tracking;
- Berkeley Packet Filter (BPF) devices;
- standard and custom sysctl(7) nodes for many internals;
- a slab allocation based, hybrid static/dynamic memory pool model.
Many of its modules come with fairly elaborate comments that cover
many aspects of what is going on. The service is primarily a socket
driver built on top of the libsockdriver library, but for BPF devices
it is at the same time also a character driver.
Normally, each RMIB subtree consists of an array of nodes, indexed
by node identifier. In a sparsely filled subtree, most of the array
is empty and just wasting memory. In that case, it may be beneficial
to have a level of indirection, with an intermediate array containing
pairs of node IDs and pointers to the actual nodes. This patch adds
support for such indirection.
For the use cases that inspired this patch, net.inet and net.inet6,
the indirection shaves off a little under 16KB of memory from the
TCP/IP service.
Since the grant table is allocated dynamically, a system service always
runs the risk of running out of memory at run time when trying to
allocate a grant. In order to allow services to mitigate that risk,
grants can now be preallocated, typically at system service startup,
using the new cpf_prealloc(3) libsys function. The function takes a
'count' parameter that indicates the number of additional grants to
preallocate. Thus, the function may be called from multiple submodules
within a service, each preallocating their own maximum of grants that
it may need at run time.
In order to match NetBSD-style imports of external code, the library
has been restructured. The full lwIP source tree is imported, except
for a few .git* files in its root directory, into dist/. The MINIX 3
Makefiles and other custom files are located in lib/. Finally, since
we need to apply a number of small patches to lwIP, these patches are
stored in patches/, in addition to being applied to the lwIP tree.
The currently imported version of lwIP is taken from its master
branch sometime after the 2.0.1 release, specifically git-7ffe5bf.
When performing a restart (CSR0 STOP, STRT), the behavior regarding
the NIC's current RX/TX descriptor ring counters varies between cards:
older LANCE cards do not reset the counters; newer PCnet cards do
reset them; VirtualBox's emulation is once again broken in that it
claims to emulate newer cards but implements the older behavior.
Changing the card's receive mode requires such a restart, and now that
the system can actually change receive modes dynamically as part of
normal network operation, this results in the lance driver breaking
all the time on at least VirtualBox.
Instead of trying to figure out exactly what is going on with the
counters during a restart, we now simply perform a full-blown
reinitialization every time the NIC is restarted. That leaves no
ambiguity regarding the counters, and appears to be what drivers on
other OSes do as well. As a bonus, this approach actually saves code.
This is a driver-breaking update to the netdriver library, which is
used by all network drivers. The aim of this change is to make the
library more compatible with NetBSD, and in particular with various
features that are expected to be supported by the NetBSD userland.
The main changes made by this patch are the following:
- each network driver now has a NetBSD-style short device name;
- drivers are not expected to receive packets right after startup;
- extended support for receipt modes, including multicast lists;
- support for multiple parallel send, receive requests;
- embedding of I/O vectors in send and receive requests;
- support for capabilities, including checksum offloading;
- support for reporting link status updates to the TCP/IP stack;
- support for setting and retrieving media status;
- support for changing the hardware (MAC) address;
- support for NetBSD interface flags IFF_DEBUG, IFF_LINK[0-2];
- support for NetBSD error statistics;
- support for regular time-based ("tick") callbacks.
IMPORTANT: this patch applies a minimal update to the existing drivers
in order to make them work at all with the new netdriver library. It
however does *not* change all drivers to make use of the new features.
In fact, strictly speaking, all drivers are now violating requirements
imposed by the new library in one way or another, most notably by
enabling packet receipt when starting the driver. Changing all the
drivers to be compliant, and to support the newly added options, is
left to future patches. The existing drivers should currently *not*
be taken as examples of how to implement a new network driver!
With that said, a few drivers have already been changed to make use of
some of the new features: fxp, e1000, rtl8139, and rtl8169 now report
link and media status, and the last three of those now support setting
the hardware MAC address on the fly. In addition, dp8390 has been
changed to default to PCI autoconfiguration if no configuration is
specified through environment variables.
Like arp(8), this utility already uses the NetBSD 8 protocol for
talking to the operating system through routing sockets.
Like arp(8), this utility is not fully functional, due to limitations
of lwIP. While ndp(8) should provide a proper (read-only) view of the
contents of the Neighbor Discovery table, any attempts to modify the
table will fail. In addition, various other ndp(8) features are not
supported. On MINIX 3, the prefix and default router lists are not
managed by the operating system however, but rather by dhcpcd(8);
therefore, an implementation of the features related to those lists
would not provide any actual functionality.
For now, printing of Sun RPC requests is disabled because we do not
yet have the RPC header files. This should affect basically noone,
as we do not have any RPC-based programs yet, for the same reason.
As part of this, we import bpf_filter.c from NetBSD. Even though that
file is part of the NetBSD kernel, it is also used by userland (as is
clear here). Our LWIP service has its own bpf_filter.c implementation
but that implementation has certain limits (e.g. on program size) that
are fine for a system service but should not apply to userland.
The libpcap code has a number of blocks guarded by __NetBSD__, but
none of those blocks apply to MINIX 3. In particular, some of the
alignment logic used for NetBSD may in fact not work in our case.
The port could be improved by adding support for pselect(2).
Other than that, this port has a few MINIX-specific changes:
- we undefine IN_IFF_ flags to stop dhcpcd from thinking that we have
operating system support for link-local IPv4 address management;
- we work around one crash bug that seems triggered by using dhcpcd
on some but not all interfaces;
- we add "noalias" to the default dhcpcd.conf(5) configuration file.
Behaviorally this port should already be largely on par with the
NetBSD 8 version, in that it sets the RTF_LLDATA flag on routing
socket requests to indicate that they target link-local data.
Many parts of the arp(8) functionality are currently not yet supported
by the operating system, largely due to lwIP not exposing appropriate
means of implementing them.
The port forces the use of sysctl(7), as obtaining information through
KVM is not and will never be viable. The sysctl mode of netstat(1) is
currently somewhat limited and buggy, though. We fix a few minimal
issues, but more improvements will have to come from NetBSD reimports.
Some of netstat(1)'s views are currently not supported by the
operating system. Later improvements on this point will not require
changes to the imported code, though.
Not all of its functionality is actually implemented in the operating
system. In addition, a few modules (agr, vlan) have been disabled
because we have not imported the necessary headers yet.
With TIOCPKT enabled, each piece of output is preceded by a zero byte
on the PTY master. In addition, a non-zero byte is a flags field
that conveys information about changes on the pseudoterminal. This
patch implements the former, but not the latter. That is enough to
get telnetd(8) going, however. TIOCPKT support may be extended later.
Also retire support for the MINIX versions of /etc/hosts and
/etc/resolv.conf. These files will be brought back with NetBSD
imports, although like NetBSD, MINIX 3 will be using external
resolvers directly from then on. Since resolv.conf is hand-created
rather than installed, we do not mark it as obsolete.
This new implementation of the UDS service is built on top of the
libsockevent library. It thereby inherits all the advantages that
libsockevent brings. However, the fundamental restructuring
required for that change also paved the way for resolution of a
number of other important open issues with the old UDS code. Most
importantly, the rewrite brings the behavior of the service much
closer to POSIX compliance and NetBSD compatibility. These are the
most important changes:
- due to the use of libsockevent, UDS now supports multiple suspending
calls per socket and a large number of standard socket flags and
options;
- socket address matching is now based on <device,inode> lookups
instead of canonized path names, and socket addresses are no longer
altered either due to canonization or at connect time;
- the socket state machine is now well defined, most importantly
resolving the erroneous reset-on-EOF semantics of the old UDS, but
also allowing socket reuse;
- sockets are now connected before being accepted instead of being
held in connecting state, unless the LOCAL_CONNWAIT option is set
on either the connecting or the listening socket;
- connect(2) on datagram sockets is now supported (needed by syslog),
and proper datagram socket disconnect notification is provided;
- the receive queue now supports segmentation, associating ancillary
data (in-flight file descriptors and credentials) with each segment
instead of being kept fully separately; this is a POSIX requirement
(and needed by tmux);
- as part of the segmentation support, the receive queue can now hold
as many packets as can fit, instead of one;
- in addition to the flags supported by libsockevent, the MSG_PEEK,
MSG_WAITALL, MSG_CMSG_CLOEXEC, MSG_TRUNC, and MSG_CTRUNC send and
receive flags are now supported;
- the SO_PASSCRED and SO_PEERCRED socket options are replaced by
LOCAL_CREDS and LOCAL_PEEREID respectively, now following NetBSD
semantics and allowing use of NetBSD libc's getpeereid(3);
- memory usage is reduced by about 250 KB due to centralized in-flight
file descriptor tracking, with a limit of OPEN_MAX total rather than
of OPEN_MAX per socket;
- memory usage is reduced by another ~50 KB due to removal of state
redundancy, despite the fact that socket path names may now be up to
253 bytes rather than the previous 104 bytes;
- compared to the old UDS, there is now very little direct indexing on
the static array of sockets, thus allowing dynamic allocation of
sockets more easily in the future;
- the UDS service now has RMIB support for the net.local sysctl tree,
implementing preliminary support for NetBSD netstat(1).
RMIB: expose full node path; improve restartability
A single function may be used to handle the implementation of more
than one node. In some cases, the behavior of that function may
depend on the path name used to reach the node. Therefore, provide
the full path name as part of the call information.
As a result, RMIB has to save the paths for each of its remote MIB
mount points. That in turn also allows it to autonomously remount its
mount points after a MIB service restart, thus bringing us a step
closer to proper recovery after a MIB crash without requiring the
service using RMIB to perform explicit steps. As before, the missing
ingredient is actual notification of MIB service restarts, and proper
support for *that* will likely require changes to the DS service.
The service-only getepinfo(2) PM call returns information about a
given endpoint. This patch extends that call so that it returns
enough information to allow correctly filling a sockcred structure.
A new getsockcred(3) function is added to libsys to fill an actual
sockcred structure with the obtained information. However, for the
caller's convenience, the groups list is kept separate.
The getnucred() function was used by UDS to obtain credentials of user
processes in a form used in the UDS API, namely the ucred structure.
Since the NetBSD merge, this structure has changed drastically (aside
from being renamed to "uucred"), and it is no longer in UDS's best
interest to use this structure internally. Therefore, getnucred() is
no longer a useful API either, and instead we directly use the
previously private getepinfo() function to obtain credentials.
This patch prepares for moving of the creation of socket files on the
file system from the libc bind(2) stub into the UDS service. This
change is necessary for the socket type agnostic libc implementation.
The change is not yet activated - the code that is not yet used is
enclosed in "#if NOT_YET" blocks. The activation needs to be atomic
with UDS's switch to libsockdriver; otherwise, user applications may
break.
As part of the change, various UDS bind(2) semantics are changed to
match the POSIX standard and other operating systems. In
implementation terms, the service-only VFS API checkperms(2) is
renamed to socketpath(2), and extended with a new subcall which
creates a new socket file. An extension to test56 checks the new
bind(2) semantics of UDS, although most new tests are still disabled
until activation as well.
Finally, as further preparation for a more structural redesign of the
UDS service, also return the <device,inode> number pair for the
created or checked file name, and make returning the canonized path
name optional.