At a point not too far in the future, we will be switching from the
hardcoded MINIX3 implementation of the getifaddrs(3) libc routine to
the proper NetBSD implementation. The latter uses the
net.route.rtable sysctl functionality to obtain its information. In
order make the transition as painless as possible, this patch adds
basic support for that net.route.rtable functionality to INET and
LWIP, using the remote MIB (RMIB) facility.
With this patch, the IPC service is changed to use the new RMIB
facility to register and handle the "kern.ipc" sysctl subtree itself.
The subtree was previously handled by the MIB service directly. This
change improves locality of handling: especially the
kern.ipc.sysvipc_info node has some peculiarities specific to the IPC
service and is therefore better handled there. Also, since the IPC
service is essentially optional to the system, this rearrangement
yields a cleaner situation when the IPC service is not running: in
that case, the MIB service will expose a few basic kern.ipc nodes
indicating that no SysV IPC facilities are present. Those nodes will
be overridden through RMIB when the IPC service is running.
It should be easier to add the remaining (from NetBSD) kern.ipc nodes
as well now.
Test88 is extended with a new subtest that verifies that sysctl-based
information retrieval for semaphore sets works as expected.
MIB/libsys: support for remote MIB (RMIB) subtrees
Most of the nodes in the general sysctl tree will be managed directly
by the MIB service, which obtains the necessary information as needed.
However, in certain cases, it makes more sense to let another service
manage a part of the sysctl tree itself, in order to avoid replicating
part of that other service in the MIB service. This patch adds the
basic support for such delegation: remote services may now register
their own subtrees within the full sysctl tree with the MIB service,
which will then forward any sysctl(2) requests on such subtrees to the
remote services.
The system works much like mounting a file system, but in addition to
support for shadowing an existing node, the MIB service also supports
creating temporary mount point nodes. Each have their own use cases.
A remote "kern.ipc" would use the former, because even when such a
subtree were not mounted, userland would still expect some of its
children to exist and return default values. A remote "net.inet"
would use the latter, as there is no reason to precreate nodes for all
possible supported networking protocols in the MIB "net" subtree.
A standard remote MIB (RMIB) implementation is provided for services
that wish to make use of this functionality. It is essentially a
simplified and somewhat more lightweight version of the MIB service's
internals, and works more or less the same from a programmer's point
of view. The most important difference is the "rmib" prefix instead
of the "mib" prefix. Documentation will hopefully follow later.
Overall, the RMIB functionality should not be used lightly, for
several reasons. First, despite being more lightweight than the MIB
service, the RMIB module still adds substantially to the code
footprint of the containing service. Second, the RMIB protocol not
only adds extra IPC for sysctl(2), but has also not been optimized for
performance in other ways. Third, and most importantly, the RMIB
implementation also several limitations. The main limitation is that
remote MIB subtrees must be fully static. Not only may the user not
create or destroy nodes, the service itself may not either, as this
would clash with the simplified remote node versioning system and
the cached subtree root node child counts. Other limitations exist,
such as the fact that the root of a remote subtree may only be a
node-type node, and a stricter limit on the highest node identifier
of any child in this subtree root (currently 4095).
The current implementation was born out of necessity, and therefore
it leaves several improvements to future work. Most importantly,
support for exit and crash notification is missing, primarily in the
MIB service. This means that remote subtrees may not be cleaned up
immediately, but instead only when the MIB service attempts to talk
to the dead remote service. In addition, if the MIB service itself
crashes, re-registration of remote subtrees is currently left up to
the individual RMIB users. Finally, the MIB service uses synchronous
(sendrec-based) calls to the remote services, which while convenient
may cause cascading service hangs. The underlying protocol is ready
for conversion to an asynchronous implementation already, though.
A new test set, testrmib.sh, tests the basic RMIB functionality. To
this end it uses a test service, rmibtest, and also reuses part of
the existing test87 MIB service test.
Instead, filter it in libc for old networking implementations, as
those do not support sending SIGPIPE to user processes anyway. This
change allows newer socket drivers to implement the flag as per the
specification.
We do not support any PF functionality, nor do we intend to. However,
some NetBSD utilities rely on the presence of these files. Not all of
the files are installed. The NetBSD source seems rather inconsistent
in where from to include these files. We simply follow what NetBSD
does, though.
While still a small subset of the NetBSD headers, this set should
allow various additional utilities to be compiled without too many
MINIX3-specific changes (even if those utilities will not yet work).
While MFS failing to do I/O on a block is generally fatal, reading
the superblock at mount time is an exception: this case may occur
when the given partition is too small to contain the superblock.
Therefore, MFS should not crash or even report anything in this
case, but rather refuse to mount cleanly.
SEF: identity transfer only after controlled crash
Transparent (endpoint-preserving) restarts with identity transfer
are meant to exercise the crash recovery system only. After *real*
crashes, such restarts are useless at best and dangerous at worst,
because no state integrity can be guaranteed afterwards. Thus,
except after a controlled crash, it is best not to perform such
restarts at all. This patch changes SEF such that identity transfer
is successful only if the old process was the subject of a crash
induced through "service fi". As a result, testrelpol.sh should
continue to be able to use identity transfers for testing purposes,
but any real crash will be handled more appropriately.
The new asserts from git-29e004d exposed an issue in how VFS handles
aborting file system (FS) requests that are queued for a FS (as
opposed to sent to it) when that FS crashes. In that scenario, the
queued worker has its w_task set to NONE, because there is no ongoing
communication. However, worker_stop() is called on it regardless,
which used to abort the request only if w_task was not set to NONE,
leading to an improperly aborted request, a warning, and a VFS crash a
bit later. This patch changes worker_stop() so that w_task need not
be set to a valid endpoint for FS requests to be properly aborted.
Scripts for generating boot-to-ramdisk images are now available. These
can be used for example to boot from PXE or from a USB stick, as the
ramdisk are self-contained and do not rely on any block devices after
being loaded into RAM.
The image generation framework has also been slightly cleaned up in
order to better accomodate tarball sets bundling in images.
Some functions in lib/libc/net were disabled on MINIX3 only, but with
a few added header files they build just fine, even though some of
them rely on system functionality that has not yet been implemented.
Since the functionality is unlikely to be used in practice (because
it typically requires the use of protocol families that themselves are
not yet supported, such as IPv6), already enabling it right now helps
in building packages that rely on the functionality being present at
compile time, while not posing any practical risk of breaking the same
packages at run time.
This patch aims to synchronize the basic process user and group ID
management, as well as the set[ug]id(2) and sete[ug]id(2) behavior,
with NetBSD. As it turns out, the main issue was missing support for
saved user and group IDs. This support is now added.
Since NetBSD's userland, which we are importing, may rely on NetBSD
specifics when it comes to security, we choose not to deviate from
NetBSD's behavior in any way here. A new test, test89, verifies the
correct behavior - it has been confirmed to pass on NetBSD as is.
Commit git-c38dbb9 inadvertently broke local MINIX3-on-MINIX3 builds,
since its libc changes relied on VFS being upgraded already as well.
As a result, after installing the new libc, networking ceased to work,
leading to curl(1) failing later on in the build process. This patch
introduces transitional code that is necessary for the build process
to complete, after which it is obsolete again.
For a reason currently unknown to us, the qemu-linaro emulator
sometimes produces a Prefetch Abort exception with a fault location
(IFAR) rather different from the location of the instruction being
executed (LR corrected by 4). So far it has been observed in the
__udivmodsi4 routine of various processes, where the fault address is
for the first byte of the next page after the current instruction,
which itself is 44-64 bytes away from the start of that next page.
The affected instruction does not perform any sort of memory access.
Short of debugging qemu-linaro itself, we have no choice but to
disable the assert that previously went off in case the IFAR and
corrected LR are not equal. Since we have not yet observed this case
on actual hardware, the kernel prints a warning when detecting such a
mismatch for the first time. For the qemu-linaro case, the kernel's
actual page fault handling logic already handles this strange case
just fine.
- fix the reinstallation (preserve-/home) option;
- remove support for just reinstalling the bootloader, as the main
purpose of this option (allowing an upgrade from the old MINIX
boot monitor to the NetBSD bootloader) is no longer needed and was
already broken;
- do not try to copy over /etc/motd.install: it no longer exists.
Currently, the BSD socket API is implemented in libc, translating the
API calls to character driver operations underneath. This approach
has several issues:
- it is inefficient, as most character driver operations are specific
to the socket type, thus requiring that each operation start by
bruteforcing the socket protocol family and type of the given file
descriptor using several system calls;
- it requires that libc itself be changed every time system support
for a new protocol is added;
- various parts of the libc implementations violate the asynchronous
signal safety POSIX requirements.
In order to resolve all these issues at once, the plan is to turn the
BSD socket calls into system calls, thus making the BSD socket API the
"native" ABI, removing the complexity from libc and instead letting
VFS deal with the socket calls.
The overall change is going to break all networking functionality. In
order to smoothen the transition, this patch introduces the fifteen
new BSD socket system calls, and makes libc try these first before
falling back on the old behavior. For now, the VFS implementations of
the new calls fail such that libc will always use the fallback cases.
Later on, when we introduce the actual implementation of the native
BSD socket calls, all statically linked programs will automatically
use the new ABI, thus limiting actual application breakage.
In other words: by itself, this patch does nothing, except add a bit
of transitional overhead that will disappear in the future. The
largest part of the patch is concerned with adding full support for
the new BSD socket system calls to trace(1) - this early addition has
the advantage of making system call tracing output of several socket
calls much more readable already.
Both the system call interfaces and the trace(1) support have already
been tested using code that will be committed later on.
There is no reason to use a single message for nonoverlapping requests
and replies combined, and in fact splitting them out allows reuse of
messages and avoids various problems with field layouts. Since the
upcoming socketpair(2) system call will be using the same reply as
pipe2(2), split up the single message used for the latter. In order
to keep the used parts of messages at the front, start a transitional
phase to move the pipe(2) flags field to the front of its request.
Previously, the libc sendto(3) and recvfrom(3) implementations would
blindly assume that any unrecognized socket is a raw-IP socket. This
is not only inconsistent but also messes with returned error codes.
Lionel Sambuc [Sun, 7 Feb 2016 21:13:15 +0000 (22:13 +0100)]
Cross-compilation fixes:
- GCC >4.9:
Newer toolchains trigger more warnings, which by default are
treated as errors. Disable this while building the tools as this
will be a recurring problem in the future.
close #105
- FreeBSD: Fix some fetch.sh scripts which fails as FreeBSD's patch
fails to patch two .info files. We ignore this for the time being.
Lionel Sambuc [Sat, 6 Feb 2016 11:38:52 +0000 (12:38 +0100)]
Fix usage of parenthesis in Makefiles
While BSD make support both $() and ${} around variables, the NetBSD
source tree uses only ${} by convention.
Imported software is left as is, and sometimes $() is used when the
containing Makefile/Makefile fragment is used both by GNU make and BSD
make, as it can happen for the tools, and other parts as well which are
compiled using the host make tool.
The NetBSD bootloader attempts to load NetBSD kernel modules for
"unusual" file systems. We do not support NetBSD kernel modules,
and thus, the bootloader gives us warnings about not being able
to load them, in particular when booting CD images. This patch
disables the NetBSD file system module autoload feature.
- test3: support running the test set from a pseudoterminal;
- test60: fix number conversion bug that caused chmod errors;
- test65: remove nonworking package installation instructions;
- testisofs: work around failure due to having a timezone set;
- testisofs: exclude extra RR_MOVED directory from output.
Thomas Cort [Tue, 22 Dec 2015 03:07:01 +0000 (03:07 +0000)]
mined: fix buffer overflow in input()
input() is used to accept filenames when saving, regular
expressions when searching, and other input. It writes
the characters into buffers such as file and exp_buf and
others which are of length LINE_LEN.
To prevent writing beyond the end of the intended buffer,
truncate the input at LINE_LEN - 1 and ensure that the
string is NULL terminated.
In order to resolve page faults on file-mapped pages, VM may need to
communicate (through VFS) with a file system. The file system must
therefore not be the one to cause, and thus end up being blocked on,
such page faults. To resolve this potential deadlock, the safecopy
system was previously extended with the CPF_TRY flag, which causes the
kernel to return EFAULT to the caller of a safecopy function upon
getting a pagefault, bypassing VM and thus avoiding the loop. VFS was
extended to repeat relevant file system calls that returned EFAULT,
after resolving the page fault, to keep these soft faults from being
exposed to applications.
However, general UNIX I/O semantics dictate that if an I/O transfer
partially succeeded before running into a failure, the partial result
is to be returned. Proper file system implementations may therefore
end up returning partial success rather than the EFAULT code resulting
from a soft fault. Since VFS does not get the EFAULT code in this
case, it does not know that a soft fault occurred, and thus does not
repeat the call either. The end result is that an application may get
partial I/O results (e.g., a short read(2)) even on regular files.
Applications cannot reasonably be expected to deal with this.
Due to the fact that most of the current file system implementations
do not implement proper partial-failure semantics, this problem is not
yet widespread. In fact, it has only occurred on direct block device
I/O so far. However, the next generation of file system services will
be implementing proper I/O semantics, thus exacerbating the problem.
To remedy this situation, this patch changes the CPF_TRY semantics:
whenever the kernel experiences a soft fault during a safecopy call,
in addition to returning FAULT, the kernel also stores a mark in the
grant created with CPF_TRY. Instead of testing on EFAULT, VFS checks
whether the grant was marked, as part of revoking the grant. If the
grant was indeed marked by the kernel, VFS repeats the file system
operation, regardless of its initial return value. Thus, the EFAULT
code now only serves to make the file system fail the call faster.
The approach is currently supported for both direct and magic grants,
but is used only with magic grants - arguably the only case where it
makes sense. Indirect grants should not have CPF_TRY set; in a chain
of indirect grants, the original grant is marked, as it should be.
In order to avoid potential SMP issues, the mark stored in the grant
is its grant identifier, so as to discard outdated kernel writes.
Whether this is necessary or effective remains to be evaluated.
This patch also cleans up the grant structure a bit, removing reserved
space and thus making the structure slightly smaller. The structure
is used internally between system services only, so there is no need
for binary compatibility.
With this change, obtaining an existing free grant is no longer an
operation of O(n) complexity. As a result, the now-deprecated
getgrant/setgrant part of the grants API also no longer has a
performance advantage.
The memory grant identifier for safecopies now includes a sequence
number in its upper bits, to prevent accidental reuse of a grant ID
after revocation and subsequent reallocation. This should increase
overall system robustness by a tiny amount, and possibly help catch
bugs in system services early on. For now, the lower 20 bits of the
grant ID are used as grant table slot index (thus allowing for up to
a million grants per process), and the next 11 bits of the (signed
32-bit) grant ID are used to store the per-slot sequence number. As
grant IDs are never exposed to userland, the split can be changed
later on without breaking the userland ABI.
Apply the x86 overflow check from git-d09f72c to ARM code as well.
Not just stack traces, but also system services can trigger this
case, possibly as a result of being handed bad pointers by userland,
ending in a kernel panic.
Changed all K&R style functions to ANSI-style declarations within the
kernel directory. The code compiles and aparently works for i386. For
arm my toolchain does not work, but I have changed the code with great
care. Also, the make command fails for the test suite. Therefore, I
strongly recommand to review the code with care.
Edited by David van Moolenbroek to convert really all K&R functions.
MIB: add support for System V IPC information node
The kernel.ipc.sysvipc_info node is the gateway from NetBSD ipcs(1)
and ipcrm(1) to the IPC server, and thus necessary for a clean
import of these two utilities. The MIB service implementation uses
the preexisting (Linux-specific) information calls on the IPC server
to obtain the information.
As mentioned in previous patches, services may not subscribe to
process events from specific processes only, since this results in
race conditions. However, the IPC server can safely turn on and off
its entire subscription based on whether any System V IPC semaphores
(and, in the future, message queues) are allocated at all. Since
the System V IPC facilities are not so commonly used, this removes
the extra round trip from PM to the IPC server and back for caught
signals and process exits in the common case.
- rewrite the semop(2) implementation so that it now conforms to the
specification, including atomicity, support for blocking more than
once, range checks, but also basic fairness support;
- fix permissions checking;
- fix missing time adjustments;
- fix off-by-one errors and other bugs;
- do not allocate dynamic memory for GETALL/SETALL;
- add test88, which properly tests the semaphore functionality.
PM: generic process event publish/subscribe system
Now that there are services other than PM and VFS that implement
userland system calls directly, these services may need to know about
events related to user processes. In particular, signal delivery may
have to interrupt blocking system calls, and certain cleanup tasks may
have to be performed after a user process exits.
This patch aims to implement a generic, lasting solution for this
problem, by allowing services to subscribe to "signal delivered"
and/or "process exit" events from PM. PM publishes such events by
sending messages to its subscribed services, which must then reply an
acknowledgment message.
For now, only the two aforementioned events are implemented, and only
the IPC service makes use of the process event facility.
The new process event publish/subscribe system replaces the previous
VM notify-sig/watch-exit/query-exit system, which was unsound: 1) it
allowed subscription to events from individual processes, and suffered
from fundamental race conditions as a result; 2) it relied on "not too
many" processes making use of the IPC server functionality in order to
avoid loss of notifications. In addition, it had the "ipc" process
name hardcoded, did not distinguish between signal delivery and exits,
and added a roundtrip to VM for all events from all processes.
Closer to KNF, better coding practices, more similar to other
services, no more global variables, a few more comments, that
kind of stuff. No major functional changes.
- switch to the NetBSD identifier system; it is not only better, but
also required for porting NetBSD ipcs(1) and ipcrm(1); however, it
requires that slots not be moved, and that results in some changes;
- synchronize some other things with NetBSD: where keys are kept, as
well as various non-permission mode flags;
- fix semctl(2) vararg retrieval and message field type;
- use SUSPEND instead of weird reply exceptions in the call table;
- fix several memory leaks and at least one missing permission check;
- improve the atomicity of semop(2) by a small amount, even though
its atomicity is still broken at a fundamental level;
- use the new cheaper way to retrieve the current time;
- resolve all level-5 LLVM warnings.
Specifically, add support for the IPC_INFO, SEM_INFO, and SEM_STAT
semctl(2) operations, similar to how information about shared memory
is already exposed as well. The MINIX3 ipcs(1) utility already had
support for these operations, and can now actually use them, too.
- About 80% of PM's process table consisted of per-signal sigaction
structures. This is information not used by the MIB service, and
can safely be stored outside the main process table.
- The MIB service does not need most of the VFS process table, so VFS
now generates a "light" version of its table upon request, with just
the fields used by the MIB service.
The result is a size reduction of the MIB service of about 700KB.
Instead of pulling in process tables itself, ProcFS now queries the
MIB service for process information. This reduces ProcFS's memory
usage by about 1MB. The change does have two negative consequences.
First, getting all the original /proc/<pid>/psinfo fields filled in
would take a lot of extra effort. Since the only program that uses
those files at all is mtop(1), we reformat psinfo to expose only the
information used by mtop(1). This means that with this patch, older
copies of MINIX3 ps and top will cease to work.
Second, since both MIB and ProcFS update their own view of the
process list only once per clock tick, ProcFS' view may now be
outdated by up to two clock ticks. This is unlikely to pose a
problem in practice.
Due to differences in (mainly) measuring and accumulating CPU times,
the two top programs end up serving different purposes: the NetBSD
top is a system administration tool, while the MINIX3 top (now mtop)
is a performance debugging tool. Therefore, we keep both.
The newly imported BSD top has a few MINIX3-specific changes. CPU
statistics separate system time from kernel time, rather than kernel
time from time spent on handling interrupts. Memory statistics show
numbers that are currently relevant for MINIX3. Swap statistics are
disabled entirely. All of these changes effectively bring it closer
to how mtop already worked as well.
No changes except for one cosmetic adjustment: NetBSD has chosen to
rename the standard TT column to TTY and not shorten tty names; we
undo those changes, making ps(1) behave more in accordance with the
specification and its manual page, and, most importantly for us, not
use an incredibly wide TTY column to print "console".
Adapt libc devname(3) to make use of it, so that such device name
queries are now several orders of magnitude faster. The database
is created and updated at system bootup time.
Imported with no changes, but not all parts are expected to be
functional. The libc nlist functionality is enabled for the
purpose of successful linking, although the nlist functionaly has
not been tested on MINIX3 nor is it needed for how we use libkvm.
In terms of function calls: kvm_getproc2, kvm_getargv2,
kvm_getenvv2, and kvm_getlwps are expected to work, whereas
kvm_getproc, kvm_getargv, kvm_getenvv, and kvm_getfiles are not.
Now that uname(3) uses sysctl(2), we no longer need sysuname(2).
Backward compatibility is retained for old statically linked
binaries for a short while.
Also remove the now-obsolete MINIX3-specific "arch" field from the
utsname structure. While this is an ABI break at the libc level,
it should pose no problems in practice, because:
- statically linked programs (i.e., all of the base system) are not
affected, as they will use headers synchronized with libc;
- the structure is getting smaller, thus, older dynamically linked
programs (typically in pkgsrc) using the new libc will end up with
garbage in the "arch" field, but it is unlikely they will use this
field anyway, since it was specific to MINIX3;
- new dynamically linked programs using an old libc could end up with
memory corruption, but this is not a scenario that is expected to
occur in the first place - certainly not with programs from pkgsrc.
PM uses its own process table entry as source for kernel signals,
and temporarily changes its own process group to make the signals
arrive at the right processes. However, the value is never reset,
with as result that the temporary value shows up in ps(1) output.
So far, VM reported only the number of bytes actually allocated to
each process. This patch adds two additional fields: the sum of the
byte sizes of all the virtual address ranges in the process, and that
number minus the part of the process stack that is not actually
mapped in. Unfortunately, we have to guess where the process stack
is, so the second field is not necessarily accurate.
This functionality is required for BSD top(1), as exposed through
the CTL_KERN KERN_CP_TIME sysctl(2) call. The idea is that the
overall time spent in the system is divided into five categories.
While NetBSD uses a separate category for the kernel ("system") and
interrupts, we redefine "system" to mean userspace system services
and "interrupts" to mean time spent in the kernel, thereby providing
the same categories as MINIX3's own top(1), while adding the "nice"
category which, like on NetBSD, is used for time spent by processes
with a priority lowered by the system administrator.
The new MIB service implements the sysctl(2) system call which, as
we adopt more NetBSD code, is an increasingly important part of the
operating system API. The system call is implemented in the new
service rather than as part of an existing service, because it will
eventually call into many other services in order to gather data,
similar to ProcFS. Since the sysctl(2) functionality is used even
by init(8), the MIB service is added to the boot image.
MIB stands for Management Information Base, and the MIB service
should be seen as a knowledge base of management information.
The MIB service implementation of the sysctl(2) interface is fairly
complete; it incorporates support for both static and dynamic nodes
and imitates many NetBSD-specific quirks expected by userland. The
patch also adds trace(1) support for the new system call, and adds
a new test, test87, which tests the fundamental operation of the
MIB service rather thoroughly.
The user of the script may now override the default name of the
host platform's GNU make utility by passing in a MAKE variable.
Along with the previous commits and upcoming documentation changes,
this fixes #93.
These utilities were already largely broken and are now also obsolete.
In addition, they have too many issues (for example, dependencies on
Linux specifics) to keep around. The few usable features left in the
clientctl script are not LLVM specific and, if anything, should be
recreated somewhere else.
ASR instrumentation is now performed on all applicable system services
if the system is built with MKASR=yes. This setting automatically
enables MKMAGIC=yes, which in turn enables MKBITCODE=yes.
The number of extra rerandomized service binaries to be generated can
be set by passing ASRCOUNT=n to the build system, where n is a number
between 1 and 65536. The default ASRCOUNT is 3, meaning that each
service will have one randomized base binary and three additional
rerandomized binaries. As before, update_asr(8) can be used for
runtime rerandomization.
The magic runtime library is now built as part of the regular build, if
the MKMAGIC=yes flag is passed to the build system. The library has
been renamed from "magic" to "magicrt" to resolve a name clash with BSD
file(1)'s libmagic. All its level-5 LLVM warnings have been resolved.
The final library, "libmagicrt.bcc", is now stored in the destination
library directory rather than in the source tree.
Until now, the program name of a service was always the file name
(without directory) of the service binary. The program name is used
to, among other things, find the corresponding system.conf entry.
With ASR moving to a situation where all rerandomized service binaries
are stored in a single directory, this can no longer be maintained.
Instead, the service(8) command can now be instructed to override the
service program name, using its new -progname option.
It was not used or tested on x86 in practice, and the automated arm
tests should obviate the need for a dummy-only x86 implementation.
It should be noted that this change is merely the simplest way to
deal with conflicts with live update (for the second time now).