Improving capability usage on Linux
Capabilities are a mechanism that allow privileges usually reserved to the super user to be granted or revoked in a more granular manner. Nowadays, their usage is reasonably wide spread across the Linux ecosystem, even though some warts remain in the interface, what with them being applied per thread, not per process (this is a recurring issue on Linux with credentials: user, group and supplementary group IDs are all per-thread attributes, instead of being applied process wide; this requires clever workarounds in libcs, as well as any language runtime that bypasses libc - see this Go commit that finally implemented the credential synchronization mechanism in their runtime).
Back to the matter I wanted to write about, some time ago we got a bug
report in Void Linux
that the yggdrasil system service wasn’t working, erroring out and printing
'libcap-ng is too old for "all" caps'
before exiting. This happened because
we had wanted to apply the same restrictions the yggdrasil systemd
service
did, which removed nearly all its capabilities and left it only with enough
privileges to manage network interfaces. However, without systemd to do the
heavy lifting for us, this had to be implemented using
setpriv(1) from the
util-linux project, which,
unfortunately, claimed it couldn’t act on all
as parameter to its
capabilities arguments, since it was running on a kernel that had more
capabilities than it “knew” about at build time. Given that it used the
CAP_LAST_CAP
macro to determine the last capability it knew about and
compared that value with what the kernel told it, the error message was
actually misleading: it didn’t really matter what version of
libcap-ng was being used or what it
had been built with, only what kernel header version had been used when
building util-linux.
How they determined the last available capability in the old version can be seen here:
// SPDX-License-Identifier: GPL-2.0-or-later
int cap_last_cap(void)
{
/* CAP_LAST_CAP is untrustworthy. */
static int ret = -1;
int matched;
FILE *f;
if (ret != -1)
return ret;
f = fopen(_PATH_PROC_CAPLASTCAP, "r");
if (!f) {
ret = CAP_LAST_CAP; /* guess */
return ret;
}
matched = fscanf(f, "%d", &ret);
fclose(f);
if (matched != 1)
ret = CAP_LAST_CAP; /* guess */
return ret;
}
Then, in setpriv.c
, we can see where it was erroring out:
// SPDX-License-Identifier: GPL-2.0-or-later
static void do_caps(enum cap_type type, const char *caps)
{
/*
...
*/
if (!strcmp(c + 1, "all")) {
int i;
/* It would be really bad if -all didn't drop all
* caps. It's better to just fail. */
if (cap_last_cap() > CAP_LAST_CAP)
errx(SETPRIV_EXIT_PRIVERR,
_("libcap-ng is too old for \"all\" caps"));
for (i = 0; i <= CAP_LAST_CAP; i++)
cap_update(action, type, i);
}
/*
...
*/
}
Basically, this code block existed because the value returned by
cap_last_cap()
could have been a guess, and therefore couldn’t be trusted. It
should be noted that the logic was still somewhat erroneous: on a newer kernel
where /proc
wasn’t mounted, the function would return CAP_LAST_CAP
and the
program wouldn’t error out, even though there were more system capabilities
available than it was aware of.
It felt to me like there should be a better way of doing this, because otherwise setpriv(1) would be too fragile an utility to depend on. Initially, I opened an issue in the util-linux repository where I asked about this apparent fragility, and asked for suggestions on how to improve the situation; my own initial suggestion had been simply to improve the documentation to point out the program’s limitations, but that wouldn’t solve the issue we were facing in Void Linux, and would still require changing the yggdrasil service to do something other than run as root with dropped capabilities.
In thinking about the situation, I ended up taking a deeper look into the
prctl(2) system call. Maybe there was
some option
value that would return the highest capability value.
Unfortunately, there wasn’t, but what I did find was the PR_CAPBSET_READ
option, which didn’t change anything about the running process, only queried
some properties; even better, it would return either 0 or 1 for known
capabilities, to show whether they were in the thread’s capability bounding
set, or -1, in the case of an invalid (read “unknown to the kernel”)
capability. This meant that with some smart binary searching, we could find the
highest capability known to a kernel without an absurd number of system calls.
After some suggestions for improvements, this is what I ended up with:
// SPDX-License-Identifier: GPL-2.0-or-later
static int test_cap(unsigned int cap)
{
/* prctl returns 0 or 1 for valid caps, -1 otherwise */
return prctl(PR_CAPBSET_READ, cap, 0, 0, 0) >= 0;
}
int cap_last_cap(void)
{
static int cap = -1;
FILE *f;
if (cap != -1)
return cap;
/* try to read value from kernel, check that the path is
* indeed in a procfs mount */
f = fopen(_PATH_PROC_CAPLASTCAP, "r");
if (f) {
int matched = 0;
if (proc_is_procfs(fileno(f))) {
matched = fscanf(f, "%d", &cap);
}
fclose(f);
/* we check if the cap after this one really isn't valid */
if (matched == 1 && cap < INT_MAX && !test_cap(cap + 1))
return cap;
}
/* if it wasn't possible to read the file in /proc,
* fall back to binary search over capabilities */
/* starting with cap=INT_MAX means we always know
* that cap1 is invalid after the first iteration */
unsigned int cap0 = 0, cap1 = INT_MAX;
cap = INT_MAX;
while ((int)cap0 < cap) {
if (test_cap(cap)) {
cap0 = cap;
} else {
cap1 = cap;
}
cap = (cap0 + cap1) / 2U;
}
return cap;
}
The advantage of this version of cap_last_cap()
is that it doesn’t even touch
CAP_LAST_CAP
, and the value returned by it can always be trusted. This
allowed me to simply remove the restriction in setpriv(1). Now, the only
consequence regarding a mismatch between the capabilities supported by the
running kernel and the kernel headers these utilities were built with is that
the capability’s name might not be known by libcap-ng, making it necessary to
use a generic cap_XX
name.
Afterwards, I also added the same algorithm to libcap-ng itself, where I
learned that the maximum value for capabilities supported by the kernel isn’t
INT_MAX
, but instead 64 (at least as of 2020), because capabilities are
tracked in the kernel and file systems as a bitmask stored in two 32-bit
integers. The final version of libcap-ng’s init_lib
can be seen below:
// SPDX-License-Identifier: LGPL-2.1-or-later
static void init_lib(void) __attribute__ ((constructor));
static void init_lib(void)
{
#ifdef HAVE_PTHREAD_H
pthread_atfork(NULL, NULL, deinit);
#endif
// Detect last cap
if (last_cap == 0) {
int fd;
// Try to read last cap from procfs
fd = open("/proc/sys/kernel/cap_last_cap", O_RDONLY);
if (fd >= 0) {
#ifdef HAVE_LINUX_MAGIC_H
struct statfs st;
// Bail out if procfs is invalid or fstatfs fails
if (fstatfs(fd, &st) || st.f_type != PROC_SUPER_MAGIC)
goto fail;
#endif
char buf[8];
int num = read(fd, buf, sizeof(buf) - 1);
if (num > 0) {
buf[num] = 0;
errno = 0;
unsigned int val = strtoul(buf, NULL, 10);
if (errno == 0)
last_cap = val;
}
fail:
close(fd);
}
// Run a binary search over capabilities
if (last_cap == 0) {
// starting with last_cap=MAX_CAP_VALUE means we always know
// that cap1 is invalid after the first iteration
last_cap = MAX_CAP_VALUE;
unsigned int cap0 = 0, cap1 = MAX_CAP_VALUE;
while (cap0 < last_cap) {
if (test_cap(last_cap))
cap0 = last_cap;
else
cap1 = last_cap;
last_cap = (cap0 + cap1) / 2U;
}
}
}
}
Interestingly, when I strace(1) programs that use libcap, it seems they do a similar sweeping across capabilities using prctl(2). I have yet to look at their code (stracing things is just so much simpler), but it would seem my idea wasn’t as original as I thought it was. Still, I’m glad to have improved the utilities whose limitations were affecting us now.