In this week’s edition: Linux 4.11-rc8, updating kernel.org cross compilers, Intel 5-level paging, v3 namespaced file capabilities, and ongoing development.
Apologies for the delay to this week’s podcast. I got flu around the time I was preparing last week’s podcast, limped along to the weekend, and then had to stay in bed for a long time. On the other hand, it let me play with a bunch of new SDRs [HackRF, RTL-SDR, and friends, for the curious) on Sunday when I skipped the 5K I was supposed to run 🙂
I would also like to note my thanks for the first 10,000 downloads of the new series of this podcast. It’s a work in progress. I am going to make (positive!) changes over the coming months, including a web interface that will track all LKML posts and allow for community-directed collaboration on creating this (and hopefully other) podcasts. I will include automatic patch tracking (showing when patches have landed in upstream trees, and so on), info on post authors, and allow you to edit personal bios, links, etc. And employer info. After some discussions around the best way to handle author employer attribution (to make sure everyone is treated fairly), I’ve decide to take a little time away from including employer names until I have a populated database of mappings. Jon Corbet from LWN has something similar already, which I believe is also stored in git, but there’s more to be done here (thanks to Alex and others for the G+ feedback and discussion on this).
Linus Torvalds announced Linux 4.11-rc8, saying “So originally I was just planning on releasing the final 4.11 today, but while we didn’t have a *lot* of changes the last week, we had a couple of really annoying ones, so I’m doing another rc release instead”. As he also notes, “The most noticeable of the issues is that we’ve quirked off some NVMe power management that apparently causes problems on some machines. It’s not entirely clear what caused the issue (it wasn’t just limited to some NVMe hardware, but also particular platforms), but let’s test it”.
With the release of Linux 4.11-rc8 comes that impending moment of both elation and dread that is a final kernel. It’ll be great to see 4.11 out there. It’s an awesome kernel, with lots of new features, and it will be well summarized in kernelnewbies and elsewhere. But upon its release comes the opening of the merge window for 4.12. Tracking that was exciting for 4.11. Hopefully it doesn’t finish me off trying to do that for 4.12 😉
Geert Utterhoeven posted “Build regressions/improvements in v4.11-rc8”, in which he noted that (compared with v.4.10), an addition build error and several hundred more warnings were recently added to the kernel. The error he points to is in the AVR32 architecture when applying a relocation in the linker, probably due to an unsupported offset.
Greg K-H (Kroah-Hartman) announced Linux 4.4.64, 4.9.25, and 4.10.13
Junio C Hamano announced Git v2.13.0-rc1
Alex Williams posted “Generic DMA-capable streaming device driver looking for home” in which he describes some generic features of his device (the ability to “carry generic data to/from userspace”) and inquired as to where it should live in the kernel. It could do with some followup.
Updating kernel.org cross compilers
Andre Przywara inquired as to the state of the kernel.org cross compilers. This was a project, initiated by Tony Breeds and located on kernel.org, to maintain current Intel x86 Architecture builds of cross compiler toolchains for various architecture targets (a cross compiler is one that runs on one architecture, targeting another, which is incidentally different from a “Canadian cross” compiler – look it up if you’re ever bored or want to bootstrap compilers for fun). It was a great project, but like so many others one day (three years ago) there were no more updates. That is something Andre would like to see changed. He posted, noting that many people still use the compilers on kernel.org (including yours truly, in a pinch) and that “The latest compiler I find there is 4.9.0, which celebrated its third birthday at the weekend, also has been superseded by 4.9.4 meanwhile”.
Andre used build scripts from Segher Bossenkool to build binutils (the GNU assembler) 2.28 and GCC (the GNU Compiler Collection) 6.3.0. With some tweaks, he was able to build for “all architectures except arc, m68k, tilegx and tilepro”. He wondered “what the process is to get these [the compilers linked from the kernel website] updated?”. It seems like he is keen to clean this up, which is to be commended and encouraged. And hopefully (since he works for ARM) that will eventually also include cross compiler targets for x86 that run on ARMv8 server systems.
Intel 5-level paging
Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he provides an “updated version the fourth and the last bunch of  patches that brings initial 5-level paging enabling.” This is in support of Intel’s “la57” feature of future microprocessors that allows them to exceed the traditional 48-bit “Canonical Addressing” in order to address up to 56-bits of Virtual Address space (a big benefit to those who want to map large non-volatile storage devices and accelerators into virtual memory). His latest patch series includes a fix for a “KASLR [Kernel Address Space Layout Randomization”] bug due to rewriting  startup_64() in C”.
He goes on to discuss passing the “hint” parameter to mmap() “in order to tell the kernel not to allocate memory beyond the 48 bits address space. Unfortunately, on Linux this will only work when the area pointed to by “hint” is unallocated which means one cannot simply use a hardcoded “hint” to mitigate this problem”. What he means here is that the mmap call to map a virtual memory area into a userspace process allows an application to specify where it would like that mapping to occur, but Linux isn’t required to respect this. Contemporary Linux implements “MAP_FIXED” as an option to mmap, which will either map a region where requested or explicitly fail (as Andy Lutomirski pointed out). This is different from a legacy behavior where Linux used to take a hint and might just not respect placement (as Andi Kleen alluded to in followup).
This whole discussion is actually the reason that Kirill had (thoughtfully) already included a feature bit setting in his patches that allows an application to effectively override the existing kernel logic and always allocate below 48 bits (preserving as close to existing behavior as possible on a per application basis while allowing a larger VA elsewhere). The thread resulted in this being pointed out, but it’s a timely reminder of the problems faced as the pressure continues upon architectures to grow their VA (Virtual Address) space size.
Often, efforts at growing virtual memory address spaces run up against uses of the higher order bits that were never sanctioned but are in widespread use. Many people strongly dislike pointer tagging of this kind (your author included), but it is not going away. It is great that Kirill’s patches have a form of solution that can be used for the time being by applications that want to retain a smaller address space, but that’s framed in the context of legacy support, not to enable runtimes to continue to use high order bits forevermore.
Introduce v3 namespaced file capabilities
Serge E. Hallyn posted “Introduce v3 namespaced file capabilities”. Linux includes a comprehensive capability mechanism that allows applications to limit what privileged operations may be performed by them. In the “good old days” when Unix hacker beards were more likely than today’s scruffy look, root was root and nobody really cared about remote compromise because they were still fighting having to have login passwords at all. But in today’s wonderful world of awesome, in which anything not bolted down is often not long for this world, “root” can mean very little. The traditionally privileged users can be extremely restricted by security policy frameworks, such as SELinux, but even more fundamentally can be subject to restrictions imposed by the growth in use of “capabilities”.
A classic example of a capability is CAP_NET_RAW, which the “ping” utility needs in order to create a raw socket. Traditionally, such utilities were created on Unix and Linux filesystems as “setuid root”, which means that they had the “s” bit set in their permissions to “run as root” when they were executed by regular users. This allowed the utility to operate, but it also allowed any user who could trick the utility into providing a shell conveniently gain a root login. Many security exploits over the years later and we have filesystem capabilities which allow binaries to exist on disk, tagged with just those extra capabilities they require to get the job done, through the filesystem “xattr” extended attributes. “ping” has CAP_NET_RAW, so it can create raw sockets, but it doesn’t need to run as root, so it isn’t market as “setuid root” on modern distros.
Fast forward still further into the modern era of containers and namespaces, and things get more complex. As Serge notes in his patch, “Root in a non-initial user ns [namespace] cannot be trusted to write a traditional security.capability xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the xattr, and execute the file with privilege on the host”. However, as he also notes, “supporting file capabilities in a user namespace is very desirable. Not doing so means that and programs designed to run with limited privilege must continue to support other methods of gaining and dropping privilege. For instance a program installer must detect whether file capabilities can be assigned, and assign them if so but set setuid-root otherwise. The program in turn must known how to drop partial capabilities [which is a mess to get right], and do so only if setuid-root”. This is, of course, far from desirable.
In the patch series, Serge “builds a vfs_ns_cap_data struct by appending a uid_t [user ID] rootid to struct vfs_cap_data. This is the absolute uid_d (that is, the uid_t in user namespace which mounted the filesystem, usually init_user_ns [the global default]) of the root id in whosr namespace the file capabilities may take effect”. He then rewrites xattrs within the namespace for unprivileged “root” users with the appropriate notion of capabilities for that environment (in a “v3” xattr that is transparently converted to/from the conventional “v2” security.capability xattr), in accordance with capabilities that have been granted to the namespace from outside by a CAP_SETFCAP. This allows capability use without undermining host system security and seems like a nice solution.
Ashish Kalra posted “Fix BSS corruption/overwrite issue in early x86 kernel setup”. The BSS (Block Started by Symbol) is the longstanding name used to refer to statically allocated (and pre-zeroed) variables that have memory set aside at compile time. It’s a common feature of almost every ELF (Executable and Linking Format) Linux binary you will come across, the kernel not being much different. Linux also uses stacks for small runtime allocations by having a page (or several) of memory that contains a pointer which descends (it’s actually called a “fully descending” type of stack) in address as more (small) items are allocated within it. At boot time, the kernel typically expects the bootloader will have setup a stack that can be used for very early code, but Linux is willing to handle its own setup if the bootloader isn’t sophisticated enough to handle this. The latter code isn’t well exercised and it turns out doesn’t reserve quite enough space, which causes the stack to descend (run into) the BSS segment, resulting in corruption. Ashish fixes this by increasing the fallback stack allocation size from 512 to 1024 bytes in arch/x86/boot/boot.h.
Vladimir Murzin posted “ARM: Fix dma_alloc_coherent()” and friends for NOMMU”, noting “It seem that addition of cache support for M-class CPUs uncovered [a] latent bug in DMA usage. NOMMU memory model has been treated as being always consistent; however, for R/M [Real Time and Microcontroller] classes [of ARM cores] memory can be covered by MPU [Memory Protection Unit] which in turn might configure RAM as Normal i.e. bufferable and cacheable. It breaks dma_alloc_coherent() and friends, since data can stuck in caches”.
Andrew Pinski posted “arm64/vdso: Rewrite gettimeofday into C”, which improves performance by up to 32% when compared to the existing in-kernel implementation on a Cavium ThunderX system (because there are division operations that the compiler can optimize). On their next generation, it apparently improves performance by 18% while also benefitting other ARM platforms that were tested. This is a significant improvement since that function is often called by userspace applications many times per second.
Baoquan He posted “x86/KASLR: Use old ident map page table if physical randomization failed”. Dave Young discovered a problem with the physical memory map setup of kexec/kdump kernels when KASLR (Kernel Address Space Layout Randomization) is enabled. KASLR does what it says on the tin. It applies a level of randomization to the placement of (most) physical pages of the kernel such that it is harder for an attacker to guess where in memory the kernel is located. This reduces the ability for “off the shelf” buffer overflow/ROP/similar attacks to leverage known kernel layout. But when the kernel kexec’s into a kdump kernel upon a crash, it’s loading a second kernel while attempting to leave physical memory not allocated to the crash kernel alone (so that it can be dumped). This can lead to KASLR allocation failures in the crash kernel, which (until this patch) would result in the crash kernel not correctly setting up an identity mapping for the original (older) kernel, resulting in immediately resetting the machine. With the patch, the crash kernel will fallback to the original kernel’s identity mapping page tables when KASLR setup fails.
On a separate, but related, note, Xunlei Pang posted “x86_64/kexec: Use PUD level 1GB page for identity mapping if available” which seeks to change how the kexec identity mapping is established, favoring a new top-level 1GB PUD (Page Upper Directory) allocation for the identity mappings needed prior to booting into the new kernel. This can save considerable memory (128MB “On one 32TB machine”…) vs using the current approach of many 2MB PTEs (Page Table Entries) for the region. Rather than many PTEs, an effective huge page can be mapped. PTEs are grouped into “directories” in memory that the microprocessor’s walker engines can navigate when handling a “page fault” (the process of loading the TLB – Translation Lookaside Buffer – and microTLB caches). Middle Directories are collections of PTEs, and these are then grouped into even larger collections at upper levels, depending upon nesting depth. For more about how paging works, see Mel Gorman’s “Linux Memory Management”, a classic text that is still very much relevant for the fundamentals.
Janakarajan Natarajan posted “Prevent timer value 0 for MWAITX” which limits the kernel from providing a value of zero to the privileged x86 “MWAITX” instruction. MWAIT (Memory Wait) is a series of instructions on contemporary x86 systems that allows the kernel to temporarily block execution (in place of a spinloop, or other solution) until a memory location has been updated. Then, various trickery at the micro-architectural level (a dedicated engine in the core that snoops for updates to that memory address) will handle resuming execution later. This is intended for use in waiting relatively small amounts of time in an energy efficient and high performance (low wakeup time) manner. The instruction accepts a timeout period after which a wakeup will happen regardless, but it can also accept a zero parameter. Zero is supposed to mean “never timeout” (i.e. always wait for the memory update). It turns out that existing Linux kernels do use zero on some occasions, incorrectly, and that this isn’t noticed on older microprocessors due to other events eventually triggering a wakeup regardless. On the new AMD Zen core, which behaves correctly, MWAITX may never wake up with a zero parameter, and this was causing NMI soft lockup warnings. The patch corrects Linux to do the right thing, removing the zero option.
Paul E. McKenney posted “Make SRCU be built by default”. SRCU (Sleepable) RCU (Read Copy Update) is an optional feature of the Linux kernel that provides an implementation of RCU which can sleep. Conventionally, RCU had spinlock semantics (it could not sleep). By definition, its purpose was to provide a cunning lockless update mechanism for data structures, relying upon the passage of a “grace period” defined by every processor having gone into the scheduler once (a gross simplification of RCU). But under some circumstances (for example, in a Real Time kernel) there is a need for a sleepable (and pre-emptable, but that’s another issue) RCU. And so SRCU was created more than 8 years ago. It has a companion in “Tiny SRCU” for embedded systems. A “surprisingly common case” exists now where parts of the kernel are including srcu.h so Paul’s patch builds it by default.
Laurent Dufour posted “BUG raised when onlining HWPoisoned page” in which he noted that the (being onlined) page “has already the mem_cgroup field set” (this is shown in the stack trace he posts with “page dumped because: page still charged to cgroup”). He cleans this up by clearing the mem_cgroup when a page is poisoned. His second patch skips poisoned pages altogether when performing a memory block onlining operation.
Laurent also posted an RFC (Request For Comment) patch series entitled “Replace mmap_sem by a range lock” which “implements the first step of the attempt to replace the mmap_sem by a range lock”. We will summarize this patch series in more detail the next time it is posted upstream.
Christian König posted version 4 of his “Resizable PCI BAR support” patches. PCI (and its derivatives, such as PCI Express) use BARs (Base Address Registers) to convey regions of the host physical memory map that the device will use to map in its memory. BARs themselves are just registers, but the memory they refer to must be linearly placed into the physical map (or interim IOVA map in the case that the BAR is within a virtual machine). Fitting large, multi GB windows can be a challenge, sometimes resulting in failure, but many devices can also manage with smaller memory windows. Christian’s patches attempt to provide for the best of both by adding support for a contemporary feature of PCI (Express) that allows devices with such an ability to convey a minimal BAR size and then increase the allocation if that is available. His changes since version 3 include “Fail if any BAR is still in use…”.
Ying Huang posted version 10 of his “THP swap: Delay splitting THP during swapping out” which allows for swapping of Transparent Huge Pages directly. We have previously covered iterations of this patch series. The latest changes are minimal, suggesting this is close to being merged.
Jérôme Glisse posted version 21 of his “Heterogeneous Memory Management” (HMM) patch series. This is very similar to the version we covered last week. As a reminder, HMM provides an API through which the kernel can manage devices that want to share memory with a host processing environment in a more seamless fashion, using shared address spaces and regular pointers. His latest version changes the concept of “device unaddressable” memory to “device private” (MEMORY_DEVICE_PRIVATE vs MEMORY_DEVICE_PUBLIC) memory, following the feedback from Dan Nellans that devices are changing over time such that “memory may not remain CPU-unaddressable in the future” and that, even though this would likely result in subsequent changes to HMM, it was worthwhile starting out with nomenclature correctly referring to memory that is considered private to a device and will not be managed by HMM.
Intel’s test Robot noticed a 12.8% performance improvement in one of their scalability benchmarks when running with a recent linux-next tree containing Al Viro’s “amd64: get rid of zeroing” patch. This is patch of his larger “uccess unification” patch series that aims to simply and cleanup the process of copying data to/from kernel and userspace. In particular, when asking the kernel to copy data from one userspace virtual address to another, there is no need to apply the level of data zeroing that typically applies to buffers the kernel copies (for security purposes – preventing leakage of extra data beyond structures returned from kernel calls, as an example). When both source and destination are already in userspace, there is no security issue, but there was a performance degregation that Viro had noticed and fixed.
Julien Grall posted “Xen: Implement EFI reset_system callback”, which provides a means to correctly reboot and power off Dom0 host Xen Hypervisors when running on EFI systems for which reset_system is used by reference (ARM).