Linux Kernel Podcast for 2017/04/04

Audiohttp://traffic.libsyn.com/jcm/20170404v2.mp3

Linus Torvalds announces Linux 4.11-rc5, Donald Drumpf drains the maintainer swamp in April, Intel FPGA Device Drivers, FPU state cacheing, /dev/mem access crashing machines, and assorted ongoing development.

Linus Torvalds announced Linux 4.11-rc5. In his announcement mail, Linus notes that “things have definitely started to calm down, let’s hope it stays this way and it wasn’t just a fluke this week”. He calls out the oddity that “half the arch updates are to parisc” due to parisc user copy fixes.

It’s worth noting that rc5 includes a fix for virtio_pci which removes an “out of bounds access for msix_names” (the “name strings for interrupts” provided in the virtio_pci_device structure. According to Jason Wang (Red Hat), “Fedora has received multiple reports of crashes when running 4.11 as a guest” (in fact, your author has seen this one too). Quoting Jason, “The crashes are not always consistent but they are generally some flavor of oops or GPF [General Protection Fault – Intel x86 term referring to the general case of an access violation into memory by an offending instruction in various different ISAs – Instruction Set Architectures] in virtio related code. Multiple people have done bisections (Thank you Thorsten Leemhuis and Richard W.M. Jones)”. An example rediscovery of this issue came from a Mellanox engineer who reported that their test and regression VMs were crashing occasionally with 4.11 kernels.

Announcements

Sebastian Andrzej Siewior announced preempt-rt Linux version 4.9.20-rt16. This includes a “Re-write of the R/W semaphores code. In RT we did not allow multiple readers because a writer blocking on the semaphore would have [to] deal with all the readers in terms of priority or budget inheritance [by which he is refering to the Priority Inheritance or “PI” feature common to “real time” kernels]. It’s obvious that the single reader restriction has severe performance problems for situations with heavy reader contention.” He notes that CPU hotplug got “better but can deadlock”

Greg Kroah-Hartman posted Linux stable kernels 4.4.59, 4.9.20, and 4.10.8.

Draining the Swamp (in April)

Donald Drumpf (trump.kremlin.gov@gmail.com) posted “MAINTAINERS: Drain the swamp”, an inspired patch aiming to finally address the problem of having “a small group of elites listed in the corrupt MAINTAINERS file” who, “For too long” have “reaped the rewards of maintainership”. He notes that over the past year the world has seen a great Linux Exit (“Lexit”) movement in which “People all of the Internet have come together and demanded that power be restored to the developers”, creating “a historic fork based on Linux 2.4, back to a better time, before Linux was controlled by corporate interests”. He notes that the “FAKE NEWS site LWN.net said it wouldn’t happen, but we knew better”.

Donald says that all of the groundwork laid over the past year was just an “important first step”. And that “now, we are taking back what’s rightfully ours. We are transferring power from “Lyin’ Linus” and giving it back to you, the people. With the below patch, the job-killing MAINTAINERS file is finally being ROLLED BACK.” He also notes his intention to return “LAW and ORDER” to the Linux kernel repository by building a wall around kernel.org and “THE LINUX FOUNDATION IS GOING TO PAY FOR IT”. Additional changes will include the repeal and replacement of the “bloated merge window”, the introduction of a distribution import tax, and other key innovations that will serve to improve the world and to MAKE LINUX GREAT AGAIN!

Everyone around the world immediately and enthusiastically leaped upon this inspired and life altering patch, which was of course perfect from the moment of its inception. It was then immediately merged without so much as a dissenting voice (or any review). The private email servers used to host Linus’s deleted patch emails were investigated and a special administrator appointed to investigate the investigators.

Intel FPGA Device Drivers

Wu Hao (Intel) posted a sixteen part patch series entitled “Intel FPGA Drivers”, which “provides interfaces for userspace applications to configure, enumerate, open, and access FPGA [Field Programmable Gate Arrays, flexible logic fabrics containing millions of gates that can be connected programmatically by bitstreams describing the intended configuration] accelerators on platforms equipped with Intel(R) FPGA solutions and enables system level management functions such as FPGA partial reconfiguration [the dynamic updating of partial regions of the FPGA fabric with new logic], power management, and virtualization. This support differs from the existing in-kernel fpga-mgr from Alan Tull in that it seems to relate to the so-called Xeon-FPGA hybrid designs that Intel have presented on in various forums.

The first patch (01/16) provides a lengthy summary of their proposed design in the form of documentation that is added to the kernel’s Documentation directory, specifically in the file Documentation/fpga/intel-fpga.txt. It notes that “From the OS’s point of view, the FPGA hardware appears as a regular PCIe device. The FPGA device memory is organized using a predefined structure [Device Feature List). Features supported by the particular FPGA device are exposed throughg these data structures. An FME (FPGA Management Engine) is provided which “performs power and thermal management, error reporting, reconfiguration, performance reporting, and other infrastructure functions. Each FPGA has one FME, which is always access through the physical function (PF)”. The FPGA also provides a series of Virtual Functions that can be individually mapped into virtual machines using SR-IOV.

This design allows a CPU attached using PCIe to communicate with various Accelerated Function Units (AFUs) contained within the FPGA, and which are individually assignable into VMs or used in aggregate by the host CPU. One presumes that a series of userspace management utilities will follow this posting. It’s actually quite nice to see how they implemented the discovery of individual AFU features, since this is very close to something a certain author has proposed for use elsewhere for similar purposes. It’s always nicely validating to see different groups having similar thoughts.

Copy Offload with Peer-to-Peer PCI Memory

Logan Gunthorpe posted an RFC (Request for Comments) patch series entitled “Copy Offload with Peer-to-Peer PCI Memory” which relates to work discussed at the recent LSF/MM (Linux Storage Filesystem and Memory Management) conference, in Cambridge MA (side note: I did find some of you haha!). To quote Logan, “The concept here is to use memory that’s exposed on a PCI BAR [Base Address Register – a configuration register that tells the device where in the physical memory map of a system to place memory owned by the device, under the control of the Operating System or the platform firmware, or both] as data buffers in the NVMe target code such that data can be transferred from an RDMA NIC to the special memory and then directly to an NVMe device avoiding system memory entirely”. He notes a number of positives from this, including better QoS (Quality of Service), and a need for fewer (relatively still quite precious even in 2017) PCIe lanes from the CPU into a PCIe switch placed downstream of its Root Complex on which peer-to-peer PCIe devices talk to one another without the intervening step of hopping through the Root Complex and into the system memory via the CPU. As a consequence, Logan has focused his work on “cases where the NIC, NVMe devices and memory are all behind the same PCI switch”.

To facilitate this new feature, Logan has a second patch in the series, entitled “Introduce Peer-to-Peer memory (p2mem) device”, which supports partitioning and management of memory used in direct peer-to-peer transfers between two PCIe devices (endpoints, or “cards”) with a BAR that “points to regular memory”. As Logan notes, “Depending on hardware, this may reduce the bandwidth of the transfer but could significantly reduce pressure on system memory” (again by not hopping up through the PCIe topology). In his patch, Logan had also noted that “older PCI root complexes” might have problems with peer-to-peer memory operations, so he had decided to limit the feature to be only available for devices behind the same PCIe switch. This lead to a back and forth with Sinan Kaya who asked (rhetorically) “What is so special about being connected to the same switch?”. Sinan noted that there are plenty of ways in Linux to handle blacklisting known older bad hardware and platforms, such as requiring that the DMI/SMBIOS-provided BIOS date of manufacture of the system be greater than a certain date in combination with all devices exposing the p2p capability and a fallback blacklist. Ultimately, however, it was discovered that the feature peer-to-peer feature isn’t enabled by default, leading Sinan to suggest “Push the decision all the way to the user. Let them decide whether they want this feature to work on a root port connected port or under the switch”.

FPU state cacheing

Kees Cook (Google) posted a patch entitled “x86/fpu: move FPU state into separate cache”, which aims to remove the dependency within the Intel x86 Architecture port upon an internal kernel config setting known as ARCH_WANTS_DYNAMIC_TASK_STRUCT. This configuration setting (set by each architecture’s code automatically, not by the person building the kernel in the configuration file) says that the true size of the task_struct cannot be known in advance on Intel x86 Architecture because it contains a variable sized array (VSA) within the thread_struct that is at the end of the task_struct to support context save/restore of the CPU’s FPU (Floating Point Unit) co-processor. Indeed, the kernel definition of task_struct (see include/linux/sched.h) includes a scary and ominous warning “on x88, ‘thread_struct’ contains a variable-sized structure. It *MUST* be at the end of ‘task_struct'”. Which is fairly explicit.

The reason to remove the dependency upon dynamic task_struct sizing is because this “support[s] future structure layout randomization of the task_struct”, which requires that “none of the structure fields are allowed to have a specific position or a dynamic size”. The idea is to leverage a GCC (GNU Compiler Collection) plugin that will change the ordering of C structure members (such as task_struct) randomly at compile time, in order to reduce the ability for an attacker to guess the layout of the structure (highly useful in various exploits). In the case of distribution kernels of course, an attacker has access to the same kernel binaries that may be running on a system, and could use those to calculate likely structure layout for use in a compromise. But the same is not true of the big hyperscale service providers like Google and Facebook. They don’t have to publish the binaries for their own internal kernels running on their public infrastructure servers.

This patch lead to a back and forth with Linus, who was concerned about why the task_struct would need changing in order to prevent the GCC struct layout randomization plugin from blowing up. In particular, he was worried that it sounded like the plugin was moving variable sized arrays from the last member of structures (not legally permitted). Kees, Linus, and Andy Lutomirski went through the fact that, yes, the plugin can handle trailing VSAs and so forth. In the end, it was suggested that Kees look at making task_struct “be something that contains a fixed beginning and end, and just have an unnamed randomized part in the middle”. Kees said “That could work. I’ll play around with it”.

/dev/mem access crashing machines

Dave Jones (x86info maintainer) had a back and forth with Kees Cook, Linus, and Tommi Rantala about the latter’s discovery that running Dave’s “x86info” tool crashed his machine with an illegal memory access. In turns out that x86info reads /dev/mem (a requirement to get the data it needs), which is a special file representing the contents of physical memory. Normally, when access is granted to this file, it is restricted to the root user, and then only certain parts of memory as determined by STRICT_DEVMEM. The latter is intended only to allow reads of “reserved RAM” (normal system memory reserved for specific device purposes, not that allocated for use by programs). But in Tommi’s case, he was running a kernel that didn’t have STRICT_DEVMEM set on a system booting with EFI for which the legacy “EBDA” (Extended BIOS Data Area) that normally lives at a fixed location in the sub-1MB memory window on x86 was not provided by the platform. This meant that the x86info tool was trying to read memory that was a legal address but which wasn’t reserved in the EFI System Table (memory map), and was mapped for use elsewhere.

All of this lead Linus to point out that simply doing a “dd” read on the first MB of the memory on the offending system would be enough to crash it. He noted that (on x86 systems) the kernel allows access to the sub-1MB region of physical memory unconditionally (regardless of the setting of the kernel STRICT_DEVMEM option) because of the wealth of platform data that lives there and which is expected to be read by various tools. He proposed effectively changing the logic for this region such that memory not explicitly marked as reserved would simple “just read zero” rather than trying to read random kernel data in the case that the memory is used for other purposes.

This author certainly welcomes a day when /dev/mem dies a death. We’ve gone to great lengths on 64-bit ARM systems to kill it, in part because it is so legacy, but in another part because there are two possible ways we might trap a bad access – one as in this case (synchronous exception) but another in which the access might manifest as a System Error due to hitting in the memory controller or other SoC logic later as an errant access.

Ongoing Development

Steve Longerbeam posted version 6 of a patch series entitled “i.MX Media Driver”, which implements a V4L2 (Video for Linux 2) driver for i.MX6.

David Gstir (on behalf of Daniel Walter) posted “fscrypt: Add support for AES-128-CBC” which “adds support for using AES-128-CBC for file contents and AES-128-CBC-CTS for file name encryption. To mitigae watermarking attacks, IVs [Initial Vectors] are generated using the ESSIV algorthim.”

Djalal Harouni posted an RFC (Request for Comments) patch entitled “proc: support multiple separate proc instances per pidnamespace”. In his patch, Djala notes that “Historically procfs was tied to pid namespaces, and moun options were propagated to all other procfs instances in the same pid namespace. This solved several use cases in that time. However today we face new problems, there are multiple container implementations there, some of them want to hide pid entries, others want to hide non-pid entries, others want to have sysctlfs, others want to share pid namespace with private procfs mounts. All these with current implementation won’t work since all options will be propagated to all procfs mounts. This series allow to have new instances of procfs per pid namespace where each intance can have its own mount option”.

Zhou Chengming (Hauwei) posted “reduce the time of finding symbols for module” which aims to reduce the time taken for the Kernel Live Patch (klp) module to be loaded on a system in which the module uses many static local variables. The patch replaces the use of kallsyms_on_each_symbol with a variant that limits the search to those needed for the module (rather than every symbol in the kernel). As Jessica Yu notes, “it means that you have a lot of relocation records with reference your out-of-tree module. Then for each such entry klp_resolve_symbol() is called and then klp_find_object_symbol() to actually resolve it. So if you have 20k entries, you walk through vmlinux kallsyms table 20k times…But if there were 20k modules loaded, the problem would still be there”. She would like to see a more generic fix, but was also interested to see that the Huawei report referenced live patching support for AArch64 (64-bit ARM Architecture), which isn’t in upstream. She had a number of questions about whether this code was public, and in what form, to which links to works in progress from several years ago were posted. It appears that Huawei have been maintaining an internal version of these in their kernels ever since.

Ying Huang (Intel) posted version 7 of “THP swap: Delay splitting THP during swapping out”, which as we previously noted aims to swap out actual whole “huge” (within certain limits) pages rather than splitting them down to the smallest atom of size supported by the architecture during swap. There was a specific request to various maintainers that they review the patch.

Andi Kleen posted a patch removing the printing of MCEs to the kernel log when the “mcelog” daemon is running (and hopefully logging these events).

Laura Abbott posted a RESEND of “config: Add Fedora config fragments”, which does what it says on the tin. Quoting her mail, “Fedora is a popular distribution for people who like to build their own kernels. To make this easier, add a set of reasonable common config options for Fedora”. She adds files in kernel/configs for “fedora-core.config”, “fedora-fs.config” and “fedora-networking.config” which should prove very useful next time someone complains at me that “building kernels for Red Hat distributions is hard”.

Eric Biggers posted “KEYS: encrypted: avoid encrypting/decrypting stack buffers”, which notes that “Since [Linux] v4.9, the crypto PI cannot (normally) be used to encrypt/decrypt stack buffers because the stack may be virtually mapped. Fix this or the padding buffers in encrypted-keys by using ZERO_PAGE for the encryption padding and by allocating a temporary heap buffer for the decryption padding. Eric is referring to the virtually mapped stack support introduced by Andy Lutomirski which has the side effect of incidentally flagging up various previous missuse of stacks.

Mark Rutland posted an RFC (Request For Comments) patch series entitled “ARMv8.3 pointer authentication userspace support”. ARMv8.3 includes a new architectural extension that “adds functionality to detect modification of pointer values, mitigating certain classes of attack such as stack smashing, and making return oriented [ROP] programming attacks harder”. [aside: If you’re bored, and want some really interesting (well, I think so) bedtime reading, and you haven’t already read all about ROP, you really should do so]. Continuing to quote Mark, the “extension introduces the concept of a pointer authentication code (PAC), which is stored in some upper bits of pointers. Each PAC is derived from the original pointer, another 64-bit value (e.g. the stack pointer), and a secret 128-bit key”. The extension includes new instructions to “insert a PAC into a pointer”, to “strip a PAC from a pointer”, and to “authenticate strip a PAC from a pointer” (which has the side effect of poisoning the pointer and causing a later fault if the authentication fails – allowing for detection of malicious intent).

Mark’s patch makes for great reading and summarizes this feature well. It notes that it has various counterparts in userspace to add ELF (Executable and Linking Format, the executable container used on modern Linux and Unix systems) notes sections to programs to provide the necessary annotations and presumably other data necessary to implement pointer authentication in application programs. It will be great to see those posted too.

Joerg Roedel followed up to a posting from Samuel Sieb entitled “AMD IOMMU causing filesystem corruption” to note that it has recently been discovered (and was documented in another thread this past week entitled “PCI: Blacklist AMD Stoney GPU devices for ATS”) that the AMD “Stoney” platform features a GPU for which PCI-ATS is known to be broken. ATS (Address Translation Services) is the mechanism by which PCIe endpoint devices (such as plugin adapter cards, including AMD GPUs) may obtain virtual to physical address translations for use in inbound DMA operations initiated by a PCIe device into a virtual machine (VM’s) memory (the VM talks the other way through the CPU MMU).

In ATS, the device utilizes an Address Translation Cache (ATC) which is essentially a TLB (Translation Lookaside Buffer) but not called that because of handwavy reasons intended not to confuse CPU and non-CPU TLBs. When a device sitting behind an IOMMU needs to perform an address translation, it asks a Translation Agent (TA) typically contained within the PCIe Root Complex to which it is ultimately attached. In the case of AMD’s Stoney Platform, this blows up under address invalidation load: “the GPU does not reply to invalidations anymore, causing Completion-wait loop timeouts on the AMD IOMMU driver side”. Somehow (but this isn’t clear) this is suspected as the possible cause of the filesystem corruption seen by Samuel, who is waiting to rebuild a system that ate its disk testing this.

Calvin Owens (Facebook) posted “printk: Introduce per-console filtering of messages by loglevel”, which notes that “Not all consoles are created equal”. It essentially allows the user to set a different loglevel for consoles that might each be capable of very different performance. For example, a serial console might be severely limited in its baud rate (115,200 in many cases, but perhaps as low as 9,600 or lower is still commonplace in 2017), while a graphics console might be capable of much higher. Calvin mentions netconsole as the preferred (higher speed) console that Facebook use to “monitor our fleet” but that “we still have serial consoles attached on each host for live debugging, and the latter has caused problems”. He doesn’t specifically mention USB debug consoles, or the EFI console, but one assumes that listeners are possibly aware of the many console types.

Christopher Bostic (IBM) posted version 5 of a patch series entitled “FSI device driver implementation”. FSI stands for “Flexible Support Interface”, a “high fan out [a term referring to splitting of digital signals into many additional outputs] serial bus consisting of a clock and a serial data line capable of running at speeds up to 166MHz”. His patches add core support to the Linux bus and device models (including “probing and discovery of slaves and slave engines”), along with additional handling for CFAM (Common Field Replacable Unit Access Macro) – an ASIC (chip) “residing in any device requiring FSI communications” that provides these various “engines”, and an FSI engine driver that manages devices on the FSI bus.

Finally, Adam Borowski posted “n_tty: don’t mangle tty codes in OLCUC mode” which aims to correct a bug which is “reproducible as of Linux 0.11” and all the way back to 0.01. OLCUC is not part of POSIX, but this terminios structure flag tells Linux to map lowercase characters to uppercase ones. The posting cites an obvious desire by Linus to support “Great Runes” (archiac Operating Systems in which everything was uppercase), to which Linus (obviously in jest, and in keeping with the April 1 date) asked Adam why he “didn’t make this the default state of a tty?”.

Leave a Reply

Your email address will not be published. Required fields are marked *