Linux Kernel Podcast for 2017/07/07


Linux 4.12 final is released, the 4.13 merge window opens, and various assorted ongoing kernel development is described in detail.

Editorial note

Reports of this podcast’s demise are greatly exaggerated. But it is worth noting that recording this weekly is HARD. That said, I am going to work on automation (I want the podcast to effectively write itself by providing a web UI via of LKML threads that allows anyone to write summaries, add author bios, links, etc. – and expand this to other communities) but that will all take some time. Until that happens, we’ll just have to live with some breaks.


Linus Torvalds announced Linux 4.12 final. In his announcement mail, Linus reflects that “4.12 is just plain big”, noting that, this was “one of the bigger releases historically, and I think only 4.9 ends up having had more commits. And 4.9 was big at least partly because Greg announced it was an LTS [Long Term Support – receiving updates for several years] kernel”. In pure numbers, 4.12 adds over a million lines of code over 4.11, about half of which can be attributed to enablement for the AMD Vega GPU support. As usual, both Linux Weekly News (LWN) and KernelNewbies have excellent, and highly detailed summaries. Listeners are encouraged to support real kernel journalism by subscribing to Linux Weekly News and visiting

Theodore (Ted) Ts’o posted “Next steps and plans for the 2017 Maintainer and Kernel Summits”. He reminds everyone of the (slightly) revised format to the this year’s Kernel Summit (which is, as is often the case, co-located with a Linux Foundation event in the form of the Open Source Summit Prague in October). Notably, a program committee is established to help encourage submissions from those who feel they should be present at the event. To learn more, see the mailing list archives containing the announcement: (technically the deadline is already passed, or tomorrow, depending)

Greg K-H (Kroah-Hartman) announced Linux 4.4.76, 4.9.36, and 4.11.9.

Willy Tarreau announced Linux 3.10.106, including a reminder that this “LTS” [Long Term Stable] kernel is “scheduled for end of life on end of October”.

Steven Rostedt released preempt-rt (“Real Time”) kernels 3.10.107-rt122, 3.18.59-rt65, 4.4.75-rt88, and 4.9.35-rt25, all of which were simply rebases to stable kernel updates and had “no RT specific changes”. It will be interesting to see if some of the hotplug fixes Thomas Gleixner has sent for Linux 4.13 will resolve issues seen by some RT users when doing hotplug.

Sebastian Andrzej Siewior announced preempt-rt (“Real time”) kernels v4.9.33-rt23, and v4.11.7-rt3, which still notes potential for a deadlock under CPU hotplug.

Stpehen Hemminger announced iproute2 version 4.12.0 matching Linux 4.12. This includes support for features present in the new kernel, including flower support and enhancements to the TC (Traffic Control) code:

Bartosz Golaszewksi posted libgpiod v0.3:

Mathieu Desnoyers announced LTTng modules 2.10.0-rc2, 2.9.3, 2.8.6, including support for “4.12 release candidate kernels”.

The 4.13 merge window

With the opening of the 4.13 merge window, many pull requests have begun flowing for what will become the new hotness in another couple of months. We won’t summarize each in detail (that resulted in a one hour long podcast the last time…) but will instead call out a few “interesting” changes of note. Stephen Rothwell also promptly updated his daily linux-next tree with the usual disclaimer that “Please do not add any v4.14 material to you[r] linux-next included branches until after v4.13-rc1 has been released”.

ACPI. Rafael J. Wysocki posted “ACPI updates for v4.13-rc1”, which includes an update to the ACPICA (ACPI Component Architecture) release of 20170531 that adds support to the OS-independent ACPICA layer for ACPI 6.2. This includes a number of new tables, including the PPTT (Processor Properties and Topology Table) that some of us have wanted to see for many years (as a means to more fully describe the NUMA properties of ARM servers, as just a random example…). In addition, Kees Cook has done some work to clean up the use of function pointer structures in ACPICA to use “designated initializers” so as “to make the structure layout randomization GCC plugin work with it”. All in all, this is a nice set of updates for all architectures.

AppArmor. John Johansen noted in his earlier pull request (to James Morris, who owns overall security subsystem pull requests headed to Linus) that an attempt was being made to get many of the Ubuntu specific AppArmor patches upstreamed. The 4.13 patches “introduces the domain labeling base code that Ubuntu has been carrying for several years”. He then plans to begin to RFC other Ubuntu-specific patches in later cycles.

ARM. Arnd Bergman notes a number of changes to 64-bit ARM platforms, including work done by Timur Tabi to change kernel def(ault)config files to “enable[s] a number of options that are typically required for server platforms”. It’s only been many years since this should have been the case in upstream Linux. Meanwhile, in a separate pull for “ARM: 64-bit DT [DeviceTree] updates”, support is added for many new boards (“For the first time I can remember, this is actually larger than the corresponding branch for 32-bit platforms”) including new varieties of “OrangePi” based on Allwinner chipsets.

Docs. Jon(athan) Corbet had noted that “You’ll also encounter more than the usual number of conflicts, which is saying something”. Linus “fixed the ones that were actual data conflicts” but he had some suggestions for how Kbuild could be modified such that an “make allmodconfig” checked for the existence of various files being reference in the rst documentation source files. He also noted that he was happy to see docbook “finally gone” but that sphinx, the tool used to generate documentation now, “isn’t exactly a speed demon”.

Hotplug. As noted elsewhere, Thomas Gleixner posted a pull request for various smp hotplug fixes that includes replacing an “open coded RWSEM [Read Write Semaphore] with a percpu RWSEM”. This is done to enable full coverage by the kernel’s “lockdep” locking dependency checker in order to catch hotplug deadlocks that have been seen on certain RT (Real Time) systems.

IRQ. Thomas Gleixner posted “irq updates for 4.13”, which includes “Expand the generic infrastructure handling the irq migration on CPU hotplug and convert X86 over to it” in preparation for cleaning up affinity management on blk multiqueue devices (preventing interrrupts being moved around during hotplug by instead shutting down affine interrupts intended to be always routed to a specific CPU). Thomas notes that “Jens [the blk maintainer] acked them and agreed that they should go with the irq changes”, but Linus later pushed back strongly after hitting merge conflicts that made him feel that some of these changes should have gone in via the blk tree instead of clashing with it. Linus was also concerned if the onlining code worked at all.

Objtool. Ingo Molnar posted a pull request including changes to the “objdump” tool intending to allow the tracking of stack pointer modifications through “machine instructions of disassembled functions found in kernel .o files”. The idea is to remove a dependency upon compiling the kernel with the CONFIG_FRAME_POINTERS=y option (which causes a larger stack frame and possible additional register pressure on some architectures) while still retaining the ability to generate correct kernel debuginfo data in the future.

PCI. Thomas Gleixner posted “x86/PCI updates for 4.13”, which includes work to separate PCI config space accessors from using a global PCI lock. Apparently, x86 already had an additional PCI config lock and so two layers of redundant locking were being employed, while neither was strictly necessary in the case of ECAM (“mmconfig”) based configuration, since “access to the extended configuration space [MMIO based configuration in PCIe] does not require locking”. Thomas also notes that a commit which had switched x86 to use ECAM [the MMIO mode] by default was removed so it will still use “type1 accessors” (the “old fashioned way” that Linus is so happy with) serialized by x86 internal locking for primary configuration space. This set of patches came in through x86 via Thomas with Bjorn Helgaas’s (PCI maintainer) permission.

RCU. Ingo Molnar noted that “The sole purpose of these changes is to shrink and simplify the RCU code base, which has suffered from creeping bloat”.

Scheduler. Ingo Molnar posted a pull request that included a number of changes, among them being NUMA scheduling improvements to address regressions seen when comparing 4.11 based kernels to older ones, from Rik van Riel.

VFS. Al Viro went to town with VFS updates split into more than 10 parts (yes, really, actually 11 as of this writing). These are caused by various intrusive changes which impact many parts of the kernel tree. Linus said he would “*much* rather do five separate pull requests where each pull has a stated reason and target, than do one big mixed-up one”. Which is good because Viro promised many more than 5. Patch series number 11 got the most feedback so far.

X86. Ingo Molnar also went to town, in typical fashion, with many different updates to the kernel. These included mm changes enabling more Intel 5-level paging features (switching the “GUP” or “Get User Pages” code over to the newer generic kernel implementation shared by other architectures), and “[C]ontinued work to add PCID [Process Context ID] support”. Per-process context IDs allow for TLB (Translation Lookaside Buffer – the micro caches that store virtual to physical memory translations following page table walks by the hardware walkers) flush infrastructure optimizations on legacy architectures such as x86 that do not have certain TLB hardware optimizations. Ingo also posted microcode updates that include support for saving microcode pointers and wiring them up for use early in the “resume-from-RAM” case, and fixes to the Hyper-V guest support that add a synthetic CPU MSR (Model Specific Register) providing the CPU TSC frequency to the guest.

Ongoing Development

ARM. Will Deacon posted the fith version of a patch series entitled “Add support for the ARMv8.3 Statistical Profiling Extension”, which provides a linear, virtually addressed memory buffer containing statistical samples (subject to various filtering) related to processor operations of interest that are performed by running (application) code. Sample records take the form of “packets”, which contain very detailed amounts of information, such as the virtual PC (Program Counter) address of a branch instruction, its type (conditional, unconditional, etc.), number of cycles waiting for the instruction to issue, the target, cycles spent executing the branch instruction, associated events (e.g. misprediction), and so on. Detailed information about the new extension is available in the ARM ARM, and is summarized in a blog post, here:

RISC-V. Palmer Dabbelt posted v4 of the enablement patch series adding support for the Open Source RISC-V architecture (which will then require various enablement for specific platforms that implement the architecture). In his patch posting, he notes changes from the previous version 3 that include disabling cmpxchg64 (a 64-bit instruction that performs an “atomic” compare and exchange operation, but which isn’t atomic on 32-bit systems) on 32-bit, adding an ELF_HWCAP (hardware capability) within binaries in order for users to determine the ISA of the machine, and various other miscellaneous changes. He asks for consideration that this be merged during the ongoing merge window for 4.13, which remains to be seen. We will track this in future episodes.

FOLL_FORCE. Keno Fischer noted that “Yes, people use FOLL_FORCE”, referencing a commit from Linus in which an effort had been made to “try to remove use of FOLL_FORCE entirely” on the procfs (/proc) filesystem. Keno says “We used these semantics as a hardening mechanism in the julia JIT. By opening /proc/self/mem and using these semantics, we could avoid needing RWX pages, or a dual mapping approach”. In other words, they cheat and don’t setup direct RWX mappings ahead of time but instead get access to them via the backdoor using the kernel’s “/proc/self/mem” interface directly. Linus replied, “Oh, we’ll just re-instate the kernel behavior, it was more an optimistic “maybe nobody will notice” thing, and apparently people did notice”.

GICv4. Marc Zyngier posted version 2 of a patch series entitled “irqchip: KVM: Add support for GICv4”, a “(monster of a) series [that] implements full suport for GICv4, bringing direct injection of MSIs [Message Signalled Interrupts] to KVM on arm and arm64, assuming you have the right hardware (which is quite unlikely)”. Marc says that the “stack has been *very lightly* tested on an arm64 model, with a PCI virtio block device passed from the host to a guet (using kvmtool and Jean-Philippe Brucker’s excellent VFIO support patches). As it has never seen any HW, I expect things to be subtly broken, so go forward and test if you can, though I’m mostly interested in people reviewing the code at the moment”. It’s awesome to see 64-bit ARM systems on par with legacy architectures when it comes to VM interrupt injection.

GPIO. Any Shevchenko posted a patch (with Linus Walleij’s approval) noting that Intel would help to maintain GPIO ACPI support in the GPIO subsystem.

Hardlockup. Nicholas Piggin posted “[RFC] arch hardlockup detector interfaces improvement” which aims to “make it easier for architectures that have their own NMI / hard lockup detector to reuse various configuration interfaces that are provided by generic detectors (cmdline, sysctl, suspend/resume calls)”. He “do[es] this by adding a separate CONFIG_SOFTLOCKUP_DETECTOR [kernel configuration option], and juggling around what goes under config options. HAVE_NMI_WATCHDOG continues to be the config for arch to override the hard lockup detector, which is expanded to cover a few more cases”.

HMM. Jérôme Glisse posted “Cache coherent device memory (CDM) with HMM” which layers above his previous HMM (Heterogenous Memory Management) to provide a generic means to manage device memory that behaves much like regular system memory but may still need managing “in isolation from regular memory” (for any number of reasons, including NUMA effects). This is particularly useful in the case of a coherently attached system bus being used to connect on-device memory memory, such as CAPI or CCIX. [disclaimer: this author chairs the CCIX software working group]

Hyper-V. KY Srinivasan posted an update version of his “Hyper-V: paravirtualized remote TLB flushing and hypercall improvements” patches, which aim to optimize the case of remote TLB flushing on other vCPUs within a guest. TLBs are micro caches that store VA (Virtual Address) to PA (Physical Address) translations for VMAs (Virtual Memory Areas) that need to be invalidated during a context switch operation from one process to another. Typically, an Operating System may either utilize an IPI (Inter-Processor-Interrupt) to schedule a remote function on other CPUs that will tear down their TLB entries, or – on more enlightened and sophisticated modern computer architectures – may perform a hardware broadcast invalidation instruction that achieves the same without the gratuitous overhead. On x86 systems, IPIs are commonly used by guest operating systems and their impact can be reduced by providing special guest hypercalls allowing for hypervisor assistance in place of broadcast IPIs. Jork Loeser also posted a patch updating the Hyper-V vPCI driver to “use the Server-2016 version of the vPCI protocol, fixing MSI creation”.

ILP32. Yury Norov posted version 8 of a patch series entitled “ILP32 for ARM64” which aims to enable support for the Integer Long Pointer 32-bit optional userspace ABI on 64-bit ARM processors. In ways similar to “x32” on 64-bit “x86” systems, ILP32 aims to provide the benefits of the new ARMv8 ISA without having to use 64-bit data types and pointers for code that doesn’t actually require such large data or a large address space. Pointers (pun intended) are provided to an example kernel, GLIBC, and an OpenSuSE-based Linux distribution built against the newer ABI.

IMC Instrumentation Support. Madhavan Srinivasan posted version 10 of a patch series entitled “IMC Instrumentation Support” which aims to provide support for “In-Memory-Collection” infrastructure present in IBM POWER9 processors. IMC apparently “contains various Performance Monitoring Units (PMUs) at Nest level (these are on-chip but off-core), Core level and Thread level. The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC (On-Chip Controller) complex. The microcode collects the counter data and moves the nest IMC counter data to memory”. This effectively seems to be a microcontroller managed mechanism for providing certain core and uncore counter data using a standardized interface.

Intel FPGA Device Drivers. Wu Hao posted version 2 of a patch series entitled “Intel FPGA Device Drivers”, which “provides interfaces for userspace applications to configure, enumerate, open and access FPGA accelerators on platforms equipped with Intel(R) PCIe based FPGA solutions and enables system level management functions such as FPGA partial reconfiguration, power management and virtualization”. In other words, many of the capabilities required for datacenter level deployment of PCIe-attached FPGA accelerators.

Interconnects. Georgi Djakov posted version 2 of a patch series entitled “Introduce on-chip interconnect API”, which aims to provide a generic API to help manage the many varied high performance interconnects present on modern high-end System-on-Chip “processors”. As he notes, “Modern SoCs have multiple processors and various dedicated cores (video, gpu, graphics, model). These cores are talking to each other and can generate a lot of data flowing through the on-chip interconnects. These interconnect buses could form different topologies such as crossbar, point to point buses, hierarchical buses or use the network-on-chip concept”. The API provides an ability (subject to hardware support thereof) to control bandwidth use, QoS (Quality-of-Service), and other settings. It also includes code to enable the Qualcomm msm8916 interconnect with a layered driver.

IRQs. Daniel Lezcano posted version 10 of a patch series entitled “irq: next irq tracking” which aims to predict future IRQ occurances based upon previous system behavior. “As previously discussed the code is not enabled by default, hence compiled out”. A small circular buffer is used to keep track of non-timer interrupt sources. “A third patch provides the mathematic to compute the regular intervals”. The goal is to predict future expected system wakeups, which is useful from a latency perspective, as well as for various scheduling, or energy calculations later on.

Memory Allocation Watchdog. Tetsuo Handa posted version 9 of a patch series entitled “mm: Add memory allocation watchdog kernel thread”, which “adds a watchdog which periodically reports number of memory allocating tasks, dying tasks and OOM victim tasks when some task is spending too long time inside __alloc_pages_slowpath() [the code path called when a running program – known as a task within the kernel – must synchronously block and wait for new memory pages to become available for allocation]”. Tetsuo adds, “Thanks to OOM [Out-Of-Memory] repear which can guarantee forward progress (by selected next OOM victim) as long as the OOM killer can be invoked, we can start testing low memory situations which are previously too difficult to test. And we are now aware that there are still corner cases remaining where the system hands without invoking the OOM killer”. The patch aims to help explain whenever long hangs are explained by memory allocation failure.

Memory Protection Keys. Ram Pai posted version 5 of a patch series entitled “powerpc: Memory Protection Keys”, which aims to enable a feature in future ISA3.0 compliant POWER architecture platforms comparable to the “memory protection keys” added by Intel to their Intel x64 Architecture (“x86” variant). As Ram notes, “The overall idea: A process allocates a key and associates it with an address range within its address space. The process then can dynamically set read/write permissions on the key without involving the kernel. Any code that violates the permissions of the address space; as define by its associated key, will receive a segmentation fault”. The patches enable support on the “PPC64 HPTE platform” and are noted to have passed all of the same tests as on x86.

Modules. Djalal Harouni posted version 4 of a patch series entitled “modules: automatic module loading restrictions”, which adds a new global sysctl flag, as well as per task one, called “modules_autoload_mode”. “This new flag allows to control only automatic module loading [the kernel-invoked auto loading of certain modules in response to user or system actions] and if it is allowed or not, aligning in the process the implicit operation with the explicit [existing option to disable all module loading] one where both are now covered by capabilities checks”. The idea is to prevent certain classes of security exploit wherein – for example – a system can be caused to load a vulnerable network module by sending it a certain packet, or an application calling a certain kernel function. Other such classes of attack exist against automatic module loading, and have been the subject of a number of CVE [Common Vulnerabilities and Exposures] releases requiring frantic system patching. This feature will allow sysadmins to limit module auto loading on some classes of systems (especially embedded/IoT devices).

Network filtering. Shubham Bansal posted an RFC patch entitled “RFC: arm eBPF JIT compiler” which “is the first implementation of eBPF JIT for [32-bit] ARM”. Russell King had various questions, including whether the code handled “endian issues” well, to which Shubham replied that he had not tested it with BE (Big Endian) but was interested in setting up qemu to run Big Endian ARM models and would welcome help improving the code.

NMI. Adrien Mahieux posted “x86/kernel: Add generic handler for NMI events” which “adds a generic handler where sysadmins can specify the behavior to adopt for each NMI event code. List of events is provided at module load or on kernel cmdline, so can also generic kdump upon boot error”. The options include silently ignoring NMIs (which actually passes them through to the next handler), drop NMIs (actually discard them), or to panic the kernel immediately. An example given is using the drop parameter during kdump in order to prevent a second NMI from triggering a panic while another crash dump is already capturing from the first.

Randomness. Jason A. Donenfield posted version 4 of a patch series entitled “Unseeded In-Kernel Randomness Fixes” which aims to address “a problem with get_random_bytes being used before the RNG [Random Number Generator] has actually been seeded [given an initial set of values following boot time]. The solution for fixing this appears to be multi-pronged. One of those prongs involves adding a simple blocking API so that modules that use the RNG in process context an just sleep (in an interruptable manner) until the RNG is ready to be used. This winds up being a very useful API that covers a few use cases, several of which are included in this patch set”.

Scheduler. Nico[las] Pitre posted “scheduler tinification” which “makes it possible to configure out some parts of the scheduler such as the deadline and realtime scheduler classes. The saving in kernel footprint is non negligible”. In the examples cited, kernel text shrinks by almost 8K, which is significant in some very small Linux systems, such as in IoT.

S.A.R.A. Salvatore Mesoraca posted “S.A.R.A. a new stacked LSM” (which your author is choosing to pronounce as in “Sarah”, for various reasons, and apparently actually stands for “S.A.R.A is Another Recursive Acronym”). This is “a stacked Linux Security Module that aims to collect heterogeneous security measures, providing a common interface to manage them. It can be useful to allow minor security features to use advanced management options, like user-space configuration files and tools, without too much overhead”.

Secure Memory Encryption (SME). Tom Lendacky posted version 8 of a patch series that implements support in Linux for this feature of certain future AMD CPUs. “SME can be used to mark individual pages of memory as encrypted through the page tables. A page of memory that is marked encrypted will be automatically decrypted when read from DRAM and will be automatically encrypted when written to DRAM”. In other words, SME allows a datacenter operator to build systems in which all data leaving the SoC is encrypted either at rest (on disk), or when hitting external memory buses that might (theoretically) be monitored. When combined with other features, such as “another AMD processor feature called Secure Encrypted Virtualization (SEV)” it becomes possible to protect user data from intrusive monitoring by hypervisor operators (whether mallicious or coerced). This is the correct way to provide memory encryption. While others have built a nonsense known as “enclaves”, the AMD approach correctly solves a more general problem. The AMD patches update various pieces of kernel infrastructure, from the UEFI code, to IOMMU support for carry page encryption state through.

SMIs. Kan Liang posted version 2 of a patch entitled “measure SMI cost (user)” which adds a “new sysfs entry /sys/device/cpu/freeze_on_smi” which will cause the “FREEZE_WHILE_SMM” bit in the Intel “IA32_DEBUGCTL” processor control register to be set. Once it is set, “the PMU core counters will freeze on SMI handler”. This can be usd with a “new –smi-cost mode in perf stat…to measure the SMI cost by calculating unhalted core cycles and aperf results”. SMIs, or “System Management Interrupts” are also referred to as “cycle stealing” in that they are used by platform firmware to perform various housekeeping tasks using the application processor cores, usually without either the Operating System, nor the user’s knowledge. SMIs are used by OEMs and ODMs to “add value”, but they are also used for such things as system fan control and other essentials. What should happen, of course, is that a generic management controller should be defined to handle this, but it was easier for the industry to build the mess that is SMIs, and for Intel to then add tracking for users to see where bad latencies come from.

Speculative Page Faults. Luarent Dufour posted version 5 of a patch series entitled “Speculative page faults”, which is “a port on kernel 4.12 of the work done by Peter Zijlstra to handle page fault without holding the mm semaphore”. As he notes, “The idea is to try to handle user space page faults without holding the mmap_sem [a per-task – the kernel side name for a running process – semaphore that is shared by all threads within a process]. This should allow better concurrency for massively threaded processes since the page fault handler will not wait for other threads[‘] memory layout change to be done, assuming that this change is done in another part of the process’s memory space. This type of page fault is named speculative page fault. If the speculative page fault fails because of a concurrency is detected of because underlying PMD [Page Middle Directory] or PTE [Page Table Entry] tables are not yet allocat[ed], it [fails] its processing and a classic page fault is then tried”.

THP. Kirill A. Shutemov posted a “HELP-NEEDED” thread entitled “Do not lose dirty bit on THP pages”, in which he notes that Vlastimil Babka “noted that pmdp_invalidate [Page Middle Directory Pointer invalidate] is not atomic and we can loose dirty and access bits if CPU sets them after pmdp dereference, but before set_pmd_at()”. Kirill notes that this doesn’t currently happen to lead to user-visible problems in the current kernel, but “fixing this would be critical for future work on THP: both huge-ext4 and THP [Transparent Huge Pages] swap out rely on proper dirty tracking”. By access and dirty tracking, Kirill means page table bits that indicate whether a page has been accessed or contains dirty data which should be written back to storage. Such bits are updated by hardware automatically on memory access. He adds that “Unfortunately, there’s no way to address the issue in a generic way. We need to fix all architectures that support THP one-by-one”. Hence the topic of the thread containing the words “HELP-NEEDED”. Martin Schwidefsky had some feedback to the proposed solution that it would not work on s390, but that if pmdp_invalidate returned the old entry, that could be used in order to update certain logic based on the dirty bits. Andrea Arcangeli replied to Martin, “That to me seems the simplest fix”. Separately, Kirill posted the “Last bits for initial 5-level paging” on x86.

Timers. Christoph Hellwig posted “RFC: better timer interface”, a patch series which “attempts to provide a “modern” timer interface where the callback gets the timer_list structure as an argument so that it can use container_of instead of having to cast to/from unsigned long all the time”. Arnd Bergmann noted that “This looks really nice, but what is the long-term plan for the interface? Do you expect that we will eventually change all 700+ users of timer_list to the new type, or do we keep both variants around indefinitely to avoid having to do mass-conversions?”. Christoph thought it was possible to perform a wholesale conversion, but that “it might take some time”.

Thunderbolt. Mika Westerberg posted version 3 of a patch series implementing “Thunderbolt security levels and NVM firmware upgrade”. Apparently, “PCs running Intel Falcon Ridge or newer need these in order to connect devices if the security level is set to “user(SL1) or secure(SL2)” from BIOS” and “The security levels were added to prevent DMA attacks when PCIe is tunneled over Thunderbolt fabric where IOMMU is not available or cannot be enabled for different reasons”. While cool, it is slightly saddening that some of the awesome demos from recent DEFCONs will be slightly harder to reproduce by nation state actors and those who really need to get outside more often.

VAS. Sukadev Bhattiprolu posted version 5 of a patch series entitled “Enable VAS”, a “hardware subsystem referred to as the Virtual Accelerator Switchboard” in the IBM POWER9 architecture. According to Sukadev, “VAS allows kernel subsystems and user space processes to directly access the Nest Accelerator (NX) engines which implement compression and encryption algorithms in the hardware”. In other words, these are simple workload acceleration engines that were previously only available using special (“icswx”) privileged instructions in earlier versions of POWER machines and are now to be available to userspace applications through a multiplexing API.

WMI. Darren Hart posted an updated “Convert WMI to a proper bus” patch series, which “converts WMI [Windows Management Instrumentation] into a proper bus, adds some useful information via sysfs, and exposes the embedded MOF binary. It converts dell-wmi to use the WMI bus architecture”. WMI is required to manage various contempory (especially laptop) hardware, including backlights.

Xen. Juergen Gross posted “xen: add sysfs node for guest type” which provides information known to the guest kernel but not previously exposed to userspace, including the type of virtualization in use (HVM, PV, or PVH), and so on.

zRam. Minchan Kim posted an RFC patch entitled “writeback incompressible pages to storage”, which seeks to have the best of both worlds – the compression of Ram while handling cases where memory is incompressible. In the case that an admin sets up a suitable block device, it can be arranged that incompressible pages are written out to storage instead of using RAM.

zswap. Srividya Desireddy posted version 2 of a patch that seeks to explicitly test for so-called “zero-filled” pages before submitting them for compression. This saves time and energy, and reduces application startup time (on the order of about 3% in the example given).


Linux Kernel Podcast for 2017/05/14


In this week’s catchup mega-issue: Linux 4.12-rc1 (including a full summary of the 4.12 merge window), Linux 4.11 final is released, saving TLB flushes, various ongoing development, and a bunch of announcements.

Editorial Note

This podcast is a free service that I provide to the community in my spare time. It takes many, many hours to prepare and produce a single episode, much more during the merge window. This means that when I have major events (such as Red Hat Summit followed by OpenStack Summit) it will be delayed, as was the case this last week week. Over the coming months, I hope to automate the production in order to reduce the overhead but there will be some weeks where I need to skip a show. I am however covering the whole 4.12 merge window regardless. So while I would usually have just moved on, the circumstance warrants a mega-length catchup episode. I hope you’re still awake by the end.

Linux 4.12-rc1

Linus Torvalds announced Linux 4.12-rc1, “one day early, because I don’t like last-minute pull requests during the merge window anyway, and tomorrow is mother’s day [in the US], so I may end up being roped into various happenings”. He also noted “Besides, this has actually been a pretty large merge window, so despite there technically being time for one more day of pulls, I actually do have enough changes already. So there.” In his announcement, he says things look smooth so far, but calls those also “Famous last words”. Finally, he calls out the “odd” diffstat which is dominated by the AMD Vega10 headers. As was noted in the pull requests, unlike certain other graphics companies, AMD actually provides nice automatically generated headers and other information about their graphics chipsets, which is why the Vega10 update is plentiful.

Later in the day yesterday, following the 4.12-rc1 announcement, Guenter Roeck posted “watchdog updates for v4.12”, and Jon Mason posted “NTB bug fixes for vv4.12”, along with an apologies for tardiness.

Linux 4.11

Linus Torvalds announced Linux 4.11 noting that the extra week due a (rare-ish) “rc8” (Release Candidate 8) meant that he had felt “much happier releasing a final 4.11 now”. As usual, Linux Kernel Newbies has a writeup of 4.11, here:


Greg K-H (Kroah-Hartman) announced Linux 4.4.68, 4.9.28, 4.10.16, and 4.11.1. He later sent “Bad signatures on recent stable updates” in which he noted that “The stable kernels I just released have had signatures due to a mixup using pixz in the new backend. It will be fixed soon…”, which were later corrected. He would like to hear from anyone still seeing problems.

Greg also announced (separately) Linux 3.18.52. While Jiri Slaby announced Linux 3.12.74.

Stephen Hemminger announced iproute2 4.11 matching the new kernel release.

Michael Kerrisk announced map-pages-4.11.

Steven Rostedt announced trace-cmd 2.6.1.

Steven also announced Linux 4.4.66-rt79, 3.18.51-rt57, and 3.12.73-rt98 (preempt-rt) kernels.

Con Kolivas posted an updated version of his (renamed) “MuQSS CPU scheduler” [renamed from the BFS – Brain F*** Scheduler] in Linux 4.11-ck1.

Karel Zak announced util-linux v2.30-rc1, which includes a fix to libblkid that “has been fixed to extract LABEL= and UUID= from UDF rather than ISO9660 header on hybrid CDROM/DVD media. This change[] makes UDF media on Linux user-space more compatible with another operation systems.” but he calls it out since it could also introduce regressions for some other users.

Junio C Hamano announced Git version 2.13.0. Separately, he released maintenance versions of “Git v2.12.3 and others” which include fixes for
“a recently disclosed problem with “git shell”, which may allow a user who comes over SSH to run an interactive pager by causing it to spawn “git upload-pack –help” (CVE-2017-8386).”

Jan Kiszka announced version 0.7 of the Jailhouse hypervisor, which includes various debug and regular console driver updates and gcov debug statistics.

Bartosz Golaszewski announced libgpiod v0.2: “The most prominent new feature is the test suite working together with the gpio-mockup module”.

Christoph Hellwig notes that the Open OSD [an in-kernel OSD – Object-Based Storage Device] SCSI initiator library for Linux seems to be dead. He does this by posting a patch to the MAINTAINERS file “update OSD entries” in which he removes the (now defunct) website, and the bouncing email address for Benny Halevy. Benny appeared and ACKed.

In a similar vain, Ben Hutchings pondered aloud about the “Future of liblockdep”, which apparently “hasn’t been buildable since (I think) Linux
4.6”. Sasha Levin said things would be cleaned up promptly. And they were, with a pull request soon following with fixes for Linux 4.12.

Masahiro Yamada posted an RFC patch entitled “Increase Minimal GNU Make version for Linux Kernel from 3.80 to 3.81” in which he essentially noted that the kernel hadn’t actually worked with 3.80 (which is 15 years old!) in a very long time, but instead actually really needs 3.81 (which was itself released in 2006). It was apparently “broken” 3 years ago, but nobody noticed. Neither Greg K-H (Kroah-Hartman) nor Linus seemed to lose any sleep over this, with Linus saying “you make a strong case of “it hasn’t worked for a while already and nobody even noticed””.

Paolo Bonzini posted “CFP: KVM Forum 2017” announcing that the KVM Forum will be held October 25-27 at the Hilton in Prague, CZ, and that all submissions for proposed topics must be made by midnight June 15.

Thomas Gleixner announced “[CFP] RT-Summit – Call for Presentations” noting that the Real-Time Summit 2017 is being organized by the Linux Foundation Real-Time Linux (RTL) collaborative project in cooperation with OSADL/RTLWS and will be held also in Prague on October 21st. The cutoff for submissions is July 14th via

4.12 Merge Window

In his 4.11 announcement, Linus reminded us that the release of 4.11 meant that “the merge window [for kernel 4.12] is obviously open. I already have two pull request[s] for 4.12 in my inbox, I expect that overnight I’ll get a lot more.” He wasn’t disappointed. The flood gates well and truly opened. And they continued going for the entire two week (less one day) period. Let’s dive into what has been posted so far for 4.12 during the (now closed) merge window.

Stephen Rothwell [linux-next pre-merge development kernel tree maintainer] noted in a head’s up that Linus was going to see a “Large new drm driver” [drm – Direct Rendering Manager, not the “digital rights” technology]. Dave Airlie (the drm maintainer) had a reply but Stephen said everything was just fine and he was simply seeking to avoid surprising Linus (again). Once the pull came in, and Linus had pulled it, he quickly followed up to note that he was getting a lot of warnings about “Atomic update on pipe (A) took”. Daniel Vetter followed up to say that “We [Intel] did improve evasion a lot to the point that it didn’t show up in our CI machines anymore, so we felt we could risk enabling this everywhere. But of course it pops up all over the place as soon as drm-next hits mainline”.

4.12 git Pulls for existing subsystems

Hans-Christian Noren Egtvedt posted “AVR32 change for 4.12 – architecture removal” in which he removes AVR32 and “clean away the most obvious architecture related parts”. He posted followups to pick off more leftovers.

Ingo Molnar posted “RCU changes for 4.12” which includes “Parallelize SRCU callback handling”, performance improvements, documentation updates, and various other fixes. Linus pulled it. But then “after looking at it, ended up un-pulling it again”. He posted a rant about a new header file (linux/rcu_segcblist.h) which was a “header file from hell”, saying “I see absolutely no point in taking a heade file of several hundred lines of code”, along with more venting about the use of too much inline code (code that is always expanded in-place rather than called as a function – leading to a larger footprint sometimes). Finally, Linus said “The RCU code needs to start showing some good taste”. Sir Paul McKenney, the one and only author of RCU followed up swiftly, apologizing for the transgression in attempting to model “the various *list*.h header files”, proposing a fix, which Linus liked. Ingo Molnar implemented the suggestions, in “srcu: Debloat the <linux/rcu_segcblist.h> head”, which Paul provided a minor fix against for the case of !SMP (non-multi-processor kernel) builds.

Ingo Molnar also posted “EFI changes for 4.12” including fixes to the BGRT ACPI table (used for boottime graphics information) to allow it to be shared between x86 and ARM systems, an update to the arm64 boot protocol, improvements to the EFI stub’s command line parsing, and support for randomizing the virtual mapping of UEFI runtime services on arm64. The latter means that the function pointers for UEFI Runtime Services callbacks will be placed into random virtual address locations during the call to ExitBootServices which sets up the new mappings – it’s a good way to look for problems with platforms containing broken firmware that doesn’t correctly handle the change in location of runtime service calls.

Ingo Molnar also posted “x86/process changes for 4.12” which includes a new ARCH_[GET|SET]_CPUID prctl (process control) ABI extension that a running process can use in order to determine whether it has access to call the CPUID instruction directly. This is to support a userspace debugger known as “rr” that would like to trap and emulate calls to “CPUID” which are otherwise normally unprivileged on x86 systems.

Separately, Ingo posted “x86 fixes”, which includes “mostly misc fixes” for such things as “two boot crash fixes”, etc.

Ingo Molnar also posted “perf changes for 4.12” which includes updates to K and uprobes, making their trampolines (the codepaths jumped through when executing the probe sequence) read-only while they are used, changing UPROBES_EVENTS to be default yes in the Kconfig (since distros do this), and various other fixes. He also includes support for AMD IOMMU events, and new events for Intel Goldmont CPUs. The perf tooling itself gets many more updates, including PERF_RECORD_NAMESPACES, which allows the kernel to record information “required to associate samples to namespaces”.

Separately, Ingo posted “perf fixes”, which includes “mostly tooling updates”.

Ingo Molnar also posted “RAS changes for v4.12” which includes a “correct Errors Collector” kernel feature that will gather statistics aout correctable errors affecting physical memory pages. Once a certain watermark is reached, pages generating many correctable errors will be permanently offlined [this is useful both for DDR and NV-DIMMs]. Finally, he deprecates the existing /dev/mcelog driver and includes cleanups for MCE (Machine Check Exception) errors during kexec on x86 (which we covered in previous editions of this podcast).

Ingo Molnar also posted “x86/asm changes for v4.12”, which includes various fixes, among which are cleanups to stack trace unwinding.

Ingo Molanr also posted “x86/cpu changes for v4.12”, which includes support for “an extension of the Intel RDT code to extend it with Intel Memory Bandwidth Allocation CPU support: MBA allows bandwidth allocation between cores, while CBM (already upstream) allows CPU cache partitioning”. Effectively, Intel incorporate changes to their memory controller’s hardware scheduling algorithms as part of RDT. These allow the DDR interface to manage bandwidth for specific cores, which will almost certainly include both explict data operations, as well as separate algorithms for prefetching and speculative fetching of instructions and data. [This author has spent many hours reading about memory controller scheduling over the past year]

Ingo Molnar also posted “x86/debug changes for v4.12”, which includes support for the USB3 “debug port” based early console. As we have mentioned previously, USB3 includes a built-in “debug port” which no longer requires a special dongle to connect a remote machine for debug. It’s common in Windows kernel development to use a debug port, and since USB3 includes baseline support with the need for additional hardware, serial over USB3 is likely to become more common when developing for Linux – especially with the demise of DB9 headers on systems or even IDC10 headers on motherboards internally (to say nothing of laptop systems). As a reminder, with debug ports, usually only one USB port will support debug mode. I
guess my old USB debug port dongle can go in the pile of obsolete gear.

Ingo Molnar also posted “x86/platform changes for v4.12” which includes “continued SGI UV4 hardware-enablement changes, plus there’s also new Bluetooth support for the Intel Edison [a low cost IoT board] platform”.

Ingo Molnar also posted “x86/vdso changes for v4.12” which includes support for a “hyper-V TSC page” which is what it sounds like – a special shared page made available to guests under Microsoft’s Hyper-V hypervisor and providing a fast means to enumerate the current time. This is plumbed into the kernel’s vDSO mechanism (Virtual Dynamic Shared Objects look a bit like software libraries that are automatically linked against every running program when it launches) to allow fast clock reading.

Ingo Molnar also posted “x86/mm changes for v4.12”, which includes yet more work toward Intel 5-level paging among many other updates.

Separately Ingo posted a single “core kernel fix” to “increase stackprotector canary randomness on 64-bit kernels with very little cost”.

Thomas Gleixner posted “irq updates for 4.12”, which include a new driver for a MediaTek SoC, ACPI support for ITS (Interrupt Translation Services) when using a GICv3 on ARM systems, support for shared nested
interrupts, and “the usual pile of fixes and updates all over t[h]e place”.

Thomas Gleixner also posted “timer updates for 4.12” that include more reworking of year 2038 support (the infamous wrap of the Unix epoch), a “massive rework of the arm architected timer”, and various other work.

Separately, Ingo Molnar followed up with “timer fix” including “A single ARM Juno clocksource driver fix”.

Corey Minyard posted “4.12 for IPMI” including a watchdog fix. He “switched over to github at Stephen Rothwell’s [linux-next maintainer] request”.

Jonathan Corbet posted “Docs for 4.12” which includes “a new guide for user-space API documents” along with many other updates. Anil Nair noted “Missing File REPORTING-BUGS in Linux Kernel” which suggests that the Debian kernel package tools need to be taught about the recent changes in the kernel’s documentation layout. Separately, Jonathan replied to a thread entitled “Find more sane first words we have to say about Linux” noting that the kernel’s documentation files might not be the first place that someone completely new to Linux is going to go looking for information: “So I don’t doubt we could put something better there, but can we think for a moment about who the audience is here? If you’re “completely new to Linux”, will you really start by jumping into the kernel source tree?” The guy should do kernel standup in addition to LWN. It’d be hilarious.

Later, Jon posted “A few small documentation updates” which “Connect the newly RST-formatted documentation to the rest; this had to wait until the input pull was done. There’s also a few small fixes that wandered in”.

Tejun Heo posted “libata changes for 4.12-rc1” which includes “removal of SCT WRITE SAME support, which never worked properly”. SCT stands for “SMART [Self Monitoring And Reporting Technology – an error management mechanism common in contemporary disks] Command Transport”. The “write same” part means to set the drive content to a specific pattern (e.g. to zero it out) in cases that TRIM is not available. One wonders if that is also a feature used during destruction, though apparently the only (NSA) trusted way to destroy disks today is shredding and burning after zeroing.

Tejun Heo also posted “workqueue changes for v4.12-rc1”, which includes “One trivial patch to use setup_deferrable_timer() instead of open-coding the initialization”.

Tejun Heo also posted “cgroup changes for v4.12-rc1”, which includes a “second stab at fixing the long-standard race condition in the mount path and suppression of spurious warning from cgroup_get”.

Rafael J. Wysocki posted “Power management updates for v4.12-rc1, part 1” which includes many updates to the cpufreq subsystem and “to the intel_pstate driver in particular”. Its sysfs interface has apparently also been reworked to be more consistent with general expectations. He adds “Apart from that, the AnalyzeSuspend utility for system suspend profiling gets a companion called AnalyzeBoot for the analogous profiling of system boot and they both go into one place”.

Separately, he posted “Power management updates for v4.12-rc1, part 2”, which “add new CPU IDs [Intel Gemini Lake] to a couple of drivers [intel_idle and intel_rapl – Running Average Power Limit], fix a possible NULL pointer deference in the cpuidle core, update DT [DeviceTree]-related things in the generic power domains framwork and finally update the suspend/resume infrastructure to improve the handling of wakeups from suspend-to-idle”.

Rafael J. Wysocki also posted “ACPI updates for v4.12-rc1, part 1”, which includes a new Operation Region driver for the Intel CHT [Cherry Trail] Whiskey Cove PMIC [Power Management Integrated Circuit], and new sysfs entries for CPPC [Collaborative Processor Performance Control], which is a much more fine grained means for OS and firmware to coordinate on power management and CPU frequency/performance state transitions.

Separately, he posted “ACPI updates for v4.12-rc1, part 2”, which “update the ACPICA [ACPI – Advanced Configuration and Power Interface – Component Architecture, the cross-Operating System reference code]” to “add a few minor fixes and improvements”, and also “update ACPI SoC drivers with new device IDs, platform-related information and similar, fix the register information in xpower PMIC [Power Management IC] driver, introduce a concept of “always present” devices to the ACPI device enumeration code and use it to fix a problem with one platform [INT0002, Intel Cherry Trail], and fix a system resume issue related to power resources”.

Separately, Benjamin Tissories posted a patch reverting some ACPI laptop lid logic that had been introduced in Linux 4.10 but was breaking laptops from booting with the lid closed (a feature folks especially in QE use).

Rafael J. Wysocki also posted “Generic device properties framework updates for v4.12-rc1”, which includes various updates to the ACPI _DSD [Device Properties] method call to recognize “ports and endpoints”.

Shaohua Li posted “MD update for 4.12” which includes support for the “Partial Parity Log” feature present on the Intel IMSM RAID array, and a rewrite of the underlying MD bio (the basic storage IO concept used in Linux) handling. He notes “Now MD doesn’t directly access bio bvec, bi_phys_segments and uses modern bio API for bio split”.

Ulf Hansson posted “MMC for v[.]4.12” which includes many driver updates as well as refactoring of the code to “prepare for eMMC CMDQ and blkmq”. This is the planned transition to blkmq (block-multiqueue) for such storage devices. Previously it had stalled due to the performance hit when trying to use a multi-queue approach on legacy and contemporary non-mq devices.

Linus Walleij posted “pin control bulk changes for v4.12” in which he notes that “The extra week before the merge window actually resulted in some of the type of fixes that usually arrive after the merge window already starting to trickle in from eager developers using -next, I’m impressed”. He’s also impressed with the new “Samsung subsystem maintainer” (Krzysztof). Of the many changes, he says “The most pleasing to see is Julia Cartwright[‘]s work to audit the irqchip-providing drivers for realtime locking compliance. It’s one of those “I should really get around to looking into that” things that have been on my TODO list since forever”.

Linus Walliej also posted “Bulk GPIO changes for v4.12”, which has “Nothing really exciting goes on here this time, the most exciting for me is the same as for pin control: realtime is advancing thanks [t]o Julia Cartwright”.

Petr Mladek posted “printk for 4.12” which includes a fix for the “situation when early console is not deregistered because the preferred one matches a wrong entry. It caused messages to appear twice”.

Jiri Kosina posted “HID for 4.12” which includes various fixes, amongst them being an inversion of the HID_QUIRK_NO_INIT_REPORTS to the opposite due to the fact that it is appearently easier to whitelist working devices.

Jiri Kosina also posted “livepatching for 4.12” which includes a new “per-task consistency model” that is “being added for architectures that support reliable stack dumping”, which apparently “extends the nature of the types of patches than can be applied by live patching”.

Lee Jones posted “Backlight for 4.12” which includes various fixes.

Lee Jones also posted “MFD for v4.12” which includes some new drivers, new device support, and various new functionality and fixes.

Juergen Gross posted “xen: fixes and features for 4.12” which includes support for building the kernel with Xen enabled but without enabling paravirtualization, a new 9pfs xen frontend driver(!), and support for EFI “reset_sytem” (needed for ARMv8 Dom0 host to reboot), among various other fixes and cleanups.

Alex Williamson posted “VFIO updates for v4.12-rc1”.

Joerg Roedel posted “IOMMU Updates for Linux v4.12”, which includes “Some code optimizations for the Intel VT-d driver, code to “switch off a previously enabled Intel IOMMU” (presumably in order to place it into bypass mode for performance or other reasons?), “ACPI/IORT updates and fixes” (which enables full support for the ACPI IORT on 64-bit ARM).

Dmitry Torokhov posted “Input updates for v.4.11-rc0” which includes a documentation converstion to ReST (RST, the new kernel doc format), an update to the venerable Synaptics “PS/2” driver to be aware of companion “SMBus” devices and various other miscellaneous fixes.

Darren Hart posted “platform-drivers-x86 for 4.12-1” which includes “a significantly larger and more complex set of changes than those of prior merge windows”. These include “several changes with dependencies on other subsytems which we felt were best managed through merges of immutable branches”.

James Bottomley posted “first round of SCSI updates for the 4.11+ merge window”, which includes many driver updates, but also comes with a warning to Linus that “The major thing you should be aware of is that there’s a clash between a char dev change in the char-misc tree (adding the new cdev_device_add) and the make checking the return value of scsi_device_get() mandatory”. Linus and Greg would later clarify what cdev_device_add does in response to Greg’s request to pull “Char/Misc driver patches for 4.12-rc1”.

David Miller posted “Networking” which includes many fixes.

David also posted “Sparc”, which includes a “bug fix for handling exceptions during bzero on some sparc64 cpus”.

David also posted “IDE”, which includes “two small cleanups”.

Greg K-H (Kroah-Hartman) posted “USB driver patches for 4.12-rc1”, which includes “Lots of good stuff here, after many many many attempts, the kernel finally has a working typeC interface, many thanks to Heikki and Guenter and others who have taken the time to get this merged. It wasn’t an easy path for them at all.” It will be interesting to test that out!

Greg K-H also posted “Driver core patches for 4.12-rc1”, which is “very tiny” this time around and consists mostly of documentation fixes, etc.

Greg K-H also posted “Char/Misc driver patches for 4.12-rc1” which features “lots of new drivers” including Google firmware drivers, FPGA drivers, etc. This lead to a reaction from Linus about how the tree conflicted with James Bottomley’s tree (which he had already pulled, “as per James’ suggestion”, and a back and forth between James and Greg about how to better handle such a conflict next time, and Linus noting that he prefers to fix merge conflicts himself but “*also* really really prefer the two sides of the conflict having been more aware of the clash” and providing him with a head’s up in the pull.

Greg K-H also posted “Staging/IIO driver fixes for 4.12-rc1”, which adds “about 350k new lines of crap^Wcode, mostly all in a big dump of media drivers from Intel”. He notes that the Android low memory killer driver has finally been deleted “much to the celebration of the -mm developers”.

Greg K-H also posted “TTY patches for 4.12-rc1”, which wasn’t big.

Dan Williams posted “libnvdimm for 4.12” which includes “Region media error reporting [a generic interface more friendly to use with multiple namespaces]”, a new “struct dax_device” to allow drivers to have their own custom direct access operations, and various other updates. Dan also posted “libnvdimm: band aid btt vs clear posion locking”, a patch which “continues the 4.11 status quo of disabling of error clearing from the BTT [Block Translation Table] I/O path” and notes that “A solution for tracking and handling media errors natively in the BTT is needed”. The BTT or Block Translation Table is a mechanism used by NV-DIMMs to handle “torn sectors” (partially complete writes) in hardware during error or power failure. As the “btt.txt” in the kernel documentation notes, NV-DIMMs do not have the same atomicity guarantees as regular flash drives do. Flash drives have internal logic and store enough energy in capacitors to complete outstanding writes during a power failure (rotational drives have similar for flushing their memory based caches and syncing remap block state) but NV-DIMMs are designed differently. Thus the BTT provides a level of indirection that is used to provide for atomic sector semantics.

Separately, Dan posted “libnvdimm fixes for 4.12-rc1” which includes “incremental fixes and a small feature addition relative to the libnvdimm 4.12 pull request”. Gert had “noticed hat tinyconfig was bloated by BLOCK selecting DAX [Direct Acess Execution]”, while “Vishal adds a feature that missed the initial pull due to pending review feedback. It allows the kernel to clear media errors when initializing a BTT (atomic sector update driver) instance on a pmem namespace”.

Dave Airlie posted “drm tegra for 4.12-rc1” containing additional updates due because he missed a pull from Thierry Reding for NVidia Tegra patches. He also followed up with a “drm document code of conduct” patch that describes a code of conduct for graphics written by

Stafford Horne posted “Initramfs fix for 4.12-rc1” containing a fix “for an issue that has caused 4.11 to not boot on OpenRISC”.

Catalin Marinas posted “arm64 updates for 4.12” including kdump support, “ARMv8.3 HWCAP bits for JavaScript conversion instructions, complex numbers and weaker release consistency [memory ordering]”, and support for platform (non-enumerated buses) MSI support when using ACPI, among other patches. He also removes support for ASID-tagged VIVT [Virtually Indexed, Virtually Tagged] caches since “no ARMv8 implementation using it and deprecated in the architecture” [caches are PIPT – Physically Indexed, Physically Tagged – except that an implementation might do VIPT or otherwise internally using various CAM optimizations].

Catalin later posted “arm64 2nd set of updates for 4.12”, which include “Silence module allocation failures when CONFIG_ARM*_MODULE_PLTS is enabled”.

Olof Johansson posted “ARM: SoC contents for 4.12 merge window”. In his pull request, Olof notes that “It’s been a relatively quiet release cycle here. Patch count is about the usual (818 commits, which includes merges).”
He goes on to add, “Besides dts [DeviceTree files], the mach-gemini cleanup by Linus Walleij is the only platform that pops up on its own”. He called out the separate post for the TEE [Trusted Execution Environment] subsystem. Olof also removed Alexandre Courbot and Stephen Warren from NVidia Tega maintainership, and added Jon Hunter in their place.

Rob Herring posted “DeviceTree for 4.12”, which includes updates to the Device Tree Compiler (dtc), and more DeviceTree overlay unit tests, among various other changes.

Darrick J. Wong posted “xfs: updates for 4.12”, which includes the “big new feature for this release” of a “new space mapping ioctl that we’ve been discussing since LSF2016 [Linux Storage and Filesystem conference]”.

Max Filippov posted “Xtensa improvements for 4.12”.

Ted Ts’o posted “ext4 updates for 4.12”, which adds “GETFSMAP support” (discussed previously in this podcast) among other new features.

Ted also posted “fscrypt updates for 4.12” which has “only bug fixes”.

Paul Moore posted “Audit patches for 4.12” which includes 14 patches that “span the full range of fixes, new featuresm and internal cleanups”. These include a move to 64-bit timestamps, converting refcounts to the new refcount_t type from atomic_t, and so on.

Wolfram Sang posted “i2c for 4.12”.

Mark Brown posted “regulator updates for 4.12”, which includes “Quite a lot going on with the regulator API for this release, much more in the core than in the drivers for a change”. This includes “Fixes for voltage change propagation through dumb power switches, a notification when regulators are enabled, a new settling time property for regulators where the time taken to move to a new voltage is not related to the size of the change”, etc.

Mark also posted “SPI updates for 4.12”, which includs “quite a lot of small
driver specific fixes and enhancements”.

Jessica Yu posted “module updates for 4.12”, containing minor fixes.

Mauro Carvalho Chehab posted “media updates” including mostly driver updates and the removal of “two staging LIRC drivers for obscure hardware”. He also posted a 5 part patch series entitled “Conver more books to ReST”, which converted three kernel DocBook format documentation file sets to RST, the new format being used for kernel documentation (on the kernel-doc mailing list, and maintained by Jonathan Corbet of LWN): librs, mtdnand, and sh. He noted that “After this series, there will be just one DocBook pending conversion: ” lsm (Linux Security Modules)”. He also notes that the existing LSM documentation is very out of date and no longer describes the current API.

Michael Ellerman posted “Please pull powerpc/linux.git powerpc-4.12-1 tag”, which includes suppot for “Larger virtual address space on 64-bit server CPUs. By default we use a 128TB virtual address space, but a process can request access to the full 512TB by passing a hint to mmap() [this seems very similar to the 56-bit la57 feature from Intel]”. It also includes “TLB flushing optimisations for the radix MMU on Power9” and “Support for CAPI cards on Power9, using the “Coherent Accelerator Interface Architecture 2.0″ [which definitely sounds like juicy reading]”.

Separately Michael Ellerman posted “Please pull powerpc-linux.git powerpc-4.12-2 tag” which includes “rework the Linux page table geometry to lower memory usage on 64-bit Book3S (IBM chips) using the Hash MMU [IBM uses a special inverse page tables “reverse lookup” hashing format]”.

Eric W. Biederman posted “namespace related changes for v4.12-rc1”, which includes a “set of small fixes that were mostly stumbled over during more significant development. This proc fix and the fix to posix-timers are the most significant of the lot. There is a lot of good development going on but unfortunately it didn’t quite make the merge window”.

Takashi Iwai posted “sound updates for 4.12-rc1”, noting that it was “a relatively calm development cycle, no scaring changes are seen”.

Steven Rostedt posted “tracing: Updates for v4.12” which includes “Pretty much a full rewrite of the process of function probes”. He followed up with “Three more updates for 4.12” that contained “three simple changes”.

Martin Schwidefsky posted “s390 patches for 4.12 merge window” which includes improvements to VFIO support on mainframe(!) [this author was recently amazed to see there are also DPDK ports for s390x], a new true random number generator, perf counters for the new z13 CPU, and many others besides.

Geert Uytterhoeven posted “m68k updates for 4.12” with a couple fixes.

Jacek Anaszewski posted “LED updates for 4.12” with various fixes.

Kees Cook posted “usercopy updates for v4.12-rc1” with a couple fixes.

Kees also posted “pstore updates for v4.12-rc1”, which included “large
internal refactoring along with several smaller fixes”.

James Morris posted “Security subsystem updates for v4.12”.

Sebastian Reichel posted “hsi changes for hsi-4.12”.

Sebastian also posted “power-supply changes for 4.12”, which includes a couple of new drivers and various fixes.

Separately, Sebastian poted “power-supply changes for 4.12 (part 2), which includes some new drivers and some fixes.

Paolo Bonzini posted “First batch of KVM changes for 4.12 merge window” which includes kexec/kdump support on 32-bit ARM, support for a userspace virtual interrupt controller to handle the “weird” Raspberry Pi 3, in-kernel acceleration for VFIO on POWER, nested EPT support for accessed and dirty bits on x86, and many other fixes and improvements besides.

Separately Paolo posted “Second round of KVM changes for 4.12”, which include various ARM (32 and 64-bit) cleanups, support for PPC [POWER] XIVE (eXternal Interrupt Virtualization Engine), and “x86: nVMX improvements, including emulated page modification logging (PML) which brings nice performance improvements [under nested virtualization] on some workloads”.

Ilya Dryomov posted “Ceph updates for 4.12-rc1”, which include “support for disabling automatic rbd [resilent block device] exclusive lock transfers” and “the long awaited -ENOSPC [no space] handling series”. The latter finally handles out of space situations by aborting with -ENOSPC rather than “having them [writers] block indefinitely”.

Miklos Szeredi posted “fuse updates for 4.12”, which “contains support for pid namespaces from Seth and refcount_t work from Elena”.

Miklos also posted “overlayfs update for 4.12”, which includes “making st_dev/st_ino on the overlay behave like a normal filesystem”. “Currently this only wokrs if all layers are on the same filesystem, but future work will move the general case towards more sane behavior”.

Bjorn Helgaas posted “PCI changes for v4.12” which includes a framework for supporting PCIe devices in Endpoint mode from Kishon Vjiay Abraham, fixes for using non-posted PCI config space on ARM from Lorenzo Pieralisi, allowing slots below PCI-to-PCIe “reverse bridges”, a bunch of quirks, and many other fixes and enhancements.

Jaegeuk Kim posted “f2fs for 4.12-rc1”, which “focused on enhancing performance with regards to block allocation, GC [Garbage Collection], and discard/in-place-update IO controls”.

Shuah Khan posted “Kselftest update for 4.12-rc1” with a few fixes.

Richard Weinberg posted “UML changes for v4.12-rc1” which includes “No new stuff, just fixes” to the “User Mode Linux” architecture in 4.12. Separately, Masami Hiramatsu posted an RFC patch entitled “Output messages to stderr and support quiet option” intended to “fix[] some boot time printf output to stderr by adding os_info() and os_warn(). The information-level messages via os_info() are suppressed when “quiet” kernel option is specified”.

Richard also postd “UBI/UBIFS updates for 4.12-rc1”, which “contains updates for both UBI and UBIFS”. It has a new CONFIG_UBIFS_FS_SECURITY option, among “minor improvements” and “random fixes”.

Thierry Reding posted “pwm: Changes for v4.12-rc1”, which amongst other things includes “a new driver for the PWM controller found on MediaTek SoCs”.

Vinod Koul posted “dmaengine updates” which includes “a smaller update consisting of support for TI DA8xx dma controller” among others.

Chris Mason posted “Btrfs” which “Has fixes and cleanups” as well as “The biggest functional fixes [being] between btrfs raid5/6 and scrub”.

Trond Myklebust posted “Please pull NFS client fixes for 4.12”, which includes various fixes, and new features (such as “Remove the v3-only data server limitation on pNFS/flexfiles”).

J. Bruce Fields posted “nfsd changes for 4.12”, which includes various RDMA updates from Chuck Lever.

Stephen Boyd posted “clk changes for v4.12”. Of the changes, the “biggest things are the TI clk driver rework to lay the groundwork for clkctrl support in the next merge window and the AmLogic audio/graphics clk support”.

Alexandre Belloni posted “RTC [Real Time Clock] for 4.12”, which uses a new GPG subkey that he also let Linus know about at the same time.

Nicholas A. Bellinger posted “target updates for v4.12-rc1”, which was “a lot more calm than previously expected. It’s primarily fixes in various areas, with most of the new functionality centering around TCMU [TCM – Linux iSCSI Target Support in Userspace] backend work with Xiubo Li has been driving”.

Zhang Rui posted “Thermal management updates for v4.12-rc1”, which includes a number of fixes, as well as some new drivers, and a new interface in “thermal devfreq_cooling code so that the driver can provide more precise data regarding actual power to the thermal governor every time the power budget is calculated”.

4.12 git pulls for new subsystems and features

David Howells posted “Hardware module parameter annotation for secure boot” in which he requested that Linus pull in new “kmod” macros (the same name is used for the userspace module tooling, but in this case refers to the in-kernel kernel module infrastructure of the same name). The new macros add annotations to “module_param” of the new form “module_param_hw” with a “hwtype” such as “ioport” or “iomem”, and so forth. These are used by the kernel to prevent those parameters from being used under a UEFI Secure Boot situation in which the kernel is “locked down” (to prevent someone from loading a signed kernel image and then compromising it to circumvent the secure boot mechanism).

Arnd Bergmann sent a special pull request to Linus Torvalds for “TEE driver infrastructure and OP-TEE drivers”, which “introduces a generic TEE [Trusted Execution Environment] framework in the kernel, to handle trusted environ[ments] (security coprocessor or software implementations such as OP-TEE/TrustZone)”. He sent the pull separately from the other arm-soc pull specifically to call it out, and to make sure everyone knew that this was finally headed upstream, but he noted it would probably be maintained through the arm-soc kernel tree. He included a lengthy defense of why now was the right time to merge TEE support into upstream Linux.

Saving TLB flushes on Intel x86 Architecture

Andy Lutomirski posted an RFC patch series entitled “x86 TLB flush cleanups, moving toward PCID support”. Modern (non-legacy) architectures implement a per-process context identifier that can be used in order to tag VMA (Virtual Memory Area) translations that end up in the TLB (Translation Lookaside Buffer) caches within the microprocessor core. The processor’s hardware (or in some mostly embedded cases, software) (page table) “walkers” will navigate the page tables for a process and populate the TLBs (except in the embedded software case, such as on certain PowerPC and MIPS processors, in which the kernel contains special assembly routines to perform this in software). On legacy architectures, the TLB is fairly simple, containing a simple virtual address to physical (or intermediate, in the case of virtualization) address. But on more sophisticated architectures, the TLB includes address space identification information that allows the TLB to distinguish between hits to the same virtual address that are from two different processes (known as tasks from within the kernel). Using additional tagging in the TLB avoids the traditional need to invalidate the entire TLB on process context switch.

Modern architectures, such as AArch64, have implemented context tagging support in their architecture code for some time, and now x86 is finally set to follow, enabling a feature that has actually been present in x86 for some time (but was not wired up), thanks to Andy’s work on PCID (Process Context IDentifier) support. In his patch series, Andy notes that as he has been “polishing [his] PCID code, a major problem [he’s] encountered is that there are too many x86 TLB flushing code paths and that they have too many inconsequential differences”. This patch series aims to “clean up the mess”. Now if x86 finally gains hardware broadcast TLB invalidations it will also be able to remove the wasted IPIs (Inter-Processor-Interrupts) that it implements to cause remote processors to invalidate TLB entries, too. Linus liked Andy’s initial work, but said he is “always a bit nervous about TLB changes like this just because any potential bugs tend to be really really hard to see and catch”. Those of us who have debugged nasty TLB issues on other architectures would be inclined to agree with him.

Ongoing Development

Laurent Dufour posted version 3 of a patch series entitled “Speculative page faults”. This is a contemporary development inspired by Peter Zijstra’s earlier work, which was based upon ideas of still others. The whole concept dates back to at least 2009 and generally involves removing the traditional locking constraints of updates to VMAs (Virtual Memory Areas) used by Linux tasks (processes) to represent the memory of running programs. Essentially, a “speculative fault” means “not holding mmap_sem” (a semaphore guarding a tasks’ current memory map). Laurent (and Peter) make VMA lookups lockless, and perform updats speculatively, using a seqlock to detect a change to the underlying VMA during the fault. “Once we’ve obtained the page and are ready to update the PTE, we validate if the state we started the fault with is still valid, if not, we’ll fail the fault with VM_FAULT_RETRY, otherwise we update the PTE and we’re done”. Earlier testing showed very significant performance upside to this work due to the reduced lock contention.

Aaron Lu posted “smp: do not send IPI if call_single_queue not empty”. The Linux kernel (and most others) uses a construct known as an IPI – or Inter-Processor-Interrupt – a form of software generated interrupt that a processor will send to one or more others when it needs them to perform some housekeeping work on the kernel’s behalf. Usually, this is to handle such things as a TLB shootdown (invalidating a virtual address translation in a remote processor due to a virtual address space being removed), especially on less sophisticated legacy architectures that do not feature invalidation of TLBs through hardware broadcast, though there are many other uses for IPIs. Aaron’s patch realizes, effectively, that if a remote processor is already going to process a queue of CSD (call_single_data) function calls it has been asked to via IPI then there is no need to send another IPI and generate additional interrupts – the queue will be drained of this entry as well as existing entries by the IPI management code.

Romain Perier posted version 8 of “Replace PCI pool by DMA pool API” which realizes that the current PCI pool API uses “simple macro functions direct expanded to the appropriate dma pool functions”, so it simply replaces them with a direct use of the corresponding DMA pool API instead.

Sandhya Bankar posted “vfs: Convert file allocation code to use the IDR”. This replaces existing filesystem code that allocates file descriptors using a
custom allocator with Matthew (Willy) Wilcox’s idr (ID Radix) tree allocator.

Serge E. Hallyn posted a resend of version 2 of a patch series entitled “Introduce v3 namespaced file capabilities”. We covered this last time.

Heinrich Schuchardt posted “arm64: Always provide “model name” in /proc/cpuinfo”, which was quickly shot down (for the moment).

Christian König posted verision 5 of his “Resizeable PCI BAR support” patch series. We have featured this in a previous episode of the podcast.

Prakash Sangappa posted “hugetlbfs ‘noautofill’ mount option” which aims to allow (optionally) for hugetlbfs pseudo-filesystems to be mounted with an option which will not automatically populate holes in files with zeros during a page fault when the file is accessed though the mapped address. This is intended to benefit applications such as Oracle databases, which make heavy use of such mechanisms but don’t take kindly to the kernel having side effects that change on-disk files even if only zero fill. Dave Hansen pushed back against this change saying that it was “further specializing hugetlbfs” and that Oracle should be using userfaultfd or “an madvise() option that disallows backing allocations”. Prakash replied that they had considered those but with a database there are such a large number of single threaded processes that “The concern with using userfaultfs is the overhead of setup and having an additional thread per process”.

Sameer Goel posted “arm64: Add translation functions for /dev/mem read/write” which “Port architecture specific xlate [translate] and unxlate [untranslate] functions for /dev/mem read/write. This sets up the mapping for a valid physical address if a kernel direct mapping is not alread present”. Depending upon the ARM platform, access to a bad address in /dev/mem could result in a synchronous exception in the core, or a System Error (SError) generated by a system memory controller interface. In either case, it is handled as a fatal error where the same is not true on x86. While access to /dev/mem is restricted, increasingly being deprecated, and has other semantics to prevent its used on 64-bit ARM systems, it still exists and is used. In this case, to read the ACPI FPDT table which provides performance pointer records. Nevertheless, both Will Deacon and Leif Lindholm objected to the reasoning given here, saying that the kernel should instead be taught how to parse this table and expose its information via /sys rather than having userspace tools go poking in /dev/mem to try to read from the table directly.

Minchan Kim posted “vmscan: scan pages until it f[inds] eligible pages” in which he notes that “There are premature OOM [Out Of Memory killer invocations] happening. Although there are ton of free swap and anonymous LRU list of eligible zones, OOM happened. With investigation, skipping page of isolate_lru_pages makes reclaim void because it returns zero nr_taken easily so LRU shrinking is effectively nothing and just increases priority aggressively. Finally, OOM happens”.

Julius Werner posted version 3 of his “Memconsole changes for new coreboot format” which teaches the Google firmware driver for their memconsole to deal with the newer type of persistent ring buffer console they introduced.

Olliver Schinagl and Jamie Iles had a back and forth about the latter’s work on “glue-code” (generic handling code) for the DW (DesignWare) 8250 (a type of serial port interface made popular by PC) IP block as used in many different designs. Depending upon how the block is configured, it can behave differently, and there was some discussion about how to handle that. In particular the location of the UART_USR register.

Xiao Guangrong posted “KVM: MMU: fast write protect” which “introduces a[n] extremely fast way to write protec all the guest memory. Comparing with the ordinary algorthim which write protects last level sptes [the page table entries used by the guest] based on the rmap [the “reverse” map, the means that Linux uses to encode page table information within the kernel] one by one, it just simply updates the generation number to ask all vCPUs to reload its root page table, particularly it can be done out of mmu-lock”. The idea was apparently originally proposed by Avi (Kivity). Paolo Bonzini thought “This is clever” and wondered “how the alternative write protection mechanism would affect performance of the dirty page ring buffer patches”. Xiao thought it could be used to speed up those patches after merging, too [Paolo noted that he aims to merge these early in 4.13 development].

Bogdan Mirea posted version 2 of”Add “Preserve Boot Time Support””, which follows up on a previous discussion about retaining “Boot Time Preservation between Bootloader and Linux Kernel. It is based on the idea that the Bootloader (or any other early firmware) will start the HW Timer and Linux Kernel will count the time starting with the cycles elapsed since timer start”. By “Bootloader” he means “firmware” to those who live in x86-land.

Igor Stoppa posted “post-init-read-only protection for data allocated dynamically” which aims to provide a mechanism for dynamically allocated data which is similar to the “__read_only” special linker section that certain annotated (using special GCC directives) code will be placed into. That works great for read-only data (which is protected by the MMU locking down the corresponding region early in boot). His “wish” is to start with the “policy DB of SE Linux and the LSM Hooks, but eventually I would like to extend the protection also to other subsystems, in a way that can merged into mainline.” His patch includes an analysis of how he feels he can be as “little invasive as possible”, noting that “In most, if not all, the cases that could be enhanced, the code will be calling kmalloc/vmalloc, including GFP_KERNEL [Get Free Pages of Kernel Type Memory] as the desired type of memory”. Consequently, he says, “I suspect/hope that the various maintainer[s] won’t object too much if my changes are limited to replacing GFP_KERNEL with some other macro, for example what I previously called GFP_LOCKABLE”. Michal Hocko had some feedback, largely along the lines of a “master toggle” (tha would allow protection to be disabled for small periods in order to make changes to “read only” data) was largely pointless – due to it re-exposing the data. Instead, he wanted to see the protection being done at the kmem_cache_create time by adding a “SLAB_SEAL” parameter that would later be enabled on a per kmem_cache basis using “kmem_cache_seal(cache)” or a similar mechanism.

Bharat Bhushan posted “ARM64/PCI: Allow userspace to mmap PCI resources”, which Lorenzo Pieralisi noted was already implemented by another patch.

A lengthy, and “spirited” discussion took place between Timur Tabi and the various maintainers of the 64-bit ARM Architecture and SoC platform trees over the desire for the maintainers to have changes to “defconfigs” for the architecture go through a special “” alias. Except that after they had told Timur to use that, they objected to him posting a patch informing others of this alias in the kernel documentation. Instead, as Timur put it “without a MAINTAINERS entry, how would anyone know to CC: that address? I posted 3 versions of my defconfig patchset before someone told me that I had to send it to” The discussion thread is entitled “MAINTAINERS: add as the list for arm64 defconfig changes”.

Xunlei Pang posted version 3 of his “x86/mm/ident_map: Add PUD level 1GB page support” which helps “kernel_ident_mapping_init” to create a single and very large identitiy page mapping in order to reduce TLB (Translation Lookaside Buffer – the caches that store virtual to physical memory lookups performed by hardware) pressure on an architecture that is currently using many 2MB (PMD – Page Middle Directory) level pages for this process.

Anju T Sudhakar posted version 8 of “IMC Instrumentation Support”, which provides support for POWER9’s “In-Memory-Collection” or IMC infrastructure, which “contains various Performance Monitoring Units (PMUs) at Nest level (these are on-chip but off-core), Core level and Thread level.”

Greg K-H (Kroah-Hartman) posted an RFC patch entitled “add more new kernel pointer filter options” which “implemnt[s] some new restrictions when printing out kernel pointers, as well as the ability to whitelist kernel pointers where needed.”

Kees Cook posted “x86/refcount: Implement fast refcount overflow protection”, which seeks to upstream a “modified version of the x86 PAX_REFCOUNT defense from PaX/grsecurity. This speeds up the refcount_t API by duplicating the existing atomic_t implementation with a single instruction added to detect if the refcount has wrapped past INT_MAC (or below 0) resuling in a negative value, where the handler then restores the refcount_t to INT_MAX”.

David Howlls posted an RFC patch entitled “VFS: Introduce superblock configuration context” which is a “set of patches to create a superblock configuration contenxt prior to setting up a new mount, populating it with the parsed options/binary data, creating the superblock and then effecting the mount. This allows namespaces and other information to be conveyed through the mount procedure. It also allows extra error information”.

The Google Chromebook team let folks know that they were (rarely, like one in a million) seeing “Threads stuck in zap_pid_ns_processes()”. Guenter Roeck noted that the “Problem is that if the main task [which has children that are being ptraced] doesn’t exit, it [the child] hangs forever. Chrome OS (where we see the problem in the field, and the application is chrome) is configured to reboot on hung tasks – if a task is hung for 120 seconds on those systems, it tends to be in a bad shape. This makes it a quite severe problem for us”. He asked “Are there other conditions besides ptrace where a task isn’t reaped?”. Reaping refers to the behavior in which tasks, when they exit are reparented to the init task, which “reaps” them (cleans up and makes sure the state that exit with is seen), except under ptrace in this case where the parent task spawning the children “was outside of the pid namespace and was choosing not to reap the child”. Various proposals as to how to deal with this in the namespace code were discussed.

Mahesh Bandewar posted “kmod: don’t load module unless req process has CAP_SYS_MODULE” which notes that “A process inside random user-ns [a user namespace] should not load a module, which is currently possible”. He shows how a user namespace can be created that causes the kernel to load a module upon access to a file node indirectly. This could be a security risk if this approach were used to cause a host kernel to load a vulnerable but otherwise not loaded kernel driver through the privileged permissions in the namespace.

Marc Zyngier posted “irqdomain: Improve irq_domain_mapping facility” in which he “Update[s] IRQ-domain.txt to document irq_domain_mapping” among otherwise seeking to make it easier to access and understand this kernel feature.

Jens Axboe accepted a patch from Ulf Hansson adding Paolo Valente as a MAINTAINER of the BFQ I/O scheduler.

Cyrille Pitchen updated the git repos for the SPI NOR subsystem, which is “now hosted on MTD repos, spi-nor/next is on l2-mtd and spi-nor/fixes will be on linux-mtd”.

Alexandre Courbot posted “MAINTAINERS: remove self from GPIO maintainers”.

The folks at Codeaurora posted a lengthy analysis of the Linux kernel scheduler and specific problems with load_balance that will be covered next time around, along with work by Peter Zijlstra on the “cgroup/PELT overhaul (again).

Finally, Paul McKenney previously posted “Make SRCU be once again optional”, after having noted that the need to build it in by default (caused by other recent changes in header files) increased the kernel by 2K. Nico(las) Pitre was happy to hear this, saying “If every maintainer finds a way to (optionally) reduce the size of the code they maintain by 2K then we’ll get a much smaller kernel pretty soon”.

Linux Kernel Podcast for 2017/04/27


In this week’s edition: Linux 4.11-rc8, updating cross compilers, Intel 5-level paging, v3 namespaced file capabilities, and ongoing development.

Editorial Notes

Apologies for the delay to this week’s podcast. I got flu around the time I was preparing last week’s podcast, limped along to the weekend, and then had to stay in bed for a long time. On the other hand, it let me play with a bunch of new SDRs [HackRF, RTL-SDR, and friends, for the curious) on Sunday when I skipped the 5K I was supposed to run 🙂

I would also like to note my thanks for the first 10,000 downloads of the new series of this podcast. It’s a work in progress. I am going to make (positive!) changes over the coming months, including a web interface that will track all LKML posts and allow for community-directed collaboration on creating this (and hopefully other) podcasts. I will include automatic patch tracking (showing when patches have landed in upstream trees, and so on), info on post authors, and allow you to edit personal bios, links, etc. And employer info. After some discussions around the best way to handle author employer attribution (to make sure everyone is treated fairly), I’ve decide to take a little time away from including employer names until I have a populated database of mappings. Jon Corbet from LWN has something similar already, which I believe is also stored in git, but there’s more to be done here (thanks to Alex and others for the G+ feedback and discussion on this).

Linux 4.11-rc8

Linus Torvalds announced Linux 4.11-rc8, saying “So originally I was just planning on releasing the final 4.11 today, but while we didn’t have a *lot* of changes the last week, we had a couple of really annoying ones, so I’m doing another rc release instead”. As he also notes, “The most noticeable of the issues is that we’ve quirked off some NVMe power management that apparently causes problems on some machines. It’s not entirely clear what caused the issue (it wasn’t just limited to some NVMe hardware, but also particular platforms), but let’s test it”.

With the release of Linux 4.11-rc8 comes that impending moment of both elation and dread that is a final kernel. It’ll be great to see 4.11 out there. It’s an awesome kernel, with lots of new features, and it will be well summarized in kernelnewbies and elsewhere. But upon its release comes the opening of the merge window for 4.12. Tracking that was exciting for 4.11. Hopefully it doesn’t finish me off trying to do that for 4.12 😉

Geert Utterhoeven posted “Build regressions/improvements in v4.11-rc8”, in which he noted that (compared with v.4.10), an addition build error and several hundred more warnings were recently added to the kernel. The error he points to is in the AVR32 architecture when applying a relocation in the linker, probably due to an unsupported offset.


Greg K-H (Kroah-Hartman) announced Linux 4.4.64, 4.9.25, and 4.10.13

Junio C Hamano announced Git v2.13.0-rc1

Alex Williams posted “Generic DMA-capable streaming device driver looking for home” in which he describes some generic features of his device (the ability to “carry generic data to/from userspace”) and inquired as to where it should live in the kernel. It could do with some followup.

Updating cross compilers

Andre Przywara inquired as to the state of the cross compilers. This was a project, initiated by Tony Breeds and located on, to maintain current Intel x86 Architecture builds of cross compiler toolchains for various architecture targets (a cross compiler is one that runs on one architecture, targeting another, which is incidentally different from a “Canadian cross” compiler – look it up if you’re ever bored or want to bootstrap compilers for fun). It was a great project, but like so many others one day (three years ago) there were no more updates. That is something Andre would like to see changed. He posted, noting that many people still use the compilers on (including yours truly, in a pinch) and that “The latest compiler I find there is 4.9.0, which celebrated its third birthday at the weekend, also has been superseded by 4.9.4 meanwhile”.

Andre used build scripts from Segher Bossenkool to build binutils (the GNU assembler) 2.28 and GCC (the GNU Compiler Collection) 6.3.0. With some tweaks, he was able to build for “all architectures except arc, m68k, tilegx and tilepro”. He wondered “what the process is to get these [the compilers linked from the kernel website] updated?”. It seems like he is keen to clean this up, which is to be commended and encouraged. And hopefully (since he works for ARM) that will eventually also include cross compiler targets for x86 that run on ARMv8 server systems.

Intel 5-level paging

Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he provides an “updated version the fourth and the last bunch of [] patches that brings initial 5-level paging enabling.” This is in support of Intel’s “la57” feature of future microprocessors that allows them to exceed the traditional 48-bit “Canonical Addressing” in order to address up to 56-bits of Virtual Address space (a big benefit to those who want to map large non-volatile storage devices and accelerators into virtual memory). His latest patch series includes a fix for a “KASLR [Kernel Address Space Layout Randomization”] bug due to rewriting [] startup_64() in C”.

Separately, John Paul Adrian Glaubitz inquired about Kirill’s patch series, saying, “I recently read the LWN article on your and your colleagues work to add five-level page table support for x86 to the Linux kernel. Since this extends the address space beyond 48-bits, as you know, it will cause potential headaches with Javascript engines which use tagged pointers. On SPARC, the virtual address space already extends to 52 bits and we are running into these very issues with Javascript engines on SPARC”.

He goes on to discuss passing the “hint” parameter to mmap() “in order to tell the kernel not to allocate memory beyond the 48 bits address space. Unfortunately, on Linux this will only work when the area pointed to by “hint” is unallocated which means one cannot simply use a hardcoded “hint” to mitigate this problem”. What he means here is that the mmap call to map a virtual memory area into a userspace process allows an application to specify where it would like that mapping to occur, but Linux isn’t required to respect this. Contemporary Linux implements “MAP_FIXED” as an option to mmap, which will either map a region where requested or explicitly fail (as Andy Lutomirski pointed out). This is different from a legacy behavior where Linux used to take a hint and might just not respect placement (as Andi Kleen alluded to in followup).

This whole discussion is actually the reason that Kirill had (thoughtfully) already included a feature bit setting in his patches that allows an application to effectively override the existing kernel logic and always allocate below 48 bits (preserving as close to existing behavior as possible on a per application basis while allowing a larger VA elsewhere). The thread resulted in this being pointed out, but it’s a timely reminder of the problems faced as the pressure continues upon architectures to grow their VA (Virtual Address) space size.

Often, efforts at growing virtual memory address spaces run up against uses of the higher order bits that were never sanctioned but are in widespread use. Many people strongly dislike pointer tagging of this kind (your author included), but it is not going away. It is great that Kirill’s patches have a form of solution that can be used for the time being by applications that want to retain a smaller address space, but that’s framed in the context of legacy support, not to enable runtimes to continue to use high order bits forevermore.

Introduce v3 namespaced file capabilities

Serge E. Hallyn posted “Introduce v3 namespaced file capabilities”. Linux includes a comprehensive capability mechanism that allows applications to limit what privileged operations may be performed by them. In the “good old days” when Unix hacker beards were more likely than today’s scruffy look, root was root and nobody really cared about remote compromise because they were still fighting having to have login passwords at all. But in today’s wonderful world of awesome, in which anything not bolted down is often not long for this world, “root” can mean very little. The traditionally privileged users can be extremely restricted by security policy frameworks, such as SELinux, but even more fundamentally can be subject to restrictions imposed by the growth in use of “capabilities”.

A classic example of a capability is CAP_NET_RAW, which the “ping” utility needs in order to create a raw socket. Traditionally, such utilities were created on Unix and Linux filesystems as “setuid root”, which means that they had the “s” bit set in their permissions to “run as root” when they were executed by regular users. This allowed the utility to operate, but it also allowed any user who could trick the utility into providing a shell conveniently gain a root login. Many security exploits over the years later and we have filesystem capabilities which allow binaries to exist on disk, tagged with just those extra capabilities they require to get the job done, through the filesystem “xattr” extended attributes. “ping” has CAP_NET_RAW, so it can create raw sockets, but it doesn’t need to run as root, so it isn’t market as “setuid root” on modern distros.

Fast forward still further into the modern era of containers and namespaces, and things get more complex. As Serge notes in his patch, “Root in a non-initial user ns [namespace] cannot be trusted to write a traditional security.capability xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the xattr, and execute the file with privilege on the host”. However, as he also notes, “supporting file capabilities in a user namespace is very desirable. Not doing so means that and programs designed to run with limited privilege must continue to support other methods of gaining and dropping privilege. For instance a program installer must detect whether file capabilities can be assigned, and assign them if so but set setuid-root otherwise. The program in turn must known how to drop partial capabilities [which is a mess to get right], and do so only if setuid-root”. This is, of course, far from desirable.

In the patch series, Serge “builds a vfs_ns_cap_data struct by appending a uid_t [user ID] rootid to struct vfs_cap_data. This is the absolute uid_d (that is, the uid_t in user namespace which mounted the filesystem, usually init_user_ns [the global default]) of the root id in whosr namespace the file capabilities may take effect”. He then rewrites xattrs within the namespace for unprivileged “root” users with the appropriate notion of capabilities for that environment (in a “v3” xattr that is transparently converted to/from the conventional “v2” security.capability xattr), in accordance with capabilities that have been granted to the namespace from outside by a CAP_SETFCAP. This allows capability use without undermining host system security and seems like a nice solution.

Ongoing Development

Ashish Kalra posted “Fix BSS corruption/overwrite issue in early x86 kernel setup”. The BSS (Block Started by Symbol) is the longstanding name used to refer to statically allocated (and pre-zeroed) variables that have memory set aside at compile time. It’s a common feature of almost every ELF (Executable and Linking Format) Linux binary you will come across, the kernel not being much different. Linux also uses stacks for small runtime allocations by having a page (or several) of memory that contains a pointer which descends (it’s actually called a “fully descending” type of stack) in address as more (small) items are allocated within it. At boot time, the kernel typically expects the bootloader will have setup a stack that can be used for very early code, but Linux is willing to handle its own setup if the bootloader isn’t sophisticated enough to handle this. The latter code isn’t well exercised and it turns out doesn’t reserve quite enough space, which causes the stack to descend (run into) the BSS segment, resulting in corruption. Ashish fixes this by increasing the fallback stack allocation size from 512 to 1024 bytes in arch/x86/boot/boot.h.

Vladimir Murzin posted “ARM: Fix dma_alloc_coherent()” and friends for NOMMU”, noting “It seem that addition of cache support for M-class CPUs uncovered [a] latent bug in DMA usage. NOMMU memory model has been treated as being always consistent; however, for R/M [Real Time and Microcontroller] classes [of ARM cores] memory can be covered by MPU [Memory Protection Unit] which in turn might configure RAM as Normal i.e. bufferable and cacheable. It breaks dma_alloc_coherent() and friends, since data can stuck in caches”.

Andrew Pinski posted “arm64/vdso: Rewrite gettimeofday into C”, which improves performance by up to 32% when compared to the existing in-kernel implementation on a Cavium ThunderX system (because there are division operations that the compiler can optimize). On their next generation, it apparently improves performance by 18% while also benefitting other ARM platforms that were tested. This is a significant improvement since that function is often called by userspace applications many times per second.

Baoquan He posted “x86/KASLR: Use old ident map page table if physical randomization failed”. Dave Young discovered a problem with the physical memory map setup of kexec/kdump kernels when KASLR (Kernel Address Space Layout Randomization) is enabled. KASLR does what it says on the tin. It applies a level of randomization to the placement of (most) physical pages of the kernel such that it is harder for an attacker to guess where in memory the kernel is located. This reduces the ability for “off the shelf” buffer overflow/ROP/similar attacks to leverage known kernel layout. But when the kernel kexec’s into a kdump kernel upon a crash, it’s loading a second kernel while attempting to leave physical memory not allocated to the crash kernel alone (so that it can be dumped). This can lead to KASLR allocation failures in the crash kernel, which (until this patch) would result in the crash kernel not correctly setting up an identity mapping for the original (older) kernel, resulting in immediately resetting the machine. With the patch, the crash kernel will fallback to the original kernel’s identity mapping page tables when KASLR setup fails.

On a separate, but related, note, Xunlei Pang posted “x86_64/kexec: Use PUD level 1GB page for identity mapping if available” which seeks to change how the kexec identity mapping is established, favoring a new top-level 1GB PUD (Page Upper Directory) allocation for the identity mappings needed prior to booting into the new kernel. This can save considerable memory (128MB “On one 32TB machine”…) vs using the current approach of many 2MB PTEs (Page Table Entries) for the region. Rather than many PTEs, an effective huge page can be mapped. PTEs are grouped into “directories” in memory that the microprocessor’s walker engines can navigate when handling a “page fault” (the process of loading the TLB – Translation Lookaside Buffer – and microTLB caches). Middle Directories are collections of PTEs, and these are then grouped into even larger collections at upper levels, depending upon nesting depth. For more about how paging works, see Mel Gorman’s “Linux Memory Management”, a classic text that is still very much relevant for the fundamentals.

Janakarajan Natarajan posted “Prevent timer value 0 for MWAITX” which limits the kernel from providing a value of zero to the privileged x86 “MWAITX” instruction. MWAIT (Memory Wait) is a series of instructions on contemporary x86 systems that allows the kernel to temporarily block execution (in place of a spinloop, or other solution) until a memory location has been updated. Then, various trickery at the micro-architectural level (a dedicated engine in the core that snoops for updates to that memory address) will handle resuming execution later. This is intended for use in waiting relatively small amounts of time in an energy efficient and high performance (low wakeup time) manner. The instruction accepts a timeout period after which a wakeup will happen regardless, but it can also accept a zero parameter. Zero is supposed to mean “never timeout” (i.e. always wait for the memory update). It turns out that existing Linux kernels do use zero on some occasions, incorrectly, and that this isn’t noticed on older microprocessors due to other events eventually triggering a wakeup regardless. On the new AMD Zen core, which behaves correctly, MWAITX may never wake up with a zero parameter, and this was causing NMI soft lockup warnings. The patch corrects Linux to do the right thing, removing the zero option.

Paul E. McKenney posted “Make SRCU be built by default”. SRCU (Sleepable) RCU (Read Copy Update) is an optional feature of the Linux kernel that provides an implementation of RCU which can sleep. Conventionally, RCU had spinlock semantics (it could not sleep). By definition, its purpose was to provide a cunning lockless update mechanism for data structures, relying upon the passage of a “grace period” defined by every processor having gone into the scheduler once (a gross simplification of RCU). But under some circumstances (for example, in a Real Time kernel) there is a need for a sleepable (and pre-emptable, but that’s another issue) RCU. And so SRCU was created more than 8 years ago. It has a companion in “Tiny SRCU” for embedded systems. A “surprisingly common case” exists now where parts of the kernel are including srcu.h so Paul’s patch builds it by default.

Laurent Dufour posted “BUG raised when onlining HWPoisoned page” in which he noted that the (being onlined) page “has already the mem_cgroup field set” (this is shown in the stack trace he posts with “page dumped because: page still charged to cgroup”). He cleans this up by clearing the mem_cgroup when a page is poisoned. His second patch skips poisoned pages altogether when performing a memory block onlining operation.

Laurent also posted an RFC (Request For Comment) patch series entitled “Replace mmap_sem by a range lock” which “implements the first step of the attempt to replace the mmap_sem by a range lock”. We will summarize this patch series in more detail the next time it is posted upstream.

Christian König posted version 4 of his “Resizable PCI BAR support” patches. PCI (and its derivatives, such as PCI Express) use BARs (Base Address Registers) to convey regions of the host physical memory map that the device will use to map in its memory. BARs themselves are just registers, but the memory they refer to must be linearly placed into the physical map (or interim IOVA map in the case that the BAR is within a virtual machine). Fitting large, multi GB windows can be a challenge, sometimes resulting in failure, but many devices can also manage with smaller memory windows. Christian’s patches attempt to provide for the best of both by adding support for a contemporary feature of PCI (Express) that allows devices with such an ability to convey a minimal BAR size and then increase the allocation if that is available. His changes since version 3 include “Fail if any BAR is still in use…”.

Ying Huang posted version 10 of his “THP swap: Delay splitting THP during swapping out” which allows for swapping of Transparent Huge Pages directly. We have previously covered iterations of this patch series. The latest changes are minimal, suggesting this is close to being merged.

Jérôme Glisse posted version 21 of his “Heterogeneous Memory Management” (HMM) patch series. This is very similar to the version we covered last week. As a reminder, HMM provides an API through which the kernel can manage devices that want to share memory with a host processing environment in a more seamless fashion, using shared address spaces and regular pointers. His latest version changes the concept of “device unaddressable” memory to “device private” (MEMORY_DEVICE_PRIVATE vs MEMORY_DEVICE_PUBLIC) memory, following the feedback from Dan Nellans that devices are changing over time such that “memory may not remain CPU-unaddressable in the future” and that, even though this would likely result in subsequent changes to HMM, it was worthwhile starting out with nomenclature correctly referring to memory that is considered private to a device and will not be managed by HMM.

Intel’s test Robot noticed a 12.8% performance improvement in one of their scalability benchmarks when running with a recent linux-next tree containing Al Viro’s “amd64: get rid of zeroing” patch. This is patch of his larger “uccess unification” patch series that aims to simply and cleanup the process of copying data to/from kernel and userspace. In particular, when asking the kernel to copy data from one userspace virtual address to another, there is no need to apply the level of data zeroing that typically applies to buffers the kernel copies (for security purposes – preventing leakage of extra data beyond structures returned from kernel calls, as an example). When both source and destination are already in userspace, there is no security issue, but there was a performance degregation that Viro had noticed and fixed.

Julien Grall posted “Xen: Implement EFI reset_system callback”, which provides a means to correctly reboot and power off Dom0 host Xen Hypervisors when running on EFI systems for which reset_system is used by reference (ARM).


Linux Kernel Podcast for 2017/04/19


[ Apologies for the delay – I have been a little sick for the past day or so and was out on Monday volunteering at the Boston Marathon, so my evenings have been in scarse supply to get this week’s issue completed ]

In this week’s edition: Linus Torvalds announces Linux 4.11-rc7, a kernel security update bonanza, the end of Kconfig maintenance, automatic NUMA balancing, movable memory, a bug in synchronize_rcu_tasks, and ongoing development. The Linux 4.12 merge window should open before next week.

Linus Torvalds announced Linux 4.11-rc7, noting that “You all know the drill by now. We’re in the late rc phase, and this may be the last rc if nothing surprising happens”. He also pointed out how things had been calm, and then, “as usual Friday happened”, leading to a number of reverts for “things that didn’t work out and aren’t worth trying to fix at this point”. In anticipation of the imminent opening of the 4.12 merge window (period of time during which disruptive changes are allowed) Linux Weekly News posted their usual excellent summary of the 4.11 development cycle. If you want to support quality Linux journalism, you should subscribe to LWN today.

Ted (Theodore) Ts’o posted “[REGRESSION] 4.11-rc: systemd doesn’t see most devices” in which he noted that “[t]here is a frustrating regression in 4.11 that I’ve been trying to track down. The symptoms are that a large number of systemd devices don’t show up.” (which was affecting the encrypted device mapper target backing his filesystem). He had a back and forth with Greg K-H (Kroah Hartman) about it with Greg suggesting Ted watch with udevadm and Ted pointing out that this happens at boot and is hard to trace. Ted’s final comment was interesting: “I’d do more debugging, but there’s a lot of magic these days in the kernel to udev/systemd communications that I’m quite ignorant about. Is this a good place I can learn more about how this all works, other than diving into the udev and systemd sources?”. Indeed. In somewhat interesting timing, Enric Balletbo i Serra later posted a 5 part patch series entitled “dm: boot a mapped device without an initramfs”.

Rafael J. Wysocki posted some late breaking 4.11-rc7 fixes for ACPI, including one patch reverting a “recent ACPICA commit [to the ACPI – Advanced Configuration and Power Interface – Component Architecture aka reference code upon which the kernel’s runtime interpretor is based] targeted at catching firmware bugs” that did do so, but also caused “functional problems”.


Jiri Slaby announced Linux 3.12.73.

Greg KH (Kroah-Hartman) announced Linux 3.18.49, 3.19.49 4.4.62, 4.9.23, and 4.10.11. As he noted in his review posting prior to announcing the latest 3.18 kernel, 3.18 was indeed “dead and forgotten and left to rot on the side of the road” but “unfortunately, there’s a few million or so devices out there in the wild that still rely on this kernel”. Important security fixes are included in all of these updates. Greg doesn’t commit to bring 3.18 out of retirement for very long, but he does note that Google is assisting a little for the moment to make sure 3.18 based devices get some updates.

Steven Rostedt announced “Real Time” (preempt-rt) kernels 3.2.88-rt126 (“just an update to the new stable 3.2.88 version”), 3.12.72-rt97, and 4.4.60-rt73. Separately, Paul E. McKenney noted “A Hannes Weisbach of TU Dresden published this master thesis on quasi-real-time scheduling:

Rafael J. Wysocki announced a CFP (Call For Papers) targeting the upcoming LPC (Linux Plumbers Conference) Power Management and Energy-Awareness microconference “Call for topics”. Registration for LPC just opened.

Yann E. MORIN posted “MAINTAINERS: relinquish kconfig” in which he apologized for not having enough time to maintain Kconfig with “I’ve been almost entirely absent, which totally sucks, and there is no excuse for my behavior and for not having relinquished this earlier”. With such harsh friends as yourself, who needs enemies? Joking aside, this is sad news, since Kconfig is the core infrastructure used to configure the kernel. It wasn’t long before someone else (Randy Dunlap) posted a patch for Kconfig that no longer has a maintainer (Randy’s patch implements a sort method for config options)

[as an aside, as usual, I have pinged folks who might be looking for an opportunity to encourage them to consider stepping up to take this on].

Automatic NUMA balancing, movable memory, and more!

Mel Gorman posted “mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa”. Modern Linux kernels include a feature known as automatic numa balancing which relies upon marking regions of virtual memory as inaccessible via their page table entries (PTEs) and set a special prot_numa protection hinting bit. The idea is that a later “NUMA hinting fault” on access to the page will allow the Operating System to determine whether it should migrate the page to another NUMA node. Pages are simply small granular units of system memory that are managed by the kernel in setting up translations from virtual to physical memory. When an access to a virtual address occurs, hardware (or, on some architectures, special software) “walkers” navigate the “page tables” pointed to by a special system register. The walker will traverse various “directories” formed from collections of pages in a hierarchical fashion intended to require less space to store page tables than if entries were required for every possible virtual address in a 32 or 64-bit space.

Contemporary microprocessors also support multiple page (granule) sizes, with a fundamental size (commonly 4K or 64K) being supplemented by the ability for larger pages (aka “hugepages”) to be used for very large regions of contiguous virtual memory at less overhead. Common sizes of huge pages are 2MB, 4MB, 512M, and even multi-GB, with “contiguous hint bits” on some modern architectures allowing for even greater flexibility in the footprint of page table and TLB (Translation Lookaside Buffer) entries by only requiring physical entries for a fraction of a contiguous region. On Intel x86 Architecture, huge pages are implemented using the Page Size Extensions (PSE), which allows for a PMD (Page Middle Directory) to be replaced by an entry that effectively allocates the entire range to a single page entry. When a hardware walker sees this, a single TLB entry can be used for an entire range of a few MB instead of many 4K entries.

A bug known as a “race condition” exist(ed) in the automatic NUMA hinting code in which change_pmd_range would perform a number of checks without a lock being held to protect against a concurrent race againt a parallel protection updated (which does happen under a lock) that would clear the PMD and fill it with a prot_numa entry. Mel adds a new pmd_none_or_trans_huge_or_clear_bad function that correctly handles this rare corner case sequence, and documents it (in mm/mprotect.c). Michal Hocko responded with “you will probably win the_longer_function_name_contest but I do not have [a] much better suggestion”.

Speaking of Michal Hocko, he posted version 2 of a patch series entitled “mm: make movable onlining suck less” in which he described the current status quo of “Movable onlining” as “a real hack with many downsides”. Linux divides memory into regions describing zones with names like ZONE_NORMAL (for regular system memory) and ZONE_MOVABLE (for memory the contents of which is entirely pages that don’t contain unmovable system data, firmware data, or for other reasons cannot be trivially moved/offlined/etc.).

The existing implementation has a number of constraints around which pages can be onlined. In particular, around the relative placement of the memory being onlined vs the ZONE_NORMAL memory. This, Michal described as “mainly reintroduction of lowmem/highmem issues we used to have on 32b systems – but it is the only way to make the memory hotremove more reliable which is something that people are asking for”. His patch series aims to make “the onlining semantic more usable [especially when driven by udev]…it allows to online memory movable as long as it doesn’t clash with the existing ZONE_NORMAL. That means that ZONE_NORMAL and ZONE_MOVABLE cannot overlap”. He noted that he had discussed this patch series with Jérôme Glisse (author of the HMM – Heterogenous Memory Management – patches) which were to be rebased on top of this patch series. Michal said he would assist with resolving any conflicts.

Igor Mammedov (Red Hat) noted that he had “given [the movable onlining] series some dumb testing” and had found three issues with it, which he described fully. In summary, these were “unable to online memblock as NORMAL adjacent to onlined MOVABLE”, “dimm1 assigned to node 1 on qemu CLI memblock is onlined as movable by default”, and “removable flag flipped to non-removable state”. Michal wasn’t initially able to reproduce the second issue (because he didn’t have ACPI_HOTPLUG_MEMORY enabled in his kernel) but was then able to followup noting that it was similar to another bug he had already fixed. Jérôme subsequently followed up with an updated HMM patchset as well.

Joonsoo Kim (LGE) posted version 7 of a patch series entitled “Introduce ZONE_CMA” in which he reworks the CMA (Contiguous Memory Allocator) used by Linux to manage large regions of physcially contiguous memory that must be allocated (for device DMA buffers in cases where scatter gather DMA or an IOMMU are not available for managed translations). In the existing CMA implementation, physically contiguous pages are reserved at boot time, but they operate much as reserved memory that happens to fall within ZONE_NORMAL (but with a special “migratetype”, MIGRATE_CMA), and will not generally be used by the system for regular memory allocations unless there are no movable freepages available. In other words, only as a last possible resort.

This means that on a system with 1024MB of memory, kswapd “is mostly woke[n] up when roughly 512MB free memory is left”. The new patches instead create a distinct ZONE_CMA which has some special properties intended to address utilization issues with the existing implementation. As he notes, he had a lengthy discussion with Mel Gorman after the LSF/MM 2016 conference last year, in which Mel stated “I’m not going to outright NAK your series but I won’t ACK it either”. A lot of further discussion is anticipated. Michal Hocko might have summarized it best with, “the cover letter didn’t really help me to understand the basic concepts to have a good starting point before diving into the implementation details [to review the patches]”. Joonsoo followup up with an even longer set of answers to Michal.

A bug in synchronize_rcu_tasks()

Paul E. McKenney posted “There is a Tasks RCU stall warning” in which he noted that he and Steven Rostedt were seeing a stall that didn’t report until it had waited 10 minutes (and recommended that Steven try setting the kernel rcupdate.rcu_task_stall_timeout boot parameter). RCU (Read Copy Update) is a clever mechanism used by Linux (under a GPL license from IBM, who own a patent on the underlying technology) to perform lockless updates to certain types of data structure, by tracking versions of the structure and freeing the older version once references to it have reached an RCU quiescent state (defined by each CPU in the system having scheduled synchronize_rcu once).

Steven noted that for the issue under discussion there was a thread that “never goes to sleep, but will call cond_resched() periodically [a function that is intended to possibly call into the scheduler if there is work to be done there]”. On the RT (Real Time, “preempt-rt”) kernel, Steven noted that cond_resched() is a nop and that the code he had been working on should have made a call directly to the schedule() function. Which lead to him suggesting he had “found a bug in synchronize_rcu_tasks()” in the case that a task frequently calls schedule() but never actually performs a context switch. In that case, per Paul’s subsequent patch, the kernel is patched to specially handle calls to schedule() not due to regular preemption.

Ongoing Development

Anshuman Khandual posted “mm/madvise: Clean up MADV_SOFT_OFFLINE and MADV_HWPOISON” noting that “madvise_memory_failure() was misleading to accommodate handling of both memory_failure() as well as soft_offline_page() functions. Basically it handles memory error injection from user space which can go either way as memory failure or soft offline. Renamed as madvise_inject_error() instead.” The madvise infrastructure allows for coordination between kernel and userspace about how the latter intends to use regions of its virtual memory address space. Using this interface, it is possible for applications to provide hints as to their future usage patterns, relinquish memory that they no longer require, inject errors, and much more. This is particularly useful to KVM virtual machines, which appear as regular processes and can use madvise() to control their “RAM”.

Sricharan R (Codeaurora) posted version 11 of a patch series entitled “IOMMU probe deferral support”, which “calls the dma ops configuration for the devices at a generic place so that it works for all busses”.

Kishon Vijay Abraham sent a pull request to Greg K-H (Kroah Hartman) for Linux 4.12 that included individual patches in addition to the pull itself. This resulted in an interesting side discussion between Kishon and Lee Jones (Linaro) about how this was “a strange practice” Lee hadn’t seen before.

Thomas Garnier (Google) posted version 7 of a patch series entitled “syscalls: Restore address limit after a syscall” which “ensures a syscall does not return to user-mode with a kernel address limit. If that happened, a process can corrupt kernel-mode memory and elevate privileges”. Once again, he cites how this would have preemptively mitagated a Google Project Zero security bug.

Christopher Bostic posted version 6 of a patch series enabling support for the “Flexible Support Interface” (FSI) high fan out bus on IBM POWER systems.

Dan Williams (Intel) posted “x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions” in which he says “Before we rework the “pmem api” to stop abusing __copy_user_nocache() for memcpy_to_pmem() we need to fix cases where we may strand dirty data in the cpu cache.”

Leo Yan (Linaro) posted an RFC (Request For Comments) patch series entitled “coresight: support dump ETB RAM” which enables support for the Embedded Trace Buffer (ETB) on-chip storage of trace data. This is a small buffer (usually 2KB to 8KB) containing profiling data used for postmortem debug.

Thierry Escande posted “Google VPD sysfs driver”, which provides support for “accessing Google Vital Product Data (VPD) through the sysfs”.

Alex(ander) Graf posted version 6 of “kvm: better MWAIT emulation for guests”, which provides new capability information to user space in order for it to inform a KVM guest of the availability of native MWAIT instruction support. MWAIT allows a (guest) kernel to wake up a remote (v)CPU without an IPI – InterProcessor Interrupt – and the associated vmexit that would then occur to schedule the remote vCPU for execution. The availability of MWAIT is deliberately not provided in the normal CPUID bitmap since “most people will want to benefit from sleeping vCPUs to allow for over commit” (in other words with MWAIT support, one can arrange to keep virtual CPUs runnable for longer and this might impact the latency of hosting many tenants on the same machine).

David Woodhouse posted version 2 of his patch series entitled “PCI resource mmap cleanup” which “pursues my previous patch set all the way to its logical conclusion”, killing off “the legacy arch-provided pci_mmap_page_range() completely, along with its vile ‘address converted by pci_resource_ro_user()’ API and the various bugs and other strange behavior that various architectures had”. He noted that to “accommodate the ARM64 maintainers’ desire *not* to support [the legacy] mmap through /proc/bus/pci I have separated HAVE_PCI_MMAP from the sysfs implementation”. This had previously been called out since older versions of DPDK were looking for the legacy API and failing as a result on newer ARM server platforms.

Darren Hart posted an RFC (Request For Comments) patch series entitled “WMI Enhancements” that seeks to clean up the “parallel efforts involving the Windows Management Instrumentation (WMI) and dependent/related drivers”. He wanted to have a “round of discussion among those of you that have been invovled in this space before we decide on a direction”. The proposed direction is to “convert[] wmi into a platform device and a proper bus, providing devices for dependent drivers to bind to, and a mechanism for sibling devices to communicate with each other”. In particular, it includes a capability to expose WMI devices directly to userspace, which resulted in some pushback (from Pali Rohár) and a suggestion that some form of explicit whitelisting of wmi identifiers (GUIDS) should be used instead. Mario Limonciello (Dell) had many useful suggestions.

Wei Wang (Intel) posted version 9 of a patch series entitled “Extend virtio-balloon for fast (de)inflating & fast live migration” in which he “implements two optimizations”. The first “tranfer[s] pages in chunks between the guest and host”. The second “transfer[s] the guest unused pages to the host so that they can be skipped in live migration”.

Dmitry Safonov posted “ARM32: Support mremap() for sigpage/vDSO” which allows CRIU (Checkpoint and Restart in Userspace) to complete its process of restoring all application VMA (Virtual Memory Area) mappings on restart by adding the ability to move the vDSO (Virtual Dynamic Shared Object) and sigpage kernel pages (data explicitly mapped into every process by the kernel to accelerate certain operations) into “the same place where they were before C/R”.

Matias Bjørling (Cnex Labs) prepared a git pull request for “LightNVM” targeting Linux 4.12. This is “a new host-side translation layer that implements support for exposing Open-Channel SSDs as block devices”.

Greg Thelen (Google) posted “slab: avoid IPIs when creating kmem caches”. Linux’s SLAB memory allocator (see also the paper on the original Solaris memory allocator) can be used to pre-allocate small caches of objects that can then be efficiently used by various kernel code. When these are allocated, per-cpu array caches are created, and a call is made to kick_all_cpus_sync() which will schedule all processors to run code to ensure that that there are no stale references to the old array caches. This global call is performed using an IPI (InterProcessor Interrupt), which is relatively expensive, especially in the case that a new cache is being created (and not replacing an old one). In that case wasteful IPIs are generated on the order of 47,741 additional ones in the example given vs. 1,170 in a patched kernel.

One Day Delay Due to Boston Marathon

The Podcast is delayed until Wednesday evening this week. Usually, I try to get it out on a Monday night (or at least write it up then and actually post on Tuesday), but when holidays or other events fall on a Monday, I will generally delay the podcast by a day. This week, I was volunteering at the Marathon all of Monday, which means the prep is taking place Tuesday night instead.

Linux Kernel Podcast for 2017/04/11


In this week’s edition: Linus Torvalds announces Linux 4.11-rc6, Intel Memory Bandwidth Allocation (MBA), Coherent Device Memory (CDM), Paravirtualized Remote TLB Flushing,kernel lockdown, the latest on Intel 5-level paging, and other assorted ongoing development activities.

Linus Torvalds announced Linux 4.11-rc6. In his mail, Linus notes that “Things are looking fairly normal [for this point in the development cycle]…The only slightly unusual thing is how the patches are spread out, with almost equal parts of arch updates, drivers, filesystems, networking and “misc”.” He ends “Go and get it”. Thorsten Leemhuis followed up with “Linux 4.11: Reported regressions as of Sunday, 2017-04-09”, his third regression report for 4.11. Which “lists 15 regressions I’m currently aware of. 5 regressions mentioned in last week[‘]s report got fixed”. Most appear to be driver problems, but there is one relating to audit, and one in inet6_fill_ifaddr that is stalled waiting for “feedback from reporter”.

Stable kernels

Greg K-H (Kroah-Hartman) announced Linux kernels 4.4.60, 4.9.21, and 4.10.9

Ben Hutchings announced Linux 3.2.88 and 3.16.43

Jason A. Donenfeld pointed out that Linux 3.10 “is inexplicably missing crypto_memneq, making all crypto mac [Message Authentication Code] comparisons use non constant-time comparisons. Bad news bears [presumably due to side channel attack]. Willy followed up noting that he would “check if the 3.12 patches…can be safely backported”.

Memory Bandwidth Allocation (Intel Resource Director Technology, RDT)

Vikas Shivappa (Intel) posted version 4 of a patch series entitled “x86/intel_rdt: Intel Memory bandwidth allocation”, addressing feedback from the previous iteration that he had received from Thomas Gleixner. The MBA (Memory Bandwidth Allocation) technology is described both in the kernel Documentation patch (provided) as well as in various Intel papers and materials available online. Intel provide a construct known as a “Class of Service” (CLOS) on certain contemporary Xeon processors, as part of their CAT (Cache Allocation Technology) feature, which is itself part of a larger family of technologies known as “Intel Resource Directory Technology” (RDT). These CLOSes “act as a resource control tag into which a thread/app/VM/container can be grouped”.

It appears that a feature of Intel’s L3 cache (LLC in Intel-speak) in these parts is that they can not only assign specific proportions of the L3 cache slices on the Xeon’s ring interconnect to specific resources (e.g. “tasks” – otherwise known as processes, or applications) but also can control the amount of memory bandwidth granted to these. This is easier than it sounds. From a technical perspective, Intel integrate their memory controller onto their dies, and contemporary memory controllers already perform fine grained scheduling (this is how they bias memory reads for speculative loads of the instruction stream in among the other traffic, as just one simple example). Therefore, exposing memory bandwidth control to the cache slices isn’t all that more complex. But it is cute, and looks great in marketing materials.

Coherent Device Memory (CDM) on top of HMM

Jérôme Glisse posted and RFC [Request for Comments] patch series entitled “Coherent Device Memory (CDM) on top of HMM”. His previous HMM (Heterogenous Memory Management) patch series, now in version 19, implemented support for (non-coherent) device memory to be mapped into regular process address space, by leveraging the ability for certain contempory devices to fault on access to untranslated addresses managed in device page tables thus allowing for a kind of pageable device memory and transparent management of ownership of the memory pages between application processor cores and (e.g.) a GPU or other acceleration device. The latest patch series builds upon HMM to also support coherent device memory (via a new ZONE_DEVICE memory – see also the recent postings from IBM in this area). As Jérôme notes, “Unlike the unaddressable memory type added with HMM patchset, the CDM [Coherent Device Memory] type can be access[ed] by [the] CPU.” He notes that he wanted to kick off this RFC more for the conversation it might provoke.

In his mail, Jérôme says, “My personal belief is that the hierarchy of memory is getting deeper (DDR, HBM stack memory, persistent memory, device memory, …) and it may make sense to try to mirror this complexity within mm concept. Generalizing the NUMA abstraction is probably the best starting point for this. I know there are strong feelings against changing NUMA so i believe now is the time to pick a direction”. He’s right of course. There have been a number of patch series recently also targeting accelerators (such as FPGAs), and more can be anticipated for coherently attached devices in the future. [This author is personally involved in CCIX]

Hyper-V: Paravirtualized Remote TLB Flushing and Hypercall Improvements

Vitaly Kuznetsov (Red Hat) posted “Hyper-V: paravirtualized remote TLB flushing and hypercall improvements”. It turns out that Microsoft’s Hyper-V hypervisor supports hypercalls (calls into the hypervisor from the guest OS) for “doing local and remote TLB [Translation Lookaside Buffer] flushing”. Translation Lookaside Buffers [TLBs] are caches built into microprocessors that store a translation of a CPU virtual address to “physical” (or, for a virtual machine, into an intermediate hypervisor) address. They save an unnecessary page table walk (of the software managed hardware/software structure – depending upon architecture – that “walkers” navigate to perform a translation during a “page fault” or unhandled memory access, such as happens constantly when demand loading/faulting in application code and data, or sharing read-only data provided by shared libraries, etc.). TLBs are generally transparent to the OS, except that they must be explicitly managed under certain conditions – such as when invlidating regions of virtual memory or performing certain context switches (depending upon the provisioning of address and virtual memory space tag IDs in the architecture).

TLB invalidates on local processor cores normally use special CPU instructions, and this is certainly also true under virtualization. But virtual addresses used by a particular process (known as a task within the kernel) might be also used by other cores that have touched the same virtual memory space. And those translations need to be invalidated too. Some architectures include sophisticated hardware broadcast invalidation of TLBs, but some other legacy architectures don’t provide these kinds of capabilities. On those architectures that don’t provide for a hardware broadcast, it is typically necessary to use a construct known as an IPI (Inter Processor Interrupt) to cause an IRQ (interrupt message) to be delivered to the remote interrupt controller CPU interface (e.g. LAPIC on Intel x86 architecture) of the destination core, which will run an IPI handler in response that does the TLB teardown.

As Vitaly notes, nobody is recommending doing local TLB flash using a hypercall, but there can be significant performance improvement in using a hypercall for the remote invalidates. In the example cited, which uses “a special ‘TLB trasher'” he demonstrates how a 16 vCPU guest experienced a greater than 25% performance improvement using the hypercall approach.

Ongoing Development

David Howells posted an magnum opus entitled “Kernel lockdown”, which aims to “provide a facility by which a variety of avenues by which userspace can feasibly modify the running kernel image can be locked down”. As he says, “The lock-down can be configured to be triggered by the EFI secure boot status, provided the shim isn’t insecure. The lock-down can be lifted by typing SysRq+x on a keyboard attached to the system [physcial presence]. Among the many other things, these patches (versions of which have been in distribution kernels for a while) change kernel behavior to include “No unsigned modules and no modules for which [we] can’t validate the signature”, disable many hardware access functions, turn off hibernation, prevent kexec_load(), and limit some debugging features. Justin Forbes of the Fedora Project noted that he had (obviously) tested these. One of the many interesting sets of patches included a feature to “Annotate hardware config module parameters” which allows modules to mark unsafe options. Following some pushback, David also followed up with a rationale for doing kernel lockdown, entitled “Why kernel lockdown?”. Worth reading.

Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he (bravely) took Ingo’s request to “rewrite assembly parts of boot process into C before bringing 5-level paging support”. He says, “The only part where I succeed is startup_64 in arch/x86/kernel/head_64.S. Most of the logic is now in C.” He also renames the level 4 page tables “init_level4_pgt” and “early_level4_pgt” to “init_top_pgt” and “early_top_pgt”. There was another lengthy discussion around his “Allow to have userspace mappings above 47-bits”, a patch which tells the kernel to prefer to do memory allocations below 47-bits (the previous “Canonical Addressing” limit of Intel x86 processors, which some JITs and other code exploit by abusing the top bits of the address space in pointers for illegal tags, breaking compatibility with an extended virtual address space). The patch allows mmap calls ith MAP_FIXED hints to cause larger allocations. There was some concern that larger VM space is ABI and must be handled with care. A footnote here is that (apparently, from the patch) Intel MPX (Memory Protection Extension) doesn’t yet work with LA57 (the larger address space feature) and so Kirill avoids both in the same process.

Christopher Bostic posted version 5 of a patch series entitled “FSI driver implementation”. This is support for the POWER’s [Performance Optimization With Enhanced RISC, for those who ever wondered – this author used to have a lot of interest in PowerPC back in the day] “Flexible Support Interface” (FSI), a “high fan out serial bus” whose specification seems to have appeared on the OpenPower Foundation website recently also.

Kishon Vijay Abraham posted “PCI: Support for configurable PCI endpoint”, which Bjorn finally pulled into his tree in anticipation of the upcoming 4.12 merge cycle. For those who haven’t see Kishon’s awesome presentation “Overview of PCI(e) Subsystem” for Embedded Linux Conference Europe, you are encouraged to watch it at least several times. He really knows his stuff, and has done an excellent job producing a high quality generic PCIe endpoint driver for Linux:

Ard Biesheuvel posted “EFI fixes for v4.11”, which among other goodies includes a fix for EFI GOP (Graphics Output Protocol) support on systems built using the 64-bit ARM Architecture, which uses firmware assignment of PCIe BAR resources. Ard and Alex Graf have done some really fun work with graphics cards on 64-bit ARM lately – including emulating x86 option ROMs. Ard also had some fixes prepared for v4.12 that he announced, including a bunch of cleanup to the handling of FDT (Flattened Device Tree) memory allocation. Finally, he added support for the kernel’s “quiet” command line option, to remove extraneous output from the EFI stub on boot.

Srikar Dronamraju and Michal Hocko had a back and forth on the former’s “sched: Fix numabalancing to work with isolated cpus” patch, which does what it says on the tin. Michal was a little concered that NUMA balancing wasn’t automatically applied even to isolated CPUs, but others (including Peter Zjilsta) noted that this absolutely is the intended behavior.

Ying Huang (Intel) posted version 8 of his “THP swap: Delay splitting THP during swapping out”, which essentially allows paging of (certain) huge pages. He also posted version 2 of “mm, swap: Sort swap entries before free”, which sorts consecutive swap entires in a per-CPU buffer into order accoring to their backing swap deivce before freeing those entries. This reduces needless acquiring/releasing of locks and improves performance.

Will Deacon posted version 2 of a patch series entitled “drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension”. The “SPE” (Statistical Profiling Extension) “can be used to profile a population of operations in the CPU pipeline after instruction decode. These are either architected instructions (i.e. a dynamic instruction trace) or CPU-specific uops and the choice is fixed statically in the hardware and advertised to userpace via caps. Sampling is controlled using a sampling interval, similar to a regular PMU counter, but also with an optional random perturbation”. He notes that the “in-memory buffer is linear and virtually addressed, raising an interrupt when it fills up” [which makes using it nice for software folks].

Binoy Jayan posted “IV [Initial Vector] Generation algorithms for dm-crypt”, the goal of which “is to move these algorithms from the dm layer to the kernel crypto layer by implementing them as template ciphers”.

Joerg Roedel posted “PCI: Add ATS-disable quirk for AMD Stoney GPUs”. Then, he posted a followup with a minor fix based upon feedback. This should close the issue of certain bug reports posted by those using an IOMMU on a Stoney platform and seeing lockups under high TLB invalidation.

Born Helgass posted “PCI fixes for v4.11”, which includes “fix ThunderX legacy firmware resources”, a PCI quirk for certain ARM server platforms.

Paul Menzel reported “`pci_apply_final_quirks()` taking half a second”, which David Woodhouse (who wrote the code to match PCIe devices against the quick list “back in the mists of time”) posited was perhaps down to “spending a fair amount of time just attempting to match each device against the list”. He wondered “if it’s worth sorting the list by vendor ID or somthing, at least for the common case of the quirks which match on vendor/device”. There was a general consensus that cleanup would be nice, if only someone had the time and the inclination to take a poke at it.

Seth Forshee (Canonical) posted “audit regressions in 4.11”, in which he noted that ever since the merging of “audit: fix auditd/kernel connection state tracking”, the kernel will now queue up indefintely audit messages for delivery to the (userspace) audit daemon if it is not running – ultimately crashing the machine. Paul Moore thanked him for the report and there was a back and forth on the best way to handle the case of no audit running.

Neil Brown posted a patch entitled “NFS: fix usage of mempools”. As he notes in his patch, “When passed GFP [Get Free Page] flags that allow sleeping (such as GFP_NOIO), mempool_alloc() will never return NULL, it will wait until memory is available…This means that we don’t need to handle falure, but that we do need to ensure one thread doesn’t call mempool_alloc twice on the one pool without queuing or freeing the first allocation”. He then cites “pnfs_generic_alloc_ds_commits” as an unsafe function and provides a fix.

Finally, Kees Cook followed up (as he had promised) on a discussion from last week, with an RFC (Request for Comments) patch series entitiled “mm: Tighten x86 /dev/mem with zeroing”, including the suggestion from Linus that reads from /dev/mem that aren’t permitted simply return zero data. This was just one of many security discussions he was involved in (as usual). Another included having suggested a patch posted by Eddie Kovsky entitled “module: verify address is read-only”, which modifies kernel functions that use modules to verify that they are in the correct kernel ro_after_init memory area and “reject structures not marked ro_after_init”.

Linux Kernel Podcast for 2017/04/04


Linus Torvalds announces Linux 4.11-rc5, Donald Drumpf drains the maintainer swamp in April, Intel FPGA Device Drivers, FPU state cacheing, /dev/mem access crashing machines, and assorted ongoing development.

Linus Torvalds announced Linux 4.11-rc5. In his announcement mail, Linus notes that “things have definitely started to calm down, let’s hope it stays this way and it wasn’t just a fluke this week”. He calls out the oddity that “half the arch updates are to parisc” due to parisc user copy fixes.

It’s worth noting that rc5 includes a fix for virtio_pci which removes an “out of bounds access for msix_names” (the “name strings for interrupts” provided in the virtio_pci_device structure. According to Jason Wang (Red Hat), “Fedora has received multiple reports of crashes when running 4.11 as a guest” (in fact, your author has seen this one too). Quoting Jason, “The crashes are not always consistent but they are generally some flavor of oops or GPF [General Protection Fault – Intel x86 term referring to the general case of an access violation into memory by an offending instruction in various different ISAs – Instruction Set Architectures] in virtio related code. Multiple people have done bisections (Thank you Thorsten Leemhuis and Richard W.M. Jones)”. An example rediscovery of this issue came from a Mellanox engineer who reported that their test and regression VMs were crashing occasionally with 4.11 kernels.


Sebastian Andrzej Siewior announced preempt-rt Linux version 4.9.20-rt16. This includes a “Re-write of the R/W semaphores code. In RT we did not allow multiple readers because a writer blocking on the semaphore would have [to] deal with all the readers in terms of priority or budget inheritance [by which he is refering to the Priority Inheritance or “PI” feature common to “real time” kernels]. It’s obvious that the single reader restriction has severe performance problems for situations with heavy reader contention.” He notes that CPU hotplug got “better but can deadlock”

Greg Kroah-Hartman posted Linux stable kernels 4.4.59, 4.9.20, and 4.10.8.

Draining the Swamp (in April)

Donald Drumpf ( posted “MAINTAINERS: Drain the swamp”, an inspired patch aiming to finally address the problem of having “a small group of elites listed in the corrupt MAINTAINERS file” who, “For too long” have “reaped the rewards of maintainership”. He notes that over the past year the world has seen a great Linux Exit (“Lexit”) movement in which “People all of the Internet have come together and demanded that power be restored to the developers”, creating “a historic fork based on Linux 2.4, back to a better time, before Linux was controlled by corporate interests”. He notes that the “FAKE NEWS site said it wouldn’t happen, but we knew better”.

Donald says that all of the groundwork laid over the past year was just an “important first step”. And that “now, we are taking back what’s rightfully ours. We are transferring power from “Lyin’ Linus” and giving it back to you, the people. With the below patch, the job-killing MAINTAINERS file is finally being ROLLED BACK.” He also notes his intention to return “LAW and ORDER” to the Linux kernel repository by building a wall around and “THE LINUX FOUNDATION IS GOING TO PAY FOR IT”. Additional changes will include the repeal and replacement of the “bloated merge window”, the introduction of a distribution import tax, and other key innovations that will serve to improve the world and to MAKE LINUX GREAT AGAIN!

Everyone around the world immediately and enthusiastically leaped upon this inspired and life altering patch, which was of course perfect from the moment of its inception. It was then immediately merged without so much as a dissenting voice (or any review). The private email servers used to host Linus’s deleted patch emails were investigated and a special administrator appointed to investigate the investigators.

Intel FPGA Device Drivers

Wu Hao (Intel) posted a sixteen part patch series entitled “Intel FPGA Drivers”, which “provides interfaces for userspace applications to configure, enumerate, open, and access FPGA [Field Programmable Gate Arrays, flexible logic fabrics containing millions of gates that can be connected programmatically by bitstreams describing the intended configuration] accelerators on platforms equipped with Intel(R) FPGA solutions and enables system level management functions such as FPGA partial reconfiguration [the dynamic updating of partial regions of the FPGA fabric with new logic], power management, and virtualization. This support differs from the existing in-kernel fpga-mgr from Alan Tull in that it seems to relate to the so-called Xeon-FPGA hybrid designs that Intel have presented on in various forums.

The first patch (01/16) provides a lengthy summary of their proposed design in the form of documentation that is added to the kernel’s Documentation directory, specifically in the file Documentation/fpga/intel-fpga.txt. It notes that “From the OS’s point of view, the FPGA hardware appears as a regular PCIe device. The FPGA device memory is organized using a predefined structure [Device Feature List). Features supported by the particular FPGA device are exposed throughg these data structures. An FME (FPGA Management Engine) is provided which “performs power and thermal management, error reporting, reconfiguration, performance reporting, and other infrastructure functions. Each FPGA has one FME, which is always access through the physical function (PF)”. The FPGA also provides a series of Virtual Functions that can be individually mapped into virtual machines using SR-IOV.

This design allows a CPU attached using PCIe to communicate with various Accelerated Function Units (AFUs) contained within the FPGA, and which are individually assignable into VMs or used in aggregate by the host CPU. One presumes that a series of userspace management utilities will follow this posting. It’s actually quite nice to see how they implemented the discovery of individual AFU features, since this is very close to something a certain author has proposed for use elsewhere for similar purposes. It’s always nicely validating to see different groups having similar thoughts.

Copy Offload with Peer-to-Peer PCI Memory

Logan Gunthorpe posted an RFC (Request for Comments) patch series entitled “Copy Offload with Peer-to-Peer PCI Memory” which relates to work discussed at the recent LSF/MM (Linux Storage Filesystem and Memory Management) conference, in Cambridge MA (side note: I did find some of you haha!). To quote Logan, “The concept here is to use memory that’s exposed on a PCI BAR [Base Address Register – a configuration register that tells the device where in the physical memory map of a system to place memory owned by the device, under the control of the Operating System or the platform firmware, or both] as data buffers in the NVMe target code such that data can be transferred from an RDMA NIC to the special memory and then directly to an NVMe device avoiding system memory entirely”. He notes a number of positives from this, including better QoS (Quality of Service), and a need for fewer (relatively still quite precious even in 2017) PCIe lanes from the CPU into a PCIe switch placed downstream of its Root Complex on which peer-to-peer PCIe devices talk to one another without the intervening step of hopping through the Root Complex and into the system memory via the CPU. As a consequence, Logan has focused his work on “cases where the NIC, NVMe devices and memory are all behind the same PCI switch”.

To facilitate this new feature, Logan has a second patch in the series, entitled “Introduce Peer-to-Peer memory (p2mem) device”, which supports partitioning and management of memory used in direct peer-to-peer transfers between two PCIe devices (endpoints, or “cards”) with a BAR that “points to regular memory”. As Logan notes, “Depending on hardware, this may reduce the bandwidth of the transfer but could significantly reduce pressure on system memory” (again by not hopping up through the PCIe topology). In his patch, Logan had also noted that “older PCI root complexes” might have problems with peer-to-peer memory operations, so he had decided to limit the feature to be only available for devices behind the same PCIe switch. This lead to a back and forth with Sinan Kaya who asked (rhetorically) “What is so special about being connected to the same switch?”. Sinan noted that there are plenty of ways in Linux to handle blacklisting known older bad hardware and platforms, such as requiring that the DMI/SMBIOS-provided BIOS date of manufacture of the system be greater than a certain date in combination with all devices exposing the p2p capability and a fallback blacklist. Ultimately, however, it was discovered that the feature peer-to-peer feature isn’t enabled by default, leading Sinan to suggest “Push the decision all the way to the user. Let them decide whether they want this feature to work on a root port connected port or under the switch”.

FPU state cacheing

Kees Cook (Google) posted a patch entitled “x86/fpu: move FPU state into separate cache”, which aims to remove the dependency within the Intel x86 Architecture port upon an internal kernel config setting known as ARCH_WANTS_DYNAMIC_TASK_STRUCT. This configuration setting (set by each architecture’s code automatically, not by the person building the kernel in the configuration file) says that the true size of the task_struct cannot be known in advance on Intel x86 Architecture because it contains a variable sized array (VSA) within the thread_struct that is at the end of the task_struct to support context save/restore of the CPU’s FPU (Floating Point Unit) co-processor. Indeed, the kernel definition of task_struct (see include/linux/sched.h) includes a scary and ominous warning “on x88, ‘thread_struct’ contains a variable-sized structure. It *MUST* be at the end of ‘task_struct'”. Which is fairly explicit.

The reason to remove the dependency upon dynamic task_struct sizing is because this “support[s] future structure layout randomization of the task_struct”, which requires that “none of the structure fields are allowed to have a specific position or a dynamic size”. The idea is to leverage a GCC (GNU Compiler Collection) plugin that will change the ordering of C structure members (such as task_struct) randomly at compile time, in order to reduce the ability for an attacker to guess the layout of the structure (highly useful in various exploits). In the case of distribution kernels of course, an attacker has access to the same kernel binaries that may be running on a system, and could use those to calculate likely structure layout for use in a compromise. But the same is not true of the big hyperscale service providers like Google and Facebook. They don’t have to publish the binaries for their own internal kernels running on their public infrastructure servers.

This patch lead to a back and forth with Linus, who was concerned about why the task_struct would need changing in order to prevent the GCC struct layout randomization plugin from blowing up. In particular, he was worried that it sounded like the plugin was moving variable sized arrays from the last member of structures (not legally permitted). Kees, Linus, and Andy Lutomirski went through the fact that, yes, the plugin can handle trailing VSAs and so forth. In the end, it was suggested that Kees look at making task_struct “be something that contains a fixed beginning and end, and just have an unnamed randomized part in the middle”. Kees said “That could work. I’ll play around with it”.

/dev/mem access crashing machines

Dave Jones (x86info maintainer) had a back and forth with Kees Cook, Linus, and Tommi Rantala about the latter’s discovery that running Dave’s “x86info” tool crashed his machine with an illegal memory access. In turns out that x86info reads /dev/mem (a requirement to get the data it needs), which is a special file representing the contents of physical memory. Normally, when access is granted to this file, it is restricted to the root user, and then only certain parts of memory as determined by STRICT_DEVMEM. The latter is intended only to allow reads of “reserved RAM” (normal system memory reserved for specific device purposes, not that allocated for use by programs). But in Tommi’s case, he was running a kernel that didn’t have STRICT_DEVMEM set on a system booting with EFI for which the legacy “EBDA” (Extended BIOS Data Area) that normally lives at a fixed location in the sub-1MB memory window on x86 was not provided by the platform. This meant that the x86info tool was trying to read memory that was a legal address but which wasn’t reserved in the EFI System Table (memory map), and was mapped for use elsewhere.

All of this lead Linus to point out that simply doing a “dd” read on the first MB of the memory on the offending system would be enough to crash it. He noted that (on x86 systems) the kernel allows access to the sub-1MB region of physical memory unconditionally (regardless of the setting of the kernel STRICT_DEVMEM option) because of the wealth of platform data that lives there and which is expected to be read by various tools. He proposed effectively changing the logic for this region such that memory not explicitly marked as reserved would simple “just read zero” rather than trying to read random kernel data in the case that the memory is used for other purposes.

This author certainly welcomes a day when /dev/mem dies a death. We’ve gone to great lengths on 64-bit ARM systems to kill it, in part because it is so legacy, but in another part because there are two possible ways we might trap a bad access – one as in this case (synchronous exception) but another in which the access might manifest as a System Error due to hitting in the memory controller or other SoC logic later as an errant access.

Ongoing Development

Steve Longerbeam posted version 6 of a patch series entitled “i.MX Media Driver”, which implements a V4L2 (Video for Linux 2) driver for i.MX6.

David Gstir (on behalf of Daniel Walter) posted “fscrypt: Add support for AES-128-CBC” which “adds support for using AES-128-CBC for file contents and AES-128-CBC-CTS for file name encryption. To mitigae watermarking attacks, IVs [Initial Vectors] are generated using the ESSIV algorthim.”

Djalal Harouni posted an RFC (Request for Comments) patch entitled “proc: support multiple separate proc instances per pidnamespace”. In his patch, Djala notes that “Historically procfs was tied to pid namespaces, and moun options were propagated to all other procfs instances in the same pid namespace. This solved several use cases in that time. However today we face new problems, there are multiple container implementations there, some of them want to hide pid entries, others want to hide non-pid entries, others want to have sysctlfs, others want to share pid namespace with private procfs mounts. All these with current implementation won’t work since all options will be propagated to all procfs mounts. This series allow to have new instances of procfs per pid namespace where each intance can have its own mount option”.

Zhou Chengming (Hauwei) posted “reduce the time of finding symbols for module” which aims to reduce the time taken for the Kernel Live Patch (klp) module to be loaded on a system in which the module uses many static local variables. The patch replaces the use of kallsyms_on_each_symbol with a variant that limits the search to those needed for the module (rather than every symbol in the kernel). As Jessica Yu notes, “it means that you have a lot of relocation records with reference your out-of-tree module. Then for each such entry klp_resolve_symbol() is called and then klp_find_object_symbol() to actually resolve it. So if you have 20k entries, you walk through vmlinux kallsyms table 20k times…But if there were 20k modules loaded, the problem would still be there”. She would like to see a more generic fix, but was also interested to see that the Huawei report referenced live patching support for AArch64 (64-bit ARM Architecture), which isn’t in upstream. She had a number of questions about whether this code was public, and in what form, to which links to works in progress from several years ago were posted. It appears that Huawei have been maintaining an internal version of these in their kernels ever since.

Ying Huang (Intel) posted version 7 of “THP swap: Delay splitting THP during swapping out”, which as we previously noted aims to swap out actual whole “huge” (within certain limits) pages rather than splitting them down to the smallest atom of size supported by the architecture during swap. There was a specific request to various maintainers that they review the patch.

Andi Kleen posted a patch removing the printing of MCEs to the kernel log when the “mcelog” daemon is running (and hopefully logging these events).

Laura Abbott posted a RESEND of “config: Add Fedora config fragments”, which does what it says on the tin. Quoting her mail, “Fedora is a popular distribution for people who like to build their own kernels. To make this easier, add a set of reasonable common config options for Fedora”. She adds files in kernel/configs for “fedora-core.config”, “fedora-fs.config” and “fedora-networking.config” which should prove very useful next time someone complains at me that “building kernels for Red Hat distributions is hard”.

Eric Biggers posted “KEYS: encrypted: avoid encrypting/decrypting stack buffers”, which notes that “Since [Linux] v4.9, the crypto PI cannot (normally) be used to encrypt/decrypt stack buffers because the stack may be virtually mapped. Fix this or the padding buffers in encrypted-keys by using ZERO_PAGE for the encryption padding and by allocating a temporary heap buffer for the decryption padding. Eric is referring to the virtually mapped stack support introduced by Andy Lutomirski which has the side effect of incidentally flagging up various previous missuse of stacks.

Mark Rutland posted an RFC (Request For Comments) patch series entitled “ARMv8.3 pointer authentication userspace support”. ARMv8.3 includes a new architectural extension that “adds functionality to detect modification of pointer values, mitigating certain classes of attack such as stack smashing, and making return oriented [ROP] programming attacks harder”. [aside: If you’re bored, and want some really interesting (well, I think so) bedtime reading, and you haven’t already read all about ROP, you really should do so]. Continuing to quote Mark, the “extension introduces the concept of a pointer authentication code (PAC), which is stored in some upper bits of pointers. Each PAC is derived from the original pointer, another 64-bit value (e.g. the stack pointer), and a secret 128-bit key”. The extension includes new instructions to “insert a PAC into a pointer”, to “strip a PAC from a pointer”, and to “authenticate strip a PAC from a pointer” (which has the side effect of poisoning the pointer and causing a later fault if the authentication fails – allowing for detection of malicious intent).

Mark’s patch makes for great reading and summarizes this feature well. It notes that it has various counterparts in userspace to add ELF (Executable and Linking Format, the executable container used on modern Linux and Unix systems) notes sections to programs to provide the necessary annotations and presumably other data necessary to implement pointer authentication in application programs. It will be great to see those posted too.

Joerg Roedel followed up to a posting from Samuel Sieb entitled “AMD IOMMU causing filesystem corruption” to note that it has recently been discovered (and was documented in another thread this past week entitled “PCI: Blacklist AMD Stoney GPU devices for ATS”) that the AMD “Stoney” platform features a GPU for which PCI-ATS is known to be broken. ATS (Address Translation Services) is the mechanism by which PCIe endpoint devices (such as plugin adapter cards, including AMD GPUs) may obtain virtual to physical address translations for use in inbound DMA operations initiated by a PCIe device into a virtual machine (VM’s) memory (the VM talks the other way through the CPU MMU).

In ATS, the device utilizes an Address Translation Cache (ATC) which is essentially a TLB (Translation Lookaside Buffer) but not called that because of handwavy reasons intended not to confuse CPU and non-CPU TLBs. When a device sitting behind an IOMMU needs to perform an address translation, it asks a Translation Agent (TA) typically contained within the PCIe Root Complex to which it is ultimately attached. In the case of AMD’s Stoney Platform, this blows up under address invalidation load: “the GPU does not reply to invalidations anymore, causing Completion-wait loop timeouts on the AMD IOMMU driver side”. Somehow (but this isn’t clear) this is suspected as the possible cause of the filesystem corruption seen by Samuel, who is waiting to rebuild a system that ate its disk testing this.

Calvin Owens (Facebook) posted “printk: Introduce per-console filtering of messages by loglevel”, which notes that “Not all consoles are created equal”. It essentially allows the user to set a different loglevel for consoles that might each be capable of very different performance. For example, a serial console might be severely limited in its baud rate (115,200 in many cases, but perhaps as low as 9,600 or lower is still commonplace in 2017), while a graphics console might be capable of much higher. Calvin mentions netconsole as the preferred (higher speed) console that Facebook use to “monitor our fleet” but that “we still have serial consoles attached on each host for live debugging, and the latter has caused problems”. He doesn’t specifically mention USB debug consoles, or the EFI console, but one assumes that listeners are possibly aware of the many console types.

Christopher Bostic (IBM) posted version 5 of a patch series entitled “FSI device driver implementation”. FSI stands for “Flexible Support Interface”, a “high fan out [a term referring to splitting of digital signals into many additional outputs] serial bus consisting of a clock and a serial data line capable of running at speeds up to 166MHz”. His patches add core support to the Linux bus and device models (including “probing and discovery of slaves and slave engines”), along with additional handling for CFAM (Common Field Replacable Unit Access Macro) – an ASIC (chip) “residing in any device requiring FSI communications” that provides these various “engines”, and an FSI engine driver that manages devices on the FSI bus.

Finally, Adam Borowski posted “n_tty: don’t mangle tty codes in OLCUC mode” which aims to correct a bug which is “reproducible as of Linux 0.11” and all the way back to 0.01. OLCUC is not part of POSIX, but this terminios structure flag tells Linux to map lowercase characters to uppercase ones. The posting cites an obvious desire by Linus to support “Great Runes” (archiac Operating Systems in which everything was uppercase), to which Linus (obviously in jest, and in keeping with the April 1 date) asked Adam why he “didn’t make this the default state of a tty?”.

Linux Kernel Podcast for 2017/03/28


Author’s Note: Apologies to Ulrich Drepper for incorrectly attributing his paper “Futexes are Tricky” to Rusty. Oops. In any case, everyone should probably read Uli’s paper:

In this week’s edition: Linus Torvalds announces Linux 4.11-rc4, early debug with USB3 earlycon, upcoming support for USB-C in 4.12, and ongoing development including various work on boot time speed ups, logging, futexes, and IOMMUs.

Linus Torvalds announced Linux 4.11-rc4, noting that “So last week, I said that I was hoping that rc3 was the point where we’d start to shrink the rc’s, and yes, rc4 is smaller than rc3. By a tiny tiny sidgen. It does touch a few more files, but it has a couple fewer commits, and fewer lines changed overall. But on the whole the two are almost identical in size. Which isn’t actually all that bad, considering that rc4 has both a networking merge and the usual driver suspects from Greg [Kroah Hartman], _and_ some drm fixes”.


Junio C Hamano announced Git v2.12.2.

Greg Kroah-Hartman announced Linux 4.4.57, 4.9.18, and 4.10.6.

Sebastian Andrezej Siewior announced Linux v4.9.18-rt14, which includes a “larger rework of the futex / rtmutex code. In v4.8-rt1 we added a workaround so we don’t de-boost too early in the unlock path. A small window remained in which the locking thread could de-boost the unlocking thread. This rework by Peter Zijlstra fixes the issue.”

Upcoming features

Greg K-H finally accepted the latest “USB Type-C Connector class” patch series from Heikki Krogerus (Intel). This patch series aims to provide various control over the capability for USB-C to be used both as a power source and as a delivery interface to supply to power to external devices (enabling the oft-cited use case of selecting between charging your cellphone/mobile device or using said device to charge your laptop). This will land a new generic management framework exposed to userspace in Linux 4.12, including a driver for “Intel Whiskey Cove PMIC [Power Management IC] USB Type-C PHY”. Your author looks forward to playing. Greg thanked Heikki for the 18(!) iterations this patch went through prior to being merged – not quite a record, but a lot of effort!

Kishon Vijay Abraham (TI) posted “PCI: Support for configurable PCI endpoint”, which provides generic infrastructure to handle PCI endpoint devices (Linux operating as a PCI endpoint “device”), such as those based upon IP blocks from DesignWare (DW). He’s only tested the design on his “dra7xx” boards and requires “the help of others to test the platforms they have access to”. The driver adds a configfs interface including an entry to which userspace should write “start” to bring up an endpoint device. He adds himself as the maintainer for this new kernel feature.

Rob Herring posted “dtc updates for 4.12”, which “syncs dtc [Device Tree Compiler] with current mainline [dtc]”. His “primary motivation is to pull in the new checks [he’s] worked on. This gives lots of new warnings which are turned off by default”.

60Hz vs 59.94Hz (Handling of reduced FPS in V4L2)

Jose Abreu (Synopsys) posted a patch series entitled “Handling of reduced FPS in V4L2”, which aims to provide a mechanism for the kernel to measure (in a generic way) the actual Frames Per Second for a Video For Linux (V4L) video device. The patches rely upon hardware drivers being able to signal that they can distinguish “between regular fps and 1000/1001 fps”.

This took your author on a journey of discovery. It turns out that (most of the time), when a video device claims to be “60fps” it’s actually running at 59.94fps, but not always. The latter frame rate is an artifact of the NTSC (National Television System Committee) color television standard in the United States. Early televisions used the 60Hz frequency (which is nationally synchronized, at least in each of the traditional three independent grids operated in the US, which are now interconnected using HVDC interconnects but presumably are still not directly in phase with one another – feel free to educate me!) of the AC supply to lock individual frame scan times. When color TV was introduced, a small frequency offset was used to make room in each frame for a color sub-carrier signal while retaining backward compatibility for black and white transmissions. This is where frequencies of 29.97 and 59.95 frames per second originate. In case you always wondered.

Jose and Hans Verkuil had a back and forth discussion about various real- world measured pixelclock frequencies that they had obtained using a variety of equipment (signal analyzers, certified HDMI analyzer, and the Synopsys IP supported by the patch series under discussion) to see whether it was in reality possible to reliably distinguish frame rates.

Early Debug with USB3 earlycon (early printk)

Lu Baolu (Intel) posted version 8 of a patch series entitled “usb: early: add support for early printk through USB3 debug port”. Contemporary (especially x86) desktop and server class systems don’t expose low level hardware debug interfaces, such as JTAG debug chains, which are used during chip bringup and early firmware and OS enablement activities, and which allow developers with suitable tools to directly control and interrogate hardware state. Or just dump out the kernel ringbuffer (the dmesg “log”).

Actually, all such systems do have low level debug capabilities, they’re just fused out during the production process (by blowing efuses embedded into the processor) and either not exposed on the external pins of the chip at all, or are simply disabled in the chip logic. Probably most of these can be re-enabled by writing the magic cryptographically signed hashes to undocumented memory regions in on-chip coprocessor spaces. In any case, vendors such as Intel aren’t going to tell you how.

Yet it is often desirable to have certain low level debug functionality for systems that are deployed into field settings, even to reliably dump out the kernel console log DEBUG log level messages somewhere. Traditionally this was done using PC serial ports, but most desktop (and all laptop) systems no longer ship with those exposed on the rear panel. If you’re lucky you’ll see an IDC10 connector on your motherboard to which you can attach a DB9 breakout cable. Consumers and end users have no idea what any of this means, and in the case that they don’t know what this means, they probably shouldn’t be encouraged to open the machine up and poke things. Yet even in the case that IDC10 connectors exist and can be hooked up, this is still a cumbersome interface that cannot be relied upon today.

Microsoft (who are often criticized but actually are full of many good ideas and usually help to drive industry standardization for the broader market) instituted sanity years ago by working with the USB Implementors Forum (IF) to ensure that the USB3 specification included a standardized feature known as xHCI debug capability (DbC), an “optional but standalone functionality by an xHCI hosst controller”. This suited Windows, which traditionally requires two UARTs (serial ports) for kernel development, and uses one of them for simple direct control of the running kernel without going through complex driver frameworks. Debug port (which also existed on USB2) traditionally required a special external partner hardware dongle but is cleaner in USB3, requiring only a USB A-to-A crossover cable connecting USB3.0 data lines.

As Lu Baolu notes in his patch, “With DbC hardware initialized, the system will present a debug device through the USB3 debug port (normally the first USB3 port)”. The patch series enables this as a high speed console log target on Linux, but it could be used for much more interesting purposes via KDB.

[Separately, but only really related to console drivers and not debugging, Thierry Escande posted “firmware: google memconsole” which adds support for importing the boot time BIOS memory based console into the kernel ringbuffer on Google Coreboot systems].

Ongoing Development

Pavel Tatashin (Oracle) posted “parallelized “struct page” zeroing”, which improves boot time performance significantly in the case that the “deferred struct page initialization feature is enabled”. In this case, zeroing out of the kernel’s vmemmap (Virtual Memory Map) is delayed until after the secondary CPU cores on a machine have been started. When this is done, those cores can be used to run zeroing threads that write to memory, taking one SPARC system down from 97.89 seconds to boot down to 46.91. Pavel notes that the savings are also considerable on x86 systems too.

Thomas Gleixner had a lengthy back and forth with Pasha Tatashin (Oracle) over the latter’s posting of “Early boot time stamps for x86” which use the TSC (Time Stamp Counter) on Intel x86 Architecture. The goal is to log how long the machine actually took to boot, including firmware, rather than just how long Linux took to boot from the time it was started. Peter Zijlstra responded (to Pasha), “Lol, how cute. You assume TSC starts at 0 on reset” (alluding to the fact that firmware often does crazy things playing with the TSC offset or directly writing to it). Thomas was unimpressed with Pavel’s posting of a v2 patch series, noting “Did you actually read my last reply on V1 of this? I made it clear that the way this is done, i.e. hacking it into the earliest boo[]t stage is not going to happen…I don’t care about you wasting your time, but I very much care about my time”. He provided a further more lengthy response, including various commentary on the best ways to handle feedback.

Peter Zijlstra posted version 6 of a patch series entitled “The arduous story of FUTEX_UNLOCK_PI” in which he adds “Another installment of the futex patches that give you nightmares”. Futexes (Fast User-space Mutexes) are a mechanism provided by the Linux kernel which leverage shared memory to provide a low overhead mutex (mutual exclusion primitave) to userspace in the case that such mutexes are uncontended (no conflicts between processes – tasks within the kernel – exist trying to acquire the same resource) but with a “slow path” through the kernel in the case of contention. They are used by many userspace applications, including extensively in the C library (see the famous paper by Rusty Russell entitled “Futexes are Tricky”). Peter is working on solving problems introduced by having to have Priority Inheritance (PI) aware futexes in Real Time kernels. These adjust priority of the associated tasks holding mutexes for short periods in order to prevent Priority Inversion (see Mars Pathfinder study papers) in which a low priority task holds a mutex that a high priority task wants to acquire. Peter’s patches “rework[] and document[] the locking” of existing code.

Separately, Waiman Long (Red Hat) posted version 6 of “futex” Introducing throughput-optimized (TP) futexes which “introduces a new futex implementation called throughput-optmized (TP) futexes. It is similar to PI futexes in its calling convention, but provides better throughput than the wait-wake (WW) futexes by encouraging lock stealing and optimistic spinning. The new TP futexes an be used in implementing both userspace mutexes and rwlocks. The provide[] better performance while simplifying the userspace locking implementation at the same time. The WW futexes are still needed to implement other synchronization primitives like conditional variables and semaphores that cannot be handled by the TP futexes”.

David Woodhouse posted “PCI resource mmap cleanup” which aims to clean up the use of various kernel interfaces that provide “user visible” resource addresses through (legacy) proc and (contemporary) sysfs. The purpose of these interfaces is to provide information about regions of PCI address space memory that can be directly mapped by userspace applications such as those linked against the DPDK (Data Plane Development Kit) library. An example of his cleanup included “Only allow WC [Write Combining] mmap on prefetchable resources” for the /proc/bus/pci mmap interface because this was the case for the preferred sysfs interface already. This lead some to debate why the 64-bit ARM Architecture didn’t provide the legacy procfs interface (since there was a little confusion about the dependencies for DPDK) but ultimately re-concluded that it shouldn’t.

Tyler Baicar (Codeaurora) posted version 13 of a patch series entitled “Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64”, which aims to introduce support to the 64-bit ARM Architecture for logging of RAS events using the shared “GHES” (Generic Hardware Error Source) memory location “with the proper GHES structures to notify the OS of the error”. This dovetails nicely with platforms performing “firmware first” error handling in which errors are trapped to secure firmare which first handles them and subsequently informs the Operating System using this ACPI feature.

Shaohua Li (Facebook) posted a patch entitled “add an option to disable iommu force on” in the case of the (x86) Trusted Boot (TBOOT) feature being enabled. The reason cited was that under a certain 40GBit networking load XDP (eXpress Data Path) test there were high numbers of IOTLB (IO Translation Look Aside Buffer) misses “which kills the performance”. What he is refering to is the mechanism through which an IOMMU (which sits logically between a hardware device, such as a network card, and memory, often as part of an integrated PCI Root Complex) translates underlying memory accesses by the adapter card into real host memory transactions. These are cached by the IOMMU in small caches (known as IOTLBS) after it performs such translations using its “page tables” (similar to how a host CPU’s MMU – Memory Management Unit – performs host memory translations). Badly designed IOMMU implementations or poor utilization can result in large numbers of misses that result in users disabling the feature. Alas, without an IOMMU, there’s little protection during boot from rogue devices that maliciously want to trash host memory. Nobody has noted this in the RFC (Request For Comments) discussion, yet.

Bodong Wang (Mellanox) posted a patch entitled “Add an option to probe VFs or not before enabling SR-IOV”, which aims to allow administrators to limit the probing of (PCIe) Virtual Functions (VFs) on adapters that will have those resources passed through to Virtual Machines (VMs) (using VFIO). This “can save host side resource usage by VF instances which would be eventually probed to VMs”. It adds a new sysfs interface to control this.

Viresh Kumar posted a patch entitled “cpufreq: Restore policy min/max limits on CPU online”. Apparently, existing code behavior was that “On CPU online the cpufreq core restores the previous governor [the in kernel logic that determines CPU frequency transitions based upon various metrics, such as saving energy, or prioritizing performance]…but it does not restore min/max limits at the same time”. The patch addresses this shortcoming.

Wanpeng Li posted a patch entitled “KVM: nVMX: Fix nested VPID vmx exec control” that aims to “hide and forbid” Virtual Processor IDentifiers in nested virtualization contexts where the hardware doesn’t support this. Apparently it was unconditionally being enabled (based upon real hardware realities of existing implementation) regardless of feature information (INVVPID) provided in the “vmx” capabilities.

Joerg Roedel posted a patch entitled “ACPI: Don’t create a platform_device for IOAPIC/IOxAPIC” since this was causing problems during hot remove (of CPUs). Rafael J. Wysocki noted that “it’s better to avoid using platform_device for hot-removable stuff” since it is “inherently fragile”.

Kees Cook (Google) posted a patch disabling hibernation support on 32-bit systems in the case that KASLR (Kernel Address Space Layout Randomization) was enabled at boot time, but allowing for “nokaslr” on the kernel command line to change this. Evgenii Shatokhin initially noted that “nokaslr” didn’t re-enable hibernation support correctly, but eventually determined that the ordering and placement of the “nokaslr” on the command line was to blame, which lead to Kees saying he would look into the command line parsing sequence and interaction with other options, such as “resume=”.

Separately, Baoquan He (Red Hat) noted that with KASLR an implicit assumption that EFI_VA_START < EFI_VA_END existed, while “In fact [the] EFI [(Unified) Extensible Firmware Interface] region reserved for runtime services [these are callbacks into firmware from Linux] virtual mapping will be allocated using a top-down schema”. His patches addressed this problem, and being “RESEND”s, he was keen to see that they get taken up soon.

Also separately, Kees posted “syscalls: Restore address limit after a syscall” which “ensures a syscall does not return to user-mode with a kernel address limit. If that happened, a process can corrupt kernel-mode memory and elevate privileges”. He cites a bug it would have prevented.

Kan Liang (Intel) posted “measure SMI cost”. This patch series aims to leverage hardware counters to inform perf of the amount of time spent (on Intel x86 Architecture systems) inside System Management Mode (SMM). SMIs (System Management Interrups) are events that are generated (usually) by Intel Platform Control Hub and similar chipset logic which can be programmed by firmare to generate regular interrupts that target a secure execution context known as SMM (System Management Mode). It is here that firmware temporarily steals CPU cycles from the Operating System (without its knowledge) to perform such things as CPU fan control, errata handling, and wholesale VGA graphics emulation in BMC “value add” from OEMs). Over the years, the amount of gunk hidden in SMIs has grown that this author even once wrote a latency detector (hwlat) and has a patent on SMI detection without using such dedicated counters…due to the impact of such on system performance. SMM is necessary on x86 due to its lack of a standardized on-SoC platform management controller, but so is accounting for bloat.

Finally, yes, Kirill A. Shutemov snuck in another iteration of his Intel “5-level paging support” in preparation for a 4.12 merge.


Linux Kernel Podcast for 2017/03/21


In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc3, this week’s exciting installment of “5-level paging weekly”, the 2038 doomsday compliance “statx” systemcall, and heterogenous memory management. Also a summary of all ongoing active kernel development toward 4.12 onwards.

Linus Torvalds announced Linux 4.11-rc3. In his announcement, Linus noted that “rc3 is larger than rc2, but this is hopefully the point where things start to shrink and calm down. We had a late typo in rc2 that affected arm and powerpc (the prep code for the 5-level page tables [on x86 systems]), and hopefully there are no similar brown-paper-bugs in rc3.”


Kent Overstreet announced the latest developments in Bcachefs, in a post entitled “Bcachefs – encryption, fsck, and more”. One of the key new features is that “We now have whole filesystem encryption – and this is modern authenticated encryption”. He notes that they can’t currently encrypt only part of the filesystem (as is the case, for example, with ext4 – as used on Android devices, and of course with Apple’s multi-layered iOS filesystem implementation) but “it’s more of a better dm-crypt” in removing the layers between the filesystem and the underlying hardware. He also notes that there’s a “New inode format”, and many other changes. Further details at:

Hongbo Wang (Intel) announced the 2016-Q4 release of XenGT and 2016-Q4 release of KVMGT. These are both “full GPU virtualization solution[s] with mediated pass-through”…of the hardware graphics resources into guest virtual machines. Further information is available from Intel’s github: (igvtg-xen for the Xen tree, and igvtg-kernel, and igvtg-qemu for the pieces needed for KVM support)

Julia Cartwright announced the Linux preempt-rt (Real Time) kernel version 4.1.39-rt47 stable kernel release.

Junio C Hamano announced Git v2.12.1. In his announcement, he noted that the tarballs “are NOT YET found at” the typical URL since “I am having trouble reaching there”. It’s unclear if this is due to recent changes in the architecture of and its mirroring, or a local issue.

Intel 5-level paging

In this week’s episode of “merging Intel 5-level paging support” the fun but unexpected plot twist resulting in a “will it merge or not” cliffhanger comes from Linus. Kirill A. Shutemov (Intel) has been diligently posting this series for some time, and if you recall from last week’s episode, the foundational pieces needed to land this in 4.12 were merged after the closure of the 4.11 merge window following a special request from Linus. Kirill has since posted “x86: 5-level paging enabling for v4.12, Part 1”. In response to a comment from Kirill that “Let’s see if I’m on the right track addressing Ingo’s [Molnar’s] feedback”, Linus stated, “Considering the bug we just had with the HAVE_GENERIC_RCU_GUP code, I’m wondering if people would be willing to look at what it would take to make x86 use the generic version?”, and “The x86 version of __get_user_pages_fast() seems to be quite similar to the generic one. And it would be lovely if all the main architectures shared the same core gup code”.

The Linux kernel implements a set of code functions for pinning of usermode (userspace) pages (the smallest granule size upon which contemporary hardware operates via a Memory Management Unit under the control of software provided and (co-)maintained “page tables”, and the size tracked by the Operating System in its page table management code) whenever they must be shared between userspace (which has dynamically pageable memory that can come and go as the kernel needs to free up RAM temporarily for other tasks by “paging” those pages out to “swap”) and code running within a kernel driver (the Linux kernel does not have pageable memory). GUP (get_user_pages) handles this operation, which takes a set of pointers to the individual pages that should be present and marked as in use. It has a variant usually referred to as “fast GUP” which aims to perform this operation without taking an expensive lock in the corresponding userspace processes’ “mm” struct (an object that forms part of a task’s – the in-kernel term for a process – metadata, and linked from the corresponding task_struct). Fast GUP doesn’t always work, but when it doesn’t need to fallback to an expensive slow path, it can save considerable time. So Linus was expressing a desire for x86 to share the same generic code as used by other architectures for this operation.

Linus further added three “subtle issues” that he saw with switching over x86 to the generic GUP code:

“(a) we need to make sure that x86 actually matches the required semantics for the generic GUP.

(b) we need to make sure the atomicity of the page table reads is ok.

(c) need to verify the maximum VM address properly”

He said “I _think_ (a) is ok”. But he wanted to see “real work to make sure” that (b) is “ok on 32-bit PAE”. PAE means Physical Address Extension, a mechanism used on certain 32-bit Intel x86 systems to address greater than a 32-bit physical address space by leveraging the fact that many individual applications don’t need larger than a 32-bit address space but that an overall system might in aggregate use multiple such 32-bit applications. It was a hack that bought time before the widespread adoption of the 64-bit architecture, and one that others (such as ARM) have implemented in a similar sense of end purpose in “LPAE” and friends as well. PAE moved the x86 architecture from 32-bit PTE (Page Table Entries) to 64-bit hardware entries, which means that on 32-bit systems there are real concerns around atomicity of updates to these structures without very careful handling. And as this author can attest, you don’t want to have to debug that situation.

This discussion lead Kirill to point out that there were some obvious looking bugs in the existing x86 GUP code that needed fixing for PAE anyway. The thread is ongoing, and Kirill is certain to be enjoying this week’s episode of “so you thought you were only adding 5-level paging?”. Michal Hocko noted that he had pulled the current version of the 5-level paging patch series into the mmotm (mm of the moment) VM (Virtual Memory) subsystem development tree as co-maintained with Andrew Morton and others.

Borislav Petkov posted “x86/mce: Handle broadcasted MCE gracefully with kexec” which (as we covered previously) seeks to handle the unfortunate case of an MCE (Machine Check Exception) on Intel x86 systems arriving during the process of handoff from the crash kernel into “pergatory” prior to the new kernel beginning. At this phase, the old kernel’s MCE handler is running and will never complete a synchronization with other cores in the system that are waiting in a holding spinloop (probably MWAIT one would assume) for the new kernel to take over.


Various subsystems gained support for the new “statx” system call, which is part of the ongoing “Year 2038” doomsday avoidance work to prevent a Y2K style disaster when 32-bit Unix time wraps in 2038 (this being an actual potential “disaster” in the making, unlike the much hyped Y2K nonsense). Many of us have aspiriations to be retired and living on boats by then, but this is neither assured, nor a prudent means to guarantee we won’t have to deal with this later (but presumably with at least some kind of lucrative consulting contract to bring us out of our early or late retirements).

The “statx” call adds 64-bit timestamps and replaces “stat”. It also does a lot more than just “make large” (David Howell’s words) the various fields in the previous stat structutures. The overall system call was covered much more generally by Linux Weekly News (which you should support as a purveyor of actual in-depth journalism on such topics) as recently as last week. Stafford Horne posted one example of the patches we refer to here, for the “asm-generic” reference includes used by emerging architectures, such as the OpenRISC architecture that he is maintaining. Another statx patch came from David Howells, for the ext4 filesytem, which lead to a longer discussion of how to implement various underlying flag changes required to ext4.

Eric Biggers noted that David used the ext4_get_inode_flags function “to sync the generic inode flags (inode->i_flags) to the ext4-specific inode flags (ei->i_flags)” bu that a problem can exist when doing this without holding an underlying lock due to “flag syncs…in both directions concurrently” which could “cause an update to be lost”. He walked an example of how this could occur, and then suggested that for ->getattr() it might be easier to skip the call to the offending function and “instead populating the generic attributes like STATX_ATTR_APPEND and STATX_ATTR_IMMUTABLE from the generic inode flags, rather than from the ext4-specific flags?”. Andreas Dilger suggested the other way around, pulling the flags directly from the ext4 flags rather than the generic ones. He also raised the eneral question of “when/where are the VFS inode flags changed that they need to be propagated into the ext4 disk inode?”.

Jan Kara replied that “you seem to be right. And actually I have checked and XFS does not bother to copy inode->i_flags to its on-disk flags so it seems generally we are not expected to reflect inode->i_flags in on-disk state”. Jan suggested to Andreas that it might be “better…to have ext4_quota_on() and ext4_quota_off() just update the flags as needed and avoid doing it anywhere else…I’ll have a look into it”.

Heterogeneous Memory Management

Jérôme Glisse posted version 18 of his patch series entitled “HMM (Heterogenous Memory Management)” which aims to serve two generic use cases: “First it allows to use device memory transparently inside any process without modifications to process program code. Second it allows to mirror process address space on a device”. His intro described these summaries as a “Cliff node” (a brand of examination-time study materials often used by students for preparation), which lead to an objection from Andrew Morton that “Cliff’s notes” “isn’t appropriate for a large feature such as this. Where’s the long-form description? One which permits readers to fully understand the requirements, design, alternative designs, the implementation, the interface(s), etc?”. He also asked for clarifcation of which was meant by “device memory” since “That’s very vague. What are the characteristics of this memory? Why is it a requirement that userspace code be unaltered? What are the security implications – does the process need particular permissions to access this memory? What is the proposed interface to set up this access?”

In a followup, Jérôme noted that he had previously given a longer form summary, which he attached, in the earlier revisions of the now version 18 patch series. In his summary, he makes clear his intent is to ease the overall management and programming of hybrid systems involving GPUs and other accelerators by introducing “a new kind of ZONE_DEVICE memory that does allow to allocate a struct page for each page of the device memory. Those page are special because the CPU can not map them. They however allow to migrate main memory to device memory using ex[]isting migration mechanism[s] and everything looks like it page was swap[ped] out to disk from CPU point of view. Using a struct page gives the easiest and cleanest integration with existing mm mechanisms”. He notes that he isn’t trying to solve other problems, and in fact one could summarize HMM using the buzzword du jour: “mediated”.

In an HMM world, devices and host-side application software can share what appears to them as a “unified” memory map. One in which pointer addresses from within an application can be deferenced by code running on a GPU, and vice versa, through cunning use of page tables and a new underlying system framework for the device drivers touching the hardware. It’s not magic, but it does help to treat device memory “like regular memory” and accommodates “Advance in high level language construct (in C++ but others too) gives opportunities to compiler to leverage GPU transparently without programmer knowledge. But for this to happen we need a share[d] address space”.

This means that, if a host application (processor side of the equation) performs an access to part of a process (known as a “task” within the kernel) address space that is currently under control of a device, then the associated page fault will trigger generic framework code to handle handoff of that page back to the host CPU side. On the flip side, the framework still requires device drivers to use a new framework to manage their access to memory since few devices have generic page fault mechanisms today that can be leveraged to make this more transparent, and a lot of other device specific gunk is needed. It’s not a perfect solution, but it does arguably advance the state of the art, and is useful. Jérôme also states that “I do not wish to compete for the patchset with the highest revision count and i would like a clear cut position on w[h]ether it can be merge[d] or not. If not i would like to know why because i am more than willing to address any issues people might have. I just don’t want to keep submitting it over and over until i end up in hell…So please consider applying for 4.12”.

This author’s own personal opinion is that, while HMM is certainly useful, many such shared device/host memory situations can be greatly simplified by introducing coherent shared virtual memory between device and host. That model allows for direct address space sharing without some of the heavy lifting required in this patch set. Yet, as is noted in the posting, few devices today have such features (and there is no reason to presume that all future devices suddenly will implement shared virtual memory, not that every device will want to expand the energy required to maintain coherent memory for communication). So the HMM patches provide a means of tracking who owns memory shared between device and “host”, and they exploit split device and “host” system page tables as well as associated faults to ensure pages are handed off as cleanly as can be achieved with technology available in the market today.

Ongoing Development

Michal Hocko posted a patch entitled “rework memory hotplug onlining”, which seeks to rework the semantics for memory hotplug since the current implementation is “awkward and hard/impossible to use from the udev to online memory as movable. The main problem is that only the last memblock or the adjacent to highest movable memblock can be onlined as movable”. He posted a number of examples showing how things fall down today, as well as a patch (“just for x86 now but I will address other arches once there is an agreement this is the right approach”) removing “all the zone specific operations from __add_pages (aka arch_add_memory) path. Instead we do page->zone association from move_pfn_range which is called from online_pages. This criterion for movable/normal zone association is really simple now. We just have to guarantee that zone Normal is always lower than zone Movable”. This lead to a lengthy discussion around the ideal longer term approach and is likely to be a topic at the LSF/MM conference this week (one assumes?). [ It’s happening down the street from me…I’ll smile and wave at you 😉 ]

Gustavo Padovan posted “V4L2 explicit synchronization support”, an RFC (Request For Comments) that “adds support for Explicit Synchronization of shared buffers in V4L2” (Video For Linux 2, the general purpose video framework API used on Linux machines for certain multimedia purposes). This new RFC leverages the “Sync File Framework” as a means to “communicate the fences between kernel and userspace”. In English, what this means is that it’s often necessary to communicate using shared buffers between userspace, kernel, and hardware. And some (most) hardware might not guarantee that these buffers are fully coherent (observed identically between multiple concurrently operating agents that are manipulating it). The use of “fences” (barriers) enables explicit communication of certain points in time during which the state of a buffer is consistent and ready for access to be handed off between different parts of the system. The RFC is quite interesting and has a lot more detail, including the observation that it is intended to be a PoC (Proof of Concept) to get the conversation moving more than the eventual end result of that conversation that might actually be merged.

Wei Wang (Intel) posted a patch series entitled “Extend virtio-balloon for fast (de)inflating & fast live migration. Balloons aren’t just helium filled goodies that all of us love to play with from a young age. Well, they are that, but, they’re also a concept applied to the memory management of virtual machines, which “inflate” the amount of memory available to them by requesting more from a hypervisor during their lifetime (that they might also return). In Linux, the same concept is applied to the migration of virtual machines, which can use the virtio-balloon abstraction over the virtio bus (a hypervisor communications channel) to transfer “guest unused pages to the host so that they can be skipped to migrate in live migration”. One of the patches in his version 3 series (patch number 3 of 4), entitled “mm: add in[t]erface to offer info about unused pages” had some detailed discussion with Michael S. Tsirkin commenting on better documentation and Andrew Morton suggesting that it might be better for the code to live in the virtio-balloon driver rather than being made too generic as its use case is very targeted.

Elena Reshetova continued her work toward conversion of Linux kernel subsystems to her newer “refcount” explicit reference counting API with a posting entitled “net subsystem refcount conversions”.

Suzuki K Poulose posted a bunch of patches implementing support for detection and reporting of new ARMv8.3 architecture features, including one patch that was entitled “arm64: v8.3: Support for Javascript conversion instruction” (which really means a new double precision float to integer conversion instruction that will likely be used by high performance JavaScript JITs…). He also posted “arm64: v8.3: Support for weaker release consistency”. The new revision of the architecture adds new instructions to “support Release Consistent processor consistent (RCpc) model, which is weaker than the RCsc [Release Consistent sequential consistency] model”. Listeners are encouraged to read the C++ memory model and other fascinating bedtime literature for much more detail on the available RC options.

Markus Mayer (Broadcom) posted “Basic divider clock”, an RFC which aims to provide a generic means of expressing clock dividers that can be leveraged in an embedded system’s “DeviceTree”, for which he also posted bindings (descriptions to be used in creating these textual description “trees”). Stephen Boyd pushed back that the community had so far avoided generic implementations but instead preferred to keep things at the level of having drivers that target certain hardware IP from certain vendors based upon the compatible matching strings.

Michael S. Tsirkin posted “kvm: better MWAIT emulation for guests”. We have previously explained this patchset and the dynamics of MWAIT implementations. His goal for this patch is to handle guests that assume the presence of the (x86) MWAIT feature, which isn’t present on all x86 CPUs. If you were running (for example) MacOS inside a VM on an 86 machine, it would generally assume the presence of MWAIT without checking for it, because it’s present in all x86-based Apple Macs. Emulating MWAIT is useful in such situations.

Romain Perier posted “Replace PCI pool by DMA pool API”. As he notes in his posting, “The current PCI pool API are simple macro functions direct expanded to the appropriate dma pool functions. The prototypes are almost the same and semantically, they are very similar. I propose to use the DMA pool API directly and get rid of the old API”.

Daeseok Youn posted “staging: atomisp: use k{v}zalloc instead of k{v}alloc and memset”. Alan Cox replied “…please don’t apply this. There are about five other layers of indirection for memory allocators that want removing first so that the driver just uses the correct kmalloc/kzalloc/kv* functions in the right places”. Now does seem like a good time not to add more layers.

Peter Zijlstra posted various “x86 optimizations” that aimed to “shrink the kernel and generate better code”.