In this week’s edition: Linus Torvalds announces Linux 4.11-rc6, Intel Memory Bandwidth Allocation (MBA), Coherent Device Memory (CDM), Paravirtualized Remote TLB Flushing,kernel lockdown, the latest on Intel 5-level paging, and other assorted ongoing development activities.
Linus Torvalds announced Linux 4.11-rc6. In his mail, Linus notes that “Things are looking fairly normal [for this point in the development cycle]…The only slightly unusual thing is how the patches are spread out, with almost equal parts of arch updates, drivers, filesystems, networking and “misc”.” He ends “Go and get it”. Thorsten Leemhuis followed up with “Linux 4.11: Reported regressions as of Sunday, 2017-04-09”, his third regression report for 4.11. Which “lists 15 regressions I’m currently aware of. 5 regressions mentioned in last week[‘]s report got fixed”. Most appear to be driver problems, but there is one relating to audit, and one in inet6_fill_ifaddr that is stalled waiting for “feedback from reporter”.
Greg K-H (Kroah-Hartman) announced Linux kernels 4.4.60, 4.9.21, and 4.10.9
Ben Hutchings announced Linux 3.2.88 and 3.16.43
Jason A. Donenfeld pointed out that Linux 3.10 “is inexplicably missing crypto_memneq, making all crypto mac [Message Authentication Code] comparisons use non constant-time comparisons. Bad news bears [presumably due to side channel attack]. Willy followed up noting that he would “check if the 3.12 patches…can be safely backported”.
Memory Bandwidth Allocation (Intel Resource Director Technology, RDT)
Vikas Shivappa (Intel) posted version 4 of a patch series entitled “x86/intel_rdt: Intel Memory bandwidth allocation”, addressing feedback from the previous iteration that he had received from Thomas Gleixner. The MBA (Memory Bandwidth Allocation) technology is described both in the kernel Documentation patch (provided) as well as in various Intel papers and materials available online. Intel provide a construct known as a “Class of Service” (CLOS) on certain contemporary Xeon processors, as part of their CAT (Cache Allocation Technology) feature, which is itself part of a larger family of technologies known as “Intel Resource Directory Technology” (RDT). These CLOSes “act as a resource control tag into which a thread/app/VM/container can be grouped”.
It appears that a feature of Intel’s L3 cache (LLC in Intel-speak) in these parts is that they can not only assign specific proportions of the L3 cache slices on the Xeon’s ring interconnect to specific resources (e.g. “tasks” – otherwise known as processes, or applications) but also can control the amount of memory bandwidth granted to these. This is easier than it sounds. From a technical perspective, Intel integrate their memory controller onto their dies, and contemporary memory controllers already perform fine grained scheduling (this is how they bias memory reads for speculative loads of the instruction stream in among the other traffic, as just one simple example). Therefore, exposing memory bandwidth control to the cache slices isn’t all that more complex. But it is cute, and looks great in marketing materials.
Coherent Device Memory (CDM) on top of HMM
Jérôme Glisse posted and RFC [Request for Comments] patch series entitled “Coherent Device Memory (CDM) on top of HMM”. His previous HMM (Heterogenous Memory Management) patch series, now in version 19, implemented support for (non-coherent) device memory to be mapped into regular process address space, by leveraging the ability for certain contempory devices to fault on access to untranslated addresses managed in device page tables thus allowing for a kind of pageable device memory and transparent management of ownership of the memory pages between application processor cores and (e.g.) a GPU or other acceleration device. The latest patch series builds upon HMM to also support coherent device memory (via a new ZONE_DEVICE memory – see also the recent postings from IBM in this area). As Jérôme notes, “Unlike the unaddressable memory type added with HMM patchset, the CDM [Coherent Device Memory] type can be access[ed] by [the] CPU.” He notes that he wanted to kick off this RFC more for the conversation it might provoke.
In his mail, Jérôme says, “My personal belief is that the hierarchy of memory is getting deeper (DDR, HBM stack memory, persistent memory, device memory, …) and it may make sense to try to mirror this complexity within mm concept. Generalizing the NUMA abstraction is probably the best starting point for this. I know there are strong feelings against changing NUMA so i believe now is the time to pick a direction”. He’s right of course. There have been a number of patch series recently also targeting accelerators (such as FPGAs), and more can be anticipated for coherently attached devices in the future. [This author is personally involved in CCIX]
Hyper-V: Paravirtualized Remote TLB Flushing and Hypercall Improvements
Vitaly Kuznetsov (Red Hat) posted “Hyper-V: paravirtualized remote TLB flushing and hypercall improvements”. It turns out that Microsoft’s Hyper-V hypervisor supports hypercalls (calls into the hypervisor from the guest OS) for “doing local and remote TLB [Translation Lookaside Buffer] flushing”. Translation Lookaside Buffers [TLBs] are caches built into microprocessors that store a translation of a CPU virtual address to “physical” (or, for a virtual machine, into an intermediate hypervisor) address. They save an unnecessary page table walk (of the software managed hardware/software structure – depending upon architecture – that “walkers” navigate to perform a translation during a “page fault” or unhandled memory access, such as happens constantly when demand loading/faulting in application code and data, or sharing read-only data provided by shared libraries, etc.). TLBs are generally transparent to the OS, except that they must be explicitly managed under certain conditions – such as when invlidating regions of virtual memory or performing certain context switches (depending upon the provisioning of address and virtual memory space tag IDs in the architecture).
TLB invalidates on local processor cores normally use special CPU instructions, and this is certainly also true under virtualization. But virtual addresses used by a particular process (known as a task within the kernel) might be also used by other cores that have touched the same virtual memory space. And those translations need to be invalidated too. Some architectures include sophisticated hardware broadcast invalidation of TLBs, but some other legacy architectures don’t provide these kinds of capabilities. On those architectures that don’t provide for a hardware broadcast, it is typically necessary to use a construct known as an IPI (Inter Processor Interrupt) to cause an IRQ (interrupt message) to be delivered to the remote interrupt controller CPU interface (e.g. LAPIC on Intel x86 architecture) of the destination core, which will run an IPI handler in response that does the TLB teardown.
As Vitaly notes, nobody is recommending doing local TLB flash using a hypercall, but there can be significant performance improvement in using a hypercall for the remote invalidates. In the example cited, which uses “a special ‘TLB trasher'” he demonstrates how a 16 vCPU guest experienced a greater than 25% performance improvement using the hypercall approach.
David Howells posted an magnum opus entitled “Kernel lockdown”, which aims to “provide a facility by which a variety of avenues by which userspace can feasibly modify the running kernel image can be locked down”. As he says, “The lock-down can be configured to be triggered by the EFI secure boot status, provided the shim isn’t insecure. The lock-down can be lifted by typing SysRq+x on a keyboard attached to the system [physcial presence]. Among the many other things, these patches (versions of which have been in distribution kernels for a while) change kernel behavior to include “No unsigned modules and no modules for which [we] can’t validate the signature”, disable many hardware access functions, turn off hibernation, prevent kexec_load(), and limit some debugging features. Justin Forbes of the Fedora Project noted that he had (obviously) tested these. One of the many interesting sets of patches included a feature to “Annotate hardware config module parameters” which allows modules to mark unsafe options. Following some pushback, David also followed up with a rationale for doing kernel lockdown, entitled “Why kernel lockdown?”. Worth reading.
Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he (bravely) took Ingo’s request to “rewrite assembly parts of boot process into C before bringing 5-level paging support”. He says, “The only part where I succeed is startup_64 in arch/x86/kernel/head_64.S. Most of the logic is now in C.” He also renames the level 4 page tables “init_level4_pgt” and “early_level4_pgt” to “init_top_pgt” and “early_top_pgt”. There was another lengthy discussion around his “Allow to have userspace mappings above 47-bits”, a patch which tells the kernel to prefer to do memory allocations below 47-bits (the previous “Canonical Addressing” limit of Intel x86 processors, which some JITs and other code exploit by abusing the top bits of the address space in pointers for illegal tags, breaking compatibility with an extended virtual address space). The patch allows mmap calls ith MAP_FIXED hints to cause larger allocations. There was some concern that larger VM space is ABI and must be handled with care. A footnote here is that (apparently, from the patch) Intel MPX (Memory Protection Extension) doesn’t yet work with LA57 (the larger address space feature) and so Kirill avoids both in the same process.
Christopher Bostic posted version 5 of a patch series entitled “FSI driver implementation”. This is support for the POWER’s [Performance Optimization With Enhanced RISC, for those who ever wondered – this author used to have a lot of interest in PowerPC back in the day] “Flexible Support Interface” (FSI), a “high fan out serial bus” whose specification seems to have appeared on the OpenPower Foundation website recently also.
Kishon Vijay Abraham posted “PCI: Support for configurable PCI endpoint”, which Bjorn finally pulled into his tree in anticipation of the upcoming 4.12 merge cycle. For those who haven’t see Kishon’s awesome presentation “Overview of PCI(e) Subsystem” for Embedded Linux Conference Europe, you are encouraged to watch it at least several times. He really knows his stuff, and has done an excellent job producing a high quality generic PCIe endpoint driver for Linux: https://www.youtube.com/watch?v=uccPR6X8vy8
Ard Biesheuvel posted “EFI fixes for v4.11”, which among other goodies includes a fix for EFI GOP (Graphics Output Protocol) support on systems built using the 64-bit ARM Architecture, which uses firmware assignment of PCIe BAR resources. Ard and Alex Graf have done some really fun work with graphics cards on 64-bit ARM lately – including emulating x86 option ROMs. Ard also had some fixes prepared for v4.12 that he announced, including a bunch of cleanup to the handling of FDT (Flattened Device Tree) memory allocation. Finally, he added support for the kernel’s “quiet” command line option, to remove extraneous output from the EFI stub on boot.
Srikar Dronamraju and Michal Hocko had a back and forth on the former’s “sched: Fix numabalancing to work with isolated cpus” patch, which does what it says on the tin. Michal was a little concered that NUMA balancing wasn’t automatically applied even to isolated CPUs, but others (including Peter Zjilsta) noted that this absolutely is the intended behavior.
Ying Huang (Intel) posted version 8 of his “THP swap: Delay splitting THP during swapping out”, which essentially allows paging of (certain) huge pages. He also posted version 2 of “mm, swap: Sort swap entries before free”, which sorts consecutive swap entires in a per-CPU buffer into order accoring to their backing swap deivce before freeing those entries. This reduces needless acquiring/releasing of locks and improves performance.
Will Deacon posted version 2 of a patch series entitled “drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension”. The “SPE” (Statistical Profiling Extension) “can be used to profile a population of operations in the CPU pipeline after instruction decode. These are either architected instructions (i.e. a dynamic instruction trace) or CPU-specific uops and the choice is fixed statically in the hardware and advertised to userpace via caps. Sampling is controlled using a sampling interval, similar to a regular PMU counter, but also with an optional random perturbation”. He notes that the “in-memory buffer is linear and virtually addressed, raising an interrupt when it fills up” [which makes using it nice for software folks].
Binoy Jayan posted “IV [Initial Vector] Generation algorithms for dm-crypt”, the goal of which “is to move these algorithms from the dm layer to the kernel crypto layer by implementing them as template ciphers”.
Joerg Roedel posted “PCI: Add ATS-disable quirk for AMD Stoney GPUs”. Then, he posted a followup with a minor fix based upon feedback. This should close the issue of certain bug reports posted by those using an IOMMU on a Stoney platform and seeing lockups under high TLB invalidation.
Born Helgass posted “PCI fixes for v4.11”, which includes “fix ThunderX legacy firmware resources”, a PCI quirk for certain ARM server platforms.
Paul Menzel reported “`pci_apply_final_quirks()` taking half a second”, which David Woodhouse (who wrote the code to match PCIe devices against the quick list “back in the mists of time”) posited was perhaps down to “spending a fair amount of time just attempting to match each device against the list”. He wondered “if it’s worth sorting the list by vendor ID or somthing, at least for the common case of the quirks which match on vendor/device”. There was a general consensus that cleanup would be nice, if only someone had the time and the inclination to take a poke at it.
Seth Forshee (Canonical) posted “audit regressions in 4.11”, in which he noted that ever since the merging of “audit: fix auditd/kernel connection state tracking”, the kernel will now queue up indefintely audit messages for delivery to the (userspace) audit daemon if it is not running – ultimately crashing the machine. Paul Moore thanked him for the report and there was a back and forth on the best way to handle the case of no audit running.
Neil Brown posted a patch entitled “NFS: fix usage of mempools”. As he notes in his patch, “When passed GFP [Get Free Page] flags that allow sleeping (such as GFP_NOIO), mempool_alloc() will never return NULL, it will wait until memory is available…This means that we don’t need to handle falure, but that we do need to ensure one thread doesn’t call mempool_alloc twice on the one pool without queuing or freeing the first allocation”. He then cites “pnfs_generic_alloc_ds_commits” as an unsafe function and provides a fix.
Finally, Kees Cook followed up (as he had promised) on a discussion from last week, with an RFC (Request for Comments) patch series entitiled “mm: Tighten x86 /dev/mem with zeroing”, including the suggestion from Linus that reads from /dev/mem that aren’t permitted simply return zero data. This was just one of many security discussions he was involved in (as usual). Another included having suggested a patch posted by Eddie Kovsky entitled “module: verify address is read-only”, which modifies kernel functions that use modules to verify that they are in the correct kernel ro_after_init memory area and “reject structures not marked ro_after_init”.