Linux Kernel Podcast for 2017/04/19

Audiohttp://traffic.libsyn.com/jcm/20170419.mp3

[ Apologies for the delay – I have been a little sick for the past day or so and was out on Monday volunteering at the Boston Marathon, so my evenings have been in scarse supply to get this week’s issue completed ]

In this week’s edition: Linus Torvalds announces Linux 4.11-rc7, a kernel security update bonanza, the end of Kconfig maintenance, automatic NUMA balancing, movable memory, a bug in synchronize_rcu_tasks, and ongoing development. The Linux 4.12 merge window should open before next week.

Linus Torvalds announced Linux 4.11-rc7, noting that “You all know the drill by now. We’re in the late rc phase, and this may be the last rc if nothing surprising happens”. He also pointed out how things had been calm, and then, “as usual Friday happened”, leading to a number of reverts for “things that didn’t work out and aren’t worth trying to fix at this point”. In anticipation of the imminent opening of the 4.12 merge window (period of time during which disruptive changes are allowed) Linux Weekly News posted their usual excellent summary of the 4.11 development cycle. If you want to support quality Linux journalism, you should subscribe to LWN today.

Ted (Theodore) Ts’o posted “[REGRESSION] 4.11-rc: systemd doesn’t see most devices” in which he noted that “[t]here is a frustrating regression in 4.11 that I’ve been trying to track down. The symptoms are that a large number of systemd devices don’t show up.” (which was affecting the encrypted device mapper target backing his filesystem). He had a back and forth with Greg K-H (Kroah Hartman) about it with Greg suggesting Ted watch with udevadm and Ted pointing out that this happens at boot and is hard to trace. Ted’s final comment was interesting: “I’d do more debugging, but there’s a lot of magic these days in the kernel to udev/systemd communications that I’m quite ignorant about. Is this a good place I can learn more about how this all works, other than diving into the udev and systemd sources?”. Indeed. In somewhat interesting timing, Enric Balletbo i Serra later posted a 5 part patch series entitled “dm: boot a mapped device without an initramfs”.

Rafael J. Wysocki posted some late breaking 4.11-rc7 fixes for ACPI, including one patch reverting a “recent ACPICA commit [to the ACPI – Advanced Configuration and Power Interface – Component Architecture aka reference code upon which the kernel’s runtime interpretor is based] targeted at catching firmware bugs” that did do so, but also caused “functional problems”.

Announcements

Jiri Slaby announced Linux 3.12.73.

Greg KH (Kroah-Hartman) announced Linux 3.18.49, 3.19.49 4.4.62, 4.9.23, and 4.10.11. As he noted in his review posting prior to announcing the latest 3.18 kernel, 3.18 was indeed “dead and forgotten and left to rot on the side of the road” but “unfortunately, there’s a few million or so devices out there in the wild that still rely on this kernel”. Important security fixes are included in all of these updates. Greg doesn’t commit to bring 3.18 out of retirement for very long, but he does note that Google is assisting a little for the moment to make sure 3.18 based devices get some updates.

Steven Rostedt announced “Real Time” (preempt-rt) kernels 3.2.88-rt126 (“just an update to the new stable 3.2.88 version”), 3.12.72-rt97, and 4.4.60-rt73. Separately, Paul E. McKenney noted “A Hannes Weisbach of TU Dresden published this master thesis on quasi-real-time scheduling:
http://os.inf.tu-dresden.de/papers_ps/weisbach-master.pdf

Rafael J. Wysocki announced a CFP (Call For Papers) targeting the upcoming LPC (Linux Plumbers Conference) Power Management and Energy-Awareness microconference “Call for topics”. Registration for LPC just opened.

Yann E. MORIN posted “MAINTAINERS: relinquish kconfig” in which he apologized for not having enough time to maintain Kconfig with “I’ve been almost entirely absent, which totally sucks, and there is no excuse for my behavior and for not having relinquished this earlier”. With such harsh friends as yourself, who needs enemies? Joking aside, this is sad news, since Kconfig is the core infrastructure used to configure the kernel. It wasn’t long before someone else (Randy Dunlap) posted a patch for Kconfig that no longer has a maintainer (Randy’s patch implements a sort method for config options)

[as an aside, as usual, I have pinged folks who might be looking for an opportunity to encourage them to consider stepping up to take this on].

Automatic NUMA balancing, movable memory, and more!

Mel Gorman posted “mm, numa: Fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa”. Modern Linux kernels include a feature known as automatic numa balancing which relies upon marking regions of virtual memory as inaccessible via their page table entries (PTEs) and set a special prot_numa protection hinting bit. The idea is that a later “NUMA hinting fault” on access to the page will allow the Operating System to determine whether it should migrate the page to another NUMA node. Pages are simply small granular units of system memory that are managed by the kernel in setting up translations from virtual to physical memory. When an access to a virtual address occurs, hardware (or, on some architectures, special software) “walkers” navigate the “page tables” pointed to by a special system register. The walker will traverse various “directories” formed from collections of pages in a hierarchical fashion intended to require less space to store page tables than if entries were required for every possible virtual address in a 32 or 64-bit space.

Contemporary microprocessors also support multiple page (granule) sizes, with a fundamental size (commonly 4K or 64K) being supplemented by the ability for larger pages (aka “hugepages”) to be used for very large regions of contiguous virtual memory at less overhead. Common sizes of huge pages are 2MB, 4MB, 512M, and even multi-GB, with “contiguous hint bits” on some modern architectures allowing for even greater flexibility in the footprint of page table and TLB (Translation Lookaside Buffer) entries by only requiring physical entries for a fraction of a contiguous region. On Intel x86 Architecture, huge pages are implemented using the Page Size Extensions (PSE), which allows for a PMD (Page Middle Directory) to be replaced by an entry that effectively allocates the entire range to a single page entry. When a hardware walker sees this, a single TLB entry can be used for an entire range of a few MB instead of many 4K entries.

A bug known as a “race condition” exist(ed) in the automatic NUMA hinting code in which change_pmd_range would perform a number of checks without a lock being held to protect against a concurrent race againt a parallel protection updated (which does happen under a lock) that would clear the PMD and fill it with a prot_numa entry. Mel adds a new pmd_none_or_trans_huge_or_clear_bad function that correctly handles this rare corner case sequence, and documents it (in mm/mprotect.c). Michal Hocko responded with “you will probably win the_longer_function_name_contest but I do not have [a] much better suggestion”.

Speaking of Michal Hocko, he posted version 2 of a patch series entitled “mm: make movable onlining suck less” in which he described the current status quo of “Movable onlining” as “a real hack with many downsides”. Linux divides memory into regions describing zones with names like ZONE_NORMAL (for regular system memory) and ZONE_MOVABLE (for memory the contents of which is entirely pages that don’t contain unmovable system data, firmware data, or for other reasons cannot be trivially moved/offlined/etc.).

The existing implementation has a number of constraints around which pages can be onlined. In particular, around the relative placement of the memory being onlined vs the ZONE_NORMAL memory. This, Michal described as “mainly reintroduction of lowmem/highmem issues we used to have on 32b systems – but it is the only way to make the memory hotremove more reliable which is something that people are asking for”. His patch series aims to make “the onlining semantic more usable [especially when driven by udev]…it allows to online memory movable as long as it doesn’t clash with the existing ZONE_NORMAL. That means that ZONE_NORMAL and ZONE_MOVABLE cannot overlap”. He noted that he had discussed this patch series with Jérôme Glisse (author of the HMM – Heterogenous Memory Management – patches) which were to be rebased on top of this patch series. Michal said he would assist with resolving any conflicts.

Igor Mammedov (Red Hat) noted that he had “given [the movable onlining] series some dumb testing” and had found three issues with it, which he described fully. In summary, these were “unable to online memblock as NORMAL adjacent to onlined MOVABLE”, “dimm1 assigned to node 1 on qemu CLI memblock is onlined as movable by default”, and “removable flag flipped to non-removable state”. Michal wasn’t initially able to reproduce the second issue (because he didn’t have ACPI_HOTPLUG_MEMORY enabled in his kernel) but was then able to followup noting that it was similar to another bug he had already fixed. Jérôme subsequently followed up with an updated HMM patchset as well.

Joonsoo Kim (LGE) posted version 7 of a patch series entitled “Introduce ZONE_CMA” in which he reworks the CMA (Contiguous Memory Allocator) used by Linux to manage large regions of physcially contiguous memory that must be allocated (for device DMA buffers in cases where scatter gather DMA or an IOMMU are not available for managed translations). In the existing CMA implementation, physically contiguous pages are reserved at boot time, but they operate much as reserved memory that happens to fall within ZONE_NORMAL (but with a special “migratetype”, MIGRATE_CMA), and will not generally be used by the system for regular memory allocations unless there are no movable freepages available. In other words, only as a last possible resort.

This means that on a system with 1024MB of memory, kswapd “is mostly woke[n] up when roughly 512MB free memory is left”. The new patches instead create a distinct ZONE_CMA which has some special properties intended to address utilization issues with the existing implementation. As he notes, he had a lengthy discussion with Mel Gorman after the LSF/MM 2016 conference last year, in which Mel stated “I’m not going to outright NAK your series but I won’t ACK it either”. A lot of further discussion is anticipated. Michal Hocko might have summarized it best with, “the cover letter didn’t really help me to understand the basic concepts to have a good starting point before diving into the implementation details [to review the patches]”. Joonsoo followup up with an even longer set of answers to Michal.

A bug in synchronize_rcu_tasks()

Paul E. McKenney posted “There is a Tasks RCU stall warning” in which he noted that he and Steven Rostedt were seeing a stall that didn’t report until it had waited 10 minutes (and recommended that Steven try setting the kernel rcupdate.rcu_task_stall_timeout boot parameter). RCU (Read Copy Update) is a clever mechanism used by Linux (under a GPL license from IBM, who own a patent on the underlying technology) to perform lockless updates to certain types of data structure, by tracking versions of the structure and freeing the older version once references to it have reached an RCU quiescent state (defined by each CPU in the system having scheduled synchronize_rcu once).

Steven noted that for the issue under discussion there was a thread that “never goes to sleep, but will call cond_resched() periodically [a function that is intended to possibly call into the scheduler if there is work to be done there]”. On the RT (Real Time, “preempt-rt”) kernel, Steven noted that cond_resched() is a nop and that the code he had been working on should have made a call directly to the schedule() function. Which lead to him suggesting he had “found a bug in synchronize_rcu_tasks()” in the case that a task frequently calls schedule() but never actually performs a context switch. In that case, per Paul’s subsequent patch, the kernel is patched to specially handle calls to schedule() not due to regular preemption.

Ongoing Development

Anshuman Khandual posted “mm/madvise: Clean up MADV_SOFT_OFFLINE and MADV_HWPOISON” noting that “madvise_memory_failure() was misleading to accommodate handling of both memory_failure() as well as soft_offline_page() functions. Basically it handles memory error injection from user space which can go either way as memory failure or soft offline. Renamed as madvise_inject_error() instead.” The madvise infrastructure allows for coordination between kernel and userspace about how the latter intends to use regions of its virtual memory address space. Using this interface, it is possible for applications to provide hints as to their future usage patterns, relinquish memory that they no longer require, inject errors, and much more. This is particularly useful to KVM virtual machines, which appear as regular processes and can use madvise() to control their “RAM”.

Sricharan R (Codeaurora) posted version 11 of a patch series entitled “IOMMU probe deferral support”, which “calls the dma ops configuration for the devices at a generic place so that it works for all busses”.

Kishon Vijay Abraham sent a pull request to Greg K-H (Kroah Hartman) for Linux 4.12 that included individual patches in addition to the pull itself. This resulted in an interesting side discussion between Kishon and Lee Jones (Linaro) about how this was “a strange practice” Lee hadn’t seen before.

Thomas Garnier (Google) posted version 7 of a patch series entitled “syscalls: Restore address limit after a syscall” which “ensures a syscall does not return to user-mode with a kernel address limit. If that happened, a process can corrupt kernel-mode memory and elevate privileges”. Once again, he cites how this would have preemptively mitagated a Google Project Zero security bug.

Christopher Bostic posted version 6 of a patch series enabling support for the “Flexible Support Interface” (FSI) high fan out bus on IBM POWER systems.

Dan Williams (Intel) posted “x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions” in which he says “Before we rework the “pmem api” to stop abusing __copy_user_nocache() for memcpy_to_pmem() we need to fix cases where we may strand dirty data in the cpu cache.”

Leo Yan (Linaro) posted an RFC (Request For Comments) patch series entitled “coresight: support dump ETB RAM” which enables support for the Embedded Trace Buffer (ETB) on-chip storage of trace data. This is a small buffer (usually 2KB to 8KB) containing profiling data used for postmortem debug.

Thierry Escande posted “Google VPD sysfs driver”, which provides support for “accessing Google Vital Product Data (VPD) through the sysfs”.

Alex(ander) Graf posted version 6 of “kvm: better MWAIT emulation for guests”, which provides new capability information to user space in order for it to inform a KVM guest of the availability of native MWAIT instruction support. MWAIT allows a (guest) kernel to wake up a remote (v)CPU without an IPI – InterProcessor Interrupt – and the associated vmexit that would then occur to schedule the remote vCPU for execution. The availability of MWAIT is deliberately not provided in the normal CPUID bitmap since “most people will want to benefit from sleeping vCPUs to allow for over commit” (in other words with MWAIT support, one can arrange to keep virtual CPUs runnable for longer and this might impact the latency of hosting many tenants on the same machine).

David Woodhouse posted version 2 of his patch series entitled “PCI resource mmap cleanup” which “pursues my previous patch set all the way to its logical conclusion”, killing off “the legacy arch-provided pci_mmap_page_range() completely, along with its vile ‘address converted by pci_resource_ro_user()’ API and the various bugs and other strange behavior that various architectures had”. He noted that to “accommodate the ARM64 maintainers’ desire *not* to support [the legacy] mmap through /proc/bus/pci I have separated HAVE_PCI_MMAP from the sysfs implementation”. This had previously been called out since older versions of DPDK were looking for the legacy API and failing as a result on newer ARM server platforms.

Darren Hart posted an RFC (Request For Comments) patch series entitled “WMI Enhancements” that seeks to clean up the “parallel efforts involving the Windows Management Instrumentation (WMI) and dependent/related drivers”. He wanted to have a “round of discussion among those of you that have been invovled in this space before we decide on a direction”. The proposed direction is to “convert[] wmi into a platform device and a proper bus, providing devices for dependent drivers to bind to, and a mechanism for sibling devices to communicate with each other”. In particular, it includes a capability to expose WMI devices directly to userspace, which resulted in some pushback (from Pali Rohár) and a suggestion that some form of explicit whitelisting of wmi identifiers (GUIDS) should be used instead. Mario Limonciello (Dell) had many useful suggestions.

Wei Wang (Intel) posted version 9 of a patch series entitled “Extend virtio-balloon for fast (de)inflating & fast live migration” in which he “implements two optimizations”. The first “tranfer[s] pages in chunks between the guest and host”. The second “transfer[s] the guest unused pages to the host so that they can be skipped in live migration”.

Dmitry Safonov posted “ARM32: Support mremap() for sigpage/vDSO” which allows CRIU (Checkpoint and Restart in Userspace) to complete its process of restoring all application VMA (Virtual Memory Area) mappings on restart by adding the ability to move the vDSO (Virtual Dynamic Shared Object) and sigpage kernel pages (data explicitly mapped into every process by the kernel to accelerate certain operations) into “the same place where they were before C/R”.

Matias Bjørling (Cnex Labs) prepared a git pull request for “LightNVM” targeting Linux 4.12. This is “a new host-side translation layer that implements support for exposing Open-Channel SSDs as block devices”.

Greg Thelen (Google) posted “slab: avoid IPIs when creating kmem caches”. Linux’s SLAB memory allocator (see also the paper on the original Solaris memory allocator) can be used to pre-allocate small caches of objects that can then be efficiently used by various kernel code. When these are allocated, per-cpu array caches are created, and a call is made to kick_all_cpus_sync() which will schedule all processors to run code to ensure that that there are no stale references to the old array caches. This global call is performed using an IPI (InterProcessor Interrupt), which is relatively expensive, especially in the case that a new cache is being created (and not replacing an old one). In that case wasteful IPIs are generated on the order of 47,741 additional ones in the example given vs. 1,170 in a patched kernel.

One Day Delay Due to Boston Marathon

The Podcast is delayed until Wednesday evening this week. Usually, I try to get it out on a Monday night (or at least write it up then and actually post on Tuesday), but when holidays or other events fall on a Monday, I will generally delay the podcast by a day. This week, I was volunteering at the Marathon all of Monday, which means the prep is taking place Tuesday night instead.

Linux Kernel Podcast for 2017/04/11

Audiohttp://traffic.libsyn.com/jcm/20170411.mp3

In this week’s edition: Linus Torvalds announces Linux 4.11-rc6, Intel Memory Bandwidth Allocation (MBA), Coherent Device Memory (CDM), Paravirtualized Remote TLB Flushing,kernel lockdown, the latest on Intel 5-level paging, and other assorted ongoing development activities.

Linus Torvalds announced Linux 4.11-rc6. In his mail, Linus notes that “Things are looking fairly normal [for this point in the development cycle]…The only slightly unusual thing is how the patches are spread out, with almost equal parts of arch updates, drivers, filesystems, networking and “misc”.” He ends “Go and get it”. Thorsten Leemhuis followed up with “Linux 4.11: Reported regressions as of Sunday, 2017-04-09”, his third regression report for 4.11. Which “lists 15 regressions I’m currently aware of. 5 regressions mentioned in last week[‘]s report got fixed”. Most appear to be driver problems, but there is one relating to audit, and one in inet6_fill_ifaddr that is stalled waiting for “feedback from reporter”.

Stable kernels

Greg K-H (Kroah-Hartman) announced Linux kernels 4.4.60, 4.9.21, and 4.10.9

Ben Hutchings announced Linux 3.2.88 and 3.16.43

Jason A. Donenfeld pointed out that Linux 3.10 “is inexplicably missing crypto_memneq, making all crypto mac [Message Authentication Code] comparisons use non constant-time comparisons. Bad news bears [presumably due to side channel attack]. Willy followed up noting that he would “check if the 3.12 patches…can be safely backported”.

Memory Bandwidth Allocation (Intel Resource Director Technology, RDT)

Vikas Shivappa (Intel) posted version 4 of a patch series entitled “x86/intel_rdt: Intel Memory bandwidth allocation”, addressing feedback from the previous iteration that he had received from Thomas Gleixner. The MBA (Memory Bandwidth Allocation) technology is described both in the kernel Documentation patch (provided) as well as in various Intel papers and materials available online. Intel provide a construct known as a “Class of Service” (CLOS) on certain contemporary Xeon processors, as part of their CAT (Cache Allocation Technology) feature, which is itself part of a larger family of technologies known as “Intel Resource Directory Technology” (RDT). These CLOSes “act as a resource control tag into which a thread/app/VM/container can be grouped”.

It appears that a feature of Intel’s L3 cache (LLC in Intel-speak) in these parts is that they can not only assign specific proportions of the L3 cache slices on the Xeon’s ring interconnect to specific resources (e.g. “tasks” – otherwise known as processes, or applications) but also can control the amount of memory bandwidth granted to these. This is easier than it sounds. From a technical perspective, Intel integrate their memory controller onto their dies, and contemporary memory controllers already perform fine grained scheduling (this is how they bias memory reads for speculative loads of the instruction stream in among the other traffic, as just one simple example). Therefore, exposing memory bandwidth control to the cache slices isn’t all that more complex. But it is cute, and looks great in marketing materials.

Coherent Device Memory (CDM) on top of HMM

Jérôme Glisse posted and RFC [Request for Comments] patch series entitled “Coherent Device Memory (CDM) on top of HMM”. His previous HMM (Heterogenous Memory Management) patch series, now in version 19, implemented support for (non-coherent) device memory to be mapped into regular process address space, by leveraging the ability for certain contempory devices to fault on access to untranslated addresses managed in device page tables thus allowing for a kind of pageable device memory and transparent management of ownership of the memory pages between application processor cores and (e.g.) a GPU or other acceleration device. The latest patch series builds upon HMM to also support coherent device memory (via a new ZONE_DEVICE memory – see also the recent postings from IBM in this area). As Jérôme notes, “Unlike the unaddressable memory type added with HMM patchset, the CDM [Coherent Device Memory] type can be access[ed] by [the] CPU.” He notes that he wanted to kick off this RFC more for the conversation it might provoke.

In his mail, Jérôme says, “My personal belief is that the hierarchy of memory is getting deeper (DDR, HBM stack memory, persistent memory, device memory, …) and it may make sense to try to mirror this complexity within mm concept. Generalizing the NUMA abstraction is probably the best starting point for this. I know there are strong feelings against changing NUMA so i believe now is the time to pick a direction”. He’s right of course. There have been a number of patch series recently also targeting accelerators (such as FPGAs), and more can be anticipated for coherently attached devices in the future. [This author is personally involved in CCIX]

Hyper-V: Paravirtualized Remote TLB Flushing and Hypercall Improvements

Vitaly Kuznetsov (Red Hat) posted “Hyper-V: paravirtualized remote TLB flushing and hypercall improvements”. It turns out that Microsoft’s Hyper-V hypervisor supports hypercalls (calls into the hypervisor from the guest OS) for “doing local and remote TLB [Translation Lookaside Buffer] flushing”. Translation Lookaside Buffers [TLBs] are caches built into microprocessors that store a translation of a CPU virtual address to “physical” (or, for a virtual machine, into an intermediate hypervisor) address. They save an unnecessary page table walk (of the software managed hardware/software structure – depending upon architecture – that “walkers” navigate to perform a translation during a “page fault” or unhandled memory access, such as happens constantly when demand loading/faulting in application code and data, or sharing read-only data provided by shared libraries, etc.). TLBs are generally transparent to the OS, except that they must be explicitly managed under certain conditions – such as when invlidating regions of virtual memory or performing certain context switches (depending upon the provisioning of address and virtual memory space tag IDs in the architecture).

TLB invalidates on local processor cores normally use special CPU instructions, and this is certainly also true under virtualization. But virtual addresses used by a particular process (known as a task within the kernel) might be also used by other cores that have touched the same virtual memory space. And those translations need to be invalidated too. Some architectures include sophisticated hardware broadcast invalidation of TLBs, but some other legacy architectures don’t provide these kinds of capabilities. On those architectures that don’t provide for a hardware broadcast, it is typically necessary to use a construct known as an IPI (Inter Processor Interrupt) to cause an IRQ (interrupt message) to be delivered to the remote interrupt controller CPU interface (e.g. LAPIC on Intel x86 architecture) of the destination core, which will run an IPI handler in response that does the TLB teardown.

As Vitaly notes, nobody is recommending doing local TLB flash using a hypercall, but there can be significant performance improvement in using a hypercall for the remote invalidates. In the example cited, which uses “a special ‘TLB trasher'” he demonstrates how a 16 vCPU guest experienced a greater than 25% performance improvement using the hypercall approach.

Ongoing Development

David Howells posted an magnum opus entitled “Kernel lockdown”, which aims to “provide a facility by which a variety of avenues by which userspace can feasibly modify the running kernel image can be locked down”. As he says, “The lock-down can be configured to be triggered by the EFI secure boot status, provided the shim isn’t insecure. The lock-down can be lifted by typing SysRq+x on a keyboard attached to the system [physcial presence]. Among the many other things, these patches (versions of which have been in distribution kernels for a while) change kernel behavior to include “No unsigned modules and no modules for which [we] can’t validate the signature”, disable many hardware access functions, turn off hibernation, prevent kexec_load(), and limit some debugging features. Justin Forbes of the Fedora Project noted that he had (obviously) tested these. One of the many interesting sets of patches included a feature to “Annotate hardware config module parameters” which allows modules to mark unsafe options. Following some pushback, David also followed up with a rationale for doing kernel lockdown, entitled “Why kernel lockdown?”. Worth reading.

Kirill A. Shutemov posted “x86: 5-level paging enabling for v4.12, Part 4”, in which he (bravely) took Ingo’s request to “rewrite assembly parts of boot process into C before bringing 5-level paging support”. He says, “The only part where I succeed is startup_64 in arch/x86/kernel/head_64.S. Most of the logic is now in C.” He also renames the level 4 page tables “init_level4_pgt” and “early_level4_pgt” to “init_top_pgt” and “early_top_pgt”. There was another lengthy discussion around his “Allow to have userspace mappings above 47-bits”, a patch which tells the kernel to prefer to do memory allocations below 47-bits (the previous “Canonical Addressing” limit of Intel x86 processors, which some JITs and other code exploit by abusing the top bits of the address space in pointers for illegal tags, breaking compatibility with an extended virtual address space). The patch allows mmap calls ith MAP_FIXED hints to cause larger allocations. There was some concern that larger VM space is ABI and must be handled with care. A footnote here is that (apparently, from the patch) Intel MPX (Memory Protection Extension) doesn’t yet work with LA57 (the larger address space feature) and so Kirill avoids both in the same process.

Christopher Bostic posted version 5 of a patch series entitled “FSI driver implementation”. This is support for the POWER’s [Performance Optimization With Enhanced RISC, for those who ever wondered – this author used to have a lot of interest in PowerPC back in the day] “Flexible Support Interface” (FSI), a “high fan out serial bus” whose specification seems to have appeared on the OpenPower Foundation website recently also.

Kishon Vijay Abraham posted “PCI: Support for configurable PCI endpoint”, which Bjorn finally pulled into his tree in anticipation of the upcoming 4.12 merge cycle. For those who haven’t see Kishon’s awesome presentation “Overview of PCI(e) Subsystem” for Embedded Linux Conference Europe, you are encouraged to watch it at least several times. He really knows his stuff, and has done an excellent job producing a high quality generic PCIe endpoint driver for Linux: https://www.youtube.com/watch?v=uccPR6X8vy8

Ard Biesheuvel posted “EFI fixes for v4.11”, which among other goodies includes a fix for EFI GOP (Graphics Output Protocol) support on systems built using the 64-bit ARM Architecture, which uses firmware assignment of PCIe BAR resources. Ard and Alex Graf have done some really fun work with graphics cards on 64-bit ARM lately – including emulating x86 option ROMs. Ard also had some fixes prepared for v4.12 that he announced, including a bunch of cleanup to the handling of FDT (Flattened Device Tree) memory allocation. Finally, he added support for the kernel’s “quiet” command line option, to remove extraneous output from the EFI stub on boot.

Srikar Dronamraju and Michal Hocko had a back and forth on the former’s “sched: Fix numabalancing to work with isolated cpus” patch, which does what it says on the tin. Michal was a little concered that NUMA balancing wasn’t automatically applied even to isolated CPUs, but others (including Peter Zjilsta) noted that this absolutely is the intended behavior.

Ying Huang (Intel) posted version 8 of his “THP swap: Delay splitting THP during swapping out”, which essentially allows paging of (certain) huge pages. He also posted version 2 of “mm, swap: Sort swap entries before free”, which sorts consecutive swap entires in a per-CPU buffer into order accoring to their backing swap deivce before freeing those entries. This reduces needless acquiring/releasing of locks and improves performance.

Will Deacon posted version 2 of a patch series entitled “drivers/perf: Add support for ARMv8.2 Statistical Profiling Extension”. The “SPE” (Statistical Profiling Extension) “can be used to profile a population of operations in the CPU pipeline after instruction decode. These are either architected instructions (i.e. a dynamic instruction trace) or CPU-specific uops and the choice is fixed statically in the hardware and advertised to userpace via caps. Sampling is controlled using a sampling interval, similar to a regular PMU counter, but also with an optional random perturbation”. He notes that the “in-memory buffer is linear and virtually addressed, raising an interrupt when it fills up” [which makes using it nice for software folks].

Binoy Jayan posted “IV [Initial Vector] Generation algorithms for dm-crypt”, the goal of which “is to move these algorithms from the dm layer to the kernel crypto layer by implementing them as template ciphers”.

Joerg Roedel posted “PCI: Add ATS-disable quirk for AMD Stoney GPUs”. Then, he posted a followup with a minor fix based upon feedback. This should close the issue of certain bug reports posted by those using an IOMMU on a Stoney platform and seeing lockups under high TLB invalidation.

Born Helgass posted “PCI fixes for v4.11”, which includes “fix ThunderX legacy firmware resources”, a PCI quirk for certain ARM server platforms.

Paul Menzel reported “`pci_apply_final_quirks()` taking half a second”, which David Woodhouse (who wrote the code to match PCIe devices against the quick list “back in the mists of time”) posited was perhaps down to “spending a fair amount of time just attempting to match each device against the list”. He wondered “if it’s worth sorting the list by vendor ID or somthing, at least for the common case of the quirks which match on vendor/device”. There was a general consensus that cleanup would be nice, if only someone had the time and the inclination to take a poke at it.

Seth Forshee (Canonical) posted “audit regressions in 4.11”, in which he noted that ever since the merging of “audit: fix auditd/kernel connection state tracking”, the kernel will now queue up indefintely audit messages for delivery to the (userspace) audit daemon if it is not running – ultimately crashing the machine. Paul Moore thanked him for the report and there was a back and forth on the best way to handle the case of no audit running.

Neil Brown posted a patch entitled “NFS: fix usage of mempools”. As he notes in his patch, “When passed GFP [Get Free Page] flags that allow sleeping (such as GFP_NOIO), mempool_alloc() will never return NULL, it will wait until memory is available…This means that we don’t need to handle falure, but that we do need to ensure one thread doesn’t call mempool_alloc twice on the one pool without queuing or freeing the first allocation”. He then cites “pnfs_generic_alloc_ds_commits” as an unsafe function and provides a fix.

Finally, Kees Cook followed up (as he had promised) on a discussion from last week, with an RFC (Request for Comments) patch series entitiled “mm: Tighten x86 /dev/mem with zeroing”, including the suggestion from Linus that reads from /dev/mem that aren’t permitted simply return zero data. This was just one of many security discussions he was involved in (as usual). Another included having suggested a patch posted by Eddie Kovsky entitled “module: verify address is read-only”, which modifies kernel functions that use modules to verify that they are in the correct kernel ro_after_init memory area and “reject structures not marked ro_after_init”.

Linux Kernel Podcast for 2017/04/04

Audiohttp://traffic.libsyn.com/jcm/20170404v2.mp3

Linus Torvalds announces Linux 4.11-rc5, Donald Drumpf drains the maintainer swamp in April, Intel FPGA Device Drivers, FPU state cacheing, /dev/mem access crashing machines, and assorted ongoing development.

Linus Torvalds announced Linux 4.11-rc5. In his announcement mail, Linus notes that “things have definitely started to calm down, let’s hope it stays this way and it wasn’t just a fluke this week”. He calls out the oddity that “half the arch updates are to parisc” due to parisc user copy fixes.

It’s worth noting that rc5 includes a fix for virtio_pci which removes an “out of bounds access for msix_names” (the “name strings for interrupts” provided in the virtio_pci_device structure. According to Jason Wang (Red Hat), “Fedora has received multiple reports of crashes when running 4.11 as a guest” (in fact, your author has seen this one too). Quoting Jason, “The crashes are not always consistent but they are generally some flavor of oops or GPF [General Protection Fault – Intel x86 term referring to the general case of an access violation into memory by an offending instruction in various different ISAs – Instruction Set Architectures] in virtio related code. Multiple people have done bisections (Thank you Thorsten Leemhuis and Richard W.M. Jones)”. An example rediscovery of this issue came from a Mellanox engineer who reported that their test and regression VMs were crashing occasionally with 4.11 kernels.

Announcements

Sebastian Andrzej Siewior announced preempt-rt Linux version 4.9.20-rt16. This includes a “Re-write of the R/W semaphores code. In RT we did not allow multiple readers because a writer blocking on the semaphore would have [to] deal with all the readers in terms of priority or budget inheritance [by which he is refering to the Priority Inheritance or “PI” feature common to “real time” kernels]. It’s obvious that the single reader restriction has severe performance problems for situations with heavy reader contention.” He notes that CPU hotplug got “better but can deadlock”

Greg Kroah-Hartman posted Linux stable kernels 4.4.59, 4.9.20, and 4.10.8.

Draining the Swamp (in April)

Donald Drumpf (trump.kremlin.gov@gmail.com) posted “MAINTAINERS: Drain the swamp”, an inspired patch aiming to finally address the problem of having “a small group of elites listed in the corrupt MAINTAINERS file” who, “For too long” have “reaped the rewards of maintainership”. He notes that over the past year the world has seen a great Linux Exit (“Lexit”) movement in which “People all of the Internet have come together and demanded that power be restored to the developers”, creating “a historic fork based on Linux 2.4, back to a better time, before Linux was controlled by corporate interests”. He notes that the “FAKE NEWS site LWN.net said it wouldn’t happen, but we knew better”.

Donald says that all of the groundwork laid over the past year was just an “important first step”. And that “now, we are taking back what’s rightfully ours. We are transferring power from “Lyin’ Linus” and giving it back to you, the people. With the below patch, the job-killing MAINTAINERS file is finally being ROLLED BACK.” He also notes his intention to return “LAW and ORDER” to the Linux kernel repository by building a wall around kernel.org and “THE LINUX FOUNDATION IS GOING TO PAY FOR IT”. Additional changes will include the repeal and replacement of the “bloated merge window”, the introduction of a distribution import tax, and other key innovations that will serve to improve the world and to MAKE LINUX GREAT AGAIN!

Everyone around the world immediately and enthusiastically leaped upon this inspired and life altering patch, which was of course perfect from the moment of its inception. It was then immediately merged without so much as a dissenting voice (or any review). The private email servers used to host Linus’s deleted patch emails were investigated and a special administrator appointed to investigate the investigators.

Intel FPGA Device Drivers

Wu Hao (Intel) posted a sixteen part patch series entitled “Intel FPGA Drivers”, which “provides interfaces for userspace applications to configure, enumerate, open, and access FPGA [Field Programmable Gate Arrays, flexible logic fabrics containing millions of gates that can be connected programmatically by bitstreams describing the intended configuration] accelerators on platforms equipped with Intel(R) FPGA solutions and enables system level management functions such as FPGA partial reconfiguration [the dynamic updating of partial regions of the FPGA fabric with new logic], power management, and virtualization. This support differs from the existing in-kernel fpga-mgr from Alan Tull in that it seems to relate to the so-called Xeon-FPGA hybrid designs that Intel have presented on in various forums.

The first patch (01/16) provides a lengthy summary of their proposed design in the form of documentation that is added to the kernel’s Documentation directory, specifically in the file Documentation/fpga/intel-fpga.txt. It notes that “From the OS’s point of view, the FPGA hardware appears as a regular PCIe device. The FPGA device memory is organized using a predefined structure [Device Feature List). Features supported by the particular FPGA device are exposed throughg these data structures. An FME (FPGA Management Engine) is provided which “performs power and thermal management, error reporting, reconfiguration, performance reporting, and other infrastructure functions. Each FPGA has one FME, which is always access through the physical function (PF)”. The FPGA also provides a series of Virtual Functions that can be individually mapped into virtual machines using SR-IOV.

This design allows a CPU attached using PCIe to communicate with various Accelerated Function Units (AFUs) contained within the FPGA, and which are individually assignable into VMs or used in aggregate by the host CPU. One presumes that a series of userspace management utilities will follow this posting. It’s actually quite nice to see how they implemented the discovery of individual AFU features, since this is very close to something a certain author has proposed for use elsewhere for similar purposes. It’s always nicely validating to see different groups having similar thoughts.

Copy Offload with Peer-to-Peer PCI Memory

Logan Gunthorpe posted an RFC (Request for Comments) patch series entitled “Copy Offload with Peer-to-Peer PCI Memory” which relates to work discussed at the recent LSF/MM (Linux Storage Filesystem and Memory Management) conference, in Cambridge MA (side note: I did find some of you haha!). To quote Logan, “The concept here is to use memory that’s exposed on a PCI BAR [Base Address Register – a configuration register that tells the device where in the physical memory map of a system to place memory owned by the device, under the control of the Operating System or the platform firmware, or both] as data buffers in the NVMe target code such that data can be transferred from an RDMA NIC to the special memory and then directly to an NVMe device avoiding system memory entirely”. He notes a number of positives from this, including better QoS (Quality of Service), and a need for fewer (relatively still quite precious even in 2017) PCIe lanes from the CPU into a PCIe switch placed downstream of its Root Complex on which peer-to-peer PCIe devices talk to one another without the intervening step of hopping through the Root Complex and into the system memory via the CPU. As a consequence, Logan has focused his work on “cases where the NIC, NVMe devices and memory are all behind the same PCI switch”.

To facilitate this new feature, Logan has a second patch in the series, entitled “Introduce Peer-to-Peer memory (p2mem) device”, which supports partitioning and management of memory used in direct peer-to-peer transfers between two PCIe devices (endpoints, or “cards”) with a BAR that “points to regular memory”. As Logan notes, “Depending on hardware, this may reduce the bandwidth of the transfer but could significantly reduce pressure on system memory” (again by not hopping up through the PCIe topology). In his patch, Logan had also noted that “older PCI root complexes” might have problems with peer-to-peer memory operations, so he had decided to limit the feature to be only available for devices behind the same PCIe switch. This lead to a back and forth with Sinan Kaya who asked (rhetorically) “What is so special about being connected to the same switch?”. Sinan noted that there are plenty of ways in Linux to handle blacklisting known older bad hardware and platforms, such as requiring that the DMI/SMBIOS-provided BIOS date of manufacture of the system be greater than a certain date in combination with all devices exposing the p2p capability and a fallback blacklist. Ultimately, however, it was discovered that the feature peer-to-peer feature isn’t enabled by default, leading Sinan to suggest “Push the decision all the way to the user. Let them decide whether they want this feature to work on a root port connected port or under the switch”.

FPU state cacheing

Kees Cook (Google) posted a patch entitled “x86/fpu: move FPU state into separate cache”, which aims to remove the dependency within the Intel x86 Architecture port upon an internal kernel config setting known as ARCH_WANTS_DYNAMIC_TASK_STRUCT. This configuration setting (set by each architecture’s code automatically, not by the person building the kernel in the configuration file) says that the true size of the task_struct cannot be known in advance on Intel x86 Architecture because it contains a variable sized array (VSA) within the thread_struct that is at the end of the task_struct to support context save/restore of the CPU’s FPU (Floating Point Unit) co-processor. Indeed, the kernel definition of task_struct (see include/linux/sched.h) includes a scary and ominous warning “on x88, ‘thread_struct’ contains a variable-sized structure. It *MUST* be at the end of ‘task_struct'”. Which is fairly explicit.

The reason to remove the dependency upon dynamic task_struct sizing is because this “support[s] future structure layout randomization of the task_struct”, which requires that “none of the structure fields are allowed to have a specific position or a dynamic size”. The idea is to leverage a GCC (GNU Compiler Collection) plugin that will change the ordering of C structure members (such as task_struct) randomly at compile time, in order to reduce the ability for an attacker to guess the layout of the structure (highly useful in various exploits). In the case of distribution kernels of course, an attacker has access to the same kernel binaries that may be running on a system, and could use those to calculate likely structure layout for use in a compromise. But the same is not true of the big hyperscale service providers like Google and Facebook. They don’t have to publish the binaries for their own internal kernels running on their public infrastructure servers.

This patch lead to a back and forth with Linus, who was concerned about why the task_struct would need changing in order to prevent the GCC struct layout randomization plugin from blowing up. In particular, he was worried that it sounded like the plugin was moving variable sized arrays from the last member of structures (not legally permitted). Kees, Linus, and Andy Lutomirski went through the fact that, yes, the plugin can handle trailing VSAs and so forth. In the end, it was suggested that Kees look at making task_struct “be something that contains a fixed beginning and end, and just have an unnamed randomized part in the middle”. Kees said “That could work. I’ll play around with it”.

/dev/mem access crashing machines

Dave Jones (x86info maintainer) had a back and forth with Kees Cook, Linus, and Tommi Rantala about the latter’s discovery that running Dave’s “x86info” tool crashed his machine with an illegal memory access. In turns out that x86info reads /dev/mem (a requirement to get the data it needs), which is a special file representing the contents of physical memory. Normally, when access is granted to this file, it is restricted to the root user, and then only certain parts of memory as determined by STRICT_DEVMEM. The latter is intended only to allow reads of “reserved RAM” (normal system memory reserved for specific device purposes, not that allocated for use by programs). But in Tommi’s case, he was running a kernel that didn’t have STRICT_DEVMEM set on a system booting with EFI for which the legacy “EBDA” (Extended BIOS Data Area) that normally lives at a fixed location in the sub-1MB memory window on x86 was not provided by the platform. This meant that the x86info tool was trying to read memory that was a legal address but which wasn’t reserved in the EFI System Table (memory map), and was mapped for use elsewhere.

All of this lead Linus to point out that simply doing a “dd” read on the first MB of the memory on the offending system would be enough to crash it. He noted that (on x86 systems) the kernel allows access to the sub-1MB region of physical memory unconditionally (regardless of the setting of the kernel STRICT_DEVMEM option) because of the wealth of platform data that lives there and which is expected to be read by various tools. He proposed effectively changing the logic for this region such that memory not explicitly marked as reserved would simple “just read zero” rather than trying to read random kernel data in the case that the memory is used for other purposes.

This author certainly welcomes a day when /dev/mem dies a death. We’ve gone to great lengths on 64-bit ARM systems to kill it, in part because it is so legacy, but in another part because there are two possible ways we might trap a bad access – one as in this case (synchronous exception) but another in which the access might manifest as a System Error due to hitting in the memory controller or other SoC logic later as an errant access.

Ongoing Development

Steve Longerbeam posted version 6 of a patch series entitled “i.MX Media Driver”, which implements a V4L2 (Video for Linux 2) driver for i.MX6.

David Gstir (on behalf of Daniel Walter) posted “fscrypt: Add support for AES-128-CBC” which “adds support for using AES-128-CBC for file contents and AES-128-CBC-CTS for file name encryption. To mitigae watermarking attacks, IVs [Initial Vectors] are generated using the ESSIV algorthim.”

Djalal Harouni posted an RFC (Request for Comments) patch entitled “proc: support multiple separate proc instances per pidnamespace”. In his patch, Djala notes that “Historically procfs was tied to pid namespaces, and moun options were propagated to all other procfs instances in the same pid namespace. This solved several use cases in that time. However today we face new problems, there are multiple container implementations there, some of them want to hide pid entries, others want to hide non-pid entries, others want to have sysctlfs, others want to share pid namespace with private procfs mounts. All these with current implementation won’t work since all options will be propagated to all procfs mounts. This series allow to have new instances of procfs per pid namespace where each intance can have its own mount option”.

Zhou Chengming (Hauwei) posted “reduce the time of finding symbols for module” which aims to reduce the time taken for the Kernel Live Patch (klp) module to be loaded on a system in which the module uses many static local variables. The patch replaces the use of kallsyms_on_each_symbol with a variant that limits the search to those needed for the module (rather than every symbol in the kernel). As Jessica Yu notes, “it means that you have a lot of relocation records with reference your out-of-tree module. Then for each such entry klp_resolve_symbol() is called and then klp_find_object_symbol() to actually resolve it. So if you have 20k entries, you walk through vmlinux kallsyms table 20k times…But if there were 20k modules loaded, the problem would still be there”. She would like to see a more generic fix, but was also interested to see that the Huawei report referenced live patching support for AArch64 (64-bit ARM Architecture), which isn’t in upstream. She had a number of questions about whether this code was public, and in what form, to which links to works in progress from several years ago were posted. It appears that Huawei have been maintaining an internal version of these in their kernels ever since.

Ying Huang (Intel) posted version 7 of “THP swap: Delay splitting THP during swapping out”, which as we previously noted aims to swap out actual whole “huge” (within certain limits) pages rather than splitting them down to the smallest atom of size supported by the architecture during swap. There was a specific request to various maintainers that they review the patch.

Andi Kleen posted a patch removing the printing of MCEs to the kernel log when the “mcelog” daemon is running (and hopefully logging these events).

Laura Abbott posted a RESEND of “config: Add Fedora config fragments”, which does what it says on the tin. Quoting her mail, “Fedora is a popular distribution for people who like to build their own kernels. To make this easier, add a set of reasonable common config options for Fedora”. She adds files in kernel/configs for “fedora-core.config”, “fedora-fs.config” and “fedora-networking.config” which should prove very useful next time someone complains at me that “building kernels for Red Hat distributions is hard”.

Eric Biggers posted “KEYS: encrypted: avoid encrypting/decrypting stack buffers”, which notes that “Since [Linux] v4.9, the crypto PI cannot (normally) be used to encrypt/decrypt stack buffers because the stack may be virtually mapped. Fix this or the padding buffers in encrypted-keys by using ZERO_PAGE for the encryption padding and by allocating a temporary heap buffer for the decryption padding. Eric is referring to the virtually mapped stack support introduced by Andy Lutomirski which has the side effect of incidentally flagging up various previous missuse of stacks.

Mark Rutland posted an RFC (Request For Comments) patch series entitled “ARMv8.3 pointer authentication userspace support”. ARMv8.3 includes a new architectural extension that “adds functionality to detect modification of pointer values, mitigating certain classes of attack such as stack smashing, and making return oriented [ROP] programming attacks harder”. [aside: If you’re bored, and want some really interesting (well, I think so) bedtime reading, and you haven’t already read all about ROP, you really should do so]. Continuing to quote Mark, the “extension introduces the concept of a pointer authentication code (PAC), which is stored in some upper bits of pointers. Each PAC is derived from the original pointer, another 64-bit value (e.g. the stack pointer), and a secret 128-bit key”. The extension includes new instructions to “insert a PAC into a pointer”, to “strip a PAC from a pointer”, and to “authenticate strip a PAC from a pointer” (which has the side effect of poisoning the pointer and causing a later fault if the authentication fails – allowing for detection of malicious intent).

Mark’s patch makes for great reading and summarizes this feature well. It notes that it has various counterparts in userspace to add ELF (Executable and Linking Format, the executable container used on modern Linux and Unix systems) notes sections to programs to provide the necessary annotations and presumably other data necessary to implement pointer authentication in application programs. It will be great to see those posted too.

Joerg Roedel followed up to a posting from Samuel Sieb entitled “AMD IOMMU causing filesystem corruption” to note that it has recently been discovered (and was documented in another thread this past week entitled “PCI: Blacklist AMD Stoney GPU devices for ATS”) that the AMD “Stoney” platform features a GPU for which PCI-ATS is known to be broken. ATS (Address Translation Services) is the mechanism by which PCIe endpoint devices (such as plugin adapter cards, including AMD GPUs) may obtain virtual to physical address translations for use in inbound DMA operations initiated by a PCIe device into a virtual machine (VM’s) memory (the VM talks the other way through the CPU MMU).

In ATS, the device utilizes an Address Translation Cache (ATC) which is essentially a TLB (Translation Lookaside Buffer) but not called that because of handwavy reasons intended not to confuse CPU and non-CPU TLBs. When a device sitting behind an IOMMU needs to perform an address translation, it asks a Translation Agent (TA) typically contained within the PCIe Root Complex to which it is ultimately attached. In the case of AMD’s Stoney Platform, this blows up under address invalidation load: “the GPU does not reply to invalidations anymore, causing Completion-wait loop timeouts on the AMD IOMMU driver side”. Somehow (but this isn’t clear) this is suspected as the possible cause of the filesystem corruption seen by Samuel, who is waiting to rebuild a system that ate its disk testing this.

Calvin Owens (Facebook) posted “printk: Introduce per-console filtering of messages by loglevel”, which notes that “Not all consoles are created equal”. It essentially allows the user to set a different loglevel for consoles that might each be capable of very different performance. For example, a serial console might be severely limited in its baud rate (115,200 in many cases, but perhaps as low as 9,600 or lower is still commonplace in 2017), while a graphics console might be capable of much higher. Calvin mentions netconsole as the preferred (higher speed) console that Facebook use to “monitor our fleet” but that “we still have serial consoles attached on each host for live debugging, and the latter has caused problems”. He doesn’t specifically mention USB debug consoles, or the EFI console, but one assumes that listeners are possibly aware of the many console types.

Christopher Bostic (IBM) posted version 5 of a patch series entitled “FSI device driver implementation”. FSI stands for “Flexible Support Interface”, a “high fan out [a term referring to splitting of digital signals into many additional outputs] serial bus consisting of a clock and a serial data line capable of running at speeds up to 166MHz”. His patches add core support to the Linux bus and device models (including “probing and discovery of slaves and slave engines”), along with additional handling for CFAM (Common Field Replacable Unit Access Macro) – an ASIC (chip) “residing in any device requiring FSI communications” that provides these various “engines”, and an FSI engine driver that manages devices on the FSI bus.

Finally, Adam Borowski posted “n_tty: don’t mangle tty codes in OLCUC mode” which aims to correct a bug which is “reproducible as of Linux 0.11” and all the way back to 0.01. OLCUC is not part of POSIX, but this terminios structure flag tells Linux to map lowercase characters to uppercase ones. The posting cites an obvious desire by Linus to support “Great Runes” (archiac Operating Systems in which everything was uppercase), to which Linus (obviously in jest, and in keeping with the April 1 date) asked Adam why he “didn’t make this the default state of a tty?”.

Linux Kernel Podcast for 2017/03/28

Audiohttp://traffic.libsyn.com/jcm/20170328v2.mp3

Author’s Note: Apologies to Ulrich Drepper for incorrectly attributing his paper “Futexes are Tricky” to Rusty. Oops. In any case, everyone should probably read Uli’s paper: https://www.akkadia.org/drepper/futex.pdf

In this week’s edition: Linus Torvalds announces Linux 4.11-rc4, early debug with USB3 earlycon, upcoming support for USB-C in 4.12, and ongoing development including various work on boot time speed ups, logging, futexes, and IOMMUs.

Linus Torvalds announced Linux 4.11-rc4, noting that “So last week, I said that I was hoping that rc3 was the point where we’d start to shrink the rc’s, and yes, rc4 is smaller than rc3. By a tiny tiny sidgen. It does touch a few more files, but it has a couple fewer commits, and fewer lines changed overall. But on the whole the two are almost identical in size. Which isn’t actually all that bad, considering that rc4 has both a networking merge and the usual driver suspects from Greg [Kroah Hartman], _and_ some drm fixes”.

Announcements

Junio C Hamano announced Git v2.12.2.

Greg Kroah-Hartman announced Linux 4.4.57, 4.9.18, and 4.10.6.

Sebastian Andrezej Siewior announced Linux v4.9.18-rt14, which includes a “larger rework of the futex / rtmutex code. In v4.8-rt1 we added a workaround so we don’t de-boost too early in the unlock path. A small window remained in which the locking thread could de-boost the unlocking thread. This rework by Peter Zijlstra fixes the issue.”

Upcoming features

Greg K-H finally accepted the latest “USB Type-C Connector class” patch series from Heikki Krogerus (Intel). This patch series aims to provide various control over the capability for USB-C to be used both as a power source and as a delivery interface to supply to power to external devices (enabling the oft-cited use case of selecting between charging your cellphone/mobile device or using said device to charge your laptop). This will land a new generic management framework exposed to userspace in Linux 4.12, including a driver for “Intel Whiskey Cove PMIC [Power Management IC] USB Type-C PHY”. Your author looks forward to playing. Greg thanked Heikki for the 18(!) iterations this patch went through prior to being merged – not quite a record, but a lot of effort!

Kishon Vijay Abraham (TI) posted “PCI: Support for configurable PCI endpoint”, which provides generic infrastructure to handle PCI endpoint devices (Linux operating as a PCI endpoint “device”), such as those based upon IP blocks from DesignWare (DW). He’s only tested the design on his “dra7xx” boards and requires “the help of others to test the platforms they have access to”. The driver adds a configfs interface including an entry to which userspace should write “start” to bring up an endpoint device. He adds himself as the maintainer for this new kernel feature.

Rob Herring posted “dtc updates for 4.12”, which “syncs dtc [Device Tree Compiler] with current mainline [dtc]”. His “primary motivation is to pull in the new checks [he’s] worked on. This gives lots of new warnings which are turned off by default”.

60Hz vs 59.94Hz (Handling of reduced FPS in V4L2)

Jose Abreu (Synopsys) posted a patch series entitled “Handling of reduced FPS in V4L2”, which aims to provide a mechanism for the kernel to measure (in a generic way) the actual Frames Per Second for a Video For Linux (V4L) video device. The patches rely upon hardware drivers being able to signal that they can distinguish “between regular fps and 1000/1001 fps”.

This took your author on a journey of discovery. It turns out that (most of the time), when a video device claims to be “60fps” it’s actually running at 59.94fps, but not always. The latter frame rate is an artifact of the NTSC (National Television System Committee) color television standard in the United States. Early televisions used the 60Hz frequency (which is nationally synchronized, at least in each of the traditional three independent grids operated in the US, which are now interconnected using HVDC interconnects but presumably are still not directly in phase with one another – feel free to educate me!) of the AC supply to lock individual frame scan times. When color TV was introduced, a small frequency offset was used to make room in each frame for a color sub-carrier signal while retaining backward compatibility for black and white transmissions. This is where frequencies of 29.97 and 59.95 frames per second originate. In case you always wondered.

Jose and Hans Verkuil had a back and forth discussion about various real- world measured pixelclock frequencies that they had obtained using a variety of equipment (signal analyzers, certified HDMI analyzer, and the Synopsys IP supported by the patch series under discussion) to see whether it was in reality possible to reliably distinguish frame rates.

Early Debug with USB3 earlycon (early printk)

Lu Baolu (Intel) posted version 8 of a patch series entitled “usb: early: add support for early printk through USB3 debug port”. Contemporary (especially x86) desktop and server class systems don’t expose low level hardware debug interfaces, such as JTAG debug chains, which are used during chip bringup and early firmware and OS enablement activities, and which allow developers with suitable tools to directly control and interrogate hardware state. Or just dump out the kernel ringbuffer (the dmesg “log”).

Actually, all such systems do have low level debug capabilities, they’re just fused out during the production process (by blowing efuses embedded into the processor) and either not exposed on the external pins of the chip at all, or are simply disabled in the chip logic. Probably most of these can be re-enabled by writing the magic cryptographically signed hashes to undocumented memory regions in on-chip coprocessor spaces. In any case, vendors such as Intel aren’t going to tell you how.

Yet it is often desirable to have certain low level debug functionality for systems that are deployed into field settings, even to reliably dump out the kernel console log DEBUG log level messages somewhere. Traditionally this was done using PC serial ports, but most desktop (and all laptop) systems no longer ship with those exposed on the rear panel. If you’re lucky you’ll see an IDC10 connector on your motherboard to which you can attach a DB9 breakout cable. Consumers and end users have no idea what any of this means, and in the case that they don’t know what this means, they probably shouldn’t be encouraged to open the machine up and poke things. Yet even in the case that IDC10 connectors exist and can be hooked up, this is still a cumbersome interface that cannot be relied upon today.

Microsoft (who are often criticized but actually are full of many good ideas and usually help to drive industry standardization for the broader market) instituted sanity years ago by working with the USB Implementors Forum (IF) to ensure that the USB3 specification included a standardized feature known as xHCI debug capability (DbC), an “optional but standalone functionality by an xHCI hosst controller”. This suited Windows, which traditionally requires two UARTs (serial ports) for kernel development, and uses one of them for simple direct control of the running kernel without going through complex driver frameworks. Debug port (which also existed on USB2) traditionally required a special external partner hardware dongle but is cleaner in USB3, requiring only a USB A-to-A crossover cable connecting USB3.0 data lines.

As Lu Baolu notes in his patch, “With DbC hardware initialized, the system will present a debug device through the USB3 debug port (normally the first USB3 port)”. The patch series enables this as a high speed console log target on Linux, but it could be used for much more interesting purposes via KDB.

[Separately, but only really related to console drivers and not debugging, Thierry Escande posted “firmware: google memconsole” which adds support for importing the boot time BIOS memory based console into the kernel ringbuffer on Google Coreboot systems].

Ongoing Development

Pavel Tatashin (Oracle) posted “parallelized “struct page” zeroing”, which improves boot time performance significantly in the case that the “deferred struct page initialization feature is enabled”. In this case, zeroing out of the kernel’s vmemmap (Virtual Memory Map) is delayed until after the secondary CPU cores on a machine have been started. When this is done, those cores can be used to run zeroing threads that write to memory, taking one SPARC system down from 97.89 seconds to boot down to 46.91. Pavel notes that the savings are also considerable on x86 systems too.

Thomas Gleixner had a lengthy back and forth with Pasha Tatashin (Oracle) over the latter’s posting of “Early boot time stamps for x86” which use the TSC (Time Stamp Counter) on Intel x86 Architecture. The goal is to log how long the machine actually took to boot, including firmware, rather than just how long Linux took to boot from the time it was started. Peter Zijlstra responded (to Pasha), “Lol, how cute. You assume TSC starts at 0 on reset” (alluding to the fact that firmware often does crazy things playing with the TSC offset or directly writing to it). Thomas was unimpressed with Pavel’s posting of a v2 patch series, noting “Did you actually read my last reply on V1 of this? I made it clear that the way this is done, i.e. hacking it into the earliest boo[]t stage is not going to happen…I don’t care about you wasting your time, but I very much care about my time”. He provided a further more lengthy response, including various commentary on the best ways to handle feedback.

Peter Zijlstra posted version 6 of a patch series entitled “The arduous story of FUTEX_UNLOCK_PI” in which he adds “Another installment of the futex patches that give you nightmares”. Futexes (Fast User-space Mutexes) are a mechanism provided by the Linux kernel which leverage shared memory to provide a low overhead mutex (mutual exclusion primitave) to userspace in the case that such mutexes are uncontended (no conflicts between processes – tasks within the kernel – exist trying to acquire the same resource) but with a “slow path” through the kernel in the case of contention. They are used by many userspace applications, including extensively in the C library (see the famous paper by Rusty Russell entitled “Futexes are Tricky”). Peter is working on solving problems introduced by having to have Priority Inheritance (PI) aware futexes in Real Time kernels. These adjust priority of the associated tasks holding mutexes for short periods in order to prevent Priority Inversion (see Mars Pathfinder study papers) in which a low priority task holds a mutex that a high priority task wants to acquire. Peter’s patches “rework[] and document[] the locking” of existing code.

Separately, Waiman Long (Red Hat) posted version 6 of “futex” Introducing throughput-optimized (TP) futexes which “introduces a new futex implementation called throughput-optmized (TP) futexes. It is similar to PI futexes in its calling convention, but provides better throughput than the wait-wake (WW) futexes by encouraging lock stealing and optimistic spinning. The new TP futexes an be used in implementing both userspace mutexes and rwlocks. The provide[] better performance while simplifying the userspace locking implementation at the same time. The WW futexes are still needed to implement other synchronization primitives like conditional variables and semaphores that cannot be handled by the TP futexes”.

David Woodhouse posted “PCI resource mmap cleanup” which aims to clean up the use of various kernel interfaces that provide “user visible” resource addresses through (legacy) proc and (contemporary) sysfs. The purpose of these interfaces is to provide information about regions of PCI address space memory that can be directly mapped by userspace applications such as those linked against the DPDK (Data Plane Development Kit) library. An example of his cleanup included “Only allow WC [Write Combining] mmap on prefetchable resources” for the /proc/bus/pci mmap interface because this was the case for the preferred sysfs interface already. This lead some to debate why the 64-bit ARM Architecture didn’t provide the legacy procfs interface (since there was a little confusion about the dependencies for DPDK) but ultimately re-concluded that it shouldn’t.

Tyler Baicar (Codeaurora) posted version 13 of a patch series entitled “Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64”, which aims to introduce support to the 64-bit ARM Architecture for logging of RAS events using the shared “GHES” (Generic Hardware Error Source) memory location “with the proper GHES structures to notify the OS of the error”. This dovetails nicely with platforms performing “firmware first” error handling in which errors are trapped to secure firmare which first handles them and subsequently informs the Operating System using this ACPI feature.

Shaohua Li (Facebook) posted a patch entitled “add an option to disable iommu force on” in the case of the (x86) Trusted Boot (TBOOT) feature being enabled. The reason cited was that under a certain 40GBit networking load XDP (eXpress Data Path) test there were high numbers of IOTLB (IO Translation Look Aside Buffer) misses “which kills the performance”. What he is refering to is the mechanism through which an IOMMU (which sits logically between a hardware device, such as a network card, and memory, often as part of an integrated PCI Root Complex) translates underlying memory accesses by the adapter card into real host memory transactions. These are cached by the IOMMU in small caches (known as IOTLBS) after it performs such translations using its “page tables” (similar to how a host CPU’s MMU – Memory Management Unit – performs host memory translations). Badly designed IOMMU implementations or poor utilization can result in large numbers of misses that result in users disabling the feature. Alas, without an IOMMU, there’s little protection during boot from rogue devices that maliciously want to trash host memory. Nobody has noted this in the RFC (Request For Comments) discussion, yet.

Bodong Wang (Mellanox) posted a patch entitled “Add an option to probe VFs or not before enabling SR-IOV”, which aims to allow administrators to limit the probing of (PCIe) Virtual Functions (VFs) on adapters that will have those resources passed through to Virtual Machines (VMs) (using VFIO). This “can save host side resource usage by VF instances which would be eventually probed to VMs”. It adds a new sysfs interface to control this.

Viresh Kumar posted a patch entitled “cpufreq: Restore policy min/max limits on CPU online”. Apparently, existing code behavior was that “On CPU online the cpufreq core restores the previous governor [the in kernel logic that determines CPU frequency transitions based upon various metrics, such as saving energy, or prioritizing performance]…but it does not restore min/max limits at the same time”. The patch addresses this shortcoming.

Wanpeng Li posted a patch entitled “KVM: nVMX: Fix nested VPID vmx exec control” that aims to “hide and forbid” Virtual Processor IDentifiers in nested virtualization contexts where the hardware doesn’t support this. Apparently it was unconditionally being enabled (based upon real hardware realities of existing implementation) regardless of feature information (INVVPID) provided in the “vmx” capabilities.

Joerg Roedel posted a patch entitled “ACPI: Don’t create a platform_device for IOAPIC/IOxAPIC” since this was causing problems during hot remove (of CPUs). Rafael J. Wysocki noted that “it’s better to avoid using platform_device for hot-removable stuff” since it is “inherently fragile”.

Kees Cook (Google) posted a patch disabling hibernation support on 32-bit systems in the case that KASLR (Kernel Address Space Layout Randomization) was enabled at boot time, but allowing for “nokaslr” on the kernel command line to change this. Evgenii Shatokhin initially noted that “nokaslr” didn’t re-enable hibernation support correctly, but eventually determined that the ordering and placement of the “nokaslr” on the command line was to blame, which lead to Kees saying he would look into the command line parsing sequence and interaction with other options, such as “resume=”.

Separately, Baoquan He (Red Hat) noted that with KASLR an implicit assumption that EFI_VA_START < EFI_VA_END existed, while “In fact [the] EFI [(Unified) Extensible Firmware Interface] region reserved for runtime services [these are callbacks into firmware from Linux] virtual mapping will be allocated using a top-down schema”. His patches addressed this problem, and being “RESEND”s, he was keen to see that they get taken up soon.

Also separately, Kees posted “syscalls: Restore address limit after a syscall” which “ensures a syscall does not return to user-mode with a kernel address limit. If that happened, a process can corrupt kernel-mode memory and elevate privileges”. He cites a bug it would have prevented.

Kan Liang (Intel) posted “measure SMI cost”. This patch series aims to leverage hardware counters to inform perf of the amount of time spent (on Intel x86 Architecture systems) inside System Management Mode (SMM). SMIs (System Management Interrups) are events that are generated (usually) by Intel Platform Control Hub and similar chipset logic which can be programmed by firmare to generate regular interrupts that target a secure execution context known as SMM (System Management Mode). It is here that firmware temporarily steals CPU cycles from the Operating System (without its knowledge) to perform such things as CPU fan control, errata handling, and wholesale VGA graphics emulation in BMC “value add” from OEMs). Over the years, the amount of gunk hidden in SMIs has grown that this author even once wrote a latency detector (hwlat) and has a patent on SMI detection without using such dedicated counters…due to the impact of such on system performance. SMM is necessary on x86 due to its lack of a standardized on-SoC platform management controller, but so is accounting for bloat.

Finally, yes, Kirill A. Shutemov snuck in another iteration of his Intel “5-level paging support” in preparation for a 4.12 merge.

 

Linux Kernel Podcast for 2017/03/21

Audiohttp://traffic.libsyn.com/jcm/20170321.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc3, this week’s exciting installment of “5-level paging weekly”, the 2038 doomsday compliance “statx” systemcall, and heterogenous memory management. Also a summary of all ongoing active kernel development toward 4.12 onwards.

Linus Torvalds announced Linux 4.11-rc3. In his announcement, Linus noted that “rc3 is larger than rc2, but this is hopefully the point where things start to shrink and calm down. We had a late typo in rc2 that affected arm and powerpc (the prep code for the 5-level page tables [on x86 systems]), and hopefully there are no similar brown-paper-bugs in rc3.”

Announcements

Kent Overstreet announced the latest developments in Bcachefs, in a post entitled “Bcachefs – encryption, fsck, and more”. One of the key new features is that “We now have whole filesystem encryption – and this is modern authenticated encryption”. He notes that they can’t currently encrypt only part of the filesystem (as is the case, for example, with ext4 – as used on Android devices, and of course with Apple’s multi-layered iOS filesystem implementation) but “it’s more of a better dm-crypt” in removing the layers between the filesystem and the underlying hardware. He also notes that there’s a “New inode format”, and many other changes. Further details at: https://bcache.evilpiepirate.org/Bcachefs/

Hongbo Wang (Intel) announced the 2016-Q4 release of XenGT and 2016-Q4 release of KVMGT. These are both “full GPU virtualization solution[s] with mediated pass-through”…of the hardware graphics resources into guest virtual machines. Further information is available from Intel’s github: https://github.com/01org/ (igvtg-xen for the Xen tree, and igvtg-kernel, and igvtg-qemu for the pieces needed for KVM support)

Julia Cartwright announced the Linux preempt-rt (Real Time) kernel version 4.1.39-rt47 stable kernel release.

Junio C Hamano announced Git v2.12.1. In his announcement, he noted that the tarballs “are NOT YET found at” the typical URL since “I am having trouble reaching there”. It’s unclear if this is due to recent changes in the architecture of kernel.org and its mirroring, or a local issue.

Intel 5-level paging

In this week’s episode of “merging Intel 5-level paging support” the fun but unexpected plot twist resulting in a “will it merge or not” cliffhanger comes from Linus. Kirill A. Shutemov (Intel) has been diligently posting this series for some time, and if you recall from last week’s episode, the foundational pieces needed to land this in 4.12 were merged after the closure of the 4.11 merge window following a special request from Linus. Kirill has since posted “x86: 5-level paging enabling for v4.12, Part 1”. In response to a comment from Kirill that “Let’s see if I’m on the right track addressing Ingo’s [Molnar’s] feedback”, Linus stated, “Considering the bug we just had with the HAVE_GENERIC_RCU_GUP code, I’m wondering if people would be willing to look at what it would take to make x86 use the generic version?”, and “The x86 version of __get_user_pages_fast() seems to be quite similar to the generic one. And it would be lovely if all the main architectures shared the same core gup code”.

The Linux kernel implements a set of code functions for pinning of usermode (userspace) pages (the smallest granule size upon which contemporary hardware operates via a Memory Management Unit under the control of software provided and (co-)maintained “page tables”, and the size tracked by the Operating System in its page table management code) whenever they must be shared between userspace (which has dynamically pageable memory that can come and go as the kernel needs to free up RAM temporarily for other tasks by “paging” those pages out to “swap”) and code running within a kernel driver (the Linux kernel does not have pageable memory). GUP (get_user_pages) handles this operation, which takes a set of pointers to the individual pages that should be present and marked as in use. It has a variant usually referred to as “fast GUP” which aims to perform this operation without taking an expensive lock in the corresponding userspace processes’ “mm” struct (an object that forms part of a task’s – the in-kernel term for a process – metadata, and linked from the corresponding task_struct). Fast GUP doesn’t always work, but when it doesn’t need to fallback to an expensive slow path, it can save considerable time. So Linus was expressing a desire for x86 to share the same generic code as used by other architectures for this operation.

Linus further added three “subtle issues” that he saw with switching over x86 to the generic GUP code:

“(a) we need to make sure that x86 actually matches the required semantics for the generic GUP.

(b) we need to make sure the atomicity of the page table reads is ok.

(c) need to verify the maximum VM address properly”

He said “I _think_ (a) is ok”. But he wanted to see “real work to make sure” that (b) is “ok on 32-bit PAE”. PAE means Physical Address Extension, a mechanism used on certain 32-bit Intel x86 systems to address greater than a 32-bit physical address space by leveraging the fact that many individual applications don’t need larger than a 32-bit address space but that an overall system might in aggregate use multiple such 32-bit applications. It was a hack that bought time before the widespread adoption of the 64-bit architecture, and one that others (such as ARM) have implemented in a similar sense of end purpose in “LPAE” and friends as well. PAE moved the x86 architecture from 32-bit PTE (Page Table Entries) to 64-bit hardware entries, which means that on 32-bit systems there are real concerns around atomicity of updates to these structures without very careful handling. And as this author can attest, you don’t want to have to debug that situation.

This discussion lead Kirill to point out that there were some obvious looking bugs in the existing x86 GUP code that needed fixing for PAE anyway. The thread is ongoing, and Kirill is certain to be enjoying this week’s episode of “so you thought you were only adding 5-level paging?”. Michal Hocko noted that he had pulled the current version of the 5-level paging patch series into the mmotm (mm of the moment) VM (Virtual Memory) subsystem development tree as co-maintained with Andrew Morton and others.

Borislav Petkov posted “x86/mce: Handle broadcasted MCE gracefully with kexec” which (as we covered previously) seeks to handle the unfortunate case of an MCE (Machine Check Exception) on Intel x86 systems arriving during the process of handoff from the crash kernel into “pergatory” prior to the new kernel beginning. At this phase, the old kernel’s MCE handler is running and will never complete a synchronization with other cores in the system that are waiting in a holding spinloop (probably MWAIT one would assume) for the new kernel to take over.

statx

Various subsystems gained support for the new “statx” system call, which is part of the ongoing “Year 2038” doomsday avoidance work to prevent a Y2K style disaster when 32-bit Unix time wraps in 2038 (this being an actual potential “disaster” in the making, unlike the much hyped Y2K nonsense). Many of us have aspiriations to be retired and living on boats by then, but this is neither assured, nor a prudent means to guarantee we won’t have to deal with this later (but presumably with at least some kind of lucrative consulting contract to bring us out of our early or late retirements).

The “statx” call adds 64-bit timestamps and replaces “stat”. It also does a lot more than just “make large” (David Howell’s words) the various fields in the previous stat structutures. The overall system call was covered much more generally by Linux Weekly News (which you should support as a purveyor of actual in-depth journalism on such topics) as recently as last week. Stafford Horne posted one example of the patches we refer to here, for the “asm-generic” reference includes used by emerging architectures, such as the OpenRISC architecture that he is maintaining. Another statx patch came from David Howells, for the ext4 filesytem, which lead to a longer discussion of how to implement various underlying flag changes required to ext4.

Eric Biggers noted that David used the ext4_get_inode_flags function “to sync the generic inode flags (inode->i_flags) to the ext4-specific inode flags (ei->i_flags)” bu that a problem can exist when doing this without holding an underlying lock due to “flag syncs…in both directions concurrently” which could “cause an update to be lost”. He walked an example of how this could occur, and then suggested that for ->getattr() it might be easier to skip the call to the offending function and “instead populating the generic attributes like STATX_ATTR_APPEND and STATX_ATTR_IMMUTABLE from the generic inode flags, rather than from the ext4-specific flags?”. Andreas Dilger suggested the other way around, pulling the flags directly from the ext4 flags rather than the generic ones. He also raised the eneral question of “when/where are the VFS inode flags changed that they need to be propagated into the ext4 disk inode?”.

Jan Kara replied that “you seem to be right. And actually I have checked and XFS does not bother to copy inode->i_flags to its on-disk flags so it seems generally we are not expected to reflect inode->i_flags in on-disk state”. Jan suggested to Andreas that it might be “better…to have ext4_quota_on() and ext4_quota_off() just update the flags as needed and avoid doing it anywhere else…I’ll have a look into it”.

Heterogeneous Memory Management

Jérôme Glisse posted version 18 of his patch series entitled “HMM (Heterogenous Memory Management)” which aims to serve two generic use cases: “First it allows to use device memory transparently inside any process without modifications to process program code. Second it allows to mirror process address space on a device”. His intro described these summaries as a “Cliff node” (a brand of examination-time study materials often used by students for preparation), which lead to an objection from Andrew Morton that “Cliff’s notes” “isn’t appropriate for a large feature such as this. Where’s the long-form description? One which permits readers to fully understand the requirements, design, alternative designs, the implementation, the interface(s), etc?”. He also asked for clarifcation of which was meant by “device memory” since “That’s very vague. What are the characteristics of this memory? Why is it a requirement that userspace code be unaltered? What are the security implications – does the process need particular permissions to access this memory? What is the proposed interface to set up this access?”

In a followup, Jérôme noted that he had previously given a longer form summary, which he attached, in the earlier revisions of the now version 18 patch series. In his summary, he makes clear his intent is to ease the overall management and programming of hybrid systems involving GPUs and other accelerators by introducing “a new kind of ZONE_DEVICE memory that does allow to allocate a struct page for each page of the device memory. Those page are special because the CPU can not map them. They however allow to migrate main memory to device memory using ex[]isting migration mechanism[s] and everything looks like it page was swap[ped] out to disk from CPU point of view. Using a struct page gives the easiest and cleanest integration with existing mm mechanisms”. He notes that he isn’t trying to solve other problems, and in fact one could summarize HMM using the buzzword du jour: “mediated”.

In an HMM world, devices and host-side application software can share what appears to them as a “unified” memory map. One in which pointer addresses from within an application can be deferenced by code running on a GPU, and vice versa, through cunning use of page tables and a new underlying system framework for the device drivers touching the hardware. It’s not magic, but it does help to treat device memory “like regular memory” and accommodates “Advance in high level language construct (in C++ but others too) gives opportunities to compiler to leverage GPU transparently without programmer knowledge. But for this to happen we need a share[d] address space”.

This means that, if a host application (processor side of the equation) performs an access to part of a process (known as a “task” within the kernel) address space that is currently under control of a device, then the associated page fault will trigger generic framework code to handle handoff of that page back to the host CPU side. On the flip side, the framework still requires device drivers to use a new framework to manage their access to memory since few devices have generic page fault mechanisms today that can be leveraged to make this more transparent, and a lot of other device specific gunk is needed. It’s not a perfect solution, but it does arguably advance the state of the art, and is useful. Jérôme also states that “I do not wish to compete for the patchset with the highest revision count and i would like a clear cut position on w[h]ether it can be merge[d] or not. If not i would like to know why because i am more than willing to address any issues people might have. I just don’t want to keep submitting it over and over until i end up in hell…So please consider applying for 4.12”.

This author’s own personal opinion is that, while HMM is certainly useful, many such shared device/host memory situations can be greatly simplified by introducing coherent shared virtual memory between device and host. That model allows for direct address space sharing without some of the heavy lifting required in this patch set. Yet, as is noted in the posting, few devices today have such features (and there is no reason to presume that all future devices suddenly will implement shared virtual memory, not that every device will want to expand the energy required to maintain coherent memory for communication). So the HMM patches provide a means of tracking who owns memory shared between device and “host”, and they exploit split device and “host” system page tables as well as associated faults to ensure pages are handed off as cleanly as can be achieved with technology available in the market today.

Ongoing Development

Michal Hocko posted a patch entitled “rework memory hotplug onlining”, which seeks to rework the semantics for memory hotplug since the current implementation is “awkward and hard/impossible to use from the udev to online memory as movable. The main problem is that only the last memblock or the adjacent to highest movable memblock can be onlined as movable”. He posted a number of examples showing how things fall down today, as well as a patch (“just for x86 now but I will address other arches once there is an agreement this is the right approach”) removing “all the zone specific operations from __add_pages (aka arch_add_memory) path. Instead we do page->zone association from move_pfn_range which is called from online_pages. This criterion for movable/normal zone association is really simple now. We just have to guarantee that zone Normal is always lower than zone Movable”. This lead to a lengthy discussion around the ideal longer term approach and is likely to be a topic at the LSF/MM conference this week (one assumes?). [ It’s happening down the street from me…I’ll smile and wave at you 😉 ]

Gustavo Padovan posted “V4L2 explicit synchronization support”, an RFC (Request For Comments) that “adds support for Explicit Synchronization of shared buffers in V4L2” (Video For Linux 2, the general purpose video framework API used on Linux machines for certain multimedia purposes). This new RFC leverages the “Sync File Framework” as a means to “communicate the fences between kernel and userspace”. In English, what this means is that it’s often necessary to communicate using shared buffers between userspace, kernel, and hardware. And some (most) hardware might not guarantee that these buffers are fully coherent (observed identically between multiple concurrently operating agents that are manipulating it). The use of “fences” (barriers) enables explicit communication of certain points in time during which the state of a buffer is consistent and ready for access to be handed off between different parts of the system. The RFC is quite interesting and has a lot more detail, including the observation that it is intended to be a PoC (Proof of Concept) to get the conversation moving more than the eventual end result of that conversation that might actually be merged.

Wei Wang (Intel) posted a patch series entitled “Extend virtio-balloon for fast (de)inflating & fast live migration. Balloons aren’t just helium filled goodies that all of us love to play with from a young age. Well, they are that, but, they’re also a concept applied to the memory management of virtual machines, which “inflate” the amount of memory available to them by requesting more from a hypervisor during their lifetime (that they might also return). In Linux, the same concept is applied to the migration of virtual machines, which can use the virtio-balloon abstraction over the virtio bus (a hypervisor communications channel) to transfer “guest unused pages to the host so that they can be skipped to migrate in live migration”. One of the patches in his version 3 series (patch number 3 of 4), entitled “mm: add in[t]erface to offer info about unused pages” had some detailed discussion with Michael S. Tsirkin commenting on better documentation and Andrew Morton suggesting that it might be better for the code to live in the virtio-balloon driver rather than being made too generic as its use case is very targeted.

Elena Reshetova continued her work toward conversion of Linux kernel subsystems to her newer “refcount” explicit reference counting API with a posting entitled “net subsystem refcount conversions”.

Suzuki K Poulose posted a bunch of patches implementing support for detection and reporting of new ARMv8.3 architecture features, including one patch that was entitled “arm64: v8.3: Support for Javascript conversion instruction” (which really means a new double precision float to integer conversion instruction that will likely be used by high performance JavaScript JITs…). He also posted “arm64: v8.3: Support for weaker release consistency”. The new revision of the architecture adds new instructions to “support Release Consistent processor consistent (RCpc) model, which is weaker than the RCsc [Release Consistent sequential consistency] model”. Listeners are encouraged to read the C++ memory model and other fascinating bedtime literature for much more detail on the available RC options.

Markus Mayer (Broadcom) posted “Basic divider clock”, an RFC which aims to provide a generic means of expressing clock dividers that can be leveraged in an embedded system’s “DeviceTree”, for which he also posted bindings (descriptions to be used in creating these textual description “trees”). Stephen Boyd pushed back that the community had so far avoided generic implementations but instead preferred to keep things at the level of having drivers that target certain hardware IP from certain vendors based upon the compatible matching strings.

Michael S. Tsirkin posted “kvm: better MWAIT emulation for guests”. We have previously explained this patchset and the dynamics of MWAIT implementations. His goal for this patch is to handle guests that assume the presence of the (x86) MWAIT feature, which isn’t present on all x86 CPUs. If you were running (for example) MacOS inside a VM on an 86 machine, it would generally assume the presence of MWAIT without checking for it, because it’s present in all x86-based Apple Macs. Emulating MWAIT is useful in such situations.

Romain Perier posted “Replace PCI pool by DMA pool API”. As he notes in his posting, “The current PCI pool API are simple macro functions direct expanded to the appropriate dma pool functions. The prototypes are almost the same and semantically, they are very similar. I propose to use the DMA pool API directly and get rid of the old API”.

Daeseok Youn posted “staging: atomisp: use k{v}zalloc instead of k{v}alloc and memset”. Alan Cox replied “…please don’t apply this. There are about five other layers of indirection for memory allocators that want removing first so that the driver just uses the correct kmalloc/kzalloc/kv* functions in the right places”. Now does seem like a good time not to add more layers.

Peter Zijlstra posted various “x86 optimizations” that aimed to “shrink the kernel and generate better code”.

Kernel Podcast for March 13th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170313.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc2 (including pre-enablement for Intel 5-level paging), VMA based swap readahead, and ongoing development ahead of the next cycle.

Linus Torvalds announced Linux 4.11-rc2. In his announcement, he said that the past week had been “fairly quiet” because “people are still looking for bugs and taking a breather after the merge window”. But he also noted that “we’ve got a healthy number of fixes in, and there’ssome cleanup/prep patches for the upcoming 5-level page table support that I took after the merge window just to make the next merge window easier”.

Various fixes and updates have been posted against the previous rc1, over the past week, including an urgent fix from Matthew (Willy) Wilcox for his idr rewrite in 4.11 (freeing the correct IDA bitmap).

Geert Uytterhoeven posted “Build regressions/improvements in v4.11-rc1”. This compared build error/warning regressions and improvements between v4.11-rc1 and v4.10. According to Geert, the 4.11-rc1 kernel saw an increase of 19 build errors and 1108 warnings when compared to 4.10.

Announcements

Jiri Slaby announced Linux 3.12.71, Greg Kroah Hartman (KH) announced 4.4.53, 4.9.14, and 4.10.2 (which started a conversation about git tags being stale that we will address in a moment). Greg took the opportunity of various stable kernel work to prod the i915 graphics driver team with a message entitled “The i915 stable patch marking is totally broken”.

Sebastian Andrzej Siewior announced the v4.9.13-rt12 preempt-rt “Real Time” kernel patch set, which has a known issue that “CPU hotplug got a little better but can deadlock”, suggesting you might not want to try that then.

Julia Cartwright announced 4.1.38-rt46.

Steven Rostedt announced the 3.18.48-rt53 stable release of the RT kernel. He also announced the 3.10.105-rt119 and 3.2.86-rt124 releases.

Jair Ruusu announced “loop-AES-v3.7k file/swap crypto package”, which is available on sourceforge at: http://loop-aes.sourceforge.net/

Andy Lutomirski sent out detailed notes (along with a followup with yet more explanation) of the Intel SGX (“Secure Enclave”) feature discussion that occured at Kernel Summit and Linux Plumbers Conference last fall. The thread is called “SGX notes from KS/LPC”. In the thread, he explains what SGX is (a small region of virtual memory within a Linux process – known as a task inside the kernel – that is not visible to the host OS after the enclave is “launched”) and how it can be used to hide certain data from system administrators or providers – for example, cryptographic keys that one would rather were not compromised. SGX comes with a litany of new requirements upon the Operating System that Andy covers, along with some guidelines for how to expose this feature, and how to make it useable.

Packet.net are now sponsoring the kernel.org project to the tune of various geo-diverse bare metal frontend systems in datacenters around the globe. Each of these (powerful) frontends provides read-only public access to kernel.org git repositories and the public website (git.kernel.org and www.kernel.org). More information, including machine specifications can be found here: https://www.kernel.org/fast-new-frontends-with-packet.html

(this came to light because of a brief outage affecting the Newark, NJ mirror which was lagging behind other mirrors in picking up new git tags pushed, but one hopes that an official announcement and thanks was otherwise forthcoming)

Masahiro Yamada has been added as a Kbuild (co-)maintainer.

Intel 5-level paging

Kirill A. Shutemov posted version 4 of his “5-level paging” patch series that implements support for the la57 (56 bit Virtual Address space for x64 Canonical Addressing) feature on some future CPUs. We covered the underlying patch series before, explaining the benefit of a larger (virtual) address space, as well as the additional compexities required to implement backward compatibility (including new prctls to limit the virtual address space of certain legacy applications), and the lack (so far) of boot time switching between 4-and-5-level support, which is seen as important for the distros.

Linus responded by saying that he thought “we should just aim for this being in 4.12” as he didn’t “see any real reason to delay merging it”. After some discussion about whose tree to merge it through, it was decided (by Thomas Gleixner) that it could come in through the “-tip” x86 tree. Which resulted in Linus pulling a preparatory “5-level paging: prepare generic code” patch series from Kirill into 4.11 (even after the merge window had closed) to lay the groundwork for pulling the main feature into the next (4.12) cycle. This promptly broke PowerPC, which was promptly fixed by a followup patch. Following the merge of enabling support in 4.11, Kirill posted “5-level paging enabling for v4.12” which aims to complete the merge next cycle.

The earlier version 4 iteration of the patch series noted that the Xen hypervisor currently doesn’t support 5-level paging and thus CONFIG_XEN is disabled automatically when building CONFIG_X86_5LEVEL. It was pointed out by the Andrew Cooper that runtime (boottime) switching between 4 and 5 level support would be required in order to provide a clean experience, especially until Xen Dom0 support is available. That boottime switching is on the existing todo and presumably is going to land at some point.

Separately, Dmitry Safonov posted version 6 of a patch series entitled “Fix compatible mmap() return pointer over 4Gb” which has “some minor conflicts with Kirill’s set for 5-table paging”. Dmitry aims to solve a slightly different problem than Kirill’s PR_{SET,GET}_MAX_VADDR calls (which limit the virtual address ranges returned by mmap to avoid legacy programs breaking when suddenly able to receive much larger “Canonical Addresses” – in Intel parlance – than they were compiled with built-in and broken assumptions about once upon a time) insomuch as he is focused on 32-bit legacy syscalls on 64-bit x64 not returning memory above 4GB that cannot be used by older 32-bit code.

VMA based swap readahead

Ying Huang (Intel) posted an RFC (Request For Comments) entitled “mm, swap: VMA based swap readahead” in which he discussed the current kernel paging implementation for Virtual Memory Areas (VMAs) as well as how it could be improved to facilitate greater awareness of the in-memory access patterns of associated data by changing the corresponding readahead algorithm.

“Readahead” as a concept is what it sounds like. Locality (both spacial, in this case, as well as temporal, in other cases) of data means that when a memory access occurs, it is usually more likely than not that an access to a nearby memory location will soon follow (except in the case of pure random access workloads). Thus, the kernel contains support for preloading nearby data when performing various disk and memory operations. Examples include readahead of nearby disk blocks when loading filesystem data, and loading nearby disk blocks when reading pages back in from swap.

VMAs (Virtual Memory Areas) are regions of memory managed by the Linux kernel. A running application (process), known as a “task” by the kernel, contains a large number of different VMAs which form its overall address space. You can see this by inspecting /proc/self/maps (replacing “self” with a process ID that you have access to). The output will show a series of memory regions representing various memory owned by the task. Memory that doesn’t represent files is known as “anonymous memory” and it is what is paged (swapped) out under memory pressure situations.

As Ying notes in his RFC, the “original swap readahead algorithm does readahead based on the consecutive blocks in [the] swap device” but “the consecutive blocks in [the] swap device just reflect the order of page reclaiming” and not necessarily “the access sequence in RAM”. His patch series aims to change this by teaching the readahead algorithm about VMAs and how to bias the readahead to sequentially walk through the address space of a task (process), reading those parts of the swap space containing this data rather than simply walking through swap sequentially.

But wait! There’s more! Ying also posted a separate patch series entitled “THP swap: Delay splitting THP during swapping out”, which does what it sounds like it would do. THP (Transparent Huge Pages) is a technology used by the Linux kernel to dynamically allocate “huge” (optionally very large – up to 1GB in size, but in this case 2MB) pages of memory to contiguous regions of virtual memory address space, especially those backing shared large memory data (even including a huge zero page used for virtual machine RAM at boot). THP reduces pressure on limited CPU internal microarchitectural caches known as TLBs (Translation Lookaside Buffers) – as well as uTLBs at a lower level than the TLBs – which cache the translation performed by page table entries to physical or intermediate memory addresses. Reducing the number of TLBs required to map regions of virtual memory reduces the number of times TLBs must be reused by the underlying architecture during memory access operations.

The existing Linux kernel THP code splits THPs back into smaller pages whenever they are swapped (paged) out to disk. Yet it turns out that this is particularly inefficient on contemporary systems in which secondary disk or NVMe storage has far greater bandwidth than a single high end core can saturate if forced to do this work. Ying’s patch instead delays this split and pushes entire THPs out to swap, allowing for larger writes and reads of contiguous memory out to the backing storage.

Ongoing Development

“David F” inquired about RAID mode support for Intel m.2 chipsets. These devices continue the recent-ish legacy of certain Intel storage devices providing dual modes of operation: as an AHCI device, and as a hardware RAID device operating in a propietary mode for which no Linux drivers exist. David was quite concerned that the lack of a Linux driver was becoming particular problematic on newer machines, which might not provide a means to switch into AHCI mode (supported by Linux). Christoph Hellwig was…unsympathetic…suggesting that the RAID mode “provides worse performance”, and that its implementation was questionable. He also had a series of other suggestions for what to do with these devices – those are less family friendly to repeat in this podcast.

Michal Hocko posted “kvmalloc” which is a generic replacement for the many “open coded kmalloc with vmalloc fallback instances in the tree”. k-and-vmalloc are two different means by which kernel code allocates memory. The former is used to obtain small allocations (on the order of a few pages – the minimal granule size operated on by the virtual memory subsystem of Linux on contemporary processors) that are also linerally contiguous in physical memory. The latter is for larger allocations of strictly “virtual” memory – contiguous only when accessed using the underlying Memory Mangement Unit to perform a translation (this is usually automatic for kernel code, since the kernel runs with virtual memory of its own, just like user processes do, but it can be problematic if a driver would like to use this memory for certain hardware operations, such as DMA transfers). The generic wrapper aims to clean up the common case that kernel code just wants a chunk of memory and will try to allocate it with kmalloc, but fallback to the more generic vmalloc if that fails.

Christian Konig (AMD) posted “PCI: add resizeable BAR infrastructure” (version 2, and later an update with some fixes in a version 3 also), which aims to add support to the kernel for a PCI SIG (Peripheral Component Interconnect Special Interest Group) ECN (Engineering Change Notice) that enables BARs (Base Address Registers) to be resized at runtime. PCI(e) BARs are mapping windows (aperatures) in the system memory map that are used to talk to hardware add-on cards (or built-in devices within modern platforms) by determining where the device’s memory will live. Traditionally, BARs were fixed size and so on architectures not relying upon firmware configuration of underlying BARs, Linux would have to determine where to place certain PCI(e) resources at boot/hotplug time by checking how much memory a device needed to expose and programming the BARs. With the new extension comes the possibility to increase the size of a BAR to map larger regions of memory. This is a useful feature for graphics cards, which may want to map very large regions of memory. A subsequent patch wires up the AMD GPU driver to use this.

Javi Merino posted “Documentation/EDID fixes”, which aims to correct some broken assumptions in the kernel documentation for EDID (Extended Display Identification Data – the data provided over e.g. I2C from a VGA monitor when the cable is connected). The examples didn’t build correctly due to existing assumptions. This author is probably one of few people who always thinks of EDID and the interaction with Xorg every time he plugs in an external projector to his laptop.

David Howells posted “net: Work around lockdep limitation in sockets that use sockets” in which he corrected an erroneous assumption in the kernel “lockdep” (lock dependency checker) that prevented it from correctly identifying bad call chains involving TCP sockets when there exists a dependency between sockets created purely in the kernel and sockets created purely in userspace (which the lockdep could not distinguish between due to its use of broad lock classes). The AFS (Andrew File System) was generating a false lockdep warning because it was exposing such an implied dependency.

Charles Keepax posted “genirq: Add support for nested shared IRQs” to address an audio CODEC that also acts as an interrupt controller. The details sounded rather painful. Yet it was “fairly easy” to fix.

Steven Rostedt posted “tracing: Allow function tracing to start earlier in boot up”, which does roughly what it says on the can, “moving tracing up further in the boot process”, “right after memory is initialized”. He noted that his RFC was a start and could be futher improved upon.

Matthew (Willy) Wilcox posted an RFC entitled “memset_l and memfill” that provides a generic means for architectures to provide optimized functions that “fill regions of memory with patterns larger than those contained in a single byte”. This is intended to be used by zram as well as other code.

Paul McKenney noticed some of his RCU torture tests failing during hotplug early in boot due to calls to smp_store_cpu_info during that operation. The call is not safe because it indirectly invokes schedule_work() which wants to use RCU prior to RCU being enabled as a side effect of dealing with an unstable TSC (Time Stamp Counter) on the afflicted CPU. Peter Zijlstra had an opinion on hotplug, and also a patch to handle this situation.

Vlad Zakharov posted “update timer frequencies”, which inquired about the best means to implement a cpufreq driver for ARC CPUs. These having a special property that “ARC timers (including those are used for timekeeping) are driven by the same clock as ARC CPU core(s)”. Yup, they change frequency according to the current CPU frequency. Which as Thomas Gleixner noted in response is “broken by design and you really should go and tell your hardware folks to fix that”. He added that “It’s well known for more than TWO decades that changing the frequency of the timekeeper clocksource is a complete disaster”.

Thomas Gleixner posted “kexec, x86/purgatory: Cleanup the unholy mess”, which aims to address his opinion that “the whole machinery is undocumented and lacks any form of forward declarations” (of variables which were previously global but had been made static). Purgatory is a special piece of code which is provided by the kernel but runs in the interim period between the kernel crashing (or beginning kexec) and the new crash or kexec kernel that is then subsequently loaded – this is what performs the load and exec.

Kernel Podcast for March 6th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170306.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc1, rants about folks not correctly leveraging linux-next, the remainder of this cycle’s merge window pulls, and announcements concerning end of life for some features.

Linus Torvalds announced Linux 4.11-rc1, noting that “two weeks have passed, the merge window is over, and 4.11 has been tagged and pushed out.” He notes that the latest kernel cycle is set to be “on the smallish side”, but that is only in comparison with the most recent two cycles, which have been significantly larger than typical. He notes that 4.11 has a similar number of commits to 4.1, 4.3, 4.5, and 4.7 before it. With the release of 4.11-rc1 comes the closing of the “merge window” (defined by it, the period of time during which disruptive changes are allowed into the kernel prior to RC).

We covered most of the major pulls for 4.11 in last week’s podcast. But there were a few more stragglers. Here’s a sample of those:

J. Bruce Fields posted “nfsd changes for 4.11” which included two semantic changes: NFS security labels are “now off by default” and a “new security_label export flag reenables it per export” since this “only makes sense if all your clients and servers have similar enough selinux policies”. Secondly, NFSv4/UDP support is off because “It was never really supported, and the spec explicitly forbids it. We only ever left it on out of laziness; thanks to Jeff Layton for finally fixing that.”

Anna Schumaker followed up a little later with “Please pull NFS client changes for Linux 4.11”, which includes a memory leak in “_nfs4_open_and_get_state”, as well as various other fixes and new features.

Matthew (Willy) Wilcox posted “Please pull IDR rewrite” which seeks to harmonize the IDR (“Small id to pointer translation service avoding fixed sized tables”) and in-kernel radix tree code. Accoring to Willy, merging the two codebases “lets us share the memory alloction pools, and results in a net deletion of 500 lines of code. It also opens up the possibility of exposing more of the fetures of the radix tree to users of the IDR”.

Will Deacon posted “arm64 fixes for -rc1” of which the “main fix here addresses a kernel panic triggered on Qualcomm QDF2400 due to incorrect register usage in an erratum workaround introduced during the merge window”.

Michael S. Tsirkin posted “vhost: cleanups and fixes”, of which there were very few for this kernel cycle.

Nicholas A. Bellinger posted “target updates for v4.11-rc1”, which includes support for “dual mode (initiator + target) qla2xxx operation”, and a number of other fixes and improvements. He pre-warns that things are “shaping up to be a busy cycle for v4.12 with a new fabric driver (efct) in flight, and a number of other patches on the list being discussed”.

Rafael J. Wysocki posted “Additional ACPI update for v4.11-rc1”, which includes a fix for “an apparant, but actually artificial, resource conflict between the ACPI NVS memory region and the ACPI BERT (Boot Error Record Table)”.

Jens Axboe posted “Block fixes for 4.11-rc1”, which includes a “collection of fixes for this merge window, either fixes for existing issues, or parts that were waiting for acks to come in”. These include a performance fix for the allocation of nvme queues on the right node, along with others.

Miklos Szeredi posted “fuse update for 4.11” and “overlayfs update for 4.11”. the latter “allows concurrent copy up of regular files eliminating [the] potential problem” of (previously) serialized copy ups taking a long time.

Bjorn Helgaas posted “PCI fixes for v4.11”, including a couple of fixes for bugs introduced during code refactoring.

Dan Williams posted “libnvdimm fixes for 4.11-rc1”, which includes a fix for the generation of “nvdimm namespace label”s (metadata) checksums that “Linux was not calculating correcting leading to other environments rejecting the Linux label”.

Helge Deller posted “parisc updates for 4.11”, noting that there was “nothing really important” in this particular cycle to pull in.

James Bottomley posted “final round of SCSI updates for the 4.10+ merge window”, which “is the set of stuff that didn’t quite make the initial pull and a set of fixes for stuff which did”.

Radim Krcmar posted “Second batch of KVM changes for 4.11 merge window”, which includes a number of fixes for PPC and x86.

David Miller posted “Networking”, including many fixes.

A linux-next rant

In his 4.11-rc1 announcement, Linus noted that “it *does* feel like there was more stuff that I was asked to pull than was in linux-next. That always happens, but seems to have happened more now than usually. Comparing to the linux-next tree at the time of the 4.10 release, almost 18% of the non-merge commits were not in Linux-next. That seems higher than usual, although I guess Stephen Rothwell has actual numbers from past merges.” Let’s break what Linus said a little. Stephen Rothwell is an (overworked) kernel hacker based in Australia who produces a (daily, outside of the merge window) kernel tree (and accompanying test infrastructure, patch tracking, and announcement mechanisms) known as “linux-next”. Its raison d’etre is to be the proving ground for new features before they are sent to Linus for merging.

Typically, major new features soak in linux-next for a cycle prior to the one in which they are actually merged (so features landing in 4.11 would have been largely complete and tested via -next during 4.10). Linux kernel development cycles are generally on the order of about two months, so this isn’t an unreasonable long period of time for disruptive changes to languish. Contrast this with the multi-year wait that used to happen back when Linux had an odd/even minor version cycle in which even numbers (2.2, 2.4, 2.6) were the “supported” releases and the odd numbers (2.1, 2.3, 2.5) were development ones. That seems like ancient history now, but it’s really only in the past decade of git that kernel development tooling and community has reached a level of sophistication that the ship can keep moving while the engine is replaced.

Linus noted that there are a “few different classes” of changes that didn’t come to him following a previous test in linux-next. Those include fixes (which is “obviously ok and inevitable”), a specific example (statx) for a longstanding issue that has been ongoing for years (to which he said, “Yeah, I’ll allow this one too”), the “quite noticeable <linux/sched.h> split up series” which “had real reasons for late inclusion”. Finally, he includes the class of subsystems such as “drm, Infiniband, watchdog and btrfs”, which he “found rather annoying this merge window”. He reminded folks of the “linux-next sanity checks” and that if folks ingore them “you had better have your own sanity checks that you replaced them with” rather than “screw all the rules and processes we have in place to verify things”.

The bottom line? Linus says “You people know who you are. Next merge window I will not accept anything even remotely like that. Things that haven’t been in linux-next will be rejected, and since you’re already on my sh*t-list you’ll get shouted at again”. And nobody enjoys being shouted at by Linus. Well, almost nobody. There do seem to be a few people who perversely enjoy it.

Announcements

A couple of questions of code maintenance arose this week. The first was from Natale Patriciello, who asked whether UML (User Mode Linux) is “not maintained anymore?” by citing a few bugs that haven’t been resolved in some time. There were no followups at the time of this recording. The second question came in form of an RFC (Request For Comments) patch entitled “remove support for AVR32 architecture” from Hans-Christian Noren Egtvedt. He noted that AVR32 is “not keeping up with the development of the kernel”, “shares so much of the drivers with Atmel ARM SoC”, and “all AVR32 AP7 SoC processors are end of lifed from Atmel (now Microchip)”. This did seem like a fairly compelling set of reasons to kill it, which others agreed with also. This means that unless someone comes forward soon to maintain AVR32 (along with the associated GCC toolchain and other distribution pieces), its days in the upstream Linux kernel are numbered – and probably removed in 4.12.

Sebastian Andrzej Siewior announced Linux v4.9.13-rt11, which includes a fix for a previous fix (allowing the previous lockdep fix to compile on UP).

Drivers

Logan Gunthorpe posted “New Microsemi PCI Switch Management Driver”, which is in its 7th revision. The RFC (Request for Comments “proposes a management driver for Microsemi’s Switchtec line of PCI switches. This hardware is still looking to be used in the Open Compute Platform”. Logan notes that “Switchtec products are compliant with the PCI specifications and are supported today with the standard in-kernel driver. However, these devices also expose a management endpoint on a separate PCI function address which can be used to perform some advanced operations”.

Ongoing Development

Michael S. Tsirkin continued his work on “vfio error recovery: kernel support” with version 4 of the patch series wich seeks to do more than simply ignoring non-fatal PCIe AER (Advanced Error Reporting) errors that hit assigned devices passed using VFIO into a guest Virtual Machine. Currently, only fatal errors (which cause a PCIe link reset) are reported – they stop the guest. In his summary email, Michael notes that his goal is to handle non-fatal errors by reporting them to the guest and having it handle them. And rather than surprising existing code, he calls out under “issues” that “this behavior should only be enabled with new userspace, old userspace should work without changes”. By “userspace” he means the code driving VFIO, which might be a QEMU process that is backing a KVM virtual machine context, or a container, or merely a bare metal userspace process that is using VFIO directly.

Johannes Weiner posted “mm: kswapd spinning on unreclaimable nodes – fixes and cleanups” in which he notes a previous posting from Jia He that he (and the team at Facebook) have reproduced. In the case of the problem scenario, the kernel’s kswapd (swap space daemon) for a given (memory) node spins indefinitely at 100% CPU usage when there are absolutely no reclaimable pages (granules of the smallest size of memory that can be managed by Linux and the underlying hardware) however the “condition for backing off is never met”. This results in kswapd busy-looping forever. In his patches, Johannes changes reclaim behavior so that kswapd will eventually really back off after failing 16 times (which is the same magic number of times we try during an OOM “Out Of Memory” situation) as defined by MAX_RECLAIM_RETRIES. He includes various examples.

Len Brown posted “cpufreq: Add the “cpufreq.off=1” cmdline option. This is a corollary to “cpuidle.off=1” and comes about for similar reasons for the purpose of testing. This author wonders aloud whether this will allow for buggy platforms that don’t support CPPC (Collaborative Processor Performance Control) to easily disable this at runtime too.

Aleksey Makarov posted “printk: fix double printing with earlycon”. On ACPI compliant platforms (including ARM servers), the SPCR (“Serial Port Console Redirection”) table provides information about the serial console UART that the kernel should be using, rather than having the user provide memory register addresses and baud rates on the kernel command line. This is a feature which is generally useful beyond ARM systems (although most x86 systems follow the traditional “PC” UART design). Prior to this fix, the kernel would double print output if given a “console=” and “earlycon”.

Minchan Kim posted “make try_to_unmap simple” which aims to remove some of the (apparently somewhat gratitous) complexity in the return value of this function. Currently it can return SWAP_SUCCESS, SWAP_FAIL, SWAP_AGAIN, SWAP_DIRTY, and SWAP_MLOCK. But Minchan feels that it can be simply a boolean return by removing the latter three of those return values.

Matthew Gerlach (Intel) posted “Altera Partial Reconfiguration IP”, which adds support to the kernel’s (Alan Tull’s) “fpga-mgr” driver for the “Altera Partial Reconfiguration IP”. Partial Reconfiguration (sometimes known as “PR” in the reconfigurable logic community) allows an FPGA (Field Programmable Gate Array)’s logic fabric to be reconfigured in smaller than whole regions. This (for example) would allow a closely coupled datacenter (Xeon) processor to continue to drive certain FPGA contained IP while other IP were being replaced dynamically. If one were to couple this with support in OpenStack Nomad or Kubernetes for dynamic reconfiguration at VM/container setup it would begin to enable various use cases for the mainstream datacenter around FPGA acceleration.

Andi Kleen posted “pci: Allow lockless access path to PCI mmconfig”. “mmconfig” refers to the memory mapped configuration region used by contemporary PCIe devices during enumeration and configuration. This is a kind of out-of-band mechanism by which the kernel can talk to PCIe devices in a fully standards compliant means prior to having configured them. Intel processors include many “PCIe” devices that are in fact a logical means of expressing so called “uncore” non-compute features on the processor SoC. They’re not real PCIe devices but appear to the kernel as such. This wonderful abstraction comes with some overhead cost, especially when the kernel spends time grabbing the “pci_cfg_lock” which it actually doesn’t need to hold, according to Andi.

Jarkko Sakkinen posted version 3 of “in-kernel resource manager”, which adds support to the kernel for “TPM spaces that provide an isolated execution context for transient objects and HMAC policy sessions”.

Tomas Winkler posted a question about what the community considered to be the “correct usage of arrats of variable length within [the] Linux kernel”. The replies generally included language to the form of “don’t”. Both for reasons of general language ugliness, and also because (especially in the case of local variables) the Linux kernel’s fixed (and also small) size stack raises serious potential for stack overflow if one is not careful. There was a suggestion that the kernel should be built with a compiler option to disallow VLAs, but that this would require various code to be fixed first.

Kernel Podcast for Feb 27th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170228.mp3

In this week’s kernel podcast: the merge window for kernel 4.11 is open and patches are flying into Linus’s inbox, fixing NUMA node determination at runtime, Virtual Machine Aware Caches, Advisory Memory Allocations, and a non-fixed TASK_SIZE to bring excitement to your life. We will have this, and a summary of ongoing development in this week’s Linux Kernel podcast.

The merge window (period of time during which disruptive changes are allowed to be “merged” – incorporated into Linus’s official git tree – prior to a multi-week stabilization and Release Candidate cycle) for Linux 4.11 is currently open. This means that the most recent official kernel remains Linux 4.10. Meanwhile, many “pull requests” and merges are in flight for various kernel subsystems planning updates in 4.11. These include:

  • Ingo Molnar posted “EFI changes for 4.11”, including support for determining at boot time whether secure boot authentication was performed.
  • Ingo also posted “x86/cpufeature changes for v4.11”, which include the new support for “ring-3 MONITOR/MWAIT instructions on supported CPUs”. This is otherwise known as “MWAIT in userspace”, in which an unprivileged application can (in certain approved situations) use the CPU’s built-in monitor to cause a low-latency low-power wait on a memory location. This can be used (for example) by various userpace lock infrastructure to obviate spinning.
  • Joerg Roedel posted “IOMMU Updates for Linux v4.11”, which includes patches from Eric Auger (Red Hat) implementing “KVM PCIe/MSI passthrough support on ARM/ARM64”. These patches have been under development for many many months, and have been completely refactored on several occasions. They begin to enable various (OP)NFV (Open Platform for Network Function Virtualization) use cases, such as DPDK accelerated OVS (and other VNFs – Virtual Network Functions) within VMs passing through PCIe devices from the host via VFIO. Accompanying this was support for “a core representation for individual hardware iommus” (ARM uses a distributed System-MMU architecture), support for SMMUv2 on ARM systems, a stream table optimization for SMMUv3 on ARM systems, and various other small improvements.
  • Rafael J. Wysocki posted “Power management updates for v4.11-rc1”, noting that the “majority of changes go into the Operating Performance Points (OPP) framework and cpufreq this time, followed by devfreq and some scattered updates all over”. He also posted “ACPI updates for v4.11-rc1”, which include a rebase of the ACPICA (ACPI – Advanced Configuration and Power Interface – Component Architecture) reference shared among various Operating Systems for interpreting ACPI AML (ACPI Machine Language) at runtime. The ACPICA is updated to 20170119, with many fixes, including those “related to the handling of the bit width and bit offset fields in [GAS] Generic Address Structure”, utility updates, and support for “method invocations as target operands in AML”.
  • James Morris posted “Security subsystem updates for 4.11”, including a “major AppArmor update: policy namespaces & lots of fixes”, a new “/sys/kernel/security/lsm node for easy detection of loaded LSMs”, “SELinux cgroupfs labeling support”, and “SELinux context mounts on tmpfs, ramfs, devpts within user namespaces”. There was also “improved TPM 2.0 support”. This author is hoping an outfit such as Linux Weekly News (LWN) has an article on TPM2.0 at some point soon. James also posted a “seccomp bugfix” from Kees Cook that ensures seccomp will only dump core in the case that a process is single threaded (Kees wasn’t done with his usual awesome security fixes – he also had one to “censor kernel pointer in debug files” within the cgroup filesystem).
  • Bjorn Helgaas posted “PCI changes for v4.11”. These include ACS (Access Control Services) quirks for Intel Union Point, Qualcomm QDF2400, and QDF2432. ACS allows PCIe devices to communicate peer to peer without an intervening transaction through the Root Complex for IOV capabilities. Linus grumbled about Bjorn’s pull request due to the use of an SHA1 without a branch or tag name. But Bjorn noted it was a simple script mistake and was already fixed – he sent a followup with corrected “pci-v4.11-changes”.
  • Stafford Horne posted a very large set of patches for OpenRISC. These include “optimized memset and memcpy routines” with a 20% boot time saving, “support for cpu idling”, and various preparatory work on atomics, bitops, futexes, and locks in anticipation of future SMP support. Finally, he added a link to the OpenRISC git tree (on github) to MAINTAINERS. The OpenRISC architecture gets a bit less press these days than RISCV but it is still alive, and has a number of implementations. Your author has several OpenRISC development boards but hasn’t played in a while.

For a detailed sumary of current merge widow pulls and patches, consult this week’s Linux Weekly News at LWN.net (Thursday).

Geert Uytterhoeven posted a summary of “Build regressions/improvements in v4.10”. These show an increase in build errors and warnings vs the previous 4.9 kernel cycle. He posted a list of configs used, the error and warning messages, and thanked the “linux-next team for providing the build service”.

Pavel Machek has been posting about various problems running 4.10 kernels. In one instance, he saw a corrupted stack that implied a double call to “startup_32_smp” (the secondary CPU boot method on Intel x64 Architecture). This lead Josh Poimbeouf to ponder whether the GCC in use was somehow bad.

Announcements

Greg Kroah-Hartman announced Linux 4.4.52, 4.9.13, and 4.10.1. Ben Hutchings announced Linux 3.16.41, and 3.2.86.

Stephen Hemminger announced iproute2-4.10, including support for “new features in Linux 4.10”. Amongst those new features are “enhanced support for BPF [Berkley Packet Filter], VRF [Virtual Routing and Forwarding], and Flow based classifier (flower)”. The latest version is available here: https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.10.0.tar.gz

Karel Zak announced util-linux v2.29.2, including a fix for a (nasty) “su” security issue, otherwise documented in CVE-2017-2616. According to Karel, it is “possible for any local user to send SIGKILL to other processes with root privileges. To exploit this, the user must be able to perform su with a successful login. SIGKILL can only be send to processes which were executed after the su process. It is not possible to send SIGKILL to processes which were already running”. A fix entitled “properly clear child PID” against “su” is included among the fixes listed.

Lucas De Marchi announced kmod 24, which includes enhanced support for kernel module dependency loop detection: ftp://ftp.kernel.org/pub/linux/utils/kernel/kmod/kmod-24.tar.xz

Junio C Hamano announced git version 2.12.0: https://www.kernel.org/pub/software/scm/git/

Con Kolivas announced his Linux-4.10-ck1 MuQSS (Multiple Queue Skiplist Scheduler) version 0.152. More details at: http://ck.kolivas.org/patches/4.0/4.10/4.10-ck1/

Ove Kent Karlsen has been performing various Linux gaming experiments. They posted links to YouTube videos showing results with “Doom 3”, which can be found here: https://www.youtube.com/watch?v=xDct6vVvFxA

NUMA node determination

Dou Liyang (Fujitsu) posted several revisions of a patch series entitled “Revert works for the mapping of cpuid <-> nodeid”. This is intended to clean up the process by which (Intel x64 Architecture) systems enumerate the mapping of physical processor IDs to NUMA (Non-Uniform Memory Architecture) multi-socket “node” IDs. Conventionally, Linux uses the MADT (Multiple APIC Description Table – otherwise known as the “APIC” table for legacy reasons). ACPI table to map processors to their “Local APIC ID” (the ID of the core connected to the Intel APIC interrupt controller’s LAPIC CPU interface). It then maps these to NUMA nodes using the _PXM node ID in the ACPI DSDT (Differentiated System Description Table) and determines NUMA topology using the SRAT (Static Resource Affinity Table) and SLIT (System Locality Information Table). But this is fragile. Firmware developers are known to make mistakes on occasion, and these have included “duplicated processor IDs in DSDT”, and having the “_PXM in DSDT…inconsistent with the one in [the] MADT”. For this reason, Dou seeks to move the proximity discovery into the system’s hotplug path by reverting two previous commits. Xiaolong Ye (Intel) said he would test these and followup.

As a footnote, it’s worth adding that modern processors have a very  oose notion of a “physical” core, since they usually (internally) support dynamic remapping of true physical cores to the IDs exposed even to system programmers. This affords the illusion of contiguously numbered processors, and prevents an easy analysis of binning and yield characteristics. It’s one of the reasons that processors such as Intel’s use various mapping schemes in order to determine NUMA node proximinity. But one should never assume that any information given about a processor in any table reflects reality other than as a microprocessor company wanted you to perceive it.

Virtual Machine Aware Caches

Shanker Donthineni (Codeaurora) posted “arm64: Add support for VMID aware PIPT instruction cache”. Caches on the ARMv8 architecture are defined to be PIPT (Physically Indexed, Physically Tagged) from a software perspective (although the underlying implementation might be different – for example, you could index virtually with VIPT underneath a PIPT facade if you implemented expensive logic for automatic homonym detection). The ARMv8.2 specification allows “VMID aware PIPT” which means a cache is PIPT but aware of the existence of Virtual Machine IDs (VMIDs), which might form part of the cache entry. Will Deacon responded that the approach “may well cause problems for KVM with non-VHE [Virtual Host Extension – the ability to run “type 2″ hypervisors with split page tables for the kernel and userspace, as opposed to non-VHE implemented on original ARMv8.0 machines in which a shim running with its own page tables is required for KVM] because the host VMID is different from the guest VMID, yet we assume that I-cache invalidation by the host *will* affect the guest when, for example, invalidating the I-cache for pages holding the guest kernel Image”. He noted that he had some other patches in flight that he would post soon (for 4.12).

Advisory Memory Allocations in real life

Shaohua Li (Facebook) posted “mm: fix some MADV_FREE issues”. MADV_FREE is part of relatively recent(ish) kernel infrastructure to support advisory mmaps that the kernel may need to arbitrarily reclaim later when low on available memory. It’s the kind of thing that other Operating Systems (such as Windows) have done for many years (Windows will even dynamically enlarge its swap (paging) file on low memory situations). Facebook apparently like to use the (alternative) “jemalloc” userspace memory allocator and have found a number of issues when attempting to combine this with MADV_FREE flags to mmap. Shaohua notes that MADV_FREE cannot be used on a machine without swap enabled, actually increases memory pressure (due to page reclaim being biases against anonymous pages), and the lack of global accounting. The patches aim to address these.

Non-fixed TASK_SIZE

Martin Schwidefsky and Linus Torvalds had a back and forth discussion about “Using TASK_SIZE for kernel threads”. As kernel programmers know, kernel threads (“tasks”, or “kernel processes” – these show up in brackets in “ps” and “top”) don’t have an associated “mm” struct (they have no userspace). On s390, just to be different, TASK_SIZE is not fixed. It can actually be one of several values that are determined by reading a field in a task’s mm struct (context.asce_limit). This was causing very subtle breakage as the kernel indirected into a null structure which happened to contain a value very close to zero that kinda worked. Martin has a fixed queued up but had some suggestions for changes to make to the kernel to avoid such a subtle issue in future. Linus was more convinced that s390 was just doing something that needed fixing.

Ongoing Development

Elena Reshetova (Intel) posted many patches converting various uses of the kernel’s “atomic_t” datatype as a reference counter over to the new “refcount_t”. As she notes, “[b]y doing this we prevent intentional or accidental underflows or overflows that can le[a]d to use-after-free vulnerabilities”. Examples including architecture and VM code fixes.

Xunlei Pang (Red Hat) posted version 2 of a patch entitled “x86/mce: Don’t participate in rendezvous process once nmi-shootdown_cpus() was  made’. This aims to juggle a post-crash conumdrum: system errors sufficient enough to generate an MCE (Machine Check Exception) should not be ignored (and thus the machine check handler should run in the kernel) but they might be generated during the process of actively taking a crash/kdump. The existing code might instead cause a panic on exit from the (old kernel provided) MCE handler. Borislav Petkov didn’t like some of the details of the patch. He wanted to also see explicit documentation as to the handling of MCEs.

Andy Lutomirski posted “KVM TSS cleanups and speedups”, which aims to refactor how the kernel handles guest TSS (Task Segment Selector) handling on Intel x64 Architecture systems. These are layered upon a series from Thomas Gleixner aimed at cleaning up GDT (Global Descriptor Table) use. He notes that there “may be a slight speedup, too, because they remove an STR [store] instruction from the VMX [Virtual Machine] entry path”.

Heikki Krogerus posted version 17 of a patch series implementing “USB Type-C Connector class” support. This is “meant to provide [a] unified interface to…userspace to present the USB Type-C ports in a system”. Your author is looking forward to trying this on his Dell XPS Skylake with USB-C.

Rob Herring posted a patch “Add SPDX license tag check for dts files and headers” to the kernel’s “checkpatch.pl” patch submission checking tool.

Finally this week, Lorenzo Pieralisi posted “PCI: fix config and I/O Address space memory mappings” intended to address the inconvenient fact that “ioremap” on 32-bit and 64-bit ARM platforms was failing to strictly comply with the PCI local bus specification’s “Transaction Ordering and Posting” requirements. These mandate that PCI configuration cycles (during startup or hotplug) and I/O address space accesses must be “non-posted” (in other words, they must always receive a write notification response and not be buffered arbitrarily). Lorenzo addresses this with a 20 part patch series that cleans this up.

Kernel Podcast for Feb 20th, 2017

UPDATE: Thanks to LWN for the mention. This podcast is in “alpha”. It will start to show up on iTunes and Google Play (which didn’t exist last time I did this thing!) stores within the next day or two. You can also subscribe (for the moment) by using this link: kernel podcast audio rss feed. This podcast format will be tweaked, and the format/layout will very likely change a bit as I figure out what works, and what does not. Equipment just started to arrive at home (Zoom H4N Pro, condenser mics, etc.), a new content publishing platform needs to get built (I intend ultimately for listeners to help to create summaries by annotating threads as they happen). And yes, my former girlfriend will once again be reprising her role as author of another catchy intro jingle…soon 😉

Audio: Kernel Podcast 20170220

Support for this podcast comes from Jon Masters, trying to bring back the Kernel Podcast since 2012.

In this week’s edition: Linus Torvalds announces Linux 4.10, Alan Tull updates his FPGA manager framework, and Intel’s latest 5-level paging patch series is posted for review. We will have this, and a summary of ongoing development in the first of the newly revived Linux Kernel Podcast.

Linux 4.10

Linus Torvalds announced the release of 4.10 final, noting that “it’s been quiet since rc8, but we did end up fixing several small issues, so the extra week was all good”. Linus added a (relatively rare) additional “RC8” (Release Candidate 8) to this kernel cycle due to the timing – many of us were attending the “Open Source Leadership Summit” (OSLS, formerly “Linux Foundation Collaboration Summit”, or “Collab”) over the past week. The 4.10 kernel contains about 13,000 commits, which used to seem large but somehow now…isn’t. Kernelnewbies.org has the usual summary of new features and fixes: https://kernelnewbies.org/Linux_4.10

With the announcement of 4.10 comes the opening of the merge window for Linux 4.11 (the period of up to two weeks at the beginning of a development cycle, during with new features and disruptive changes are “pulled” into Linus’s kernel (git) tree). The 4.11 merge window begins today.

FPGA Manager Updates

Alan Tull posted a patch series implementing “FPGA Region enhancements and fixes”, which “intends to enable expanding the user of FPGA regions beyond device tree overlays”. Alan’s FPGA manager framework allows the kernel to manage regions within FPGAs (Field Programmable Gate Arrays) known as “partial reconfigurable” regions – areas of the logic fabric that can be loaded with new bitstream configs. Part of the discussion around the latest patches centered on their providing a new sysfs interface for loading FPGA images, and in particular the need to ensure that this ABI handle FPGA bitstream metadata in a standard and portable fashion across different OSes.

Intel 5-level paging

Kirill A. Shutemov posted version 3 of Intel’s 5 level paging patch series that expands the supportable VA (Virtual Address) space on Intel Architecture from 256TiB (64TiB physical) to 128PiB (4PiB physical). Channeling his inner Bill Gates, he suggests that this “ought to be enough for anybody”. Key among the TODO items remains “boot-time switch between 4 and 5-level paging” to avoid the need for custom kernels. The latest patches introduce two new prctl calls to manage the maximum virtual address space available to userspace processes during mmap calls (PR_SET_MAX_VADDR and PR_GET_MAX_VADDR). This is intended to aid in compatibility by preventing certain legacy programs from breaking when confronted with a 56-bit address space they weren’t expecting. In particular, some JITs use high order “canonical” bits in existing x86 addresses to encode pointer tags and other information (that they should not per a strict interpretation of Intel’s “Canonical Addressing”).

Announcements

Steven Rostedt announced verious preempt-rt (“Real Time”) kernel trees (4.4.47-rt59, 4.1.38-rt45, 3.18.47-rt52, 3.12.70-rt94, and 3.10.104-rt118). Sebastian Andrzej also announced version v4.9.9-rt6 of the preempt-rt “Real Time” Linux patch series. It includes fixes for a spurious softirq wakeup, and a GPL symbol issue. A known issue is that CPU hotplug can still deadlock.

Junio C Hamano announced version v2.12.0-rc2 of git.

Bugfixes

Hoeun Ryu posted version 6 of a patch that takes care to properly free up virtually mapped (vmapped) stacks that might be in the kernel’s stack cache when cpus are offlined (otherwise the kernel was leaking these during offline/online operations).

New Drivers

Mahipal Challa posted version 2 of a patch series implementing a compression driver for the Cavium ThunderX “ZIP” IP on their 64-bit ARM server SoC (System-on-Chip) to plumb into the kernel cryptoapi.

Anup Patel posted version 3 of a patch implementing RAID offload
support for the Broadcom “SBA” RAID device on their SoCs.

Ongoing Development

Andi Kleen posted various perf vendor events for Intel uncore devices, Kan Liang posted new core events for Intel Goldmont, and Srinivas Pandruvada posted perf events for Intel Kaby Lake.

Velibor Markovski (Broadcom) posted a patch implementing ARM Cache Coherent Network (CCN) 502 support.

Sven Schmidt posted version 7 of a patch series updating the LZ4 compression module to support a mode known as “LZ4 fast”, in particular for the benefit of its use by the lustre filesystem.

Zhou Xianrong posted a patch (for the ARM Architecture) that attempts to save kernel memory by freeing parts of the the linear memmap for physical PFNs (page frame numbers) that are marked reserved in a DeviceTree. This had some pushback. The argument is that it saves memory on resource constrained machines – 6MB of RAM in the example.

Jessica Yu (who took over maintaining the in-kernel module loader infrastructure from Rusty Russell some time back) posted a link to her module-next tree in the kernel MAINTAINERS document.

Bhupesh Sharma posted a patch moving in-kernel handling of ACPI BGRT (Boot(time) Graphics Resource) tables out of the x86 architecture tree and into drivers/firmware/efi (so that it can be shared with the 64-bit ARM Architecture).

Jarkko Sakkinen posted version 2 of a patch series implementing a new in-kernel resource manager for “TPM spaces” (these are “isolated execution context(s) for transient objects and HMAC and policy sessions.”. Various test scripts were provided also.

That’s all for this week. Tune in next time for the latest happenings in the Linux kernel community. Don’t forget to follow us @kernelpodcast