Kernel Podcast for March 13th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170313.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc2 (including pre-enablement for Intel 5-level paging), VMA based swap readahead, and ongoing development ahead of the next cycle.

Linus Torvalds announced Linux 4.11-rc2. In his announcement, he said that the past week had been “fairly quiet” because “people are still looking for bugs and taking a breather after the merge window”. But he also noted that “we’ve got a healthy number of fixes in, and there’ssome cleanup/prep patches for the upcoming 5-level page table support that I took after the merge window just to make the next merge window easier”.

Various fixes and updates have been posted against the previous rc1, over the past week, including an urgent fix from Matthew (Willy) Wilcox for his idr rewrite in 4.11 (freeing the correct IDA bitmap).

Geert Uytterhoeven posted “Build regressions/improvements in v4.11-rc1”. This compared build error/warning regressions and improvements between v4.11-rc1 and v4.10. According to Geert, the 4.11-rc1 kernel saw an increase of 19 build errors and 1108 warnings when compared to 4.10.

Announcements

Jiri Slaby announced Linux 3.12.71, Greg Kroah Hartman (KH) announced 4.4.53, 4.9.14, and 4.10.2 (which started a conversation about git tags being stale that we will address in a moment). Greg took the opportunity of various stable kernel work to prod the i915 graphics driver team with a message entitled “The i915 stable patch marking is totally broken”.

Sebastian Andrzej Siewior announced the v4.9.13-rt12 preempt-rt “Real Time” kernel patch set, which has a known issue that “CPU hotplug got a little better but can deadlock”, suggesting you might not want to try that then.

Julia Cartwright announced 4.1.38-rt46.

Steven Rostedt announced the 3.18.48-rt53 stable release of the RT kernel. He also announced the 3.10.105-rt119 and 3.2.86-rt124 releases.

Jair Ruusu announced “loop-AES-v3.7k file/swap crypto package”, which is available on sourceforge at: http://loop-aes.sourceforge.net/

Andy Lutomirski sent out detailed notes (along with a followup with yet more explanation) of the Intel SGX (“Secure Enclave”) feature discussion that occured at Kernel Summit and Linux Plumbers Conference last fall. The thread is called “SGX notes from KS/LPC”. In the thread, he explains what SGX is (a small region of virtual memory within a Linux process – known as a task inside the kernel – that is not visible to the host OS after the enclave is “launched”) and how it can be used to hide certain data from system administrators or providers – for example, cryptographic keys that one would rather were not compromised. SGX comes with a litany of new requirements upon the Operating System that Andy covers, along with some guidelines for how to expose this feature, and how to make it useable.

Packet.net are now sponsoring the kernel.org project to the tune of various geo-diverse bare metal frontend systems in datacenters around the globe. Each of these (powerful) frontends provides read-only public access to kernel.org git repositories and the public website (git.kernel.org and www.kernel.org). More information, including machine specifications can be found here: https://www.kernel.org/fast-new-frontends-with-packet.html

(this came to light because of a brief outage affecting the Newark, NJ mirror which was lagging behind other mirrors in picking up new git tags pushed, but one hopes that an official announcement and thanks was otherwise forthcoming)

Masahiro Yamada has been added as a Kbuild (co-)maintainer.

Intel 5-level paging

Kirill A. Shutemov posted version 4 of his “5-level paging” patch series that implements support for the la57 (56 bit Virtual Address space for x64 Canonical Addressing) feature on some future CPUs. We covered the underlying patch series before, explaining the benefit of a larger (virtual) address space, as well as the additional compexities required to implement backward compatibility (including new prctls to limit the virtual address space of certain legacy applications), and the lack (so far) of boot time switching between 4-and-5-level support, which is seen as important for the distros.

Linus responded by saying that he thought “we should just aim for this being in 4.12” as he didn’t “see any real reason to delay merging it”. After some discussion about whose tree to merge it through, it was decided (by Thomas Gleixner) that it could come in through the “-tip” x86 tree. Which resulted in Linus pulling a preparatory “5-level paging: prepare generic code” patch series from Kirill into 4.11 (even after the merge window had closed) to lay the groundwork for pulling the main feature into the next (4.12) cycle. This promptly broke PowerPC, which was promptly fixed by a followup patch. Following the merge of enabling support in 4.11, Kirill posted “5-level paging enabling for v4.12” which aims to complete the merge next cycle.

The earlier version 4 iteration of the patch series noted that the Xen hypervisor currently doesn’t support 5-level paging and thus CONFIG_XEN is disabled automatically when building CONFIG_X86_5LEVEL. It was pointed out by the Andrew Cooper that runtime (boottime) switching between 4 and 5 level support would be required in order to provide a clean experience, especially until Xen Dom0 support is available. That boottime switching is on the existing todo and presumably is going to land at some point.

Separately, Dmitry Safonov posted version 6 of a patch series entitled “Fix compatible mmap() return pointer over 4Gb” which has “some minor conflicts with Kirill’s set for 5-table paging”. Dmitry aims to solve a slightly different problem than Kirill’s PR_{SET,GET}_MAX_VADDR calls (which limit the virtual address ranges returned by mmap to avoid legacy programs breaking when suddenly able to receive much larger “Canonical Addresses” – in Intel parlance – than they were compiled with built-in and broken assumptions about once upon a time) insomuch as he is focused on 32-bit legacy syscalls on 64-bit x64 not returning memory above 4GB that cannot be used by older 32-bit code.

VMA based swap readahead

Ying Huang (Intel) posted an RFC (Request For Comments) entitled “mm, swap: VMA based swap readahead” in which he discussed the current kernel paging implementation for Virtual Memory Areas (VMAs) as well as how it could be improved to facilitate greater awareness of the in-memory access patterns of associated data by changing the corresponding readahead algorithm.

“Readahead” as a concept is what it sounds like. Locality (both spacial, in this case, as well as temporal, in other cases) of data means that when a memory access occurs, it is usually more likely than not that an access to a nearby memory location will soon follow (except in the case of pure random access workloads). Thus, the kernel contains support for preloading nearby data when performing various disk and memory operations. Examples include readahead of nearby disk blocks when loading filesystem data, and loading nearby disk blocks when reading pages back in from swap.

VMAs (Virtual Memory Areas) are regions of memory managed by the Linux kernel. A running application (process), known as a “task” by the kernel, contains a large number of different VMAs which form its overall address space. You can see this by inspecting /proc/self/maps (replacing “self” with a process ID that you have access to). The output will show a series of memory regions representing various memory owned by the task. Memory that doesn’t represent files is known as “anonymous memory” and it is what is paged (swapped) out under memory pressure situations.

As Ying notes in his RFC, the “original swap readahead algorithm does readahead based on the consecutive blocks in [the] swap device” but “the consecutive blocks in [the] swap device just reflect the order of page reclaiming” and not necessarily “the access sequence in RAM”. His patch series aims to change this by teaching the readahead algorithm about VMAs and how to bias the readahead to sequentially walk through the address space of a task (process), reading those parts of the swap space containing this data rather than simply walking through swap sequentially.

But wait! There’s more! Ying also posted a separate patch series entitled “THP swap: Delay splitting THP during swapping out”, which does what it sounds like it would do. THP (Transparent Huge Pages) is a technology used by the Linux kernel to dynamically allocate “huge” (optionally very large – up to 1GB in size, but in this case 2MB) pages of memory to contiguous regions of virtual memory address space, especially those backing shared large memory data (even including a huge zero page used for virtual machine RAM at boot). THP reduces pressure on limited CPU internal microarchitectural caches known as TLBs (Translation Lookaside Buffers) – as well as uTLBs at a lower level than the TLBs – which cache the translation performed by page table entries to physical or intermediate memory addresses. Reducing the number of TLBs required to map regions of virtual memory reduces the number of times TLBs must be reused by the underlying architecture during memory access operations.

The existing Linux kernel THP code splits THPs back into smaller pages whenever they are swapped (paged) out to disk. Yet it turns out that this is particularly inefficient on contemporary systems in which secondary disk or NVMe storage has far greater bandwidth than a single high end core can saturate if forced to do this work. Ying’s patch instead delays this split and pushes entire THPs out to swap, allowing for larger writes and reads of contiguous memory out to the backing storage.

Ongoing Development

“David F” inquired about RAID mode support for Intel m.2 chipsets. These devices continue the recent-ish legacy of certain Intel storage devices providing dual modes of operation: as an AHCI device, and as a hardware RAID device operating in a propietary mode for which no Linux drivers exist. David was quite concerned that the lack of a Linux driver was becoming particular problematic on newer machines, which might not provide a means to switch into AHCI mode (supported by Linux). Christoph Hellwig was…unsympathetic…suggesting that the RAID mode “provides worse performance”, and that its implementation was questionable. He also had a series of other suggestions for what to do with these devices – those are less family friendly to repeat in this podcast.

Michal Hocko posted “kvmalloc” which is a generic replacement for the many “open coded kmalloc with vmalloc fallback instances in the tree”. k-and-vmalloc are two different means by which kernel code allocates memory. The former is used to obtain small allocations (on the order of a few pages – the minimal granule size operated on by the virtual memory subsystem of Linux on contemporary processors) that are also linerally contiguous in physical memory. The latter is for larger allocations of strictly “virtual” memory – contiguous only when accessed using the underlying Memory Mangement Unit to perform a translation (this is usually automatic for kernel code, since the kernel runs with virtual memory of its own, just like user processes do, but it can be problematic if a driver would like to use this memory for certain hardware operations, such as DMA transfers). The generic wrapper aims to clean up the common case that kernel code just wants a chunk of memory and will try to allocate it with kmalloc, but fallback to the more generic vmalloc if that fails.

Christian Konig (AMD) posted “PCI: add resizeable BAR infrastructure” (version 2, and later an update with some fixes in a version 3 also), which aims to add support to the kernel for a PCI SIG (Peripheral Component Interconnect Special Interest Group) ECN (Engineering Change Notice) that enables BARs (Base Address Registers) to be resized at runtime. PCI(e) BARs are mapping windows (aperatures) in the system memory map that are used to talk to hardware add-on cards (or built-in devices within modern platforms) by determining where the device’s memory will live. Traditionally, BARs were fixed size and so on architectures not relying upon firmware configuration of underlying BARs, Linux would have to determine where to place certain PCI(e) resources at boot/hotplug time by checking how much memory a device needed to expose and programming the BARs. With the new extension comes the possibility to increase the size of a BAR to map larger regions of memory. This is a useful feature for graphics cards, which may want to map very large regions of memory. A subsequent patch wires up the AMD GPU driver to use this.

Javi Merino posted “Documentation/EDID fixes”, which aims to correct some broken assumptions in the kernel documentation for EDID (Extended Display Identification Data – the data provided over e.g. I2C from a VGA monitor when the cable is connected). The examples didn’t build correctly due to existing assumptions. This author is probably one of few people who always thinks of EDID and the interaction with Xorg every time he plugs in an external projector to his laptop.

David Howells posted “net: Work around lockdep limitation in sockets that use sockets” in which he corrected an erroneous assumption in the kernel “lockdep” (lock dependency checker) that prevented it from correctly identifying bad call chains involving TCP sockets when there exists a dependency between sockets created purely in the kernel and sockets created purely in userspace (which the lockdep could not distinguish between due to its use of broad lock classes). The AFS (Andrew File System) was generating a false lockdep warning because it was exposing such an implied dependency.

Charles Keepax posted “genirq: Add support for nested shared IRQs” to address an audio CODEC that also acts as an interrupt controller. The details sounded rather painful. Yet it was “fairly easy” to fix.

Steven Rostedt posted “tracing: Allow function tracing to start earlier in boot up”, which does roughly what it says on the can, “moving tracing up further in the boot process”, “right after memory is initialized”. He noted that his RFC was a start and could be futher improved upon.

Matthew (Willy) Wilcox posted an RFC entitled “memset_l and memfill” that provides a generic means for architectures to provide optimized functions that “fill regions of memory with patterns larger than those contained in a single byte”. This is intended to be used by zram as well as other code.

Paul McKenney noticed some of his RCU torture tests failing during hotplug early in boot due to calls to smp_store_cpu_info during that operation. The call is not safe because it indirectly invokes schedule_work() which wants to use RCU prior to RCU being enabled as a side effect of dealing with an unstable TSC (Time Stamp Counter) on the afflicted CPU. Peter Zijlstra had an opinion on hotplug, and also a patch to handle this situation.

Vlad Zakharov posted “update timer frequencies”, which inquired about the best means to implement a cpufreq driver for ARC CPUs. These having a special property that “ARC timers (including those are used for timekeeping) are driven by the same clock as ARC CPU core(s)”. Yup, they change frequency according to the current CPU frequency. Which as Thomas Gleixner noted in response is “broken by design and you really should go and tell your hardware folks to fix that”. He added that “It’s well known for more than TWO decades that changing the frequency of the timekeeper clocksource is a complete disaster”.

Thomas Gleixner posted “kexec, x86/purgatory: Cleanup the unholy mess”, which aims to address his opinion that “the whole machinery is undocumented and lacks any form of forward declarations” (of variables which were previously global but had been made static). Purgatory is a special piece of code which is provided by the kernel but runs in the interim period between the kernel crashing (or beginning kexec) and the new crash or kexec kernel that is then subsequently loaded – this is what performs the load and exec.

Kernel Podcast for March 6th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170306.mp3

In this week’s kernel podcast: Linus Torvalds announces Linux 4.11-rc1, rants about folks not correctly leveraging linux-next, the remainder of this cycle’s merge window pulls, and announcements concerning end of life for some features.

Linus Torvalds announced Linux 4.11-rc1, noting that “two weeks have passed, the merge window is over, and 4.11 has been tagged and pushed out.” He notes that the latest kernel cycle is set to be “on the smallish side”, but that is only in comparison with the most recent two cycles, which have been significantly larger than typical. He notes that 4.11 has a similar number of commits to 4.1, 4.3, 4.5, and 4.7 before it. With the release of 4.11-rc1 comes the closing of the “merge window” (defined by it, the period of time during which disruptive changes are allowed into the kernel prior to RC).

We covered most of the major pulls for 4.11 in last week’s podcast. But there were a few more stragglers. Here’s a sample of those:

J. Bruce Fields posted “nfsd changes for 4.11” which included two semantic changes: NFS security labels are “now off by default” and a “new security_label export flag reenables it per export” since this “only makes sense if all your clients and servers have similar enough selinux policies”. Secondly, NFSv4/UDP support is off because “It was never really supported, and the spec explicitly forbids it. We only ever left it on out of laziness; thanks to Jeff Layton for finally fixing that.”

Anna Schumaker followed up a little later with “Please pull NFS client changes for Linux 4.11”, which includes a memory leak in “_nfs4_open_and_get_state”, as well as various other fixes and new features.

Matthew (Willy) Wilcox posted “Please pull IDR rewrite” which seeks to harmonize the IDR (“Small id to pointer translation service avoding fixed sized tables”) and in-kernel radix tree code. Accoring to Willy, merging the two codebases “lets us share the memory alloction pools, and results in a net deletion of 500 lines of code. It also opens up the possibility of exposing more of the fetures of the radix tree to users of the IDR”.

Will Deacon posted “arm64 fixes for -rc1” of which the “main fix here addresses a kernel panic triggered on Qualcomm QDF2400 due to incorrect register usage in an erratum workaround introduced during the merge window”.

Michael S. Tsirkin posted “vhost: cleanups and fixes”, of which there were very few for this kernel cycle.

Nicholas A. Bellinger posted “target updates for v4.11-rc1”, which includes support for “dual mode (initiator + target) qla2xxx operation”, and a number of other fixes and improvements. He pre-warns that things are “shaping up to be a busy cycle for v4.12 with a new fabric driver (efct) in flight, and a number of other patches on the list being discussed”.

Rafael J. Wysocki posted “Additional ACPI update for v4.11-rc1”, which includes a fix for “an apparant, but actually artificial, resource conflict between the ACPI NVS memory region and the ACPI BERT (Boot Error Record Table)”.

Jens Axboe posted “Block fixes for 4.11-rc1”, which includes a “collection of fixes for this merge window, either fixes for existing issues, or parts that were waiting for acks to come in”. These include a performance fix for the allocation of nvme queues on the right node, along with others.

Miklos Szeredi posted “fuse update for 4.11” and “overlayfs update for 4.11”. the latter “allows concurrent copy up of regular files eliminating [the] potential problem” of (previously) serialized copy ups taking a long time.

Bjorn Helgaas posted “PCI fixes for v4.11”, including a couple of fixes for bugs introduced during code refactoring.

Dan Williams posted “libnvdimm fixes for 4.11-rc1”, which includes a fix for the generation of “nvdimm namespace label”s (metadata) checksums that “Linux was not calculating correcting leading to other environments rejecting the Linux label”.

Helge Deller posted “parisc updates for 4.11”, noting that there was “nothing really important” in this particular cycle to pull in.

James Bottomley posted “final round of SCSI updates for the 4.10+ merge window”, which “is the set of stuff that didn’t quite make the initial pull and a set of fixes for stuff which did”.

Radim Krcmar posted “Second batch of KVM changes for 4.11 merge window”, which includes a number of fixes for PPC and x86.

David Miller posted “Networking”, including many fixes.

A linux-next rant

In his 4.11-rc1 announcement, Linus noted that “it *does* feel like there was more stuff that I was asked to pull than was in linux-next. That always happens, but seems to have happened more now than usually. Comparing to the linux-next tree at the time of the 4.10 release, almost 18% of the non-merge commits were not in Linux-next. That seems higher than usual, although I guess Stephen Rothwell has actual numbers from past merges.” Let’s break what Linus said a little. Stephen Rothwell is an (overworked) kernel hacker based in Australia who produces a (daily, outside of the merge window) kernel tree (and accompanying test infrastructure, patch tracking, and announcement mechanisms) known as “linux-next”. Its raison d’etre is to be the proving ground for new features before they are sent to Linus for merging.

Typically, major new features soak in linux-next for a cycle prior to the one in which they are actually merged (so features landing in 4.11 would have been largely complete and tested via -next during 4.10). Linux kernel development cycles are generally on the order of about two months, so this isn’t an unreasonable long period of time for disruptive changes to languish. Contrast this with the multi-year wait that used to happen back when Linux had an odd/even minor version cycle in which even numbers (2.2, 2.4, 2.6) were the “supported” releases and the odd numbers (2.1, 2.3, 2.5) were development ones. That seems like ancient history now, but it’s really only in the past decade of git that kernel development tooling and community has reached a level of sophistication that the ship can keep moving while the engine is replaced.

Linus noted that there are a “few different classes” of changes that didn’t come to him following a previous test in linux-next. Those include fixes (which is “obviously ok and inevitable”), a specific example (statx) for a longstanding issue that has been ongoing for years (to which he said, “Yeah, I’ll allow this one too”), the “quite noticeable <linux/sched.h> split up series” which “had real reasons for late inclusion”. Finally, he includes the class of subsystems such as “drm, Infiniband, watchdog and btrfs”, which he “found rather annoying this merge window”. He reminded folks of the “linux-next sanity checks” and that if folks ingore them “you had better have your own sanity checks that you replaced them with” rather than “screw all the rules and processes we have in place to verify things”.

The bottom line? Linus says “You people know who you are. Next merge window I will not accept anything even remotely like that. Things that haven’t been in linux-next will be rejected, and since you’re already on my sh*t-list you’ll get shouted at again”. And nobody enjoys being shouted at by Linus. Well, almost nobody. There do seem to be a few people who perversely enjoy it.

Announcements

A couple of questions of code maintenance arose this week. The first was from Natale Patriciello, who asked whether UML (User Mode Linux) is “not maintained anymore?” by citing a few bugs that haven’t been resolved in some time. There were no followups at the time of this recording. The second question came in form of an RFC (Request For Comments) patch entitled “remove support for AVR32 architecture” from Hans-Christian Noren Egtvedt. He noted that AVR32 is “not keeping up with the development of the kernel”, “shares so much of the drivers with Atmel ARM SoC”, and “all AVR32 AP7 SoC processors are end of lifed from Atmel (now Microchip)”. This did seem like a fairly compelling set of reasons to kill it, which others agreed with also. This means that unless someone comes forward soon to maintain AVR32 (along with the associated GCC toolchain and other distribution pieces), its days in the upstream Linux kernel are numbered – and probably removed in 4.12.

Sebastian Andrzej Siewior announced Linux v4.9.13-rt11, which includes a fix for a previous fix (allowing the previous lockdep fix to compile on UP).

Drivers

Logan Gunthorpe posted “New Microsemi PCI Switch Management Driver”, which is in its 7th revision. The RFC (Request for Comments “proposes a management driver for Microsemi’s Switchtec line of PCI switches. This hardware is still looking to be used in the Open Compute Platform”. Logan notes that “Switchtec products are compliant with the PCI specifications and are supported today with the standard in-kernel driver. However, these devices also expose a management endpoint on a separate PCI function address which can be used to perform some advanced operations”.

Ongoing Development

Michael S. Tsirkin continued his work on “vfio error recovery: kernel support” with version 4 of the patch series wich seeks to do more than simply ignoring non-fatal PCIe AER (Advanced Error Reporting) errors that hit assigned devices passed using VFIO into a guest Virtual Machine. Currently, only fatal errors (which cause a PCIe link reset) are reported – they stop the guest. In his summary email, Michael notes that his goal is to handle non-fatal errors by reporting them to the guest and having it handle them. And rather than surprising existing code, he calls out under “issues” that “this behavior should only be enabled with new userspace, old userspace should work without changes”. By “userspace” he means the code driving VFIO, which might be a QEMU process that is backing a KVM virtual machine context, or a container, or merely a bare metal userspace process that is using VFIO directly.

Johannes Weiner posted “mm: kswapd spinning on unreclaimable nodes – fixes and cleanups” in which he notes a previous posting from Jia He that he (and the team at Facebook) have reproduced. In the case of the problem scenario, the kernel’s kswapd (swap space daemon) for a given (memory) node spins indefinitely at 100% CPU usage when there are absolutely no reclaimable pages (granules of the smallest size of memory that can be managed by Linux and the underlying hardware) however the “condition for backing off is never met”. This results in kswapd busy-looping forever. In his patches, Johannes changes reclaim behavior so that kswapd will eventually really back off after failing 16 times (which is the same magic number of times we try during an OOM “Out Of Memory” situation) as defined by MAX_RECLAIM_RETRIES. He includes various examples.

Len Brown posted “cpufreq: Add the “cpufreq.off=1” cmdline option. This is a corollary to “cpuidle.off=1” and comes about for similar reasons for the purpose of testing. This author wonders aloud whether this will allow for buggy platforms that don’t support CPPC (Collaborative Processor Performance Control) to easily disable this at runtime too.

Aleksey Makarov posted “printk: fix double printing with earlycon”. On ACPI compliant platforms (including ARM servers), the SPCR (“Serial Port Console Redirection”) table provides information about the serial console UART that the kernel should be using, rather than having the user provide memory register addresses and baud rates on the kernel command line. This is a feature which is generally useful beyond ARM systems (although most x86 systems follow the traditional “PC” UART design). Prior to this fix, the kernel would double print output if given a “console=” and “earlycon”.

Minchan Kim posted “make try_to_unmap simple” which aims to remove some of the (apparently somewhat gratitous) complexity in the return value of this function. Currently it can return SWAP_SUCCESS, SWAP_FAIL, SWAP_AGAIN, SWAP_DIRTY, and SWAP_MLOCK. But Minchan feels that it can be simply a boolean return by removing the latter three of those return values.

Matthew Gerlach (Intel) posted “Altera Partial Reconfiguration IP”, which adds support to the kernel’s (Alan Tull’s) “fpga-mgr” driver for the “Altera Partial Reconfiguration IP”. Partial Reconfiguration (sometimes known as “PR” in the reconfigurable logic community) allows an FPGA (Field Programmable Gate Array)’s logic fabric to be reconfigured in smaller than whole regions. This (for example) would allow a closely coupled datacenter (Xeon) processor to continue to drive certain FPGA contained IP while other IP were being replaced dynamically. If one were to couple this with support in OpenStack Nomad or Kubernetes for dynamic reconfiguration at VM/container setup it would begin to enable various use cases for the mainstream datacenter around FPGA acceleration.

Andi Kleen posted “pci: Allow lockless access path to PCI mmconfig”. “mmconfig” refers to the memory mapped configuration region used by contemporary PCIe devices during enumeration and configuration. This is a kind of out-of-band mechanism by which the kernel can talk to PCIe devices in a fully standards compliant means prior to having configured them. Intel processors include many “PCIe” devices that are in fact a logical means of expressing so called “uncore” non-compute features on the processor SoC. They’re not real PCIe devices but appear to the kernel as such. This wonderful abstraction comes with some overhead cost, especially when the kernel spends time grabbing the “pci_cfg_lock” which it actually doesn’t need to hold, according to Andi.

Jarkko Sakkinen posted version 3 of “in-kernel resource manager”, which adds support to the kernel for “TPM spaces that provide an isolated execution context for transient objects and HMAC policy sessions”.

Tomas Winkler posted a question about what the community considered to be the “correct usage of arrats of variable length within [the] Linux kernel”. The replies generally included language to the form of “don’t”. Both for reasons of general language ugliness, and also because (especially in the case of local variables) the Linux kernel’s fixed (and also small) size stack raises serious potential for stack overflow if one is not careful. There was a suggestion that the kernel should be built with a compiler option to disallow VLAs, but that this would require various code to be fixed first.

Kernel Podcast for Feb 27th, 2017

Audiohttp://traffic.libsyn.com/jcm/20170228.mp3

In this week’s kernel podcast: the merge window for kernel 4.11 is open and patches are flying into Linus’s inbox, fixing NUMA node determination at runtime, Virtual Machine Aware Caches, Advisory Memory Allocations, and a non-fixed TASK_SIZE to bring excitement to your life. We will have this, and a summary of ongoing development in this week’s Linux Kernel podcast.

The merge window (period of time during which disruptive changes are allowed to be “merged” – incorporated into Linus’s official git tree – prior to a multi-week stabilization and Release Candidate cycle) for Linux 4.11 is currently open. This means that the most recent official kernel remains Linux 4.10. Meanwhile, many “pull requests” and merges are in flight for various kernel subsystems planning updates in 4.11. These include:

  • Ingo Molnar posted “EFI changes for 4.11”, including support for determining at boot time whether secure boot authentication was performed.
  • Ingo also posted “x86/cpufeature changes for v4.11”, which include the new support for “ring-3 MONITOR/MWAIT instructions on supported CPUs”. This is otherwise known as “MWAIT in userspace”, in which an unprivileged application can (in certain approved situations) use the CPU’s built-in monitor to cause a low-latency low-power wait on a memory location. This can be used (for example) by various userpace lock infrastructure to obviate spinning.
  • Joerg Roedel posted “IOMMU Updates for Linux v4.11”, which includes patches from Eric Auger (Red Hat) implementing “KVM PCIe/MSI passthrough support on ARM/ARM64”. These patches have been under development for many many months, and have been completely refactored on several occasions. They begin to enable various (OP)NFV (Open Platform for Network Function Virtualization) use cases, such as DPDK accelerated OVS (and other VNFs – Virtual Network Functions) within VMs passing through PCIe devices from the host via VFIO. Accompanying this was support for “a core representation for individual hardware iommus” (ARM uses a distributed System-MMU architecture), support for SMMUv2 on ARM systems, a stream table optimization for SMMUv3 on ARM systems, and various other small improvements.
  • Rafael J. Wysocki posted “Power management updates for v4.11-rc1”, noting that the “majority of changes go into the Operating Performance Points (OPP) framework and cpufreq this time, followed by devfreq and some scattered updates all over”. He also posted “ACPI updates for v4.11-rc1”, which include a rebase of the ACPICA (ACPI – Advanced Configuration and Power Interface – Component Architecture) reference shared among various Operating Systems for interpreting ACPI AML (ACPI Machine Language) at runtime. The ACPICA is updated to 20170119, with many fixes, including those “related to the handling of the bit width and bit offset fields in [GAS] Generic Address Structure”, utility updates, and support for “method invocations as target operands in AML”.
  • James Morris posted “Security subsystem updates for 4.11”, including a “major AppArmor update: policy namespaces & lots of fixes”, a new “/sys/kernel/security/lsm node for easy detection of loaded LSMs”, “SELinux cgroupfs labeling support”, and “SELinux context mounts on tmpfs, ramfs, devpts within user namespaces”. There was also “improved TPM 2.0 support”. This author is hoping an outfit such as Linux Weekly News (LWN) has an article on TPM2.0 at some point soon. James also posted a “seccomp bugfix” from Kees Cook that ensures seccomp will only dump core in the case that a process is single threaded (Kees wasn’t done with his usual awesome security fixes – he also had one to “censor kernel pointer in debug files” within the cgroup filesystem).
  • Bjorn Helgaas posted “PCI changes for v4.11”. These include ACS (Access Control Services) quirks for Intel Union Point, Qualcomm QDF2400, and QDF2432. ACS allows PCIe devices to communicate peer to peer without an intervening transaction through the Root Complex for IOV capabilities. Linus grumbled about Bjorn’s pull request due to the use of an SHA1 without a branch or tag name. But Bjorn noted it was a simple script mistake and was already fixed – he sent a followup with corrected “pci-v4.11-changes”.
  • Stafford Horne posted a very large set of patches for OpenRISC. These include “optimized memset and memcpy routines” with a 20% boot time saving, “support for cpu idling”, and various preparatory work on atomics, bitops, futexes, and locks in anticipation of future SMP support. Finally, he added a link to the OpenRISC git tree (on github) to MAINTAINERS. The OpenRISC architecture gets a bit less press these days than RISCV but it is still alive, and has a number of implementations. Your author has several OpenRISC development boards but hasn’t played in a while.

For a detailed sumary of current merge widow pulls and patches, consult this week’s Linux Weekly News at LWN.net (Thursday).

Geert Uytterhoeven posted a summary of “Build regressions/improvements in v4.10”. These show an increase in build errors and warnings vs the previous 4.9 kernel cycle. He posted a list of configs used, the error and warning messages, and thanked the “linux-next team for providing the build service”.

Pavel Machek has been posting about various problems running 4.10 kernels. In one instance, he saw a corrupted stack that implied a double call to “startup_32_smp” (the secondary CPU boot method on Intel x64 Architecture). This lead Josh Poimbeouf to ponder whether the GCC in use was somehow bad.

Announcements

Greg Kroah-Hartman announced Linux 4.4.52, 4.9.13, and 4.10.1. Ben Hutchings announced Linux 3.16.41, and 3.2.86.

Stephen Hemminger announced iproute2-4.10, including support for “new features in Linux 4.10”. Amongst those new features are “enhanced support for BPF [Berkley Packet Filter], VRF [Virtual Routing and Forwarding], and Flow based classifier (flower)”. The latest version is available here: https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.10.0.tar.gz

Karel Zak announced util-linux v2.29.2, including a fix for a (nasty) “su” security issue, otherwise documented in CVE-2017-2616. According to Karel, it is “possible for any local user to send SIGKILL to other processes with root privileges. To exploit this, the user must be able to perform su with a successful login. SIGKILL can only be send to processes which were executed after the su process. It is not possible to send SIGKILL to processes which were already running”. A fix entitled “properly clear child PID” against “su” is included among the fixes listed.

Lucas De Marchi announced kmod 24, which includes enhanced support for kernel module dependency loop detection: ftp://ftp.kernel.org/pub/linux/utils/kernel/kmod/kmod-24.tar.xz

Junio C Hamano announced git version 2.12.0: https://www.kernel.org/pub/software/scm/git/

Con Kolivas announced his Linux-4.10-ck1 MuQSS (Multiple Queue Skiplist Scheduler) version 0.152. More details at: http://ck.kolivas.org/patches/4.0/4.10/4.10-ck1/

Ove Kent Karlsen has been performing various Linux gaming experiments. They posted links to YouTube videos showing results with “Doom 3”, which can be found here: https://www.youtube.com/watch?v=xDct6vVvFxA

NUMA node determination

Dou Liyang (Fujitsu) posted several revisions of a patch series entitled “Revert works for the mapping of cpuid <-> nodeid”. This is intended to clean up the process by which (Intel x64 Architecture) systems enumerate the mapping of physical processor IDs to NUMA (Non-Uniform Memory Architecture) multi-socket “node” IDs. Conventionally, Linux uses the MADT (Multiple APIC Description Table – otherwise known as the “APIC” table for legacy reasons). ACPI table to map processors to their “Local APIC ID” (the ID of the core connected to the Intel APIC interrupt controller’s LAPIC CPU interface). It then maps these to NUMA nodes using the _PXM node ID in the ACPI DSDT (Differentiated System Description Table) and determines NUMA topology using the SRAT (Static Resource Affinity Table) and SLIT (System Locality Information Table). But this is fragile. Firmware developers are known to make mistakes on occasion, and these have included “duplicated processor IDs in DSDT”, and having the “_PXM in DSDT…inconsistent with the one in [the] MADT”. For this reason, Dou seeks to move the proximity discovery into the system’s hotplug path by reverting two previous commits. Xiaolong Ye (Intel) said he would test these and followup.

As a footnote, it’s worth adding that modern processors have a very  oose notion of a “physical” core, since they usually (internally) support dynamic remapping of true physical cores to the IDs exposed even to system programmers. This affords the illusion of contiguously numbered processors, and prevents an easy analysis of binning and yield characteristics. It’s one of the reasons that processors such as Intel’s use various mapping schemes in order to determine NUMA node proximinity. But one should never assume that any information given about a processor in any table reflects reality other than as a microprocessor company wanted you to perceive it.

Virtual Machine Aware Caches

Shanker Donthineni (Codeaurora) posted “arm64: Add support for VMID aware PIPT instruction cache”. Caches on the ARMv8 architecture are defined to be PIPT (Physically Indexed, Physically Tagged) from a software perspective (although the underlying implementation might be different – for example, you could index virtually with VIPT underneath a PIPT facade if you implemented expensive logic for automatic homonym detection). The ARMv8.2 specification allows “VMID aware PIPT” which means a cache is PIPT but aware of the existence of Virtual Machine IDs (VMIDs), which might form part of the cache entry. Will Deacon responded that the approach “may well cause problems for KVM with non-VHE [Virtual Host Extension – the ability to run “type 2″ hypervisors with split page tables for the kernel and userspace, as opposed to non-VHE implemented on original ARMv8.0 machines in which a shim running with its own page tables is required for KVM] because the host VMID is different from the guest VMID, yet we assume that I-cache invalidation by the host *will* affect the guest when, for example, invalidating the I-cache for pages holding the guest kernel Image”. He noted that he had some other patches in flight that he would post soon (for 4.12).

Advisory Memory Allocations in real life

Shaohua Li (Facebook) posted “mm: fix some MADV_FREE issues”. MADV_FREE is part of relatively recent(ish) kernel infrastructure to support advisory mmaps that the kernel may need to arbitrarily reclaim later when low on available memory. It’s the kind of thing that other Operating Systems (such as Windows) have done for many years (Windows will even dynamically enlarge its swap (paging) file on low memory situations). Facebook apparently like to use the (alternative) “jemalloc” userspace memory allocator and have found a number of issues when attempting to combine this with MADV_FREE flags to mmap. Shaohua notes that MADV_FREE cannot be used on a machine without swap enabled, actually increases memory pressure (due to page reclaim being biases against anonymous pages), and the lack of global accounting. The patches aim to address these.

Non-fixed TASK_SIZE

Martin Schwidefsky and Linus Torvalds had a back and forth discussion about “Using TASK_SIZE for kernel threads”. As kernel programmers know, kernel threads (“tasks”, or “kernel processes” – these show up in brackets in “ps” and “top”) don’t have an associated “mm” struct (they have no userspace). On s390, just to be different, TASK_SIZE is not fixed. It can actually be one of several values that are determined by reading a field in a task’s mm struct (context.asce_limit). This was causing very subtle breakage as the kernel indirected into a null structure which happened to contain a value very close to zero that kinda worked. Martin has a fixed queued up but had some suggestions for changes to make to the kernel to avoid such a subtle issue in future. Linus was more convinced that s390 was just doing something that needed fixing.

Ongoing Development

Elena Reshetova (Intel) posted many patches converting various uses of the kernel’s “atomic_t” datatype as a reference counter over to the new “refcount_t”. As she notes, “[b]y doing this we prevent intentional or accidental underflows or overflows that can le[a]d to use-after-free vulnerabilities”. Examples including architecture and VM code fixes.

Xunlei Pang (Red Hat) posted version 2 of a patch entitled “x86/mce: Don’t participate in rendezvous process once nmi-shootdown_cpus() was  made’. This aims to juggle a post-crash conumdrum: system errors sufficient enough to generate an MCE (Machine Check Exception) should not be ignored (and thus the machine check handler should run in the kernel) but they might be generated during the process of actively taking a crash/kdump. The existing code might instead cause a panic on exit from the (old kernel provided) MCE handler. Borislav Petkov didn’t like some of the details of the patch. He wanted to also see explicit documentation as to the handling of MCEs.

Andy Lutomirski posted “KVM TSS cleanups and speedups”, which aims to refactor how the kernel handles guest TSS (Task Segment Selector) handling on Intel x64 Architecture systems. These are layered upon a series from Thomas Gleixner aimed at cleaning up GDT (Global Descriptor Table) use. He notes that there “may be a slight speedup, too, because they remove an STR [store] instruction from the VMX [Virtual Machine] entry path”.

Heikki Krogerus posted version 17 of a patch series implementing “USB Type-C Connector class” support. This is “meant to provide [a] unified interface to…userspace to present the USB Type-C ports in a system”. Your author is looking forward to trying this on his Dell XPS Skylake with USB-C.

Rob Herring posted a patch “Add SPDX license tag check for dts files and headers” to the kernel’s “checkpatch.pl” patch submission checking tool.

Finally this week, Lorenzo Pieralisi posted “PCI: fix config and I/O Address space memory mappings” intended to address the inconvenient fact that “ioremap” on 32-bit and 64-bit ARM platforms was failing to strictly comply with the PCI local bus specification’s “Transaction Ordering and Posting” requirements. These mandate that PCI configuration cycles (during startup or hotplug) and I/O address space accesses must be “non-posted” (in other words, they must always receive a write notification response and not be buffered arbitrarily). Lorenzo addresses this with a 20 part patch series that cleans this up.

Kernel Podcast for Feb 20th, 2017

UPDATE: Thanks to LWN for the mention. This podcast is in “alpha”. It will start to show up on iTunes and Google Play (which didn’t exist last time I did this thing!) stores within the next day or two. You can also subscribe (for the moment) by using this link: kernel podcast audio rss feed. This podcast format will be tweaked, and the format/layout will very likely change a bit as I figure out what works, and what does not. Equipment just started to arrive at home (Zoom H4N Pro, condenser mics, etc.), a new content publishing platform needs to get built (I intend ultimately for listeners to help to create summaries by annotating threads as they happen). And yes, my former girlfriend will once again be reprising her role as author of another catchy intro jingle…soon 😉

Audio: Kernel Podcast 20170220

Support for this podcast comes from Jon Masters, trying to bring back the Kernel Podcast since 2012.

In this week’s edition: Linus Torvalds announces Linux 4.10, Alan Tull updates his FPGA manager framework, and Intel’s latest 5-level paging patch series is posted for review. We will have this, and a summary of ongoing development in the first of the newly revived Linux Kernel Podcast.

Linux 4.10

Linus Torvalds announced the release of 4.10 final, noting that “it’s been quiet since rc8, but we did end up fixing several small issues, so the extra week was all good”. Linus added a (relatively rare) additional “RC8” (Release Candidate 8) to this kernel cycle due to the timing – many of us were attending the “Open Source Leadership Summit” (OSLS, formerly “Linux Foundation Collaboration Summit”, or “Collab”) over the past week. The 4.10 kernel contains about 13,000 commits, which used to seem large but somehow now…isn’t. Kernelnewbies.org has the usual summary of new features and fixes: https://kernelnewbies.org/Linux_4.10

With the announcement of 4.10 comes the opening of the merge window for Linux 4.11 (the period of up to two weeks at the beginning of a development cycle, during with new features and disruptive changes are “pulled” into Linus’s kernel (git) tree). The 4.11 merge window begins today.

FPGA Manager Updates

Alan Tull posted a patch series implementing “FPGA Region enhancements and fixes”, which “intends to enable expanding the user of FPGA regions beyond device tree overlays”. Alan’s FPGA manager framework allows the kernel to manage regions within FPGAs (Field Programmable Gate Arrays) known as “partial reconfigurable” regions – areas of the logic fabric that can be loaded with new bitstream configs. Part of the discussion around the latest patches centered on their providing a new sysfs interface for loading FPGA images, and in particular the need to ensure that this ABI handle FPGA bitstream metadata in a standard and portable fashion across different OSes.

Intel 5-level paging

Kirill A. Shutemov posted version 3 of Intel’s 5 level paging patch series that expands the supportable VA (Virtual Address) space on Intel Architecture from 256TiB (64TiB physical) to 128PiB (4PiB physical). Channeling his inner Bill Gates, he suggests that this “ought to be enough for anybody”. Key among the TODO items remains “boot-time switch between 4 and 5-level paging” to avoid the need for custom kernels. The latest patches introduce two new prctl calls to manage the maximum virtual address space available to userspace processes during mmap calls (PR_SET_MAX_VADDR and PR_GET_MAX_VADDR). This is intended to aid in compatibility by preventing certain legacy programs from breaking when confronted with a 56-bit address space they weren’t expecting. In particular, some JITs use high order “canonical” bits in existing x86 addresses to encode pointer tags and other information (that they should not per a strict interpretation of Intel’s “Canonical Addressing”).

Announcements

Steven Rostedt announced verious preempt-rt (“Real Time”) kernel trees (4.4.47-rt59, 4.1.38-rt45, 3.18.47-rt52, 3.12.70-rt94, and 3.10.104-rt118). Sebastian Andrzej also announced version v4.9.9-rt6 of the preempt-rt “Real Time” Linux patch series. It includes fixes for a spurious softirq wakeup, and a GPL symbol issue. A known issue is that CPU hotplug can still deadlock.

Junio C Hamano announced version v2.12.0-rc2 of git.

Bugfixes

Hoeun Ryu posted version 6 of a patch that takes care to properly free up virtually mapped (vmapped) stacks that might be in the kernel’s stack cache when cpus are offlined (otherwise the kernel was leaking these during offline/online operations).

New Drivers

Mahipal Challa posted version 2 of a patch series implementing a compression driver for the Cavium ThunderX “ZIP” IP on their 64-bit ARM server SoC (System-on-Chip) to plumb into the kernel cryptoapi.

Anup Patel posted version 3 of a patch implementing RAID offload
support for the Broadcom “SBA” RAID device on their SoCs.

Ongoing Development

Andi Kleen posted various perf vendor events for Intel uncore devices, Kan Liang posted new core events for Intel Goldmont, and Srinivas Pandruvada posted perf events for Intel Kaby Lake.

Velibor Markovski (Broadcom) posted a patch implementing ARM Cache Coherent Network (CCN) 502 support.

Sven Schmidt posted version 7 of a patch series updating the LZ4 compression module to support a mode known as “LZ4 fast”, in particular for the benefit of its use by the lustre filesystem.

Zhou Xianrong posted a patch (for the ARM Architecture) that attempts to save kernel memory by freeing parts of the the linear memmap for physical PFNs (page frame numbers) that are marked reserved in a DeviceTree. This had some pushback. The argument is that it saves memory on resource constrained machines – 6MB of RAM in the example.

Jessica Yu (who took over maintaining the in-kernel module loader infrastructure from Rusty Russell some time back) posted a link to her module-next tree in the kernel MAINTAINERS document.

Bhupesh Sharma posted a patch moving in-kernel handling of ACPI BGRT (Boot(time) Graphics Resource) tables out of the x86 architecture tree and into drivers/firmware/efi (so that it can be shared with the 64-bit ARM Architecture).

Jarkko Sakkinen posted version 2 of a patch series implementing a new in-kernel resource manager for “TPM spaces” (these are “isolated execution context(s) for transient objects and HMAC and policy sessions.”. Various test scripts were provided also.

That’s all for this week. Tune in next time for the latest happenings in the Linux kernel community. Don’t forget to follow us @kernelpodcast