2009/06/17 Linux Kernel Podcast
Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090617.mp3
Support for this Podcast comes from the humble Blueberry. Did you know that a mere 4 pints of blueberries for breakfast can be a healthy form of OCD?
For Wednesday, June 17th 2009, I’m Jon Masters with a summary of the day’s LKML traffic.
In today’s issue: the continuing 2.6.31 merge window, changing the NOHZ idle load balancing logic, OpenAFS pioctls, MCE, and scsi_wait_scan configuration.
Apologies for the tardiness of today’s production. Your author is currently preparing updates to cover Thursday and the weekend podcast update and hopes to get back into the swing of things next week. I guess the merge window really is that unpleasant to keep up with – bear with me, I’ll get there. I expect to introduce more automation and tracking, and filtering, in time.
The Continuing 2.6.31 merge window
Poisonous Hardware. Fengguang Wu posted a policy change RFC patch, in which the HWPOISON code would only “early kill” (that is to say, before an unrecoverable error has occured) processes that had installed a SIGBUS handler. This would allow certain applications (that caught SIGBUS) to recover from corruption of (for example) single pages within internal caches and other non-critical (isolatable) data. This might include, for example, the KVM (Kernel Virtual Machine) Hypervisor, Oracle’s database software, or similar programs using extensive internal cacheing to recover on memory errors.
Early SLAB allocation. Pekka J Enberg posted a series of SLAB updates for 2.6.31, which remember, include the new early SLAB allocator approach. In a separate mail thread, Linus Torvalds suggested that “All the recent init ordering changes should mean that the slab allocator is available _much_ earlier – to the point that hopefully any code that runs before slab is initialized should know very deep down that it’s special, and uses the bootmem allocator without doing any conditions what-so-ever”. Ben Herrenschmidt (the maintainer of the PowerPC architecture port) reponded that, which he would normally agree with this, there are a number of hairy skeletons in the PowerPC port closet that prevent this from being true…yet. He pleaded for more time before things like slab_is_available() are taken away from him, and he’s probably not the only person who will be affected in such a migration.
e820 table reservations. e820 is a standard BIOS extension used by a PC-based Operating System, such as Linux, to query the system physical memory map, for example to determine where certain standard resources are located. The existing e820 parser in the kernel doesn’t handle regions marked as EFI_RESERVED_TYPE, so they might be recorded as useable. A patch from Cliff Wichman changes this by marking such regions as E820_RESERVED.
Searching for empty slots in resources trees. In PCI, we use BARs (Base Address Registers) to program devices with a range of the system (PCI) address space to use for interaction with the host system. For example, a card providing a large buffer needs to have that buffer mapped somewhere in memory. Andrew Patterson noticed that the function pci_assign_resource() which calls find_resource, and is used to allocate address ranges for PCI device BARs in the parent bridge’s resource tree during hot add operations only checks is immediate children and siblings of the root resource passed. In certain topologies where a resource (that is to say, range of memory) is only available further down the resource tree, the existing algorithms can fail to allocate an acceptable resource. Andrew posted a patch that modifies find_resources and allocate_resources so that they recursive descend the entire tree instead. Others (including Linus Torvalds) expressed some concern that Andrew’s patch might be curing symptoms rather than the actual disease, since the situation described shouldn’t easily be arising. Later, Matthew (willy) Wilcox posted a series of four patches covering this problem, fixing it by “changing where ia64 sets up the resource pointers in the root pci bus”.
Dynamic per-cpu. Tejun Heo posted version 3 of his dynamic per-cpu patchset. Per-CPU is a mechanism wherein Linux kernel code can split certain data into a data area per CPU, so that hot-path code can quickly make updates without being concerned about the actions of other CPUs. Like it sounds, this patchset makes per-cpu data area allocations entirely dynamic, rather than a compile-time determination. At David Miller’s request, individual maintainers were removed from the CC list and substituted with the more generic arch maintainers list. Separately, Tejun posted a patch (entitled “teach lpage allocator about NUMA) which “makes the percpu allocator able to use non-linear and/or sparse cpu -> unit mappings and then makes the lpage allocator consider CPU topology and group CPUs in LOCAL_DISTANCE into the same large pages”.
VFS patches, part 2. Al Viro posted a series of VFS patches, mostly targeting BKL (Big Kernel Lock) removal in both the VFS and in filesystems. The Big Kernel Lock (BKL) was introduced in the easiest days of Linux SMP support written by Alan Cox as a means to have an extremely coarse-level “kernel lock” (exactly one CPU could be executing kernel code at a time), but it has long since become a performance bottleneck and is slowly being removed. Previous kernels have attempted to replace it with a semaphore (which was reverted, again for performance related reasons), and the RT tree still does so. Separately, Jan Blunck posted a series of patches preparing for the VFS based union mounts. He and Val think these are good to go in separately.
PCI updates for 2.6.31. Jesse Barnes posted a summary of pending changes in his git tree. These include improved PCI AER (Advanced Error Reporting) support (refer to the pciaer-howto for further information), the removal of pci_find_slot, and a collection of the usual cleanups and fixes.
FireWire updates post 2.6.30. Stefan Richter posted a few IEEE1394 (firewire) updates for 2.6.31. These included the newer sysfs attributes mentioned previously that should lead to “simpler and saner udev rules”.
Miscellaneous updates include: some trivial fixes for the ksym_tracer from K. Prasad, V4L/DVB updates from Mauro Carvalho Chehab, kmemleak fixes from Catalin Marinas (who also wishes to rename kmemleak_panic to kmemleak_stop to avoid confusion over the use of the “panic” word), UBI and UBIFS fixes from Artem Bityutskiy, some exofs patches from Boaz Harrosh, and a patch series adding software (not hardware) counters for PowerPC 32-bit. Discussion continued on the idea of handling page faults on x86 with interrupts enabled, adding a little complexity to the interrupt handler but intending to reduce overall overhead in the process.
Non-merge specific concerns
Changing the NOHZ idle load balance logic. Venkatest Pallipadi posted a two part patch series aimed at changing the NOHZ idle load balance logic from the “pull” model currenly in use (in which one idle load balancer CPU is nominated to not go into NOHZ mode and ends up doing all the balancing work for CPUs in the NOHZ mode) to a “push” model in which busy CPUs can kick those that are idle (and in NOHZ mode) into taking care of idle balancing on behalf of a group of idle CPUs. Apparently, there are still some “rough edges”, and so this is an RFC for the moment.
OpenAFS pioctls. OpenAFS is an implementation of the Andrew distributed filesystem, which is especially popular with banks and international corporations. David Howells posted a 17 part patch series implementing an in-kernel pioctl system call, as used by OpenAFS. Alan Cox objected to the “ugly” nature of the ABI, and asked why David couldn’t instead use the C-library system call wrapper (all system calls end up with a small wrapper in the system C-library) to do what this system call would otherwise do using those already available. David replied that it was almost possible to do this, but that it got very hairy and that he also wanted the kAFS and OpenAFS implementations to be able to share userspace tools without recompiling.
MCE test coverage data. Huang Ying posted to let everyone know about his mce-inject test tool (with git repostitory) and about further test information being available on his kernel.org people page.
Finally today, the “lack” of a configuration option for scsi_wait_scan was finally addressed today in the form of documentation (from Stefan Richter) explaining why it has intentionally been ommited. Thee SCSI wait scan module is used (especially by distributions, in their initrds) in order to wait for SCSI device enumeration activity completion. It does this by simply not returning from module_init until the SCSI subsubsystem is ready to procede. It is needed by some users and accidental removal can lead to hard to debug boot failures, although removing the config option does seem excessive.
In today’s announcements: Thomas Gleixner announced version 2.6.29.5-rt21 of the Real Time patchset. The latest version includes a fix for a rather unpleasant “lockup” scenario in the softirq handling code. There was no announcement for the previous -rt20 release due to this softirq issue.
The latest kernel release is 2.6.30, which was released by Linus June 9th.
Stephen Rothwell posted a linux-next tree for June 17th. Since the previous day, the powerpc tree continues to fail to build in an allyesconfig build configuration, the ext4 build failure means that a version from Monday is being used, the 4vl-dvb tree lost its conflict, and the KVM tree gained a build failure (due to PowerPC now using -Werror), for which Stephen applied a quick patch. Total tree count remains at 128 trees.
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

