2010/04/11 Linux Kernel Podcast
Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100411.mp3
For the weekend of April 11th, 2010, I’m Jon Masters with a summary of today’s LKML traffic.
In today’s issue: Fsck, Futexes, IOMMU, Modules, PRNG, and SMIs.
Fsck. Pavel Machek raised the issue of power failure and its potential to wreak havoc on filesystems that don’t enable barriers (that ensure the journal is fully on disk) by default. Pavel felt it would be prudent to artificially increase the mount count for unclean shutdowns so as to make an fsck more likely next boot. Ted T’so recommended that people could just move to ext4, while Rob Landley was surprised that anyone would want to wait hours for an fsck, to which Ted added that it was of course possible to use online checking via e2croncheck and so on (in which case, he recommends people do weekly checks using for e.g. an LVM snapshot of the running filesystem).
Futexes. Darren Hart posted an RFC entitled “Ideal Adaptive Spinning Conditions” in which he requested some comments on his ideas around adaptive lock spining with futexes (essentially spinning for a while rather than sleeping immediately when blocking on an already locked mutex, in case someone else releases it in short order – the kind of behavior implemented for adaptive kernel spinlocks by Gregory Haskins for the Novell RT kernel patchset) as a means to reduce dependence on sched_yield when implementing userspace spinlocks. Darren finds adaptive spinning actually harms his userspace implementation and is interested to know, therefore, what are the ideal conditions for this technique to be of use. Darren, Steven Rostedt, Gregory Haskins, Rik van Riel, Chris Wright, and the other usual suspects discussed this a little, as well as how things change under virtualization.
IOMMU. Neil Horman was concerned about recent kernels causing rare corruption when in flight IOMMU operations are not properly flushed during a kexec (or a kdump) operation and posted a patch intended to ensure all outstanding IOMMU domain entries are flushed on shutdown. Chris Wright favored doing this on initialization and stated that this was working in the past and so something must have broken it recently in order for Neil to experience issues. Neil looked at the code some more and determined that the state AMD set the IOMMU to on init should be relatively safe unless dma operations are very long lived or devices are getting confused. He decided to think some more. Chris Wright later posted a patch to the IOMMU initialization such that it is properly enabled before devices are attached in order to prevent the kind of stale entries that Neil had been seeing. Neil tested over the weekend and found that it did indeed solve his problems.
Modules. Nick Piggin was looking for ways to implement scalable in kernel refcounting when he came across the current way that struct module_ref implements module reference counting for loadable modules. He thinks that the existing implementation is racy, though Rusty Russell pointed out that it is only manipulated under stop machine (which itself causes the kernel to essentially become single threaded code). Although this is (mostly) true for the module code itself, the counts are exported to those who do not necessary use it correctly with any real locking. Rusty pointed out that unloading is relatively rare and so few people seem to care about bad usage. Nonetheless, Linus liked Nick Piggin’s patch, which replaces a single percpu counter with two (one for incrementing the count, one for decremeting, and the total count of module users is thus represented by summing these) and thus removes a small window during which one CPU may decrement a use count without seeing an increment from another CPU occuring at the same time. This is considered an improvement against those reading module_refcount unsafely, at least until that is unexported, the code is fixed up, or module removal support is itself removed entirely from the kernel.
PRNG. It was noted (by Eric Dumazet) that recent kernels provide 16 bytes of random entropy to new tasks (AT_RANDOM) for the benefit of the glibc PRNG (Psuedo Random Number Generator). This is the reason that Jan Ceuleers was seeing repeated reads to entropy_avail seeming to decrease available entropy as the fork() of every task reading from that file would also consume it via indirect action.
SMIs. Joe Korty posted a patch entitled “A nonintrusive SMI sniffer for x86″, in which he proposed hooking into the idle loop to detect unexplained gaps in time, using a similar approach to my own SMI or hwlat detector, but only in the idle loop. The patch looks interesting as an additional means for runtime detection of SMIs however it cannot replace the alternatives because it is only able to detect SMIs during the short window of its execution. As an aside, Steven Rostedt and I are poking at a new implementation for hwlat.
In today’s miscellaneous items:
*). Bartlomiej Zolnierkiewicz noted that his “atang” tree has been rebased on top of the 2.6.33 kernel.
*). James Hogan pointed out that several of the watchdog ioctl definitions are technically incorrect, but Alan Cox pointed out that these historical mistakes cannot now be corrected without breaking compatibility.
*). Version 10 of the sys_membarrier patches from Mathieu Desnoyers. These allow a task to issue a process wide memory barrier from userspace, which is useful when implementing userspace locking primitivies (such as the userspace RCU implementation Mathieu is working on).
*). A bunch of patches from Tejun Heo intended to handle the future case of mainline no longer implicitly including slab.h from percpu.h.
*). Version 2 of a fun patch from Xiaohui Xin implementing a xero copy method for DMAing data into virtualized KVM guests by means of pinning specific copy buffers within the guest memory. Avi Kivity noted that this can be more useful than PCI passthrough as it copes with migration.
*). A simple patch from Eric Dumazet addressing a regression that had stopped the ability to perform a rewinding seek on /dev/mem and therefore had broken the ability to use x86info correctly.
*). A patch to pagemap walking in procfs initially from San Mehat and then reworked a little. The conversation gave Linus a chance to rant about the entire pagemap code in general, which Matt Mackall didn’t enjoy.
*). A discussion of the prefered means to detect whether a given graphics driver is using the KMS (Kernel Mode Setting) rather than simply walking through all PCI graphics devices, started by Rafael J. Wysocki.
*). A discussion about bitops compile time optimizations for hweight_long (a hamming weight calculation routine), that also covered implementing support for hardware popcnt using the alternatives() mechanism on x86. Borislav Petkov posted a patch entitled “Add optimized popcnt variants”.
*). General agreement that removing the “please try ‘cgroup_disable=memory’ option is you don’t want memory cgroups” message on boot is a good idea both for Red Hat Enterprise Linux and also for upstream. Red Hat had expressed some concern about unnecessary support calls.
*). Exposure of an old bug with interrupts being enabled early on some ARM systems as reported by code in start_kernel. This was raised by Rabin Vincent, and triggered Peter Anvin to dig through old trees and find that rwsems can be used early in init when IRQs are still off, but will unconditionally re-enable them. Kevin Hilman posted a generic patch, changing the rwsem slow path to use save/restore spinlocks.
*). VMware posted their Baloon driver in response to Avi Kivity (the KVM maintainer)’s suggestion that that they not attempt to integrate this into virtio but instead stand seperately as simpler code. Andrew Morton requested a writeup, saing “I think I’ve forgotten what balloon drivers do. Are they as nasty a hack as I remember them to be?” (short answer: yes).
In today’s announcements:
*). sg3_utils-1.29. Douglas Gilbert announced that version 1.29 of sg3_utils is now availalbe. This package provides command line utilities for sending SCSI (and some ATA) commands to devices. Further information is available at: http://sg.danny.cz/sg/sg3_utils.html
*). 2.6.33-rt13. Thomas Gleixner announced that version 2.6.33-rt13 of the Real Time patchset is available. The patch is available from kernel.org at: http://www.kernel.org/pub/linux/kernel/projects/rt/
*). GIT 1.7.1.rc0 Junio C Hamano announced that version 1.7.1.rc0 of GIT is now available for download from http://www.kernel.org/pub/software/scm/git/. It includes a contributed script from Eric Raymond, support for GIT_ASKPASS, and a large number of other useful patches.
The latest kernel release was 2.6.34-rc3. The rc4 release was delayed for reasons that will be covered in the next episode of this podcast.
Rafael J. Wysocki sent an updated list of recent kernel regressions.
There was some concern from Taylor Lewick that kernel performance had regressed between the older 2.6.16 kernel he was running and more recent kernels, with transaction times increasing on the order of 15us. He posted some detailed statistics, though there have been few comments thus far.
Till Kamppeter noted that the deadline for student application to the Google Summer of Code (GSoC) had passed and that it was time to assign them to the various kernel projects. In the end, all unassigned applications went to Grant Likely because he made the mistake of volunteering
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

