2010/05/30 Linux Kernel Podcast
Audio: http://traffic.libsyn.com/jcm/linux_kernel_podcast_20100530.mp3
The podcast has returned from a brief break of a few weeks while I was busily working on a certain Enterprise Linux and using my spare time to not be in front of a computer (sailing). There is a backlog of shows in various stages though I’m not yet sure when I’ll get around to posting them online. Thanks for bearing with me and let’s hope we can get back into a routine once more. As always, if you are interested in helping out, drop me a line by email.
For the US Memorial Day Holiday weekend of May 31st 2010, I’m Jon Masters with a summary of the past week’s LKML traffic.
In today’s issue: Linux 2.6.35-rc1, errors, TSC, Unified Ringbuffer, virtio, and YAFFS.
Linux 2.6.35-rc1. Linus Torvalds announced the release of kernel 2.6.35-rc1 on Sunday, May 30th 2010 at 1:21pm Best Coast Time (PDT). Quoting Linus, “…and thus endeth the merge window”. After a two week merge window, Linus says that the “bulk should be there. And please, let’s try to make the merge window mean something this time – don’t send me any new pull request unless they are for real regressions or for major bugs, ok?”. The 2.6.35 release will not feature any new filesystems for a change, but does have all of the ususal driver updates, and of thr 8500 commits, there were about 1000 individual developers involved in the 2.6.35 tree this time around. Linus described the statistics – specifically calling them out in his mail – as demonstrating what is “a healthy development environment”.
Errors. Modern hardware is generally highly reliable, but scalability and the growth of datacenters play havoc with statistics. Given a large enough amount of memory, disks, or other devices, something will eventually go wrong. When it does, it is useful to handle as much as possible with an air of grace. Memory errors are of particular concern, especially with the growth in the amount of RAM in (increasingly) large servers. ECC (Error Correcting Memory) can help, and includes the useful side effect of reporting on correctable errors. Existing userspace utilities, such as Andi Kleen’s mcelog (and other related work in the kernel itself into recoverable memory errors) offer an ability to collect reports of such errors, as well as Machine Check Exceptions (essentially hardware errors, usually related to failing memory, caches, etc.) of various other kinds. At this year’s Linux Foundation Collab Summit (April 15th 2010), there was a mini-summit aimed at figuring out a path for the future of various separate error reporting subsystems, such as MCE (Intel), and EDAC (AMD). Mauro Carvalho Chehab posted a summary of the minutes in the form of an email thread entitled “Hardware Error Kernel Mini-Summit”, in which it is proposed that a new kernel error subsystem be created, abstracting all of the existing mechanisms, and wired up using performance events (perf). The latter piece comes largely at the insistance of Ingo Molnar and Thomas Gleixner, and is not without its controvasy amongst those who feel perf is growing to become some catch-all solution to every problem. Still, it seems likely that there will be some generic replacement to meclog in the future.
TSC. Venkatesh Pallipadi (Google) posted a patch, originally from Dan Magenheimer (Oracle) in which various information about the perceived (or, generally, otherwise) reliability of the TSC known by the kernel was exported via the sysfs. This would allow userspace applications using rdtsc to know whether the counter is generally regarded as a reliable source of time or not. Thomas Gleixner and Ingo Molnar both absolutely hated this, on the grounds that the TSC is known to be generally not a great clocksource (although it is becoming more reliable in many systems) and that just because the reading of it is generally unprivileged and thus widespread does not mean that the kernel should be complicit in encouraging others not to use the standard timestamp reading abstractions. Especially with modern kernels, where there are vsyscalls and other facing mapped page hacks, the overhead of obtaining timestamp information from the kernel is generally fairly reasonable. There was even some suggestion of limiting ring3 access to the TSC by means of a SPR (Special Purpose Register) setting. Dan Magenhiemer noted that the uses of userspace reading of the TSC were more widespread than Thomas and Ingo may have considered, and he called out the dynamic linker used in RHEL5 as one example of a semi-frequent reader of TSC information. Brian Bloniarz, John Stultz, and Peter Anvin took the conversation in a slightly different direction after Brian noted that sometimes userspace needs to know how reliable the current clocksource is considered to be for use in calibration (for example, when using NTP and desiring to know oscillator accuracy). It seemed to be decided that it would therefore be worthwhile to have a general means to determine the accuracy of the current clocksource, not just the Intel-world-view centric TSC. That latter part may well happen.
Unified Ringbuffer. Hardware error detection wasn’t the only topic of general unification efforts this week. Steven Rostedt posted an RFC thread entitled “Unified Ring Buffer” in which he discussed implementing a globally generic kernel ringbuffer that could be used in any subsystem (recall that Steven also implemented a fancy ringbuffer design in ftrace). He posted links to LKML discussion on the effort so far, and an LWN summary article, noting that both the ftrace ringbuffer and the oprofile ringbuffer have so far been unified, but also noting that the introduction of perf events (which require both a lockless, NMI safe, and mmap()able implementation) came with yet another new ringbuffer from Peter Zijlstra. Steven’s original ringbuffer became lockless last year, but currently does not support mmap. So there are two implementations, “neither of which can perform all of the features needed. This is putting a bit of stress on the users of these tools, not to mention the stress on the developers as well”. Steven would like to find a solution to this problem, and so started the thread. Mathieu Desnoyers added that he was happy to help, and had already started working on his own tree (originally intended to help his LTTng tracing tools), while Andi Kleen wondered aloud why Steven would “want a single ring buffer for everyone?”. Steven said the solution might not be to have one implementation, but merely one single interface (with varying backends used, including, as Andi had noted, kfifo based implementations). This lead Ingo Molnar to suggest that grand design planning discussion of ringbuffers was less important than discussing the future direction for tracing and instrumentation (the main users of these ringbuffers, and the real motivation behind them), and to note that performance was currently quite sucky both in ftrace and perf. The conversation seemed to dry up without any specific conclusions. Separately, Peter Zijlstra posted perf ringbuffer optimization patches in a thread entitled “Optimize perf ring-buffer”. Still separately, Chase Douglas posted some “Tracing configuration review” questions for the forthcoming Ubuntu kernel configuration, seeking review comments.
Virtio. Michael S. Tsirkin posted an RFC patch entitled “virtio: put last seen used index into ring itself”, which as it implies modifies the ring buffer used for host/guest communication of vitio (via a feature flag, using available room in the existing structure) such that a guest will update the ring buffer with a host-visible state of where it is in consuming the buffer. The host doesn’t technically require this information, but it can save on unwanted interrupts if the host knows that the guest is not done processing previous ringbuffer entries, and provides useful statistical information. There then followed a lengthy (and somewhat interesting) debate between Michael, Dor Laor, and Rusty Russell concerning the latter’s assertion that the state of the ring buffer could be stored in the same cacheline as the last item in the buffer, rather than in its own cacheline. Rusty contended that this would be more efficient (since occasionally the index and data would be read at the same time), but when he wrote a useful test program was only able to prove that Avi Kivity was right in suggesting separation. Various other dialogue related to the complexity of virtio was discussed.
YAFFS. Charles Manning, ever diligent YAFFS (Yet Another Flash Filesystem – an excellent alternative that this author has had the privilege of poking at with his embedded hat on in the past) developer posted some questions on SLUB behavior. Charles uses a SLUB-like allocator in YAFFS to manage objects, but his objects are separated according to the mount to which they refer. This makes it very easy for him to just throw away a large number of objects on unmount without de-allocating them (”trust me, I know what I’m doing”). He is looking at replacing his custom allocator with SLUB in order to facilitate eventual mainlining of YAFFS, but wants to know whether SLUB could grow some additional “don’t combine this SLUB with others” and ‘”trust me, I know what I’m doing”: Allow the cache to be dumped with objects still allocated” flags. So far, nobody has answered his questions.
In today’s miscellaneous items:
*). Mike Snitzer, Jens Axboe, Vivek Goyal, and Kiyoshi Ueda discussed (in a thread entitled “only initialize full request_queue for request-based device) various approaches to minimalist initialization of Device Mapper devices, specifically given the new split handling of bio vs. request based devices. Only the latter type require “full” queue setup.
*). Ingo Molnar requested that Linus pull the “lockup-detector-for-linus” tree, which contains a unified kernel lockup detector in kernel/watchdog.c that replaces the existing NMI, hung tasks, softlockup, and so forth all in one place. Big thanks go to Don Zickus for his work on this.
*). Discussion continued surrounding some documentation that Henrik Rydberg posted on the Multitouch event slots protocol for multitouch devices. It seems that these input devices become more complex by the day.
*). Don Zickus posted a patch entitled “Makefile.build: make KBUILD_SYMTYPES work again”, in which he provided some fixes to the code that provides a means to determine why kernel symbol versions have changed (i.e. which specific change to which kernel structure or function was the cause). This is of particular use to “Enterprise” distributions doing module versioning.
*). Michel Lespinasse (Google) posted a patch entitled “Stronger CONFIG_DEBUG_SPINLOCK_SLEEP without CONFIG_PREEMPT” in which he proposed tracking the preempt count even when not using CONFIG_PREEMPT, but when nonetheless building with CONFIG_DEBUG_SPINLOCK_SLEEP. Rather than the use of preempt_{dis,en}able actually resulting in preemption, it would merely serve as a means to warn when attempting to sleep incorrectly from within a critical section, but without explicitly enforcing it.
*). Discussion continued surrounding a previous patch from Kay Sievers adding new “devname” module aliases to facilitate module on-demand autoloading. The idea here is that modules can now provide the name of the device entry or entries they will create and so tools like udev can demand load modules as the nodes they support are accessed.
*). Thomas Gleixner finally posted the patch series he had threatend to post previously, entitled “Run interrupt handlers always with interrupts disabled”, that does largely what it says on the tin. It removes the IRQF_DISABLED functionality at interrupt registration and runs all interrupt handles with IRQs off. This should facilitate greater migration over to modern threaded interrupt handlers as needed.
*). Neil Brown posted a patch entitled “VFS: fix recent breakage of FS_REVAL_DOT” in which he provided a fix for a change to NFS client mount behaviors, under which the client would no longer check if a directory within which “ls -l” were being run had changed at the time of the command, without waiting for the cached timestamp attributes to timeout. Al Viro took the patch, but did not like the implementation, so some further discussion ensued.
*). Arve Hjønnevåg posted the latest version of the “suspend block API”, which provides the “same functionality as the android wakelock api”. This is intended to control when a system will be blocked from suspending due to activity, and comes with the benefit of lengthy LKML discussion.
*). Glauber Costa posted version 3 of a patch implementing various MSR (Machine Status Register) KVM specific documentation.
In today’s announcements:
* Smatch 1.55. Dan Carpenter announced release 1.55 of the “smatch” static C source checker tool is now available. The latest version includes an enhanced array overflow check, new checks for precedence bugs caused by macro expansion, rewritten checks for null pointer dereferences, and some kernel specific checks for kunmap, release_resource, etc. http://smatch.sf.net/ or git://repo.or.cz/smatch.git
* Jeff Merkey announced version 2.6.34 of ndiswrapper. Quoting Jeff, “Always here to support the hated projects of Evil Emperor Linus. Needed this f**king think to work on my laptop so fixed the busted sh*t.” His 4-letter-word strewn announcement was greeted by a reply from Simon Horman noting that he would be happy to send Jeff a dictionary if he was looking to “learn some words that are more than four letters long”.
The latest kernel release is 2.6.35-rc1.
Greg Kroah-Hartman posted a series of 2.6.32.14 stable kernel review patches. He notes that he only included patches that were released in kernels up to the 2.6.34 release, since the line had to be drawn somewhere. This is a “long term” stable kernel tree. Many vendors are basing on 2.6.32 now. Greg also posted “take 2″ of some 2.6.27.47 stable series patches, as well as stable review patches for 2.6.33.5.
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

