2009/08/13 Linux Kernel Podcast
Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090813.mp3
For Thursday, August 13th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.
In today’s issue: A big security item, AlacrityVM, Fast boot times, JTAG, Kprobes, mailing lists, runtime power management, and using compressed RAM for swap.
Security. Today saw the release of another NULL pointer exploit into the wild. This one affects almost every 2.4 and 2.6 series kernel, and in the latter case is compounded by other published issues with mmap_min_addr protection.
Tavis Ormandy, while looking at various socket operations structures around the kernel tree discoved that sock_sendpage() doesn’t validate the function pointer it uses in the underlying protocol. This is a problem if the underlying socket operations struct hasn’t been initialized correctly, as is the case for a number of different protocols implemented in the kernel now. This causes a NULL pointer execution, which for those systems without vm.mmap_min_addr set (many, but by no means all systems), allows a local exploit through a simple mapping of the zero page. Although setting that tunable may mitigate the error, a recently noticed issue with LSM might actually make it more likely that systems running SELinux are affected. All users are strongly encouraged to set this tunable, ensure their SELinux policy is not overriding its behavior, and upgrade their kernels forthwith.
Linus Torvalds took the opporunity (in announcing 2.6.31-rc6, which contained the fix and which is covered later) to blast the vendor-sec mailing list and the very concept of “embargoes”, saying “if it hadn’t been for vendor-sec apparently leaking like a sieve, we’d have delayed the fix until the next -rc due to trying to be polite to vendors”. Linus of course makes a good point – keeping security fixes “secret” only works as long as you have a perfect system for keeping them secure (which doesn’t exist).
AlacrityVM. Anthoy Liguori pointed out that Gregory Haskins’ prevous benchmark results comparing e.g. venet against the existing userspace implementation of virtio had been done on a kernel build without High Resolution Timers, and so had resulted in graphs that showed an extreme difference in performance. Greg (who it should be added did take the trouble to contact me immediately after the previous podcast and point this out) updated his graphs, which now show correct round trip times for virtio-u of 266us, as opposed to more than 4000us in the previous benchmark run. Either way, userspace virtio is still shown to be slowed than his replacement, though that change with the in-kernel virtio implementation coming down the pipe. Greg updated his graphs (which a number of vocal people still seem to hate being 3D). Separately, Michael S. Tsirkin posted version 3 of a 2 part patch series implementing a kernel-level virtio server (against in-kernel KVM), aimed at improving performance for virtio by reducing the virtualization overhead caused by extraneous system calls. As Michael says, for virtio-net, this removes up to 4 system calls *per packet*. Since the previous release, the patch adds some RCU comments, compat ioctl support, and uses “more idiomatic english” from Rusty Russell (we all know how Rusty can be eloquent with his use of the language).
Fast boot times. Robert Schewel posted asking whatever became of the “fastboot” boot parameter and the git development tree that Arjan van de Ven had setup back in March. This was a tree that included asynchronous device probing within the kernel to speed up bus enumeration times at boot. Arjan responded that the features were now present by default (so there was no need to do anything), and also, quote “on x86 we’re doing pretty well
”. This lead some to joke that fastboot was now a completely solved problem.
JTAG. Ordinarily, this would be a miscellaneous item, but I think it’s pretty cool. Davide Rizzo posted a patch series implementing a generic JTAG bitbang miscellaneous driver proposal. He wasn’t sure if he’d posted to the right list, since he couldn’t find an appropriate subsystem maintainer, but that will likely be resolved one way or another later. JTAG provides boundary scan and debug facilities, especially on embedded boards, and this driver will certainly be of use to those who have appropriate hardware.
Kprobes. Masami Hiramatsu posted version 14 of a 12 part patch series implementing a kprobe-based event tracer and x86 instruction decoder. The tracer allows one to probe various kernel events through the ftrace interface, while implementing a generic x86 instruction decoder that can be used to find the instruction boundaries when inserting new kprobes (remember, unlike many cleaner ISAs, x86 uses variable length instructions). The decoder does not support SSE/FP opcodes, and Masami thinks it might be possible to share the included opcode decoder map with the one currently used by the KVM Hypervisor. The latest version seems pretty close to mergeable and has various fixes.
Mailing lists. The MMC tree has a new mailing list. It is linux-mmc and is hosted as usual on the vger.kernel.org mailing list server. And on the subject of mailing lists, ongoing debate is happening surrounding which ARM mailing lists are preferred: Russell King’s moderated linux-arm-kernel on his own machine, or the linux-arm list on vger.kernel.org. A vocal minority would like to see posts happen on a public, open, non-moderated list, and see the kernel MAINTAINERS file include this address. Russell finally seemed to express indifference if that’s what the community preferred as a solution.
Runtime Power Management. Matthew Garrett posted two interesting RFC patches implementing runtime power management for PCI and USB buses. This allows for devices to be selectively shutdown when they are not in use, in much the same way that they would when being suspended and resumed. As Matthew says, this work builds upon Rafael J. Wysocki’s reworking of the power management API. Matthew had been experiencing various problems in testing due to a buggy BIOS, but has apparently now received an update upon which he is able to show that this works now. It’s still RFC, but it’s good to see it happening.
Swap. Nitin Gupta previously posted concerning his work on “compcache”, which implements a compressed RAM device upon which can mount swap. For efficiency, the compcache folks want to have an immediate callback when a swap slot if freed rather than waiting for the special event that is otherwise passed into the block layer on swap slot freeing. Andrew Morton expressed concern at a simple callback under a spinlock, since he thought that artificially limited what one might be able to do with such an API. There was also some concern at duplicating functionality with a callback and subsequent block layer handling. Hugh Dickins shared Peter Zijlstra’s view that a general notifier might be the best way forward (while cautioning against current users of any hook), and finished up noting, “I won’t be surprised if we find that we need to move swap discard support much closer to swap_free”.
In today’s miscellaneous items: some genirq fixes aimed at preventing the wakeup of a freed irq thread from Thomas Gleixner (using Linus Torvalds’ “obvious solution”, for which he added a “precautionary” Signed-off-by), an RFC patch series implementing support for irq chips on slow buses such as I2C and SPI, also from Thomas Gleixner, a performance counters fix for “perf report” from Peter Zijlstra (since Pekka Enberg had noticed that this was broken by a “Full task tracing” patch), some performance counters, x86, and core kernel fixes from Ingo Molnar, another patch fixing an ABI incompatibility between “perf” and kernel, also from Peter Zijlstra, version 5 of the “Help Root Memory Cgroup Resource Counters Scale Better” patch series from Balbir Singh (which features a renamed subject), some RT mutex build fixes from Sven-Thorsten Dietrich, a cleanup fix for swiotlb fallback in intel_iommu_init, some sh updates (including initcall fixes in relation to recent I2C re-ordering now mergeable because the underlying I2C fixes got merged) from Paul Mundt, and some md fixes from Neil Brown. There was a suggestion from Jens Axboe that inlining spinlocks also results in a performance improvement of 3.5% with a particular workload on SPARC (as Dave Miller points out, this is likely because of the expense of a register window overflow onto the stack – which is 128 bytes of writes).
Finally today, Michael Schnell posted asking about the best practice to follow in implementing new futex support for an architecture (in this case on MMU-enabled NIOS systems). He would like some feedback.
In today’s announcements: Linux 2.6.31-rc6. Linus Torvalds announced 2.6.31-rc6. This had “Lots of small fixes all over, spread out fairly evenly”. As he says, things seem to be calming down a bit now, taking the opportunity to demonstrate his git prowess with an example command showing patch sizes. The release contains a fix for the (by now imfamous) NULL pointer exploit, although as Linus points out, this should not be too much of a problem if previous efforts to fix mapping at the NULL page have turned out right.
Linux 2.4.37.5. Willy Tarreau took the opportunity to release 2.4.37.5, which he had wanted to delay but the NULL pointer exploit (that also affects 2.4 systems – although the local exploit that is distributing is not exactly the same on these older systems) forced his hand. Willy repeats the assertion that users should set /proc/sys/vm/mmap_min_addr to 4096 or higher anyway, “unless you know that it breaks one very old legacy application”. Doing so will mitigate against the exploit by not allowing zero page mappings.
Greg Kroah-Hartman released review patches for the 2.6.27.30 and 2.6.30.5 stable series kernels, containing 28 and 74 patches respectively. I didn’t check but expect that the NULL pointer fix is amongst those.
The latest kernel release is 2.6.31-rc6, which was released by Linus in the evening, or rather his afternoon, at 16:37 PDT.
Stephen Rothwell posted a linux-next tree for August 13th. Since Wednesday, the v4l-dvb tree regained the same conflicts, the kvm tree gained a build failure, and the percpu tree lost 2 conflicts. Stephen notes that the linux-next tree composition has moved and is now located at a more officious address on the kernel.org website, while symlinks provide redirection from the old addresses for those who want to use an up-to-the minute tree and yet live in the past in other ways, perhaps as some form of compensation.
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

