Archive

Archive for June 22nd, 2009

2009/06/21 Linux Kernel Podcast

June 22nd, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090621.mp3

This podcast is brought to you in part by way too many California strawberries.

For the weekend of June 21st 2009, I’m Jon Masters with a summary of the weekend’s LKML traffic.

In today’s issue: The continuing 2.6.31 merge window, the “Ceph” distributed filesystem, IO scheduler based IO controllers, poisonous hardware, transcedent memory, and ksplice tainting.

The continuing 2.6.31 merge window

Core kernel. Ingo posted a few updates to the core kernel. Amongst these was a bugfix developed in collaboration with Thomas that included a new function named get_user_writeable for use by the futex code (which can’t rely upon the existing access_ok for private futexes). A dialog ensued between Linus, Ingo Molnar and Thomas Gleixner concerning use of get_user_pages_fast() in this code, which Linus pointed out could be replaced with a single instruction on Intel-esque systems at any rate.

DRM. Dave Airlie posted a final drm tree for 2.6.31. Amongst the major changes was a switch in the AGP code to use arrays of pages instead of arrays of unsigned long. Quoting Dave, “since pageattr grew patch array interfaces this is possible and should solve GEM on PAE issues”.

KVM Support for 1GB pages. Joerg Roedel posted version 3 of a patch series that gives KVM the ability to support 1GB pages. This relies upon nested paging support, a feature of modern CPUs which behaves very similarly to an additional level in the global page table hierarchy. The patch series relies upon exporting vma_kernel_pagsize to modules.

Per-cpu. Ingo Molnar responded to yesterday’s “percpu for 2.6.31″ pull request posted by Tejun Heo (that had gotten slightly warped in the posting and caused Linus to be slightly unhappy), pleading with Linus and company to reconsider taking the per-cpu changes due to the fact that the patches had been posted in a timely fashion, and the sheer amount of work Tejun will be committed to if he must maintain them for yet another cycle (170 files worth of changes).

Performance counters. Paul Mackerras noted that architectures like PowerPC64 define __u64 to be unsigned long rather than unsigned long long, which causes compiler warnings every time one prints such a value with the print format string of %Lx. To correct this, Paul posted a patch to these userspace tools providing their own implementation of the definition of types such as u64.

RCU. Paul E. McKenney posted version 8 of his “big hammer” expidited RCU grace periods patchset. This patchset uses the existing per-CPU migration kthreads, which are awakened in a loop and waited for in a second loop, in order to expidite the passage of an RCU grace period. Apparently, this patchset can reduce RCU grace periods to 40us on an 8-CPU POWER machine.

Syscall tracepoints. While it is yet to be decided exactly when Jason Barron’s proposed syscall tracepoints will make it in, Li Zefan did use the opporunity to discover a bug in seqfile handling in the kernel trace infrastructure for which he posted a series of patches.

David Miller noted that stack backtrace support had broken sometime in the past day or so, which Stephen Rothwell was already aware of. Stephen forwarded a patch from Mike Frysinger that fixed it, which was also good news for Ingo.

Miscellaneous updates include: MMC updates (Pierre Ossman), Cryptography (Herbert Xu), ALSA (Takashi Iwai), NFS (Trond Myklebust, including support for version 4.1 of the NFS standard), Watchdog (part 2, apologies for not having space to mention part 1 yesterday), the usual level of tree posting insanity from Ingo (IRQs, scheduler – including another attempt to hide runqueues from those that would poke at them, timers, tracing, and x86), IDE (Bartlomiej Zolnierkiewicz), input updates (Dmitry Torokhov) and some kbuild fixes from Sam Ravnborg.

Architecture updates include: PowerPC (Benjamin Herrenschmidt), Blackfin (Mike Frysinger), and Microblaze (fixing a build problem caused by the previous round of Microblaze architectural updates).

Non-merge specific concerns

Ceph distributed filesystem client. Sage Weil posted a 21 part patch series implementing a “Ceph” distributed filesystem client, in the staging tree. “Ceph” is apparently a distributed filesystem designed for reliability, scalability, and performance, which relies on btrfs underneath. It features the usual kinds of things – data replication, no single points of failure, and fast recovery from node failures, although the fact that it’s only just going into the “staging” tree obviously means you shouldn’t rely on this client for critical stuff at this point. Separately, Greg posted a large number of changes to Linus for the “staging” tree (and by large, we mean 658 files changed, 165585 insertions, and 240493 deletions). Quoting Greg, “We are removing more crap than we are adding, looks like progress to me!”.

IO Scheduler based IO Controller. Vivek Goyal posted version 5 of his IO scheduler IO controller patchset. This patchset aims to introduce an ability to assign and control IO bandwidth consumed by tasks through IO throttling. A number of additional changes have been made since version 4, but this are mostly fixes and it looks like the patchset is stabilizing now.

Poisonous Hardware. Fengguang Wu posted version 6 of his HWPOISON patchset. This version has many of the changes discussed previously in this podcast. Included amongst those are the switched default to “late” kill except for those processes that have specificially requested an “early” kill via a per-process tunable option, as proposed by Nick Piggin and Hugh Dickens. Other changes include killing off the “uevent” emission idea, tainting the kernel on posioned page detection, and not “mess”ing with dirty/writeback pages for now.

Transcendent memory (”tmem”). Dan Magenheimer posted a 4 part patch series (first as an email attachment, then as a normal series), implementing what he described as “tmem” for Linux. Essentially, this is support for transient memory of a “dynamically variable size”, addressable only indirectly by the kernel, and which might disappear without warning. It may seem (on the face of it) to have little utility, but the application is in virtual machines (or other non-virtualized environments, including hotplug memory, SSDs, page cache compression, and even highmem on non-highmem kernels and using space VRAM) being provided with memory for cacheing (and similar purposes) that might be taken away at any moment without any warning. Since it requires kernel assistance, it’s application is mostly for in-kernel caches. The patch series is fairly comprehensive, and there will be a talk on the design on the first day of the 2009 Linux Symposium in Montreal, Canada.

Finally today, the ksplice guys requested a new TAINT flag so those loading ksplice updates into their kernels would be able to detect this easily (especially vendors of those concerned). Peter Zjilstra objected on the grounds that ksplice isn’t upstream, although it does still seem (to this author) that it would be a worthwhile thing to have in mainline anyway.

The latest kernel release is 2.6.30, which was released by Linus on June 9th.

Stephen Rothwell posted a linux-next tree for June 19th. Stephen added one fix (for symbol checking, affecting ARM), and noted that Linus tree gained a build failure due to a compiler bug (for which he reverted the offending commit). A few other trees lost conflicts, and the tree continues to fail to build for those seeking an allyesconfig build configuration on PowerPC. The total number of sub-trees remains steady at 128 again today (apologies for missing the total in yesterday’s summary podcast).

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/06/18 Linux Kernel Podcast

June 22nd, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090618.mp3

Support for this Podcast comes from an unhealthy amount of coffee. Mine’s a double Americano, what’s yours?

For Thursday, June 18th 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: The continuing 2.6.31 merge window, direct mmap for FUSE/CUSE, racing in TCP receive, problems with sys_mount(), and kernel.org front page kernels.

We’re playing catchup here, largely because this is the first merge window this podcast has had to cover and it takes, well, a certain mind set.

The Continuing 2.6.31 merge window

Dynamic per-cpu. Tejun Heo posted an updated per-cpu git tree for 2.6.31, that takes into account many of the recent per-cpu fixes (including dynamic allocation of per-cpu data). Linus objected to the tree on the grounds that it hadn’t been in linux-next, and had been created only moments before posting with (potentially) little time for test. Andrew Morten re-affirmed the lack of linux-next usage, adding ‘If this doesn’t mean “you missed 2.6.31″ then what does?” (he did also observe that there are some special cases such as this where some critical core kernel feature is modified and it’s not just “an ordinary old git merge like all the others”). The situation was clarified by Tejun: the git tree was being created from quilt patches that had been posted a number of times already, but there had been a glitch in the quilt import. He agreed that the lack of exposure in linux-next warranted delaying until 2.6.32 and stated that he would prep a tree for Stephen to pick up in linux-next soon.

Making executable pages the first class citizen. This podcast has covered this patch series several times before, but it is worth noting some feedback since this has now hit mainline, as Jesse Barnes pointed out. He found that one of his sample workloads went from creating an unusual machine to simply a slighlty sluggish machine. Fengguang Wu was happy to hear this, but keen to point out that Rik van Riel had also helped with his protecting active file LRU pages from being flushed by streaming IO. On a VM tangent, Fengguang Wu also posted in response to the ongoing HWPOISON patchset with a modified version of the “only early kill processes who installed SIGBUS handler” which only does so for processes that register an interest in doing so via a prctl. This allows applications to easily be modified, without breaking existing expectations of applications currently deployed in the field.

Fixing returng from kernel to tasks with a 16-bit stack. Alexander van Heukelum posted a detailed explanation and patch series, describing a bug in the kernel support (on x86 systems) for returning from the kernel into userspace tasks that use a 16-bit stack. Obviously, this doesn’t happen too often, but it does in emulation software such as WINE and dosemu. Due to a quirk in the manner in which an Intel processor restores state in such situations, only the lower 16 bits of the userspace stack pointer are preserved, while the upper 16 bits are kept from the kernel stack. The kernel has an existing special “espfix” segment that is abused to ensure that the upper 16 bits of the returning stack pointer will be correct, but this wasn’t always being setup correctly, especially not in a return from NMI.

Architecture updates include: microblaze (generic headers switch), and Super H fixes from Paul Mundt. On a tangent, it looks like John Williams (the author of the microblaze port has got a new .com email, possibly indicating a move)

Miscalleneous updates include: md updates from Neil Brown (including support for non-power of two chunk sizes in RAID0), ftrace updates from Steven Rostedt (including support for bypassing read locks inside the NMI handler – as you may know, Steven’s unique page swapping on read means we only need a lock on read, not on write to an active ring_buffer), a trivial documentation update to kthread_stop from Oleg Nesterov (reminding everyone that kernel threads can now call do_exit and be kthread_stop()ed, the two were previously mutually exclusive), cleanups to MAINTAINERS from Joe Perches, ext4 updates from Ted T’so, some relatively straightforward network stuff from David Miller (including wireless bits from John Linville, and bug fixes for NetXen and E100), and minimal HTC Dream Support (Google Andriod) via a reposted patch series from Brian Swetland (including some patches signed off by the somewhat quieter these days Robert Love).

Apologies to Gregory Haskins for not covering the latest iteration of his irqfd and eventfd work in detail, since it hasn’t changed hugely. But if you’d like to read about precisely how network packets are received and routed to KVM via vbus, take a look at the latest eventfd thread.

Non-merge specific concerns

Implementing direct mmap for FUSE/CUSE. Tejun Heo was busy today. In addition to posting per-cpu updates, he also posted the third version of a patchset implementing direct mmap support for FUSE/CUSE. This allows users of a FUSE filesystem to request an mmaped region, which will be satisfied on the backend by a kernel anonymous mapping, and still populated by the FUSE userspace server. The server gets to decide how mappings are shared so this has additional performance benefits for those implementing on FUSE/CUSE.

A rare race in TCP receive. Jiri Olsa posted to say that he had found a rare race in the TCP layer using a older RHEL4 kernel (that happens to be based upon 2.6.9, which is fairly long in the tooth). It turned out that, because of a missing smp_mb() and a combination of known errata in certain Intel CPUs, it was possible for tp->rcv_nxt updates made by one CPU to not propogate correctly to the others and result in a system sleeping forever. Jiri posted a patch citing the various errata, documentation, and including a fairly comprehensive analysis of the situation, although he said that he could not reproduce this upstream due to the rarity of its occurance.

Fixing an overflow in sys_mount(). Today’s tip of the hat goes to Vegard Nossum, who dilligently tracked down a bug reported by Ingo Molnar. It turns out that kernel code calling sys_mount() can be bitten by the fact that the aforementioned function will copy an entire page passed for the “type” parameter, even though less data is typically required for this string. If the content of the page happens to contain stray “wild” pointers, we might follow those and wreak some random havoc. Vegard (obviously) suggests stopping after we find the first NULL.

Finally today, Randy Dunlap resurrected an email thread from several weeks ago in which it was proposed that references to the old “mm” tree be removed from the front page of kernel.org. He added that 2.2 kernels might go the same way.

The latest kernel release is 2.6.30, which was released by Linus on June 9th.

Stephen Rothwell posted a linux-next tree for June 18th. Since Wednesday, the tree contains a few fixes, some conflicts due to deltas between Linus’ ongoing changes to his tree and developer trees, and the tree still fails to build in an allyesconfig build configuration for powerpc.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags: