Archive

Archive for the ‘episodes’ Category

2010/04/11 Linux Kernel Podcast

April 14th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100411.mp3

For the weekend of April 11th, 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Fsck, Futexes, IOMMU, Modules, PRNG, and SMIs.

Fsck. Pavel Machek raised the issue of power failure and its potential to wreak havoc on filesystems that don’t enable barriers (that ensure the journal is fully on disk) by default. Pavel felt it would be prudent to artificially increase the mount count for unclean shutdowns so as to make an fsck more likely next boot. Ted T’so recommended that people could just move to ext4, while Rob Landley was surprised that anyone would want to wait hours for an fsck, to which Ted added that it was of course possible to use online checking via e2croncheck and so on (in which case, he recommends people do weekly checks using for e.g. an LVM snapshot of the running filesystem).

Futexes. Darren Hart posted an RFC entitled “Ideal Adaptive Spinning Conditions” in which he requested some comments on his ideas around adaptive lock spining with futexes (essentially spinning for a while rather than sleeping immediately when blocking on an already locked mutex, in case someone else releases it in short order – the kind of behavior implemented for adaptive kernel spinlocks by Gregory Haskins for the Novell RT kernel patchset) as a means to reduce dependence on sched_yield when implementing userspace spinlocks. Darren finds adaptive spinning actually harms his userspace implementation and is interested to know, therefore, what are the ideal conditions for this technique to be of use. Darren, Steven Rostedt, Gregory Haskins, Rik van Riel, Chris Wright, and the other usual suspects discussed this a little, as well as how things change under virtualization.

IOMMU. Neil Horman was concerned about recent kernels causing rare corruption when in flight IOMMU operations are not properly flushed during a kexec (or a kdump) operation and posted a patch intended to ensure all outstanding IOMMU domain entries are flushed on shutdown. Chris Wright favored doing this on initialization and stated that this was working in the past and so something must have broken it recently in order for Neil to experience issues. Neil looked at the code some more and determined that the state AMD set the IOMMU to on init should be relatively safe unless dma operations are very long lived or devices are getting confused. He decided to think some more. Chris Wright later posted a patch to the IOMMU initialization such that it is properly enabled before devices are attached in order to prevent the kind of stale entries that Neil had been seeing. Neil tested over the weekend and found that it did indeed solve his problems.

Modules. Nick Piggin was looking for ways to implement scalable in kernel refcounting when he came across the current way that struct module_ref implements module reference counting for loadable modules. He thinks that the existing implementation is racy, though Rusty Russell pointed out that it is only manipulated under stop machine (which itself causes the kernel to essentially become single threaded code). Although this is (mostly) true for the module code itself, the counts are exported to those who do not necessary use it correctly with any real locking. Rusty pointed out that unloading is relatively rare and so few people seem to care about bad usage. Nonetheless, Linus liked Nick Piggin’s patch, which replaces a single percpu counter with two (one for incrementing the count, one for decremeting, and the total count of module users is thus represented by summing these) and thus removes a small window during which one CPU may decrement a use count without seeing an increment from another CPU occuring at the same time. This is considered an improvement against those reading module_refcount unsafely, at least until that is unexported, the code is fixed up, or module removal support is itself removed entirely from the kernel.

PRNG. It was noted (by Eric Dumazet) that recent kernels provide 16 bytes of random entropy to new tasks (AT_RANDOM) for the benefit of the glibc PRNG (Psuedo Random Number Generator). This is the reason that Jan Ceuleers was seeing repeated reads to entropy_avail seeming to decrease available entropy as the fork() of every task reading from that file would also consume it via indirect action.

SMIs. Joe Korty posted a patch entitled “A nonintrusive SMI sniffer for x86″, in which he proposed hooking into the idle loop to detect unexplained gaps in time, using a similar approach to my own SMI or hwlat detector, but only in the idle loop. The patch looks interesting as an additional means for runtime detection of SMIs however it cannot replace the alternatives because it is only able to detect SMIs during the short window of its execution. As an aside, Steven Rostedt and I are poking at a new implementation for hwlat.

In today’s miscellaneous items:

*). Bartlomiej Zolnierkiewicz noted that his “atang” tree has been rebased on top of the 2.6.33 kernel.

*). James Hogan pointed out that several of the watchdog ioctl definitions are technically incorrect, but Alan Cox pointed out that these historical mistakes cannot now be corrected without breaking compatibility.

*). Version 10 of the sys_membarrier patches from Mathieu Desnoyers. These allow a task to issue a process wide memory barrier from userspace, which is useful when implementing userspace locking primitivies (such as the userspace RCU implementation Mathieu is working on).

*). A bunch of patches from Tejun Heo intended to handle the future case of mainline no longer implicitly including slab.h from percpu.h.

*). Version 2 of a fun patch from Xiaohui Xin implementing a xero copy method for DMAing data into virtualized KVM guests by means of pinning specific copy buffers within the guest memory. Avi Kivity noted that this can be more useful than PCI passthrough as it copes with migration.

*). A simple patch from Eric Dumazet addressing a regression that had stopped the ability to perform a rewinding seek on /dev/mem and therefore had broken the ability to use x86info correctly.

*). A patch to pagemap walking in procfs initially from San Mehat and then reworked a little. The conversation gave Linus a chance to rant about the entire pagemap code in general, which Matt Mackall didn’t enjoy.

*). A discussion of the prefered means to detect whether a given graphics driver is using the KMS (Kernel Mode Setting) rather than simply walking through all PCI graphics devices, started by Rafael J. Wysocki.

*). A discussion about bitops compile time optimizations for hweight_long (a hamming weight calculation routine), that also covered implementing support for hardware popcnt using the alternatives() mechanism on x86. Borislav Petkov posted a patch entitled “Add optimized popcnt variants”.

*). General agreement that removing the “please try ‘cgroup_disable=memory’ option is you don’t want memory cgroups” message on boot is a good idea both for Red Hat Enterprise Linux and also for upstream. Red Hat had expressed some concern about unnecessary support calls.

*). Exposure of an old bug with interrupts being enabled early on some ARM systems as reported by code in start_kernel. This was raised by Rabin Vincent, and triggered Peter Anvin to dig through old trees and find that rwsems can be used early in init when IRQs are still off, but will unconditionally re-enable them. Kevin Hilman posted a generic patch, changing the rwsem slow path to use save/restore spinlocks.

*). VMware posted their Baloon driver in response to Avi Kivity (the KVM maintainer)’s suggestion that that they not attempt to integrate this into virtio but instead stand seperately as simpler code. Andrew Morton requested a writeup, saing “I think I’ve forgotten what balloon drivers do. Are they as nasty a hack as I remember them to be?” (short answer: yes).

In today’s announcements:

*). sg3_utils-1.29. Douglas Gilbert announced that version 1.29 of sg3_utils is now availalbe. This package provides command line utilities for sending SCSI (and some ATA) commands to devices. Further information is available at: http://sg.danny.cz/sg/sg3_utils.html

*). 2.6.33-rt13. Thomas Gleixner announced that version 2.6.33-rt13 of the Real Time patchset is available. The patch is available from kernel.org at: http://www.kernel.org/pub/linux/kernel/projects/rt/

*). GIT 1.7.1.rc0 Junio C Hamano announced that version 1.7.1.rc0 of GIT is now available for download from http://www.kernel.org/pub/software/scm/git/. It includes a contributed script from Eric Raymond, support for GIT_ASKPASS, and a large number of other useful patches.

The latest kernel release was 2.6.34-rc3. The rc4 release was delayed for reasons that will be covered in the next episode of this podcast.

Rafael J. Wysocki sent an updated list of recent kernel regressions.

There was some concern from Taylor Lewick that kernel performance had regressed between the older 2.6.16 kernel he was running and more recent kernels, with transaction times increasing on the order of 15us. He posted some detailed statistics, though there have been few comments thus far.

Till Kamppeter noted that the deadline for student application to the Google Summer of Code (GSoC) had passed and that it was time to assign them to the various kernel projects. In the end, all unassigned applications went to Grant Likely because he made the mistake of volunteering :)

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/04/04 Linux Kernel Podcast

April 13th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100404.mp3

For the weekend of April 4th 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: BKL, KVM, Networking, and recvmmsg.

BKL. In the latest round of Big Kernel Lock (BKL) removal discussion, Arnd Bergmann posted some patches to the TTY layer, noting that it was “one of the trick[ie]r bits in the BKL removal series, so let’s discuss it here”. Arnd’s code is similar to the earlier Big Kernel Semaphore (BKS) concept but it uses a Big TTY Mutex instead. This is based upon a mutex, not a semaphore, that does not autorelease on sleep, and is intentionally confined to TTY use. Alan Cox replied suggesting that he wasn’t too bothered if these patches went in because he was working to remove the need for giant locks whatever they happen to be called. So the Bit TTY Mutex may be a short lived piece in otherwise killing the BKL sooner than later. Having said that, Alan wanted to hold off a little while he took care of “low hanging fruit” first. Others agreed.

KVM. Jiri Kosina inquired about a kernel warning generated on 32-bit KVM guests when using an AMD guest CPU on an AMD host. The emulated guest CPU is an AMD model 2, stepping 3, which is one of the models AMD apparently explicitly did not support using in SMP configurations. Jiri wondered whether it was worth adding a specific hack for KVM (since SMP emulation does work), Andi Kleen suggested perhaps just killing the code that generates a warning on those systems as it is by now very old, while Andre Przywara really didn’t like removing the warning and favored simply emulating a better model instead. Pavel Machek agreed that emulating an explicitly SMP-capable CPU model was likely the solution.

Networking. Christoph Lameter inquired as to future network stack support for the PGM protocol (RFC 3208). Currently, there exists the openpgm implementation, which runs as a userspace application using raw sockets, but there are a number of limitations in so doing, not the least of which is a performance hit. Christoph feels that PGM belongs at the same level as both UDP and TCP support, though the conversation didn’t go much beyond discussing possible prototypes.

recvmmsg(). Linux 2.6.33 added a new system call called recvmmsg() that intends to complement recvmsg() in allowing for multiple packets to be received and processed at once, rather than performing one system call (or even more) per individual packet. Unfortunately for Brandon Black, who was trying to use this new feature in his DNS server implementation, calls to recvmmsg() on a blocking socket will result in the call blocking until the maximum requested number of packets are available, not just one single packet. Although Brandon says he is willing to work around this, he prefers a more configurable blocking behavior in use of recvmmsg(). Ulrich Drepper agreed; Brandon posted a patch.

In today’s miscellaneous items:

*). A couple of IDE reverts to deal with missing devices.

*). Some new cpu-hotplug wrapper functions (cpu_notify, __cpu_notify, and cpu_notify_nofail).

*). Some followup discussion on a new CPU flag bit on recent Intel CPUs that enables the CPU to declare that it explicitly has a synchronized TSC.

*). Some percpu module handling fixes for module static percpu from Tejun Heo.

*). An async firmware loading patch from Johannes Berg, intended to allow for non-blocking immediate rejection of unavailable firmware early during boot that is requested via request_firmare_nowait prior to boot completion.

*). Tilman Schmidt noted that CONFIG_PROVE_RCU is incompatible with proprietary kernel modules because it will result in the creation of a reference to a GPL only exported symbol even in modules that do not use RCU. He suggests that those building proprietary modules disable PROVE_RCU. Paul McKenney thanked him for sharing this solution with others who might be affected.

*). A fix for __module_ref_addr() use on stable kernels prior to 2.6.34 (where percpu use has been refactored) by Mathieu Desnoyers.

*). A scheduler bug present since November 12 2009 was identified in an email thread posted by Torok Edwin (and bisected by Mike Galbraith) in which use of latencytop results in the runtime of random tasks being set to really high values afterward due to the broken commit.

*). Version 10 of the “use lmb with x86″ patches was posted by Yinghai Lu. There was some further discussion about the plan to essentially replace e820 handling on x86 with a modified version of the Logical Memory Block code that will now be modified to support parsing e820 tables.

*). A small tweak to the ordering of TLB flushig on S4 resume for i386 via a patch from Shaohua Li.

*). A discussion started by Torok Edwin concerning 32-bit perf tracing with a 64-bit kernel. Torok had been slightly confused by needing to re-install perf for a 32-bit build and this lead Ingo Molnar to ponder whether it was time to have a variant of perf for each architecture variant built.

*). A nice summary of the various printk macros (pr_, dev_, netdev_, netif_, etc.) from Joe Perches after Neshama Parhoti asked about them.

*). A patch from Robert Schone modifying power_frequency events such that changing the frequency on another CPU results in it being traced rather than the CPU that initiated the frequency change operation.

*). A patch making it easier to disable fragmentation when doing PPP multilink from Richard Hartman. Apparently this reduces “packet loss and massive ping spikes” that are seen by Richard and others.

*). Lin Ming asked Corey Ashford whether he was still working on performance event support for “uncore” or “nest” CPU units (these are additional functional units on the same die as the CPU cores but not in-core). Corey said that he was not actively working on it but is working on nest events for IBM’s “Wire-Speed” processor using the existing infrastructure due to some time constraints. It looks like more will happen here in due course.

*). Some shadow page cache discussion for KVM MMU from Xiao Guangrong.

*). Some discussion between Peter Zijlstra, Rusty Russell and Tejun Heo concerning the latter’s “cpuhog” patches and the fact that Peter doesn’t like the name. Rusty on the other hand quite likes it, because “ugly things should have ugly names”. Tejun did propose an alternative set of names, including functions such as stop_cpu() and stop_cpus() but these don’t really stop CPUs, they hog them. So the CPU hog name is more apt.

*). Lee Schermerhor posted some comparitive benchmarks between a Red Hat 2.6.18 and upstream 2.6.32, 2.6.33 kernels showing recent upstream performance regressions. Plots: http://free.linux.hp.com/~lts/Pft/

In today’s announcements:

OSPERT 2010. Peter Zijlstra announced the official Call For Papers for the 2010 Operating System Platform for Embedded Real-Time applications conference. It is to be held on July 6th in Brussels, Belgium in conjunction with the 22nd Euromicro International Conference on Real-Time Systems, which happens between the 7th and the 9th of July also. Those working on embedded Real Time systems may find this particularly interesting. The paper deadline was April 4th.

Git 1.7.0.4. A maintenance GIT release was announced by Junio C Humano.

LTP. Rishikesh K Rajak announced that the Linux Test Project (LTP) for March 2010 has now been released. It includes some last minute fixes and is available at the usual sourceforge.net/projects/ltp location.

LTTng 0.208. Mathieu Desnoyers announced the latest LTTng release 0.208 for Linux kernel 2.6.33.2 is now available. It uses waits with msleep() in place of cpu_relax() in order to handle !PREEMPT uniprocessor (UP) configurations.

The latest kernel release was 2.6.34-rc3 during the time period covered by this podcast episode.

Greg Kroah-Hartman announced the release of stable series kernels 2.6.27.46, 2.6.31.13, and 2.6.33.2. Existing users of these stable kernels should upgrade.

Finally today, Jeff Merkey surfaced from wherever he’s been recently and let everyone know that he has been issued US patent number 7,684,347, which was noted seems to be simply an abstract “really fast” packet sniffer. Jan III Sobiesk suggested that someone should patent a “really fast operating system”. Jeff should have waited a few days for April 1st, the same day that the kernel.org website featured 180 degree (or pi if you prefer) rotated text on the main page – that wasn’t a hack, it was John and Peter showing some humor.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/03/28 Linux Kernel Podcast

April 13th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100328.mp3

For the weekend of March 28th 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Filesystems, Interrupts, LMB vs. e820, Multitouch, PHY and phylib, the VM, and VMWare.

Filesystems. Josef Bacik posted a patch entitled “Introduce freeze_super and thaw_super for the fsfreeze ioctl”. In the patch, Josef notes that the existing fsfreeze code actually works too much at the block level, assuming every superblock is backed by a (typically a single) block device. For some modern filesystems – such as is the case with btrfs (Josef is a btrfs developer) – there can be a number of backing block devices, some of which may be added and removed while a filesystem is mounted. Consequently, Josef wishes to split out the freeze process to include dedicated superblock manipulating functions that don’t require the superblock s_bdev to be populated with one backing device. Al Viro had some typically useful comments about the patch, including some further followup to a reply by Nigel Cunningham containing some information about how TuxOnIce does filesystem freezing that Al was not too happy about.

Interrupts. Andi Kleen posted a patch entitled “Prevent nested interrupts when the IRQ stack is near overflowing”, in which he attempted to address the issue of too many IRQ vectors assigned to a given CPU all firing in rapid succession and causing the interrupt stack to overflow. Thomas Gleixner, in rejecting the patch first noted that Andi’s changelog was “utter nonsense” because it refered to interrupt nesting from same interrupt source rather than many vectors, and then noted that simply disabling further interrupts in such cases was not the correct solution. Thomas favored doing away with IRQF_DISABLED and instead finishing the task of converting to threaded IRQ handlers with the small hard handler always running with IRQs disabled, and he wouldn’t take the patch “unless you come up with a real convincing story”. Alan Cox wondered if there was “anyone [Thomas had] forgotten to offend”, to which Thomas responded matter of factly that he wasn’t sure since he hadn’t measured IRQ handler run times “for quite a while”. Linus first told Thomas he was “wrong” in always disabling interrupts, and then seemed to change direction, giving some comments on removing IRQF_DISABLED entirely.

LMB vs. e820. Two different mechanisms for accounting and tracking physical memory layout are in common use within the kernel. Intel (x86) systems use the Intel e820 BIOS provided tables (and support code with the same name) to track which memory ranges are assigned to particular uses, while other architectures – including SPARC, POWER/PowerPC – use LMB (Logical Memory Blocks). The latter was made an architecture independent library in 2008 and lives in lib/lmb.c. The fact that there are two different systems came to a head when Yinghai Lu posted an early_res patch aiming to move the more architecture independent pieces of the existing e820 code into fw_memmap.c. David Miller (the SPARC maintainer) did not like this, since he believed that Yinghai wasn’t listening to earlier advice that LMB provided all of the support in an indepedent fashion and should be adapted to replace the e820 bits instead. Thomas Gleixner added that, “All we get are some meager bones thrown our way”, and suggested that this wasn’t the best way to interact with the community. The thread started a mini-architecture flamewar with Ingo Molnar noting that he really wished “non-x86 architectures apprec[ia]ted (and helped) the core kernel work x86 is doing”, and Benjamin Herrenschmidt more than taking offense at this statement. But that aside, Ingo did point out that Yinghai had been doing a lot of very difficult work that was certainly of use, even if in the end another approach to unifying various bits of LMB and e820 is taken. Yinghai later posted a new patch series entitled “use lmb with x86″

Multitouch. Just in time for this author to buy a shiny new Macbook Pro that suffers from the same problem (and also uses the nouveau driver, that has had its own interesting ride recently), the discussion of multitouch finger tracking was raised again. Modern (laptop) hardware touchpads feature an ability to accurately track the position of multiple fingers at a time, and this allows for the kinds of gestures that are becoming popular today. At the same time, the X Window system that powers most graphical Linux desktops today has only minimal support and cannot handle such things as click and drag with two fingers. This means that your author has to use a custom hacked up mouse driver to support click and drag. I’m not the only one, and this prompted Henrik Rydberg to wonder recently whether it was time to add software finger tracking into the kernel. He pointed to an X.org discussion that had originally raised the idea back in summer 2009. Having discounted the idea then, he was now much more amenable to reconsidering. It seems likely that something will happen, it’s just a question of whether it will be directly in the input layer, in a new mtdev handler, or in an external library that is provided for userspace code to link against. In any case, your author is glad to see this in kernel, where it belongs.

PHY and phylib. Stefani Seibold posted in a thread entitled “fix PHY polling system blocking”, inquiring about the existing implementation for PHY link detection with MII (Media Independent Interface – the means through which network MAC chips communicate portably with various possible PHYs). The existing mechanism does not always use interrupts and can block for a few milliseconds (up to 4ms in one example with e100), while the chip that Stefani is using sees approximately 450us delay. Stefani made various proposals for adjusting the existing phylib, one of which was explicitly disliked by David Miller because it would break link-type changes.

VM. Mel Gorman followed up to a previous patch he had posted (in which he attempted to address some concerns with an IO intensive workload running with little available RAM that the VM may be calling congestion_wait in cases where something other than strict congestion is at fault) with some test results showing that the number of times kswapd and the page allocator have been calling congestion_wait and the time it spends in that function have been increasing since 2.6.29. Quoting Mel, “120+ kernels and a lot of hurt later;”. He posted very detailed test reproducer information, noting that the increase in calls to congestion_wait wasn’t due to any one change, and itemizing a few of the recent changes that have played a part. These include the TTY layer using higher order allocations more frequently, some CFQ fairness changes, and so on. He, Rik van Riel, Corrado Zoccolo, and Johannes Weiner bounced ideas around about the real reasons for performance regressions on the IO workload that was being tested. Simply adding more RAM was not the point.

VMWare. Dmitry Torokhov posted an RFC patch implementing a virtio extension for the VMWare balloon driver. Balloon drivers allow for virtualized guests to expand and contract their memory requirements at runtime, through a co-operative interaction with the hypervisor. In the case of VMWare, Dmitry says VMWare are interested in using the existing Linux virtio framework to communicate between Linux guests and the VMware hypervisor, but with a few tweaks – for example, their hypervisor may refuse to lock certain pages, or may (under certain circumstances) reset the balloon via a notification to the guest, without requiring the guest to explicitly notify on every page released back to the hypervisor as a consequence. Dmitry is interested in various other capabilities that could be exposed over virtio but is first interested to hear from the Linux community. So far that community is only represented in replies by Avi Kivity (KVM), who favors VMWare having their own balloon driver, or splitting out a shared “balloon core”.

In today’s miscellaneous items:

* Brian Gerst posted version 2 of a patch implementing merged fpu and simd exception handlers in one function.

* The final round of task_struct->signal stability cleanups from Oleg Nesterov.

* Support for nested pid namespaces from Serge E. Hallyn.

* A patch from Jason Baron implementing support for enabling the kmemleak checker and memory hotplug support simultaneously in the kernel config.

* Some changes to TAINT_ flag handling from Ben Hutchings (intended to distinguish non-harmful errors such as missing firmware from more serious issues that would tradionally have set the taint flag).

* Some work in progress discussion about reading remapped performance counters on x86 systems from Stephane Eranian (but the current patch breaks the already working implementation on POWER/PowerPC).

* The latest version (5) of the Memory Compaction patches from Mel Gorman.

* A patch allowing different tracers to be compiled intependently from Jan Kara.

* The latest version (5) of the Jump Label patches from Jason Baron.

* An ARM port of the Linux Checkpoint-Restart patches from Christoffer Dall

In today’s announcements:

The latest kernel release on the original date of this podcast was 2.6.34-rc2, which was released on March 19th. The current release is a higher revision.

Rafael J. Wysocki posted a list of reported regressions from 2.6.32 and 2.6.33 that were still possibly affecting 2.6.34-rc2.

Git 1.7.0.3. Junio C Hamano announced that version 1.7.0.3 of GIT is available. The latest release includes fixes for ACL support on the underlying filesystem, and various other fixes also.

IIO mailing list. Jonathan Cameron announced the creation of a new “Industrial input / output” mailing list since a lot of such discussions had been happening off list already. The new (majordomo) list is linux-iio@vger.kernel.org, and can be subscribed to via sending email to majordomo@vger.kernel.org as usual.

SystemTAP version 1.2. Frank Ch. Eigler announced the release of SystemTAP version 1.2 by posting some release notes. This includes various fixes for use with kernel version 2.6.9 from 2.6.34-rc.

util-linux-ng v2.17.2. Karel Zak announced version 2.17.2 of the util-linux-ng package. This is a bugfix release.

Sachin Sant reported a hotplug test failure on -rc2, and Rafael J. Wysocki posted a link to an existing patch that corrected the problem.

Frederic Weisbecker inquired as to whether anyone would mentor the Linux Wireless Google Summer of Code (GSoC) project, to which there were no replies. Therefore it seems that some folks at Portland State University will be asking around amongst the student population for interested parties.

Finally today, Michael Gilbert noted that CVE-2009-4537 had been publicly disclosed for a while but an official (non-vendor) fix was not upstream. Neil Horman said he would take care of making a posting about it, and he did post an official fix for the r8169 frame length error a few days later.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/03/21 Linux Kernel Podcast

March 21st, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100321.mp3

For the weekend of March 21st, 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Linux 2.6.34-rc2, 64-bit system calls, core dumping to a pipe, exported symbols, page cache control, and performance counters for KVM guests amongst other things.

Linux 2.6.34-rc2. Although there is no official announcement as of this writing, Linus’ git tree currently contains a 2.6.34-rc2 release that he created on Friday March 19th 2010 at 6:17pm Best Coast Time (PDT). Once the announcement is officially made, there will be more detail.

64-bit system calls. Benjamin Herrenschmidt raised a question in a thread entitled “64-syscall args on 32-bit vs syscall()”, concerning the ability for existing kernels to handle passing 64-bit parameters to system calls when using a 32-bit userspace. A problem arises on platforms such as POWER and it’s smaller cousin, PowerPC, in which arguments are often passed by register and not on the stack (unless a large number are passed). When passing 64-bit values (as in calling fallocate() within hdparm), GCC may try to use multiple registers (which themselves need to be aligned on even boundaries) to pass a 64-bit value using two sequential 32-bit registers. But the syscall() function within glibc may try to effectively use the same trick again, causing arguments to be off-by-one. Benjamin had a proposal for modifying the existing syscall() interface in a way he thought would be backward compatible (perhaps confined to P{ower,OWER}{PC,} initially) but Ulrich Drepper wasn’t quite so trigger happy to make changes. Peter Anvin favored using explicit versioning to isolate any syscall() interface changes. Separately, Torok Edwin posted some perf (Performance Counters userspace utilities in the “perf” directory) patches enabling callgraph tracing of 32-bit processes when running 64-bit kernels.

Core dumping to a pipe. Neil Horman posted the 4th version of a patch series entitled “exec: refactor how call_usermodehelper works, and update the sense of the core_pipe recursion check”. In addition to addressing some existing race conditions with the implemention, Neil was interested in reworking the call_usermodehelper() function to handle core dumping to a pipe. In the existing arrangement, it is necessary to have all running processes with non-zero core dump ulimits to ensure the pipe dump will work as planned. But Neil has had enough requests to be more flexible, and has come up with the idea of adding a function callback to the call_usermodehelper (umh) that will be made after the task (at this point, in userspace nomenclature, that is just about referable as a process – they are the same however) has been forked but prior to the exec() call starting the userspace code. That function pointer can, in the case of do_coredump, fiddle with ulimits.

Exported symbols. Robert P. J. Day inquired whether the kfifo implementation should really be exporting as many symbols as it does. Tilman Schmidt alluded to the reasoning behind this in mentioning inlined functions. For background, whenever the kernel needs to make use of some function from within modules, that function must explicitly be exported through an EXPORT_SYMBOL or a similar macro definition – simply using the C keyword “static” does not have the desired effect. Sometimes, symbols are exported solely because they are used by corresponding inline functions that are included within module files and need to use the corresponding export. For example, an inline function called “foo”, might need an export “_foo”. In order to clarify the situation, this author suggested a new EXPORT_SYMBOL_INTERNAL export to clearly label these use cases such that symbols are not used where they are not intended.

Page cache control. Balbir Singh posted a patch exposing a cache= kernel command line parameter that can be used to control page cache operation, and effectively disable it entirely in certain situations. This is of particular benefit to virtualized guests (especially those not wanting to enter into direct reclaim frequently), which otherwise might have their pagecache data effectively stored twice – once in the host, and once in the guest itself. Now, there being no such thing as a free lunch, Avi Kivity pointed out that this would slow down guests booted with cache=off because they would now need to use a virtio call to pull in more pages. However, guest memory utilization was shown to fall considerably as might be expected without a page cache. Both Avi and Balbir seemed to agree that the tunable knob allowed for situation specific decisions to be based upon the specific needs of an environment – more overhead in the VM or a slight loss in performance, according to workload, IO types, filesysyems, and a number of other items mentioned by both. Randy Dunlap specifically requested that documentation be added also.

Performance Counters for KVM guests. Yamin Zhang posted a patch entitled, “Enhance perf to collect KVM guest os statistics from host side” intended to facilitate the collection of performance counters statistics from the host when using Linux guest instances, with the exception of guest userspace. Avi Kivity was excited that this patch did not require the exact same kernel on both the host and the guest (he called that “critical”, noting that, “I can’t remember the last time I ran same kernels”). There did seem to be some agreement between both Avi and Ingo Molnar that having a vmchannel client in the host kernel exporting various data for tracing to guest kernels did make life easier for the implementators of such features but potentially opened up another DoS target and needed to be avoided. Instead, Ingo suggested that the host perf tools connect to the qemu instances managing guest instances and communicate over a well-known UNIX socket. The conversation went off onto a tangent about obtaining guest instance information using libvirt, whether there were other tools in common usage to manage guest instances other than starting them directly using the modified qemu, and the relative benefits of shipping all KVM kernel and userspace code in a single project. This gave Ingo an opportunity to get in another mention of what he considers to be “ugly” separation between glibc and the kernel. The entire thread is certainly worth reading, at dozens of posts and likely growing.

In today’s miscellaneous items:

*). A fix for allmodconfig with Xilinx soft core FPGA systems.

*). A device power management documentation update from Rafael J. Wysocki.

*). Version 7 of Andrea Righi’s per memory cgroup dirty page limit patch. Andrea provided some documentation updates that were discussed also. Separately, and on the note of cgroups, the CFG_GROUP_IOSCHED configuration option was made visible in a patch from Li Zefan.

*). A bunch of scheduler and cpusets fixes from Oleg Nesterov, who also noted that there were remaining issues – including a potential lockup in do_fork() caused by receiving a signal from an IRQ or an RT thread pre-emption event because the runqueue lock (rq->lock) cannot be taken in the interim. Oleg asked the maintainers very nicely to please review his patches and comment, although there have been no comments posted in the last week on these.

*). Michael Braun reported an issue involving an interaction (or lack thereof) between the kernel crypto subsystem and the SLOB allocator. He finds that there is “general memory corruption” when using SLOB that isn’t present with the other allocators. Herbert Xu (and by extension, Pekka Enberg, since it was him who inquired as to whether these option were enabled) asked Michael to turn on some allocator debugging options and provide the relevant debugging output to facilitate further analysis.

*). A fix ensuring that legacy PIC interrupts are handled on all CPUs and not just the boot CPU when using the “noapic” kernel boot option from Suresh Sidda. This addresses a bug originally raised by Ingo Molnar.

*). A patch from Dmitry Torokhov re-implementing sysrq as an input handler, rather than as a custom hack in the legacy keyboard driver. Henrique de Moraes Holschuh wondered aloud whether this would introduce any problems for SAK (Secure Attention Key), which should be uninterruptible. That piece seems yet fully resolved in the thread.

*). A patch converting alpha to use clocksource rather than arch_gettimeoffset from John Stultz.

*). A missaligned percpu allocation when using lock events through perf on a particular SPARC box was reported by Frederic Weisbecker.

In today’s announcements:

Kernel.org. John (warthog9) Hawley announced the general availability of various SSL based services on kernel.org. Quoting John, “[t]his should help provide an additional level of security, in particular for our dynamic content like the wiki’s, patchwork and bugzilla”. John noted that the SSL certificates were generously donated by Thawte, and included a quote from the latter in which they state that they are, “proud of [our] open source lineage”. As of this writing, services officially using SSL (through explicit redirection) include Bugzilla, Wikis, Account Requests, Patchwork, while services that can use SSL if requested using the appropriate address do currently include the main www.kernel.org, boot.kernel.org, git.kernel.org, and android.git.kernel.org. Services not using SSL include mirrors.kernel.org (due to the volume of traffic incurred), and the geo-DNS entries because that would expand the number of SSL certificates required unreasonably.

Loop-AES. Jari Ruusu announced version 3.3a of the loop-AES file/swap utility. Details: http://loop-aes.sourceforge.net/

LTP. Rishikesh K Rajak sent an announcement saying that the previous ltp-cvs commit list would be supplemented by a new ltp-commits list that includes git commits also. The name would suggest that it may be somewhat VCS agnostic. Details: http://lists.sourceforge.net/lists/listinfo/ltp-commits

SCST. Vladislav Bolkhovitin posted to announce that the “new SCST SysFS-based interface has become fully usable, so you can start migrating to it and update your target drivers, dev handlers and management utilities”. For further information, please see: http://scst.sourceforge.net/

TCM. Nicholas A. Bellinger announced the release of version 3.4.0-rc1 of the Target_Core_Mod/ConfigFS infrastructure project, which includes a new Open-FCoE.org based target module (tcm_fc) for TCM/ConfigFS 3.x (mentioned in a separate release announcement). As of the latest release, the TCM/ConfigFS project is now tracking upstream Linux development once again. For further information: http://www.linux-iscsi.org/index.php/Target_Core_Mod/ConfigFS

RT 2.6.33.1-rt11. Thomas Gleixner announced the latest RT kernel patch version 2.6.33.1-rt11 is now available. Since he had been traveling, Thomas had made a few interim releases (rt6 through rt11), the sum of which he summarized. For further detail: http://www.kernel.org/pub/linux/kernel/projects/rt

TuxOnIce 3.1. Nigel Cunningham announced the 3.1 release of TuxOnIce. This is a series of alternative software suspend and resume patches that have been out of the kernel tree for some time, but have their various supportors. The latest patches include LZO compression support, UUID support for detecting suspend images without using a resume= parameter, and other fixes.

The latest kernel release is 2.6.34-rc2.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/03/14 Linux Kernel Podcast

March 19th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100314.mp3

For the weekend of March 14th 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

In today’s issue: The 2.6.34 merge window, anonymous inodes, ATA 4KiB sector issues, cpuhogs, ext4, PCI, and USB console support.

The 2.6.34-rc1 merge window. Linus Torvalds announced the release of the first 2.6.34 RC kernel on Monday, March 8th 2010 at 12:33pm Best Coast Time (PST). In closing the merge window early, he hoped to make a point in line with previous comments on the issue of getting merge requests in in a timely fashion. Quoting Linus, “but in general the merge window is over. And as promised, if you left your pull request to the last day of a two-week window, you’re now going to have to wait for the 2.6.35 window.” According to Linus, nearly two thirds of the changes are in drivers (when factoring in 50% drivers/ code, 5% sound/ code, and 10% firmware). Of the remaining bits, about half is architectural and the rest is, well, the rest. So far, about 850 developers are involved. Linus again refered to his Fedora Nouveau rant in ending with a reference to the need to upgrade libdrm/nouveau_drv versions if using that driver.

Several architecture maintainers gave their excuses and requested pulls later, but Linus drew the line at a request from James Bottomley to pull SCSI pieces two days later, on March 10th. James noted that he had been en route back from India, nobody had told him the merge window would close early, and that the only commit added to his tree since the merge window closed on Monday was a bug fix. Linus said he was “not going to pull” and that the whole point behind closing the merge window early was because of people posting pull requests late that “should have been ready when the merge window _opened_”. James objected to the unpredictability of the merge window closing, but Linus said that “WAS THE WHOLE F*CKING POINT!”, in order to avoid last minute pull requests, and added that he would in future not even say how long the merge window was going to be in order to have requests ready the moment the window opened. Unfortunately for James, Linus wanted to make a point and he seemed to meet Linus’ criteria for doing so. Doug Gilbert later pointed out that people should not attack James just because he was the subject of “yet another Linus rant”.

Anonymous inodes. Dmitry Torokhov recently started a thread entitled “S[E]Linux going crazy in 2.6.34-rc0″ (but note the corrected capitalization of “SELinux”). He was experiencing a side effect of some recent work by Al Viro, as well as others, to switch various subsystems such as inotfiy over to use anon inodes rather than their own “filesystem” type. Previously, inotify had used its own filesystem called simply and obviously “inotifyfs”. This allowed for SELinux rules to match on various notification events on an “inotify_t” filesystem type of filesystem. But with the trend to convert to anonymous inodes, there becomes no easy way to write SELinux rules to confine applications (if that is what you actually want to do), and the existing rules go insane, as this author recently saw on a rawhide system that happened to be running SELinux. Eric Paris proposed various workarounds – type a, and type b – of the “revert” everything back to how it used to be, or create support for differing security contexts for anonymous inodes. The latter seems more likely to happen though the thread dried up at that point and nothing further was said on the topic until Eric Paris sent a pull request for some notify bits a week later.

ATA 4 KiB sector issues. Tejun Heo started a new thread entitled “ATA 4 KiB sector issues”, in which he lamented the current state of support for larger sector size ATA devices (those using 4K rather than 512 bytes as their natural unit of size – someone please add a comment to this article with a description for the term used to describe the natural size of a disk, its “word size”). Apparently, the transition will be “quite painful”. In his lengthy email, the gist of which is covered by an article on the kernel.org wiki at: http://ata.wiki.kernel.org/index/php/ATA_4_KiB_sector_issues, Tejun covers the issue of backwards compatibility, DOS partition table support, and that beast of beasts – Windows. Interestingly, I didn’t see a specific mention of the issue of unaligned writes when using journalled filesystems and ensuring commits have hit the disk, but I’m sure that’s covered somewhere in there. I suspect this is now required reading if you work on disk and block bits. James Bottomley added some useful notes about the lack of bootloader support, etc.

CPU Hogs. Tejun Heo posted a patchset intended to generalize the case of monopolizing a CPU (or a set of CPUs) with a single kernel thread. The cpuhog functionality can be used by any kernel code that needs to grab one or more CPUs exclusively for some period of time, such as [k]stop_machine, which does just thus during module load in order to ensure that it is safe to fiddle with the kernel symbol table. For good measure, Tejun also fixes the kernel migration threads to use cpuhog while he’s at it. LWN had a writeup on this topic later, and your author has a pet project in mind that should benefit already from using this patchset. Thanks Tejun Heo!

ext4. Christian Borntraeger posted asking about e4defrag support for compatible ioctls (as in the case on his system, with a 64-bit x86_64 kernel and 32-bit IA32 userspace environment). He suggested, “[l]et[']s just wire up EXT4_IOC_MOVE_EXT for the compat case.” This lead Jeff Garzik to wonder aloud what the overall status was of ext4 defragmentation support. Jeff noted that he had actually poked at defragmentation support himslef in the past and was “hopeful that I will see defragging in a Linux distribution sometime in my lifetime”. Eric Sandeen noted that such support had previously been in Fedora (briefly) but was removed because he (Eric) wasn’t so happy with the code. Since I happen to know Jeff has a good many years ahead of him, one hopes that he will get to see many great things, including ext4 defragmentation. Separately, Michael Tokarev pointed out another 32-bit userspace on 64-bit kernel issue with compatible ioctls, this time affecting AIO. Jeff Moyer was on the case with an initial test patch that he could use succesfully with the libaio test harness built with -m32 while he continues to work in general on further AIO cleanups for the longer term.

PCI. Alex Chiang posted an updated patch based upon some awesome work that Matthew Wilcox had done to provide sysfs PCI slot to device mapping directory entries that can be used to determine which physical slot a device is actually installed in within the chasis of a given system. This will be of use to a number of projects, including efforts to name network interfaces according to the slot they reside in (rather than their MAC address) for distributions needing to support single system images – at least, that’s one possibility that comes to mind. I have pinged a few people myself to see if this will be of use to that effort in general, and there are bound to be many more.

USB Console. Jason Wessel posted a 6 part patch series entitled “usb console imprevements series”, containing “aggregated and ported…usb patches I have previously posted which are not mainlined into a single series aimed at providing a stable [USB] console”. Jason began with a recap about what the problem with USB consoles currently is – that they are not synchronous (as opposed to regular serial UART consoles which are) and so will drop data on the floor if there is no room to buffer it when interrupts are disabled. The new code introduces intentional delay loops calculated through imperical testing using an FTDI USB part (a common part on many embedded boards, such as the BeagleBoard JTAG debugger sitting on this author’s desk).

In today’s miscellaneous items:

* some early dev_name() patches from Paul Mundt allowing early platform device code to use dev_name() before the guts of the driver core are online.

* This author was bitten by a recent bad commit from Al Viro that caused opendir() to succeed on regular files. I posted a question about it and was told that it had already been fixed. Indeed, it had.

* Ongoing debate happend about reducing the number of memory allocators in use on x86 systems, per a previous note from Ingo that there were 5 possibilities depending upon phase of boot and this needed to be reconciled.

* A rant from Finn Thain about a “coding style” fix patch for Macintosh that reduced a comment length to fit in 80 characters. Finn thought this was an utter waste of time, and repeated a comment often heard elsewhere, “checkpatch.pl is great but code that fails it is NOT always wrong.” and, ‘”Check patch” is a good idea but “check existing code” is a waste of everyone’s time. Sometimes, cleanup patches do more harm that good, for example a well intentioned “if” cleanup this week completely misunderstood how the identation is supposed to work and was also summarily rejected. Ben Herrenschmidt’s only response to this mini-rant was “Amen !”.

* Mitake Hitoshi concurred with Guangrong Xiao’s posted results showing an *improvement* in performance of userspace mutexes when lock trace events were enabled. Reproducer code was posted and confirmed.

* Some useful documentation was provided on Linux’s circular buffering and memory barriers support from David Howells.

* Support for specifying in the environmental variable context of a kernel emitted uevent whether it came because of a kernel_firmware() or a kernel_firmware_nowait() request was postulated by Johannes Berg (to handle the case of built-in drivers requesting firmware not in an initramfs). Kay Sievers pointed out that many events are re-triggered during boot and so the firmware loader cannot know what state the system is in, and therefore it might be better to leave requests for unsatisfiable firmware around “forever” until they are cancelled from userspace rather than trying to cunningly work around the issue of firmware not being present in an initrd context with special uevent environment variables.

* and the jabs at SELinux security labeling continued with Al Viro coming up with a few amusing retorts in the “Upstream first policy” thread and Ingo Molnar comparing SELinux relabeling wait times to fire doors, “we should prefer a one inch thick fire door that opens and closes fully automated to a five inches thick fire door that people keep always-open with a chair”. Ingo contends that all too often, people “turn off the whole thing” because of various frustrations and so there is less overall security than might be the case with a slightly less perfect system. Dave Airlie called SELinux relabels “the new fsck” and called for journalling.

In today’s announcements:

Benchmarks. Anca Emanuel announced some new Phoronix benchmarks for kernels 2.6.24 through 2.6.33, showing that performance has generally improved by 770% from 2.6.29 to 2.6.30 and only regressed very slightly in 2.6.32. Regretfully, however, 2.6.33 does not perform nearly so well, and, according to the Phoronix quote, “PostgreSQL performance atop the EXT3 file-system has falled off a cliff”. Full details are available on the http://www.phoronix.com/ website.

RT 2.6.33-rt6. Thomas Gleixner announced the release of version 2.6.33-rt6 of the RT patchset that he and others are continuing to develop against the 2.6.33 series kernel. As he mentions, there was an -rt5, but it was more of a separation point in the git tree. With the merging of some bits into that older tag, MIPS support rejoins the RT tree thanks to Wu Zhangjin. As usual, the RT patch is available on the kernel.org website, in the section devoted to such projects, or in the head (rt/head) and stable (rt/2.6.33) branches of the “tip” tree maintained by Ingo Molnar. Details: http://www.kernel.org/pub/linux/kernel/projects/rt/

The latest kernel release is 2.6.34-rc1.

Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-03-09-19-15. Hiroyuki Kamezawa posted an updated version of his OOM notifier memory cgroup patches against this latest tree. Andrew later posted an mmotm for 2010-03-11-13-13. And in other “mm” news, Mel Gorman posted the 4th version of his “memory compaction” patches.

Greg Kroah-Hartman posted some review patches for stable kernels 2.6.33.1, and for 2.6.32.10. These were subsequently released.

Finally today, Robert P. J. Day asked whether it was still worth him running his “cleanup” scripts (that look for problems with kernel config options) after each merge window closes. Randy Dunlap thought “yes”, and was even more happy that Robert had posted his scripts for him and others to use. Details: http://www.crashcourse.ca/wiki/index.php/Kernel_cleanup_scripts Robert followed up later with another email saying that most of his popular cleanup scripts have now been posted, which is great.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/03/07 Linux Kernel Podcast

March 18th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100307.mp3

For the weekend of March 7th, 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Console, DRM, ext4, integrating tools, sensors, split function and data sections, union mounts, and versioning.

Console. Eric W. Biederman posted an intuitive patch for /dev/console opening, effectively ensuring that it is always available even if the root filesystem has no /dev. “This effectively guarantees that there will be a device node, and it won’t be on a filesystem that we will ever unmount”. Al Viro replied “hell yeah”, and took the patch “with thanks”.

DRM. This weeks thread length of the week prize goes to a thread entitled, “drm request 3″ in which Dave Airlie tried to pull some patches into the 2.6.34 merge window. These contained, “[f]ixes for default y + CONFIG_STAGING + CONFIG_DRM_NOUVEAU enabled”. Linus wasn’t very happy when he booted with these patches (nouveau interface version 0.0.16) and saw an error message saying “[drm] wrong version, expecting 0.0.15″. This lead to a rant about backwards compatibility, and that he hadn’t even been warned it would break existing user space (in his case, Fedora 12). Linus even found that the commit that introduced the breakage did so explicitly, but again noted, ‘why the hell wasn’t I made aware of it before-hand? Quite frankly, I probably wouldn’t have pulled it. We can’t just go around break people[s] setups. This driver is, like it or not, used by Fedora-12 (and probably other distros). It may say “staging”, but that doesn’t change the fact that it’s in production use by huge distributions. Flag days aren’t acceptable’. This lead on to a thread in which Linus and others (including Jeff Garzik) noted that Fedora 12 was shipping this driver in “production” and so more should be done to ensure that the kernel could be tested on older systems, while others said the driver was all along a “use at your own risk” driver (Jesse Barnes). Personally, this author solved the problem by using another graphics chipset a long time ago. Daniel Stone probably had the best solution, “fuck it, it’s Friday. To the pub”.

The DRM thread also deviated into a discussion of “Upstream first” as a distro policy, and then onto specific patches in other distributions that aren’t in upstream. For example, Ubuntu carrying AppArmor. That lead on to yet another tangent in which James Morris felt he was being personally attacked for the lack of the patches being upstream. Ingo Molnar (and later, Linus, who seemed to share a similar viewpoint – that there needn’t be only one security answer) decided to weigh in, noting that it had been “a few reasonable months after the last big security flamewar”, and wanting to see a “rehash or fair summary of the pathname versus labels arguments” (refering to the fact that SELinux uses file labeling and complex rules, while AppArmor uses simple file paths). Ingo feels that pathnames are a “far more fitting abstraction to any ‘human based security process’ on Linux than ‘labels’”. Ingo called out that there was a lot of security research based on labels but essentially said none of that mattered due to the difficulty of practically using label based security. Quoting Ingo again, “[i]n other words: [I] see [SEL]inux’s main failure in that it somewhat blindly aims for a security model that is sees as the technical most secure, while not being intellectually open to the fact that we very likely _cannot know in advance_ which of the models will make Linux more secure in the long run. It would seem Ingo would like AppArmor to be less of a “hostile competitor” and more of a “natural ally” to SELinux. The idea is that there can be two different security mechanisms for different use cases.

Ext4 performance concerns. Justin Piszcz had recently raised the issue of the relative performance of ext4 for “large” writes vs. XFS. Justin was seeing almost half the write throughput when using ext4 as opposed to XFS and was concerned. After asking various questions, to which the replies included that he should use “nice” numbers of disks (e.g. 9 for the specific RAID case he was looking at) that made no difference, the thread seemed to dry up without any concrete conclusions other than that a performance issue exists and requires some further investigation using blktrace, etc.

Integrating tools. Ingo Molnar, in a thread entitled “Re: KVM usability”, made some remarks about the relative virtues of having “unified repositor[ies]” in which both the kernel and userspace tools are combined in one place, such as with the Performance Counters tools. Ingo believes that one reason why Apple can “consistently out-develop Linux” is “in part due to there not being a strict [C]hinese [W]all between the Apple kernel, libraries and applications – it’s one coherent project where everyone is well-connected to each piece”. This maybe true, but it’s just as likely in this author’s opinion that Apple is benefitting from that, coupled with the fact that it owns every piece and can hand down edicts from on high about what every piece will do, and when. In any case, the thread is worth reading – it was surprisingly short given the potentially contentious comments that could have made great flamebait.

Sensors. Dima Zavin (Google) replied to Jean Delvare’s attempt to have the ALS (Ambient Light Sensors) subsystem pulled, saying that the kernel was on the road toward having one subsystem under drivers/ for ALS, one for Proximity sensors, one for Accelerometers, etc. all with similar interfaces, and that a better approach would be a single “sensors” subsystem. He offered to help work on just that. Jean was interested, but didn’t want to hold up having the ALS patches pulled, favoring reworking them later on. He was subsequently dismayed when Linus and others started asking why ALS wasn’t just using the input subsystem for events, saying that he didn’t care where the code went but that discussions had been ongoing for 5 months already and he didn’t want to hold things up for another 5 months when people decided to bring this up during the merge window rather than before. The conversation then took a tangent into different rate devices (some of these “sensors” can operate at many KHz, above what the “input” subsystem is intended for). Linus contended that these devices, just like joysticks, were input devices. The conversation appears to have stalled at this point without a resolution.

Split function and data sections. As some of you will know, various attempts have been made over the past year to add support for compiling the kernel with the GCC options “-ffunction-sections”, and “-fdata-sections”. These cause the kernel to generate one ELF section for each function or data related object, and make life very easy for optimization tools (that can remove whole sections) as well as kernel patching utilities such as Ksplice. Tim (Ksplice) Abbott was happy with the latest round of patches, though he did have some questions about the “rename kernel’s magic sections with compatbility with -ffunction-sections -fdata-sections” patch series, especially about where certain renames were being used. For example, he wondered aloud how renaming “.text.reset” to “.text..reset” would affect AVR32 systems, because he couldn’t see how the original “.text.reset” was being populated anyway (answer: it wasn’t). As Tim mentioned, he wanted input from Haaard Skinnemoen, who provided the comment on “.text.reset” amongst other feedback.

Union mounts. Valerie Aurora posted version 1 of an RFC patch series (against Al Viro’s for-next tree) entitled, “Union mount core rewrite”. This, as it implies, is a complete rewrite of parts of the code implementing union mounts. Val has previously written about the goals and implementation of her work in various LWN articles. Separately, Val wondered aloud whether it was now possible to have multiple read-only layers in union mounts.

Versioning. Paul McKenney posted a patch placing the SHA1 git hash of the latest commit in the kernel version line on boot if available, or “[Not git tree]” in the case that a non-git tree was use to build.

In today’s miscellaneous items:
Large numbers of git pull requests started to come in for 2.6.34 (including everything from core kernel to networking and sound), there were some further nested SVM patches from Joerg Roedel, a large number of KVM updates (including a lot of PowerPC bits, Microsoft Hyper-V patches, and some x86 emulator cleanup), a new “platform-drivers-x86″ git tree reference was added to the MAINTAINERS file (as maintained by Matthew Garrett, who posted a pull request for the latest bits also), a new generic x86 “NMI Watchdog” built upon performance events from Don Zickus (by way of Ingo Molnar actually making the pull request for Don’s previously posted patches), version 3 of the memory controller groups dirty page limits patches from Andrea Righi, an affirmation from Andrew Morton that the “Linux Checkpoint-Restart” patches could be posted to LKML following 2.6.34-rc1 (Oren Laadan also mentioned how the patches will refuse to do a checkpoint if they believe they cannot do so safely, reporting this back to userspace), the latest “compat-wireless” tree for stable kernel (2.6.32) users that contains the latest 2.6.33 bits from Luis R. Rodriguez, version 3 of a patch series providing for 512KB readahead rather than 128KB from Fengguang Wu, various trivial and staging patches from Greg Kroah-Hartman (as an aside, Alan Stern raised some concerns about the way Greg’s scripts generate those patches), a request to pull the Ceph distributed file system client into 2.6.34 (along with various input about changes made since the 2.6.33 merge request) from Sage Weil, some Performance (perf) Counters “live mode” patches from Tom Zanussi that allow perf data to be directly processed as it is captured “without ever touching the disk”, some paravirt (PV) extension patches for HVM (Hybrid virtualization support) in Xen from Sheng Yang, and Ted Ts’o complained about dynamic device filesystems with initramfses in a mini-rant about how 2.6.33 could not boot with an LVM root on his Ubuntu 9.10 userspace. He added that, “of course, the initrfamfs environment is so crappy that there are no debugging aids — not even a working pager”.

In today’s announcements:

Git 1.7.0.2. Junio C Hamano announced the latest maintenance release of Git version 1.7.0.{1,2}. The second .2 posting had a few minor patches since .1, including fixing support for GIT_PAGER. Whether or not it is technically an SCM, I will cease using that term in this podcast, following some feedback from listeners of this podcast.

LTP. The Linux Test Project was released for February 2010. The latest release comes with a reminder that there “has been multiple chnges for building/installing the test suite after the recent changes in Makefile infrastructure”. This month’s release didn’t come with any corrupt script warnings.

Userspace RCU 0.4.2. Mathieu Desnoyers announced version 0.4.2 of his Userspace RCU “urcu” library. It includes some patches from Paolo Bonzini adding generic uatomic ops support for architectures not explicitly supported by liburcu, including (effectively free support) for IA64 and Alpha when using GCC versions 4.0-4.5, and a bugfix in urcu-bp which is the “User-Space Tracing” version of the urcu library. Mathieu has asked me to point out that an patent exemption was made to cover use of RCU in LGPL code such as urcu, so my previous comments about GPL patent concerns were a little too severe.

The latest kernel release was 2.6.33.

Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-03-04-18-05.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/02/28 Linux Kernel Podcast

March 18th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100228.mp3

For the weekend of February 28th 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

In today’s issue: Linux 2.6.33, ACPI, Cgroups, Checkpoint and Restart, OF Device Tree, Firmware, and x86 embedded.

Linux 2.6.33. Linus Torvalds announced the final release of 2.6.33 on Wednesday February 24th at 12:06pm Best Coast Time (PST). The final release includes a relatively small number of final fixes on top of rc8. As Linus says, the most notable thing may be the Nouveau integration and modesetting support. Others may notice the mainlining of DRBD and the fact that the AS IO scheduler is now gone (”since keeping it around and just causing confusion seemed to not be worth it any more. You’re supposed to use CFQ instead”). Daniel walker asked Linus whether he still planned to try a one week merge window this time, to which Linus said, “No. But I might do a ten-to-twelve day thing or something like that – just to make sure that anybody who tries to game the system and send their merge request late will get summarily ignored. So I’m going to stop being so predictable that people can tell that exactly two weeks after the last release is where the merge window closes, and if people want to make sure their stuff merged, I had better have a merge request in my inbox earlier than thirteen days after the release.” The pull requests started pretty much immediately, and with the usual vigor. Separately, Con Kolivas announced 2.6.33-ck1, which includes his BFS scheduler and various other “desktop” focused bits.

ACPI. Rafael J. Wysocki posted an RFC patch concerned with removing race conditions from ACPI event handlers. The first race concerns the execution of handlers while they are being removed, the second is a locking issue.

Cgroups. Andrea Righi posted an intruiging RFC patch series intended to provide per-cgroup dirty page limits. The idea is that the maximum amount of dirty pages a cgroup is allowed to have can be limited, and if a cgroup exceeds this count, it will be forced to perform write-out immediately.

Checkpoint and restart. Oren Laaden posted version 19 of his “Linux Checkpoint-Restart” patchset. As a reminder, these patches are intended to allow systems to handle failures by taking whole system checkpoints and restarting all activity from that point in the event of failure. The latest patchset is intended to address previous concerns from Andrew Morton and others, and is apparently able to checkpoint and restart both screen and vnc sessions, and support live migration of network servers between hosts. The project has a checklist of TODOs on its wiki: http://ckpt.wiki.kernel.org/.

OF Device Tree. Grant Likely asked Linus to pull in his OF device tree rework for 2.6.34. Grant has recently been working on ARM support, in addition to the PowerPC, Microblaze, and SPARC changes covered in this pull. Hopefully, OF device tree emulation will finally provide one mechanism for supplying data to the kernel that can be common across many different architectures, in addition to those that do “real” OpenFirmware in the vendor firmware.

Firmware. There was some discussion about kernel firmware versioning, and whether kernel firmware should be wrapped in a container format making it more suited to SO library style versioning. This happened in response to the folks behind the open sourcing of the Atheros WiFi firmware seeking advice on the best way to handle compatible and incompatible versions. David Woodhouse has advocated for the use of more library-like versioning, but was not a big fan of introducing the complexity of such wrappers. In the end it was decided that the kernel developer maintained linux-firmware package should provide firmware files of the form foo-$(API). Those wanting a sub-versioned file like foo-$(API)-$(VAR) could provide one if they so wish.

x86 embedded. Graeme Russ posted a very detailed and well reasoned description of his embedded x86 port, which is not in any way based upon PC hardware, in which he uses U-Boot to transition to 32-bit Protected Mode and directly calls the kernel’s “32-bit BOOT PROTOCOL” described in Documentation/x86/boot.txt. He was having some issues though handling kernel relocation that turned out to be due to documentation differences between the bzImage format and the current reality. Peter Anvin was his usually very helpful self.

In today’s miscellaneous items: A fix for SPARC32 from Rob Landley (apparently, SPARC32 has been broken since 2.6.28, which isn’t surprising since this author and most other Linux SPARC users seem to be running SPARC64 kernels), various debugging from Thomas Gleixner and John Kacur on the recent 2.6.33 RT patch, version 6 of a patch series intended to add lockdep-based diagnostics to rcu_dereference() from Paul McKenney, a series of PPS implementation patches from Rodolfo Giometti (useful for those needing accurate time sources on a serial line), a patch to increase readahead size to a default of 512K from Fengguang Wu (the previous default was 128K), a bunch of s390 updates for 2.6.33 final from Martin Schwidefsky (including kernel image compression “finally…after only 10 years”), some patches intended to document the rfkill sysfs ABI from Florian Mickler, some more nested SVM (virtualization within virtualization on AMD compatible systems) from Joerg Roedel intended to aid running Microsoft Hyper-V with nested SVM (which doesn’t quite work yet even with these according to Joerg), a number of rather cool gdb and early debug updates from Jason Wessel (who has now split kdb and early debug out into two separate trees), version 4 of the “concurrency managed workqueue” from Tejun Heo, a discussion about order 1 allocation failures started by Frans Pop (the failures were under GFP_ATOMIC, but Frans felt that they were particularly ugly given plenty of cache was available for reclaim), David Howells proposed removing EXPERIMENTAL from NFS_FSCACHE in order that it could be compiled into the standard Ubuntu kernel (since, as he says, “As Arjan van de Ven pointed out…the EXPERIMENTAL flag doesn’t mean that much any more”, and a lengthy discussion of linux-next “requirements” that is worth reading, if you have the time.

In today’s announcements:

iproute2. Stephen Hemminger announced release 2.6.33 of the iproute2 utilities that “includes bug fixes and support for all the new features in kernel 2.6.33. This integrates a number of minor bug fixes from Debian aswell”. The update is available at http://devresources.linux-foundation.org/.

RT 2.6.33-rt4. Thomas Gleixner announced version 2.6.33-rt{2,3,4} of the RT kernel patchset. This updates to Linus’ latest tree and includes a number of fixes to bugs reported by John Kacur and others. It is available from the usual location: http://www.kernel.org/pub/linux/kernel/projects/rt/ Thomas noted that “rt/2.6.33 branch is now stabilization only. The rt/head branch will follow linus tree from now on, so it will inherit all (mis)features which come in the merge window. Separately, John Stultz announced that he had forward ported Nick Piggin’s VFS scalability patches to 2.6.33-rc8-rt2, and that it applies to 2.6.33 without any collisions. He requested feedback as he had yet to do any serious stress testing with the patchset (yet).

The latest kernel release was 2.6.33.

Greg Kroah-Hartman released an updated stable Linux 2.6.32.9.

Finally today, Mikael Abrahamsson suggested that some TLC be given to the Wikipedia article on the Linux kernel as it “doesn’t even mention the new -rc system” (in the “development model” section of the article). He wondered if anyone who knew exactly what was going on could write up the new world order on that wiki page for the rest of the world to see. That does not seem to have happened as of this writing.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/02/21 Linux Kernel Podcast

March 15th, 2010 jcm No comments

Audio: http://media.libsyn.com/medi/jcm/linux_kernel_podcast_20100221.mp3

For the weekend of February 21st, 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

In today’s issue: AMD TSC, anon_inode flags, extents, LSI MegaRAID, md RAID, SSE, UML, and XZ.

AMD TSC. Mark Langsdorf (AMD) posted a patch entitled “Option to synchronize P-states for AMD family 0xf”, in which he reminded readers that AMD Family Oxf processors (that is AMD Athlon 64s and AMD Opterons) do not have P-State and C-State invariant TSCs – that is to say the TSC increments at the current frequency of the CPU core, and not at some fixed frequency that would be more useful to those using it as a timing source. It is nonetheless possible to scale the TSC readings to be used as a time source, if all CPUs in the system adjust their frequency at the same time and to the same amount. To do this, Mark modifies the PowerNow! driver with a new “tscsync” parameter. He reminds us that there are many other possible clock sources in a system, but customers want something particularly lightweight in some situations, like the TSC.

anon_inode flags. Matt Helsley noted that existing anon_inode interfaces often do not support flags that can be set by using fcntl(). He proposed a series of 4 patches to signalfd, timerfd, epoll, and eventfd that would allow the same flag behavior as their corresponding creation syscalls. Davide Libenzi, the original author of the anon_inode bits, signed off.

Extents. Jari Sundell reported an issue with sparse files on ext4 in which many extents nonetheless sequentially placed on disk were not merged by the filesystem. This manifested in the form of 3000 or more extents for a 250MB bittorrent download file (aside: bittorrent pulls many file pieces at once from many different sources and so relies heavily on sparse files).

MegaRAID. LSI posted to let everyone know that they were interested in an overhaul of the MegaRAID driver to support future HBAs. Rather than make a lot of changes to the existing code, they were interested in, and were encouraged to create a new driver for the newer parts. Matthew Wilcox may have detected a hint of reasoning behind why they had been a little resistive to not having a single heavily hacked driver and suggested an approach that could be used to “make your management happy” in effectively combining two drivers together into a single object file with two separate sets of PCI tables being handled and different functions within. Whatever the eventual decision, the thread ended there with no followup.

md. Justin Piszcz started a discussion thread entitled “Linux mdadm superblock question”, in which he asked about RAID superblock types. The older version 0.90 superblock format supports autoassemble within the kernel, whereby the kernel can automatically create the appropriate RAID device without having to use tools within an initrd/initramfs (the initramfs itself is not required in that case, otherwise it is if you want to use RAID). Justin wanted to know whether there were any benefits for a < 2TB RAID1 boot volume in moving to a higher versioned superblock without autoassemble support.

The conversation lead Peter Anvin to point out some issues with a recent change in mdadm, which now apparently creates 1.1 version superblocks by default. Peter noted that the 0.9 superblock format doesn’t make it possible to easily distinguish RAID partitions from whole volume RAID devices, but the problem migrating to 1.1 is that 1.1 uses the bootblock for its superbock and so can cause problems with bootloaders such as grub that result in people having to regenerate their entire disk if they want to easily boot with it. Version 1.2 of the md RAID superblock uses the same 1.1 superblock format but at a different location than the bootblock, and so Peter favors a default of using 1.0 or 1.2, but not 1.1 as the mdadm default.

The entire md RAID thread is worth reading because it took a tangent off into a lengthy debate about the merits of using (or being required to use) initramfses, time taken to boot using an initramfs (or if not using one – the plan is to remove autoassembly from the kernel for good, so good luck booting within an initramfs if you want RAID in the longer term), and tools such as AEUIO that can build a customized initramfs image. Of course, every distro and his dog have also re-invented initramfs creation.

SSE. There’s a long-standing philosophy of avoiding floating point (FP) or other general usage of optional compute units such as SSE, SSE2, and so forth from within the kernel itself. Using these units requires saving state, and that isn’t typically done (for performance reasons). However, these optional units can often handle very large word sizes and so can be useful for those seeking to optimize existing kernel routines. Luca Barbieri posted, starting a new thread entitled “use SSE for atomic64_read/set if available” to do just that on x86-32 systems as an alternative to some of the more complex code being used today (including disabling pre-emption very briefly). Peter Anvin and Luca got into a somewhat lengthy debate about FPU etiquette (especially with regard to Peter’s view that kernel_fpu_begin() and kernel_fpu_end() be wrapped around kernel calls to the FPU, and Luca’s view that this expensive state change could be skipped in the case that only specific registers need to be saved and restored in such situations as in his patch). Peter Zijlstra, though not objecting to a cleanish implementation, suggested that one might want to “run a 64bit kernel already”. In the end Luca decided to re-write his other patches explicitly in assembly to avoid future complications with GCC changes, and to hold off on the SSE piece in question until another day.

UML. Remember the work a few weeks back to bring initial task userspace stack sizes in line with those permitted by rlimit? Well it turns out that the patch was a little too restrictive and was causing UML (User Mode Linux) to segfault on startup. The issue was raised by a number of people, including Adam Nielsen, who was also told that it is not possible to run 32-bit UML instances on a host 64-bit kernel or vice versa. They must match.

xz. Discussion continued on the potential for migrating kernel.org over to use ZX format compressed files. Phillip Lougher offered some defense of the venerable gzip format, emphasizing its cross-platform nature (there are even completely separate implementations available in Java for the inclined), and Andi Kleen pointed out the relative availability of tools that handle gzip files or bzip2 vs. xz, but others seemed to agree that various contrived scenarios not that relevant directly to kernel developers don’t warrent holding off an eventual migration to some better compression format.

In today’s miscellaneous items: An updated version of the OOM killer rewrite was posted by David Rientjes (including a patch that treats task running on different sets of CPUs as unlikely to be interfering with oneanother), the third round of KVM patches for 2.6.34 from Avi Kivity (including 1GB page size support, and an initial implementation of “Hyper-V” support for those desperate enough to need or want to run a Microsoft virtual machine guest), some seqlock implementation cleanups from Thomas Gleixner, a “foruth [sic] general posting of the newest version of the AppArmor security module” that is essentially a rewrite of the existing AppArmor code to use the existing hooks in the LSM security infrastructure rather than custom VFS patching, Grant Likely posted “basic ARM device tree support” (yaaaay!), Denys Vlasenko posted another attempt at supporting split out function and data ELF sections (one section per function or data item – something that is great for Ksplice), and Microsoft revived their work in Hyper-V recently (Hank Janssen seems to be trying really really hard to do the right things).

In today’s announcements:

Gujin 2.8. Etienne Lorrain announced a new release of the Gujin bootloader. It has some really nice options for device emulation, El-Torito emulation for booting Live-CD images, and a lot more besides.

RT patchset 2.6.32.12-rt21. Thomas Gleixner announced an updated RT patchset containing “fixes and cherry-picks from all over the place”, as well as some tracer fixes. The short log includes two scheduler fixes, some futex fixes, and some architectural stuff for ARM support.

RT patchset 2.6.33-rc8. Thomas Glexiner also announced the first RT release for the 2.6.33 stable series kernel. Thomas says he is pretty excited about the stability of this latest patch series, and the overall patch size is still falling quite considerably. He ends, “We are zooming in, but there is still a way to go”.

util-linux-ng 2.17.1. Karel Zak announced the release of util-linux-ng 2.17.1. This latest release includes an option to fdisk to disable DOS-compatible mode from the commmand line.

The latest kernel release was 2.6.33-rc8.

Finally today, the end of an era. Christine Caulfield announced that she is orphaning DECnet support in the kernel, due to “lack of time, space, motivation, hardware and probably expertise”. Apparently, “judging from the deafening silence on the linux-decnet mailing list [she] suspect[s] it’s either not being used anyway, of the few people that are using it are happy with their older kernels.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/02/14 Linux Kernel Podcast

February 17th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100214.mp3

This podcast is brought to you by the colour blue and way too much coffee, together reminding you to check out the awesome power of the BeagleBoard Open Source hardware project at http://www.beagleboard.org/. My new Rev C. board was responsible for the delay getting this issue out…too much fun was had.

For the weekend of February 14th, 2010, I’m Jon Masters with a summary of the weeks’s LKML traffic.

In this issue: Linux 2.6.33-rc8, x86 bootmem, NFS, OOM, Performance Counters, Relaxation, Stack Sizes, and SysFS mutability.

Linux 2.6.33-rc8. Linus Torvalds announced the release of version 2.6.33-rc8 on Friday February 12th 2010 at 11:49 am Best Coast Time (PST), saying that he hoped it would be the last before 2.6.33 final. He added that, “A number of regressions should be fixed, and while the regression list doesn’t make me _happy_, we didn’t have the kind of nasty things that went on before -rc7 and made me worried”. This kernel includes fixes for the netfilter bugs that I discovered, as well as some KMS regression fixes. In a separate discussion thread started by John Hawley (warthog9), it was debated when kernel.org should move over to using xz (LZMA2) as a replacement for bzip2 compression (remember when bzip2 was trendy and new?). John proposed various migration options before the thread verred off into a discussion around when an eventual 3.0 Linux kernel would come, and what that would actually mean in practical terms – just an arbitrary future release? I expect that LWN will have a typically witty writeup of this discussion sometime this week.

Bootmem. Back in October last year, Ingo Molnar had stated that the kernel may not need the “bootmem” allocator on x86. At the time, he noted that there were 5 different allocators on x86, depending upon the boot stage (to say nothing of the other core allocator options): the generic allocator, the early allocator (bootmem), the very early allocator (reserve_early), the very very early allocator (early brk model), and the very very very early allocator (basically just build time allocation). By initializing the x86 page allocator earlier in the boot process, Yinghai Lu attempts to do just what Ingo had suggested, now in version 6 of his patchset.

NFS. Hirofumi Ogawa noticed (2.6.33-rc6) that recent kernels could not mount remote NFS version 3 shares, because of a userspace visible change in the kernel nfsd server. If he specified “vers=3″ at mount time, all was well, but the kernel was not falling back to v3 correctly when v4 fails due to a change in error handling. Bruce Fields noted that this change was actually intentional and that the userspace tools had been updated, but decided to revert the patch that caused this change for the time being – at least until the new versions of the mount tools are much more widespread than right now. Bruce sent a patch entitled (”informingly”) “2.6.33 fix” to Linus.

OOM. David Rientjes posted a patchset re-implementing the OOM killer, in the wake of a number of discussions concerning its brokenness. It includes a complete rewrite of the badness() heuristic, which he is then described in some detail within the corresponding patch. Quoting David, ‘The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of “allowable” memory. ” Allowble,” in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current’s cpuset, or a memory controller’s limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space.”

Performance counters. Christoph Hellwig had complained that a patch had been merged back in September from Arjan van de Ven entitled “perf_core: provide a kernel-internal interface to get to performance counters”. That was intended to facilitate in-kernel use of the performance counters framework, but it was Christoph’s opinion that it had no users and should be reverted. Ingo Molnar countered that there actually were a growing number of users, now including the latest work by Don Zickus to create a generalized NMI watchdog handler.

Relax. Michael Breuer posted an interesting analysis of the implementation of the function cpu_relax on x86 systems. This function is called during spinlock spinning cycles in order to give the CPU a break (power management, etc.). Apparently, that function currently uses a nop, but both the Intel and AMD documentation recommend the PAUSE instruction instead (partly because it can be detected on recent CPUs and used to give special treatment to guest instances running under virtualization that are wasting CPU cycles when multiple vpus are allocated and some are spinning away). Arjan van de Ven, and others too, seemed to find this odd, and Artur Skawina wondered if this might be an odd alignment issue. Nonetheless, Michael detects a noticeable performance impact in various tests between these two instructions.

Stack sizes. The kernel contains various task startup code that will create a vma region for its stack use. Existing kernels make this size determination based upon the PAGE_SIZE for the architecture, even though this really is independent of the userspace code that will use the stack, and even given existing rlimits that might see the stack theoretically larger than has been allowed by system limits. Michael Neuling sent a patch to decouple stack sizing from PAGE_SIZE and to default to basing it upon the rlimit.

SysFS. Amerigo Wang posted an RFC patch implementing “mutable sysfs files”. The basic idea is that all potentially “mutable” (that is to say, files that may be yanked out from underneath at any time a hotplug or other operation occurs) files should use a specific API to avoid warnings.

In today’s miscellaneous items: An interesting discussion started by Salman Qazi (Google) centered around a missunderstanding of the ptrace API (and eventual iteration from Oleg Nesterov that the existing API sucks), a January XFS update from Christoph Hellwig (noting new support for netlink provided quota communication, better power saving in XFS kernel threads), Mel Gorman posted version 2 (v2r12) of his “Memory Compaction” patch series that is intended to “defragment” memory by reconciling GFP_MOVABLE pages, and another one of Al Viro’s entertaining rants, this time about pohmelfs and its use of direct access to the current->fs->{root,mnt} entries.

In today’s announcements:

Git version 1.6.6.2. Junio C Hamano announced an update to the 1.6.6 series of the Git SCM tool, releasing version 1.6.6.2. This contains a few fixes.

Git version 1.7.0. Junio C Hamano also announced version 1.7.0 of the Git SCM had been released. This is the latest official version and includes a number of behavioral changes to “git push”, “git send-email”, and other commands as previously noted in this podcast. Users should read the release notes before upgrading if they want to make sure they catch all of the improvements.

Linux 2.6.32.8. Greg Kroah-Hartman, apologizing for the slight delay due to a few crashes that had been reported and a need to verify a security fix, as well as various travel plans, announced the release of 2.6.32.8. It contains a few fixes 2.6.32 users really should have on their systems.

The Linux Storage and Filesystems Summit. James Bottomley announced that the annual Linux Storage and Filesystems summit will take place concurrently with the VM summit on the two days before LinuxCon in Boston (Sunday and Monday), on the 8th and 9th of August. Interested parties can visit either the Linux Foundation website, or email agenda topics to the program committee at lsf10-pc@lists.linuxfoundation.org.

Userspace RCU 0.4.1. Mathieu Desnoyers announced the latest release of his Userspace RCU implementation (remember, patent encumbered, but with a waiver for GPL projects). Version 0.4.1 contains a compilation fix for s390.

As a followup to last weekend’s kerneloops statistics, Arjan van de Ven also posted statistics purely for the 2.6.33 at that time. In his statistics, he showed that the most popular oops was in memcpy_toiovecend (found 391 times).

The latest kernel release is 2.6.33-rc8.

Andrew Morton announced an mm-of-the-moment mmotm for 2010-02-11-21-15.

Don’t forget to read my latest blog posting on jonmasters.org for more information on using the Cyclades TS-3000 with kgdb for remote target debugging, and don’t forget to support Jason Wessel’s proposed kgdb and kdb merge for 2.6.34. You know it makes sense to get this out there widely.

That’s a summary of the week’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/02/07 Linux Kernel Podcast

February 10th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100207.mp3

This podcast is brought to you by the awesome power of Jason Wessel’s kgdb patches, helping to support those who believe in kernel debuggers find hard to reach kernel bugs since 2009. Kernel debuggers: the way of the future.

For the weekend of February 7th, 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

In today’s issue: Linux 2.6.33-rc7, regressions, Google Summer of Code, IMA, OOM, and sys_membarrier.

Linux 2.6.33-rc7. Linus Torvalds announced the 2.6.33-rc7 release of the Linux kernel on Saturday, February 6th, 2010 at 2:44pm (14:44) Best Coast Time (PST). In his announcement, Linus remarked, “I have to admit that I wish we had way fewer regressions listed by this time, so I hereby would like to point every developer to” a link to a recent post to the linux wireless mailing list archive on gmane.org showing a copy of a recent email from Rafael J. Wysocki detailing known kernel regressions between 2.6.32 and 2.6.33-rc6 as posted originally to the LKML. He added, “But we’ve certainly fixed a few things, and it’s been a week, so here’s -rc7″. Most of the changes are in PowerPC defconfigs (default configs), but there are even more i915 updates, radeon KMS updates, and lots of other smaller bits all over the tree. Linus also wondered (in another email) whether it was worth making the .gz files any more given that bzip2 has been around more than long enough by now. Some thought the gzip files were still useful on systems without bzip2 or for some really slow systems that apparently handle gzip files more easily.

Regressions. Rafael J. Wysocki followed up to Linus’ 2.6.33-rc7 announcement (as he had also done with 2.6.33-rc6) with a list of outstanding regressions beteen 2.6.32 and 2.6.33-rc7. There are currently 20 “unresolved” issues in the list of regressions given. Rafael also noted that Maciej Rutecki has, “generously volunteered to work on the tracking of kernel regressions”. The work done by Rafael (and now, hopefully Maciej also) is very valuable to the community and we really do owe them our gratitude for helping out. Arjan van de Ven also posted a list of oops and warning reports on kerneloops.org from the week, including a very common ext4/quota issue in Fedora.

Google Summer of Code. Luis Rodriguez stated that, “Google has confirmed it will have a Google Summer of Code for 2010″, then mentioned that last year’s effort (4 suggested projects, of which 3 were accepted) resulted in only one success. Witold Sowa followed up saying that he didn’t know he was the only student who completed his project, but that the work to add an AP mode to NetworkManager, “with use of wpa_supplicant’s newly developed AP mode” was relatively easy to accomplish and so he had worked on other things also. Apparently, the initial GSoC work is now available in NetworkManager. Nonetheless, it sounds as if Luis is keen to see a higher than 33% success rate if any entries are accepted this year under the Linux Foundation.

IMA. Mimi Zohar replied to an email from Shi Weihua concerning a NULL pointer deference bug in the IMA security code (ima_file_free), which Al Viro and others had previously discussed solutions for.

OOM. Lubos Lunak and David Rientjes resurrected the OOM killer discussion again after Lubos posted some analysis of various KDE processes running on his system, and wondered why the OOM killer uses VmSize rather than RSS to determine tasks that should be killed (in other words, why should it not favor tasks actually resident in memory at the time?). This discussion has been had recently, and David Rientjes explained that the kernel favors overall VmSize in its calculations so as to catch memory leakers as a preference (which are often not resident at the time). David did seem to like the suggestion of catching the the child with the highest badness calculation before killing its parent, and posted an untest patch. He also suggested that the KDE process tree example was “a textbook case for using /proc/pid/oom_adj to ensure a critical task, such as kdeinit is to you, is protected from getting selected for oom kill”. Lubos replied with some very good points about how simply setting oom_adj doesn’t scale, and Balbir Singh was amongst those still favoring a switch to RSS-like accounting but with support for shared pages (for example “PSS”) eventually. Rik van Riel noted that he had no strong opinion one way or the other. David posted various patches proposing an alternative fine grained oom_adj mechanism.

sys_membarrier. Mathieu Desnoyers posted a three part patch series implementing sys_membarrier, a new system call that can be used to “distribute the overhead of memory barriers asymmetrically”. In particular, he wants it for his urcu userspace RCU implementation (for use within the synchronize_rcu call). Sensibly, Mathieu proposes incremental additions to each architecture (even though he believes that it “should be portable to other architectures as-is”), reserving the system call numbers now, then implementing gradually.

In today’s miscellaneous items: Matti Aarnio posted to let everyone know that a recently discovered hole in the bayesian filtering system as used by the vger.kernel.org mailing list server to reduce SPAM has been plugged (it had been possible to reach the list using a specific “backend” majordomo domain), Catalin Marinas decided to simply patch the USB HCD driver that had resulted in cache coherency problems when using USB storage (and noted that a followup posting to linux-arch would call for a flush_dcache_range function), some miscallenous rewrites of obsolete syscall handlers to use generic versions from Christoph Hellwig, a request for an opinion on mergeing the kFIFO rewrite in 2.6.34 from Stefani Seibold, a potential issue with the kernel implementation of LZO compression reported by Nigel Cunningham (for which he will switch back to LZF in TuxOnIce again for the moment), Stephen Rothwell wondered aloud whether Linus would really be interested in taking the percpu changes currently sigging in percpu “next”, and Mathieu Desnoyers announced he is switching email from his academic address in Montreal (where he recently completed his PhD around LTTng) to a consulting firm he is involved with at http://efficios.com.

In today’s announcements: Greg Kroah-Hartman posted review patches for the 2.6.32.8 stable series kernel.

Scott James Remnant announced the release of upstart version 0.6.5. It includes a large number of fixes, amongst which is the completion of the splitting out of libnih into its own project. There is a new /sbin/reload command for reloading upstart daemons, a restored sync() before reboot, improved documentation, and more goodies.

Junio C Hamano announced version 1.7.0.rc2 of the Git SCM, which includes a number of forthcoming behavior changes as mentioned in this podcast when discussing the rc1 release from the previous week.

Subrata Modak announced that the Linux Test Project (LTP) for January 2010 has been released. It now contains over 3000 tests. Separately, Garrett Cooper noted a rather severe bug in the top level LTP Makefile that could result in an “rm -rf /” in the wrong circumstances, suggesting that all LTP users comment out three lines from that file.

Willy Tarreau (re-)announced the release of 2.4.37.9. The previos 2.4.37.8 hadn’t actually contained the required e1000 backport with a CVE fix that had triggered the previous release. Willy noted, “I don’t know how I managed to do that because it once was OK and I could successfully build it. Well, whatever I did, the result is wrong and the issue it was supposed to fix is still present in 2.4.37.8. So here comes 2.4.37.9 with the real fix this time”.

The latest kernel release is 2.6.33-rc7.

Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-02-03-20-09.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags: