Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091101.mp3
If at first you don’t succeed. Welcome to version 2.0 of the LKML summary podcast. In this revamped version I will concentrate on the major issues under discussion on a given day, rather than commenting on every single patch, which had become an unsustainable load. I am still interested to hear from volunteers who might help to make the podcast workload less challenging on a daily basis.
For the weekend of November 1st 2009, I’m Jon Masters with a summary of the weekend’s LKML traffic.
In today’s issue: Fanotify, FatELF, Futexes, KVM, Memory Overcommit, Regressions, and Thread Naming.
Fanotify. Eric Paris posted a patch series implementing a new file mode entitled FMODE_NONOTIFY, which can only be set by the kernel itself. Its job is to indicate that an fd was opened by fanotify itself and should not cause future fanotify events. This allows one to obviate such livelock scenarios as would otherwise occur from fanotify close events resulting in repeated opens on a file that would then be closed and cause another event to be emitted.
FatELF. Ryan C. Gordon posted what he hoped would be his final round of FatELF patches. These extend the Linux kernel’s ELF binary format handler loader code to accept “FAT” images containing multiple ELF binaries, allowing for such features as multi-arch code encapsulated within a single binary. In some respects, the feature behaves similar to Apple’s Universal Binary format, which it was noted is covered by several patents. More information on FatELF can be found at http://icculus.org/fatelf/.
Futexes. Darren Hart, known for his involvement in the RT kernel community, recently posted an RFC patch series intended to make futex_lock_pi into a fully interruptible syscall. This would allow for canceling of locking requests, while preserving FIFO ordered wakeup and Priority Inheritance requirements, and without having to try to emulate this behavior in userspace. He included a test case demonstrator, which used an RT signal handler to abort the futex locking attempt. Arnd Bergmann responded that it should be possible to simply longjmp out of the test application signal handler and avoid modifying the kernel, something that Darren confirmed did work, but he was apprehensive as to whether there might be unintended issues in doing this.
KVM. Gleb Natapov posted a patch series implementing asynchronous page faults for paravirtualized KVM guests. Typically, a guest encountering a page fault becomes blocked until the faulting page is made available by KVM and the guest can be resumed. But paravirtualized guests are aware of the hypervisor and can interact with it. In this case by blocking only the faulting task within the guest and not the entire guest VM. The faulting page can then be swapped in while the guest is still running, using the assistance of a parallel thread within the hypervisor.
Memory Overcommit. Here comes the annual OOM killer discussion. Back in the middle of October, Vedran Furac sent a message entitled “Memory overcommit”, in which he posited how still today a trivial C program run by an ordinary user that attempts to perform large memory allocations can trigger the OOM killer and really take down a system (by killing many essential system services other than the guilty task) once overcommit_memory is disabled. In the example, Vedran had cited how 8 processes were killed, including the X server and some long running system daemons. He felt that the OOM killer really only served to give Linux a bad reputation amongst some users and that it was better to simply disable it by default – enforcing strict allocation only of the available free pages. Others disagreed, although Vedran had a point in saying the OOM killer might as well be renamed to TRIPK – Totally Random Innocent Process Killer.
Kamezawa Hiroyuki had made several mitigation suggestions against overcommit problems, including the use of oom_adj and explicit cgroups. But Vedran was more concerned with how the OOM killer algorithm seemed to be making the wrong choices in the first place as to which tasks should die. This is an issue that comes up every once in a while. Vedran and Kamezawa had previously taken the discussion off-list (to the mm list instead) but it now returned to LKML, Kamezawa having written a script to analyze the oom_score of existing processes on his own system and discovering (for example) that his GNOME desktop processes were being considered more bad by the OOM killer than the sample “allocate one 1GB of memory” task that had taken down Vedran’s box.
Kosaki Motohiro suggested that problem was the number of libraries the average desktop application is linked against, and also suggested that the OOM killer should not account for evictable file-backed mappings (such as libraries) in calculating the oom_score. This lead to a discussion as to the best meta to consider in making OOM kill decisions. It was deemed necessary to consider the VM size in order to catch swap-ed out fork bomb process attacks but Kosaki noted that basing oom_score on RSS + swap-entries figures would be acceptable to him as an alternative. This lead on to a lengthy discussion thread (and a number of patch iterations – including a nice analysis from Hugh Dickins), concerning the best ways to overhaul the OOM killer for modern systems and what exactly the criteria should be. Should it be that the biggest resident memory eater is always killed (which is hard to predict)? or should the total vm size (including resident and non-resident pages) factor into the decision?
Regressions. Caleb Cushing posted to let everyone know that his network performance has dropped off considerably since moving to 2.6.31.x. But the problem seems ellusive, having bitten in 2.6.30.x previously, then seeming to vanish before apparently re-appearing in 2.6.31.x. Having never performed a bisection before, Caleb wasn’t entirely sure of the process, but did post the log from a bisection hoping that others might chime in with some input.
Thread naming. John Stultz posted another iteration of a patch he has been working on that allows threads to renaming their siblings by writing into /proc/pid/tasks/tid/comm. This will allow thread managers to nicely set the task name of their children, for logging as well as for appearance.
In today’s announcements: The kerneloops.org report for the week of October 31 2009. Arjan van de Ven posted this week’s summary of recorded kernel oops logs from his kerneloops.org online service. A total of 18,023 oopses and warnings were logged over the past week, more than a 200% increase over the past week, though this week’s report co-incides with the latest Ubuntu release (which includes the ability to file such reports for the first time). The top warnings were in suspend_test_finish, acpi_idle_enter_bm and dev_watchdog.
The latest kernel release was 2.6.32-rc5.
Andrew Morton posted an mm-of-the-moment for 2009-11-01-10-01. It contains a fair number of patches against the 2.6.32-rc5 kernel.
Stephen Rothwell posted a linux-next tree for Friday. Since Thursday, he had a PowerPC KVM fix, some architectural fixes, and network and percpu conflicts that needed to be resolved. There are currently 145 sub-trees in linux-next.
That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.