Archive

Archive for November 5th, 2009

2009/11/04 Linux Kernel Podcast

November 5th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091104.mp3

For Wednesday, November 4th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Cgroups, FatELF, PerCPU MM counters, and Swap.

Cgroups. Balbir Singh posted to let everyone know that discussion is happening concerning the most appropriate place to mount the cgroup filesystem. Since the Linux Filesystem Hierarchy Standard (FHS) was written prior to the existence of cgroups, it has no specific advice, which leads to three alternatives. These are /dev/cgroup, /cgroup, or some place under /sys. Balbir prefers the first option, but that will require some co-operation with udev. He asks for advice from others as to the best place for this to live. Several people seem to be quite happy with /sys/kernel/cgroup (which is not the only filesystem that gets mounted there).

FatELF. Continuing the discussion on the relative merits of “FAT” image files containing multiple ELF objects, Mikulas Patocka made some interesting comments on Linux package managers, describing them as “evil”. In his opinion, FatELF might provide a means to ship single image files containing all of the files an application needs to execute in one object, similar to how Apple and other operating systems already do today. Mikulas is concerned about the relative difficulty Linux users face in installing software not provided by their distribution using package management software. He makes a good point, although FatELF may not be the solution to that particular problem.

PerCPU MM counters. Christoph Lameter, noting that support for generic per-cpu operations is now in the “percpu” and linux-next trees, posted a patch implementing per-cpu mm counters for tasks rather than single entires in mm_struct. This obviates the need for larger SMP systems to perform atomic updates to mm counters and (intuitively) implies a performance improvement. The only downside is occasionally having to iterate over each of these per-cpu values when the actual count values are being requested.

Swap. Following on from the recent discussion about OOM killer behavior and the various metrics that might be used in the future, Kamezawa Hiroyuki posted a patch that exports per-process (task) swap usage statistics via procfs. This happens through the addition of a new “VmSwap” entry in /proc/pid/status.

The latest kernel release is 2.6.32-rc6.

Stephen Rothwell posted a linux-next tree for November 4th. There had been no tree the previous day due to a national holiday in Australia, where he is based (and one trusts the horse race went well, too). Since Monday, there was a new “msm” tree (which is an ARM platform), the PowerPC KVM fix was still required, and a couple of other conflicts went away. The total sub-tree count increased today to 146 trees with the addition of the “msm” tree.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/11/03 Linux Kernel Podcast

November 5th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091103.mp3

For Tuesday, November 3rd, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Block IO controller, FatELF, Ftrace, Performance, and Sysctls.

Block IO controller. The ever patient Vivek Goyal, fresh from the IO minisummit in Tokyo, posted the first version of a new IO bandwidth control patchset entitled that “Block IO Controller”. This RFC patch series aims to address the problem of there being no “one size fits all” IO control policy, and the need for different policies to be implemented for different uses. The patch introduces what Vivek calls the blkio cgroup controller, through which a management interface is provided that can be used to switch policies.

FatELF. Eric Windisch posted some example use cases for FatELF that he felt others should know about, in an attempt to counter some of the points made by Alan Cox previously. In particular, it would seem that Eric is into Cloud Computing in a big way and looks forward to having virtual machine images that can simultaneously run on a variety of different hardware. Although there is certainly some benefit provided by FatELF, it wasn’t clear how these problems couldn’t be solved as Alan had suggested – with different directories containing versions of the same binaries for the different arches.

Ftrace. Michal Simek posted to let everyone know that he is currently working on Ftrace support for the Microblaze CPU architecture (an FPGA-based soft core from the folks at Xilinx). In particular, he is looking at function trace support at the moment and how the mcount function is used to record entry into each individual function. He has a number of questions, and Steven Rostedt (the Ftrace author) was happy to help answer a number of them.

Performance. Alex Shi posted with an observation that performance testing had yielded results with a 20-30% drop off in the 2.6.32-rc5 timeframe. This seemed to be due to a cfq-iosched patch from Jens Axboe. Alex attached an example run of perf stat both with and without the patch, showing a clear difference between the two sets of data.

Sysctl. Eric Dumazet recently observed that sysctl table entries were quite expensive, due to a sentinel value added after each one in order to detect and avoid corruption of table entries. Eric noted that the sentinel need actually only contain a couple of pieces of data, and so he created a special sentinel entry struct called ctl_table_sentinel that was smaller in size. This would apparently reduce RAM utilization of such entries by 40%.

In today’s announcements: Userspace RCU. Mathieu Desnoyers posted to let everyone know that version 0.3.0 of his Userspace RCU patches is now available. This is an RCU implementation using the POSIX pthread functions that applications can use to take advantage of the same features as the kernel has done for some time. The latest version removes a function (call_rcu) for which he had provided differing arguments and semantics than the kernel.

The latest kernel release is 2.6.32-rc6. Linus Torvalds announced version 2.6.32-rc6 of the Linux kernel at 12:05pm US Best Coast Time (PDT). In his announcement, Linus noted that there had been a longer gap since rc5, due in large part to the number of kernel developers who have been away at the kernel summit in Japan or traveling to and fro. There was also an ext4 filesystem corruption problem that required additional time, and that had turned out to be due to enabling checksum testing of journal transactions during recovery. Linus thanked Eric Sandeen for tracking down that particular problem. He also seemed pleased at the number of regressions addressed since 2.6.31.

Stephen Rothwell announced that there would be no linux-next tree for November 3rd due to a public holiday in Australia where he is based, which has apparently also has “nothing to do with a horse race in Melbourne”.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/11/02 Linux Kernel Podcast

November 5th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091102.mp3

For Monday, November 2nd, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: BKL, FatELF, Fast symbol resolution, OOM, and Performance benchmarks.

BKL. There is an ongoing effort to remove the BKL (Big Kernel Lock), which is the last stayover from early Linux support for SMP. Discussion of BKL removal was revived during the recent Real Time pre-emption mini-summit, and Jan Blunk is amongst those who have been looking at this from the filesystem level. He posted a series of patches intended to push BKL use down into individual filesystems from the generic kernel code (for example do_new_mount()) that it lives in today. He requests comments.

FatELF. There was some ongoing (and quite considerable) push back against the notion of supporting FatELF binaries. Chris Adams wondered aloud just what the target audience really was? As he sees it, embedded users don’t want the bloat, Enterprise distributions already have specific support processes in place for different architectures, and community distributions aren’t likely to want to deal with the increased build complexity and space requirements. Meanwhile, Alan Cox congratulated Ryan C. Gordon on re-inventing the concept of a directory – since directories already allow one to have multiple versions of a binary installed on a given system and to pick and choose between them. Sure that’s not as shiny as an Applesque approach, but it has worked for many decades at this point, and most of the distributions implement multi-arch (sometimes called multi-lib) using some kind of similar approach.

Fast symbol resolution. Alan Jenkins posted the latest version of his fast LKM symbol resolution patches. These take advantage of a binary search for symbol resolution at module load time, using a pre-generated (at build time) sorted table of exported kernel symbols. Using this approach, Alan has once again succeeded in reducing overall system boot time slightly on his netbook. The latest version of the patches has seen some limited testing on ARM and has also been built for Blackfin, so it’s not just x86 at this point.

OOM. Kamezawa Hiroyuki posted to let everyone know that he was putting code where his mouth was with a “total renewal” of the OOM killer code. This isn’t complete at this stage, but it is intended to keep the conversation moving. The first patch lays groundwork (including new OOM type classifications), while the second and subsequent patches add the ability to count swap use per process and implement a newly updated badness calculation that uses rss+swap as the base value but also factors in cpusets, and gives tasks a bonus for how far in the past their last allocation occured, and their runtime.

Performance benchmarks. Hitoshi Mitake posted to let everyone know that he has been working on integrating a benchmark subsystem into the existing – and already fairly extensive – “perf” (or performance events) utility. He asked Rusty Russell for permission to pull Rusty’s hackbench code directly into the kernel tree as part of this effort, which can be used by calling “perf bench sched” with whatever parameters one might wish to specify.

Finally today, Tilman Schmidt requests that we draw attention to the Kernel Cleanup wiki that Robert P J Day has been working on. The page at www.crashcourse.ca/wiki/index.php/Kernel_cleanup includes information about unused Kconfig variables, badly referenced ones, and general problems with kernel code that need further investigation in general.

In today’s announcements: LTP. Subrata Modak posted announcing that the Linux Test Project for October 2009 has been released. The latest version includes fixes, 119 test scenarios for EXT4 testing, new GETUID16/GETUID64/GETEUID16 and PTRACE system call tests, and much more. As usual, it is available at http://ltp.sourceforge.net/.

Sysprof. Soeren Sandmann announced version 1.1.4 of the sysprof CPU profiler. This is the latest version to be based upon the rewrite to make use of the new performance counters interface for exposing the low-level hardware counters. Since the previous 1.1.2 release, there have been a number of fixes. A download is available at http://www.daimi.au.dk/~sandmann/sysprof/.

The latest kernel release was 2.6.32-rc5.

Stephen Rothwell posted a linux-next tree for November 2nd. Since Friday, his fixes tree still has that PowerPC KVM fix, while there were a number of arch issues affecting ARM and OMAP in particular. The sub-tree count remains steady today at 145 trees in linux-next.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/11/01 Linux Kernel Podcast

November 5th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20091101.mp3

If at first you don’t succeed. Welcome to version 2.0 of the LKML summary podcast. In this revamped version I will concentrate on the major issues under discussion on a given day, rather than commenting on every single patch, which had become an unsustainable load. I am still interested to hear from volunteers who might help to make the podcast workload less challenging on a daily basis.

For the weekend of November 1st 2009, I’m Jon Masters with a summary of the weekend’s LKML traffic.

In today’s issue: Fanotify, FatELF, Futexes, KVM, Memory Overcommit, Regressions, and Thread Naming.

Fanotify. Eric Paris posted a patch series implementing a new file mode entitled FMODE_NONOTIFY, which can only be set by the kernel itself. Its job is to indicate that an fd was opened by fanotify itself and should not cause future fanotify events. This allows one to obviate such livelock scenarios as would otherwise occur from fanotify close events resulting in repeated opens on a file that would then be closed and cause another event to be emitted.

FatELF. Ryan C. Gordon posted what he hoped would be his final round of FatELF patches. These extend the Linux kernel’s ELF binary format handler loader code to accept “FAT” images containing multiple ELF binaries, allowing for such features as multi-arch code encapsulated within a single binary. In some respects, the feature behaves similar to Apple’s Universal Binary format, which it was noted is covered by several patents. More information on FatELF can be found at http://icculus.org/fatelf/.

Futexes. Darren Hart, known for his involvement in the RT kernel community, recently posted an RFC patch series intended to make futex_lock_pi into a fully interruptible syscall. This would allow for canceling of locking requests, while preserving FIFO ordered wakeup and Priority Inheritance requirements, and without having to try to emulate this behavior in userspace. He included a test case demonstrator, which used an RT signal handler to abort the futex locking attempt. Arnd Bergmann responded that it should be possible to simply longjmp out of the test application signal handler and avoid modifying the kernel, something that Darren confirmed did work, but he was apprehensive as to whether there might be unintended issues in doing this.

KVM. Gleb Natapov posted a patch series implementing asynchronous page faults for paravirtualized KVM guests. Typically, a guest encountering a page fault becomes blocked until the faulting page is made available by KVM and the guest can be resumed. But paravirtualized guests are aware of the hypervisor and can interact with it. In this case by blocking only the faulting task within the guest and not the entire guest VM. The faulting page can then be swapped in while the guest is still running, using the assistance of a parallel thread within the hypervisor.

Memory Overcommit. Here comes the annual OOM killer discussion. Back in the middle of October, Vedran Furac sent a message entitled “Memory overcommit”, in which he posited how still today a trivial C program run by an ordinary user that attempts to perform large memory allocations can trigger the OOM killer and really take down a system (by killing many essential system services other than the guilty task) once overcommit_memory is disabled. In the example, Vedran had cited how 8 processes were killed, including the X server and some long running system daemons. He felt that the OOM killer really only served to give Linux a bad reputation amongst some users and that it was better to simply disable it by default – enforcing strict allocation only of the available free pages. Others disagreed, although Vedran had a point in saying the OOM killer might as well be renamed to TRIPK – Totally Random Innocent Process Killer.

Kamezawa Hiroyuki had made several mitigation suggestions against overcommit problems, including the use of oom_adj and explicit cgroups. But Vedran was more concerned with how the OOM killer algorithm seemed to be making the wrong choices in the first place as to which tasks should die. This is an issue that comes up every once in a while. Vedran and Kamezawa had previously taken the discussion off-list (to the mm list instead) but it now returned to LKML, Kamezawa having written a script to analyze the oom_score of existing processes on his own system and discovering (for example) that his GNOME desktop processes were being considered more bad by the OOM killer than the sample “allocate one 1GB of memory” task that had taken down Vedran’s box.

Kosaki Motohiro suggested that problem was the number of libraries the average desktop application is linked against, and also suggested that the OOM killer should not account for evictable file-backed mappings (such as libraries) in calculating the oom_score. This lead to a discussion as to the best meta to consider in making OOM kill decisions. It was deemed necessary to consider the VM size in order to catch swap-ed out fork bomb process attacks but Kosaki noted that basing oom_score on RSS + swap-entries figures would be acceptable to him as an alternative. This lead on to a lengthy discussion thread (and a number of patch iterations – including a nice analysis from Hugh Dickins), concerning the best ways to overhaul the OOM killer for modern systems and what exactly the criteria should be. Should it be that the biggest resident memory eater is always killed (which is hard to predict)? or should the total vm size (including resident and non-resident pages) factor into the decision?

Regressions. Caleb Cushing posted to let everyone know that his network performance has dropped off considerably since moving to 2.6.31.x. But the problem seems ellusive, having bitten in 2.6.30.x previously, then seeming to vanish before apparently re-appearing in 2.6.31.x. Having never performed a bisection before, Caleb wasn’t entirely sure of the process, but did post the log from a bisection hoping that others might chime in with some input.

Thread naming. John Stultz posted another iteration of a patch he has been working on that allows threads to renaming their siblings by writing into /proc/pid/tasks/tid/comm. This will allow thread managers to nicely set the task name of their children, for logging as well as for appearance.

In today’s announcements: The kerneloops.org report for the week of October 31 2009. Arjan van de Ven posted this week’s summary of recorded kernel oops logs from his kerneloops.org online service. A total of 18,023 oopses and warnings were logged over the past week, more than a 200% increase over the past week, though this week’s report co-incides with the latest Ubuntu release (which includes the ability to file such reports for the first time). The top warnings were in suspend_test_finish, acpi_idle_enter_bm and dev_watchdog.

The latest kernel release was 2.6.32-rc5.

Andrew Morton posted an mm-of-the-moment for 2009-11-01-10-01. It contains a fair number of patches against the 2.6.32-rc5 kernel.

Stephen Rothwell posted a linux-next tree for Friday. Since Thursday, he had a PowerPC KVM fix, some architectural fixes, and network and percpu conflicts that needed to be resolved. There are currently 145 sub-trees in linux-next.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags: