Archive

Archive for June, 2009

2009/06/28 Linux Kernel Podcast

June 29th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090628.mp3

For the weekend of June 28th 2009, I’m Jon Masters with a summary of the weekend’s LKML traffic.

In today’s issue: kerneloops.org weekly report, DRBD, Kmemleak, OOM, performance counters, trying harder, and VFAT.

Kernel Oops reports for the week. Arjan van de Ven posted an analysis of last week’s kerneloops.org reports. He cited a mem_cgroup_add_lru_list list corruption as of concern (asking why this is new), a memcmp in the raid code, and the item he previously brought to attention (get_free_pages) concerning warnings on order > 0 allocations in the low level page allocator. Number one on the list this week was an i915_gem_set_tiling issue.

DRBD. Philipp Reisner reposted (for the first time in 2.6.31) concerning his highly available block device for HA clusters. He says “As the first bit of the DRBD patch already got upstream…it is time to get more of DRBD towards mainline”. He wants the LKML masses to consider the lru_cache next.

Kmemleak. Kmemleak, when enabled in the kernel build configuration, aims to detect runtime leakage of kernel memory. But it can be very noisy and it prints very verbose output, which a number of developers have objected to, including Ingo Molnar (who says he has lost crash information due to that). So various suggestions and patches are floating around to both trim the output, and the rate at which it is produced. One suggestion was also to be able to “watch” potentional leaked regions with some kind of registration interface.

OOM. David Howells spent a long time bi-secting kernels until he found the git commit that has been causing a marked increase in OOM situations. It was a patch from MinChan Kim entitled “vmscan: prevent shrinking of active anon lru list in case of no swap space”. There is an (incorrect) assumption that nr_swap_pages cannot be zero on systems with swap, which it can. So debate is now happening over the best way to fix the patch for systems with swap.

Performance counters. Jaswinder Singh Rajput posted a patch adding support to the “perf” utility for “multiple events in one shot”. He adds new options to display HARDWARE and SOFTWARE events using a command such as “perf state -w hw-events -e all-sw-events” wrapped around “ls” to display a number of stats for the running “ls” command.

Trying harder. Linus Torvalds replied to the ongoing get_page_from_freelist discussion concerning order > 0 GFP_NOFAIL allocations, in which David Rientjes had suggested a __GFP_WAIT allocation set the ALLOC_HARDER bit _if_ it repeats, saying that he “tends to like” the kind of “incrementally try harder” approaches to getting memory in such situations. In part because it ensures fairness – a new thread starting off won’t steal the page that an older thread has just had freed and really needs to grab right away.

VFAT. Andrew Trigell posted an updated CONFIG_VFAT_FS_DUALNAMES patch implementing a new config option. It is now possible to selectively configure whether a Linux system using VFAT will create both long and short (8.3) filename entries for long filenames – with this configuration option disabled, Linux will not create the compatibility short filename alternative on long filename entries. Andrew also posted an FAQ and announced that the Linux Foundation have arranged for John Lanza to serve as a patent attorney and answer legal questions that come up relating to this patch (he was copied on the email and is hopefully ready to handle a volume of LKML traffic).

In today’s miscellaneous items: TSC based udelay should have rdtsc_barrier (Venkatesh Pallipadi), a number of Intel Moorestown boot fixes (Jacob Jun Pan), 62 “remove semicolon” patches (Joe Perches), reposted lockdep DFS to BFS conversion patches (Tom Leiming – claims the implementation is simpler), fixes to the S+Core architecture (Arnd Bergmann), a stop_machine patch for very large CPU count machines suffering from severe cacheline contention (Robin Holt), a triple update from Ingo Molnar (x86, timers, and tracing), some EDAC AMD64 fixes (Borislav Petkov), and SPARC fixes (David Miller). Ben Herrenschmidt requested Linus pull some fixes originally intended for rc1 that weren’t ready in time because he got sick for a couple of days.

Finally today. Various people have been mentioning ext3/4 filesystem errors upon resume from suspend (especially on ATA devices). There is suspected to be a bug somewhere but it is proving fairly ellusive to track down.

In today’s announcements: dm-ioband version 1.12.0 (Ryo Tsuruta, disk bandwidth per partition control packages) and version 3.2g of the loop-AES file/swap crypto package.

The latest kernel release is 2.6.31-rc1, which was released by Linus last week.

Andrew Morton posted an mm-of-the-moment for 2009-06-25-15-49 which contains a number of updates against 2.6.31-rc1.

Stephen Rothwell posted a linux-next tree for June 26th. Since Thursday, it includes a fix for fbdev, and various subtrees gained build failures. The total tree count remains steady at 130 in the latest compose.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/06/25 Linux Kernel Podcast

June 27th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090625.mp3

For Thursday, June 25th 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Futexes, kmemleak, and mixed endianness.

Futexes. Thomas Gleixner came to Linus, cap in hand, apologizing for the mixup over correct usage of get_user_pages (the forth argument is actually a number of pages and not a straight length – Peter Zjilstra previously posted fixes to the documentation of this function to avoid similar future mishaps). Thomas joked that he’s running out of brown paper bags to throwup into, and so is declaring all future futex bugs fixed by definition and thus features.

Kmemleak. Dave Jones reported a lot of (likely false positive) kmemleak reports on his latest Fedora (rawhide) test kernels. Catalin Marinas followed up with some suggestions for kernel config changes and a noise reducing patch, enabling task stacks scanning by default, which Dave confirmed he had similarly done for his test kernels in response to the noise.

Mixed endianness. Andrew Paprocki wondered aloud whether it is really possible to mix endianness within a process on IA64. According to the documentation which Andrew cited, IA64 uses a .be bit in the PSR (Processor Status Register) to switch from one mode to another, although other (kernel) documentation says that this is not preserved by the kernel upon return from system calls – so the system is always returned to little endian mode following a system call. The question was whether it was practical to wrap system calls with a switch back from little endian to big endian once again. Nobody has answered him, yet.

Miscellaneous updates include: cpufreq lockdep fixes (Venkatesh Pallipadi), some fixes to avoid various races in irqfd/eventfd (Gregory Haskins), the 12th iteration of the per-bdi writeback flusher threads (Jens Axboe), various IDE fixes (David Miller, many by way of the previous maintainer), and a TTM page pool allocator patch for allocating e.g. AGP buffers for graphic from Jerome Glisse, which looks to be more of an RFC at this point. Steven Rostedt ACKed the general availability of the ring_buffer independently of tracing code following its use by this author’s hwlat patches.

Finally today, Robert P J Day announced that he is running his existing cleanup scripts against the kernel CONFIG options, looking for orphans. He says that he has found a number that are mentioned in a Kconfig file but not in fact used in the kernel tree at this point.

In today’s announcements: Ksplice for Ubuntu 9.04 Jaunty. Local Cambridge resident and Ksplice, Inc. founder Jeff Arnold announced that his company has begun offering updates for Ubuntu 9.04. For those just tuning in, ksplice is a dynamic kernel patching infrastructure allowing for “rebootless kernel updates”. It can handle ABI changes, structure modifications, all the kinds of things one might expect, and it doesn’t (necessarily) require a special kernel to begin with since it does all of its work in loadable modules under a kernel stop_machine context. Ksplice includes a lot of very interesting technology and the “Uptrack” service for Ubuntu is aimed to generate interest in their other commercial rebootless update offerings for the “Enterprise” distros. usbutils 0.84. Greg Kroah-Hartman announced version 0.84 of usbutils. This release fixes several bugs.

The latest kernel release is 2.6.31-rc1, which was released by Linus on Wednesday evening PDT.

Stephen Rothwell posted a linux-next tree for June 25th. Since Wednesday, his fixes tree contains several commits, the tree still fails to build in an allyesconfig build configuration on PowerPC, and a number of conflicts and build failures were removed in time for 2.6.31-rc1.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/06/24 Linux Kernel Podcast

June 25th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090624.mp3

For Wednesday, June 24th 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: NMI watchdog and NOHZ, upcoming kerneloops reports, the Simple Firmware Interface, Slow work module unload fixes, unevictable pages, and USB APIs.

The merge window is now closed, and it’s obvious from the drop-off in patches. Although as Stephen Rothwell noted, there are still a number of trees (14) in linux-next that need to be merged (or in American English, “punted”, to 2.6.32). As Stephen says, “please do not shoot the messenger.”

NMI watchdog and NOHZ. David Miller followed up (again) to his previously issue regarding nohz, this time arriving at the conclusion that little prevents the NMI firing if an interrupt storm arrives immediately after the call to tick_nohz_Stop_sched_tick(). Andi Kleen remarked that it would be safer to do tick disabling with interrupts off already, and that since the NMI watchdog is default off on most x86 systems, many people won’t have noticed.

Oops! The ever useful Arjan van de Ven posted a heads up regarding upcoming issues on kerneloops.org. Apparently, a lot of people have been bitten by an oops within get_page_from_freelist due to the changes to the VM intended to catch (and thoroughly blame) those calling kmalloc with __GFP_NOFAIL and order greater than zero. Unfortunately, this check turns out to be a little pedantic and there are many cases where legimate users need more than order 0 single page allocations returned (e.g. in the SLUB allocator). Linus Torvalds even followed up explaining how little difference there really is between order 0 and order 1, and how it shouldn’t be a big issue until you face allocations requesting order 3 or above. For now, it looks like order 1 and above is the magic number that the check will be updated to catch.

Simple Firmware Interface (SFI). Matthew Garrett followed up to his previous message concerning the SFI posting from the Intel camp. Previously, Matthew had objected to parts of the SFI concept – which is largely a cut-down ACPI – on the grounds that it tended to create a vendor mess of incompatible ACPI-like implementations with all manner of extension tables. Matthew feels that “SFI appears to be presented as a generic firmwae interface, but in reality it’s currently tightly wed to Moorestown [the Intel chipset] and I don’t see any way that that can be fixed without reinventing chunks of ACPI. I’m certainly not enthusiastic about seeing this present as a fait accompli in generic driver code”. One looks forward to the Linux Symposium “discussion”.

Slow work (if you can get it). Gregory Haskins posted a fix to the slow_work implementation, adding a module owner reference for module clients. Previously, the implementation did not have a means to ensure that slow-work threads had completely exited the text in question before it was yanked away by the module unload code.

Unevictable pages. Alok Kataria and Kamezawa Hiroyuki debated whether hugepages should be accounted as unevictable pages, and if so whether the name “unevictable” should be changed in procps output to “Pinned” or “Mlocked”. The problem is that, while hugepages are indeed unevictable, neither these nor the existing statistics fully account for every unevictable page present. Sometimes these aren’t known about even until vmscan tries to reclaim.

USB. There was some concern that lsusb was using an older API that stopped working unless CONFIG_EMBEDDED was set. This prompted several developers to question whether CONFIG_EMBEDDED should be required for “features” (it is intended to remove *unwanted* features), but Greg Kroah-Hartman explained that modern systems provide /dev/bus/usb and should be using that instead.

Miscellaneous updates include: A trivial update to the Intel TXT boot patches (Joseph Cihula, renaming a variable to make it global), IDE fixes (David Miller), Futex fixes (Thomas Gleixner, including the changes previously discussed to fault_in_user_writeable, fixing some incorrect assumptions in the previous patches regarding access_ok and whether a RW mapped region could go away under us), UWB (David Vrabel, trivial fixes), omapfb (Imre Deak, support for new LCDs and miscellaneous fixes), reducing the time taken for a single cpu online operation (Gautham R Shenoy, pseries), some networking updates (David Miller, mostly regular fixes), and a simplification of scripts/extract-ikconfig (Dick Streefland) removing the need for a special binary simply to extract a kernel config, which can be done in the bash script instead. David Airlie did try to post some drm-fixes, but it’s probably too late in the merge window at this point, as he noted.

Finally today, Mathieu Desnoyers asked about relicensing the marker LTTng, marker and tracepoints code under a dual license to include the lesser GPL license (LGPL v2.1). Although not objected to outright, some wondered why this was necessary, to which Mathieu responded that he wanted to allow userspace code that wanted to link to non-GPL code to still use the LTTng codebase.

The latest kernel release is 2.6.31-rc1, which was just released by Linus. Overall, Linus is extremely happy with how this merge window has gone. He adds, “On the whole? Tons of stuff. Let’s start testign and stabilizing.”

Stephen Rothwell posted a linux-next tree for June 24th. Since Tuesday, the fixes tree contains two commits fro fbdev and UML, the rr tree gained a conflict against Linus’ tree and the dwmw2-iommu tree lost its conflicts. The PowerPC tree still fails to build in an allyesconfig configuration.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/06/23 Linux Kernel Podcast

June 24th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090623.mp3

Do you pine for the days when men were men and wrote their own device drivers? Well, do you, punk?

For Tuesday, June 23rd 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: the continuing 2.6.31 merge window, IDE, Intel Trusted Execution Technology support, RCU, SFI, System Call tracepoints, and removing perl from the kernel build process.

The Continuing 2.6.31 merge window

per-bdi writeback threads. Jens Axboe wondered what was going to be done about his per-bdi writeback patch series, which are apparently “looking good” and have been in linux-next for almost a week without any problem reports. Jens wasn’t the only person wondering where his patches went. K. Prasad mailed to ask what was the plan with regard to his Hardware Breakpoint Interfaces, especially considering that (apparently), most of the previous concerns from Ingo Molnar and others have now been addressed in -tip.

Architecture updates include: Super-H from Paul Mundt (mostly SMP fixups), blackfin (Mike Frysinger, including a fair number of fixes from himself), and the new S+core architecture that was mentioned in this podcast previously. This ARM-reminiscent architecture is living in Arnd Bergmann’s tree at the moment while it’s author (Liqin Chen, who seems to be doing a great job figuring out Linux kernel development, including being the first user of Arnd’s new asm-generic defined ABI) figures out where to keep the 200kb tree. Currently, S+core runs LTP and a limited userland, but not against this tree.

Miscellaneous updates include: backlight updates (Richard Purdie, including a trivial kmalloc fix), LED driver support (also Richard Purdie, essentially bug fixes but also one new driver), infiniband (Roland Dreier), SCSI updates (James Bottomley, mostly a small set of driver updates and fixes), asm-generic fixes (Arnd Bergmann, whose tree is also hosting a new architecture port), fixes to watchdog (Wim Van Sebroeck), libata updates (Jeff Garzik), run-time power management of IO devices (Rafael J. Wysocki), and Kprobes jump optimization (replacing int3 breakpoints on x86 with jumps). For those interested in a discussion of git usage in kernel development, take a look at Linus’ replies to the “fix for shared flat binary format in 2.6.30″ thread.

Non-merge specific concerns

IDE. Following yesterday’s rather abrupt announcement that David Miller would be taking over IDE maintainership, Tuesday brought a number of clarifications. First, Bart and David showed public support for one-another, with Bart saying he would have more time for “other projects”, and David explaining that this was all amicably done (so nobody need worry about the event – the sky is not falling and we can move on with our lives). Secondly, David posted a new patchwork location for those wishing to track ide patches in the future.

Intel Trusted Execution Technology support. Joseph Cihula posted version 5 of a patchset implementing Trusted Execution Technology support for Linux. As I have previously discussed, this patch series is intended to safeguard against a compromised system via ensuring all of the elements in the boot path are secure from attack. The technology aims to verify that the bootloader is secure, which then verifies the kernel, and so forth. It is implemented in the form of a tboot “kernel” loaded by the bootloader that sets up the dynamic root of trust via a special GETSEC[SENTER] processor instruction and then causes the real kernel to be loaded, after it has been verified.

RCU. Paul McKenny posted a -tip proof of concept version of RCU designed for non-SMP, embedded systems, aiming to be small in footprint (as little as a quarter the size in memory use terms as other RCU options, according to the benchmarks that he attached to the posting). In addition to Paul’s patches, Jesper Dangaard Brouer posted a 10 part patch series aiming to ensure correct usage of rcu_barrier on module unload. Let’s remind ourselves that this issue was “discovered” last week and so far has resulted in a few fixes to David Miller’s net tree, with more to follow, including these patches.

SFI. Len Brown posted a patch series implementing a new “Simple Firmware Interface” (for which a talk is forthcoming at next month’s Linux Symposium), which seems to be a simplified version of ACPI curretly with a single chipset implementation. While the idea is certainly interesting, Matthew Garrett was concerned that having essentially another ACPI (sub)implementation in the kernel was setting a precident for more to follow. He prefered codebase sharing as the starting point, allowing for other sub-ACPI variants.

System call tracepoints. Jason Baron posted version 2 of his patch series implementing system call tracepoints. It includes the ability to toggle entry/exit tracing of each system call via the usual events/syscalls/syscall_blah/enable type interface. Since the previous version, Jason has added a number of fixes (locking, static allocation, etc), including support for system calls that take no argument.

Finally today, Rob Landley posted a three part patch series removing the use of perl from the 2.6.30 build, and replacing the offending perl script (kernel/timeconst.pl) with a much shorter (a quarter of the size) shell script that does the same thing. Separately, Benjamin Herrenschmidt continued the good fight figuring out how to make early SLAB initialization work on PowerPC. Amongst his findings was a need to move cpu_hotplug_init early enough, to which Linus responded that this could just be a statically initialized.

In today’s announcements: RT version 2.6.29.5-rt22. Thomas Gleixner announced version 2.6.29.5-rt22 of the -rt patchset. The announcement contains three kinds of fixes – a network live lock fix, disabling preemption over the atomic section of iomap, and identifying false positivies in softirq pending check (caused by a CPU going idle with the softirq pending bit of a blocked softirq thread still set).

The latest kernel release is 2.6.30, which was released by Linus on June 9th.

Stephen Rothwell posted a linux-next tree for June 23rd. Since Monday, he added a fix for an fbdev exposed compiler bug, the slab tree lost its build conflict, and yes, the powerpc tree continues to fail in an allyesconfig build configuration. The total sub-tree count remains steady at 130 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/06/22 Linux Kernel Podcast

June 23rd, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090622.mp3

Would you be prepared if gravity reversed itself? The only thing I can’t figure out is how to keep the change in my pockets.

For Monday, June 22nd 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: the continuing 2.6.31 merge window, the IDE tree, documenting the rc-series and the merge window, taking a core dump, kernel boot delay madness, IO scheduler based IO controllers, and some feedback.

The Continuing 2.6.31 merge window

Device Mapper. Alasdair Kergon posted a series of device mapper patches. These are mostly a consolidation of various fixes, including ioctl support cookies for udev, and some documentation updates.

Firewire. Stefan Richter posted a series of firewire (IEEE1394) update. In this posting, Stefan notes that the “new” stack is now preferable over the legacy one, except in the case of audio devices (for which he notes that it is possible for distributions to package both stacks in their releases).

Networking. David Miller posted a series of updates to the networking stack (including an indirect reminder that John Linville is away at the Wireless summit at the LinuxTAG in Berlin this week). Amongst the updates were lots of fixes, and he still expects some netfilter regression fixes to follow. Separately, David wondered aloud how the NMI watchdog and NOHZ might interact badly if a system were truly idle (triggering the NMI watchdog unncessarily), but ultimately convinced himself that he was looking for another SPARC cause.

NFS private namespaces. Tond Myklebust reposted his private namespaces for NFS series of patches. When these are applied, the kernel gains the ability to create a private mount namespace that is not visible to user processes. These were originally targeted for 2.6.31 and in the absence of objections, Trond is hopeful that they will be ACKed and accepted forthwith.

PCI. Jesse Barnes posted an updated PCI git repository. He points out that the latest updates are “much less aggressive” than those targeting 2.6.30, although he noted that the latest tree does include AER (Advanced Error Reporting) enhancements, especially on multiple error conditions.

Performance counters. Ingo Molnar responded with a series of replies to a long series of replies from Stephane Eranian concerning Ingo’s posting of Performance counters patches for the merge window. There were many different comments here, but they showed a difference in opinion between the various potential users for performance counters. Ingo states that his main concern is making tools (such as perf) a “useful solution to developers/users”, which is “a key area where…perfcounters and perfmon differs”. He also notes that it aims to be “‘Oprofile done right’ and’pfmon done right’”. The thread makes for some interesting reading if you are interested in performance counters (and, honestly, who listening to a podcast such as this one wouldn’t be?).

Architecture updates include: s390 (Martin Schwidefsky).

Miscellaneous updates include: irqfd/eventfd patches from Gregory Haskins, suppressing page allocator warnings about order >= MAX_ORDER when the code causing this is doing the right thing and intentionally gets the warning (by adding a new __GFP_NOWARN kmalloc flag), exofs/osd tree updates from Boaz Harrosh, and a rather interesting one from Krzysztof Mazur noting that arch_get_unmmaped_area() in the generic core doesn’t correctly ensure that the address it returns is greated than TASK_UNMAPPED_BASE.

Non-merge specific concerns

The IDE tree. David Miller and Bartlomiej Zolnierkiewicz had a “discussion” in which David expressed some frustration about the lack of testing of various bug fixes, to which Bartlomiej suggested that David might take over IDE, which David offered to do on the spot. He posted a new IDE tree address, and various folks sent well wishing mails. But Linus wasn’t so keen on the situation, saying that he really didn’t want to take the tree David Miller had put together. Quoting Linus, “I really don’t want to take this. I think you [David] and Bartlomiej should spend a _lot_ more time and effort trying to resolve this. Me taking it just closes the doors fro trying to be constructive about issues.” There followed a debate about the current users of ide vs. pata. and libata, and why more people don’t just move away from legacy ide code. Arnd Bergmann pointed out that a number of architectures (especially those without dma-mapping.h support, often true for the NOMMU architectures) can’t use libata at all.

Documenting the rc-series and merge window. Luis R. Rodriguez reposted his quite excellent documentation on the rc-series and merge window process. In the latest version, he adds the average time between the last ten releases (86.0 days currently).

Taking a core dump. Neil Horman posted an interesting little patch that aims to fix three deficiencies in the current core dumping code. Firstly, he fixes recursive dump handling (where the dump handler specified in core_pattern actually crashes while it is helping us to take the full dump). Secondly, Neil allows the core_pattern process to complete, waiting for it in case it wants to poke at the procfs entries for the crashee process. Finally, he adds a brand new sysctl called core_pipe_limit that bounds parallel core dumps.

Kernel boot delay madness. David Miller objected to a patch from Simon Arlot adding yet another boot parameter to the kernel, this time to obviate a (possible 2 seconds in duration) reset delay for physical network PHYs that have already been initialized on boot. David objected that “this is getting out of control” (refering to the boot delay parameter craziness), adding “We’re not going to add a hundred different obscure module options to eliminate delays and device resets”.

IO scheduler based IO controllers. Vivek Goyal followed up to his Friday posting concerning the latest iteration of his IO scheduler IO controller, noting that he had not done testing with AIO (Asynchronous Input Output). A dialog ensued between Vivek and Jeff Moyer over the best options to use for benchmarking to ensure that DIRECT IO was also being requested.

Finally today, thanks for the feedback on this podcast. It really means a lot to me that I’m providing something of some value to the community, and having a little fun in the process, especially at 4am on a Sunday morning. Do drop me a line and let me know what you think! If you’d be willing to record a few words about what you work on for me during the OLS or Plumbers conference, please do let me know, or just find me at the event and we’ll hook it up.

In today’s announcements: git version 1.6.3.3. Junio C Hamano announced git version 1.6.3.3, which includes fixes for cygwin, memory leaks, and a number of others fixes.

The latest kernel release is 2.6.30, which was released by Linus on June 9th.

Stephen Rothwell posted a linux-next tree for June 22nd. Since the previous day, two new trees were added for davinci and my hwlat hardware latency detector. Stephen also pulled in the “new” ide tree, although that might change if the discussion is revived following Linus’ comments. Today’s tree is moslty tree of conflicts, and yes, powerpc still fails to build in an allyesconfig build configuration. The total sub-tree count is now up to 130 trees, due the addition of the two aforementioned new sub-trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/06/21 Linux Kernel Podcast

June 22nd, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090621.mp3

This podcast is brought to you in part by way too many California strawberries.

For the weekend of June 21st 2009, I’m Jon Masters with a summary of the weekend’s LKML traffic.

In today’s issue: The continuing 2.6.31 merge window, the “Ceph” distributed filesystem, IO scheduler based IO controllers, poisonous hardware, transcedent memory, and ksplice tainting.

The continuing 2.6.31 merge window

Core kernel. Ingo posted a few updates to the core kernel. Amongst these was a bugfix developed in collaboration with Thomas that included a new function named get_user_writeable for use by the futex code (which can’t rely upon the existing access_ok for private futexes). A dialog ensued between Linus, Ingo Molnar and Thomas Gleixner concerning use of get_user_pages_fast() in this code, which Linus pointed out could be replaced with a single instruction on Intel-esque systems at any rate.

DRM. Dave Airlie posted a final drm tree for 2.6.31. Amongst the major changes was a switch in the AGP code to use arrays of pages instead of arrays of unsigned long. Quoting Dave, “since pageattr grew patch array interfaces this is possible and should solve GEM on PAE issues”.

KVM Support for 1GB pages. Joerg Roedel posted version 3 of a patch series that gives KVM the ability to support 1GB pages. This relies upon nested paging support, a feature of modern CPUs which behaves very similarly to an additional level in the global page table hierarchy. The patch series relies upon exporting vma_kernel_pagsize to modules.

Per-cpu. Ingo Molnar responded to yesterday’s “percpu for 2.6.31″ pull request posted by Tejun Heo (that had gotten slightly warped in the posting and caused Linus to be slightly unhappy), pleading with Linus and company to reconsider taking the per-cpu changes due to the fact that the patches had been posted in a timely fashion, and the sheer amount of work Tejun will be committed to if he must maintain them for yet another cycle (170 files worth of changes).

Performance counters. Paul Mackerras noted that architectures like PowerPC64 define __u64 to be unsigned long rather than unsigned long long, which causes compiler warnings every time one prints such a value with the print format string of %Lx. To correct this, Paul posted a patch to these userspace tools providing their own implementation of the definition of types such as u64.

RCU. Paul E. McKenney posted version 8 of his “big hammer” expidited RCU grace periods patchset. This patchset uses the existing per-CPU migration kthreads, which are awakened in a loop and waited for in a second loop, in order to expidite the passage of an RCU grace period. Apparently, this patchset can reduce RCU grace periods to 40us on an 8-CPU POWER machine.

Syscall tracepoints. While it is yet to be decided exactly when Jason Barron’s proposed syscall tracepoints will make it in, Li Zefan did use the opporunity to discover a bug in seqfile handling in the kernel trace infrastructure for which he posted a series of patches.

David Miller noted that stack backtrace support had broken sometime in the past day or so, which Stephen Rothwell was already aware of. Stephen forwarded a patch from Mike Frysinger that fixed it, which was also good news for Ingo.

Miscellaneous updates include: MMC updates (Pierre Ossman), Cryptography (Herbert Xu), ALSA (Takashi Iwai), NFS (Trond Myklebust, including support for version 4.1 of the NFS standard), Watchdog (part 2, apologies for not having space to mention part 1 yesterday), the usual level of tree posting insanity from Ingo (IRQs, scheduler – including another attempt to hide runqueues from those that would poke at them, timers, tracing, and x86), IDE (Bartlomiej Zolnierkiewicz), input updates (Dmitry Torokhov) and some kbuild fixes from Sam Ravnborg.

Architecture updates include: PowerPC (Benjamin Herrenschmidt), Blackfin (Mike Frysinger), and Microblaze (fixing a build problem caused by the previous round of Microblaze architectural updates).

Non-merge specific concerns

Ceph distributed filesystem client. Sage Weil posted a 21 part patch series implementing a “Ceph” distributed filesystem client, in the staging tree. “Ceph” is apparently a distributed filesystem designed for reliability, scalability, and performance, which relies on btrfs underneath. It features the usual kinds of things – data replication, no single points of failure, and fast recovery from node failures, although the fact that it’s only just going into the “staging” tree obviously means you shouldn’t rely on this client for critical stuff at this point. Separately, Greg posted a large number of changes to Linus for the “staging” tree (and by large, we mean 658 files changed, 165585 insertions, and 240493 deletions). Quoting Greg, “We are removing more crap than we are adding, looks like progress to me!”.

IO Scheduler based IO Controller. Vivek Goyal posted version 5 of his IO scheduler IO controller patchset. This patchset aims to introduce an ability to assign and control IO bandwidth consumed by tasks through IO throttling. A number of additional changes have been made since version 4, but this are mostly fixes and it looks like the patchset is stabilizing now.

Poisonous Hardware. Fengguang Wu posted version 6 of his HWPOISON patchset. This version has many of the changes discussed previously in this podcast. Included amongst those are the switched default to “late” kill except for those processes that have specificially requested an “early” kill via a per-process tunable option, as proposed by Nick Piggin and Hugh Dickens. Other changes include killing off the “uevent” emission idea, tainting the kernel on posioned page detection, and not “mess”ing with dirty/writeback pages for now.

Transcendent memory (”tmem”). Dan Magenheimer posted a 4 part patch series (first as an email attachment, then as a normal series), implementing what he described as “tmem” for Linux. Essentially, this is support for transient memory of a “dynamically variable size”, addressable only indirectly by the kernel, and which might disappear without warning. It may seem (on the face of it) to have little utility, but the application is in virtual machines (or other non-virtualized environments, including hotplug memory, SSDs, page cache compression, and even highmem on non-highmem kernels and using space VRAM) being provided with memory for cacheing (and similar purposes) that might be taken away at any moment without any warning. Since it requires kernel assistance, it’s application is mostly for in-kernel caches. The patch series is fairly comprehensive, and there will be a talk on the design on the first day of the 2009 Linux Symposium in Montreal, Canada.

Finally today, the ksplice guys requested a new TAINT flag so those loading ksplice updates into their kernels would be able to detect this easily (especially vendors of those concerned). Peter Zjilstra objected on the grounds that ksplice isn’t upstream, although it does still seem (to this author) that it would be a worthwhile thing to have in mainline anyway.

The latest kernel release is 2.6.30, which was released by Linus on June 9th.

Stephen Rothwell posted a linux-next tree for June 19th. Stephen added one fix (for symbol checking, affecting ARM), and noted that Linus tree gained a build failure due to a compiler bug (for which he reverted the offending commit). A few other trees lost conflicts, and the tree continues to fail to build for those seeking an allyesconfig build configuration on PowerPC. The total number of sub-trees remains steady at 128 again today (apologies for missing the total in yesterday’s summary podcast).

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/06/18 Linux Kernel Podcast

June 22nd, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090618.mp3

Support for this Podcast comes from an unhealthy amount of coffee. Mine’s a double Americano, what’s yours?

For Thursday, June 18th 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: The continuing 2.6.31 merge window, direct mmap for FUSE/CUSE, racing in TCP receive, problems with sys_mount(), and kernel.org front page kernels.

We’re playing catchup here, largely because this is the first merge window this podcast has had to cover and it takes, well, a certain mind set.

The Continuing 2.6.31 merge window

Dynamic per-cpu. Tejun Heo posted an updated per-cpu git tree for 2.6.31, that takes into account many of the recent per-cpu fixes (including dynamic allocation of per-cpu data). Linus objected to the tree on the grounds that it hadn’t been in linux-next, and had been created only moments before posting with (potentially) little time for test. Andrew Morten re-affirmed the lack of linux-next usage, adding ‘If this doesn’t mean “you missed 2.6.31″ then what does?” (he did also observe that there are some special cases such as this where some critical core kernel feature is modified and it’s not just “an ordinary old git merge like all the others”). The situation was clarified by Tejun: the git tree was being created from quilt patches that had been posted a number of times already, but there had been a glitch in the quilt import. He agreed that the lack of exposure in linux-next warranted delaying until 2.6.32 and stated that he would prep a tree for Stephen to pick up in linux-next soon.

Making executable pages the first class citizen. This podcast has covered this patch series several times before, but it is worth noting some feedback since this has now hit mainline, as Jesse Barnes pointed out. He found that one of his sample workloads went from creating an unusual machine to simply a slighlty sluggish machine. Fengguang Wu was happy to hear this, but keen to point out that Rik van Riel had also helped with his protecting active file LRU pages from being flushed by streaming IO. On a VM tangent, Fengguang Wu also posted in response to the ongoing HWPOISON patchset with a modified version of the “only early kill processes who installed SIGBUS handler” which only does so for processes that register an interest in doing so via a prctl. This allows applications to easily be modified, without breaking existing expectations of applications currently deployed in the field.

Fixing returng from kernel to tasks with a 16-bit stack. Alexander van Heukelum posted a detailed explanation and patch series, describing a bug in the kernel support (on x86 systems) for returning from the kernel into userspace tasks that use a 16-bit stack. Obviously, this doesn’t happen too often, but it does in emulation software such as WINE and dosemu. Due to a quirk in the manner in which an Intel processor restores state in such situations, only the lower 16 bits of the userspace stack pointer are preserved, while the upper 16 bits are kept from the kernel stack. The kernel has an existing special “espfix” segment that is abused to ensure that the upper 16 bits of the returning stack pointer will be correct, but this wasn’t always being setup correctly, especially not in a return from NMI.

Architecture updates include: microblaze (generic headers switch), and Super H fixes from Paul Mundt. On a tangent, it looks like John Williams (the author of the microblaze port has got a new .com email, possibly indicating a move)

Miscalleneous updates include: md updates from Neil Brown (including support for non-power of two chunk sizes in RAID0), ftrace updates from Steven Rostedt (including support for bypassing read locks inside the NMI handler – as you may know, Steven’s unique page swapping on read means we only need a lock on read, not on write to an active ring_buffer), a trivial documentation update to kthread_stop from Oleg Nesterov (reminding everyone that kernel threads can now call do_exit and be kthread_stop()ed, the two were previously mutually exclusive), cleanups to MAINTAINERS from Joe Perches, ext4 updates from Ted T’so, some relatively straightforward network stuff from David Miller (including wireless bits from John Linville, and bug fixes for NetXen and E100), and minimal HTC Dream Support (Google Andriod) via a reposted patch series from Brian Swetland (including some patches signed off by the somewhat quieter these days Robert Love).

Apologies to Gregory Haskins for not covering the latest iteration of his irqfd and eventfd work in detail, since it hasn’t changed hugely. But if you’d like to read about precisely how network packets are received and routed to KVM via vbus, take a look at the latest eventfd thread.

Non-merge specific concerns

Implementing direct mmap for FUSE/CUSE. Tejun Heo was busy today. In addition to posting per-cpu updates, he also posted the third version of a patchset implementing direct mmap support for FUSE/CUSE. This allows users of a FUSE filesystem to request an mmaped region, which will be satisfied on the backend by a kernel anonymous mapping, and still populated by the FUSE userspace server. The server gets to decide how mappings are shared so this has additional performance benefits for those implementing on FUSE/CUSE.

A rare race in TCP receive. Jiri Olsa posted to say that he had found a rare race in the TCP layer using a older RHEL4 kernel (that happens to be based upon 2.6.9, which is fairly long in the tooth). It turned out that, because of a missing smp_mb() and a combination of known errata in certain Intel CPUs, it was possible for tp->rcv_nxt updates made by one CPU to not propogate correctly to the others and result in a system sleeping forever. Jiri posted a patch citing the various errata, documentation, and including a fairly comprehensive analysis of the situation, although he said that he could not reproduce this upstream due to the rarity of its occurance.

Fixing an overflow in sys_mount(). Today’s tip of the hat goes to Vegard Nossum, who dilligently tracked down a bug reported by Ingo Molnar. It turns out that kernel code calling sys_mount() can be bitten by the fact that the aforementioned function will copy an entire page passed for the “type” parameter, even though less data is typically required for this string. If the content of the page happens to contain stray “wild” pointers, we might follow those and wreak some random havoc. Vegard (obviously) suggests stopping after we find the first NULL.

Finally today, Randy Dunlap resurrected an email thread from several weeks ago in which it was proposed that references to the old “mm” tree be removed from the front page of kernel.org. He added that 2.2 kernels might go the same way.

The latest kernel release is 2.6.30, which was released by Linus on June 9th.

Stephen Rothwell posted a linux-next tree for June 18th. Since Wednesday, the tree contains a few fixes, some conflicts due to deltas between Linus’ ongoing changes to his tree and developer trees, and the tree still fails to build in an allyesconfig build configuration for powerpc.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/06/17 Linux Kernel Podcast

June 20th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090617.mp3

Support for this Podcast comes from the humble Blueberry. Did you know that a mere 4 pints of blueberries for breakfast can be a healthy form of OCD?

For Wednesday, June 17th 2009, I’m Jon Masters with a summary of the day’s LKML traffic.

In today’s issue: the continuing 2.6.31 merge window, changing the NOHZ idle load balancing logic, OpenAFS pioctls, MCE, and scsi_wait_scan configuration.

Apologies for the tardiness of today’s production. Your author is currently preparing updates to cover Thursday and the weekend podcast update and hopes to get back into the swing of things next week. I guess the merge window really is that unpleasant to keep up with – bear with me, I’ll get there. I expect to introduce more automation and tracking, and filtering, in time.

The Continuing 2.6.31 merge window

Poisonous Hardware. Fengguang Wu posted a policy change RFC patch, in which the HWPOISON code would only “early kill” (that is to say, before an unrecoverable error has occured) processes that had installed a SIGBUS handler. This would allow certain applications (that caught SIGBUS) to recover from corruption of (for example) single pages within internal caches and other non-critical (isolatable) data. This might include, for example, the KVM (Kernel Virtual Machine) Hypervisor, Oracle’s database software, or similar programs using extensive internal cacheing to recover on memory errors.

Early SLAB allocation. Pekka J Enberg posted a series of SLAB updates for 2.6.31, which remember, include the new early SLAB allocator approach. In a separate mail thread, Linus Torvalds suggested that “All the recent init ordering changes should mean that the slab allocator is available _much_ earlier – to the point that hopefully any code that runs before slab is initialized should know very deep down that it’s special, and uses the bootmem allocator without doing any conditions what-so-ever”. Ben Herrenschmidt (the maintainer of the PowerPC architecture port) reponded that, which he would normally agree with this, there are a number of hairy skeletons in the PowerPC port closet that prevent this from being true…yet. He pleaded for more time before things like slab_is_available() are taken away from him, and he’s probably not the only person who will be affected in such a migration.

e820 table reservations. e820 is a standard BIOS extension used by a PC-based Operating System, such as Linux, to query the system physical memory map, for example to determine where certain standard resources are located. The existing e820 parser in the kernel doesn’t handle regions marked as EFI_RESERVED_TYPE, so they might be recorded as useable. A patch from Cliff Wichman changes this by marking such regions as E820_RESERVED.

Searching for empty slots in resources trees. In PCI, we use BARs (Base Address Registers) to program devices with a range of the system (PCI) address space to use for interaction with the host system. For example, a card providing a large buffer needs to have that buffer mapped somewhere in memory. Andrew Patterson noticed that the function pci_assign_resource() which calls find_resource, and is used to allocate address ranges for PCI device BARs in the parent bridge’s resource tree during hot add operations only checks is immediate children and siblings of the root resource passed. In certain topologies where a resource (that is to say, range of memory) is only available further down the resource tree, the existing algorithms can fail to allocate an acceptable resource. Andrew posted a patch that modifies find_resources and allocate_resources so that they recursive descend the entire tree instead. Others (including Linus Torvalds) expressed some concern that Andrew’s patch might be curing symptoms rather than the actual disease, since the situation described shouldn’t easily be arising. Later, Matthew (willy) Wilcox posted a series of four patches covering this problem, fixing it by “changing where ia64 sets up the resource pointers in the root pci bus”.

Dynamic per-cpu. Tejun Heo posted version 3 of his dynamic per-cpu patchset. Per-CPU is a mechanism wherein Linux kernel code can split certain data into a data area per CPU, so that hot-path code can quickly make updates without being concerned about the actions of other CPUs. Like it sounds, this patchset makes per-cpu data area allocations entirely dynamic, rather than a compile-time determination. At David Miller’s request, individual maintainers were removed from the CC list and substituted with the more generic arch maintainers list. Separately, Tejun posted a patch (entitled “teach lpage allocator about NUMA) which “makes the percpu allocator able to use non-linear and/or sparse cpu -> unit mappings and then makes the lpage allocator consider CPU topology and group CPUs in LOCAL_DISTANCE into the same large pages”.

VFS patches, part 2. Al Viro posted a series of VFS patches, mostly targeting BKL (Big Kernel Lock) removal in both the VFS and in filesystems. The Big Kernel Lock (BKL) was introduced in the easiest days of Linux SMP support written by Alan Cox as a means to have an extremely coarse-level “kernel lock” (exactly one CPU could be executing kernel code at a time), but it has long since become a performance bottleneck and is slowly being removed. Previous kernels have attempted to replace it with a semaphore (which was reverted, again for performance related reasons), and the RT tree still does so. Separately, Jan Blunck posted a series of patches preparing for the VFS based union mounts. He and Val think these are good to go in separately.

PCI updates for 2.6.31. Jesse Barnes posted a summary of pending changes in his git tree. These include improved PCI AER (Advanced Error Reporting) support (refer to the pciaer-howto for further information), the removal of pci_find_slot, and a collection of the usual cleanups and fixes.

FireWire updates post 2.6.30. Stefan Richter posted a few IEEE1394 (firewire) updates for 2.6.31. These included the newer sysfs attributes mentioned previously that should lead to “simpler and saner udev rules”.

Miscellaneous updates include: some trivial fixes for the ksym_tracer from K. Prasad, V4L/DVB updates from Mauro Carvalho Chehab, kmemleak fixes from Catalin Marinas (who also wishes to rename kmemleak_panic to kmemleak_stop to avoid confusion over the use of the “panic” word), UBI and UBIFS fixes from Artem Bityutskiy, some exofs patches from Boaz Harrosh, and a patch series adding software (not hardware) counters for PowerPC 32-bit. Discussion continued on the idea of handling page faults on x86 with interrupts enabled, adding a little complexity to the interrupt handler but intending to reduce overall overhead in the process.

Non-merge specific concerns

Changing the NOHZ idle load balance logic. Venkatest Pallipadi posted a two part patch series aimed at changing the NOHZ idle load balance logic from the “pull” model currenly in use (in which one idle load balancer CPU is nominated to not go into NOHZ mode and ends up doing all the balancing work for CPUs in the NOHZ mode) to a “push” model in which busy CPUs can kick those that are idle (and in NOHZ mode) into taking care of idle balancing on behalf of a group of idle CPUs. Apparently, there are still some “rough edges”, and so this is an RFC for the moment.

OpenAFS pioctls. OpenAFS is an implementation of the Andrew distributed filesystem, which is especially popular with banks and international corporations. David Howells posted a 17 part patch series implementing an in-kernel pioctl system call, as used by OpenAFS. Alan Cox objected to the “ugly” nature of the ABI, and asked why David couldn’t instead use the C-library system call wrapper (all system calls end up with a small wrapper in the system C-library) to do what this system call would otherwise do using those already available. David replied that it was almost possible to do this, but that it got very hairy and that he also wanted the kAFS and OpenAFS implementations to be able to share userspace tools without recompiling.

MCE test coverage data. Huang Ying posted to let everyone know about his mce-inject test tool (with git repostitory) and about further test information being available on his kernel.org people page.

Finally today, the “lack” of a configuration option for scsi_wait_scan was finally addressed today in the form of documentation (from Stefan Richter) explaining why it has intentionally been ommited. Thee SCSI wait scan module is used (especially by distributions, in their initrds) in order to wait for SCSI device enumeration activity completion. It does this by simply not returning from module_init until the SCSI subsubsystem is ready to procede. It is needed by some users and accidental removal can lead to hard to debug boot failures, although removing the config option does seem excessive.

In today’s announcements: Thomas Gleixner announced version 2.6.29.5-rt21 of the Real Time patchset. The latest version includes a fix for a rather unpleasant “lockup” scenario in the softirq handling code. There was no announcement for the previous -rt20 release due to this softirq issue.

The latest kernel release is 2.6.30, which was released by Linus June 9th.

Stephen Rothwell posted a linux-next tree for June 17th. Since the previous day, the powerpc tree continues to fail to build in an allyesconfig build configuration, the ext4 build failure means that a version from Monday is being used, the 4vl-dvb tree lost its conflict, and the KVM tree gained a build failure (due to PowerPC now using -Werror), for which Stephen applied a quick patch. Total tree count remains at 128 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/06/16 Linux Kernel Podcast

June 18th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090616.mp3

Correction: Due to an editing error, the June 16th edition of the LKML Podcast incorrectly stated that Pekka Enberg was the driving force behind a push for GFP_BOOT. In fact, Nick Piggin is the primary push behind that, while Pekka has stated several times that he is in fact comfortable with either approach.

For Tuesday, June 16th 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: the continuing 2.6.31 merge window, bulk CPU hotplug, and interrupts during pagefault.

The Continuing 2.6.31 merge window

Kernel development times. Greg Kroah-Hartman and Luis R. Rodriguez had an exchange of emails concerning Luis’ new rc-series and merge window docs. Greg questioned Luis’ figures for previous kernel release dates and development times, and Luis ultimately accepted Greg’s version of events. In Greg’s figures, over the last 10 kernel cycles, the minimum development time between kernels was 68 days (2.6.20) and the maximum was 108 (2.6.24). This places the next kernel release sometime in the early days of September.

Early SLAB allocation. Nick Piggin and Ben Herrenchmidt continued a debate between themselves (with occasional others chipping in) concerning whether it was appropriate to introduce special “boot time” versions of the kmalloc and vmalloc function calls (or more specifically, adding special boot time GFP_ flags that should be passed if an allocation might take place early in boot). Ben Herrenschmidt pointed out that there are many points at which allocations might happen and we wouldn’t think to use special flags – for example, even during suspend/resume one might be trying to perform a memory allocation that blocks pending IO to a disk that has already since gone offline. Ben pointed out that, in such cases, it’s far more likely to work out for the best if infrastructure components automatically degrade such that (for example) kmalloc automatically uses GFP_NOIO once suspend has started.

USB. Greg Kroah-Hartman posted a large number of updates via his git tree and requested Linus merge. Amongst the updates were USB 3.0 support (see Sarah Sharp’s blog posting for the details), various new drivers, Unicode bugfixes, power management, and core code cleanups. There were a few non-USB related patches that the tree depends upon but these had all received the blessings of those subsystems affected. Greg also posted a series of driver core patches – most of which were minor in nature – and these included API cleanups and documentation.

Btrfs. Chris Mason followed up again concerning changes to the physical on-disk format for Btrfs, noting that newer kernels (those post 2.6.30) will roll forward existing filesystems to a format not supported by older kernels. In order to help developers who might be using Btrfs, Chris posted some rescue disk images based upon the Arch Linux 2.6.30 distro to his kernel.org pages. These contain enough filesystem checking tools to repair damage, as well as git, gcc, make, and enough to compile a kernel. Separately the Fedora folks posted to fedora-devel announcing that rawhide will be picking up the format change in due course, and reminding everyone that breakage is entirely possible, and that Btrfs won’t be ready for prime time for a year yet. Also on the filesystem front came some minor updates for OCFS2 from Joel Becker (although he noted that these were almost entirely fixes), and David Howells posted some updates to the AFS filesystem support code.

Kdump crashkernel breakage. Chris Wright pointed out that a recent change to CONFIG_PHYSICAL_START and CONFIG_PHYSICAL_ALIGN will impact those who follow the documentation to use their (relocatable) kdump crashkernel loaded with a 64MB or 128MB window at 16MB. Doing so will now interfere with the stock kernel because it has moved from the old default physical start of 2MB. Chris suggests that the problem here is that the documentation needs to be updated reflecting this change since 2.6.30, but sought input.

Adding formatting to WARN(). Linus Torvalds and Ingo Molnar debated Arjan van de Ven’s idea of adding “\n” formatting to the WARN() macro, for the ease of formatting in kernel log files (and less corruption to logs posted on kerneloops.org). Linus liked the idea as much as Ingo, but he felt that blanketly applying formatting to all users would adversely affect existing “naked” printk’s at this point, and he didn’t much like the idea of forcing those users to migrate to using KERN_CONT explicitly. So, in true Linus style, Linus wrote a cunning macro that tries to do the right thing, only adding a “\n” if a KERN_xyz level is included at the start of the string, and changing the implementation of KERN_CONT so that it can still be used for continuation.

On a related note, Mike Frysinger posted an RFC patch series implementing a series of useful new functions for printk()ing during initcalls. Rather than simply using printk() directly, these wrappers – which include, for example, pr_info_init() and pr_cont_init() (for printk continuation) – cause the accompanying string to be stored in a separate ELF section of the kernel linked binary image(s), so they can be unloaded aswell as the initdata.

Also on the printk() front, Dave Young posted a generic version of the previous printk delay implementation for use during normal system operation. So, Linux now has an ability to insert delays between printk() messages on boot, on halt, and during normal operation with the use of a sysctl. This is specifically intended for certain kinds of embedded (and also similar) systems where it might not be easy to capture kernel output without a delay insertion.

Architecture updates include: Power management updates for s390, Blackfin, and SPARC. The latter gained dynamic per-cpu allocator support, and a new syscall. Jeremy Fitzhardinge posted some minor io_apic cleanups for x86 which he had noticed while pursuing his Xen work, these included further 32/64-bit merge fallout, loop restructuring, and comment fixing.

Non-merge specific concerns

Bulk CPU Hotplug support. Gautham R Shenoy posted an RFD patch series aimed at opening discussion surrounding the best way to move forward from the current CPU hotplug implementation. The current code allows one to online and offline a single “CPU” at a time, but this “CPU” might in fact be part of a multi-core processor or even larger package, where performing a whole series of CPU Hotplug events to take down the package is much slower than need be. Gautham posted some benchmarks (for PPC64 systems) and a fairly detailed proposal in which one could echo comma separated lists of CPUs to online or offline as a unit via the /sys/devices/system/cpu/online and /sys/devices/system/cpu/offline sysfs entries.

Interrupts during page fault (to trap or not to trap?). As part of a thread entitled “perf_count: x86: Fix call-chain support to use NMI-safe methods”, Ingo Molnar, Mathieu Desnoyers, and others engaged in a lively discussion surrounding the overhead of disabling interrupts during page faults and re-enabling them afterward (an cli/sti cycle doesn’t come free). Currently, Linux uses x86 architecture “interrupt gates” rather than “trap gates” in order to ensure interrupts are disabled starting from the moment that a page fault condition is generated. This is in order to prevent the Intel archictectural “CR2″ control register from being “messed up” by other subsequent interrupts. But if this register state is saved on the kernel within the IRQ handler instead, then the overhead (in this case of a special purpose register – SPR – write) is moved from the page fault handler having to disable/enable interrupts into the interrupt handler, which will now have to write to CR2 under certain circumstances. Ingo performed various benchmarks and agreed with Mathieu that this was an overall win due to the order of magnitude more page faults than interrupts likely on a typical x86 system.

In today’s announcements: lio-utils v3.0 configfs HOWTO for v2.6.30. Nicholas A. Bellinger announced a new HOWTO for Linux-iSCSI.org Target v3.0 users.

The latest kernel release is 2.6.30, which was released by Linus last Tuesday.

Stephen Rothwell posted a linux-next tree for June 16th. Since Monday, the kmemleak tree was removed (since it had served its purpose of testing the newer kmemleak patches), the tree continues to fail to build in an allyesconfig powerpc build time configuration, and a large number of other trees lost conflicts as the merge process continues. The total tree count is now down to 128 sub-trees, with the removal of kmemleak contributing to that.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/06/15 Linux Kernel Podcast

June 16th, 2009 jcm No comments

Audio: http://media.libsyn.com/medi/jcm/linux_kernel_podcast_20090615.mp3

Correction: Oops. I meant “Kernel Mode Setting, and not “Kernel Memory System”. I must have been smoking something at the time!

For Monday, June 15th 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: the continuing 2.6.31 merge window, kernel oops reports, and disabling the staging driver tree.

The Continuing 2.6.31 merge window.

Poisonous hardware. Fengguang Wu posted version 5 of the HWPOISON patchset. This version includes new “uevent” code designed to send events to userspace on memory corruption event detection. While many cleanups have indeed taken place and some are happy for the code to merge (because it does very little until an error occurs, and then, it’s probably not much worse to tend to panic on handling the error than on simply ignoring the error condition), others have expressed concern that this needs more bake-time. They include Alan Cox (who feels it needs at least another cycle or two and some buy-in from other architectures – for example, the PowerPC folks), Nick Piggin (who echoes similar concerns about the level of change), and so forth.

Early SLAB. Pekka Enberg, Hugh Dickins, and Nick Piggin continue to discuss Pekka’s “early SLAB” work. This work aims to remove the use of bootmem and replace it (mostly) with a much earlier initialization of SLAB allocation. Some controvasy has arisen over just how much early memory users should be aware of their situation. While Pekka argues that it is appropriate to implement and use a GFP_BOOT option to kmalloc, others believe that users of kmalloc should not have to be aware of low-level machine state (interrupts enabled, and the like). The counter argument is that users at this point (prior to running the initcalls list) should really know what they’re doing, already are using bootmem, and so can live with an extra GFP flag.

Scheduler. Fabio Checconi posted an 8 part patch series implementing an Earliest Deadline First (EDF) group level scheduler to support generic period assignments for Real Time tasks. Quoting Fabio, “With this patch, the rt_runtime and rt_period parameters can be used to specify arbitrary CPU reservations for RT tasks”. As an early RFC, this probably warrents further investigation by anyone with an interest. Separately, Lennart Poettering posted several versions of a patch series intended to extend the POSIX mandated sched_setscheduler call with an optional additional policy parameter of SCHED_RESET_ON_FORK, which, as it implies, resets any elevated priority on subsequent task fork. No doubt this is useful to Real Time tasks like the PulseAudio sound daemon Lennart works on. I am similarly interested in revising an optional extension to mlockall for inheriting MCL_ flags.

Networking. David Miller posted an updated net tree for the 2.6.31 merge. There were many great highlights, amongst them ongoing work on the 802.15.4 protocol stacĀ  (ZigBee support anyone?), conversion to net_device_ops finalization leading to a removal of legacy comptibility code, fixes, and a lot more SKB list abstraction work so that SKB’s can ultimately use list_heads rather than, as David puts it, “our custom crap”. There is also an RFKILL rewrite in the tree, which John Linville’s wireless-next tree uses also.

Kbuild. Sam Ravnborg posted a few minor kbuild updates, including preparation for the vmlinux linker script cleanups that haven’t taken hold yet. Separately, Pekka J Enberg pointed out that a number of distributions are now enabling -Wformat-security for gcc by default, which causes a number of unnecessary warnings to be generated for the kernel, quoting “sometimes in cases where fixing the warning is not likely to actually fix a bug”. Pekka posted a patch to force disable building the kernel with these warnings.

KMS. Dave Airlie posted a git tree containing updates to the Kernel Mode Setting (KMS). The latest tree introduces support for ATI radeon and includes the initial TTM memory manager. I confess to being a little out of touch with the latest KMS work and hope that someone will elect to enlighten us with an article.

Time. Thomas Gleixner posted updates to the ntp, timers (including migration support for power saving), clockevents (including a new register_device exported function smybol), etc.

Architecture updates include: powerpc and the new S+Core architecture.

The S+Core architecture. Liqin Chen posted initial support for the S+core CPU. This is a low power embedded 32-bit processor architecture created by Sunplus Core Technology (S+Core) that uses hybrid 16/32 bit instruction modes and parallel conditional code execution. The Sunplus Core Technology company was created in 2007 and the website currently has little in the way of additional documentation on the ISA or the value-add available in this “multimedia application processor” architecture. Since he did not have access to a git repository of his own, the patches are contained within a branch of Arnd Bergmann’s “asm-generic” git tree (Arnd had signed off on the patches) – having such patches as a branch on an unrelated tree is highly unusual, but there’s no technical reason why they can’t be there.

Documentation. Luis R. Rodriguez noted a need to adequately document the rc-series and kernel merge window, so he posted an RFC patch adding such documentation into the kernel tree.

Arjan van de Ven posted a kerneloops.org report for the week of June 14 2009. In it, he thanked the kernel.org team for their newly provided hosting (replacing a previous virtual machine instance), and drew attention to new oops sightings amongst the XFI driver, and the i915 gem and DMAR code. For this week’s report, a total of 4026 oopses and warnings were reported for kernels 2.6.29 and later (earlier kernels were ommitted from the report). Separately, Arjan proposed adding a “\n” to the existing kernel “WARN()” macro, saying that “many (most) users of WARN() don’t have a \n at the end of their string; as is understandable from the API usage point of view. But this means that the backend needs to add this \n or the warning message gets corrupted (as is seen by kerneloops.org).”

Finally today, there’s a little controvasy over a patch to the RT tree that disables all “staging” drivers. These are drivers in the “staging” area of the kernel source (formerly a separate tree entirely) that are not ready for prime time but are included early in order to drum up support and encourage developers to get them ready for wider use. Many of these drivers barely work on regular Linux kernels, let alone the RT kernel, but Greg Kroah Hartman (the original creator of -staging) objects to the idea of purposely keeping them out of the hands of RT users. Quoting Greg, “This seems like a patch to ensure that the staging drivers never get a chance to be fixed for any potential -rt issues. How about just sending me bug reports instead?”.

In today’s announcements: ndiswrapper 1.54-2.6.30. Jeff Merkey announced a new release of ndiswrapper for the 2.6.30 kernel. He separately took the opportunity to bash both Xen and KVM with, well, you can imagine. Thomas Gleixner posted to say he has released 2.6.29.4-rt19, which includes a couple of trivial fixes. Meanwhile, RT moves on to post-30 work.

The latest kernel release is 2.6.30, which was released by Linus last Tuesday.

Greg Kroah-Hartman announced the release of the 2.6.29.5 kernel and encouraged users to upgrade to the latest update.

Stephen Rothwell posted a linux-next tree for June 15th. Since Friday, he removed the rr-latest-cpumask tree (it having served its purpose), undropped the kvm tree (since the conflicts have been resolved), and the tree continues to fail to build in an allyesconfig build configuration for powerpc. A large number of other conflicts and build problems were accounted. The total number of trees is now down to 129, Rusty’s latest experiment having run its course.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags: