Archive

Author Archive

2010/05/30 Linux Kernel Podcast

June 3rd, 2010 jcm No comments

Audio: http://traffic.libsyn.com/jcm/linux_kernel_podcast_20100530.mp3

The podcast has returned from a brief break of a few weeks while I was busily working on a certain Enterprise Linux and using my spare time to not be in front of a computer (sailing). There is a backlog of shows in various stages though I’m not yet sure when I’ll get around to posting them online. Thanks for bearing with me and let’s hope we can get back into a routine once more. As always, if you are interested in helping out, drop me a line by email.

For the US Memorial Day Holiday weekend of May 31st 2010, I’m Jon Masters with a summary of the past week’s LKML traffic.

In today’s issue: Linux 2.6.35-rc1, errors, TSC, Unified Ringbuffer, virtio, and YAFFS.

Linux 2.6.35-rc1. Linus Torvalds announced the release of kernel 2.6.35-rc1 on Sunday, May 30th 2010 at 1:21pm Best Coast Time (PDT). Quoting Linus, “…and thus endeth the merge window”. After a two week merge window, Linus says that the “bulk should be there. And please, let’s try to make the merge window mean something this time – don’t send me any new pull request unless they are for real regressions or for major bugs, ok?”. The 2.6.35 release will not feature any new filesystems for a change, but does have all of the ususal driver updates, and of thr 8500 commits, there were about 1000 individual developers involved in the 2.6.35 tree this time around. Linus described the statistics – specifically calling them out in his mail – as demonstrating what is “a healthy development environment”.

Errors. Modern hardware is generally highly reliable, but scalability and the growth of datacenters play havoc with statistics. Given a large enough amount of memory, disks, or other devices, something will eventually go wrong. When it does, it is useful to handle as much as possible with an air of grace. Memory errors are of particular concern, especially with the growth in the amount of RAM in (increasingly) large servers. ECC (Error Correcting Memory) can help, and includes the useful side effect of reporting on correctable errors. Existing userspace utilities, such as Andi Kleen’s mcelog (and other related work in the kernel itself into recoverable memory errors) offer an ability to collect reports of such errors, as well as Machine Check Exceptions (essentially hardware errors, usually related to failing memory, caches, etc.) of various other kinds. At this year’s Linux Foundation Collab Summit (April 15th 2010), there was a mini-summit aimed at figuring out a path for the future of various separate error reporting subsystems, such as MCE (Intel), and EDAC (AMD). Mauro Carvalho Chehab posted a summary of the minutes in the form of an email thread entitled “Hardware Error Kernel Mini-Summit”, in which it is proposed that a new kernel error subsystem be created, abstracting all of the existing mechanisms, and wired up using performance events (perf). The latter piece comes largely at the insistance of Ingo Molnar and Thomas Gleixner, and is not without its controvasy amongst those who feel perf is growing to become some catch-all solution to every problem. Still, it seems likely that there will be some generic replacement to meclog in the future.

TSC. Venkatesh Pallipadi (Google) posted a patch, originally from Dan Magenheimer (Oracle) in which various information about the perceived (or, generally, otherwise) reliability of the TSC known by the kernel was exported via the sysfs. This would allow userspace applications using rdtsc to know whether the counter is generally regarded as a reliable source of time or not. Thomas Gleixner and Ingo Molnar both absolutely hated this, on the grounds that the TSC is known to be generally not a great clocksource (although it is becoming more reliable in many systems) and that just because the reading of it is generally unprivileged and thus widespread does not mean that the kernel should be complicit in encouraging others not to use the standard timestamp reading abstractions. Especially with modern kernels, where there are vsyscalls and other facing mapped page hacks, the overhead of obtaining timestamp information from the kernel is generally fairly reasonable. There was even some suggestion of limiting ring3 access to the TSC by means of a SPR (Special Purpose Register) setting. Dan Magenhiemer noted that the uses of userspace reading of the TSC were more widespread than Thomas and Ingo may have considered, and he called out the dynamic linker used in RHEL5 as one example of a semi-frequent reader of TSC information. Brian Bloniarz, John Stultz, and Peter Anvin took the conversation in a slightly different direction after Brian noted that sometimes userspace needs to know how reliable the current clocksource is considered to be for use in calibration (for example, when using NTP and desiring to know oscillator accuracy). It seemed to be decided that it would therefore be worthwhile to have a general means to determine the accuracy of the current clocksource, not just the Intel-world-view centric TSC. That latter part may well happen.

Unified Ringbuffer. Hardware error detection wasn’t the only topic of general unification efforts this week. Steven Rostedt posted an RFC thread entitled “Unified Ring Buffer” in which he discussed implementing a globally generic kernel ringbuffer that could be used in any subsystem (recall that Steven also implemented a fancy ringbuffer design in ftrace). He posted links to LKML discussion on the effort so far, and an LWN summary article, noting that both the ftrace ringbuffer and the oprofile ringbuffer have so far been unified, but also noting that the introduction of perf events (which require both a lockless, NMI safe, and mmap()able implementation) came with yet another new ringbuffer from Peter Zijlstra. Steven’s original ringbuffer became lockless last year, but currently does not support mmap. So there are two implementations, “neither of which can perform all of the features needed. This is putting a bit of stress on the users of these tools, not to mention the stress on the developers as well”. Steven would like to find a solution to this problem, and so started the thread. Mathieu Desnoyers added that he was happy to help, and had already started working on his own tree (originally intended to help his LTTng tracing tools), while Andi Kleen wondered aloud why Steven would “want a single ring buffer for everyone?”. Steven said the solution might not be to have one implementation, but merely one single interface (with varying backends used, including, as Andi had noted, kfifo based implementations). This lead Ingo Molnar to suggest that grand design planning discussion of ringbuffers was less important than discussing the future direction for tracing and instrumentation (the main users of these ringbuffers, and the real motivation behind them), and to note that performance was currently quite sucky both in ftrace and perf. The conversation seemed to dry up without any specific conclusions. Separately, Peter Zijlstra posted perf ringbuffer optimization patches in a thread entitled “Optimize perf ring-buffer”. Still separately, Chase Douglas posted some “Tracing configuration review” questions for the forthcoming Ubuntu kernel configuration, seeking review comments.

Virtio. Michael S. Tsirkin posted an RFC patch entitled “virtio: put last seen used index into ring itself”, which as it implies modifies the ring buffer used for host/guest communication of vitio (via a feature flag, using available room in the existing structure) such that a guest will update the ring buffer with a host-visible state of where it is in consuming the buffer. The host doesn’t technically require this information, but it can save on unwanted interrupts if the host knows that the guest is not done processing previous ringbuffer entries, and provides useful statistical information. There then followed a lengthy (and somewhat interesting) debate between Michael, Dor Laor, and Rusty Russell concerning the latter’s assertion that the state of the ring buffer could be stored in the same cacheline as the last item in the buffer, rather than in its own cacheline. Rusty contended that this would be more efficient (since occasionally the index and data would be read at the same time), but when he wrote a useful test program was only able to prove that Avi Kivity was right in suggesting separation. Various other dialogue related to the complexity of virtio was discussed.

YAFFS. Charles Manning, ever diligent YAFFS (Yet Another Flash Filesystem – an excellent alternative that this author has had the privilege of poking at with his embedded hat on in the past) developer posted some questions on SLUB behavior. Charles uses a SLUB-like allocator in YAFFS to manage objects, but his objects are separated according to the mount to which they refer. This makes it very easy for him to just throw away a large number of objects on unmount without de-allocating them (”trust me, I know what I’m doing”). He is looking at replacing his custom allocator with SLUB in order to facilitate eventual mainlining of YAFFS, but wants to know whether SLUB could grow some additional “don’t combine this SLUB with others” and ‘”trust me, I know what I’m doing”: Allow the cache to be dumped with objects still allocated” flags. So far, nobody has answered his questions.

In today’s miscellaneous items:

*). Mike Snitzer, Jens Axboe, Vivek Goyal, and Kiyoshi Ueda discussed (in a thread entitled “only initialize full request_queue for request-based device) various approaches to minimalist initialization of Device Mapper devices, specifically given the new split handling of bio vs. request based devices. Only the latter type require “full” queue setup.

*). Ingo Molnar requested that Linus pull the “lockup-detector-for-linus” tree, which contains a unified kernel lockup detector in kernel/watchdog.c that replaces the existing NMI, hung tasks, softlockup, and so forth all in one place. Big thanks go to Don Zickus for his work on this.

*). Discussion continued surrounding some documentation that Henrik Rydberg posted on the Multitouch event slots protocol for multitouch devices. It seems that these input devices become more complex by the day.

*). Don Zickus posted a patch entitled “Makefile.build: make KBUILD_SYMTYPES work again”, in which he provided some fixes to the code that provides a means to determine why kernel symbol versions have changed (i.e. which specific change to which kernel structure or function was the cause). This is of particular use to “Enterprise” distributions doing module versioning.

*). Michel Lespinasse (Google) posted a patch entitled “Stronger CONFIG_DEBUG_SPINLOCK_SLEEP without CONFIG_PREEMPT” in which he proposed tracking the preempt count even when not using CONFIG_PREEMPT, but when nonetheless building with CONFIG_DEBUG_SPINLOCK_SLEEP. Rather than the use of preempt_{dis,en}able actually resulting in preemption, it would merely serve as a means to warn when attempting to sleep incorrectly from within a critical section, but without explicitly enforcing it.

*). Discussion continued surrounding a previous patch from Kay Sievers adding new “devname” module aliases to facilitate module on-demand autoloading. The idea here is that modules can now provide the name of the device entry or entries they will create and so tools like udev can demand load modules as the nodes they support are accessed.

*). Thomas Gleixner finally posted the patch series he had threatend to post previously, entitled “Run interrupt handlers always with interrupts disabled”, that does largely what it says on the tin. It removes the IRQF_DISABLED functionality at interrupt registration and runs all interrupt handles with IRQs off. This should facilitate greater migration over to modern threaded interrupt handlers as needed.

*). Neil Brown posted a patch entitled “VFS: fix recent breakage of FS_REVAL_DOT” in which he provided a fix for a change to NFS client mount behaviors, under which the client would no longer check if a directory within which “ls -l” were being run had changed at the time of the command, without waiting for the cached timestamp attributes to timeout. Al Viro took the patch, but did not like the implementation, so some further discussion ensued.

*). Arve Hjønnevåg posted the latest version of the “suspend block API”, which provides the “same functionality as the android wakelock api”. This is intended to control when a system will be blocked from suspending due to activity, and comes with the benefit of lengthy LKML discussion.

*). Glauber Costa posted version 3 of a patch implementing various MSR (Machine Status Register) KVM specific documentation.

In today’s announcements:

* Smatch 1.55. Dan Carpenter announced release 1.55 of the “smatch” static C source checker tool is now available. The latest version includes an enhanced array overflow check, new checks for precedence bugs caused by macro expansion, rewritten checks for null pointer dereferences, and some kernel specific checks for kunmap, release_resource, etc. http://smatch.sf.net/ or git://repo.or.cz/smatch.git

* Jeff Merkey announced version 2.6.34 of ndiswrapper. Quoting Jeff, “Always here to support the hated projects of Evil Emperor Linus. Needed this f**king think to work on my laptop so fixed the busted sh*t.” His 4-letter-word strewn announcement was greeted by a reply from Simon Horman noting that he would be happy to send Jeff a dictionary if he was looking to “learn some words that are more than four letters long”.

The latest kernel release is 2.6.35-rc1.

Greg Kroah-Hartman posted a series of 2.6.32.14 stable kernel review patches. He notes that he only included patches that were released in kernels up to the 2.6.34 release, since the line had to be drawn somewhere. This is a “long term” stable kernel tree. Many vendors are basing on 2.6.32 now. Greg also posted “take 2″ of some 2.6.27.47 stable series patches, as well as stable review patches for 2.6.33.5.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/05/02 Linux Kernel Podcast

May 17th, 2010 jcm No comments

Audio: COMING SOON

For the weekend of May 2nd 2010, I’m Jon Masters with a summary of the past week’s LKML traffic.

In today’s issue: Linux 2.6.34-rc6, vger.kernel.org, Checkpoint and Restart, Frontswap, FUSE, and the Scheduler.

Linux 2.6.34-rc6. Linus Torvalds announced the latest 2.6.34 RC kernel on Thursday April 29th at 8:18pm PDT (Best Coast Time). The latest release is bloated by an updated PowerPC defconfig but does containing other fixes.

vger.kernel.org. There was a vger.kernel.org outage this week, from the 28th through the weekend, due to a power failure in the datacenter that hosts the equipment. This disrupted traffic to LKML, although some folks on IRC noted that their productivity had improved due to the lack of distraction.

Checkpoint and restart. Oren Laadan posted the latest version (21) of the “Kernel based checkpoint/restart” patch series, all 100 of the patches. He included various hints about which bits should be reviewed by whom, but the sheer size of the series boggled a few people. Although there wasn’t much discussion on the list, it does seem unlikely that a 100-part patch series of this kind would be pulled whole any time soon. http://www.linux-cr.org.

Frontswap. Discussion continued on some patches we missed in last week’s episode, on a rewritten piece of the previous “Transcendent Memory” patch series, named “Frontswap”. This piece of the large patch series – which is apparently shipping now in both OpenSuSE and Oracle Enterprise Linux – adds a new generic means to register what is the “opposite” of a swap-like backing store. Frontswap is essentially non-addressable RAM that is provided by a hypervisor (or perhaps a compressed in-kernel RAM device) and which may grow and shrink over time according to the availability of system resources. For example, a hypervisor may grant guests large amounts of otherwise unused RAM in the form of such “frontswap”able devices that may need to be reclaimed later on if other guests require the resources. Using frontswap, one can potentially avoid additional disk overhead usually associated with “swap”. One of the biggest criticisms, from Avi Kivity – was that these patches assume access to the frontswap device is synchronous and not being performed using DMA or some other asynchronous process. Dan Magenheimer confirmed that this is an intential design limitation in order to make the implementation much simpler for its use case(s) dealing with real physical RAM. Dan noted that the conversation had gone off on a tangent, discussing such other (interesting, but not directly relevant) issues as swap-to-flash.

Fuse. Miklos Szeredi posted an RFC patchset implementing splice(2) support for FUSE (Filesystems in USErspace). This means that is is possible to move an existing page directly into the page cache of the FUSE filesystem without ever having to perform a copy. Given that there is obvious overhead in having filesystems implemented in userspace, adding splice support is a nice touch. Apparently the early tests show improved bandwdith and reduced system time but it will be interesting to see what further testing reveals over time.

Scheduler. Ted Baker, Joerg Roedel, Doug Niehaus, and Peter Zijlstra discussed scheduler policy and classes available in the kernel in a followup to a much earlier thread entitled “RFC for a new Scheduling policy/class in the Linux-kernel”, specifically about any plans to support SCHED_SPORADIC. Both Ted Baker and Doug Niehaus had plans for the ability to assign a task a priority that is specifically non-runnable without having to send it a signal – such as SIGSTOP – that requires the task to run in order to process the STOP. Peter Zijlstra stated that the current plan involved supporting the sporadic task model through the use of SCHED_DEADLINE rather than POSIX’s SCHED_SPORADIC (the name of which, according to Peter, was jokingly “stole[n] [...] from us”). Ted Baker replied to Peter, noting that deadline scheduling and sporadic server scheduling are “two quite different things” – the latter belonging to the existing fixed priority scheduling domain (that is a separate problem domain from that of the deadline scheduling folks). Ted thought issues with the POSIX SCHED_SPORADIC API that may have problems could be corrected through “interpretation” of the standard such that a solution were available in short order rather than longer term, especially if Linux were to do something with implementation that he could feed to the Austin Group (the POSIX folks).

In today’s miscellaneous items:

* Mike Travis (SGI) posted a patch providing a kernel parameter to increase pid_max from 32k for early-in-boot use, before it can be otherwise set to a higher value. Otherwise, on a system with 1664 CPUs, Mike finds that there are 25163 processes started before the login prompt!

* Jack Steiner (SGI) noted that the existing SLAB allocator implementation of cpuset_mem_spread_node used a single rotor for allocating both file pages and SLAB pages, so that (on a multi-node memory system), writing a particular test file results in advancing the rotor 2 nodes per allocation and skipping e.g. odd number nodes in the SLAB pages allocation. The patc introduces a second rotor just for the SLAB page allocation.

* Philip Langdale (VM) noted that he has been following the Transparent Hugepage work over the past few weeks and is very encouraged. He claims a 22% improvement in ops/sec reported by SPECjbb under virtualization.

* A kernel developer posted a somewhat distressing thread suggesting some emotional disturbance caused by a particular relationship. In the interest of not being the US Weekly of LKML I shall refrain from further comment, and agree with the suggestion of using the “It’s Complicated” button on Facebook next time something like this comes up instead.

* Ying Huang posted initial support for APEI (ACPI Platform Error Interface).

* Joerg Roedel posted the second version of the “Nested Paging support for Nested SVM” patchset.

* Steven J. Magnani posted version 2 of a stack unwinder for Microblaze.

* A second series of viafb patches for OLPC from Jonathan Corbet, who later pushed a version 2.1 of the series, containing three additional patches fixing issues pointed out by Bruno Prémont. The patches are available from git://git.lwn.net/linux-2.6.git in the branch viafb-posted. Jon wondered if the patches were ready to go into viafb-next.

In today’s announcements:

* DRM. Stefan Bader posted to let everyone know that he is now maintaining a 2.6.32-based tree on kernel.org containing backported DRM improvements for 2.6.32 based kernels, since a number of vendors are using that tree. Luis R. Rodriguez replied saing that this was “Great stuff! Thanks for putting this up!”. One wonders if this is more sign of a growing trend.

* Linux 2.6.33-rt19. Thomas Gleixner announced verion 2.6.33.3-rt19 of the Real Time patchset, containing mostly VFS scalability bits. This followed a previous 2.6.33-rt16 release also this week containing largely a merge with upstream 2.6.33, and -rt17 and -rt18 releases that contained a few fixes. Thomas notes in his posting that he had previously pushed out rt14 and rt15 without sending an announcement out to the list, so he included changelogs from -rt13 to 16, and rt17-rt18 (in the separate emails he made announcing -rt16, and -rt17). Patches are available at http://www.kernel.org/pub/linux/kernel/projects/rt/ and the tip git tree on git.kernel.org contains existing rt/head and rt/2.6.33 release branches.

* Upstart 0.6.6. Scott James Remnant announced the 0.6.6 release of the “upstart” SYSV init daemon replacement that supports modern asycnhronous event driven operation rather than traditional runlevels (though it does also support emulating those for backward compatibility). Upstart is used by a number of distributions, and is available at upstart.ubuntu.com/

The latest kernel release was 2.6.34-rc6.

Greg Kroah-Hartman released stable series kernels 2.6.32.12 and 2.6.33.3. The former came with some thanks (and possibly an indirect dig at vendors) to Maximilian Attems for his “hard work digging out patches from the various vendor kernel trees for this release”. Maximilian was also thanked specifically in the latter case for contributing patches also. Separately, Greg requested of Stephen Rothwell that he begin pulling a new staging-next tree into his daily Linux -staging tree (a nice present for Stephen as he returned from vacation).

Frederic Weisbecker replied (in an innocuous thread otherwise containing a patch email thread of conversation entitled “ptrace: Cleanup useless header”) noting that things touching the BKL should CC both him and Arnd Bergmann. They are still working on Big Kernel Lock (BKL) removal, which you can keep track of via http://kernelnewbies.org/BigKernelLock. There was some other BKL removal traffic over the past week, also, including some patches from Arnd entitled “Push down BKL into device drivers” (similar to the FS patches he had posted previously that did the same in that layer – nice).

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/04/25 Linux Kernel Podcast

May 13th, 2010 jcm No comments

Audio: COMING SOON

For the weekend of April 25th 2010, I’m Jon Masters with a summary of the past week’s LKML traffic.

In today’s issue: Linux 2.6.34-rc5, CFS, Firmware, and IPC.

Linux 2.6.34-rc5. Linus Torvalds announced the release of Linux kernel 2.6.34-rc5 on Mon, April 19th 2010 at 4:42pm PDT (Best Coast Time). As he said, “Another week, another -rc. This time there wasn’t some big nasty regression I was working on to hold things up” (refering to the issues with anon_vmas and anon_vma_chains from last week). The latest release includes a number of general fixes, including boot fixes for ACPI parsing, and the usual kinds of driver updates (radeon, amd-iommu, filesystems). SPARC now has ftrace support if you are interested in playing with that. Upon mentioning regressions, Rafael J. Wysocki seemed to fly into action with his usual vigor and post his regular regression summary of issues outstanding since 2.6.33. The current statistics show that the number of unresolved issues has tended to increase over the several weeks leading up to -rc5, with 34 outstanding.

CFS. Mathieu Desnoyers posted version 2 of a patch entitled “CFS fix place entity spread issue”, which is aimed to address an apparent situation in which Mathieu felt that min_vruntime could go backwards and cause large unwanted latencies for certain workloads. Peter Zijlstra disputed that this was happening and Linus, upon testing the patch, using his “favorite non-scientific desktop load” and found that it made things worse in terms of X performance, which was apparently to be expected (according to Mathieu) because Xorg had been getting unfair runtime treatment that was now corrected. This didn’t make Linus particularly happy (from a user experience viewpoint) and meanwhile Mathieu and Peter continued to debate what was happening. Mathieu posted some links to an ELC (Embedded Linux Conference) presentation that he did on this topic at http://www.efficios.com/elc2010 and then later followed up (in an entirely separate thread) with version 11 of his “introduce sys_membarrier(): process-wide memory barrier” that he uses to assist with his userspace RCU implementation, all the while still stranded at San Francisco airport waiting for a means to get back home.

Firmware. Tomas Winkler posted a thread entitled “request_firmware API exhaust memory” in which it was discovered that some performance enhancement work done by David Woodhouse a while back actually caused the kernel to leak memory used for firmware handling, especially in the case that a large number of calls were made to request_firmware, as in the case of Tomas’ code. The issue was that the firmware code was attempting to free pages not allocated with vmalloc using vfree, whereas the underlying pages were actually being allocated and then mapped into linear kernel virtual memory with vmap calls. The fix involves unmapping and then freeing.

IPC. Manfred Spraul posted a three part patch series entitled “ipc/sem.c: Optimization for reducing spinlock contention” in which he attempts to “fix the spinlock contention reported by Chris Mason: His benchmark exposes problems of the current code”. Manfred then summarizes three main issues, including the prominent first issue that the algorithm used by update_queue() has a worst case performance on the order of O(N^2) and bulk wakeups can enter this worst case if they are unlucky. After applying the patch and performing some runs with sembench using 250 threads, waking 64 threads at a time, Manfred reports 1.1% CPU lost spinning vs. 47% before, and 6% of spinlocks spinning vs. 91% before, amongst other statistics.

In today’s miscellaneous items:

* Jon Corbet posted version 2 of an RFC patch series entitled “Initial OLPC Viafb merge”, and noted that he would begin a linux-next tree.

* Yanmin Zhang posted version 5 of a patch intended to implement perf statistics collection in the host of various guest KVM instances.

* Hiroyuki Kamezawa reported an issue with memory compaction support in the mm-of-the-day (mmotm) for 2010-04-15-14-42. He and Mel Gorman discussed it a little. Separately, Mel posted version 8 of the memory compaction patch series, without an obvious fix for the crash issue.

* Justin P. Mattock reported that the issues booting MacBook Pro systems from the previous week seemed to now be resolved in the latest kernels.

* Rusty Russell posted a module patch that causes the module_lock mutex to be dropped when waiting for parallel module loads to complete.

* Don Zickus posted a 6 part patch series entitled “lockup detector changes” that “covers mostly the changes necessary for combining the nmi_watchdog and socklockup code”.

* Stefani Seibold posted yet another (unversioned in the subject line) 4 part patch series that was entitled “enhanced reimplementation of the kfifo API”, and which contained basically a rebase to recent kernels.

* Kyle McMartin posted a patch changing the default file permissions on the kernel provided pseudo file /proc/sys/vm/mmap_min_addr to 0600 from 0644. There wasn’t a huge security issue as writes were already denied by virtue of the fact that CAP_SYS_RAWIO was also required underneath.

* Kent Overstreet posted version 3 of the “bcache” patch series.

In today’s announcements:

* Linux Plumbers Conference (LPC). Ted Ts’o posted a “Call for Tracks”, noting that this year’s conference will take place in Cambridge, MA from November 3-5. The organizers are looking for “problem statements” summarizing “things that could be improved in Linux that cross multiple interfaces or other project boundaries”. For further information about the conference, and to submit ideas, see: http://www.linuxplumbersconf.org/

* git 1.7.0.6. Junio C Hamano announced version 1.7.0.6 of the GIT utility used for version control by the Linux kernel community. The latest version includes fixes for “git diff -stat” overflow, and “git rev-list –abbrev-commit” using the older 40-byte abbreviation format. Junio also announced version 1.7.1 of the GIT utility, which included updates to gitk, the ability to invoke an external command for passwords (GIT_ASKPASS), a new bash completion script (for those who use that), and dozens of other fixes besides. Git is available on the kernel.org website: http://www.kernel.org/pub/software/scm/git/

* hwloc. Samuel Thibault announced the release of hwloc version 1.0rc1, a “hardware locality” utility intended to provide command line support for obtaining information about NUMA memory, shared caches, processor sockets, processor cores, and processor “threads”. For further detail see the project website: http://www.open-mpi.org/projects/hwloc/

The latest kernel release was 2.6.34-rc5.

Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-04-22-16-38.

There was some ongoing discussion of kernel vmalloc performance and a few patches were posted, most recently from Minchan Kim.

Joe Perches asked about the -staging tree review and acceptance process, noting that there are a “number of patches appear[ing] to go unnoticed or
untracked”. Greg Kroah-Hartman followed up explaining that he’s had conferences, travel, and has moved house, and basically asked for a break.
Greg has generally been responsive on the staging tree discussion list in my experience, and there is a lot of work that goes in there.

Greg Kroah-Hartman posted a 2.6.32 stable kernel review patch series comprised from 197 individual patches to the “long term” stable kernel 2.6.32. He also posted a 139 part patch series for the 2.6.33 stable series kernel.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/04/18 Linux Kernel Podcast

May 10th, 2010 jcm No comments

Audio: COMING SOON

For the weekend of April 18th 2010, I’m Jon Masters with a summary of the past week’s LKML traffic.

In today’s issue: Linux 2.6.34-rc4, adaptive spinning mutexes, Microblaze, Remote Controller Subsystem, Stack Size, and VM.

Linux 2.6.34-rc4. Linus Torvalds announced the release of kernel 2.6.34-rc4 on Monday April 12th 2010 at 7:16pm PDT (Best Coast Time), which had been delayed while he, Borislav Petkov, Rik van Riel, and others were tracking down an annoying rmap VM regression caused by the introduction of anon_vma_chain support. Most of Linus’ announcement covers that bug – stay tuned for some coverage on that – but also mentions the new cxgb4 network driver.

Adaptive spinning mutexes. Benjamin Herrenschmidt posted a new thread entitled “Possible bug with mutex adaptive spinning” in which he noted that the current adaptive spinning (in which a mutex will spin briefly rather than immediately going to sleep if the owner of a lock is already running and might release it soon) code in mutex_spin_on_owner() does not correctly handle the case of the owner CPU being offlined. In this case, the function will return 1, meaning that the caller should spin, which it may do forever. Ben changes the return to 0 in the case that the CPU is offline so that a sleep occurs immediately.

Microblaze. Michal Simek posted a thread entitled “Microblaze – The fi[r]st year”, in which he summarized what has happened in the year since support for the soft-core Xilinx Microblaze CPU was first added to the mainline kernel. He calls out a number of folks for specific thanks – both from Xilinx, and from PetaLogix, as well as the wider community (the usual suspects: Andrew Morton, Arnd Bergmann, Grant Likely, Ingo Molnar, John Linn, John Williams, Stephen Neuendorffer, etc.). He includes a timeline of events over the past year as well as links to git trees, the wiki, and even a Facebook fan page (such is the world in which we live today – and yes, I am a “fan” myself).

Remote Controller Subsystem. Mauro Carvalho Chehab posted an informative mail entitled “Remote Controller subsystem status” in which he updated everyone on the current progress toward implementing a new “remote controller” subsystem that replaces the legacy V4L/DVB code and will become a new “core” subsystem available in /sys/rc. There is a userspace tool called ir-keytable and some discussion of plans for merging in 2.6.35. A mail worth reading.

Stack size. Dave Chinner posted a thread entitled “mm: disallow direct reclaim page writeback” in which he advocates for using the background IO flusher threads even in the case that VM pressure is so high that direct page reclaim becomes a necessity. Dave feels that in such cases, “we may have used an arbitrary amount of stack space, and hence enterring the filesystem to do writeback can then lead to stack overruns. This problem was recently encountered [on] x86_64 systems with 8k stacks running XFS with simple storage configurations”. This lead to a longer thread in which the issue of kernel stack footprint was addressed, as well as the specific issue of what to do in the direct reclaim situation. Andi Kleen followed up to Chris Mason’s comments concerning the relatively large footprint of single fs functions with an assertion that the ‘4K stack simply has to go. I tend to call it “russian roulette” mode’. Andi considers such small stacks to be dangerous given the “obscure paths through the more an more subsystems”. He is fond of the separate interrupt stack in the case of 4K process stacks, but feels that there should always be a separate interrupt stack in any case, as might have helped in the case that Dave Chinner was mentioning in the original posting. Mel Gorman later followed up with an RFC patch series entitled “Reduce stack usage used by page reclaim” in which he attempted to “reduce some of the more obvious stack usage in page reclaim”, including in putback_lru_pages, kswapd, shrink_page_list, shrink_zone, and so forth (up to 1096 bytes saved).

VM. The Linux kernel includes support for reverse page mapping (rmap), a means by which it is possible for the Virtual Memory subsystem to answer important scalability questions such as “which virtual memory pages reference this physical page?” without having to walk through a large number of process page tables each time. Over the years, this code has become more complex through the addition of anon_vma, and anon_vma_chain structures intended to allow object based reverse mapping of anonymous memory pages with reduced overhead as compared with Rik van Riel’s original (and more simple) mechanism of having additional pointers in every struct page. anon_vma is used to track per-task anonymous VMA use, while anon_vma_chains link these together to allow the VM to determine which tasks have a shared reference to a given anonymous VMA.

The implementation of this complex VMA tracking was suffering from a bug that Borislav Petkov kept hitting in performing a suspend/resume cycle on his system, in which the resume code would wind up referencing a previously unmapped shared page first within a child process (setting up a new anon_vma) and later within a parent (causing an anon_vma_chain link to be setup pointing in the wrong direction from child to parent) that subsequently could no longer reach the child anon_vma after the child task exited. As Linus said, “End result: process A has a page that points to anon_vma B, but anon_vma B does not exist any more. This can go on forever. Forget about RCU grace periods, forget about locking, forget anything like that. The bug is simply that page->mapping points to an anon_vma that was correct at one point, but was _not_ the one that was shared by all users of that possible mapping.” Thus the fix is to ensure that new anon_vma_chain entries are always referencing the “_oldest_ possible anon_vma for the page mapping”, as is the case for Linus’ eventual (simple) patch, entitled “[PAGE 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma”. Borislav said it survived more than 20 test cycles where the system would previously have managed at most 6 resume attempts.

Linus seemed genuinely excited about tracking down this bug – it can’t always be easy doing his job, and I’m sure he relishes an occasionally really dirty bug to poke at. One thing that did come of this exercise was an improvement in comments and documentation both on list and in the affected code. Linus seemed very happy with the effort Borislav was putting in to help test and track down this issue (ending the thread with a little joke about Borislav’s email gateway, which claims to be “SuperMail on a ZX Spectrum 128k”). The thread fixed a few other issues aswell, and gave Peter Zijlstra a chance to post a documentation patch for page_lock_anon_vma noting that it is very difficult to serialize fully against page_remove_rmap so that the lock function doesn’t try, but instead all users of it should verify that the anon_vma returned to them is actually still relevant to them. Finally,
Ulrich Drepper followed up some time later – on a tangent – wondering aloud why mprotect need create so many VMAs when changing permissions
on thread stacks and the like instead of modifying page table entries.

As usual, Linux Weekly News (LWN) did a much better job of explaining the overall multi-day issue in depth so you are encouraged to take a look at
their story for more of the history, analysis, and nice graphics.

In today’s miscellaneous items:

* Robert Richter posted some model specific performance events patches in order to support AMD IBS (an unfortunate acronym in this case standing for Instruction Based Sampling).

* Nigel Cunningham was looking for a job.

* Several people have reported issues booting Macbook Pros with recent kernels. Len Brown noted that this was likely already fixed (referencing BZ 15749). In response, Harald Arnesen was especially happy about git bisect as a debugging tool for non kernel hackers to help track down bugs such as this one.

* Jason Baron posted version 7 of his “jump label” patch series.

In today’s announcements:

Git 1.7.1.rc1. Junio C Hamano announced Git version 1.7.1.rc1, which includes a number of fixes. http://www.kernel.org/pub/software/scm/git/ This comes at around the time of the 5th anniversary of the kernel switching to Git for development, which Christian Ludwig noted occured on the 15th April. Christian notes that he has made a YouTube video visualizing git development history, available at http://www.youtube.com/watch?v=ntTpM8hfl_E

Guilt 0.33. Josef “Jeff” Sipek announced version 0.33 of the Guilt (Git Quilt) series of bash scripts was now available from the usual location.
http://www.kernel.org/pub/linux/kernel/people/jsipek/guilt/

LTTng 0.210. Mathieu Desnoyers announced LTTng 0.210 for kernel 2.6.33.2, which was largely a revert of a PowerPC specific TRACE_EVENT definition that occured outside of include/trace, and which particularly bothered Mathieu.

sdparm 1.05. Douglas Gilbert announced that the 1.05 release of sdparm was now available. This is a direct analogy of “hdparm” but for SCSI devices, and so supports a lot of SCSI specific fancy options.

trace-cmd version 1.0. Steven Rostedt announced version 1.0 of his trace-cmd utility, which is a cross-platform, endian safe binary reader for ftrace that
can be used to capture data on one machine (e.g. as a flight recorder) and then decode and process it on another, at runtime, or after the fact.

The latest kernel release was 2.6.34-rc4.

Andrew Morton posted an mm-of-the-day (mmotm) for 2010-04-15-14-42.

An issue was discovered with a net-2.6 patch entitled “tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb” that caused ssh to fail. David
Miller subsequently stated that he would revert this patch and specifically test zero length data area CHECKSUM_PARTIAL packets with the IGB driver.

Pavel Machek noted that the LOCALVERSION_AUTO configuration option, which appends a new version to the kernel on each compilation, has an unfortunate interaction with loadable kernel modules when CONFIG_MODVERSIONS is unset insomuch as it causes the simple kernel version check to fail. Linus was very clear that the problem here is people building kernels without enabling modversions and expecting that to be even remotely safe.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

Dude, where’s the podcast?

May 7th, 2010 jcm No comments

Short answer is RHEL. I’m busy working on a bunch of things at the moment and the podcast has suffered. I’m planning to get caught up over the weekend if I can, or just skipping a few days/weeks and moving forward from now. I do my best, I know it’s not always good enough.

Jon.

Categories: Uncategorized Tags:

2010/04/11 Linux Kernel Podcast

April 14th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100411.mp3

For the weekend of April 11th, 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Fsck, Futexes, IOMMU, Modules, PRNG, and SMIs.

Fsck. Pavel Machek raised the issue of power failure and its potential to wreak havoc on filesystems that don’t enable barriers (that ensure the journal is fully on disk) by default. Pavel felt it would be prudent to artificially increase the mount count for unclean shutdowns so as to make an fsck more likely next boot. Ted T’so recommended that people could just move to ext4, while Rob Landley was surprised that anyone would want to wait hours for an fsck, to which Ted added that it was of course possible to use online checking via e2croncheck and so on (in which case, he recommends people do weekly checks using for e.g. an LVM snapshot of the running filesystem).

Futexes. Darren Hart posted an RFC entitled “Ideal Adaptive Spinning Conditions” in which he requested some comments on his ideas around adaptive lock spining with futexes (essentially spinning for a while rather than sleeping immediately when blocking on an already locked mutex, in case someone else releases it in short order – the kind of behavior implemented for adaptive kernel spinlocks by Gregory Haskins for the Novell RT kernel patchset) as a means to reduce dependence on sched_yield when implementing userspace spinlocks. Darren finds adaptive spinning actually harms his userspace implementation and is interested to know, therefore, what are the ideal conditions for this technique to be of use. Darren, Steven Rostedt, Gregory Haskins, Rik van Riel, Chris Wright, and the other usual suspects discussed this a little, as well as how things change under virtualization.

IOMMU. Neil Horman was concerned about recent kernels causing rare corruption when in flight IOMMU operations are not properly flushed during a kexec (or a kdump) operation and posted a patch intended to ensure all outstanding IOMMU domain entries are flushed on shutdown. Chris Wright favored doing this on initialization and stated that this was working in the past and so something must have broken it recently in order for Neil to experience issues. Neil looked at the code some more and determined that the state AMD set the IOMMU to on init should be relatively safe unless dma operations are very long lived or devices are getting confused. He decided to think some more. Chris Wright later posted a patch to the IOMMU initialization such that it is properly enabled before devices are attached in order to prevent the kind of stale entries that Neil had been seeing. Neil tested over the weekend and found that it did indeed solve his problems.

Modules. Nick Piggin was looking for ways to implement scalable in kernel refcounting when he came across the current way that struct module_ref implements module reference counting for loadable modules. He thinks that the existing implementation is racy, though Rusty Russell pointed out that it is only manipulated under stop machine (which itself causes the kernel to essentially become single threaded code). Although this is (mostly) true for the module code itself, the counts are exported to those who do not necessary use it correctly with any real locking. Rusty pointed out that unloading is relatively rare and so few people seem to care about bad usage. Nonetheless, Linus liked Nick Piggin’s patch, which replaces a single percpu counter with two (one for incrementing the count, one for decremeting, and the total count of module users is thus represented by summing these) and thus removes a small window during which one CPU may decrement a use count without seeing an increment from another CPU occuring at the same time. This is considered an improvement against those reading module_refcount unsafely, at least until that is unexported, the code is fixed up, or module removal support is itself removed entirely from the kernel.

PRNG. It was noted (by Eric Dumazet) that recent kernels provide 16 bytes of random entropy to new tasks (AT_RANDOM) for the benefit of the glibc PRNG (Psuedo Random Number Generator). This is the reason that Jan Ceuleers was seeing repeated reads to entropy_avail seeming to decrease available entropy as the fork() of every task reading from that file would also consume it via indirect action.

SMIs. Joe Korty posted a patch entitled “A nonintrusive SMI sniffer for x86″, in which he proposed hooking into the idle loop to detect unexplained gaps in time, using a similar approach to my own SMI or hwlat detector, but only in the idle loop. The patch looks interesting as an additional means for runtime detection of SMIs however it cannot replace the alternatives because it is only able to detect SMIs during the short window of its execution. As an aside, Steven Rostedt and I are poking at a new implementation for hwlat.

In today’s miscellaneous items:

*). Bartlomiej Zolnierkiewicz noted that his “atang” tree has been rebased on top of the 2.6.33 kernel.

*). James Hogan pointed out that several of the watchdog ioctl definitions are technically incorrect, but Alan Cox pointed out that these historical mistakes cannot now be corrected without breaking compatibility.

*). Version 10 of the sys_membarrier patches from Mathieu Desnoyers. These allow a task to issue a process wide memory barrier from userspace, which is useful when implementing userspace locking primitivies (such as the userspace RCU implementation Mathieu is working on).

*). A bunch of patches from Tejun Heo intended to handle the future case of mainline no longer implicitly including slab.h from percpu.h.

*). Version 2 of a fun patch from Xiaohui Xin implementing a xero copy method for DMAing data into virtualized KVM guests by means of pinning specific copy buffers within the guest memory. Avi Kivity noted that this can be more useful than PCI passthrough as it copes with migration.

*). A simple patch from Eric Dumazet addressing a regression that had stopped the ability to perform a rewinding seek on /dev/mem and therefore had broken the ability to use x86info correctly.

*). A patch to pagemap walking in procfs initially from San Mehat and then reworked a little. The conversation gave Linus a chance to rant about the entire pagemap code in general, which Matt Mackall didn’t enjoy.

*). A discussion of the prefered means to detect whether a given graphics driver is using the KMS (Kernel Mode Setting) rather than simply walking through all PCI graphics devices, started by Rafael J. Wysocki.

*). A discussion about bitops compile time optimizations for hweight_long (a hamming weight calculation routine), that also covered implementing support for hardware popcnt using the alternatives() mechanism on x86. Borislav Petkov posted a patch entitled “Add optimized popcnt variants”.

*). General agreement that removing the “please try ‘cgroup_disable=memory’ option is you don’t want memory cgroups” message on boot is a good idea both for Red Hat Enterprise Linux and also for upstream. Red Hat had expressed some concern about unnecessary support calls.

*). Exposure of an old bug with interrupts being enabled early on some ARM systems as reported by code in start_kernel. This was raised by Rabin Vincent, and triggered Peter Anvin to dig through old trees and find that rwsems can be used early in init when IRQs are still off, but will unconditionally re-enable them. Kevin Hilman posted a generic patch, changing the rwsem slow path to use save/restore spinlocks.

*). VMware posted their Baloon driver in response to Avi Kivity (the KVM maintainer)’s suggestion that that they not attempt to integrate this into virtio but instead stand seperately as simpler code. Andrew Morton requested a writeup, saing “I think I’ve forgotten what balloon drivers do. Are they as nasty a hack as I remember them to be?” (short answer: yes).

In today’s announcements:

*). sg3_utils-1.29. Douglas Gilbert announced that version 1.29 of sg3_utils is now availalbe. This package provides command line utilities for sending SCSI (and some ATA) commands to devices. Further information is available at: http://sg.danny.cz/sg/sg3_utils.html

*). 2.6.33-rt13. Thomas Gleixner announced that version 2.6.33-rt13 of the Real Time patchset is available. The patch is available from kernel.org at: http://www.kernel.org/pub/linux/kernel/projects/rt/

*). GIT 1.7.1.rc0 Junio C Hamano announced that version 1.7.1.rc0 of GIT is now available for download from http://www.kernel.org/pub/software/scm/git/. It includes a contributed script from Eric Raymond, support for GIT_ASKPASS, and a large number of other useful patches.

The latest kernel release was 2.6.34-rc3. The rc4 release was delayed for reasons that will be covered in the next episode of this podcast.

Rafael J. Wysocki sent an updated list of recent kernel regressions.

There was some concern from Taylor Lewick that kernel performance had regressed between the older 2.6.16 kernel he was running and more recent kernels, with transaction times increasing on the order of 15us. He posted some detailed statistics, though there have been few comments thus far.

Till Kamppeter noted that the deadline for student application to the Google Summer of Code (GSoC) had passed and that it was time to assign them to the various kernel projects. In the end, all unassigned applications went to Grant Likely because he made the mistake of volunteering :)

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/04/04 Linux Kernel Podcast

April 13th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100404.mp3

For the weekend of April 4th 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: BKL, KVM, Networking, and recvmmsg.

BKL. In the latest round of Big Kernel Lock (BKL) removal discussion, Arnd Bergmann posted some patches to the TTY layer, noting that it was “one of the trick[ie]r bits in the BKL removal series, so let’s discuss it here”. Arnd’s code is similar to the earlier Big Kernel Semaphore (BKS) concept but it uses a Big TTY Mutex instead. This is based upon a mutex, not a semaphore, that does not autorelease on sleep, and is intentionally confined to TTY use. Alan Cox replied suggesting that he wasn’t too bothered if these patches went in because he was working to remove the need for giant locks whatever they happen to be called. So the Bit TTY Mutex may be a short lived piece in otherwise killing the BKL sooner than later. Having said that, Alan wanted to hold off a little while he took care of “low hanging fruit” first. Others agreed.

KVM. Jiri Kosina inquired about a kernel warning generated on 32-bit KVM guests when using an AMD guest CPU on an AMD host. The emulated guest CPU is an AMD model 2, stepping 3, which is one of the models AMD apparently explicitly did not support using in SMP configurations. Jiri wondered whether it was worth adding a specific hack for KVM (since SMP emulation does work), Andi Kleen suggested perhaps just killing the code that generates a warning on those systems as it is by now very old, while Andre Przywara really didn’t like removing the warning and favored simply emulating a better model instead. Pavel Machek agreed that emulating an explicitly SMP-capable CPU model was likely the solution.

Networking. Christoph Lameter inquired as to future network stack support for the PGM protocol (RFC 3208). Currently, there exists the openpgm implementation, which runs as a userspace application using raw sockets, but there are a number of limitations in so doing, not the least of which is a performance hit. Christoph feels that PGM belongs at the same level as both UDP and TCP support, though the conversation didn’t go much beyond discussing possible prototypes.

recvmmsg(). Linux 2.6.33 added a new system call called recvmmsg() that intends to complement recvmsg() in allowing for multiple packets to be received and processed at once, rather than performing one system call (or even more) per individual packet. Unfortunately for Brandon Black, who was trying to use this new feature in his DNS server implementation, calls to recvmmsg() on a blocking socket will result in the call blocking until the maximum requested number of packets are available, not just one single packet. Although Brandon says he is willing to work around this, he prefers a more configurable blocking behavior in use of recvmmsg(). Ulrich Drepper agreed; Brandon posted a patch.

In today’s miscellaneous items:

*). A couple of IDE reverts to deal with missing devices.

*). Some new cpu-hotplug wrapper functions (cpu_notify, __cpu_notify, and cpu_notify_nofail).

*). Some followup discussion on a new CPU flag bit on recent Intel CPUs that enables the CPU to declare that it explicitly has a synchronized TSC.

*). Some percpu module handling fixes for module static percpu from Tejun Heo.

*). An async firmware loading patch from Johannes Berg, intended to allow for non-blocking immediate rejection of unavailable firmware early during boot that is requested via request_firmare_nowait prior to boot completion.

*). Tilman Schmidt noted that CONFIG_PROVE_RCU is incompatible with proprietary kernel modules because it will result in the creation of a reference to a GPL only exported symbol even in modules that do not use RCU. He suggests that those building proprietary modules disable PROVE_RCU. Paul McKenney thanked him for sharing this solution with others who might be affected.

*). A fix for __module_ref_addr() use on stable kernels prior to 2.6.34 (where percpu use has been refactored) by Mathieu Desnoyers.

*). A scheduler bug present since November 12 2009 was identified in an email thread posted by Torok Edwin (and bisected by Mike Galbraith) in which use of latencytop results in the runtime of random tasks being set to really high values afterward due to the broken commit.

*). Version 10 of the “use lmb with x86″ patches was posted by Yinghai Lu. There was some further discussion about the plan to essentially replace e820 handling on x86 with a modified version of the Logical Memory Block code that will now be modified to support parsing e820 tables.

*). A small tweak to the ordering of TLB flushig on S4 resume for i386 via a patch from Shaohua Li.

*). A discussion started by Torok Edwin concerning 32-bit perf tracing with a 64-bit kernel. Torok had been slightly confused by needing to re-install perf for a 32-bit build and this lead Ingo Molnar to ponder whether it was time to have a variant of perf for each architecture variant built.

*). A nice summary of the various printk macros (pr_, dev_, netdev_, netif_, etc.) from Joe Perches after Neshama Parhoti asked about them.

*). A patch from Robert Schone modifying power_frequency events such that changing the frequency on another CPU results in it being traced rather than the CPU that initiated the frequency change operation.

*). A patch making it easier to disable fragmentation when doing PPP multilink from Richard Hartman. Apparently this reduces “packet loss and massive ping spikes” that are seen by Richard and others.

*). Lin Ming asked Corey Ashford whether he was still working on performance event support for “uncore” or “nest” CPU units (these are additional functional units on the same die as the CPU cores but not in-core). Corey said that he was not actively working on it but is working on nest events for IBM’s “Wire-Speed” processor using the existing infrastructure due to some time constraints. It looks like more will happen here in due course.

*). Some shadow page cache discussion for KVM MMU from Xiao Guangrong.

*). Some discussion between Peter Zijlstra, Rusty Russell and Tejun Heo concerning the latter’s “cpuhog” patches and the fact that Peter doesn’t like the name. Rusty on the other hand quite likes it, because “ugly things should have ugly names”. Tejun did propose an alternative set of names, including functions such as stop_cpu() and stop_cpus() but these don’t really stop CPUs, they hog them. So the CPU hog name is more apt.

*). Lee Schermerhor posted some comparitive benchmarks between a Red Hat 2.6.18 and upstream 2.6.32, 2.6.33 kernels showing recent upstream performance regressions. Plots: http://free.linux.hp.com/~lts/Pft/

In today’s announcements:

OSPERT 2010. Peter Zijlstra announced the official Call For Papers for the 2010 Operating System Platform for Embedded Real-Time applications conference. It is to be held on July 6th in Brussels, Belgium in conjunction with the 22nd Euromicro International Conference on Real-Time Systems, which happens between the 7th and the 9th of July also. Those working on embedded Real Time systems may find this particularly interesting. The paper deadline was April 4th.

Git 1.7.0.4. A maintenance GIT release was announced by Junio C Humano.

LTP. Rishikesh K Rajak announced that the Linux Test Project (LTP) for March 2010 has now been released. It includes some last minute fixes and is available at the usual sourceforge.net/projects/ltp location.

LTTng 0.208. Mathieu Desnoyers announced the latest LTTng release 0.208 for Linux kernel 2.6.33.2 is now available. It uses waits with msleep() in place of cpu_relax() in order to handle !PREEMPT uniprocessor (UP) configurations.

The latest kernel release was 2.6.34-rc3 during the time period covered by this podcast episode.

Greg Kroah-Hartman announced the release of stable series kernels 2.6.27.46, 2.6.31.13, and 2.6.33.2. Existing users of these stable kernels should upgrade.

Finally today, Jeff Merkey surfaced from wherever he’s been recently and let everyone know that he has been issued US patent number 7,684,347, which was noted seems to be simply an abstract “really fast” packet sniffer. Jan III Sobiesk suggested that someone should patent a “really fast operating system”. Jeff should have waited a few days for April 1st, the same day that the kernel.org website featured 180 degree (or pi if you prefer) rotated text on the main page – that wasn’t a hack, it was John and Peter showing some humor.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/03/28 Linux Kernel Podcast

April 13th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100328.mp3

For the weekend of March 28th 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Filesystems, Interrupts, LMB vs. e820, Multitouch, PHY and phylib, the VM, and VMWare.

Filesystems. Josef Bacik posted a patch entitled “Introduce freeze_super and thaw_super for the fsfreeze ioctl”. In the patch, Josef notes that the existing fsfreeze code actually works too much at the block level, assuming every superblock is backed by a (typically a single) block device. For some modern filesystems – such as is the case with btrfs (Josef is a btrfs developer) – there can be a number of backing block devices, some of which may be added and removed while a filesystem is mounted. Consequently, Josef wishes to split out the freeze process to include dedicated superblock manipulating functions that don’t require the superblock s_bdev to be populated with one backing device. Al Viro had some typically useful comments about the patch, including some further followup to a reply by Nigel Cunningham containing some information about how TuxOnIce does filesystem freezing that Al was not too happy about.

Interrupts. Andi Kleen posted a patch entitled “Prevent nested interrupts when the IRQ stack is near overflowing”, in which he attempted to address the issue of too many IRQ vectors assigned to a given CPU all firing in rapid succession and causing the interrupt stack to overflow. Thomas Gleixner, in rejecting the patch first noted that Andi’s changelog was “utter nonsense” because it refered to interrupt nesting from same interrupt source rather than many vectors, and then noted that simply disabling further interrupts in such cases was not the correct solution. Thomas favored doing away with IRQF_DISABLED and instead finishing the task of converting to threaded IRQ handlers with the small hard handler always running with IRQs disabled, and he wouldn’t take the patch “unless you come up with a real convincing story”. Alan Cox wondered if there was “anyone [Thomas had] forgotten to offend”, to which Thomas responded matter of factly that he wasn’t sure since he hadn’t measured IRQ handler run times “for quite a while”. Linus first told Thomas he was “wrong” in always disabling interrupts, and then seemed to change direction, giving some comments on removing IRQF_DISABLED entirely.

LMB vs. e820. Two different mechanisms for accounting and tracking physical memory layout are in common use within the kernel. Intel (x86) systems use the Intel e820 BIOS provided tables (and support code with the same name) to track which memory ranges are assigned to particular uses, while other architectures – including SPARC, POWER/PowerPC – use LMB (Logical Memory Blocks). The latter was made an architecture independent library in 2008 and lives in lib/lmb.c. The fact that there are two different systems came to a head when Yinghai Lu posted an early_res patch aiming to move the more architecture independent pieces of the existing e820 code into fw_memmap.c. David Miller (the SPARC maintainer) did not like this, since he believed that Yinghai wasn’t listening to earlier advice that LMB provided all of the support in an indepedent fashion and should be adapted to replace the e820 bits instead. Thomas Gleixner added that, “All we get are some meager bones thrown our way”, and suggested that this wasn’t the best way to interact with the community. The thread started a mini-architecture flamewar with Ingo Molnar noting that he really wished “non-x86 architectures apprec[ia]ted (and helped) the core kernel work x86 is doing”, and Benjamin Herrenschmidt more than taking offense at this statement. But that aside, Ingo did point out that Yinghai had been doing a lot of very difficult work that was certainly of use, even if in the end another approach to unifying various bits of LMB and e820 is taken. Yinghai later posted a new patch series entitled “use lmb with x86″

Multitouch. Just in time for this author to buy a shiny new Macbook Pro that suffers from the same problem (and also uses the nouveau driver, that has had its own interesting ride recently), the discussion of multitouch finger tracking was raised again. Modern (laptop) hardware touchpads feature an ability to accurately track the position of multiple fingers at a time, and this allows for the kinds of gestures that are becoming popular today. At the same time, the X Window system that powers most graphical Linux desktops today has only minimal support and cannot handle such things as click and drag with two fingers. This means that your author has to use a custom hacked up mouse driver to support click and drag. I’m not the only one, and this prompted Henrik Rydberg to wonder recently whether it was time to add software finger tracking into the kernel. He pointed to an X.org discussion that had originally raised the idea back in summer 2009. Having discounted the idea then, he was now much more amenable to reconsidering. It seems likely that something will happen, it’s just a question of whether it will be directly in the input layer, in a new mtdev handler, or in an external library that is provided for userspace code to link against. In any case, your author is glad to see this in kernel, where it belongs.

PHY and phylib. Stefani Seibold posted in a thread entitled “fix PHY polling system blocking”, inquiring about the existing implementation for PHY link detection with MII (Media Independent Interface – the means through which network MAC chips communicate portably with various possible PHYs). The existing mechanism does not always use interrupts and can block for a few milliseconds (up to 4ms in one example with e100), while the chip that Stefani is using sees approximately 450us delay. Stefani made various proposals for adjusting the existing phylib, one of which was explicitly disliked by David Miller because it would break link-type changes.

VM. Mel Gorman followed up to a previous patch he had posted (in which he attempted to address some concerns with an IO intensive workload running with little available RAM that the VM may be calling congestion_wait in cases where something other than strict congestion is at fault) with some test results showing that the number of times kswapd and the page allocator have been calling congestion_wait and the time it spends in that function have been increasing since 2.6.29. Quoting Mel, “120+ kernels and a lot of hurt later;”. He posted very detailed test reproducer information, noting that the increase in calls to congestion_wait wasn’t due to any one change, and itemizing a few of the recent changes that have played a part. These include the TTY layer using higher order allocations more frequently, some CFQ fairness changes, and so on. He, Rik van Riel, Corrado Zoccolo, and Johannes Weiner bounced ideas around about the real reasons for performance regressions on the IO workload that was being tested. Simply adding more RAM was not the point.

VMWare. Dmitry Torokhov posted an RFC patch implementing a virtio extension for the VMWare balloon driver. Balloon drivers allow for virtualized guests to expand and contract their memory requirements at runtime, through a co-operative interaction with the hypervisor. In the case of VMWare, Dmitry says VMWare are interested in using the existing Linux virtio framework to communicate between Linux guests and the VMware hypervisor, but with a few tweaks – for example, their hypervisor may refuse to lock certain pages, or may (under certain circumstances) reset the balloon via a notification to the guest, without requiring the guest to explicitly notify on every page released back to the hypervisor as a consequence. Dmitry is interested in various other capabilities that could be exposed over virtio but is first interested to hear from the Linux community. So far that community is only represented in replies by Avi Kivity (KVM), who favors VMWare having their own balloon driver, or splitting out a shared “balloon core”.

In today’s miscellaneous items:

* Brian Gerst posted version 2 of a patch implementing merged fpu and simd exception handlers in one function.

* The final round of task_struct->signal stability cleanups from Oleg Nesterov.

* Support for nested pid namespaces from Serge E. Hallyn.

* A patch from Jason Baron implementing support for enabling the kmemleak checker and memory hotplug support simultaneously in the kernel config.

* Some changes to TAINT_ flag handling from Ben Hutchings (intended to distinguish non-harmful errors such as missing firmware from more serious issues that would tradionally have set the taint flag).

* Some work in progress discussion about reading remapped performance counters on x86 systems from Stephane Eranian (but the current patch breaks the already working implementation on POWER/PowerPC).

* The latest version (5) of the Memory Compaction patches from Mel Gorman.

* A patch allowing different tracers to be compiled intependently from Jan Kara.

* The latest version (5) of the Jump Label patches from Jason Baron.

* An ARM port of the Linux Checkpoint-Restart patches from Christoffer Dall

In today’s announcements:

The latest kernel release on the original date of this podcast was 2.6.34-rc2, which was released on March 19th. The current release is a higher revision.

Rafael J. Wysocki posted a list of reported regressions from 2.6.32 and 2.6.33 that were still possibly affecting 2.6.34-rc2.

Git 1.7.0.3. Junio C Hamano announced that version 1.7.0.3 of GIT is available. The latest release includes fixes for ACL support on the underlying filesystem, and various other fixes also.

IIO mailing list. Jonathan Cameron announced the creation of a new “Industrial input / output” mailing list since a lot of such discussions had been happening off list already. The new (majordomo) list is linux-iio@vger.kernel.org, and can be subscribed to via sending email to majordomo@vger.kernel.org as usual.

SystemTAP version 1.2. Frank Ch. Eigler announced the release of SystemTAP version 1.2 by posting some release notes. This includes various fixes for use with kernel version 2.6.9 from 2.6.34-rc.

util-linux-ng v2.17.2. Karel Zak announced version 2.17.2 of the util-linux-ng package. This is a bugfix release.

Sachin Sant reported a hotplug test failure on -rc2, and Rafael J. Wysocki posted a link to an existing patch that corrected the problem.

Frederic Weisbecker inquired as to whether anyone would mentor the Linux Wireless Google Summer of Code (GSoC) project, to which there were no replies. Therefore it seems that some folks at Portland State University will be asking around amongst the student population for interested parties.

Finally today, Michael Gilbert noted that CVE-2009-4537 had been publicly disclosed for a while but an official (non-vendor) fix was not upstream. Neil Horman said he would take care of making a posting about it, and he did post an official fix for the r8169 frame length error a few days later.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/03/21 Linux Kernel Podcast

March 21st, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100321.mp3

For the weekend of March 21st, 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Linux 2.6.34-rc2, 64-bit system calls, core dumping to a pipe, exported symbols, page cache control, and performance counters for KVM guests amongst other things.

Linux 2.6.34-rc2. Although there is no official announcement as of this writing, Linus’ git tree currently contains a 2.6.34-rc2 release that he created on Friday March 19th 2010 at 6:17pm Best Coast Time (PDT). Once the announcement is officially made, there will be more detail.

64-bit system calls. Benjamin Herrenschmidt raised a question in a thread entitled “64-syscall args on 32-bit vs syscall()”, concerning the ability for existing kernels to handle passing 64-bit parameters to system calls when using a 32-bit userspace. A problem arises on platforms such as POWER and it’s smaller cousin, PowerPC, in which arguments are often passed by register and not on the stack (unless a large number are passed). When passing 64-bit values (as in calling fallocate() within hdparm), GCC may try to use multiple registers (which themselves need to be aligned on even boundaries) to pass a 64-bit value using two sequential 32-bit registers. But the syscall() function within glibc may try to effectively use the same trick again, causing arguments to be off-by-one. Benjamin had a proposal for modifying the existing syscall() interface in a way he thought would be backward compatible (perhaps confined to P{ower,OWER}{PC,} initially) but Ulrich Drepper wasn’t quite so trigger happy to make changes. Peter Anvin favored using explicit versioning to isolate any syscall() interface changes. Separately, Torok Edwin posted some perf (Performance Counters userspace utilities in the “perf” directory) patches enabling callgraph tracing of 32-bit processes when running 64-bit kernels.

Core dumping to a pipe. Neil Horman posted the 4th version of a patch series entitled “exec: refactor how call_usermodehelper works, and update the sense of the core_pipe recursion check”. In addition to addressing some existing race conditions with the implemention, Neil was interested in reworking the call_usermodehelper() function to handle core dumping to a pipe. In the existing arrangement, it is necessary to have all running processes with non-zero core dump ulimits to ensure the pipe dump will work as planned. But Neil has had enough requests to be more flexible, and has come up with the idea of adding a function callback to the call_usermodehelper (umh) that will be made after the task (at this point, in userspace nomenclature, that is just about referable as a process – they are the same however) has been forked but prior to the exec() call starting the userspace code. That function pointer can, in the case of do_coredump, fiddle with ulimits.

Exported symbols. Robert P. J. Day inquired whether the kfifo implementation should really be exporting as many symbols as it does. Tilman Schmidt alluded to the reasoning behind this in mentioning inlined functions. For background, whenever the kernel needs to make use of some function from within modules, that function must explicitly be exported through an EXPORT_SYMBOL or a similar macro definition – simply using the C keyword “static” does not have the desired effect. Sometimes, symbols are exported solely because they are used by corresponding inline functions that are included within module files and need to use the corresponding export. For example, an inline function called “foo”, might need an export “_foo”. In order to clarify the situation, this author suggested a new EXPORT_SYMBOL_INTERNAL export to clearly label these use cases such that symbols are not used where they are not intended.

Page cache control. Balbir Singh posted a patch exposing a cache= kernel command line parameter that can be used to control page cache operation, and effectively disable it entirely in certain situations. This is of particular benefit to virtualized guests (especially those not wanting to enter into direct reclaim frequently), which otherwise might have their pagecache data effectively stored twice – once in the host, and once in the guest itself. Now, there being no such thing as a free lunch, Avi Kivity pointed out that this would slow down guests booted with cache=off because they would now need to use a virtio call to pull in more pages. However, guest memory utilization was shown to fall considerably as might be expected without a page cache. Both Avi and Balbir seemed to agree that the tunable knob allowed for situation specific decisions to be based upon the specific needs of an environment – more overhead in the VM or a slight loss in performance, according to workload, IO types, filesysyems, and a number of other items mentioned by both. Randy Dunlap specifically requested that documentation be added also.

Performance Counters for KVM guests. Yamin Zhang posted a patch entitled, “Enhance perf to collect KVM guest os statistics from host side” intended to facilitate the collection of performance counters statistics from the host when using Linux guest instances, with the exception of guest userspace. Avi Kivity was excited that this patch did not require the exact same kernel on both the host and the guest (he called that “critical”, noting that, “I can’t remember the last time I ran same kernels”). There did seem to be some agreement between both Avi and Ingo Molnar that having a vmchannel client in the host kernel exporting various data for tracing to guest kernels did make life easier for the implementators of such features but potentially opened up another DoS target and needed to be avoided. Instead, Ingo suggested that the host perf tools connect to the qemu instances managing guest instances and communicate over a well-known UNIX socket. The conversation went off onto a tangent about obtaining guest instance information using libvirt, whether there were other tools in common usage to manage guest instances other than starting them directly using the modified qemu, and the relative benefits of shipping all KVM kernel and userspace code in a single project. This gave Ingo an opportunity to get in another mention of what he considers to be “ugly” separation between glibc and the kernel. The entire thread is certainly worth reading, at dozens of posts and likely growing.

In today’s miscellaneous items:

*). A fix for allmodconfig with Xilinx soft core FPGA systems.

*). A device power management documentation update from Rafael J. Wysocki.

*). Version 7 of Andrea Righi’s per memory cgroup dirty page limit patch. Andrea provided some documentation updates that were discussed also. Separately, and on the note of cgroups, the CFG_GROUP_IOSCHED configuration option was made visible in a patch from Li Zefan.

*). A bunch of scheduler and cpusets fixes from Oleg Nesterov, who also noted that there were remaining issues – including a potential lockup in do_fork() caused by receiving a signal from an IRQ or an RT thread pre-emption event because the runqueue lock (rq->lock) cannot be taken in the interim. Oleg asked the maintainers very nicely to please review his patches and comment, although there have been no comments posted in the last week on these.

*). Michael Braun reported an issue involving an interaction (or lack thereof) between the kernel crypto subsystem and the SLOB allocator. He finds that there is “general memory corruption” when using SLOB that isn’t present with the other allocators. Herbert Xu (and by extension, Pekka Enberg, since it was him who inquired as to whether these option were enabled) asked Michael to turn on some allocator debugging options and provide the relevant debugging output to facilitate further analysis.

*). A fix ensuring that legacy PIC interrupts are handled on all CPUs and not just the boot CPU when using the “noapic” kernel boot option from Suresh Sidda. This addresses a bug originally raised by Ingo Molnar.

*). A patch from Dmitry Torokhov re-implementing sysrq as an input handler, rather than as a custom hack in the legacy keyboard driver. Henrique de Moraes Holschuh wondered aloud whether this would introduce any problems for SAK (Secure Attention Key), which should be uninterruptible. That piece seems yet fully resolved in the thread.

*). A patch converting alpha to use clocksource rather than arch_gettimeoffset from John Stultz.

*). A missaligned percpu allocation when using lock events through perf on a particular SPARC box was reported by Frederic Weisbecker.

In today’s announcements:

Kernel.org. John (warthog9) Hawley announced the general availability of various SSL based services on kernel.org. Quoting John, “[t]his should help provide an additional level of security, in particular for our dynamic content like the wiki’s, patchwork and bugzilla”. John noted that the SSL certificates were generously donated by Thawte, and included a quote from the latter in which they state that they are, “proud of [our] open source lineage”. As of this writing, services officially using SSL (through explicit redirection) include Bugzilla, Wikis, Account Requests, Patchwork, while services that can use SSL if requested using the appropriate address do currently include the main www.kernel.org, boot.kernel.org, git.kernel.org, and android.git.kernel.org. Services not using SSL include mirrors.kernel.org (due to the volume of traffic incurred), and the geo-DNS entries because that would expand the number of SSL certificates required unreasonably.

Loop-AES. Jari Ruusu announced version 3.3a of the loop-AES file/swap utility. Details: http://loop-aes.sourceforge.net/

LTP. Rishikesh K Rajak sent an announcement saying that the previous ltp-cvs commit list would be supplemented by a new ltp-commits list that includes git commits also. The name would suggest that it may be somewhat VCS agnostic. Details: http://lists.sourceforge.net/lists/listinfo/ltp-commits

SCST. Vladislav Bolkhovitin posted to announce that the “new SCST SysFS-based interface has become fully usable, so you can start migrating to it and update your target drivers, dev handlers and management utilities”. For further information, please see: http://scst.sourceforge.net/

TCM. Nicholas A. Bellinger announced the release of version 3.4.0-rc1 of the Target_Core_Mod/ConfigFS infrastructure project, which includes a new Open-FCoE.org based target module (tcm_fc) for TCM/ConfigFS 3.x (mentioned in a separate release announcement). As of the latest release, the TCM/ConfigFS project is now tracking upstream Linux development once again. For further information: http://www.linux-iscsi.org/index.php/Target_Core_Mod/ConfigFS

RT 2.6.33.1-rt11. Thomas Gleixner announced the latest RT kernel patch version 2.6.33.1-rt11 is now available. Since he had been traveling, Thomas had made a few interim releases (rt6 through rt11), the sum of which he summarized. For further detail: http://www.kernel.org/pub/linux/kernel/projects/rt

TuxOnIce 3.1. Nigel Cunningham announced the 3.1 release of TuxOnIce. This is a series of alternative software suspend and resume patches that have been out of the kernel tree for some time, but have their various supportors. The latest patches include LZO compression support, UUID support for detecting suspend images without using a resume= parameter, and other fixes.

The latest kernel release is 2.6.34-rc2.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/03/14 Linux Kernel Podcast

March 19th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100314.mp3

For the weekend of March 14th 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

In today’s issue: The 2.6.34 merge window, anonymous inodes, ATA 4KiB sector issues, cpuhogs, ext4, PCI, and USB console support.

The 2.6.34-rc1 merge window. Linus Torvalds announced the release of the first 2.6.34 RC kernel on Monday, March 8th 2010 at 12:33pm Best Coast Time (PST). In closing the merge window early, he hoped to make a point in line with previous comments on the issue of getting merge requests in in a timely fashion. Quoting Linus, “but in general the merge window is over. And as promised, if you left your pull request to the last day of a two-week window, you’re now going to have to wait for the 2.6.35 window.” According to Linus, nearly two thirds of the changes are in drivers (when factoring in 50% drivers/ code, 5% sound/ code, and 10% firmware). Of the remaining bits, about half is architectural and the rest is, well, the rest. So far, about 850 developers are involved. Linus again refered to his Fedora Nouveau rant in ending with a reference to the need to upgrade libdrm/nouveau_drv versions if using that driver.

Several architecture maintainers gave their excuses and requested pulls later, but Linus drew the line at a request from James Bottomley to pull SCSI pieces two days later, on March 10th. James noted that he had been en route back from India, nobody had told him the merge window would close early, and that the only commit added to his tree since the merge window closed on Monday was a bug fix. Linus said he was “not going to pull” and that the whole point behind closing the merge window early was because of people posting pull requests late that “should have been ready when the merge window _opened_”. James objected to the unpredictability of the merge window closing, but Linus said that “WAS THE WHOLE F*CKING POINT!”, in order to avoid last minute pull requests, and added that he would in future not even say how long the merge window was going to be in order to have requests ready the moment the window opened. Unfortunately for James, Linus wanted to make a point and he seemed to meet Linus’ criteria for doing so. Doug Gilbert later pointed out that people should not attack James just because he was the subject of “yet another Linus rant”.

Anonymous inodes. Dmitry Torokhov recently started a thread entitled “S[E]Linux going crazy in 2.6.34-rc0″ (but note the corrected capitalization of “SELinux”). He was experiencing a side effect of some recent work by Al Viro, as well as others, to switch various subsystems such as inotfiy over to use anon inodes rather than their own “filesystem” type. Previously, inotify had used its own filesystem called simply and obviously “inotifyfs”. This allowed for SELinux rules to match on various notification events on an “inotify_t” filesystem type of filesystem. But with the trend to convert to anonymous inodes, there becomes no easy way to write SELinux rules to confine applications (if that is what you actually want to do), and the existing rules go insane, as this author recently saw on a rawhide system that happened to be running SELinux. Eric Paris proposed various workarounds – type a, and type b – of the “revert” everything back to how it used to be, or create support for differing security contexts for anonymous inodes. The latter seems more likely to happen though the thread dried up at that point and nothing further was said on the topic until Eric Paris sent a pull request for some notify bits a week later.

ATA 4 KiB sector issues. Tejun Heo started a new thread entitled “ATA 4 KiB sector issues”, in which he lamented the current state of support for larger sector size ATA devices (those using 4K rather than 512 bytes as their natural unit of size – someone please add a comment to this article with a description for the term used to describe the natural size of a disk, its “word size”). Apparently, the transition will be “quite painful”. In his lengthy email, the gist of which is covered by an article on the kernel.org wiki at: http://ata.wiki.kernel.org/index/php/ATA_4_KiB_sector_issues, Tejun covers the issue of backwards compatibility, DOS partition table support, and that beast of beasts – Windows. Interestingly, I didn’t see a specific mention of the issue of unaligned writes when using journalled filesystems and ensuring commits have hit the disk, but I’m sure that’s covered somewhere in there. I suspect this is now required reading if you work on disk and block bits. James Bottomley added some useful notes about the lack of bootloader support, etc.

CPU Hogs. Tejun Heo posted a patchset intended to generalize the case of monopolizing a CPU (or a set of CPUs) with a single kernel thread. The cpuhog functionality can be used by any kernel code that needs to grab one or more CPUs exclusively for some period of time, such as [k]stop_machine, which does just thus during module load in order to ensure that it is safe to fiddle with the kernel symbol table. For good measure, Tejun also fixes the kernel migration threads to use cpuhog while he’s at it. LWN had a writeup on this topic later, and your author has a pet project in mind that should benefit already from using this patchset. Thanks Tejun Heo!

ext4. Christian Borntraeger posted asking about e4defrag support for compatible ioctls (as in the case on his system, with a 64-bit x86_64 kernel and 32-bit IA32 userspace environment). He suggested, “[l]et[']s just wire up EXT4_IOC_MOVE_EXT for the compat case.” This lead Jeff Garzik to wonder aloud what the overall status was of ext4 defragmentation support. Jeff noted that he had actually poked at defragmentation support himslef in the past and was “hopeful that I will see defragging in a Linux distribution sometime in my lifetime”. Eric Sandeen noted that such support had previously been in Fedora (briefly) but was removed because he (Eric) wasn’t so happy with the code. Since I happen to know Jeff has a good many years ahead of him, one hopes that he will get to see many great things, including ext4 defragmentation. Separately, Michael Tokarev pointed out another 32-bit userspace on 64-bit kernel issue with compatible ioctls, this time affecting AIO. Jeff Moyer was on the case with an initial test patch that he could use succesfully with the libaio test harness built with -m32 while he continues to work in general on further AIO cleanups for the longer term.

PCI. Alex Chiang posted an updated patch based upon some awesome work that Matthew Wilcox had done to provide sysfs PCI slot to device mapping directory entries that can be used to determine which physical slot a device is actually installed in within the chasis of a given system. This will be of use to a number of projects, including efforts to name network interfaces according to the slot they reside in (rather than their MAC address) for distributions needing to support single system images – at least, that’s one possibility that comes to mind. I have pinged a few people myself to see if this will be of use to that effort in general, and there are bound to be many more.

USB Console. Jason Wessel posted a 6 part patch series entitled “usb console imprevements series”, containing “aggregated and ported…usb patches I have previously posted which are not mainlined into a single series aimed at providing a stable [USB] console”. Jason began with a recap about what the problem with USB consoles currently is – that they are not synchronous (as opposed to regular serial UART consoles which are) and so will drop data on the floor if there is no room to buffer it when interrupts are disabled. The new code introduces intentional delay loops calculated through imperical testing using an FTDI USB part (a common part on many embedded boards, such as the BeagleBoard JTAG debugger sitting on this author’s desk).

In today’s miscellaneous items:

* some early dev_name() patches from Paul Mundt allowing early platform device code to use dev_name() before the guts of the driver core are online.

* This author was bitten by a recent bad commit from Al Viro that caused opendir() to succeed on regular files. I posted a question about it and was told that it had already been fixed. Indeed, it had.

* Ongoing debate happend about reducing the number of memory allocators in use on x86 systems, per a previous note from Ingo that there were 5 possibilities depending upon phase of boot and this needed to be reconciled.

* A rant from Finn Thain about a “coding style” fix patch for Macintosh that reduced a comment length to fit in 80 characters. Finn thought this was an utter waste of time, and repeated a comment often heard elsewhere, “checkpatch.pl is great but code that fails it is NOT always wrong.” and, ‘”Check patch” is a good idea but “check existing code” is a waste of everyone’s time. Sometimes, cleanup patches do more harm that good, for example a well intentioned “if” cleanup this week completely misunderstood how the identation is supposed to work and was also summarily rejected. Ben Herrenschmidt’s only response to this mini-rant was “Amen !”.

* Mitake Hitoshi concurred with Guangrong Xiao’s posted results showing an *improvement* in performance of userspace mutexes when lock trace events were enabled. Reproducer code was posted and confirmed.

* Some useful documentation was provided on Linux’s circular buffering and memory barriers support from David Howells.

* Support for specifying in the environmental variable context of a kernel emitted uevent whether it came because of a kernel_firmware() or a kernel_firmware_nowait() request was postulated by Johannes Berg (to handle the case of built-in drivers requesting firmware not in an initramfs). Kay Sievers pointed out that many events are re-triggered during boot and so the firmware loader cannot know what state the system is in, and therefore it might be better to leave requests for unsatisfiable firmware around “forever” until they are cancelled from userspace rather than trying to cunningly work around the issue of firmware not being present in an initrd context with special uevent environment variables.

* and the jabs at SELinux security labeling continued with Al Viro coming up with a few amusing retorts in the “Upstream first policy” thread and Ingo Molnar comparing SELinux relabeling wait times to fire doors, “we should prefer a one inch thick fire door that opens and closes fully automated to a five inches thick fire door that people keep always-open with a chair”. Ingo contends that all too often, people “turn off the whole thing” because of various frustrations and so there is less overall security than might be the case with a slightly less perfect system. Dave Airlie called SELinux relabels “the new fsck” and called for journalling.

In today’s announcements:

Benchmarks. Anca Emanuel announced some new Phoronix benchmarks for kernels 2.6.24 through 2.6.33, showing that performance has generally improved by 770% from 2.6.29 to 2.6.30 and only regressed very slightly in 2.6.32. Regretfully, however, 2.6.33 does not perform nearly so well, and, according to the Phoronix quote, “PostgreSQL performance atop the EXT3 file-system has falled off a cliff”. Full details are available on the http://www.phoronix.com/ website.

RT 2.6.33-rt6. Thomas Gleixner announced the release of version 2.6.33-rt6 of the RT patchset that he and others are continuing to develop against the 2.6.33 series kernel. As he mentions, there was an -rt5, but it was more of a separation point in the git tree. With the merging of some bits into that older tag, MIPS support rejoins the RT tree thanks to Wu Zhangjin. As usual, the RT patch is available on the kernel.org website, in the section devoted to such projects, or in the head (rt/head) and stable (rt/2.6.33) branches of the “tip” tree maintained by Ingo Molnar. Details: http://www.kernel.org/pub/linux/kernel/projects/rt/

The latest kernel release is 2.6.34-rc1.

Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-03-09-19-15. Hiroyuki Kamezawa posted an updated version of his OOM notifier memory cgroup patches against this latest tree. Andrew later posted an mmotm for 2010-03-11-13-13. And in other “mm” news, Mel Gorman posted the 4th version of his “memory compaction” patches.

Greg Kroah-Hartman posted some review patches for stable kernels 2.6.33.1, and for 2.6.32.10. These were subsequently released.

Finally today, Robert P. J. Day asked whether it was still worth him running his “cleanup” scripts (that look for problems with kernel config options) after each merge window closes. Randy Dunlap thought “yes”, and was even more happy that Robert had posted his scripts for him and others to use. Details: http://www.crashcourse.ca/wiki/index.php/Kernel_cleanup_scripts Robert followed up later with another email saying that most of his popular cleanup scripts have now been posted, which is great.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags: