Home > episodes > 2010/04/18 Linux Kernel Podcast

2010/04/18 Linux Kernel Podcast

Audio: COMING SOON

For the weekend of April 18th 2010, I’m Jon Masters with a summary of the past week’s LKML traffic.

In today’s issue: Linux 2.6.34-rc4, adaptive spinning mutexes, Microblaze, Remote Controller Subsystem, Stack Size, and VM.

Linux 2.6.34-rc4. Linus Torvalds announced the release of kernel 2.6.34-rc4 on Monday April 12th 2010 at 7:16pm PDT (Best Coast Time), which had been delayed while he, Borislav Petkov, Rik van Riel, and others were tracking down an annoying rmap VM regression caused by the introduction of anon_vma_chain support. Most of Linus’ announcement covers that bug – stay tuned for some coverage on that – but also mentions the new cxgb4 network driver.

Adaptive spinning mutexes. Benjamin Herrenschmidt posted a new thread entitled “Possible bug with mutex adaptive spinning” in which he noted that the current adaptive spinning (in which a mutex will spin briefly rather than immediately going to sleep if the owner of a lock is already running and might release it soon) code in mutex_spin_on_owner() does not correctly handle the case of the owner CPU being offlined. In this case, the function will return 1, meaning that the caller should spin, which it may do forever. Ben changes the return to 0 in the case that the CPU is offline so that a sleep occurs immediately.

Microblaze. Michal Simek posted a thread entitled “Microblaze – The fi[r]st year”, in which he summarized what has happened in the year since support for the soft-core Xilinx Microblaze CPU was first added to the mainline kernel. He calls out a number of folks for specific thanks – both from Xilinx, and from PetaLogix, as well as the wider community (the usual suspects: Andrew Morton, Arnd Bergmann, Grant Likely, Ingo Molnar, John Linn, John Williams, Stephen Neuendorffer, etc.). He includes a timeline of events over the past year as well as links to git trees, the wiki, and even a Facebook fan page (such is the world in which we live today – and yes, I am a “fan” myself).

Remote Controller Subsystem. Mauro Carvalho Chehab posted an informative mail entitled “Remote Controller subsystem status” in which he updated everyone on the current progress toward implementing a new “remote controller” subsystem that replaces the legacy V4L/DVB code and will become a new “core” subsystem available in /sys/rc. There is a userspace tool called ir-keytable and some discussion of plans for merging in 2.6.35. A mail worth reading.

Stack size. Dave Chinner posted a thread entitled “mm: disallow direct reclaim page writeback” in which he advocates for using the background IO flusher threads even in the case that VM pressure is so high that direct page reclaim becomes a necessity. Dave feels that in such cases, “we may have used an arbitrary amount of stack space, and hence enterring the filesystem to do writeback can then lead to stack overruns. This problem was recently encountered [on] x86_64 systems with 8k stacks running XFS with simple storage configurations”. This lead to a longer thread in which the issue of kernel stack footprint was addressed, as well as the specific issue of what to do in the direct reclaim situation. Andi Kleen followed up to Chris Mason’s comments concerning the relatively large footprint of single fs functions with an assertion that the ‘4K stack simply has to go. I tend to call it “russian roulette” mode’. Andi considers such small stacks to be dangerous given the “obscure paths through the more an more subsystems”. He is fond of the separate interrupt stack in the case of 4K process stacks, but feels that there should always be a separate interrupt stack in any case, as might have helped in the case that Dave Chinner was mentioning in the original posting. Mel Gorman later followed up with an RFC patch series entitled “Reduce stack usage used by page reclaim” in which he attempted to “reduce some of the more obvious stack usage in page reclaim”, including in putback_lru_pages, kswapd, shrink_page_list, shrink_zone, and so forth (up to 1096 bytes saved).

VM. The Linux kernel includes support for reverse page mapping (rmap), a means by which it is possible for the Virtual Memory subsystem to answer important scalability questions such as “which virtual memory pages reference this physical page?” without having to walk through a large number of process page tables each time. Over the years, this code has become more complex through the addition of anon_vma, and anon_vma_chain structures intended to allow object based reverse mapping of anonymous memory pages with reduced overhead as compared with Rik van Riel’s original (and more simple) mechanism of having additional pointers in every struct page. anon_vma is used to track per-task anonymous VMA use, while anon_vma_chains link these together to allow the VM to determine which tasks have a shared reference to a given anonymous VMA.

The implementation of this complex VMA tracking was suffering from a bug that Borislav Petkov kept hitting in performing a suspend/resume cycle on his system, in which the resume code would wind up referencing a previously unmapped shared page first within a child process (setting up a new anon_vma) and later within a parent (causing an anon_vma_chain link to be setup pointing in the wrong direction from child to parent) that subsequently could no longer reach the child anon_vma after the child task exited. As Linus said, “End result: process A has a page that points to anon_vma B, but anon_vma B does not exist any more. This can go on forever. Forget about RCU grace periods, forget about locking, forget anything like that. The bug is simply that page->mapping points to an anon_vma that was correct at one point, but was _not_ the one that was shared by all users of that possible mapping.” Thus the fix is to ensure that new anon_vma_chain entries are always referencing the “_oldest_ possible anon_vma for the page mapping”, as is the case for Linus’ eventual (simple) patch, entitled “[PAGE 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma”. Borislav said it survived more than 20 test cycles where the system would previously have managed at most 6 resume attempts.

Linus seemed genuinely excited about tracking down this bug – it can’t always be easy doing his job, and I’m sure he relishes an occasionally really dirty bug to poke at. One thing that did come of this exercise was an improvement in comments and documentation both on list and in the affected code. Linus seemed very happy with the effort Borislav was putting in to help test and track down this issue (ending the thread with a little joke about Borislav’s email gateway, which claims to be “SuperMail on a ZX Spectrum 128k”). The thread fixed a few other issues aswell, and gave Peter Zijlstra a chance to post a documentation patch for page_lock_anon_vma noting that it is very difficult to serialize fully against page_remove_rmap so that the lock function doesn’t try, but instead all users of it should verify that the anon_vma returned to them is actually still relevant to them. Finally,
Ulrich Drepper followed up some time later – on a tangent – wondering aloud why mprotect need create so many VMAs when changing permissions
on thread stacks and the like instead of modifying page table entries.

As usual, Linux Weekly News (LWN) did a much better job of explaining the overall multi-day issue in depth so you are encouraged to take a look at
their story for more of the history, analysis, and nice graphics.

In today’s miscellaneous items:

* Robert Richter posted some model specific performance events patches in order to support AMD IBS (an unfortunate acronym in this case standing for Instruction Based Sampling).

* Nigel Cunningham was looking for a job.

* Several people have reported issues booting Macbook Pros with recent kernels. Len Brown noted that this was likely already fixed (referencing BZ 15749). In response, Harald Arnesen was especially happy about git bisect as a debugging tool for non kernel hackers to help track down bugs such as this one.

* Jason Baron posted version 7 of his “jump label” patch series.

In today’s announcements:

Git 1.7.1.rc1. Junio C Hamano announced Git version 1.7.1.rc1, which includes a number of fixes. http://www.kernel.org/pub/software/scm/git/ This comes at around the time of the 5th anniversary of the kernel switching to Git for development, which Christian Ludwig noted occured on the 15th April. Christian notes that he has made a YouTube video visualizing git development history, available at http://www.youtube.com/watch?v=ntTpM8hfl_E

Guilt 0.33. Josef “Jeff” Sipek announced version 0.33 of the Guilt (Git Quilt) series of bash scripts was now available from the usual location.
http://www.kernel.org/pub/linux/kernel/people/jsipek/guilt/

LTTng 0.210. Mathieu Desnoyers announced LTTng 0.210 for kernel 2.6.33.2, which was largely a revert of a PowerPC specific TRACE_EVENT definition that occured outside of include/trace, and which particularly bothered Mathieu.

sdparm 1.05. Douglas Gilbert announced that the 1.05 release of sdparm was now available. This is a direct analogy of “hdparm” but for SCSI devices, and so supports a lot of SCSI specific fancy options.

trace-cmd version 1.0. Steven Rostedt announced version 1.0 of his trace-cmd utility, which is a cross-platform, endian safe binary reader for ftrace that
can be used to capture data on one machine (e.g. as a flight recorder) and then decode and process it on another, at runtime, or after the fact.

The latest kernel release was 2.6.34-rc4.

Andrew Morton posted an mm-of-the-day (mmotm) for 2010-04-15-14-42.

An issue was discovered with a net-2.6 patch entitled “tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb” that caused ssh to fail. David
Miller subsequently stated that he would revert this patch and specifically test zero length data area CHECKSUM_PARTIAL packets with the IGB driver.

Pavel Machek noted that the LOCALVERSION_AUTO configuration option, which appends a new version to the kernel on each compilation, has an unfortunate interaction with loadable kernel modules when CONFIG_MODVERSIONS is unset insomuch as it causes the simple kernel version check to fail. Linus was very clear that the problem here is people building kernels without enabling modversions and expecting that to be even remotely safe.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

  • Print this article!
  • del.icio.us
  • Facebook
  • TwitThis
  • Identi.ca
  • Digg
  • Google Bookmarks
  • Slashdot
  • RSS
Categories: episodes Tags:
  1. No comments yet.
  1. No trackbacks yet.