Archive

Archive for May, 2010

2010/05/02 Linux Kernel Podcast

May 17th, 2010 jcm No comments

Audio: COMING SOON

For the weekend of May 2nd 2010, I’m Jon Masters with a summary of the past week’s LKML traffic.

In today’s issue: Linux 2.6.34-rc6, vger.kernel.org, Checkpoint and Restart, Frontswap, FUSE, and the Scheduler.

Linux 2.6.34-rc6. Linus Torvalds announced the latest 2.6.34 RC kernel on Thursday April 29th at 8:18pm PDT (Best Coast Time). The latest release is bloated by an updated PowerPC defconfig but does containing other fixes.

vger.kernel.org. There was a vger.kernel.org outage this week, from the 28th through the weekend, due to a power failure in the datacenter that hosts the equipment. This disrupted traffic to LKML, although some folks on IRC noted that their productivity had improved due to the lack of distraction.

Checkpoint and restart. Oren Laadan posted the latest version (21) of the “Kernel based checkpoint/restart” patch series, all 100 of the patches. He included various hints about which bits should be reviewed by whom, but the sheer size of the series boggled a few people. Although there wasn’t much discussion on the list, it does seem unlikely that a 100-part patch series of this kind would be pulled whole any time soon. http://www.linux-cr.org.

Frontswap. Discussion continued on some patches we missed in last week’s episode, on a rewritten piece of the previous “Transcendent Memory” patch series, named “Frontswap”. This piece of the large patch series – which is apparently shipping now in both OpenSuSE and Oracle Enterprise Linux – adds a new generic means to register what is the “opposite” of a swap-like backing store. Frontswap is essentially non-addressable RAM that is provided by a hypervisor (or perhaps a compressed in-kernel RAM device) and which may grow and shrink over time according to the availability of system resources. For example, a hypervisor may grant guests large amounts of otherwise unused RAM in the form of such “frontswap”able devices that may need to be reclaimed later on if other guests require the resources. Using frontswap, one can potentially avoid additional disk overhead usually associated with “swap”. One of the biggest criticisms, from Avi Kivity – was that these patches assume access to the frontswap device is synchronous and not being performed using DMA or some other asynchronous process. Dan Magenheimer confirmed that this is an intential design limitation in order to make the implementation much simpler for its use case(s) dealing with real physical RAM. Dan noted that the conversation had gone off on a tangent, discussing such other (interesting, but not directly relevant) issues as swap-to-flash.

Fuse. Miklos Szeredi posted an RFC patchset implementing splice(2) support for FUSE (Filesystems in USErspace). This means that is is possible to move an existing page directly into the page cache of the FUSE filesystem without ever having to perform a copy. Given that there is obvious overhead in having filesystems implemented in userspace, adding splice support is a nice touch. Apparently the early tests show improved bandwdith and reduced system time but it will be interesting to see what further testing reveals over time.

Scheduler. Ted Baker, Joerg Roedel, Doug Niehaus, and Peter Zijlstra discussed scheduler policy and classes available in the kernel in a followup to a much earlier thread entitled “RFC for a new Scheduling policy/class in the Linux-kernel”, specifically about any plans to support SCHED_SPORADIC. Both Ted Baker and Doug Niehaus had plans for the ability to assign a task a priority that is specifically non-runnable without having to send it a signal – such as SIGSTOP – that requires the task to run in order to process the STOP. Peter Zijlstra stated that the current plan involved supporting the sporadic task model through the use of SCHED_DEADLINE rather than POSIX’s SCHED_SPORADIC (the name of which, according to Peter, was jokingly “stole[n] [...] from us”). Ted Baker replied to Peter, noting that deadline scheduling and sporadic server scheduling are “two quite different things” – the latter belonging to the existing fixed priority scheduling domain (that is a separate problem domain from that of the deadline scheduling folks). Ted thought issues with the POSIX SCHED_SPORADIC API that may have problems could be corrected through “interpretation” of the standard such that a solution were available in short order rather than longer term, especially if Linux were to do something with implementation that he could feed to the Austin Group (the POSIX folks).

In today’s miscellaneous items:

* Mike Travis (SGI) posted a patch providing a kernel parameter to increase pid_max from 32k for early-in-boot use, before it can be otherwise set to a higher value. Otherwise, on a system with 1664 CPUs, Mike finds that there are 25163 processes started before the login prompt!

* Jack Steiner (SGI) noted that the existing SLAB allocator implementation of cpuset_mem_spread_node used a single rotor for allocating both file pages and SLAB pages, so that (on a multi-node memory system), writing a particular test file results in advancing the rotor 2 nodes per allocation and skipping e.g. odd number nodes in the SLAB pages allocation. The patc introduces a second rotor just for the SLAB page allocation.

* Philip Langdale (VM) noted that he has been following the Transparent Hugepage work over the past few weeks and is very encouraged. He claims a 22% improvement in ops/sec reported by SPECjbb under virtualization.

* A kernel developer posted a somewhat distressing thread suggesting some emotional disturbance caused by a particular relationship. In the interest of not being the US Weekly of LKML I shall refrain from further comment, and agree with the suggestion of using the “It’s Complicated” button on Facebook next time something like this comes up instead.

* Ying Huang posted initial support for APEI (ACPI Platform Error Interface).

* Joerg Roedel posted the second version of the “Nested Paging support for Nested SVM” patchset.

* Steven J. Magnani posted version 2 of a stack unwinder for Microblaze.

* A second series of viafb patches for OLPC from Jonathan Corbet, who later pushed a version 2.1 of the series, containing three additional patches fixing issues pointed out by Bruno Prémont. The patches are available from git://git.lwn.net/linux-2.6.git in the branch viafb-posted. Jon wondered if the patches were ready to go into viafb-next.

In today’s announcements:

* DRM. Stefan Bader posted to let everyone know that he is now maintaining a 2.6.32-based tree on kernel.org containing backported DRM improvements for 2.6.32 based kernels, since a number of vendors are using that tree. Luis R. Rodriguez replied saing that this was “Great stuff! Thanks for putting this up!”. One wonders if this is more sign of a growing trend.

* Linux 2.6.33-rt19. Thomas Gleixner announced verion 2.6.33.3-rt19 of the Real Time patchset, containing mostly VFS scalability bits. This followed a previous 2.6.33-rt16 release also this week containing largely a merge with upstream 2.6.33, and -rt17 and -rt18 releases that contained a few fixes. Thomas notes in his posting that he had previously pushed out rt14 and rt15 without sending an announcement out to the list, so he included changelogs from -rt13 to 16, and rt17-rt18 (in the separate emails he made announcing -rt16, and -rt17). Patches are available at http://www.kernel.org/pub/linux/kernel/projects/rt/ and the tip git tree on git.kernel.org contains existing rt/head and rt/2.6.33 release branches.

* Upstart 0.6.6. Scott James Remnant announced the 0.6.6 release of the “upstart” SYSV init daemon replacement that supports modern asycnhronous event driven operation rather than traditional runlevels (though it does also support emulating those for backward compatibility). Upstart is used by a number of distributions, and is available at upstart.ubuntu.com/

The latest kernel release was 2.6.34-rc6.

Greg Kroah-Hartman released stable series kernels 2.6.32.12 and 2.6.33.3. The former came with some thanks (and possibly an indirect dig at vendors) to Maximilian Attems for his “hard work digging out patches from the various vendor kernel trees for this release”. Maximilian was also thanked specifically in the latter case for contributing patches also. Separately, Greg requested of Stephen Rothwell that he begin pulling a new staging-next tree into his daily Linux -staging tree (a nice present for Stephen as he returned from vacation).

Frederic Weisbecker replied (in an innocuous thread otherwise containing a patch email thread of conversation entitled “ptrace: Cleanup useless header”) noting that things touching the BKL should CC both him and Arnd Bergmann. They are still working on Big Kernel Lock (BKL) removal, which you can keep track of via http://kernelnewbies.org/BigKernelLock. There was some other BKL removal traffic over the past week, also, including some patches from Arnd entitled “Push down BKL into device drivers” (similar to the FS patches he had posted previously that did the same in that layer – nice).

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/04/25 Linux Kernel Podcast

May 13th, 2010 jcm No comments

Audio: COMING SOON

For the weekend of April 25th 2010, I’m Jon Masters with a summary of the past week’s LKML traffic.

In today’s issue: Linux 2.6.34-rc5, CFS, Firmware, and IPC.

Linux 2.6.34-rc5. Linus Torvalds announced the release of Linux kernel 2.6.34-rc5 on Mon, April 19th 2010 at 4:42pm PDT (Best Coast Time). As he said, “Another week, another -rc. This time there wasn’t some big nasty regression I was working on to hold things up” (refering to the issues with anon_vmas and anon_vma_chains from last week). The latest release includes a number of general fixes, including boot fixes for ACPI parsing, and the usual kinds of driver updates (radeon, amd-iommu, filesystems). SPARC now has ftrace support if you are interested in playing with that. Upon mentioning regressions, Rafael J. Wysocki seemed to fly into action with his usual vigor and post his regular regression summary of issues outstanding since 2.6.33. The current statistics show that the number of unresolved issues has tended to increase over the several weeks leading up to -rc5, with 34 outstanding.

CFS. Mathieu Desnoyers posted version 2 of a patch entitled “CFS fix place entity spread issue”, which is aimed to address an apparent situation in which Mathieu felt that min_vruntime could go backwards and cause large unwanted latencies for certain workloads. Peter Zijlstra disputed that this was happening and Linus, upon testing the patch, using his “favorite non-scientific desktop load” and found that it made things worse in terms of X performance, which was apparently to be expected (according to Mathieu) because Xorg had been getting unfair runtime treatment that was now corrected. This didn’t make Linus particularly happy (from a user experience viewpoint) and meanwhile Mathieu and Peter continued to debate what was happening. Mathieu posted some links to an ELC (Embedded Linux Conference) presentation that he did on this topic at http://www.efficios.com/elc2010 and then later followed up (in an entirely separate thread) with version 11 of his “introduce sys_membarrier(): process-wide memory barrier” that he uses to assist with his userspace RCU implementation, all the while still stranded at San Francisco airport waiting for a means to get back home.

Firmware. Tomas Winkler posted a thread entitled “request_firmware API exhaust memory” in which it was discovered that some performance enhancement work done by David Woodhouse a while back actually caused the kernel to leak memory used for firmware handling, especially in the case that a large number of calls were made to request_firmware, as in the case of Tomas’ code. The issue was that the firmware code was attempting to free pages not allocated with vmalloc using vfree, whereas the underlying pages were actually being allocated and then mapped into linear kernel virtual memory with vmap calls. The fix involves unmapping and then freeing.

IPC. Manfred Spraul posted a three part patch series entitled “ipc/sem.c: Optimization for reducing spinlock contention” in which he attempts to “fix the spinlock contention reported by Chris Mason: His benchmark exposes problems of the current code”. Manfred then summarizes three main issues, including the prominent first issue that the algorithm used by update_queue() has a worst case performance on the order of O(N^2) and bulk wakeups can enter this worst case if they are unlucky. After applying the patch and performing some runs with sembench using 250 threads, waking 64 threads at a time, Manfred reports 1.1% CPU lost spinning vs. 47% before, and 6% of spinlocks spinning vs. 91% before, amongst other statistics.

In today’s miscellaneous items:

* Jon Corbet posted version 2 of an RFC patch series entitled “Initial OLPC Viafb merge”, and noted that he would begin a linux-next tree.

* Yanmin Zhang posted version 5 of a patch intended to implement perf statistics collection in the host of various guest KVM instances.

* Hiroyuki Kamezawa reported an issue with memory compaction support in the mm-of-the-day (mmotm) for 2010-04-15-14-42. He and Mel Gorman discussed it a little. Separately, Mel posted version 8 of the memory compaction patch series, without an obvious fix for the crash issue.

* Justin P. Mattock reported that the issues booting MacBook Pro systems from the previous week seemed to now be resolved in the latest kernels.

* Rusty Russell posted a module patch that causes the module_lock mutex to be dropped when waiting for parallel module loads to complete.

* Don Zickus posted a 6 part patch series entitled “lockup detector changes” that “covers mostly the changes necessary for combining the nmi_watchdog and socklockup code”.

* Stefani Seibold posted yet another (unversioned in the subject line) 4 part patch series that was entitled “enhanced reimplementation of the kfifo API”, and which contained basically a rebase to recent kernels.

* Kyle McMartin posted a patch changing the default file permissions on the kernel provided pseudo file /proc/sys/vm/mmap_min_addr to 0600 from 0644. There wasn’t a huge security issue as writes were already denied by virtue of the fact that CAP_SYS_RAWIO was also required underneath.

* Kent Overstreet posted version 3 of the “bcache” patch series.

In today’s announcements:

* Linux Plumbers Conference (LPC). Ted Ts’o posted a “Call for Tracks”, noting that this year’s conference will take place in Cambridge, MA from November 3-5. The organizers are looking for “problem statements” summarizing “things that could be improved in Linux that cross multiple interfaces or other project boundaries”. For further information about the conference, and to submit ideas, see: http://www.linuxplumbersconf.org/

* git 1.7.0.6. Junio C Hamano announced version 1.7.0.6 of the GIT utility used for version control by the Linux kernel community. The latest version includes fixes for “git diff -stat” overflow, and “git rev-list –abbrev-commit” using the older 40-byte abbreviation format. Junio also announced version 1.7.1 of the GIT utility, which included updates to gitk, the ability to invoke an external command for passwords (GIT_ASKPASS), a new bash completion script (for those who use that), and dozens of other fixes besides. Git is available on the kernel.org website: http://www.kernel.org/pub/software/scm/git/

* hwloc. Samuel Thibault announced the release of hwloc version 1.0rc1, a “hardware locality” utility intended to provide command line support for obtaining information about NUMA memory, shared caches, processor sockets, processor cores, and processor “threads”. For further detail see the project website: http://www.open-mpi.org/projects/hwloc/

The latest kernel release was 2.6.34-rc5.

Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-04-22-16-38.

There was some ongoing discussion of kernel vmalloc performance and a few patches were posted, most recently from Minchan Kim.

Joe Perches asked about the -staging tree review and acceptance process, noting that there are a “number of patches appear[ing] to go unnoticed or
untracked”. Greg Kroah-Hartman followed up explaining that he’s had conferences, travel, and has moved house, and basically asked for a break.
Greg has generally been responsive on the staging tree discussion list in my experience, and there is a lot of work that goes in there.

Greg Kroah-Hartman posted a 2.6.32 stable kernel review patch series comprised from 197 individual patches to the “long term” stable kernel 2.6.32. He also posted a 139 part patch series for the 2.6.33 stable series kernel.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/04/18 Linux Kernel Podcast

May 10th, 2010 jcm No comments

Audio: COMING SOON

For the weekend of April 18th 2010, I’m Jon Masters with a summary of the past week’s LKML traffic.

In today’s issue: Linux 2.6.34-rc4, adaptive spinning mutexes, Microblaze, Remote Controller Subsystem, Stack Size, and VM.

Linux 2.6.34-rc4. Linus Torvalds announced the release of kernel 2.6.34-rc4 on Monday April 12th 2010 at 7:16pm PDT (Best Coast Time), which had been delayed while he, Borislav Petkov, Rik van Riel, and others were tracking down an annoying rmap VM regression caused by the introduction of anon_vma_chain support. Most of Linus’ announcement covers that bug – stay tuned for some coverage on that – but also mentions the new cxgb4 network driver.

Adaptive spinning mutexes. Benjamin Herrenschmidt posted a new thread entitled “Possible bug with mutex adaptive spinning” in which he noted that the current adaptive spinning (in which a mutex will spin briefly rather than immediately going to sleep if the owner of a lock is already running and might release it soon) code in mutex_spin_on_owner() does not correctly handle the case of the owner CPU being offlined. In this case, the function will return 1, meaning that the caller should spin, which it may do forever. Ben changes the return to 0 in the case that the CPU is offline so that a sleep occurs immediately.

Microblaze. Michal Simek posted a thread entitled “Microblaze – The fi[r]st year”, in which he summarized what has happened in the year since support for the soft-core Xilinx Microblaze CPU was first added to the mainline kernel. He calls out a number of folks for specific thanks – both from Xilinx, and from PetaLogix, as well as the wider community (the usual suspects: Andrew Morton, Arnd Bergmann, Grant Likely, Ingo Molnar, John Linn, John Williams, Stephen Neuendorffer, etc.). He includes a timeline of events over the past year as well as links to git trees, the wiki, and even a Facebook fan page (such is the world in which we live today – and yes, I am a “fan” myself).

Remote Controller Subsystem. Mauro Carvalho Chehab posted an informative mail entitled “Remote Controller subsystem status” in which he updated everyone on the current progress toward implementing a new “remote controller” subsystem that replaces the legacy V4L/DVB code and will become a new “core” subsystem available in /sys/rc. There is a userspace tool called ir-keytable and some discussion of plans for merging in 2.6.35. A mail worth reading.

Stack size. Dave Chinner posted a thread entitled “mm: disallow direct reclaim page writeback” in which he advocates for using the background IO flusher threads even in the case that VM pressure is so high that direct page reclaim becomes a necessity. Dave feels that in such cases, “we may have used an arbitrary amount of stack space, and hence enterring the filesystem to do writeback can then lead to stack overruns. This problem was recently encountered [on] x86_64 systems with 8k stacks running XFS with simple storage configurations”. This lead to a longer thread in which the issue of kernel stack footprint was addressed, as well as the specific issue of what to do in the direct reclaim situation. Andi Kleen followed up to Chris Mason’s comments concerning the relatively large footprint of single fs functions with an assertion that the ‘4K stack simply has to go. I tend to call it “russian roulette” mode’. Andi considers such small stacks to be dangerous given the “obscure paths through the more an more subsystems”. He is fond of the separate interrupt stack in the case of 4K process stacks, but feels that there should always be a separate interrupt stack in any case, as might have helped in the case that Dave Chinner was mentioning in the original posting. Mel Gorman later followed up with an RFC patch series entitled “Reduce stack usage used by page reclaim” in which he attempted to “reduce some of the more obvious stack usage in page reclaim”, including in putback_lru_pages, kswapd, shrink_page_list, shrink_zone, and so forth (up to 1096 bytes saved).

VM. The Linux kernel includes support for reverse page mapping (rmap), a means by which it is possible for the Virtual Memory subsystem to answer important scalability questions such as “which virtual memory pages reference this physical page?” without having to walk through a large number of process page tables each time. Over the years, this code has become more complex through the addition of anon_vma, and anon_vma_chain structures intended to allow object based reverse mapping of anonymous memory pages with reduced overhead as compared with Rik van Riel’s original (and more simple) mechanism of having additional pointers in every struct page. anon_vma is used to track per-task anonymous VMA use, while anon_vma_chains link these together to allow the VM to determine which tasks have a shared reference to a given anonymous VMA.

The implementation of this complex VMA tracking was suffering from a bug that Borislav Petkov kept hitting in performing a suspend/resume cycle on his system, in which the resume code would wind up referencing a previously unmapped shared page first within a child process (setting up a new anon_vma) and later within a parent (causing an anon_vma_chain link to be setup pointing in the wrong direction from child to parent) that subsequently could no longer reach the child anon_vma after the child task exited. As Linus said, “End result: process A has a page that points to anon_vma B, but anon_vma B does not exist any more. This can go on forever. Forget about RCU grace periods, forget about locking, forget anything like that. The bug is simply that page->mapping points to an anon_vma that was correct at one point, but was _not_ the one that was shared by all users of that possible mapping.” Thus the fix is to ensure that new anon_vma_chain entries are always referencing the “_oldest_ possible anon_vma for the page mapping”, as is the case for Linus’ eventual (simple) patch, entitled “[PAGE 4/4] anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvma”. Borislav said it survived more than 20 test cycles where the system would previously have managed at most 6 resume attempts.

Linus seemed genuinely excited about tracking down this bug – it can’t always be easy doing his job, and I’m sure he relishes an occasionally really dirty bug to poke at. One thing that did come of this exercise was an improvement in comments and documentation both on list and in the affected code. Linus seemed very happy with the effort Borislav was putting in to help test and track down this issue (ending the thread with a little joke about Borislav’s email gateway, which claims to be “SuperMail on a ZX Spectrum 128k”). The thread fixed a few other issues aswell, and gave Peter Zijlstra a chance to post a documentation patch for page_lock_anon_vma noting that it is very difficult to serialize fully against page_remove_rmap so that the lock function doesn’t try, but instead all users of it should verify that the anon_vma returned to them is actually still relevant to them. Finally,
Ulrich Drepper followed up some time later – on a tangent – wondering aloud why mprotect need create so many VMAs when changing permissions
on thread stacks and the like instead of modifying page table entries.

As usual, Linux Weekly News (LWN) did a much better job of explaining the overall multi-day issue in depth so you are encouraged to take a look at
their story for more of the history, analysis, and nice graphics.

In today’s miscellaneous items:

* Robert Richter posted some model specific performance events patches in order to support AMD IBS (an unfortunate acronym in this case standing for Instruction Based Sampling).

* Nigel Cunningham was looking for a job.

* Several people have reported issues booting Macbook Pros with recent kernels. Len Brown noted that this was likely already fixed (referencing BZ 15749). In response, Harald Arnesen was especially happy about git bisect as a debugging tool for non kernel hackers to help track down bugs such as this one.

* Jason Baron posted version 7 of his “jump label” patch series.

In today’s announcements:

Git 1.7.1.rc1. Junio C Hamano announced Git version 1.7.1.rc1, which includes a number of fixes. http://www.kernel.org/pub/software/scm/git/ This comes at around the time of the 5th anniversary of the kernel switching to Git for development, which Christian Ludwig noted occured on the 15th April. Christian notes that he has made a YouTube video visualizing git development history, available at http://www.youtube.com/watch?v=ntTpM8hfl_E

Guilt 0.33. Josef “Jeff” Sipek announced version 0.33 of the Guilt (Git Quilt) series of bash scripts was now available from the usual location.
http://www.kernel.org/pub/linux/kernel/people/jsipek/guilt/

LTTng 0.210. Mathieu Desnoyers announced LTTng 0.210 for kernel 2.6.33.2, which was largely a revert of a PowerPC specific TRACE_EVENT definition that occured outside of include/trace, and which particularly bothered Mathieu.

sdparm 1.05. Douglas Gilbert announced that the 1.05 release of sdparm was now available. This is a direct analogy of “hdparm” but for SCSI devices, and so supports a lot of SCSI specific fancy options.

trace-cmd version 1.0. Steven Rostedt announced version 1.0 of his trace-cmd utility, which is a cross-platform, endian safe binary reader for ftrace that
can be used to capture data on one machine (e.g. as a flight recorder) and then decode and process it on another, at runtime, or after the fact.

The latest kernel release was 2.6.34-rc4.

Andrew Morton posted an mm-of-the-day (mmotm) for 2010-04-15-14-42.

An issue was discovered with a net-2.6 patch entitled “tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb” that caused ssh to fail. David
Miller subsequently stated that he would revert this patch and specifically test zero length data area CHECKSUM_PARTIAL packets with the IGB driver.

Pavel Machek noted that the LOCALVERSION_AUTO configuration option, which appends a new version to the kernel on each compilation, has an unfortunate interaction with loadable kernel modules when CONFIG_MODVERSIONS is unset insomuch as it causes the simple kernel version check to fail. Linus was very clear that the problem here is people building kernels without enabling modversions and expecting that to be even remotely safe.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

Dude, where’s the podcast?

May 7th, 2010 jcm No comments

Short answer is RHEL. I’m busy working on a bunch of things at the moment and the podcast has suffered. I’m planning to get caught up over the weekend if I can, or just skipping a few days/weeks and moving forward from now. I do my best, I know it’s not always good enough.

Jon.

Categories: Uncategorized Tags: