Archive

Archive for September 15th, 2009

2009/09/10 Linux Kernel Podcast

September 15th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090910.mp3

For Thursday, September 10th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: BFS, Checkpoint and restart, MMAP and Performance Counters.

BFS. Jens Axboe posted a link to his new “latt” tool that he has been using to perform some scheduling latency benchmarks and comparisons between BFS and the mainline scheduler, since it was of interest to a number of folks. He has since converted the link to a file explaining where to find the new git tree containing the source, which is not on the standard kernel.org website. On the subject of BFS, Ingo Molnar posted another round of scheduler comparison benchmarks entitled bfs-vs-tip-oltp-v2 in which he thanked Con Kolivas for providing incentive to examine scheduler latencies once again, but noted that Con’s alternative BFS “isn’t particularly strong in this graph” either.

Apologies to those who disliked my previous BFS commentary. No source of information is completely unbiased and I do feel it completely appropriate to discuss any potential performance issues without restraint, however I do not want to offend anyone too much in the process.

Checkpoint and restart. Sukadev Bhattiprolu posted an RFC patch series with an updated version of his new clone_with_pids system call. This is used in the latest incarnation of checkpoint and restart patches to re-created tasks within a given namespace using the same process IDs as were in use prior to taking a checkpoint. Obviously, such support is a precursor to tasks being restarted without explicitly supporting a change in process descriptor ID.

MMAP. Brian McGrew posted asking a question about creating large shared page mappings and the overhead incurred in doing so. He is replacing previous use of physical mapped memory (this is presumably involving an embedded device) with a form of software emulation in which many tasks will share the same direct physical pages via mmap. He finds that creating 4MB, 16MB, 64MB or even 256MB mappings is fine, but allocating 1GB introduces huge overhead. It is very likely (in my opinion) that he is on a 32-bit system and isn’t locking every page using an mlock, and a few other things. But perhaps this is some other issue that is worth looking into.

Performance counters. Masami Hiramatsu posted some updates to the kprobes based event tracer which will allow users to add trace events dynamically on ftrace and use those events with the new performance counters “perf” tools. This patch series continues the trend toward turning perf into a swiss army knife of Linux kernel interaction – and who knows where it might end. We had another such example also from Frederic Weisbecker, who posted an RFC patch series implementing hardware breakpoints on top of performance counters.

In today’s miscellaneous items: some tracing and ring buffer updates for 2.6.32 from Steven Rostedt, some trace filters updates from Tom Zanussi, an Android build fix from Kosaki Motohiro, some gconfig build updates disabling “typeahead find” search in treeview from Diego Eli Petteno, an update on GFS2 from Steven Whitehouse (in which he essentially says the tree will be as it is now unless last minute bugs are reported), some crypto updates for 2.6.32 from Herbert Xu (including a completed hash algorithm transition over to shash), some internal PCI hotplug interface cleanups from Alex Chiang, some cpuset and hotplug fixes from Oleg Nesterov, and some /dev/mem (and also /dev/kmem) cleanups from Fengguang Wu.

Finally today, Andreas Mohr posted some weird Xorg tty experiences from 2.6.31-rc6, which is likely so ancient at this point that it has long since been fixed in the recent tty layer work.

The latest kernel release is 2.6.31.

Andrew Morton released an mm-of-the-moment for 2009-09-09-22-56.

David Tees posted a question concerning an ext4 error he was seeing in his logs from ext4_mb_generate_buddy. He wondered if anyone had suggestions concerning how serious this actually is, and what to do other than his anticipated reboot and fsck cycle.

Zhenyu Wang sent a very detailed followup addressing why some folks might have experienced strange “blanking” problems on MacBook 2,1 systems running 2.6.31-rc7. This was due to an issue with the Intel 945GM chipset and the way that the MacBook integrated TV DAC routed signals. His description was quite elaborate, and he apologized for the delay in providing this helpful detail.

Greg Kroah-Hartman posted some stable review patches for the forthcoming 2.6.27.34 and 2.6.30.7 stable series kernels. The deadline for posting replies has already lapsed at this point, however. One wonders if the review window could be slightly larger anyway.

Stephen Rothwell posted a linux-next tree for September 10th. Since Wednesday, the acpi and security-testing trees lost issues, while the rr, block, and scsi-post-merge trees had some issues. The total sub-tree count remains steady at 140 trees in this compose.

Stephen reminds everyone (in a thread entitled “linux-next: merge window reminder”, and in today’s linux-next announcement) not to add code intended to hit 2.6.33 until 2.6.32-rc1 has been released, so that folks adding bits for post-rc1 have a chance.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/09/09 Linux Kernel Podcast

September 15th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090909.mp3

For Wednesday, September 9th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Linux 2.6.31, Compache, MMAP, and unreachable code.

More on Linux 2.6.31 in a moment, but first these other top stories.

Compcache. Nitin Gupta posted version 2 of his “compcache” compressed in-memory swap device. This is used preferentially prior to a backing disk since it is faster and can store more data in a compressed form than would be the case in simply having more free memory in the system pagecache. Since the previous release Nitin has switched to using struct page references rather than 32-bit PFNs (to make the code 64-bit safe), and a variety of other cleanups. Testing shows up to a 33% performance improvement in certain idealized test conditions. Presumably this is now targeting 2.6.32.

MMAP. Lee Schermerhorn noticed some “very erratic behavior” affecting certain (AIM7) workloads on a distribution and mainline kernels, chiefly larger systems such as on an 8-socket, 32-core 256GB of RAM x86_64 platform. Lee notices a coment in mm/mmap.c:vma_adjust suggesting that there isn’t a need to take the anon_vma lock when only adjusting the end of a vma (as with brk()). The comment “questions whether it’s worth[while] to optimize for this case” but “apparently, on the newer, larger, x86_64 platforms, with interesting NUMA topologies, it is worth[while]“. The patch is a one-liner, but can double performance for the test workload, or at least stabilize the results.

Unreachable code. Roland McGrath posted a two part patch series introducing an UNREACHABLE macro that can be used to inform GCC that a particular code path cannot be reached in normal code execution. Although GCC itself has heuristics to determine when this is the case, it cannot catch assembly level impacts or certain other side-effects. Roland suggests folks begin looking for infinite for loops in the kernel and start to replace them since it takes a bit of enlightened reasoning to make the changes beyond a simple find/replace. He starts off in patching the BUG() macro to use his UNREACHABLE macro.

In today’s miscellaneous items: an update to the documentation for procfs covering the additional “time spent by a cpu servicing a guest” in /proc/stat from Eric Dumazet, an update concerning hid in 2.6.32 from Jiri Kosina (including mention of a rewrite of the debugging stub), a question about turning off ext4’s delayed allocation features from Clemens Eisserer, a trivial aoe fix from Jens Axboe, updated support for the “switch” command within compliant SD cards from Wolfgang Mues, some writeback fixes from Fengguang Wu, some updates concerning the sound tree in 2.6.32 (chiefly these will comprise driver updates, and many users won’t notice that), a trivial fix freeing the old name within kobject_set_name in the case of ENOMEM from Sebastian Ott, some internal PCI interface cleanups from Alex Chiang, some Xen bugfixes addressing spinlock bugs and stackprotector support from Jeremy Fitzhardinge, some cleanups to trace.h from Li Zefan, a fix to an unintended behavioral change in net_device_ops from Martin Decky, a fix for paravirt ops alternatives patching on 486 systems (prevously failing in text_poke_early) from Ben Hutchings, and a fix to ensure the raw_time clocksource is updated in timekeeping_suspend from Janboe Ye.

Finally today, Ingo Molnar replied to the “Epic regression in throughput since v2.6.23″ thread from Serge Belyshev with an asertion that he believes he has found the issue and has a fix in -tip that should be of interest. He would like folks to re-test and see if these improve scheduler performance.

In today’s announcements: Linux 2.6.31. Linus Torvalds announced the release of version 2.6.31 of the Linux kernel. In pointing to the kernelnewbies.org website for the full breakdown of changes, Linus took the opportunity to call out a few specifics. Amongst these were the “painful” changes to the new fsnotify backend to both inotify and dnotify, ongoing work on KMS, debug and performance counters work, and much much more. Linus announced the opening of the 2.6.32 merge window, but with the caveat that folks really should wait a few days to test and play with 2.6.31 before moving on to 2.6.32.

Greg-Kroah Hartman announced stable kernel release 2.6.30.6 and 2.6.27.32, both containing a raft of updates, followed later in the day by 2.6.27.33, which contains a fix for building ocfs2 that some folks were hitting.

The latest kernel release is 2.6.31, released at 16:06 (BCT).

David Miller noticed that __hw_perf_counter_init on x86 systems might be leaking active_counters on error condtions, causing the LAPIC NMI watching to never get re-enabled even after all performance counters users go away.

Stephen Rothwell posted a linux-next tree for September 9th. Since Tuesday, the acpi, rr, security-testing, and scsi-post-merge trees had issues, while the async_tx, wireless, drm, tip and tty trees lost their issues. The total sub-tree count remains steady at 140 trees in this compose.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: Uncategorized Tags:

2009/09/08 Linux Kernel Podcast

September 15th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090908.mp3

For Tuesday, September 8th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: BFS, Inotify, OCFS2, and Tasklets.

BFS. Frans Pop posted an email thread entitled “Another BFS versus CFS shakedown” in which he says that he tried “very consciously” to pay attention to interactivity. His results seem to show what others have implied – that BFS falls down in many literal timing tests but seems to nonetheless offer a very smooth and interactive desktop experience, which is what Con was getting at in posting the proof of concept. Frans BCC’d Con, since he wasn’t sure whether he would want to actually participate in an LKML discussion.

Inotify. Giuseppe Scrivano posted a patch intended to extend inotify to support file descriptors in addition to plain old paths. The example cited is watching standard input from within GNU tail, in which case tail must perform an entirely different internal process for watching the standard input stream because it is not necessarily represented by a known file path. The proposal is to add a new system call entitled sys_inotify_add_watch_fd, which does roughly what it would imply.

OCFS2. Joel Becker posted an update about forthcoming OCFS2 patches, in which he noted that he currently has 85 patches queued up for the forthcoming 2.6.32 merge window, and that that will probably grow to over 100 patches. Amongst these is a “big ticket” item in the form of the reflinkat() system call, which had been discussed at this year’s filesystem workshop and is mentioned in this week’s Linux Weekly News.

Tasklets. Steven Rostedt replied to the ongoing backlash against using tasklets as interrupt handler “bottom halves” and the ascertion from Stephen Hemminger that using process context for such processing is too slow, with a note that he plans to present on just this topic at LPC (Linux Plumbers Conference), demonstrating that process context is far from “too slow”. Meanwhile, Ingo Molnar pointed out how one might use performance counters on Intel systems to produce real-life measurements of any overhead.

In today’s miscellaneous items: some performance counters updates from Markus Metzger that split sample creation and output functions for performance, a fix for randomized stack configurations such that the kernel won’t accidentally pick an unfortunate mmap_base address starting in the stack reserved area from Michal Hocko, version 19 of the per-bdi writeback flusher threads patches from Jens Axboe, a fix for a PCI reference leak in the quirks code from Jiri Slaby, a fix to ensure data stored into an inode is properly seen before it is unlocked (fixes a corruption issue with ext3 over NFS) from Jan Kara, support for D-cache aliasing CPUs (such as many SPARCs) from David Miller, version 3 of a patch adding FAT root directory timestamps to the volume label from Jorg Schummer, a question concerning limiting the DMA mode picked for legacy IDE devices from Alan Stern, an ACPI 4.0 compliant power meter from Darrick J. Wong, a patch to make tmpfs depend upon shm from Hugh Dickins, some rcutorture updates from Paul E. McKenney, version 0.14 of the Ceph distributed filesystem, a fix for paravirt-alternatives suppport on i486 systems since these older processors like the 486 don’t necessary invalidate pre-fetched instructions possibly containing paravirt ops information from Ben Hutchings, Jon Corbet posted some very helpful flexible array documentation, some compiler (fPIC) options checks from Jory A Pratt, an August XFS status update from Christoph Hellwig (including 2.6.32 comments), a respin of the data=guarded patches for ext3 filesystems from Chris Mason, and a question concerning matainance plans for 2.6.27 after 2.6.32 is released from Luis R. Rodriguez.

In today’s announcements: 2.6.31-rc9-rt9.1. John Kacur announced version 2.6.31-rc9-rt9.1, since Thomas Gleixner was on vacation. This is largely the same as the rc8 tree but contains a couple of other fixes also.

The latest kernel release is 2.6.31-rc9.

Serge Belyshev posted an email thread entitled “Epic regression in throughput since v2.6.23″ in which he suggests a 10% performance degradation in tests between 2.6.23 and 2.6.31. He also comes out in favor of BFS, but it isn’t clear what kind of hardware he is using, nor how scalable the figures are.

Stephen Rothwell posted a linux-next tree for September 8th. Since Monday, the edac-amd tree has been removed temporarily (at the request of the maintainer), the v4l-dvb and trivial trees lost conflicts, while there remain a number of issues with acpi, async_tx, wireless, drm, security-testing, and scsi-post-merge. The total subtree count falls to 140 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/09/07 Linux Kernel Podcast

September 15th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090907.mp3

For the US Labor Day weekend of 2009, I’m Jon Masters with a summary of today’s LKML traffic. Happy Labor Day everyone. I spent my weekend in Maine, hiking the Knife Edge of Katahdin with my favorite AMC hiking buddies.

In today’s issue: BFS, boot interrupts, KVM, modules, tasklets, and VFS.

BFS. There has been some debate recently (and LWN has great coverage of) a new scheduling algorithm proposed by Con Kolivas (who pops up every few years in between getting disgruntled and saying that he won’t do so again) called the “Brain Fuck Scheduler”. It is intended to be really simple, and the initial posting came with all kinds of assertions about how it would perform better under typical desktop load conditions. Ingo Molnar got around to perfoming some tests under various workloads and found that, quoting, “BFS is slower than mainline in virtually every measurement” that he performed. He also discovered it performed worse in desktop interactivity tests, but he encouraged others to not take anyone’s word for it and run their own tests – including links to current versions of the upstream scheduler and Con’s patches. It should be noted that Ingo is the maintainer of the upstream scheduler and so is bound to be in part refering to himself.

Boot interrupts. Stefan Assmann posted a two part patch series disabling boot interrupts on Intel X58 and 55×0 systems. These are necessary, as he reminds everyone, because systems will otherwise generate legacy compatible interrupts that will simultaneously arrive at both the PIC and primary IO-APIC, even the former is not in use and the latter’s corresponding line is masked. Needless to say, this can and has caused some pretty hairy issues (especially for the RT kernel) and so patches like these are most welcome. These patches, like his others, poke at generally hidden PCI configuration devices that must first be made accessible before allowing disabling of boot interrupts.

KVM. Jan Kiszka inquired of K. Prasad as to his ongoing work into implementing a generic hardware breakpoint support infrastructure that could be used to also handle the contextual save and restore of hardware debug registers upon switching within KVM from host to guest CPU environment. Avi Kivity had, apparently, previously suggested that these registers might be restored from current->thread.debugregX without having to explicitly save/restore, but Jan feels it might be better to just do this generically in K. Prasad’s code.

Modules. Michal Marek posted a two part patch series modifying kbuild to generate modules.builtin files that can be parsed by module-init-tools and used to recognize drivers and other optionally modular components that have in fact been built into the running kernel. This then allows those to be listed within lsmod and other tools.

Tasklets. Luis R. Rodriguez raised the issue of using tasklets as containers for “bottom half” (a reference to old-school top/bottom half handlers) interrupt handler processing. He cites an older LWN story on the efforts by Steven Rostedt and others to remove tasklets or otherwise move them into a process context of their own. Luis feels there is particularly no reason for tasklets in wireless drivers and that instead much of the work can be moved out to a process context. This will of course be easier if the interrupts can be threaded and could do all of their handling without separate contexts, though this is not always possible in performance critical situations.

VFS. Linus Torvalds posted an 8 part patch series aimed at cleaning up VFS name lookup permission checking, with the stated goal of eventually doing multiple path component lookups in one go without taking the per-dentry lock or toggling the per-dentry atomic count for each component. The existing code is pretty horrific in terms of cacheline “ping-pong” on the common top-level dentries that “everybody looks up” and Linus is already able to show a roughly 3% performance hit on a single-socket Nehalem system. Included within his patch is an observation that there was never a need for the IMA code to call ima_path_check repeatedly during path lookups, only on the final path.

In today’s miscellaneous items: Jason Gunthorpe posted some scathing commentary of the existing implementation of pubek sysfs file for reading from TPM, and a bunch of fixes to “do it again”, some input updates from Dmitry Torokhov, a fix implementing “make file.s_c” building of dual C and assembly hybrid files from Amerigo Wang, the second patch series for ioatdma implementing RAID5/6 offload support from Dan Williams (in followup to the previous day’s patches), the 18th version of the per-bdi writeback flusher threads from Jens Axboe, a lot of helpful cleanups (mostly to x86) from Jan Beulich, some performance counters fixes from Ingo Molnar, some AMD-IOMMU passthrough support patches (iommu=pt) and page table/page fault handling updates for the same from Joerg Roedel, version 6 of the crashkernel=auto patches from Amerigo Wang, version 2 of an RFC patch series reducing the number of calls to global_page_state from balance_dirty_pages to reduce cache pressure from Richard Kennedy, some SPARC and networking updates from David Miller, a common method for reading and parsing user input within the tracing code from Jiri Olsa, a patch adding a boot option to disable the automatic VT cursor on boot (for use with graphical splash screens) from Matthew Garrett, some RFC sysfs documentation also from Matthew Garrett, version 3 of a patch removing a sleep in TASK_TRACED under a lock known as ->cred_guard_mutex from Oleg Nesterov, a single PCI fix for broken resource alignment calculations from Jesse Barnes, a cpuidle fix from Sanjeev Premi, some directory lookup optimizations for the performance counters perf tool from Ulrich Drepper by way of Arnaldo Carvalho de Melo, some tracing fixes from Frederic Weisbecker, a critical OCFS2 fix for rc8 from Joel Becker (correctly the handling of cancel requests rather than erroring out), a series of 18 patches from Steven Rostedt that had started out as a simple bugfix but turned into a significant rework to better handle switching per-cpu ring buffers, a new CROSS_COMPILE option in kconfig facilitiating easier configuration of a cross compilation environment from Roland McGrath, a fix for ext2_rename correcting unbalanced use of kmap and kunmap (causing pkmap slots to get exhausted) from Nicholas Pitre, a fix to the RCU kconfig help text from Valdis Kletnieks, a SLUB RCU fix for 2.6.31 from Pekka J. Enberg, some firewire fixes from Stefan Richter, another version of a patch adding support for LZO-compressed kernel images from Albin Tonnerre, an update on the async_tx.git/next tree and merge plans for 2.6.32 from Dan Williams (the Intel one), some USB console fixes correcting an oops from Jason Wessel, an update on a suspend saga affecting the Sharp Zaurus from Pavel Machek, a fix for building User Mode Linux with bash 4 from Paul Bolle, some minor firewire fixes from Stefan Richter, some IDE patches from David Miller, an important IMA security fix from James Morris, a large number of linker script fixes and cleanups from Tim Abbott, some drm fixes for 2.6.31 final from Dave Airlie, perf trace filtering support from Li Zefan, and some documentation updates rendering consistent the default mountpoint for making available debugfs from GeunSik Lim, a patch to fix error handling in load_module from Kamalesh Babulal, version 5 of the clone_with_pids() system call from Sukadev Bhattiprolu, a fix for the case where ACPI state C2 is mapped to C3 from Luming Yu, a patch adding locking to ext3_do_update_inode to avoid a race from Chris Mason, some fixes for handling hot remove of mmaped files from Eric W. Biederman, a summary of the current VFS scalability queue from Nick Piggin, and a patch from Adrian Hunter aiming at making write_cache_pages more sequential in flushing back pages.

In today’s announcements: Linux 2.6.31-rc9. Linus Torvalds announced the release of version 2.6.31-rc9 of the Linux kernel. He was originally planning on shipping a final 2.6.31 already, but some fundamentals (such as broken inotify support) necessitated holding off for a few more days. He requests a final round of testing prior to the 2.6.31 release.

util-linux-ng 2.16.1. Karel Zak announced util-linux version 2.16.1. The latest release includes a number of updates, amongst them a modules.dep parser that is particularly hairy but unfortunately “necessary” for ext2/3/4 detect.

The latest kernel release was 2.6.31-rc9, which was released on Saturday.

Rafael J. Wysocki posted a list of regression from 2.6.30 to 2.6.31-rc9. These include 27 unresolved issues at this time, including inotify and page allocator problems that aren’t closed yet. The outstanding list of regressions between 2.6.29 and 2.6.30 also contained 27 items, and almost all of them are driver issues dating back for some time.

Tarkan Erimer reported an oops in the ALSA stack when running 2.6.31-rc7-git1-rt9. Christoph Lameter reported 5 second “hiccups” on CIFS with 2.6.31-rc8. Luis R. Rodriguez passed along some kmemleak reports from 2.6.31-rc8 which were affecting process_zones(). Gene Heskett couldn’t reliably run 2.6.31-rc9 without various segfaults taking down his mail.

Greg Kroah-Hartman announced the 2.6.27.32 and 2.6.30.6 stable review patches.

Stephen Rothwell posted a linux-next tree for September 4th. Since Thursday, the xfs, acpi, security-testing, and staging trees had issues (all new, except for acpi). The total sub-tree count remains steady at 141 trees.

Stephen Rothwell posted a linux-next tree for September 7th. Since Friday, the acip, async_tx, mtd, battery, slab, trivial, percpu, tty, and scsi-post-merge trees gained issues, while security-testing lost its build failure but gained another for which Stephen reverted the offending commit. The total sub-tree count remains steady at 141 trees in the latest compose.

Valdis Kletnieks reports some “weirdness” in linux-next affecting KVM and bisected back to a patch from Beth Kon entitled “KVM: PIT support for HPET legacy mode”, which is causing hangs or triple fault reboots on a Dell Latitude D820 laptop.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/09/03 Linux Kernel Podcast

September 15th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090903.mp3

For Thursday, September 3rd, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: CFQ, Matchreply, PCI, RCU tree scalability, and Tracepoints.

CFQ. Corrado Zoccolo posted an RFC patch series modifying the CFQ IO scheduler to adapt its slice slice dependent upon the number of processes that are currently performing IO. Effectively, rather than using fixed time slices, the IO time slice is scaled to a faction of the number of processes performing IO and rescaled whenever that changes. The attached figures appear impressive.

Matchreply. Tejun Heo posted a simple script that he has been using “for a couple of years now” to solve the problem of receiving many duplicated messages from different mailing lists. It indexes Maildirs and hooks into procmail to catch duplicates and redirect them into a separate folder.

PCI. Tejun Heo posted a two part patch series splitting out pci_add_dynid support from store_new_id such that in-kernel code can add PCI Ids dynamically. It will be used by pci-stub to initialize intial IDs via module parameters and allows one to (for example) prevent built-in drivers from attaching to devices with certain IDs handled by loadable modules.

RCU tree scalability. Paul McKenney replied to Nick Piggin’s earlier RCU tree scalability concerns, saying that he believes that Nick is routinely driving up the number of callbacks queued on a given CPU to above 10,000, which would cause excessive calls to force_quiescent_state (400,000 calls per second, for example). He removes the grace period machinery from rcutree __call_rcu, which apparently was a previous effort to avoid implementing synchronize_rcu_expedited.

Tracepoints. Jason Baron, the cunning fox that he can be, posted a 4 part patch series implementing a new “jump label” optimization for tracepoints. The current tracepoint code is implemented using a global variable conditional for each tracepoint, which can become painfully hairy under memory pressure or with large numbers of tracepoints built into the kernel. To better handle this, in discussion with Roland McGrath and Richard Henderson, Jason and co. created a new “asm goto” statement that allows branching to a label. Using some code patching they effectively make switching tracepoints on and off a simple case of patching a jump instruction, conditionally.

In today’s miscellaneous items: some kmemleak patches from Luis R. Rodriguez, some networking updates from David Miller, some sound updates from Takashi Iwai, some AMD Magny-Cours CPU support fixes from Andreas Herrmann, some block fixes for 2.6.31 from Jens Axboe (fixing the max_sectors_kb greater than 512KB issue mentioned previously), another bug report against reading /proc/kcore from Nick Craig-Wood, version 3 of Peter Zijlstra’s load-balancing and cpu_power patches, a fix to allow setrlimit on non-current tasks from Jiri Slaby, a fix to avoid sleeping in TASK_TRACED under the ->cred_guard_mutex lock from Oleg Nesterov, version 3 of the VMware virtual HBA support patches (including relatively minor fixes since version 2) from Alok Kataria, a fix to avoid truncation of the value in abs() if it is greater than 2^32 from Rolf Eike Beer (on 64-bit systems), a bunch of suggestions for asm-generic update candidates in various architecture trees from Robin Getz, and the latest round of rants about Linux software RAID (but on that subject, Dan Williams posted a 29 part patch series beginning the road towards RAID support in ioatdma).

Finally today, Amerigo Wang posted a series of patches inplementing gcov support within kbuild such that “make foo/fbar.c.gcov” becomes possible.

In today’s announcements: Autofs version 5.0.5. Ian Kent announced version 5.0.5 of the autofs utilities. It’s been a long time, apparently, but better late than never, and that update seems fairly comprehensive.

The latest kernel release was 2.6.31-rc8.

Frank A. Kingswood reported another “inconsistend lock state” regression against 2.6.31-rc8, complete with a backtrace, in the JBD code.

Andrew Morton released an mm-of-the-moment for 2009-09-03-16-35.

Greg Kroah-Hartman posted an update on the staging tree for the upcoming 2.6.32 merge window. He reminds everyone that staging is not a dumping ground for dead code (citing the Ethernet Power Link driver as an example of an unmaintained driver that will be removed in the .32 cycle and warning that Android and others face a similar fate in the not too distant future if nothing changes soon).

Stephen Rothwell posted a linux-next tree for September 3rd. Since Wednesday, the xfs, and net trees lost their issues, while the acpi, security-testing, tip, percpu and sfi trees gained several problems. The total subtree count remains steady at 141 trees in the latest compose.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/09/02 Linux Kernel Podcast

September 15th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090902.mp3

For Wednesday, September 2nd, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Fsync, memory controller groups, and tree RCU scalability.

Fsync. Christoph Hellwig noticed that there is a disconnect in necessary fsync handling between older and newer filesystems. Many modern filesystem only update and write out metadata once other IO commits have taken place. They sometimes implement a wait inside their ->fsync methods but this is suboptimal because it happens under the i_mutex lock and must wait for an entire file to be flushed out. Instead, it can be preferable to simply wait for data writeout completion within O_SYNC handling prior to calling ->fsync. This is what Christoph’s patch does in modifying vfs_fsync_range. He includes a mini-audit of the impact upon existing filesystems and any necessary actions.

Memory cgroups. Kamezawa Hiroyuki notes that there are a few scalability issues with the current res_counter charge and uncharge accounting functions in the memory controller groups code, especially lock contention. He believes that there is a chance to perform batch-uncharge by building up a list of pages that have been affected by paging and accounting them at the time when other large chunks are processed (as a result on unmapping, truncation, at task completion, and so forth). Since it is late in the 2.6.31 cycle, he is willing to wait until the floodgates have opened for 2.6.32. Separately, Kamezawa also cleaned up multiple calls to res_count_soft_limit_excess.

Tree RCU scalability. Nick Piggin posted saying that he is testing out the scalability (or lack thereof) of various VFS code paths, and that he is noticing a problem with call_rcu. According to Nick, __call_rcu is taking 54 times more CPU to do 8 times the amount of work from 1-8 threads, of a factor of 6.7 slowdown when using tree RCU. Nick obviously requested further information from Paul McKenney, RCU inventor and chief guru.

In today’s miscellaneous items: some further fake numa node creation patches for powerpc from Ankita Garg, version 17 of the per-bdi writeback flusher threads patches from Jens Axboe, version 4 of a patch making O_SYNC handling use the standard syncing path from Jan Kara, some x86 performance counters updates from Markus T. Metzger, an updated version of the previous days’ walltime clock synching patches for KVM guests from Glauber Costa, full NAT support for IPVS with netfliter matching support from Hannes Eder, a rework of the GPE handing in the ACPI code from Matthew Garrett, additional warnings within Documentation/md.txt (largely fueled by Pavel Machek’s ongoing rants about RAID support), a summary of merge plans for RDMA in 2.6.32 from Roland Dreier (who suspects rc8 signals impending merge window craziness), some performance counters fixes for POWER7 support from Paul Mackerras, and Luis R. Rodriguez wondered whether kmemleak.h really needed to be exported to userspace.

Finally today, Frederic Riss questions whether ARM kprobes unregistration is SMP safe. The current code makes use of an illegal unstruction to trigger the kprobe code, and Frederic cannot see how one avoids a situation in which a probe is being unregistered as another core takes an illegal instruction. He wonders whether stop_machine should be in use instead.

The latest kernel release was 2.6.31-rc8.

Inotify continues to be a pain upstream. Tej Bewith posted a git dissection in which he claims that a recent fix from Eric W. Biederman to ensure for NULL termination actually broke his system from booting.

Maciej Rutecki posted a potential regression against USB in 2.6.31-rc8. Apparently, since rc7, a Debian testing box experiences unreliable detection and handling of plugin flash drives on KDE4 (one assumes with identical userland between the two kernels).

Wu Zhangjin discovered a kernel panic on 2.6.31-rc7-rt8 in the SetPageLRU code, running on MIPS.

Stephen Rothwell posted a linux-next tree for September 1st. Since Tuesday, the tree gained a few build failures (xfs, acpi, v4l-dvb, net, block). The total subtree count remains steady at 141 trees in the current compose.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/09/01 Linux Kernel Podcast

September 15th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090901.mp3

For Tuesday, September 1st, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: CFQ, Flexible arrays, IO controllers, kthreads, KVM, NOHZ, and POSIX.

CFQ. Jeff Moyer responded to a bug posting against 2.6.30 in which it had been discovered that the CFQ IO scheduler could (under certain circumstances) skip over incoming requests (mostly those issued out of order) and dramatically diminish the performance of, for example, packet writing to a DVD. Jeff’s patch causes a new next_req to be chosen in cfq_dispach_insert so that there will always be a request to handle if there are some left in the queue. With his patch one can see that the attached results speak for themselves.

Flexible arrays. David Rientjes posted some updates to his “flex arrays”, changing the way that static definitions are done because the existing implementation of FLEX_ARRAY_INIT had no way to determine whether its parameters were valid (since it simply served as a struct initializer. Instead, the new DEFINE_FLEX_ARRAY interface (which can be prefixed with ’static’ for file scoping purposes) performs checks on its parameters, which include a new “name” parameter specifying the name of the resultant structure that will be defined by the macro call.

IO controllers. On another IO related note, Vivek Goyal posted an update in regard to dm-ioband testing. He took one 40GB SATA drive (without hardware queueing) and created two partitions on the disk, to each of which he associated a new ioband device, at weight 200 and 100 respectively. Vivek assumed that this would result in the first device seeing double the IO bandwidth of the second, but this is not what happened in practice. He attached the scripts that he used to generate the tests and requested clarification from Ryo Tsuruta.

Kthreads. Ingo Molnar noticed a synchronization problem at boot time involving kthreads in which there appears to be a race between the initial task (which becomes the idle thread of CPU0) and the init task (which, as he points out, is not the same as the initial task). Although the BKL protects the interaction between these two tasks, little protects which will run first, and there is a possibility that init might run sooner than rest_init, with a resultant ksoftirqd creation failing due to a NULL kthreadd_task. Ingo adds a completion variable to avoid this situation and tags the patch for -stable.

KVM. Glauber Costa, in likely earning himself a few beers, posted two patches that introduce a worker thread fired by kvmclock that will update the guest wallclock time periodically to be in sync with the host’s wallclock. This allows system administrators to set only the host wallclock time and avoid having to run NTP within guest VMs to deal with changes in time.

NOHZ. Josh Triplett posted in regard to the tickless kernel and the reality that the kernel is only truly tickless (running without a timer interrupt) when it is running only the idle task (at other times, the system will still be interrupted every 1/HZ seconds for a timer interrupt). Josh points out that on a system largely doing number crunching, these interrupts can add up to something quite unpleasant – as much as an 8% overhead in his case. With a simple sledgehammer approach, Josh posts a patch that forces the kernel to remain tickless all of the time. The patch as it stands breaks RCU, process accounting, POSIX CPU timers, and other things, but he wants to encourage discussion and debate about the best way forward for development.

POSIX. Jim Meyering noticed that getdents and readdir returned a different st_ino inode number than dirent.d_ino for a mount-point in use by a mounted filesystem. This he claims is in violation of POSIX 2008 and caused him to disable an optimization in coreutils ‘ls -i’. He attaches a snippet of the recent POSIX specification and encourages that “Linux can catch up before too long”, since the only system currently taking advantage of strict compliance seems to be (somewhat more ironically) Cygwin.

In today’s miscellaneous items: a correction to the documentation in Documentation/numastat.txt from Minchan Kim, new sysfs ALS (Ambient Light Sensor) patches from Zhang Rui, version 2 of a patch adding support forKPF_KSM page type recognition to the page-types utility from Fengguang Wu, version 2 of his load-balancing and cpu_power patches from Peter Zijlstra, version 16 of the per-bdi writeback flusher threads patches from Jens Axboe, a patch removing an explicit assumption of the presence of cpu0 in the percpu code from Tejun Heo (especially useful on SPARC systems – this patch was later requested as part of a pull request sent out by Tejun), a patch allowing for max_sectors_kb to exceed above the default of 512 from Nikanth Karthikesan, a fix to avoid dangling blocks not used during a write operation on reiserfs from Jan Kara, a simple nilfs2 bugfix pull request from Ryusuke Konishi, a fix to ensure GCC flags don’t get squashed in the Makefiles by Jory A. Pratt, a new version of a fix to vmscan that moves pgdeactivation modification to shrink_active_list from Hugh Dickins, a fix for the anti-fragmentation patches from Mel Gorman that will once again unbreak nommu, and the addition of some XFS compatibility ioctls as well as an XFS pull request containing those from Felix Blyakher. Xiaohui Xin posted a detailed RFC for Virtual Machine Device Queues (VMDq) support on KVM for which there was not room in this episode – look for that in a later edition.

Finally today, Roland Dreier and David Miller discussed the setup of the new linux-rdma@vger.kernel.org mailing list and how it can be advertized, archived, and generally advocated as the new list for RMDA topics.

The latest kernel release was 2.6.31-rc8.

Stephen Rothwell posted a linux-next tree for September 1st. Since Monday, the pxa, xfs, i2c, and dwmw2-iommu trees lost their conflicts and build failures, while the pci, acpi, and block trees gained failures for which Stephen mostly used other versions as necessary. The total subtree count remains steady at 141 trees in the latest compose.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/31 Linux Kernel Podcast

September 15th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090831.mp3

For Monday, August 31st, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: KVM, Poisonous hardware, and XFS.

KVM. Avi Kivity announced that, from now on, he will be sharing KVM maintainership with Marcelo Tosatti. They will commit on alternating weeks, or something along these lines, which is aimed to provide Avi with more time to develop new features and improve the overall maintainership role.

Poisonous hardware. Fengguang Wu provided some memory cgroup patches implementing support for HWPOISON (detected known bad physical pages) testing. The idea here is that adding specific tasks into a memory cgroup allows for only a sub-set of running tasks to have errors injected, providing for measurement of the system response to such situations without running the risk of core system processes and daemons being killed during basic tests.

XFS. Michael Tokarev noted that XFS doesn’t provide a compat_ioctl layer for resizing (via the xfs_growfs ioctl command) ioctl calls, meaning that there is no easy way to perform online resizing of XFS volumes when using a 64-bit kernel and a 32-bit userspace environment. Michael obviously wonders if there is any plan to add such support through compat_ioctl wrappers.

In today’s miscellaneous items: some perf tools cleanups (creating a library for certain functions) from Frederic Weisbecker, some x86 header cleanups from Ying Huang, various v4l/dvb fixes for 2.6.31 from Mauro Carvalho Chehab, a patch moving the page-types utility from Documentation/vm to tools/vm from Fengguang Wu (who also added support for recognizing KPF_KSM pages), a trivial KVM symbol offset calculation fix (substituting __pa for __pa_symbol) from Glauber Costa, the addition of the new KPF_HWPOISON page flag for hardware detected memory corruption marking from Fengguang Wu (part of Andi Kleen’s ongoing HWPOISON effort), a fix using native_rdmsr|wrmsr_safe_regs prior to reading or writing to the MSR for an x86 AMD K8 erratum fix from Borislav Petkov (based on an idea from Peter Anvin), version 5 of the ALS (Ambient Light Sensor) support patches from Zhang Rui, a tracing/filters memory allocation fix from Li Zefan, another attempt at cleaning up kcore on mmotm from Kamezawa Hiroyuki, some “fake numa” fixes for powerpc from Ankita Garg, version 15 of the per-bdi writeback flusher threads patches from Jens Axboe, some patches implementing optional delays during ALUA state transition from Nicholas A. Bellinger, ongoing discussion about what to do with KVM guest page table metadata and whether this could provide safe hinting to the host, and an ongoing rant about RAID continued.

The latest kernel release was 2.6.31-rc8.

Paul Mundt discovered a page allocator regression on nommu systems, which he says is caused by a recent page from Mel Gorman (entitled “move check for disabled anti-fragmentation out of fastpath”). It causes a failure during initramfs unpacking on his development board.

Mario Holbe discovered a regression between 2.6.26 and 2.6.30 in which device-mapper would no longer handle devices with identical UUIDs. This is typically an unlikely situation, but it can happen, especially when using backup images and mounting them onto a running system.

Stephen Rothwell posted a linux-next tree for August 31st. Since Friday the pxa, sound, and sfi trees lost their conflicts while the i2c, drm, and dwmw2-iommu trees gained conflicts or build failures for which temporary fixes were applied. The total subtree count remains steady at 141 trees.

Jiri Slaby noticed that an ongoing suspend race in linux-next seems to be caused by might_sleep() calls in flush_workqueue() and flush_cpu_workqueue(), which he discovered through painstaking code instrumentation. As he points out, due to the number of suspend cycles required, bisection is tricky, but he has at least provided some data points to aid in debugging.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags: