Archive

Archive for March, 2010

2010/03/21 Linux Kernel Podcast

March 21st, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100321.mp3

For the weekend of March 21st, 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Linux 2.6.34-rc2, 64-bit system calls, core dumping to a pipe, exported symbols, page cache control, and performance counters for KVM guests amongst other things.

Linux 2.6.34-rc2. Although there is no official announcement as of this writing, Linus’ git tree currently contains a 2.6.34-rc2 release that he created on Friday March 19th 2010 at 6:17pm Best Coast Time (PDT). Once the announcement is officially made, there will be more detail.

64-bit system calls. Benjamin Herrenschmidt raised a question in a thread entitled “64-syscall args on 32-bit vs syscall()”, concerning the ability for existing kernels to handle passing 64-bit parameters to system calls when using a 32-bit userspace. A problem arises on platforms such as POWER and it’s smaller cousin, PowerPC, in which arguments are often passed by register and not on the stack (unless a large number are passed). When passing 64-bit values (as in calling fallocate() within hdparm), GCC may try to use multiple registers (which themselves need to be aligned on even boundaries) to pass a 64-bit value using two sequential 32-bit registers. But the syscall() function within glibc may try to effectively use the same trick again, causing arguments to be off-by-one. Benjamin had a proposal for modifying the existing syscall() interface in a way he thought would be backward compatible (perhaps confined to P{ower,OWER}{PC,} initially) but Ulrich Drepper wasn’t quite so trigger happy to make changes. Peter Anvin favored using explicit versioning to isolate any syscall() interface changes. Separately, Torok Edwin posted some perf (Performance Counters userspace utilities in the “perf” directory) patches enabling callgraph tracing of 32-bit processes when running 64-bit kernels.

Core dumping to a pipe. Neil Horman posted the 4th version of a patch series entitled “exec: refactor how call_usermodehelper works, and update the sense of the core_pipe recursion check”. In addition to addressing some existing race conditions with the implemention, Neil was interested in reworking the call_usermodehelper() function to handle core dumping to a pipe. In the existing arrangement, it is necessary to have all running processes with non-zero core dump ulimits to ensure the pipe dump will work as planned. But Neil has had enough requests to be more flexible, and has come up with the idea of adding a function callback to the call_usermodehelper (umh) that will be made after the task (at this point, in userspace nomenclature, that is just about referable as a process – they are the same however) has been forked but prior to the exec() call starting the userspace code. That function pointer can, in the case of do_coredump, fiddle with ulimits.

Exported symbols. Robert P. J. Day inquired whether the kfifo implementation should really be exporting as many symbols as it does. Tilman Schmidt alluded to the reasoning behind this in mentioning inlined functions. For background, whenever the kernel needs to make use of some function from within modules, that function must explicitly be exported through an EXPORT_SYMBOL or a similar macro definition – simply using the C keyword “static” does not have the desired effect. Sometimes, symbols are exported solely because they are used by corresponding inline functions that are included within module files and need to use the corresponding export. For example, an inline function called “foo”, might need an export “_foo”. In order to clarify the situation, this author suggested a new EXPORT_SYMBOL_INTERNAL export to clearly label these use cases such that symbols are not used where they are not intended.

Page cache control. Balbir Singh posted a patch exposing a cache= kernel command line parameter that can be used to control page cache operation, and effectively disable it entirely in certain situations. This is of particular benefit to virtualized guests (especially those not wanting to enter into direct reclaim frequently), which otherwise might have their pagecache data effectively stored twice – once in the host, and once in the guest itself. Now, there being no such thing as a free lunch, Avi Kivity pointed out that this would slow down guests booted with cache=off because they would now need to use a virtio call to pull in more pages. However, guest memory utilization was shown to fall considerably as might be expected without a page cache. Both Avi and Balbir seemed to agree that the tunable knob allowed for situation specific decisions to be based upon the specific needs of an environment – more overhead in the VM or a slight loss in performance, according to workload, IO types, filesysyems, and a number of other items mentioned by both. Randy Dunlap specifically requested that documentation be added also.

Performance Counters for KVM guests. Yamin Zhang posted a patch entitled, “Enhance perf to collect KVM guest os statistics from host side” intended to facilitate the collection of performance counters statistics from the host when using Linux guest instances, with the exception of guest userspace. Avi Kivity was excited that this patch did not require the exact same kernel on both the host and the guest (he called that “critical”, noting that, “I can’t remember the last time I ran same kernels”). There did seem to be some agreement between both Avi and Ingo Molnar that having a vmchannel client in the host kernel exporting various data for tracing to guest kernels did make life easier for the implementators of such features but potentially opened up another DoS target and needed to be avoided. Instead, Ingo suggested that the host perf tools connect to the qemu instances managing guest instances and communicate over a well-known UNIX socket. The conversation went off onto a tangent about obtaining guest instance information using libvirt, whether there were other tools in common usage to manage guest instances other than starting them directly using the modified qemu, and the relative benefits of shipping all KVM kernel and userspace code in a single project. This gave Ingo an opportunity to get in another mention of what he considers to be “ugly” separation between glibc and the kernel. The entire thread is certainly worth reading, at dozens of posts and likely growing.

In today’s miscellaneous items:

*). A fix for allmodconfig with Xilinx soft core FPGA systems.

*). A device power management documentation update from Rafael J. Wysocki.

*). Version 7 of Andrea Righi’s per memory cgroup dirty page limit patch. Andrea provided some documentation updates that were discussed also. Separately, and on the note of cgroups, the CFG_GROUP_IOSCHED configuration option was made visible in a patch from Li Zefan.

*). A bunch of scheduler and cpusets fixes from Oleg Nesterov, who also noted that there were remaining issues – including a potential lockup in do_fork() caused by receiving a signal from an IRQ or an RT thread pre-emption event because the runqueue lock (rq->lock) cannot be taken in the interim. Oleg asked the maintainers very nicely to please review his patches and comment, although there have been no comments posted in the last week on these.

*). Michael Braun reported an issue involving an interaction (or lack thereof) between the kernel crypto subsystem and the SLOB allocator. He finds that there is “general memory corruption” when using SLOB that isn’t present with the other allocators. Herbert Xu (and by extension, Pekka Enberg, since it was him who inquired as to whether these option were enabled) asked Michael to turn on some allocator debugging options and provide the relevant debugging output to facilitate further analysis.

*). A fix ensuring that legacy PIC interrupts are handled on all CPUs and not just the boot CPU when using the “noapic” kernel boot option from Suresh Sidda. This addresses a bug originally raised by Ingo Molnar.

*). A patch from Dmitry Torokhov re-implementing sysrq as an input handler, rather than as a custom hack in the legacy keyboard driver. Henrique de Moraes Holschuh wondered aloud whether this would introduce any problems for SAK (Secure Attention Key), which should be uninterruptible. That piece seems yet fully resolved in the thread.

*). A patch converting alpha to use clocksource rather than arch_gettimeoffset from John Stultz.

*). A missaligned percpu allocation when using lock events through perf on a particular SPARC box was reported by Frederic Weisbecker.

In today’s announcements:

Kernel.org. John (warthog9) Hawley announced the general availability of various SSL based services on kernel.org. Quoting John, “[t]his should help provide an additional level of security, in particular for our dynamic content like the wiki’s, patchwork and bugzilla”. John noted that the SSL certificates were generously donated by Thawte, and included a quote from the latter in which they state that they are, “proud of [our] open source lineage”. As of this writing, services officially using SSL (through explicit redirection) include Bugzilla, Wikis, Account Requests, Patchwork, while services that can use SSL if requested using the appropriate address do currently include the main www.kernel.org, boot.kernel.org, git.kernel.org, and android.git.kernel.org. Services not using SSL include mirrors.kernel.org (due to the volume of traffic incurred), and the geo-DNS entries because that would expand the number of SSL certificates required unreasonably.

Loop-AES. Jari Ruusu announced version 3.3a of the loop-AES file/swap utility. Details: http://loop-aes.sourceforge.net/

LTP. Rishikesh K Rajak sent an announcement saying that the previous ltp-cvs commit list would be supplemented by a new ltp-commits list that includes git commits also. The name would suggest that it may be somewhat VCS agnostic. Details: http://lists.sourceforge.net/lists/listinfo/ltp-commits

SCST. Vladislav Bolkhovitin posted to announce that the “new SCST SysFS-based interface has become fully usable, so you can start migrating to it and update your target drivers, dev handlers and management utilities”. For further information, please see: http://scst.sourceforge.net/

TCM. Nicholas A. Bellinger announced the release of version 3.4.0-rc1 of the Target_Core_Mod/ConfigFS infrastructure project, which includes a new Open-FCoE.org based target module (tcm_fc) for TCM/ConfigFS 3.x (mentioned in a separate release announcement). As of the latest release, the TCM/ConfigFS project is now tracking upstream Linux development once again. For further information: http://www.linux-iscsi.org/index.php/Target_Core_Mod/ConfigFS

RT 2.6.33.1-rt11. Thomas Gleixner announced the latest RT kernel patch version 2.6.33.1-rt11 is now available. Since he had been traveling, Thomas had made a few interim releases (rt6 through rt11), the sum of which he summarized. For further detail: http://www.kernel.org/pub/linux/kernel/projects/rt

TuxOnIce 3.1. Nigel Cunningham announced the 3.1 release of TuxOnIce. This is a series of alternative software suspend and resume patches that have been out of the kernel tree for some time, but have their various supportors. The latest patches include LZO compression support, UUID support for detecting suspend images without using a resume= parameter, and other fixes.

The latest kernel release is 2.6.34-rc2.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/03/14 Linux Kernel Podcast

March 19th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100314.mp3

For the weekend of March 14th 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

In today’s issue: The 2.6.34 merge window, anonymous inodes, ATA 4KiB sector issues, cpuhogs, ext4, PCI, and USB console support.

The 2.6.34-rc1 merge window. Linus Torvalds announced the release of the first 2.6.34 RC kernel on Monday, March 8th 2010 at 12:33pm Best Coast Time (PST). In closing the merge window early, he hoped to make a point in line with previous comments on the issue of getting merge requests in in a timely fashion. Quoting Linus, “but in general the merge window is over. And as promised, if you left your pull request to the last day of a two-week window, you’re now going to have to wait for the 2.6.35 window.” According to Linus, nearly two thirds of the changes are in drivers (when factoring in 50% drivers/ code, 5% sound/ code, and 10% firmware). Of the remaining bits, about half is architectural and the rest is, well, the rest. So far, about 850 developers are involved. Linus again refered to his Fedora Nouveau rant in ending with a reference to the need to upgrade libdrm/nouveau_drv versions if using that driver.

Several architecture maintainers gave their excuses and requested pulls later, but Linus drew the line at a request from James Bottomley to pull SCSI pieces two days later, on March 10th. James noted that he had been en route back from India, nobody had told him the merge window would close early, and that the only commit added to his tree since the merge window closed on Monday was a bug fix. Linus said he was “not going to pull” and that the whole point behind closing the merge window early was because of people posting pull requests late that “should have been ready when the merge window _opened_”. James objected to the unpredictability of the merge window closing, but Linus said that “WAS THE WHOLE F*CKING POINT!”, in order to avoid last minute pull requests, and added that he would in future not even say how long the merge window was going to be in order to have requests ready the moment the window opened. Unfortunately for James, Linus wanted to make a point and he seemed to meet Linus’ criteria for doing so. Doug Gilbert later pointed out that people should not attack James just because he was the subject of “yet another Linus rant”.

Anonymous inodes. Dmitry Torokhov recently started a thread entitled “S[E]Linux going crazy in 2.6.34-rc0″ (but note the corrected capitalization of “SELinux”). He was experiencing a side effect of some recent work by Al Viro, as well as others, to switch various subsystems such as inotfiy over to use anon inodes rather than their own “filesystem” type. Previously, inotify had used its own filesystem called simply and obviously “inotifyfs”. This allowed for SELinux rules to match on various notification events on an “inotify_t” filesystem type of filesystem. But with the trend to convert to anonymous inodes, there becomes no easy way to write SELinux rules to confine applications (if that is what you actually want to do), and the existing rules go insane, as this author recently saw on a rawhide system that happened to be running SELinux. Eric Paris proposed various workarounds – type a, and type b – of the “revert” everything back to how it used to be, or create support for differing security contexts for anonymous inodes. The latter seems more likely to happen though the thread dried up at that point and nothing further was said on the topic until Eric Paris sent a pull request for some notify bits a week later.

ATA 4 KiB sector issues. Tejun Heo started a new thread entitled “ATA 4 KiB sector issues”, in which he lamented the current state of support for larger sector size ATA devices (those using 4K rather than 512 bytes as their natural unit of size – someone please add a comment to this article with a description for the term used to describe the natural size of a disk, its “word size”). Apparently, the transition will be “quite painful”. In his lengthy email, the gist of which is covered by an article on the kernel.org wiki at: http://ata.wiki.kernel.org/index/php/ATA_4_KiB_sector_issues, Tejun covers the issue of backwards compatibility, DOS partition table support, and that beast of beasts – Windows. Interestingly, I didn’t see a specific mention of the issue of unaligned writes when using journalled filesystems and ensuring commits have hit the disk, but I’m sure that’s covered somewhere in there. I suspect this is now required reading if you work on disk and block bits. James Bottomley added some useful notes about the lack of bootloader support, etc.

CPU Hogs. Tejun Heo posted a patchset intended to generalize the case of monopolizing a CPU (or a set of CPUs) with a single kernel thread. The cpuhog functionality can be used by any kernel code that needs to grab one or more CPUs exclusively for some period of time, such as [k]stop_machine, which does just thus during module load in order to ensure that it is safe to fiddle with the kernel symbol table. For good measure, Tejun also fixes the kernel migration threads to use cpuhog while he’s at it. LWN had a writeup on this topic later, and your author has a pet project in mind that should benefit already from using this patchset. Thanks Tejun Heo!

ext4. Christian Borntraeger posted asking about e4defrag support for compatible ioctls (as in the case on his system, with a 64-bit x86_64 kernel and 32-bit IA32 userspace environment). He suggested, “[l]et[']s just wire up EXT4_IOC_MOVE_EXT for the compat case.” This lead Jeff Garzik to wonder aloud what the overall status was of ext4 defragmentation support. Jeff noted that he had actually poked at defragmentation support himslef in the past and was “hopeful that I will see defragging in a Linux distribution sometime in my lifetime”. Eric Sandeen noted that such support had previously been in Fedora (briefly) but was removed because he (Eric) wasn’t so happy with the code. Since I happen to know Jeff has a good many years ahead of him, one hopes that he will get to see many great things, including ext4 defragmentation. Separately, Michael Tokarev pointed out another 32-bit userspace on 64-bit kernel issue with compatible ioctls, this time affecting AIO. Jeff Moyer was on the case with an initial test patch that he could use succesfully with the libaio test harness built with -m32 while he continues to work in general on further AIO cleanups for the longer term.

PCI. Alex Chiang posted an updated patch based upon some awesome work that Matthew Wilcox had done to provide sysfs PCI slot to device mapping directory entries that can be used to determine which physical slot a device is actually installed in within the chasis of a given system. This will be of use to a number of projects, including efforts to name network interfaces according to the slot they reside in (rather than their MAC address) for distributions needing to support single system images – at least, that’s one possibility that comes to mind. I have pinged a few people myself to see if this will be of use to that effort in general, and there are bound to be many more.

USB Console. Jason Wessel posted a 6 part patch series entitled “usb console imprevements series”, containing “aggregated and ported…usb patches I have previously posted which are not mainlined into a single series aimed at providing a stable [USB] console”. Jason began with a recap about what the problem with USB consoles currently is – that they are not synchronous (as opposed to regular serial UART consoles which are) and so will drop data on the floor if there is no room to buffer it when interrupts are disabled. The new code introduces intentional delay loops calculated through imperical testing using an FTDI USB part (a common part on many embedded boards, such as the BeagleBoard JTAG debugger sitting on this author’s desk).

In today’s miscellaneous items:

* some early dev_name() patches from Paul Mundt allowing early platform device code to use dev_name() before the guts of the driver core are online.

* This author was bitten by a recent bad commit from Al Viro that caused opendir() to succeed on regular files. I posted a question about it and was told that it had already been fixed. Indeed, it had.

* Ongoing debate happend about reducing the number of memory allocators in use on x86 systems, per a previous note from Ingo that there were 5 possibilities depending upon phase of boot and this needed to be reconciled.

* A rant from Finn Thain about a “coding style” fix patch for Macintosh that reduced a comment length to fit in 80 characters. Finn thought this was an utter waste of time, and repeated a comment often heard elsewhere, “checkpatch.pl is great but code that fails it is NOT always wrong.” and, ‘”Check patch” is a good idea but “check existing code” is a waste of everyone’s time. Sometimes, cleanup patches do more harm that good, for example a well intentioned “if” cleanup this week completely misunderstood how the identation is supposed to work and was also summarily rejected. Ben Herrenschmidt’s only response to this mini-rant was “Amen !”.

* Mitake Hitoshi concurred with Guangrong Xiao’s posted results showing an *improvement* in performance of userspace mutexes when lock trace events were enabled. Reproducer code was posted and confirmed.

* Some useful documentation was provided on Linux’s circular buffering and memory barriers support from David Howells.

* Support for specifying in the environmental variable context of a kernel emitted uevent whether it came because of a kernel_firmware() or a kernel_firmware_nowait() request was postulated by Johannes Berg (to handle the case of built-in drivers requesting firmware not in an initramfs). Kay Sievers pointed out that many events are re-triggered during boot and so the firmware loader cannot know what state the system is in, and therefore it might be better to leave requests for unsatisfiable firmware around “forever” until they are cancelled from userspace rather than trying to cunningly work around the issue of firmware not being present in an initrd context with special uevent environment variables.

* and the jabs at SELinux security labeling continued with Al Viro coming up with a few amusing retorts in the “Upstream first policy” thread and Ingo Molnar comparing SELinux relabeling wait times to fire doors, “we should prefer a one inch thick fire door that opens and closes fully automated to a five inches thick fire door that people keep always-open with a chair”. Ingo contends that all too often, people “turn off the whole thing” because of various frustrations and so there is less overall security than might be the case with a slightly less perfect system. Dave Airlie called SELinux relabels “the new fsck” and called for journalling.

In today’s announcements:

Benchmarks. Anca Emanuel announced some new Phoronix benchmarks for kernels 2.6.24 through 2.6.33, showing that performance has generally improved by 770% from 2.6.29 to 2.6.30 and only regressed very slightly in 2.6.32. Regretfully, however, 2.6.33 does not perform nearly so well, and, according to the Phoronix quote, “PostgreSQL performance atop the EXT3 file-system has falled off a cliff”. Full details are available on the http://www.phoronix.com/ website.

RT 2.6.33-rt6. Thomas Gleixner announced the release of version 2.6.33-rt6 of the RT patchset that he and others are continuing to develop against the 2.6.33 series kernel. As he mentions, there was an -rt5, but it was more of a separation point in the git tree. With the merging of some bits into that older tag, MIPS support rejoins the RT tree thanks to Wu Zhangjin. As usual, the RT patch is available on the kernel.org website, in the section devoted to such projects, or in the head (rt/head) and stable (rt/2.6.33) branches of the “tip” tree maintained by Ingo Molnar. Details: http://www.kernel.org/pub/linux/kernel/projects/rt/

The latest kernel release is 2.6.34-rc1.

Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-03-09-19-15. Hiroyuki Kamezawa posted an updated version of his OOM notifier memory cgroup patches against this latest tree. Andrew later posted an mmotm for 2010-03-11-13-13. And in other “mm” news, Mel Gorman posted the 4th version of his “memory compaction” patches.

Greg Kroah-Hartman posted some review patches for stable kernels 2.6.33.1, and for 2.6.32.10. These were subsequently released.

Finally today, Robert P. J. Day asked whether it was still worth him running his “cleanup” scripts (that look for problems with kernel config options) after each merge window closes. Randy Dunlap thought “yes”, and was even more happy that Robert had posted his scripts for him and others to use. Details: http://www.crashcourse.ca/wiki/index.php/Kernel_cleanup_scripts Robert followed up later with another email saying that most of his popular cleanup scripts have now been posted, which is great.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/03/07 Linux Kernel Podcast

March 18th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100307.mp3

For the weekend of March 7th, 2010, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Console, DRM, ext4, integrating tools, sensors, split function and data sections, union mounts, and versioning.

Console. Eric W. Biederman posted an intuitive patch for /dev/console opening, effectively ensuring that it is always available even if the root filesystem has no /dev. “This effectively guarantees that there will be a device node, and it won’t be on a filesystem that we will ever unmount”. Al Viro replied “hell yeah”, and took the patch “with thanks”.

DRM. This weeks thread length of the week prize goes to a thread entitled, “drm request 3″ in which Dave Airlie tried to pull some patches into the 2.6.34 merge window. These contained, “[f]ixes for default y + CONFIG_STAGING + CONFIG_DRM_NOUVEAU enabled”. Linus wasn’t very happy when he booted with these patches (nouveau interface version 0.0.16) and saw an error message saying “[drm] wrong version, expecting 0.0.15″. This lead to a rant about backwards compatibility, and that he hadn’t even been warned it would break existing user space (in his case, Fedora 12). Linus even found that the commit that introduced the breakage did so explicitly, but again noted, ‘why the hell wasn’t I made aware of it before-hand? Quite frankly, I probably wouldn’t have pulled it. We can’t just go around break people[s] setups. This driver is, like it or not, used by Fedora-12 (and probably other distros). It may say “staging”, but that doesn’t change the fact that it’s in production use by huge distributions. Flag days aren’t acceptable’. This lead on to a thread in which Linus and others (including Jeff Garzik) noted that Fedora 12 was shipping this driver in “production” and so more should be done to ensure that the kernel could be tested on older systems, while others said the driver was all along a “use at your own risk” driver (Jesse Barnes). Personally, this author solved the problem by using another graphics chipset a long time ago. Daniel Stone probably had the best solution, “fuck it, it’s Friday. To the pub”.

The DRM thread also deviated into a discussion of “Upstream first” as a distro policy, and then onto specific patches in other distributions that aren’t in upstream. For example, Ubuntu carrying AppArmor. That lead on to yet another tangent in which James Morris felt he was being personally attacked for the lack of the patches being upstream. Ingo Molnar (and later, Linus, who seemed to share a similar viewpoint – that there needn’t be only one security answer) decided to weigh in, noting that it had been “a few reasonable months after the last big security flamewar”, and wanting to see a “rehash or fair summary of the pathname versus labels arguments” (refering to the fact that SELinux uses file labeling and complex rules, while AppArmor uses simple file paths). Ingo feels that pathnames are a “far more fitting abstraction to any ‘human based security process’ on Linux than ‘labels’”. Ingo called out that there was a lot of security research based on labels but essentially said none of that mattered due to the difficulty of practically using label based security. Quoting Ingo again, “[i]n other words: [I] see [SEL]inux’s main failure in that it somewhat blindly aims for a security model that is sees as the technical most secure, while not being intellectually open to the fact that we very likely _cannot know in advance_ which of the models will make Linux more secure in the long run. It would seem Ingo would like AppArmor to be less of a “hostile competitor” and more of a “natural ally” to SELinux. The idea is that there can be two different security mechanisms for different use cases.

Ext4 performance concerns. Justin Piszcz had recently raised the issue of the relative performance of ext4 for “large” writes vs. XFS. Justin was seeing almost half the write throughput when using ext4 as opposed to XFS and was concerned. After asking various questions, to which the replies included that he should use “nice” numbers of disks (e.g. 9 for the specific RAID case he was looking at) that made no difference, the thread seemed to dry up without any concrete conclusions other than that a performance issue exists and requires some further investigation using blktrace, etc.

Integrating tools. Ingo Molnar, in a thread entitled “Re: KVM usability”, made some remarks about the relative virtues of having “unified repositor[ies]” in which both the kernel and userspace tools are combined in one place, such as with the Performance Counters tools. Ingo believes that one reason why Apple can “consistently out-develop Linux” is “in part due to there not being a strict [C]hinese [W]all between the Apple kernel, libraries and applications – it’s one coherent project where everyone is well-connected to each piece”. This maybe true, but it’s just as likely in this author’s opinion that Apple is benefitting from that, coupled with the fact that it owns every piece and can hand down edicts from on high about what every piece will do, and when. In any case, the thread is worth reading – it was surprisingly short given the potentially contentious comments that could have made great flamebait.

Sensors. Dima Zavin (Google) replied to Jean Delvare’s attempt to have the ALS (Ambient Light Sensors) subsystem pulled, saying that the kernel was on the road toward having one subsystem under drivers/ for ALS, one for Proximity sensors, one for Accelerometers, etc. all with similar interfaces, and that a better approach would be a single “sensors” subsystem. He offered to help work on just that. Jean was interested, but didn’t want to hold up having the ALS patches pulled, favoring reworking them later on. He was subsequently dismayed when Linus and others started asking why ALS wasn’t just using the input subsystem for events, saying that he didn’t care where the code went but that discussions had been ongoing for 5 months already and he didn’t want to hold things up for another 5 months when people decided to bring this up during the merge window rather than before. The conversation then took a tangent into different rate devices (some of these “sensors” can operate at many KHz, above what the “input” subsystem is intended for). Linus contended that these devices, just like joysticks, were input devices. The conversation appears to have stalled at this point without a resolution.

Split function and data sections. As some of you will know, various attempts have been made over the past year to add support for compiling the kernel with the GCC options “-ffunction-sections”, and “-fdata-sections”. These cause the kernel to generate one ELF section for each function or data related object, and make life very easy for optimization tools (that can remove whole sections) as well as kernel patching utilities such as Ksplice. Tim (Ksplice) Abbott was happy with the latest round of patches, though he did have some questions about the “rename kernel’s magic sections with compatbility with -ffunction-sections -fdata-sections” patch series, especially about where certain renames were being used. For example, he wondered aloud how renaming “.text.reset” to “.text..reset” would affect AVR32 systems, because he couldn’t see how the original “.text.reset” was being populated anyway (answer: it wasn’t). As Tim mentioned, he wanted input from Haaard Skinnemoen, who provided the comment on “.text.reset” amongst other feedback.

Union mounts. Valerie Aurora posted version 1 of an RFC patch series (against Al Viro’s for-next tree) entitled, “Union mount core rewrite”. This, as it implies, is a complete rewrite of parts of the code implementing union mounts. Val has previously written about the goals and implementation of her work in various LWN articles. Separately, Val wondered aloud whether it was now possible to have multiple read-only layers in union mounts.

Versioning. Paul McKenney posted a patch placing the SHA1 git hash of the latest commit in the kernel version line on boot if available, or “[Not git tree]” in the case that a non-git tree was use to build.

In today’s miscellaneous items:
Large numbers of git pull requests started to come in for 2.6.34 (including everything from core kernel to networking and sound), there were some further nested SVM patches from Joerg Roedel, a large number of KVM updates (including a lot of PowerPC bits, Microsoft Hyper-V patches, and some x86 emulator cleanup), a new “platform-drivers-x86″ git tree reference was added to the MAINTAINERS file (as maintained by Matthew Garrett, who posted a pull request for the latest bits also), a new generic x86 “NMI Watchdog” built upon performance events from Don Zickus (by way of Ingo Molnar actually making the pull request for Don’s previously posted patches), version 3 of the memory controller groups dirty page limits patches from Andrea Righi, an affirmation from Andrew Morton that the “Linux Checkpoint-Restart” patches could be posted to LKML following 2.6.34-rc1 (Oren Laadan also mentioned how the patches will refuse to do a checkpoint if they believe they cannot do so safely, reporting this back to userspace), the latest “compat-wireless” tree for stable kernel (2.6.32) users that contains the latest 2.6.33 bits from Luis R. Rodriguez, version 3 of a patch series providing for 512KB readahead rather than 128KB from Fengguang Wu, various trivial and staging patches from Greg Kroah-Hartman (as an aside, Alan Stern raised some concerns about the way Greg’s scripts generate those patches), a request to pull the Ceph distributed file system client into 2.6.34 (along with various input about changes made since the 2.6.33 merge request) from Sage Weil, some Performance (perf) Counters “live mode” patches from Tom Zanussi that allow perf data to be directly processed as it is captured “without ever touching the disk”, some paravirt (PV) extension patches for HVM (Hybrid virtualization support) in Xen from Sheng Yang, and Ted Ts’o complained about dynamic device filesystems with initramfses in a mini-rant about how 2.6.33 could not boot with an LVM root on his Ubuntu 9.10 userspace. He added that, “of course, the initrfamfs environment is so crappy that there are no debugging aids — not even a working pager”.

In today’s announcements:

Git 1.7.0.2. Junio C Hamano announced the latest maintenance release of Git version 1.7.0.{1,2}. The second .2 posting had a few minor patches since .1, including fixing support for GIT_PAGER. Whether or not it is technically an SCM, I will cease using that term in this podcast, following some feedback from listeners of this podcast.

LTP. The Linux Test Project was released for February 2010. The latest release comes with a reminder that there “has been multiple chnges for building/installing the test suite after the recent changes in Makefile infrastructure”. This month’s release didn’t come with any corrupt script warnings.

Userspace RCU 0.4.2. Mathieu Desnoyers announced version 0.4.2 of his Userspace RCU “urcu” library. It includes some patches from Paolo Bonzini adding generic uatomic ops support for architectures not explicitly supported by liburcu, including (effectively free support) for IA64 and Alpha when using GCC versions 4.0-4.5, and a bugfix in urcu-bp which is the “User-Space Tracing” version of the urcu library. Mathieu has asked me to point out that an patent exemption was made to cover use of RCU in LGPL code such as urcu, so my previous comments about GPL patent concerns were a little too severe.

The latest kernel release was 2.6.33.

Andrew Morton posted an mm-of-the-moment (mmotm) for 2010-03-04-18-05.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/02/28 Linux Kernel Podcast

March 18th, 2010 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20100228.mp3

For the weekend of February 28th 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

In today’s issue: Linux 2.6.33, ACPI, Cgroups, Checkpoint and Restart, OF Device Tree, Firmware, and x86 embedded.

Linux 2.6.33. Linus Torvalds announced the final release of 2.6.33 on Wednesday February 24th at 12:06pm Best Coast Time (PST). The final release includes a relatively small number of final fixes on top of rc8. As Linus says, the most notable thing may be the Nouveau integration and modesetting support. Others may notice the mainlining of DRBD and the fact that the AS IO scheduler is now gone (”since keeping it around and just causing confusion seemed to not be worth it any more. You’re supposed to use CFQ instead”). Daniel walker asked Linus whether he still planned to try a one week merge window this time, to which Linus said, “No. But I might do a ten-to-twelve day thing or something like that – just to make sure that anybody who tries to game the system and send their merge request late will get summarily ignored. So I’m going to stop being so predictable that people can tell that exactly two weeks after the last release is where the merge window closes, and if people want to make sure their stuff merged, I had better have a merge request in my inbox earlier than thirteen days after the release.” The pull requests started pretty much immediately, and with the usual vigor. Separately, Con Kolivas announced 2.6.33-ck1, which includes his BFS scheduler and various other “desktop” focused bits.

ACPI. Rafael J. Wysocki posted an RFC patch concerned with removing race conditions from ACPI event handlers. The first race concerns the execution of handlers while they are being removed, the second is a locking issue.

Cgroups. Andrea Righi posted an intruiging RFC patch series intended to provide per-cgroup dirty page limits. The idea is that the maximum amount of dirty pages a cgroup is allowed to have can be limited, and if a cgroup exceeds this count, it will be forced to perform write-out immediately.

Checkpoint and restart. Oren Laaden posted version 19 of his “Linux Checkpoint-Restart” patchset. As a reminder, these patches are intended to allow systems to handle failures by taking whole system checkpoints and restarting all activity from that point in the event of failure. The latest patchset is intended to address previous concerns from Andrew Morton and others, and is apparently able to checkpoint and restart both screen and vnc sessions, and support live migration of network servers between hosts. The project has a checklist of TODOs on its wiki: http://ckpt.wiki.kernel.org/.

OF Device Tree. Grant Likely asked Linus to pull in his OF device tree rework for 2.6.34. Grant has recently been working on ARM support, in addition to the PowerPC, Microblaze, and SPARC changes covered in this pull. Hopefully, OF device tree emulation will finally provide one mechanism for supplying data to the kernel that can be common across many different architectures, in addition to those that do “real” OpenFirmware in the vendor firmware.

Firmware. There was some discussion about kernel firmware versioning, and whether kernel firmware should be wrapped in a container format making it more suited to SO library style versioning. This happened in response to the folks behind the open sourcing of the Atheros WiFi firmware seeking advice on the best way to handle compatible and incompatible versions. David Woodhouse has advocated for the use of more library-like versioning, but was not a big fan of introducing the complexity of such wrappers. In the end it was decided that the kernel developer maintained linux-firmware package should provide firmware files of the form foo-$(API). Those wanting a sub-versioned file like foo-$(API)-$(VAR) could provide one if they so wish.

x86 embedded. Graeme Russ posted a very detailed and well reasoned description of his embedded x86 port, which is not in any way based upon PC hardware, in which he uses U-Boot to transition to 32-bit Protected Mode and directly calls the kernel’s “32-bit BOOT PROTOCOL” described in Documentation/x86/boot.txt. He was having some issues though handling kernel relocation that turned out to be due to documentation differences between the bzImage format and the current reality. Peter Anvin was his usually very helpful self.

In today’s miscellaneous items: A fix for SPARC32 from Rob Landley (apparently, SPARC32 has been broken since 2.6.28, which isn’t surprising since this author and most other Linux SPARC users seem to be running SPARC64 kernels), various debugging from Thomas Gleixner and John Kacur on the recent 2.6.33 RT patch, version 6 of a patch series intended to add lockdep-based diagnostics to rcu_dereference() from Paul McKenney, a series of PPS implementation patches from Rodolfo Giometti (useful for those needing accurate time sources on a serial line), a patch to increase readahead size to a default of 512K from Fengguang Wu (the previous default was 128K), a bunch of s390 updates for 2.6.33 final from Martin Schwidefsky (including kernel image compression “finally…after only 10 years”), some patches intended to document the rfkill sysfs ABI from Florian Mickler, some more nested SVM (virtualization within virtualization on AMD compatible systems) from Joerg Roedel intended to aid running Microsoft Hyper-V with nested SVM (which doesn’t quite work yet even with these according to Joerg), a number of rather cool gdb and early debug updates from Jason Wessel (who has now split kdb and early debug out into two separate trees), version 4 of the “concurrency managed workqueue” from Tejun Heo, a discussion about order 1 allocation failures started by Frans Pop (the failures were under GFP_ATOMIC, but Frans felt that they were particularly ugly given plenty of cache was available for reclaim), David Howells proposed removing EXPERIMENTAL from NFS_FSCACHE in order that it could be compiled into the standard Ubuntu kernel (since, as he says, “As Arjan van de Ven pointed out…the EXPERIMENTAL flag doesn’t mean that much any more”, and a lengthy discussion of linux-next “requirements” that is worth reading, if you have the time.

In today’s announcements:

iproute2. Stephen Hemminger announced release 2.6.33 of the iproute2 utilities that “includes bug fixes and support for all the new features in kernel 2.6.33. This integrates a number of minor bug fixes from Debian aswell”. The update is available at http://devresources.linux-foundation.org/.

RT 2.6.33-rt4. Thomas Gleixner announced version 2.6.33-rt{2,3,4} of the RT kernel patchset. This updates to Linus’ latest tree and includes a number of fixes to bugs reported by John Kacur and others. It is available from the usual location: http://www.kernel.org/pub/linux/kernel/projects/rt/ Thomas noted that “rt/2.6.33 branch is now stabilization only. The rt/head branch will follow linus tree from now on, so it will inherit all (mis)features which come in the merge window. Separately, John Stultz announced that he had forward ported Nick Piggin’s VFS scalability patches to 2.6.33-rc8-rt2, and that it applies to 2.6.33 without any collisions. He requested feedback as he had yet to do any serious stress testing with the patchset (yet).

The latest kernel release was 2.6.33.

Greg Kroah-Hartman released an updated stable Linux 2.6.32.9.

Finally today, Mikael Abrahamsson suggested that some TLC be given to the Wikipedia article on the Linux kernel as it “doesn’t even mention the new -rc system” (in the “development model” section of the article). He wondered if anyone who knew exactly what was going on could write up the new world order on that wiki page for the rest of the world to see. That does not seem to have happened as of this writing.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2010/02/21 Linux Kernel Podcast

March 15th, 2010 jcm No comments

Audio: http://media.libsyn.com/medi/jcm/linux_kernel_podcast_20100221.mp3

For the weekend of February 21st, 2010, I’m Jon Masters with a summary of the week’s LKML traffic.

In today’s issue: AMD TSC, anon_inode flags, extents, LSI MegaRAID, md RAID, SSE, UML, and XZ.

AMD TSC. Mark Langsdorf (AMD) posted a patch entitled “Option to synchronize P-states for AMD family 0xf”, in which he reminded readers that AMD Family Oxf processors (that is AMD Athlon 64s and AMD Opterons) do not have P-State and C-State invariant TSCs – that is to say the TSC increments at the current frequency of the CPU core, and not at some fixed frequency that would be more useful to those using it as a timing source. It is nonetheless possible to scale the TSC readings to be used as a time source, if all CPUs in the system adjust their frequency at the same time and to the same amount. To do this, Mark modifies the PowerNow! driver with a new “tscsync” parameter. He reminds us that there are many other possible clock sources in a system, but customers want something particularly lightweight in some situations, like the TSC.

anon_inode flags. Matt Helsley noted that existing anon_inode interfaces often do not support flags that can be set by using fcntl(). He proposed a series of 4 patches to signalfd, timerfd, epoll, and eventfd that would allow the same flag behavior as their corresponding creation syscalls. Davide Libenzi, the original author of the anon_inode bits, signed off.

Extents. Jari Sundell reported an issue with sparse files on ext4 in which many extents nonetheless sequentially placed on disk were not merged by the filesystem. This manifested in the form of 3000 or more extents for a 250MB bittorrent download file (aside: bittorrent pulls many file pieces at once from many different sources and so relies heavily on sparse files).

MegaRAID. LSI posted to let everyone know that they were interested in an overhaul of the MegaRAID driver to support future HBAs. Rather than make a lot of changes to the existing code, they were interested in, and were encouraged to create a new driver for the newer parts. Matthew Wilcox may have detected a hint of reasoning behind why they had been a little resistive to not having a single heavily hacked driver and suggested an approach that could be used to “make your management happy” in effectively combining two drivers together into a single object file with two separate sets of PCI tables being handled and different functions within. Whatever the eventual decision, the thread ended there with no followup.

md. Justin Piszcz started a discussion thread entitled “Linux mdadm superblock question”, in which he asked about RAID superblock types. The older version 0.90 superblock format supports autoassemble within the kernel, whereby the kernel can automatically create the appropriate RAID device without having to use tools within an initrd/initramfs (the initramfs itself is not required in that case, otherwise it is if you want to use RAID). Justin wanted to know whether there were any benefits for a < 2TB RAID1 boot volume in moving to a higher versioned superblock without autoassemble support.

The conversation lead Peter Anvin to point out some issues with a recent change in mdadm, which now apparently creates 1.1 version superblocks by default. Peter noted that the 0.9 superblock format doesn’t make it possible to easily distinguish RAID partitions from whole volume RAID devices, but the problem migrating to 1.1 is that 1.1 uses the bootblock for its superbock and so can cause problems with bootloaders such as grub that result in people having to regenerate their entire disk if they want to easily boot with it. Version 1.2 of the md RAID superblock uses the same 1.1 superblock format but at a different location than the bootblock, and so Peter favors a default of using 1.0 or 1.2, but not 1.1 as the mdadm default.

The entire md RAID thread is worth reading because it took a tangent off into a lengthy debate about the merits of using (or being required to use) initramfses, time taken to boot using an initramfs (or if not using one – the plan is to remove autoassembly from the kernel for good, so good luck booting within an initramfs if you want RAID in the longer term), and tools such as AEUIO that can build a customized initramfs image. Of course, every distro and his dog have also re-invented initramfs creation.

SSE. There’s a long-standing philosophy of avoiding floating point (FP) or other general usage of optional compute units such as SSE, SSE2, and so forth from within the kernel itself. Using these units requires saving state, and that isn’t typically done (for performance reasons). However, these optional units can often handle very large word sizes and so can be useful for those seeking to optimize existing kernel routines. Luca Barbieri posted, starting a new thread entitled “use SSE for atomic64_read/set if available” to do just that on x86-32 systems as an alternative to some of the more complex code being used today (including disabling pre-emption very briefly). Peter Anvin and Luca got into a somewhat lengthy debate about FPU etiquette (especially with regard to Peter’s view that kernel_fpu_begin() and kernel_fpu_end() be wrapped around kernel calls to the FPU, and Luca’s view that this expensive state change could be skipped in the case that only specific registers need to be saved and restored in such situations as in his patch). Peter Zijlstra, though not objecting to a cleanish implementation, suggested that one might want to “run a 64bit kernel already”. In the end Luca decided to re-write his other patches explicitly in assembly to avoid future complications with GCC changes, and to hold off on the SSE piece in question until another day.

UML. Remember the work a few weeks back to bring initial task userspace stack sizes in line with those permitted by rlimit? Well it turns out that the patch was a little too restrictive and was causing UML (User Mode Linux) to segfault on startup. The issue was raised by a number of people, including Adam Nielsen, who was also told that it is not possible to run 32-bit UML instances on a host 64-bit kernel or vice versa. They must match.

xz. Discussion continued on the potential for migrating kernel.org over to use ZX format compressed files. Phillip Lougher offered some defense of the venerable gzip format, emphasizing its cross-platform nature (there are even completely separate implementations available in Java for the inclined), and Andi Kleen pointed out the relative availability of tools that handle gzip files or bzip2 vs. xz, but others seemed to agree that various contrived scenarios not that relevant directly to kernel developers don’t warrent holding off an eventual migration to some better compression format.

In today’s miscellaneous items: An updated version of the OOM killer rewrite was posted by David Rientjes (including a patch that treats task running on different sets of CPUs as unlikely to be interfering with oneanother), the third round of KVM patches for 2.6.34 from Avi Kivity (including 1GB page size support, and an initial implementation of “Hyper-V” support for those desperate enough to need or want to run a Microsoft virtual machine guest), some seqlock implementation cleanups from Thomas Gleixner, a “foruth [sic] general posting of the newest version of the AppArmor security module” that is essentially a rewrite of the existing AppArmor code to use the existing hooks in the LSM security infrastructure rather than custom VFS patching, Grant Likely posted “basic ARM device tree support” (yaaaay!), Denys Vlasenko posted another attempt at supporting split out function and data ELF sections (one section per function or data item – something that is great for Ksplice), and Microsoft revived their work in Hyper-V recently (Hank Janssen seems to be trying really really hard to do the right things).

In today’s announcements:

Gujin 2.8. Etienne Lorrain announced a new release of the Gujin bootloader. It has some really nice options for device emulation, El-Torito emulation for booting Live-CD images, and a lot more besides.

RT patchset 2.6.32.12-rt21. Thomas Gleixner announced an updated RT patchset containing “fixes and cherry-picks from all over the place”, as well as some tracer fixes. The short log includes two scheduler fixes, some futex fixes, and some architectural stuff for ARM support.

RT patchset 2.6.33-rc8. Thomas Glexiner also announced the first RT release for the 2.6.33 stable series kernel. Thomas says he is pretty excited about the stability of this latest patch series, and the overall patch size is still falling quite considerably. He ends, “We are zooming in, but there is still a way to go”.

util-linux-ng 2.17.1. Karel Zak announced the release of util-linux-ng 2.17.1. This latest release includes an option to fdisk to disable DOS-compatible mode from the commmand line.

The latest kernel release was 2.6.33-rc8.

Finally today, the end of an era. Christine Caulfield announced that she is orphaning DECnet support in the kernel, due to “lack of time, space, motivation, hardware and probably expertise”. Apparently, “judging from the deafening silence on the linux-decnet mailing list [she] suspect[s] it’s either not being used anyway, of the few people that are using it are happy with their older kernels.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

Updates coming!

March 4th, 2010 jcm No comments

Folks,

Sorry for the delay. I should have updates out before the end of the week. Thanks. Remember, this is a spare time project and takes a lot of effort to do properly.

Jon.

Categories: Uncategorized Tags: