Archive

Archive for August, 2009

2009/08/23 Linux Kernel Podcast

August 26th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090823.mp3

You know the drill, so all together now: “Another week, another -rc kernel”.

For the weekend of August 23rd, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Mailing lists, Offline scheduling, TuxOnIce, and x86.

Mailing Lists. Pavel Machek’s latest comments about the linux-arm-kernel mailing list (refering to a regular reminder mail sent to subscribers that he claimed was out of sync with recent discussions) obviously struck a raw nerve with Russell King, who decided to shut down those mailing lists immediately, telling everyone to refer to Pavel’s “extremely selfish attitude”. There have been recent efforts to move discussion onto an open (unmoderated) mailing list hosted on vger.kernel.org (there is already one – linux-arm – but it is not widely known and not in active use by the wider community of ARM developers) and it seemed as if this may finally happen this time, but instead David Woodhouse (with Russell’s blessing) setup alternative (open) lists on lists.infradread.org. Many other kernel hackers were unhappy with Russell’s actions – David Miller publicly stated that “a lot of us are tired of your crap, and like Alan Cox I can’t take you seriously at all”, while Ted T’So was somewhat more diplomatic but stated that he was “appalled” by the behavior.

Meanwhile, David Woodhouse, with a certain eloquance, debated the relative merits of automatically resubscribing all existing members of the old list to the new one, and especially whether some would be confused and flood the list with “please unsubscribe me” messages. David said “there’s no accounting for the stupidity of the human animal” and, ‘Even with the notification and the first major thread being “ARM Mailing lists have moved”, you’re right that there are probably a few people who are _so_ stupid that they can’t manage to work it out. I’ve sent a mail with words of one syllable or fewer to each list, which might hopefully help — but I’m sure there’ll still be some. When that happens, we’ll unsubscribe them — and send in the ninjas to ensure they don’t accidentally breed’. Catalin Marinas pointed out that the vger list will likely stick around for any time the new lists finally drive David mad.

Separately, there appeared to be some concensus that the Linux InfiBand/RDMA mailing list could be moved over to vger.kernel.org.

Offline scheduling (yes, that’s not a type). Raz Ben Yehuda mailed to let everyone know about some work ongoing at The Open University of Israel Department of Mathematics and Computer Science. They look at the problem of assigning dedicated processors in general purpose computing systems such as Linux, in which even a “dedicated” processor must peform a lot of work that relates to general system housekeeping, and other activities. Instead, they exploit the ability to offline a CPU and logically detach it from the running system, leaving it to run whatever dedicated code has been assigned to it. The example cited is that of offloading packet analysis in a firewall to one of these “offlined” cores, but there are many other potential uses. Some wondered why the same features couldn’t be achived with existing cpusets, virtualization, and other efforts. Ben’s reply included that this work could be used for “hard realtime” systems (this smells a lot like some of the other real time solutions available for Linux systems in that regard).

TuxOnIce. On Thursday, Nigel Cunningham had mailed to suggest various ways forward for the development of the “TuxOnIce” alternative suspend patchset. Nigel and others have not had enough time for development recently and don’t see that situation changing any time soon, and seemed to favor someone taking over development altogether. After a little silence, Jiri Slaby stepped up to the plate and suggested that he could help sheppard pieces of TuxOnIce into the mainline swsusp implementation to make major features more available. Others, including Xavier Gnata, offered their time in testing. Meanwhile, Nigel sent another oration in response, with a “todo list” included.

X86. Thomas Gleixner posted a 32 part patch series intended to refactor the setup code used in x86 to provide a base suited to embedded platforms, in light of the arrival of Intel Moorestown support, which Thomas says “indicate the arrival of the embedded nightmare to arch/x86″. As Thomas puts it, “Moorestown is a SoC with an x86 core and a bunch of random peripherals glued around it. It finally gets rid of legacy hardware like PIT, 8042 et. al. but on the other hand it introduces the full embedded horror by adding random peripheral IP cores as an replacement which are glued onto the x86 CPU with duct tape and other nasty tricks.” Thomas was particularly kind towards the design of the AHBT timer and then got down to the business of describing how the new patches refactored the setup code without replacing paravirt_ops.

In today’s miscellaneous items: a 46 part patch series representing part 3 of a 4 part patch series of KVM updates targeted for 2.6.32 from Avi Kivity, a bunch of staging patches from Bartlomiej Zolnierbiewicz, some “use printk_once” patch series’ from Marcin Slusarz, some s390x updates from Martin Schwidefsky, a PCI fix (correcting a PCI suspend/resume problem) from Jesse Barnes, a patch to reuse the boot-time mappings of fixed_addresses from Xiao Guangrong, a large number of RCU patches from Paul McKenney, an interesting munlock fix from Hiroaki Wakabayashi (who noticed that due to recent work to make get_user_pages interruptible, we can end up with a situation in which some pages passed to mlock are not actually pinned following a well-timed SIGKILL, and these will later result in a completely pointless page fault at munlock during exit), some tracing fixes for generic syscall events from Josh Stone, conversion to the new dev_pm_ops patches from Marek Vasut (which Nicolas Pitre questioned might be buggy), some SCSI fixes from James Bottomley (entirely mpt2sas fixes), the latest version of the flex_array patches from David Rientjes, an Intel Atom CPU configuration target from Tobias Doerffel, some m68k and m68knommu updates from Geert Utterhoeven (including support for Performance Counters), a backport of TREE_RCU to 2.6.27 from Paul McKenney (it had been pointed out previously that this was likely required for certain users of that kernel, which I believe is used by a particular Enterprise Linux kernel), another round of O_SYNC patches fro Jan Kara (17 patches implementing a single path for O_SYNC and standard syncing), a single wireless fix from John Linville (for rtl8187b parts), version two of Performance Counters support for IA64 from William Cohen (with the V2 email re-written by Ingo Molnar to include appropriate references to the previous version), some fixes to build Intel TXT (Trusted Boot Technology) on non-x86 platforms from Shane Wang, an informative mail from Gregory Haskins concerning “vbus design points” and how he is particularly proud of the design of the vbus shared-memory model, a fix to update_process_times from Peter Zijlstra that delays waking up softirqs from jiffy ticks in an attempt to fix broken runqueue balancing on recent RT kernels, an updated git repository from Dave Airlie containing the same DRM fixes but this time without some weird corruption, a btrfs fix from Jens Axboe that corrects a red/black (rb) tree corruption, version 4 of the automatic crashkernel patches from Amerigo Wang, ongoing discussion of the merits of sysfs and configfs over ioctls (mostly Avi Kivity, but some others also), and some updates to checkpatch from Andy Whitcroft.

In today’s announcements: Linux 2.6.31-rc7. Linus Torvards announced the release of version 2.6.31-rc7 of the Linux kernel. The latest rc includes many driver and architecture updates, with lots of other smaller fixes. Quoting Linus, there are “some inotify fixes here to, but I don’t think we’ve confirmed whether they help the (apparently very hard to trigger) oopes some people have seen. When Linus isn’t announcing new kernel releases, he’s giving advice on using ftrace to track down stack overflow issues – nice to see that ftrace is getting wide use and recommendation for all manner of situations.

2.6.31-rc6-rt6. Thomas Gleixner announced version 2.6.31-rc6-rt6 of the ongoing preempt-rt development. In the latest Real Time kernel, one can find a rebase to Linus’ recent git, and a load accounting/balancing fix from Peter Zijlstra. A problem with ARM highmem use remains unresolved.

Git version 1.6.4.1. Junio C Hamano announced the release of version 1.6.4.1 of the git SCM tool used in Linux kernel development (and by other projects). The latest version includes fixes to “git am”, documentation updates (e.g. for fast-forward), and a lot more besides.

Gujin GPL bootloader version 2.7. Etienne Lorrain announced the release of version 2.7 of the “Gujin” bootloader, which features support for several Linux distributions in the form of packages. This latest release can parse ISO format files containing filesystem images and extract kernels from within, allowing one to (in theory anyway) keep an image of a LiveCD around as a boot target. This doesn’t quite work (yet) for many standard live images, but one can see the direction this is heading.

The latest kernel release is 2.6.31-rc7, which was released by Linus Torvalds on Friday evening at 6:26pm PDT.

Several people have found issues with rc7 already, including a build error with the oom_adj reversion discovered by Geert Uytterhoeven, and lost hardware sensors support in the case of Gene Heskett. Jes Sorensen thought he had seen a problem in the scheduler, but it wasn’t easily reproduced. Ingo Molnar believed it might be a rare timing bug and asked for any logs, which didn’t exist. So whatever may or may not exist there probably will stick around.

Stephen Rothwell posted a linux-next tree for Friday August 21st. Since Thursday, the nfs tree gained a build failure, the drm tree lost all of its conflicts, and the agp tree gained a conflict against the powerpc tree. The sub-tree count remains steady today at 140 trees in the linux-next tree.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/20 Linux Kernel Podcast

August 21st, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090820.mp3

The UNipexed Information and Computing Service (UNIX) turns 40 this month. How many of us were around back in the days of Woodstock? Not this author.

For Thursday, August 20th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Lazy workqueues, mailing lists, O_DIRECT, and TuxOnIce.

Lazy workqueues. Jens Axboe posted in followup to a previous rant about the number of kernel threads that had been running on his system (all 531 – really – of them). He prefered keeping the workqueue interface rather than redoing it yet again with some kind of wheel re-inventing new scheme. Jens adds lazy workqueues, which behave like the existing code, but create only one core kernel thread per online CPU that shares the responsibility of providing context for all lazy work not otherwise assigned with its own thread.

Mailing Lists. In another mail along the “should we move to vger?” lines, Roland Dreir solicited for opinions on moving the Linux InfiniBand/RDMA mailing list over to vger.kernel.org. Largely, the impetus seems to be that the existing list on openfabrics.org is closed to posts from non-subscribers, and, just like the recent discussion concerning the ARM Linux kernel list, many would prefer to have a list that was open to posts from non-members (especially as that allows easy cross-posting of topics with the LKML).

O_DIRECT loop devices. Jens Aboe and Alan D. Brunelle had a back and forth concerning some metrics Alan had collected in test runs of Jens’ patch, which aims to unifying O_DIRECT handling to allow loopback device data writes to proceed directly to backing storage without hitting the page cache. Alan’s test runs (available as a large PNG) show a huge drop in performance for POSIX AIO random and sequential writes (half way down the graphic). This isn’t unusual for a patch at an early stage of testing and development.

TuxOnIce. What to do? Nigel Cunningham posted to let everyone know that after his most recent attempt to get TuxOnIce merged (apparently, this is “something like the third time” he has tried to do this by now), there had been an interim agreement that he and Rafael would work on getting functionality merged bit by bit. Alas, both are busy with other things and do not have enough time for the effort, and so Nigel proposes three possibilities. First, he’d like to know if someone would like to improve the existing swsusp code (taking bits from TuxOnIce if they deem it appropriate) without help from him. Second, he’d like to know whether someone would take over TuxOnIce maintainership. Finally, he’d like to know if there are any better ideas that have not occured to him.

In today’s miscellaneous items: some multi-node processor scheduling fixes from Andreas Herrmann, some input updates for 2.6.31-rc5 from Dmitry Torokhov, a series of NFS bug reports from Fenggaung Wu in which recent kernels would suddenly return access denied errors and/or cause kerel panics in nfs_release, an eloquently phrased patch to the PCI DMAR code for the case of a DMAR returning all ones from David Woodhouse informing certain BIOS vendors that they had further lowered his already unprintable opinion of closed source BIOSes and BIOS engineers, a patch from Kamezawa Hiroyuki aimed at better aligning percpu counters, a device table update from Mario Schwalbe adding support for Apple models MacBook 5,1, MacBook Pro 5,1, MacBook Pro 5,2, and MacBook Pro 5,5 (Apple has a tendency to use really stupid model numbering conventions and always has), additional support for cut_here in AFS, CacheFiles, FS-Cache and RxRPC from David Howells such that these filesystems and caching services will display some useful diagnostic information as an accompaniement to a BUG() report (for which he also posted a patch implementing disconnected use of cut_here), some error handling fixes from Florian Tobias Schandinat for the framebuffer drivers implementing support for the error code possibly returned by fb_set_par that was being silently ignored by fbmem.c and fbcon.c, a fix to “reservetop” kernel boot parameter handling from Xio Guangrong, a fix from Jan Beulich to the target specifications in arch/x86/boot/compressed/Makefile such that vmlinux.lds is included and will not cause a number of pointless rebuilt files on each kernel compilation if they are already up-to-date, some sound (HD-audio) fixes from Takashi Iwai, some additional wireless patches for 2.6.32 from John Linville, a suggestion from Balbir Singh that his scalability fixes for root overhead in memory cgroup controllers be merged for 2.6.31 rather than holding off to 2.6.32, version 3 of a patch series from Jason Wessel implementing various EHCI and earlyprintk improvements for attached devices, a fix for a theoretical deadlock involving the del_timer_sync inside cancel_delayed_work from Roland Dreier, and some DRM fixes from Dave Airlie.

Finally today, Frans Pop reported a concern that he was getting a cryptic looking error message that related to his PCI hardware not supporting the Advanced Error Reporting (AER) feature of recent devices. It’s unfortunate that the error result from pci_enable_pcie_error_reporting would lead to such an unhelpful error message in the system logs.

The latest kernel release is 2.6.31-rc6, which was released on August 14th.

Krzysztof Halasa posted saying that he believes he has worked out what was causing the strange network timeouts in 2.6.30.5. He believes the problem lies with network desc’s being allocated non-coherently using a streaming allocation that fails on x86 with swiotlb because swiotlb has no concept of a “dirty” flag and so doesn’t know when to flush. Apparently, there is no other fix than converting the allocations over to coherent forms in post-2.6.31.

Dinakar Guniguntala concurred with John Stultz that he was also seeing an issue with recent 2.6.31 RT kernels in which all tasks would end up bound to a single CPU due to some kind of regression in the SMP scheduler behavior.

Eric W. Biederman reported a NULL pointer deference bug in 2.6.31-rc6 with an overrun backtrace containing a recent call to lapic_next_event and
run_timer_softirq.

Andrew Morton posted an mm-of-the-moment for 2009-08-20-19-18.

Stephen Rothwell posted a linux-next tree for August 20th. Since Wednesday, the drm tree gained 3 conflicts while the fsnotify, drbd, tip and the usb trees all lost build failures and conflicts. The total sub-tree count is steady today at 140 trees in the latest linux-next compose.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/19 Linux Kernel Podcast

August 21st, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090819.mp3

For Wednesday, August 19th 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Config/SysFS, Cpuidle, O_SYNC, Perfomance Counters, Spinlocks, and x86.

Config/SysFS. Avi Kivity posted concerning some issues he has with “all the text based pseudo filesystems that the kernel exposes”. His main concern being that the kernel development community is “optimizing for the active sysadmin, not for libraries and management programs”. On a lower level, he is concerned about a number of specifics, including efficiency of open/read/close actions, atomicity of having to read multiple files that may be changing in order to capture specific system state information, the ambiguous format of attributes, lifetime and access control concerns, notification of change in attributes, and readdir support being “painful”. Avi says that “I don’t think a lot of effort is needed to make an extensible syscall interface just as usable and a lot more efficient than config/sysfs”, to which Ingo Molnar suggested that such an implementation was available in the form of the mechanism used by the performance counters code perf_counter_open system call, which does such things as passing an embedded .size field so that the data structure exchanged with userspace can change in size later on (embedded ABI protection). Avi replied that he had seen this and that it was “nice”. A number of others expressed frustrations at the current interfaces, so it will be interesting to see whether this turns into anything more concrete.

Cpuidle. Arun R Bharadwaj posted a two part patch series implementing cpuidle infrastructure support for powerpc systems. This not only allows powerpc systems to save power by selectively entering “snooze” and “nap” states when the kernel cpuidle code deems it appropriate, but also provides tpmd_idle, which is support for Thermal and Power Management idling also.

O_SYNC. Jan Kara posted a seventeen part patch series entitled “Make O_SYNC handling use standard syncing path” that aims to unify O_SYNC handling with the existing code that implements fsync(). After this patch series is applied, there is just one place where handling for forcing file commits to disk is implemented, making life easier for filesystem code. The patch touches a lot of filesystems and is probably going to need some fairly hefty testing.

Performance Counters. Everyone’s worried about information leakage and security at the moment, and Peter Zijlstra had previously noted the risk for information leakage through performance counters metrics. He posted another version of a patch series changing the default permissions on performance counters (disallowing regular users from creating cpu-wide counters), and causing any samples to have anonymized kernel IPs (Instruction Pointers) in the case that they are being collected by an unprivileged user.

Spinlocks. Discussion continued surrounding the meaning and purpose of spin_is_locked() as applied to uniprocessor systems. Thomas Gleixner had suggested that it should always return true, whereas Peter Zijlstra, Linus, and others had pointed out problems with this logic. In the end Peter suggested that the best idea might be for spin_is_locked to by a synonym for panic(). As I mentioned previously, Linux Weekly News has an excellent writeup in the latest edition, so it’s worth refering to that for more detail.

x86. Jan Beulich noted that according to gcc’s instruction selection, inc/dec instructions can be used without a performance penalty on most x86 CPU models, but should be avoided on others. Hence he suggests (and posts a patch for) selectable configuration of inc/dec instruction use depending upon the CPU models that are being targeted by a given x86 build.

In today’s miscellaneous items: another version of the CLOCK_REALTIME_COARSE patch that adds a fast but not very fine-grained timestamp from John Stultz, version 0.5 of the new kfifo API implementation from Stefani Seibold, a patch from Bartlomiej Zolnierkiewicz removing the mailing list for ncpfs from MAINTAINERS, a patch from Miguel Boton moving the many different alignment macros within the kernel into a standard “align.h” header file, yet another round of patches for Compal made Dell laptops from Mario Limonciello (with special thanks to Alan Jenkins for once again putting a lot of effort into testing and finding some bugs), some minor bug files for nilfs2 from Ryusuke Konishi, some documentation update to AFS from David Howells, a patch from Miroslav Rezanina causing Xen guest kernels booted with a mem= parameter (but nonetheless allocated additional memory in the hypervisor) to return the additional memory back to Xen early in boot, a second batch 47 of KVM updates targeting 2.6.32, a trivial fix for linux-next from Ingo Molnar that adds new tracepoints for syscall_enter and exit on s390 systems (avoiding a build failure otherwise), some microblaze fixes from Michal Simek, version 3 of a patch series from Zhang Rui implementing a standard interface for Ambient Light Sensors (ALS), a patch adding syscall filtering support for ftrace events from Li Zefan, version 5 of a patch from Amerigo Wang correcting the semantics for file truncations when both suid and write permissions are set for the user on a given file entry, some DRM fixes from Dave Airlie, and a new version 4 of the vhost kernel-level virtio server from Michael S. Tsirkin that is sure to kick off another round of enjoyable virtualization dialogue.

In today’s announcements: 2.6.31-rc6-rt5. Thomas Gleixner posted the latest version of the preempt-rt kernel, which updates to the latest Linus git tree, makes IPI handlers unthreaded on PowerPC (pseries), and fixes a problem with cgroup memcontrol preemption.

The latest kernel release is 2.6.31-rc6, which was released on August 14th.

Rafael J. Wysocki posted a list of regressions from 2.6.29 to 2.6.30 and from 2.6.30 to 2.6.31-rc6-git5 for which there are no fixes in mainline that he is currently aware of. The regression list has not increased dramatically, and most of the bugs seem to have driver specific or suspend/resume roots.

Walt Holman posted saying that he is experiencing some “periodic timeouts” with kernel 2.6.30.5 and Simon Kirby noticed how a “storage head box” also running 2.6.30 would occasionally get stuck allocating memory to send a packet for up to several seconds (visible watching sshd getting stuck), blocking on a mutex named iprune_mutex called from prune_icache in fs/inode.c. He made some suggestions about converting to a try_lock in that code and so forth. Finally, Steven Rostedt posted a series of lockups in the IPI code on recent kernels.

Stephen Rothwell posted a linux-next tree for August 19th. Since Tuesday, the mips, omap, and suspend trees lost their issues, wheile the tip and usb trees gained some conflicts. The total sub-tree count remains steady at 140 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/18 Linux Kernel Podcast

August 21st, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090818.mp3

For Tuesday, August 18th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: AlacrityVM, Kconfig, Spinlocks, and VM.

AlacrityVM. Today’s ongoing debate about which IO implementation is the fairest of them all saw the discussion head into the realm of DMA. Specifically whether vbus supported things like RDMA and how guests are protected from DMA to random host memory on platforms like PowerPC using a real physical DMA controller with virtio, and similar topics.

Kconfig. Steven Rostedt, who I know has had his fair share of annoyances with building test kernels, posted some patches aimed at making the process of building test kernels for a particular system much easier by having a build target that will automatically select a configuration covering all modules currently loaded on the test system. Rather than having to build a distribution style test kernel (which takes time), Steven’s patches allow developers to use “make localmodconfig” and “make localyesconfig” to build test kernels featuring only modules actually in use, either built as modules or built into the kernel in the latter case. Thanks a whole bunch, Steven!

Spinlocks. Kumar Gala posted asking whether spin_is_locked behavior is broken on uniprocessor systems. As Linux Weekly News pointed out, the problem here is that, actually, the meaning of spin_is_locked on systems without actual spinlocks being present is somewhat ill-defined. The LWN article does a great job of explaining the issues, so I won’t cover it much further here except to say that it’s likely some new spinlock primitives are coming down the pipe.

VM. I wonder what he’s working on. Jan Beulich posted a range of patches. The first alters handling of num_physpages since memory allocations should depend upon the amount of usable memory, and not just the total PFN count (which may include all manner of non-RAM ranges) in a system. The second builds upon this by replacing various users of num_physpages with totalram_pages. The third migrates the PID hash table over to using alloc_large_system_hash. And the fourth patch from Jan removes use of alloc_bootmem_low where’s it not strictly required for a given system to operate, especially on large 64-bit systems.

Also on a VM note today, Mel Gorman posted a three part RFC patch series aimed at reducing the need to search within the fast path of the low level page allocator by maintaining multiple free-lists in the per-cpu structure. At the time of the original introduction of per_cpu_pages, Mel says that the per-cpu static allocation thereof (recall that dynamic per-cpu-structucture allocation was recently implemented) resulted in too much wasted memory. But now that this is no longer the case, he is able to add multiple free lists to struct per_cpu_pages, one per migratetype that can be stored on the PCP lists. For the most part, performance testing showed only marginal improvement, except in the case of netperf-udp on x86_64 and sysbench on ppc64, which were higher.

In today’s miscellaneous items: some tracing fixes (to correct broken names in ftrace filters) from Steven Rostedt, another version of Paul Menage’s cgroup memberlist enhancements that add a cgroup.procs file to each cgroup (that contains unique thread group information rather than task IDs), an implementation of ACPI 4.0 power meter support via an extended hwmon sysfs interface from Darrick J. Wong, some irq fixes from Thomas Gleixner (who confirmed that today’s tree “contains really what I want you to pull”, after yesterday’s tree inadvertedly had the wrong patch), a fix to the LSM_MMAP_MIN_ADDR (yes, that one) help text from Dave Jones that corrects the default value to 65536 rather than 65535 (which would still fall within the first page on a 4K page system), another version of Jon Hunter’s patch that catches timer wrapping in clocksources and allows 32-bit systems to sleep for longer than 2.15 seconds when using dynamic ticks, two more wireless updates from John Linville, a twelve part patch series aimed at cleaning up __build_sched_domains by making the code “less ugly and more readable” from Andreas Herrmann, version two of yesterday’s page based O_DIRECT implementation from Jens Axboe, a whole bunch of network fixes from David Miller (including a TUN ioctl race fix from Herbert Xu, and a fix to the genetlink data structure that had previously broken userland), version 2 of the patch series adding in-memory-only xattr support on sysfs files from Casey Schaufler, and a trivial “make html” fix for performance counters from Kyle McMartin.

Finally today, Mikael Pettersson posted an intriguingly excessive request. He notes that his laptop hardware “clips disk capacities to 128GB. There’s no BIOS update or BIOS setup option to fix this. Passing libata.ignore_hpa=1 allows the Linux kernel to access larger disks, so Linux does work Ok with larger disks. However the laptop dual-boots Windows (for work-related stuff), and Windows has a major problem: if an entry in the msdos partition table refers to a sector above the BIOS 128GB limit, the Windows kernel crashes an reboots early in its boot sequence”. He goes on to propose adding some kind of sub-partition type that could be somehow hidden from Windows.

In today’s announcements: 2.6.31-rc6-rt4. Thomas Gleixner announced the latest iteration of the preempt-rt patchset (he skipped -rt3 as it failed in testing). This included an update to the “ONESHOT” irq infrastructure Thomas has been working on for mainline inclusion.

The latest kernel release is 2.6.31-rc6, which was released on August 14th.

Christoph Thielecke posted an interesting hard lockup on 2.6.31-rc6, which again seemed to be related to his ongoing Xorg development build testing.

Stephen Rothwell posted a linux-next tree for August 18th. Since Monday, the xfs, fsnotify, and suspend trees gained conflicts while the usb tree lost one. The total sub-tree count remains steady at 140 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/17 Linux Kernel Podcast

August 21st, 2009 jcm 1 comment

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090817.mp3

Did you know there have been over 64,000 downloads of the LKML podcast? That there are 500 listeners per episode? People who listen to the podcast vary from developers to company executives, they listen on their way to work, on the way to school, and even in the bathtub. Thank you for listening.

For Monday, August 17th 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: AlacrityVM, Clocksource, Discard, and Loop.

AlacrityVM. Well, it was going to happen eventually. Ever since Gregory Haskins first posted patches implementing an alternative virtualized IO framework (replacing virtio in userspace with vbus/venet in kernelspace) and subsequently posted a KVM “fork” using those for virtualizing real time guests, there has been some discontent within the virtualization community at the apparent fracturing of the community. Both Avi Kivity and Ingo Molnar have spoken out against developing two competing implementations within the kernel (rather than getting behind one of them), and the former has been particularly defendent of the existing virtio implementation (which not has a kernelspace implementation available). Although not technically objecting to the work, Ingo’s argument can perhaps be summed up best with this quote: “If virtio pulls even with vbus’s performance and vbus has no advantages over virtio [I] do NAK vbus on that basis. Let’s stop the sillyness before it starts hurting users. Coming up with something better is good, but doing an incompatible, duplicative framework just for NIH reasons is stupid and should be resisted”. Obviously there are different commercial interests at play, but it should be noted that Greg has seemingly tried to navigate around these hurdles. He argues that he has made “every effort” to propose that his patches get integrated within KVM directly and implies that he is only continuing to work on the Alacrity implementation for purely technical reasons of driving up performance. And, it should be added, as Greg points out, his work has already helped to motivate kernelspace virtio efforts.

Clocksource. Stephen Hemminger noticed a regression caused by a recent patch from John Stultz that had aimed to sanity check changes to the active clocksource configuration using the sysfs interface. This change had broken systems built with High Resolution Timers (HRT) but not actually using them.

Discard. After several days worth of discussions surrounding the need for compcache (essentially, compressed RAM backed swap) to be made more immediately aware of swap slots becoming free, discussion had turned toward general discard handling. This is a term that refers to the need to educate underlying block devices whenever a block is actually no longer in use by any higher level software abstractions (a filesystem, or swap device, or something more exotic) and has become increasingly relevent in a world where SSD flash devices would love to know when they can actually recycling underlying flash blocks for more effective allocation and wear leveling support. Linux uses the ATA “TRIM” command to educate many of these devices about such events, but that command has a number of unpleasant standard-mandated issues not the least of which is the in-ordinate amount of time it can take to complete. Mark Lord posted some rather horrific benchmarks showing how drive firmware successfully lied about the first call to TRIM but subsequent (more real world) calls immediately following a TRIM resulted in hundreds of milliseconds of drive latency. Linux Weekly News had a more exciting and lengthy summary of the troubles with discard, so I recommend reading that article for further detail.

Loop. Jens Axboe, noting that the existing loop implementation (support in the kernel for exposing a file or similar as a block device upon which a higher level filesystem may subsequently be mounted) always uses the page cache regardless of IOPS requesting O_DIRECT, posted a patch implementing page based O_DIRECT on loop devices. His patchset modifies the IO patch for all O_DIRECT operations making it page based rather than passing down iovecs, but he cautions that it is “basically a first version so don’t expect too much of it, but it does seem to work fine for me.”. NFS was apparently the main difficulty in converting over existing code, and he’s not at all sure that that has been successful – so apply usual caution in testing.

In today’s miscellaneous items: some sh updates from Paul Mundt, a three patch patch series adding support for the Dell “Mini” series based upon compal-laptop from Mario Limonciello, some tracing fixes from Frederic Weisbecker, some performance counters fixes and x86 from Ingo Molnar, some performance counters fixes from Peter Zijlstra, a new iteration of the generic hardware breakpoints patchset from K. Prasad, some minor fixes to Microsoft Hyper-V configuration options so that all the sub-component drivers depend upon the base one from Jan Beulich, an IRQ fix from Thomas Gleixner (Linus wasn’t convinced that the right patch was in the git tree), a huge TLB driver example from Alexey Korolev, a suggestion that ESP and EIP values are removed from a task stat file and made available to processes with PTRACE capability, some XFS updates from Felix Blyakher, version 2 of the RDC (a low power 486 like SoC implementation) detection patches from Mark Kelly, a fix to drop write permission on /proc/timer_list and /proc/slabinfo from Amerigo Wang (which Ingo Molnar described as a “good catch”), a new time-source selector allowing one to (for example) specify wallclock times be using in ftrace entries from Zhao Lei, version 4 of a patch fixing file truncation handling when both suid and write permissions are held on a given file entry by Amerigo Wang, a patch to flex_array optimizing hot paths by allowing the compiler to substitute bit shifts for divides on power-of-two size allocated arrays from Dave Hansen, and a patch adding a diagnostic message differentiating between a keyboard vs. non-keyboard triggered sys-b reboot event from Tina Yang.

Finally today, a number of kernel developers repeated the point concerning vendors making hardware available for test, in particular suggesting that the Linux Foundation should foot the bill and hand out hardware at conferences like the upcoming Linux Plumbers Conference in order to save on shipping. In his dissenting opinion, James Bottomley reminded everyone that these devices often cost fairly significant amounts of money, but conceeded that the Linux Foundation might be a means to distribute otherwise free hardware to those developers in need.

In today’s announcements: Greg Kroah-Hartman announced usbutils version 0.86.

The latest kernel release is 2.6.31-rc6, which was released on August 14th.

Rogerio Brito reported a regression in the hfsplus code affecting rc5. He found that creating a loopback mounted filesystem resulted in data loss.

Stephen Rothwell posted a linux-next tree for August 17th. Since Friday, several trees lost conflicts while the suspend, tip, and sfi trees gained a build failure and conflicts respectively. The total subtree count remains stready at 140 trees in the latest linux-next tree compose.

If you haven’t been to a dentist in a while, I strongly advise you to go. You’ll avoid having your root canal redone twice for good measure.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/16 Linux Kernel Podcast

August 18th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090816.mp3

For the weekend August 16th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Asynchronous suspend and resume, kfifo, KVM, Security, and threaded interrupts.

Asynchronous suspend and resume. Rafael J. Wysocki posted an (updated) 7 part RFC patch series implementing asynchronous suspend and resume callbacks, which should allow for faster and smoother transitions. The patch series is currently targeted for 2.6.32 inclusion.

kfifo. Stefani Seibold posted the latest version 0.4 of a generic kernel FIFO implementation. This aims to replace the current kernel FIFO API, which apparently has “too many constrain[t]s”. This was accompanied by a patch from David VomLehn converting the usb-serial code to use kfifo as a write buffer.

KVM. Avi Kivity posted the first of four batches(!) of patches against 2.6.32. This first round contains 48 patches including Gregory Haskins’ irqfd infrastructure. Later patches will apparently add support for “new hardware features, [and] improvements in nested [SVM] support”, as well a conversion to the new trace infrastructure and x2apic emulation.

Security. Unless you were living under a rock in a remote monestary last week (and even then), you are probably aware of a major local NULL pointer exploit affecting nearly all 2.4 and 2.6 series Linux systems since 2001. This has been gaining a little attention in the (mostly Slashdot weenie) media over the past few days, and the script kiddies are taking interest. Consequently, it is worthwhile applying the one-line patch to affected systems, perhaps using ksplice for a dynamic update in the case that you are unable to reboot.

On another security related note, David Wagner drew attention to a security paper from this year’s USENIX playing up the impact of making various files world readable in the task directories under /proc. In one case, they are able to use the ESP and EIP information from a task stat file to recover information about another user’s keystrokes, which is potentially a security issue. This said, there have long been patches available (such as grsecurity) that hide various process statistics from other users, and it would be relatively trivial to adapt one of these for mainstream consumption. Co-incidentally, Kosaki Motohiro posted a patch (from Tatsuhiro Aoshima) adding more statistics to the aforementioned state file, for user time and system time consumed by the task. One awaits the next alarmist paper.

On a final security note, Casey Schaufler posted a patch series implementing an in-memory version of extended attributes, without requiring an underlying filesystem representation be in place. The current work provides support for sysfs files in the security domain so that certain otherwise privileged capabilities can be passed from one running task to the next and used to manipulate sysfs entries. Several justifying examples were included.

Threaded Interrupts. Michael Buesch posted asking whether threaded interrupts were broken because his system was unable to use them correctly with a particular wireless network driver that he was developing. He posted a fix for a crash that Thomas Gleixner thanked him for, but then got into a discussion in which it turned out that the interrupt line was being marked IRQ_DISABLED and the threaded handler was not being setup correctly. Thomas initially implied that there was a bug in Michael’s code but later discovered that a newly created kernel thread backing the interrupt was not being setup right in __setup_irq. He posted a patch – adding “Gah. I think I found it. I wonder why nobody else ever tripped over this” – that Michael confirmed is working.

In today’s miscellaneous items: version 8 of the IO Scheduler based IO Controller from Vivek Goyal, a patch implementing detection of RDC (a low power 486SX compatible SoC) x86 processors from Mark Kelly, an updated patch series implementing support for IRQ chips on slow buses (I2C, and SPI) from Thomas Gleixner, a patch inlining __fatal_signal_pending, since it took longer to call than to execute in instruction footprint, from Roland McGrath, a git tree corruption observed on a btrfs filesystem by Jeremy Fitzhardinge, some patches allowing rescheduling during scan and ignoring any aperature memory holes on x86_64 systems from Catalin Marinas, version 3 of the smells like ACPI Simple Firmware Interface from Len Brown, more static structure patches for the staging tree from Julia Lawall, a post from Robert Schwebel inciting further discussion of ways to reduce boot times (especially on ARM), a fix to splice ensuring that it updates mtime from Miklos Szeredi, some memory corruption fixers in the wireless stack from John Linville, version 4 of the clocksource/timekeeping rework from Martin Schwidefsky, some GFS2 fixes from Steven Whitehouse, some s390 updates for the forthcoming merge window, a fix for the AUX LOOP command not being properly implementing in many laptop kyboard controllers (affecting touchpad detection) from Dmitry Torokhov, a resent patch correcting the result returned by getcwd() on an unbound bind mount from Miklos Szeredi, a fix to scripts/verlinux providing more useful output from Christian Kujau, a typedef removal tool from Luis R. Rodriguez (who obviously did not find an existing tool in response to his questions from the last week), a fix for a theoretical IPI-related race condition in the performance counters code (__per_counter_read) from Paul Mackerras, some percpu fixes from Tejun Heo, ongoing debate about compcache and the relative merits of compressed swap, version three of the AlacrityVM guest drivers from Gregory Haskins, and a fix to mmap POSIX interface semantics ensuring that mmaped files must have been open()ed with read permission from Graff Yang.

Finally today, Pander Musubi (whose mail client is missconfigured not to supply his full name, but sourceforge knows) offered a build of the kernel intended to allow a Linux laptop to operate as a charger for other devices (over USB or Firewire interfaces for example), consuming as little power as possible by turning off “disks, screen, etc.”.

In today’s announcements: Thomas Gleixner announced release 2.6.31 rc6-rt2 of the preempt-rt patch. The latest release includes a threaded interrupt fix from Linus Torvalds, and a number of other minor fixes against a -rc6 rebase.

Greg Kroah-Hartman released the 2.6.27.30 and 2.6.30.5 stable kernels, which he “strongly encouraged” users to upgrade to. These contain, amongst other things, fixes for the local NULL pointer exploit currently in the wild. Tim Tassonis had previously inquired as to whether the fixes were going to be present in stable, having confirmed that 2.6.30.4 was vulnerable.

The latest kernel release is 2.6.31-rc6, which was released on Thursday.

Stephen Rothwell posted a linux-next tree for August 14th. He reminds everyone that the tree has now moved to a new location (still on kernel.org). Since Thursday, the kvm tree lost its build failure, and the net tree lost its conflict but gained another against the net-current tree. The total sub-tree count remains steady today at 140 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/13 Linux Kernel Podcast

August 14th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090813.mp3

For Thursday, August 13th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: A big security item, AlacrityVM, Fast boot times, JTAG, Kprobes, mailing lists, runtime power management, and using compressed RAM for swap.

Security. Today saw the release of another NULL pointer exploit into the wild. This one affects almost every 2.4 and 2.6 series kernel, and in the latter case is compounded by other published issues with mmap_min_addr protection.

Tavis Ormandy, while looking at various socket operations structures around the kernel tree discoved that sock_sendpage() doesn’t validate the function pointer it uses in the underlying protocol. This is a problem if the underlying socket operations struct hasn’t been initialized correctly, as is the case for a number of different protocols implemented in the kernel now. This causes a NULL pointer execution, which for those systems without vm.mmap_min_addr set (many, but by no means all systems), allows a local exploit through a simple mapping of the zero page. Although setting that tunable may mitigate the error, a recently noticed issue with LSM might actually make it more likely that systems running SELinux are affected. All users are strongly encouraged to set this tunable, ensure their SELinux policy is not overriding its behavior, and upgrade their kernels forthwith.

Linus Torvalds took the opporunity (in announcing 2.6.31-rc6, which contained the fix and which is covered later) to blast the vendor-sec mailing list and the very concept of “embargoes”, saying “if it hadn’t been for vendor-sec apparently leaking like a sieve, we’d have delayed the fix until the next -rc due to trying to be polite to vendors”. Linus of course makes a good point – keeping security fixes “secret” only works as long as you have a perfect system for keeping them secure (which doesn’t exist).

AlacrityVM. Anthoy Liguori pointed out that Gregory Haskins’ prevous benchmark results comparing e.g. venet against the existing userspace implementation of virtio had been done on a kernel build without High Resolution Timers, and so had resulted in graphs that showed an extreme difference in performance. Greg (who it should be added did take the trouble to contact me immediately after the previous podcast and point this out) updated his graphs, which now show correct round trip times for virtio-u of 266us, as opposed to more than 4000us in the previous benchmark run. Either way, userspace virtio is still shown to be slowed than his replacement, though that change with the in-kernel virtio implementation coming down the pipe. Greg updated his graphs (which a number of vocal people still seem to hate being 3D). Separately, Michael S. Tsirkin posted version 3 of a 2 part patch series implementing a kernel-level virtio server (against in-kernel KVM), aimed at improving performance for virtio by reducing the virtualization overhead caused by extraneous system calls. As Michael says, for virtio-net, this removes up to 4 system calls *per packet*. Since the previous release, the patch adds some RCU comments, compat ioctl support, and uses “more idiomatic english” from Rusty Russell (we all know how Rusty can be eloquent with his use of the language).

Fast boot times. Robert Schewel posted asking whatever became of the “fastboot” boot parameter and the git development tree that Arjan van de Ven had setup back in March. This was a tree that included asynchronous device probing within the kernel to speed up bus enumeration times at boot. Arjan responded that the features were now present by default (so there was no need to do anything), and also, quote “on x86 we’re doing pretty well ;-) ”. This lead some to joke that fastboot was now a completely solved problem.

JTAG. Ordinarily, this would be a miscellaneous item, but I think it’s pretty cool. Davide Rizzo posted a patch series implementing a generic JTAG bitbang miscellaneous driver proposal. He wasn’t sure if he’d posted to the right list, since he couldn’t find an appropriate subsystem maintainer, but that will likely be resolved one way or another later. JTAG provides boundary scan and debug facilities, especially on embedded boards, and this driver will certainly be of use to those who have appropriate hardware.

Kprobes. Masami Hiramatsu posted version 14 of a 12 part patch series implementing a kprobe-based event tracer and x86 instruction decoder. The tracer allows one to probe various kernel events through the ftrace interface, while implementing a generic x86 instruction decoder that can be used to find the instruction boundaries when inserting new kprobes (remember, unlike many cleaner ISAs, x86 uses variable length instructions). The decoder does not support SSE/FP opcodes, and Masami thinks it might be possible to share the included opcode decoder map with the one currently used by the KVM Hypervisor. The latest version seems pretty close to mergeable and has various fixes.

Mailing lists. The MMC tree has a new mailing list. It is linux-mmc and is hosted as usual on the vger.kernel.org mailing list server. And on the subject of mailing lists, ongoing debate is happening surrounding which ARM mailing lists are preferred: Russell King’s moderated linux-arm-kernel on his own machine, or the linux-arm list on vger.kernel.org. A vocal minority would like to see posts happen on a public, open, non-moderated list, and see the kernel MAINTAINERS file include this address. Russell finally seemed to express indifference if that’s what the community preferred as a solution.

Runtime Power Management. Matthew Garrett posted two interesting RFC patches implementing runtime power management for PCI and USB buses. This allows for devices to be selectively shutdown when they are not in use, in much the same way that they would when being suspended and resumed. As Matthew says, this work builds upon Rafael J. Wysocki’s reworking of the power management API. Matthew had been experiencing various problems in testing due to a buggy BIOS, but has apparently now received an update upon which he is able to show that this works now. It’s still RFC, but it’s good to see it happening.

Swap. Nitin Gupta previously posted concerning his work on “compcache”, which implements a compressed RAM device upon which can mount swap. For efficiency, the compcache folks want to have an immediate callback when a swap slot if freed rather than waiting for the special event that is otherwise passed into the block layer on swap slot freeing. Andrew Morton expressed concern at a simple callback under a spinlock, since he thought that artificially limited what one might be able to do with such an API. There was also some concern at duplicating functionality with a callback and subsequent block layer handling. Hugh Dickins shared Peter Zijlstra’s view that a general notifier might be the best way forward (while cautioning against current users of any hook), and finished up noting, “I won’t be surprised if we find that we need to move swap discard support much closer to swap_free”.

In today’s miscellaneous items: some genirq fixes aimed at preventing the wakeup of a freed irq thread from Thomas Gleixner (using Linus Torvalds’ “obvious solution”, for which he added a “precautionary” Signed-off-by), an RFC patch series implementing support for irq chips on slow buses such as I2C and SPI, also from Thomas Gleixner, a performance counters fix for “perf report” from Peter Zijlstra (since Pekka Enberg had noticed that this was broken by a “Full task tracing” patch), some performance counters, x86, and core kernel fixes from Ingo Molnar, another patch fixing an ABI incompatibility between “perf” and kernel, also from Peter Zijlstra, version 5 of the “Help Root Memory Cgroup Resource Counters Scale Better” patch series from Balbir Singh (which features a renamed subject), some RT mutex build fixes from Sven-Thorsten Dietrich, a cleanup fix for swiotlb fallback in intel_iommu_init, some sh updates (including initcall fixes in relation to recent I2C re-ordering now mergeable because the underlying I2C fixes got merged) from Paul Mundt, and some md fixes from Neil Brown. There was a suggestion from Jens Axboe that inlining spinlocks also results in a performance improvement of 3.5% with a particular workload on SPARC (as Dave Miller points out, this is likely because of the expense of a register window overflow onto the stack – which is 128 bytes of writes).

Finally today, Michael Schnell posted asking about the best practice to follow in implementing new futex support for an architecture (in this case on MMU-enabled NIOS systems). He would like some feedback.

In today’s announcements: Linux 2.6.31-rc6. Linus Torvalds announced 2.6.31-rc6. This had “Lots of small fixes all over, spread out fairly evenly”. As he says, things seem to be calming down a bit now, taking the opportunity to demonstrate his git prowess with an example command showing patch sizes. The release contains a fix for the (by now imfamous) NULL pointer exploit, although as Linus points out, this should not be too much of a problem if previous efforts to fix mapping at the NULL page have turned out right.

Linux 2.4.37.5. Willy Tarreau took the opportunity to release 2.4.37.5, which he had wanted to delay but the NULL pointer exploit (that also affects 2.4 systems – although the local exploit that is distributing is not exactly the same on these older systems) forced his hand. Willy repeats the assertion that users should set /proc/sys/vm/mmap_min_addr to 4096 or higher anyway, “unless you know that it breaks one very old legacy application”. Doing so will mitigate against the exploit by not allowing zero page mappings.

Greg Kroah-Hartman released review patches for the 2.6.27.30 and 2.6.30.5 stable series kernels, containing 28 and 74 patches respectively. I didn’t check but expect that the NULL pointer fix is amongst those.

The latest kernel release is 2.6.31-rc6, which was released by Linus in the evening, or rather his afternoon, at 16:37 PDT.

Stephen Rothwell posted a linux-next tree for August 13th. Since Wednesday, the v4l-dvb tree regained the same conflicts, the kvm tree gained a build failure, and the percpu tree lost 2 conflicts. Stephen notes that the linux-next tree composition has moved and is now located at a more officious address on the kernel.org website, while symlinks provide redirection from the old addresses for those who want to use an up-to-the minute tree and yet live in the past in other ways, perhaps as some form of compensation.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/12 Linux Kernel Podcast

August 13th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090812.mp3

For Wednesday, August 12th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: AlacrityVM, Asynchronous device suspend and resume, and Kbuild.

AlacrityVM. Gregory Haskins posted some updated benchmarks for the AlacrityVM hypervisor (based upon KVM) that he and others have been working on. Chiefly, Alacrity includes a new venet implementation for network virtualization, and aims to be optimized for Real Time workloads with low latency requirements. The figures are based upon 2.6.31-rc4, and show example response times of 56.8us vs. 29.8us native as opposed to 4016.0us for existing KVM instances running with virtio. Greg renames virtio to “virtio-u” since he is aware of the new in-kernel virtio server and plans to update the figures once he is able to compare against the “virtio-k” code posted recently. He posted some “3Dish” graphics that seemed to disturb one reader to the point of ranting about the evils of 3D graphics (somewhat harsh for a simple bar chart).

Asynchronous device suspend and resume. Rafael J. Wysocki posted a three part patch series intended to implement asynchronously the device driver provided suspend and resume callbacks on such events as suspend to RAM.

Kbuild. Catalin Marinas posted a nice patch to kbuild that implements reverse dependency tracking for selected options. With this patch, an option cannot be selected if any of its direct dependencies are not met.

In today’s miscellaneous items: a number of V4L/DVB fixes from Mauro Carvalho Chehab, a request for development of Kprobes and Kretprobes support in performance counters from Frederic Weisbecker, a fix for blktrace from Jens Axboe (fixing a double removal of a debugfs directory causing a crash), Rick L. Vinyard asked about tracking changes to exported attributes in sysfs (to which Kay Sievers replied that what he wants doesn’t exist “ouf of the box”), some cleanups to the tracepoint-analysis documentation (based upon feedback from LWN’s Jonathan Corbet) from Mel Gorman – who recently implemented the VM tracepoints, a fix for an O_DIRECT oops in NFS from Trond Myklebust, a new version of the “send callback when swap slot is freed” patch from Nitin Gupta, a git pull request implementing various performance counters code refactoring from Frederic Weisbecker, some libata fixes from Jeff Garzik, version 3 of the automatic crashkernel size calculation boot parameter patch from Amerigo Wang, some XFS updates for 2.6.31-rc6 from Felix Blyakher, another version of the kfifo patches from Stefani Seibold, and some sound fixes from Takashi Iwai.

Finally today, David Wuertele notes some difficulty in creating readonly root filesystems using initramfs. He would like to know how to do so but his tests are failing and the documentation doesn’t provide any detail – perhaps someone can help him out with an explanation.

In today’s announcements: linux-2.6.31-rc5-rt1.2. John Kacur announced version 2.6.31-rc5-rt1.2 of the -rt kernel patchset, which is an “unofficial” tree (although with implicit blessing nonetheless) intended to avoid the RT patch generating bitrot while Thomas Gleixner and others work on new RT features. The development (though maybe not this tree yet) removes the boot warning for options that might hurt performance in the case that ftrace is built with dynamic ftrace support rather than static. This paves the way for having ftrace built into the kernel by default, rather than optionally doing so.

The latest kernel release is 2.6.31-rc5, which was released over a week ago.

Andrew Morton posted an mm-of-the-moment for 2009-08-12-13-55.

Stephen Rothwell posted a linux-next tree for August 12th. Since Tuesday, the linux-next tree has now moved to a new more officious location on git.kernel.org (symlinks will redirect from the old location), and the v4l-dvb lost its conflicts. The sub-tree count remains steady at 140 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/11 Linux Kernel Podcast

August 13th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090811.mp3

For Tuesday, August 11th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Kexec, KVM, RTC, and VGA.

Kexec. Amerigo Wang posted two interesting patches for kexec. The first implements the display of a loaded crash kernel’s memory section information in /proc/iomem, while the second allows one to shrink the reserved memory for a crash kernel on an already running system if it is more than enough. For example, if you already had reserved 128MB, but only needed 100MB, you can simply write into sysfs (/sys/kernel/kexec_crash_size) to reclaim 28MB.

KVM. Michael S. Tsirkin posted version 2 of a 2 part patch series implementing a kernel-level virtio server. The main motivation for this effort is to reduce the virtualization overhead for virtio by removing system calls in the data path, without changing the guest system. As he says, for virtio-net, this removes up to 4 system calls *per packet*, which is a very significant performance improvement and should lead to some nice benchmarks. This version has only a few minor improvements from the previous one, such as moving rather than copying fs/aio.c, and removing some debug logging.

RTC. Feng Tang posted an RFC patch series implementing a new generic rtc_ops struct for x86 systems. As Feng points out, most x86 systems get their time keeping information from a Motorola 146818-like RTC device, EFI, or even virtualiation (these come in via get_wallclock/set_wallclock) but in the future there will be other mechanisms also and so Feng implements the ability to register different RTC sources in a generic fashion.

VGA. Dave Airlie posted a patch series that had originally come in from Tiago Vignatti, aimed at implementing VGA arbitration on systems using “legacy” VGA devices. As Dave says, the Resource Access Control (RAC) module inside the X server currently does the task of arbitration when more than one legacy device co-exists on the same machine, but a problem happens when different userspace clients attempt to do the same and so an arbitration mechanism that is independent of the X server is really needed.

In today’s miscellaneous items: an ACPI event notifier for AC/DC connect/disconnect events from Mark Langsdorf, a number of tracing fixes from Frederic Weisbecker that include Jason Baron’s syscall name to number mapping function, some wireless fixes from John Linville, some OCFS2 fixes from Joel Becker, version 2 of the new Winbond IR driver from David Hardeman, a patch allowing architectures (for example, SPARC) to override the default check_for_illegal_area function if it doesn’t work reliably from Joerg Roedel, a fix for userland ABI breakage in gnet_stats_basic that is passed via netlink from Michael Spang, version 4 of the “Help Resource Counters Scale better” patch series from Balbir Singh (which Prarit Bhargava confirmed improved a kernel compile time by around 30 seconds), a patch fixing CPUCLOCK_PROF and CPUCLOCK_VIRT timer precision from Stanislaw Gruszka (who notes that few people use these, but they should probably still be fixed for anyone who does – his posting includes a reproducer), a patch “constifying” various seq_operations structs from James Morris, a patch to print AMD virtualization features such as NPT, LBRV, SVML, and NRIPS in /proc/cpuinfo from Joerg Roedel, some IPC semaphore improvements (aimed to improve the O(n^2) behavior with n waiting processes) from Nick Piggin, a patch disabling cpufreq on 32-bit PowerPC systems from Bastian Blank, a patch adding cache miss and cache references events to performance counters on Pentium-M systems from Ingo Molnar, and a question from Frans Pop (of Ted T’so) as to what happened to the “data=guarded” patches Chris Mason had proposed in April for ext3.

Finally today, Luis R. Rodriguez inquired as to whether there exists a “typedef” removal tool. Presumably this would be a script or program that would look for typedefs and remove or replace them intelligently. If anyone knows of such a tool, do let Luis know about it also.

The latest kernel release is 2.6.31-rc5, which was released over a week ago.

Zdenek Kabelac posted to let everyone know that he is getting a “complete system freeze during reboot – usually just after iptables frees modules”. He posted a kconfig indicating that he is running 2.6.31-rc5. Also, Catalin Marinas posted to let everyone know that LTP on 2.6.31-rc5 on ARM with root NFS generates an oops in __put_nfs_open_context when running diotest4.

Stephen Rothwell posted a linux-next tree for August 11th. Since Monday, the nfdsd, kvm, rr, and staging trees all lost conflicts and/or build failures. The total sub-tree count is steady today at 140 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/10 Linux Kernel Podcast

August 13th, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090810.mp3

For Monday, August 10th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: CGroups, Ftrace, Modules, RCU, Spinlocks, Swap, System Calls, TTY, and VM.

CGroups. Ben Blum posted version 3 of a 7 part patch series implementing support for a “cgroup.procs” file that allows the user to quickly display all the unique thread IDs in a particular cgroup, as well as move a collection of existing processes sharing the same thread ID into a particular cgroup.

Ftrace. John Reiser complained that recordmcount, which is run during kernel build against every .o object as a means to extract mcount data for use with the dynamic function patching code in ftrace can add many minutes to a full kernel compile. He suggests that the problem is in repeated calls to “ld -r”, which can be batched into one call based on the output from recordmcount or the other way around. Either way, he says, the data output is the same. He was concerned that his 900 line “recordmcount.c” replacement might be too long for the mailing list (perhaps he has not seen the size of some patches) but will likely be persuaded to send it if the developers are interested.

Modules. Eric Paris posted requesting thoughts on how permissions checks are currently implemented on request_module(), and if it makes sense. As he says, request_module() is used to request the kernel helper thread spawn out a modprobe userspace thread to do a module load. It is called in a number of places within the kernel (apparently, approximately 128 unique callsites) and only three check to see if the requesting process has some sort of module loading permissions (CAP_SYS_RAWIO). Amongst the suggestions, Eric would like to see the request_module() code perform this security check for itself. Also on the subject of modules, Ozan Caglayan posted version 2 of a recent patch implementing a fix in the markup_oops script that will use modinfo to lookup module information when the EIP within a oops is within a module that has a “-” instead of a “_”. This is a semi-frequent occurance with module naming, so should avoid confusion.

RCU. Martin Schwidefsky had posted on Friday evening concerning a 2.6.30 system that was hanging due to a bad interaction between RCU and NOHZ. Paul McKenney followed up today with a congratulatory reply saying, “Congratulations, Martin! You have exercised what to date has been a theoretical bug identified last year by Manfred Spraul. This fix is to swich from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in 2.6.29″. Martin replies that SLES11 uses 2.6.27 and classic RCU, and he believes the bug is present there also, so therefore does need to be fixed. On a only maringally related tangential note, Martin also mentioned that he is working on NOHZ some more to improve delay performance by not having a CPU go fully tickless if it did some work in the last timer tick (which causes an unnecessary timer tick if the CPU goes truly idle, but generally he thinks will improve performance – and Martin requests comments on this approach from the wider LKML Congress).

Spinlocks. Heiko Carstens posted an RFC patch series allowing inlined spinlocks once again, since this apparently can lead to a 1%-5% speedup on some (s390 in this particular case) systems under certain workloads. The patch introduces CONFIG_SPINLOCK_INLINE as a conditional selector for this feature.

Swap. Nitin Gupta posted an RFC patch implementing a callback function whenever a swap slot is freed, for use on (in this example) systems with compressed RAM devices backing the swap device, allowing the memory to be instantly freed rather than when the “swap discard” bio is eventually processed by the block layer. Apparently, this is “essential” for the “compcache” project to which he posted a link.

System Calls. Jason Baron posted an interesting 12 part patch series implementing a runtime system call to name mapping function that allows one to pass a string representation of a system call and returns the ID of the call. Initially, it is for the syscall event tracer within ftrace, although one can imagine other projects would be interested in picking this up in-kernel.

TTY. The ongoing saga with the TTY layer came up again today (but only marginally). Artur Skawina noticed a ^S/^Q sequence resulted in data loss within his xterm. That seemed to be caused by a recent commit that had removed a check for tty->stopped in pty_write_buffer() for “no clear reason”, according to Linus Torvalds, who posted a patch that fixed the problem for Artur.

VM. Bill Speirs noticed a problem with VMA merging. The Linux VM uses VMAs (Virtual Memory Areas) to represent ranges of pages allocated to a task, complete with their protections and flags. A typical task has a number of different VMAs representing load code, library functions, program text, data, and so forth. Typically, the kernel will coalesce adjacent VMA regions if they share contiguous (virtual) memory and protection. However, in the case Bill cited, where he maps three pages with PROT_NONE and then sets the middle one to PROT_WRITE protection before setting it back, the kernel fails to reconcile these three pages back into a single VMA. This is not true if the same experiment is done using PROT_READ. Bill sees this issue because he is in reality mapping 200,000+ pages and rapidly changing permissions is causing him to exceed the max_map_count ulimit. This is worthy of investigation.

In today’s miscellaneous items: a power management fix (removing a run-time warning) from Rafael J. Wysocki, some performance counters fixes from Ingo Molnar (who states that he hopes it is still fine to make a few changes, but is willing to trim the patchset down to minimal changes if Linus prefers), the usual round of other updates from Ingo (x86, irq), some PCI fixes from Jesse Barnes, version 6 of a patch series adding trace events to the page allocator from Mel Gorman (who requests a “yey or nay” on whether these should be merged), a memory leak in security/selinux/hooks.c, identified by “iceberg” (which is about as useful as calling yourself only “debiandeveloper” or one of the many other nickname-only posters on LKML) and later patched in a posting from James Morris, version 2 of a VFS patch converting superblock s_maxbytes to an loff_t, a patch giving waitqueue spinlocks their own lockdep classes when they are initialized from init_waitqueue_head() from Peter Zijlstra by way of David Howells, who needed it to address a lockdep false positive situation in CacheFiles, a powerpc fix that allows “direct” DMA (non-iommu) to work for devices that have a < 32-bit DMA mask when the machine simply has no enough memory to go over the chip addressing limit from Ben Herrenschmidt, a patch implementing vhost, a kernel-level virtio server, from Micael S. Tsirkin, and a rethink of command line precedence on MicroBlaze.

Finally today, Ted T’so posted an update to the Kconfig description for EXT3_DEFAULTS_TO_ORDERED better explaining the tradeoffs in terms of journal options on ext3, which he says has been vetted by the developers as being more informative for users. Hopefully, some users will agree with that assertion.

The latest kernel release is 2.6.31-rc5, which was released over a week ago.

Matthias Dahl reported an oops in 2.6.31-rc5-git5 in kmem_cache_alloc and Eric Paris noticed a NULL pointer deference in kmemcheck in linux-next. There was also some whining that ARM doesn’t test with “randconfig” builds that often.

Stephen Rothwell posted a linux-next tree for August 10th. Since Friday, there are two new trees added – ide and hwpoison (the old ide became ide-current). The nfsd and drm trees gained conflicts, while the trivial tree lost its conflict. Given the two new tree additions, there are now 140 sub-trees. Stephen reminded Andi Kleen (author of HWPOISON) that linux-next is intended only for patches “destined for the next merge window”, which Andi affirmed.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags: