Archive

Archive for August 21st, 2009

2009/08/20 Linux Kernel Podcast

August 21st, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090820.mp3

The UNipexed Information and Computing Service (UNIX) turns 40 this month. How many of us were around back in the days of Woodstock? Not this author.

For Thursday, August 20th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Lazy workqueues, mailing lists, O_DIRECT, and TuxOnIce.

Lazy workqueues. Jens Axboe posted in followup to a previous rant about the number of kernel threads that had been running on his system (all 531 – really – of them). He prefered keeping the workqueue interface rather than redoing it yet again with some kind of wheel re-inventing new scheme. Jens adds lazy workqueues, which behave like the existing code, but create only one core kernel thread per online CPU that shares the responsibility of providing context for all lazy work not otherwise assigned with its own thread.

Mailing Lists. In another mail along the “should we move to vger?” lines, Roland Dreir solicited for opinions on moving the Linux InfiniBand/RDMA mailing list over to vger.kernel.org. Largely, the impetus seems to be that the existing list on openfabrics.org is closed to posts from non-subscribers, and, just like the recent discussion concerning the ARM Linux kernel list, many would prefer to have a list that was open to posts from non-members (especially as that allows easy cross-posting of topics with the LKML).

O_DIRECT loop devices. Jens Aboe and Alan D. Brunelle had a back and forth concerning some metrics Alan had collected in test runs of Jens’ patch, which aims to unifying O_DIRECT handling to allow loopback device data writes to proceed directly to backing storage without hitting the page cache. Alan’s test runs (available as a large PNG) show a huge drop in performance for POSIX AIO random and sequential writes (half way down the graphic). This isn’t unusual for a patch at an early stage of testing and development.

TuxOnIce. What to do? Nigel Cunningham posted to let everyone know that after his most recent attempt to get TuxOnIce merged (apparently, this is “something like the third time” he has tried to do this by now), there had been an interim agreement that he and Rafael would work on getting functionality merged bit by bit. Alas, both are busy with other things and do not have enough time for the effort, and so Nigel proposes three possibilities. First, he’d like to know if someone would like to improve the existing swsusp code (taking bits from TuxOnIce if they deem it appropriate) without help from him. Second, he’d like to know whether someone would take over TuxOnIce maintainership. Finally, he’d like to know if there are any better ideas that have not occured to him.

In today’s miscellaneous items: some multi-node processor scheduling fixes from Andreas Herrmann, some input updates for 2.6.31-rc5 from Dmitry Torokhov, a series of NFS bug reports from Fenggaung Wu in which recent kernels would suddenly return access denied errors and/or cause kerel panics in nfs_release, an eloquently phrased patch to the PCI DMAR code for the case of a DMAR returning all ones from David Woodhouse informing certain BIOS vendors that they had further lowered his already unprintable opinion of closed source BIOSes and BIOS engineers, a patch from Kamezawa Hiroyuki aimed at better aligning percpu counters, a device table update from Mario Schwalbe adding support for Apple models MacBook 5,1, MacBook Pro 5,1, MacBook Pro 5,2, and MacBook Pro 5,5 (Apple has a tendency to use really stupid model numbering conventions and always has), additional support for cut_here in AFS, CacheFiles, FS-Cache and RxRPC from David Howells such that these filesystems and caching services will display some useful diagnostic information as an accompaniement to a BUG() report (for which he also posted a patch implementing disconnected use of cut_here), some error handling fixes from Florian Tobias Schandinat for the framebuffer drivers implementing support for the error code possibly returned by fb_set_par that was being silently ignored by fbmem.c and fbcon.c, a fix to “reservetop” kernel boot parameter handling from Xio Guangrong, a fix from Jan Beulich to the target specifications in arch/x86/boot/compressed/Makefile such that vmlinux.lds is included and will not cause a number of pointless rebuilt files on each kernel compilation if they are already up-to-date, some sound (HD-audio) fixes from Takashi Iwai, some additional wireless patches for 2.6.32 from John Linville, a suggestion from Balbir Singh that his scalability fixes for root overhead in memory cgroup controllers be merged for 2.6.31 rather than holding off to 2.6.32, version 3 of a patch series from Jason Wessel implementing various EHCI and earlyprintk improvements for attached devices, a fix for a theoretical deadlock involving the del_timer_sync inside cancel_delayed_work from Roland Dreier, and some DRM fixes from Dave Airlie.

Finally today, Frans Pop reported a concern that he was getting a cryptic looking error message that related to his PCI hardware not supporting the Advanced Error Reporting (AER) feature of recent devices. It’s unfortunate that the error result from pci_enable_pcie_error_reporting would lead to such an unhelpful error message in the system logs.

The latest kernel release is 2.6.31-rc6, which was released on August 14th.

Krzysztof Halasa posted saying that he believes he has worked out what was causing the strange network timeouts in 2.6.30.5. He believes the problem lies with network desc’s being allocated non-coherently using a streaming allocation that fails on x86 with swiotlb because swiotlb has no concept of a “dirty” flag and so doesn’t know when to flush. Apparently, there is no other fix than converting the allocations over to coherent forms in post-2.6.31.

Dinakar Guniguntala concurred with John Stultz that he was also seeing an issue with recent 2.6.31 RT kernels in which all tasks would end up bound to a single CPU due to some kind of regression in the SMP scheduler behavior.

Eric W. Biederman reported a NULL pointer deference bug in 2.6.31-rc6 with an overrun backtrace containing a recent call to lapic_next_event and
run_timer_softirq.

Andrew Morton posted an mm-of-the-moment for 2009-08-20-19-18.

Stephen Rothwell posted a linux-next tree for August 20th. Since Wednesday, the drm tree gained 3 conflicts while the fsnotify, drbd, tip and the usb trees all lost build failures and conflicts. The total sub-tree count is steady today at 140 trees in the latest linux-next compose.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/19 Linux Kernel Podcast

August 21st, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090819.mp3

For Wednesday, August 19th 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: Config/SysFS, Cpuidle, O_SYNC, Perfomance Counters, Spinlocks, and x86.

Config/SysFS. Avi Kivity posted concerning some issues he has with “all the text based pseudo filesystems that the kernel exposes”. His main concern being that the kernel development community is “optimizing for the active sysadmin, not for libraries and management programs”. On a lower level, he is concerned about a number of specifics, including efficiency of open/read/close actions, atomicity of having to read multiple files that may be changing in order to capture specific system state information, the ambiguous format of attributes, lifetime and access control concerns, notification of change in attributes, and readdir support being “painful”. Avi says that “I don’t think a lot of effort is needed to make an extensible syscall interface just as usable and a lot more efficient than config/sysfs”, to which Ingo Molnar suggested that such an implementation was available in the form of the mechanism used by the performance counters code perf_counter_open system call, which does such things as passing an embedded .size field so that the data structure exchanged with userspace can change in size later on (embedded ABI protection). Avi replied that he had seen this and that it was “nice”. A number of others expressed frustrations at the current interfaces, so it will be interesting to see whether this turns into anything more concrete.

Cpuidle. Arun R Bharadwaj posted a two part patch series implementing cpuidle infrastructure support for powerpc systems. This not only allows powerpc systems to save power by selectively entering “snooze” and “nap” states when the kernel cpuidle code deems it appropriate, but also provides tpmd_idle, which is support for Thermal and Power Management idling also.

O_SYNC. Jan Kara posted a seventeen part patch series entitled “Make O_SYNC handling use standard syncing path” that aims to unify O_SYNC handling with the existing code that implements fsync(). After this patch series is applied, there is just one place where handling for forcing file commits to disk is implemented, making life easier for filesystem code. The patch touches a lot of filesystems and is probably going to need some fairly hefty testing.

Performance Counters. Everyone’s worried about information leakage and security at the moment, and Peter Zijlstra had previously noted the risk for information leakage through performance counters metrics. He posted another version of a patch series changing the default permissions on performance counters (disallowing regular users from creating cpu-wide counters), and causing any samples to have anonymized kernel IPs (Instruction Pointers) in the case that they are being collected by an unprivileged user.

Spinlocks. Discussion continued surrounding the meaning and purpose of spin_is_locked() as applied to uniprocessor systems. Thomas Gleixner had suggested that it should always return true, whereas Peter Zijlstra, Linus, and others had pointed out problems with this logic. In the end Peter suggested that the best idea might be for spin_is_locked to by a synonym for panic(). As I mentioned previously, Linux Weekly News has an excellent writeup in the latest edition, so it’s worth refering to that for more detail.

x86. Jan Beulich noted that according to gcc’s instruction selection, inc/dec instructions can be used without a performance penalty on most x86 CPU models, but should be avoided on others. Hence he suggests (and posts a patch for) selectable configuration of inc/dec instruction use depending upon the CPU models that are being targeted by a given x86 build.

In today’s miscellaneous items: another version of the CLOCK_REALTIME_COARSE patch that adds a fast but not very fine-grained timestamp from John Stultz, version 0.5 of the new kfifo API implementation from Stefani Seibold, a patch from Bartlomiej Zolnierkiewicz removing the mailing list for ncpfs from MAINTAINERS, a patch from Miguel Boton moving the many different alignment macros within the kernel into a standard “align.h” header file, yet another round of patches for Compal made Dell laptops from Mario Limonciello (with special thanks to Alan Jenkins for once again putting a lot of effort into testing and finding some bugs), some minor bug files for nilfs2 from Ryusuke Konishi, some documentation update to AFS from David Howells, a patch from Miroslav Rezanina causing Xen guest kernels booted with a mem= parameter (but nonetheless allocated additional memory in the hypervisor) to return the additional memory back to Xen early in boot, a second batch 47 of KVM updates targeting 2.6.32, a trivial fix for linux-next from Ingo Molnar that adds new tracepoints for syscall_enter and exit on s390 systems (avoiding a build failure otherwise), some microblaze fixes from Michal Simek, version 3 of a patch series from Zhang Rui implementing a standard interface for Ambient Light Sensors (ALS), a patch adding syscall filtering support for ftrace events from Li Zefan, version 5 of a patch from Amerigo Wang correcting the semantics for file truncations when both suid and write permissions are set for the user on a given file entry, some DRM fixes from Dave Airlie, and a new version 4 of the vhost kernel-level virtio server from Michael S. Tsirkin that is sure to kick off another round of enjoyable virtualization dialogue.

In today’s announcements: 2.6.31-rc6-rt5. Thomas Gleixner posted the latest version of the preempt-rt kernel, which updates to the latest Linus git tree, makes IPI handlers unthreaded on PowerPC (pseries), and fixes a problem with cgroup memcontrol preemption.

The latest kernel release is 2.6.31-rc6, which was released on August 14th.

Rafael J. Wysocki posted a list of regressions from 2.6.29 to 2.6.30 and from 2.6.30 to 2.6.31-rc6-git5 for which there are no fixes in mainline that he is currently aware of. The regression list has not increased dramatically, and most of the bugs seem to have driver specific or suspend/resume roots.

Walt Holman posted saying that he is experiencing some “periodic timeouts” with kernel 2.6.30.5 and Simon Kirby noticed how a “storage head box” also running 2.6.30 would occasionally get stuck allocating memory to send a packet for up to several seconds (visible watching sshd getting stuck), blocking on a mutex named iprune_mutex called from prune_icache in fs/inode.c. He made some suggestions about converting to a try_lock in that code and so forth. Finally, Steven Rostedt posted a series of lockups in the IPI code on recent kernels.

Stephen Rothwell posted a linux-next tree for August 19th. Since Tuesday, the mips, omap, and suspend trees lost their issues, wheile the tip and usb trees gained some conflicts. The total sub-tree count remains steady at 140 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/18 Linux Kernel Podcast

August 21st, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090818.mp3

For Tuesday, August 18th, 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: AlacrityVM, Kconfig, Spinlocks, and VM.

AlacrityVM. Today’s ongoing debate about which IO implementation is the fairest of them all saw the discussion head into the realm of DMA. Specifically whether vbus supported things like RDMA and how guests are protected from DMA to random host memory on platforms like PowerPC using a real physical DMA controller with virtio, and similar topics.

Kconfig. Steven Rostedt, who I know has had his fair share of annoyances with building test kernels, posted some patches aimed at making the process of building test kernels for a particular system much easier by having a build target that will automatically select a configuration covering all modules currently loaded on the test system. Rather than having to build a distribution style test kernel (which takes time), Steven’s patches allow developers to use “make localmodconfig” and “make localyesconfig” to build test kernels featuring only modules actually in use, either built as modules or built into the kernel in the latter case. Thanks a whole bunch, Steven!

Spinlocks. Kumar Gala posted asking whether spin_is_locked behavior is broken on uniprocessor systems. As Linux Weekly News pointed out, the problem here is that, actually, the meaning of spin_is_locked on systems without actual spinlocks being present is somewhat ill-defined. The LWN article does a great job of explaining the issues, so I won’t cover it much further here except to say that it’s likely some new spinlock primitives are coming down the pipe.

VM. I wonder what he’s working on. Jan Beulich posted a range of patches. The first alters handling of num_physpages since memory allocations should depend upon the amount of usable memory, and not just the total PFN count (which may include all manner of non-RAM ranges) in a system. The second builds upon this by replacing various users of num_physpages with totalram_pages. The third migrates the PID hash table over to using alloc_large_system_hash. And the fourth patch from Jan removes use of alloc_bootmem_low where’s it not strictly required for a given system to operate, especially on large 64-bit systems.

Also on a VM note today, Mel Gorman posted a three part RFC patch series aimed at reducing the need to search within the fast path of the low level page allocator by maintaining multiple free-lists in the per-cpu structure. At the time of the original introduction of per_cpu_pages, Mel says that the per-cpu static allocation thereof (recall that dynamic per-cpu-structucture allocation was recently implemented) resulted in too much wasted memory. But now that this is no longer the case, he is able to add multiple free lists to struct per_cpu_pages, one per migratetype that can be stored on the PCP lists. For the most part, performance testing showed only marginal improvement, except in the case of netperf-udp on x86_64 and sysbench on ppc64, which were higher.

In today’s miscellaneous items: some tracing fixes (to correct broken names in ftrace filters) from Steven Rostedt, another version of Paul Menage’s cgroup memberlist enhancements that add a cgroup.procs file to each cgroup (that contains unique thread group information rather than task IDs), an implementation of ACPI 4.0 power meter support via an extended hwmon sysfs interface from Darrick J. Wong, some irq fixes from Thomas Gleixner (who confirmed that today’s tree “contains really what I want you to pull”, after yesterday’s tree inadvertedly had the wrong patch), a fix to the LSM_MMAP_MIN_ADDR (yes, that one) help text from Dave Jones that corrects the default value to 65536 rather than 65535 (which would still fall within the first page on a 4K page system), another version of Jon Hunter’s patch that catches timer wrapping in clocksources and allows 32-bit systems to sleep for longer than 2.15 seconds when using dynamic ticks, two more wireless updates from John Linville, a twelve part patch series aimed at cleaning up __build_sched_domains by making the code “less ugly and more readable” from Andreas Herrmann, version two of yesterday’s page based O_DIRECT implementation from Jens Axboe, a whole bunch of network fixes from David Miller (including a TUN ioctl race fix from Herbert Xu, and a fix to the genetlink data structure that had previously broken userland), version 2 of the patch series adding in-memory-only xattr support on sysfs files from Casey Schaufler, and a trivial “make html” fix for performance counters from Kyle McMartin.

Finally today, Mikael Pettersson posted an intriguingly excessive request. He notes that his laptop hardware “clips disk capacities to 128GB. There’s no BIOS update or BIOS setup option to fix this. Passing libata.ignore_hpa=1 allows the Linux kernel to access larger disks, so Linux does work Ok with larger disks. However the laptop dual-boots Windows (for work-related stuff), and Windows has a major problem: if an entry in the msdos partition table refers to a sector above the BIOS 128GB limit, the Windows kernel crashes an reboots early in its boot sequence”. He goes on to propose adding some kind of sub-partition type that could be somehow hidden from Windows.

In today’s announcements: 2.6.31-rc6-rt4. Thomas Gleixner announced the latest iteration of the preempt-rt patchset (he skipped -rt3 as it failed in testing). This included an update to the “ONESHOT” irq infrastructure Thomas has been working on for mainline inclusion.

The latest kernel release is 2.6.31-rc6, which was released on August 14th.

Christoph Thielecke posted an interesting hard lockup on 2.6.31-rc6, which again seemed to be related to his ongoing Xorg development build testing.

Stephen Rothwell posted a linux-next tree for August 18th. Since Monday, the xfs, fsnotify, and suspend trees gained conflicts while the usb tree lost one. The total sub-tree count remains steady at 140 trees.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags:

2009/08/17 Linux Kernel Podcast

August 21st, 2009 jcm No comments

Audio: http://media.libsyn.com/media/jcm/linux_kernel_podcast_20090817.mp3

Did you know there have been over 64,000 downloads of the LKML podcast? That there are 500 listeners per episode? People who listen to the podcast vary from developers to company executives, they listen on their way to work, on the way to school, and even in the bathtub. Thank you for listening.

For Monday, August 17th 2009, I’m Jon Masters with a summary of today’s LKML traffic.

In today’s issue: AlacrityVM, Clocksource, Discard, and Loop.

AlacrityVM. Well, it was going to happen eventually. Ever since Gregory Haskins first posted patches implementing an alternative virtualized IO framework (replacing virtio in userspace with vbus/venet in kernelspace) and subsequently posted a KVM “fork” using those for virtualizing real time guests, there has been some discontent within the virtualization community at the apparent fracturing of the community. Both Avi Kivity and Ingo Molnar have spoken out against developing two competing implementations within the kernel (rather than getting behind one of them), and the former has been particularly defendent of the existing virtio implementation (which not has a kernelspace implementation available). Although not technically objecting to the work, Ingo’s argument can perhaps be summed up best with this quote: “If virtio pulls even with vbus’s performance and vbus has no advantages over virtio [I] do NAK vbus on that basis. Let’s stop the sillyness before it starts hurting users. Coming up with something better is good, but doing an incompatible, duplicative framework just for NIH reasons is stupid and should be resisted”. Obviously there are different commercial interests at play, but it should be noted that Greg has seemingly tried to navigate around these hurdles. He argues that he has made “every effort” to propose that his patches get integrated within KVM directly and implies that he is only continuing to work on the Alacrity implementation for purely technical reasons of driving up performance. And, it should be added, as Greg points out, his work has already helped to motivate kernelspace virtio efforts.

Clocksource. Stephen Hemminger noticed a regression caused by a recent patch from John Stultz that had aimed to sanity check changes to the active clocksource configuration using the sysfs interface. This change had broken systems built with High Resolution Timers (HRT) but not actually using them.

Discard. After several days worth of discussions surrounding the need for compcache (essentially, compressed RAM backed swap) to be made more immediately aware of swap slots becoming free, discussion had turned toward general discard handling. This is a term that refers to the need to educate underlying block devices whenever a block is actually no longer in use by any higher level software abstractions (a filesystem, or swap device, or something more exotic) and has become increasingly relevent in a world where SSD flash devices would love to know when they can actually recycling underlying flash blocks for more effective allocation and wear leveling support. Linux uses the ATA “TRIM” command to educate many of these devices about such events, but that command has a number of unpleasant standard-mandated issues not the least of which is the in-ordinate amount of time it can take to complete. Mark Lord posted some rather horrific benchmarks showing how drive firmware successfully lied about the first call to TRIM but subsequent (more real world) calls immediately following a TRIM resulted in hundreds of milliseconds of drive latency. Linux Weekly News had a more exciting and lengthy summary of the troubles with discard, so I recommend reading that article for further detail.

Loop. Jens Axboe, noting that the existing loop implementation (support in the kernel for exposing a file or similar as a block device upon which a higher level filesystem may subsequently be mounted) always uses the page cache regardless of IOPS requesting O_DIRECT, posted a patch implementing page based O_DIRECT on loop devices. His patchset modifies the IO patch for all O_DIRECT operations making it page based rather than passing down iovecs, but he cautions that it is “basically a first version so don’t expect too much of it, but it does seem to work fine for me.”. NFS was apparently the main difficulty in converting over existing code, and he’s not at all sure that that has been successful – so apply usual caution in testing.

In today’s miscellaneous items: some sh updates from Paul Mundt, a three patch patch series adding support for the Dell “Mini” series based upon compal-laptop from Mario Limonciello, some tracing fixes from Frederic Weisbecker, some performance counters fixes and x86 from Ingo Molnar, some performance counters fixes from Peter Zijlstra, a new iteration of the generic hardware breakpoints patchset from K. Prasad, some minor fixes to Microsoft Hyper-V configuration options so that all the sub-component drivers depend upon the base one from Jan Beulich, an IRQ fix from Thomas Gleixner (Linus wasn’t convinced that the right patch was in the git tree), a huge TLB driver example from Alexey Korolev, a suggestion that ESP and EIP values are removed from a task stat file and made available to processes with PTRACE capability, some XFS updates from Felix Blyakher, version 2 of the RDC (a low power 486 like SoC implementation) detection patches from Mark Kelly, a fix to drop write permission on /proc/timer_list and /proc/slabinfo from Amerigo Wang (which Ingo Molnar described as a “good catch”), a new time-source selector allowing one to (for example) specify wallclock times be using in ftrace entries from Zhao Lei, version 4 of a patch fixing file truncation handling when both suid and write permissions are held on a given file entry by Amerigo Wang, a patch to flex_array optimizing hot paths by allowing the compiler to substitute bit shifts for divides on power-of-two size allocated arrays from Dave Hansen, and a patch adding a diagnostic message differentiating between a keyboard vs. non-keyboard triggered sys-b reboot event from Tina Yang.

Finally today, a number of kernel developers repeated the point concerning vendors making hardware available for test, in particular suggesting that the Linux Foundation should foot the bill and hand out hardware at conferences like the upcoming Linux Plumbers Conference in order to save on shipping. In his dissenting opinion, James Bottomley reminded everyone that these devices often cost fairly significant amounts of money, but conceeded that the Linux Foundation might be a means to distribute otherwise free hardware to those developers in need.

In today’s announcements: Greg Kroah-Hartman announced usbutils version 0.86.

The latest kernel release is 2.6.31-rc6, which was released on August 14th.

Rogerio Brito reported a regression in the hfsplus code affecting rc5. He found that creating a loopback mounted filesystem resulted in data loss.

Stephen Rothwell posted a linux-next tree for August 17th. Since Friday, several trees lost conflicts while the suspend, tip, and sfi trees gained a build failure and conflicts respectively. The total subtree count remains stready at 140 trees in the latest linux-next tree compose.

If you haven’t been to a dentist in a while, I strongly advise you to go. You’ll avoid having your root canal redone twice for good measure.

That’s a summary of today’s Linux Kernel Mailing List traffic, for further information visit www.kernel.org. I’m Jon Masters.

Categories: episodes Tags: