Tuesday, February 3, 2009

Build a better WAVE reader, pt 2

In the last post, I looked at the beginnings of using an Iteratee to create a WAVE file reader. I achieved a runtime of 25 seconds, compared to 1.9 seconds for my benchmark code.

Oleg's Iteratee code uses a Stream type which internally represents data as Chunk [a]. Haskell lists have many great properties, but they are not particularly efficient for this type of numerical processing. There are many better options, including Arrays, UVectors, and StorableVectors.

Unfortunately all of these options I would want to use introduce a type class restriction on types of elements in the array. This isn't a problem for sound files in particular, although it does greatly limit the utility of the Iteratees in general. One of the Iteratee examples shows how Iteratees can be layered for text processing, using IterateeGM String m a iteratees layered on IterateeGM Char m a iteratees to provide operations on words in a text stream. This is no longer possible, as String is not an instance of the type classes necessary to use one of these more compact data structures. In the end, the performance benefits should be worth it.

I chose StorableVector to replace the basic list, because it seemed handy while I was experimenting. StorableVector turns out to have another good property which is key to achieving the goal of performance close to C.

Adapting the Iteratees to use StorableVector is straightforward. Unfortunately when we've finished, performance is actually worse! In fact, it's as bad as the original RBIO-based version. It turns out that one key function is the culprit: conv_stream.

conv_stream converts between two streams. It is capable of converting multiple elements from one stream to one (or no) elements of the other, using a user-supplied conversion function. The wave file reader uses conv_stream to convert from a stream of Word8's to a stream of Doubles so the user can operate on Doubles, with the library handling all conversions transparently. The original type of conv_stream is:
conv_stream :: Monad m =>
IterateeGM el m (Maybe [el']) -> EnumeratorN el el' m a
Unfortunately, this means our data needs to unpacked from the StorableVector Word8 into separate Word8's, those Word8's assembled into Just [Double], and finally packed into a StorableVector Double for the next Iteratee. This is very inefficient, but easily fixed. Just change the type of conv_stream to
conv_stream :: (Storable el, Storable el', Monad m) =>
IterateeGM el m (Maybe (Vec.Vector el')) -> EnumeratorN el el' m a

and we're done. In fact, the implementation doesn't change at all from the original, just the type. Now no packing or unpacking need take place.

With the new conv_stream, it's necessary to change the conversion function. This is the IterateeGM which converts Word8's into Doubles. I have actually split this code into two parts, a function (called unroller because the original list-based version was a manually-unrolled endian_read) which converts Word8s into Word16 or Word32 as necessary, and a normalization function to normalize the result and convert from Words to Double. The normalization function should be fine as is, all that's necessary to change is the unroller function. Ideally, it would operate on StorableVectors directly. Fortunately this is possible. All that's necessary is to get a ForeignPtr Word8 from a StorableVector, and cast it to ForeignPtr Wordn. A somewhat hackish 16-bit converter can be implemented as follows:
import qualified Foreign.Ptr as FP
import qualified Foreign.ForeignPtr as FFP
import qualified Data.StorableVector as Vec

unroll_16 :: (Monad m) => IterateeGM Word8 m (Maybe (Vec.Vector Word16))
unroll_16 = liftI $ IE_cont step
where
step (Chunk vec)
| Vec.null vec = unroll_16
| Vec.length vec == 1 = liftI $ IE_cont $ step' vec
| Vec.length vec `rem` 2 == 0 = liftI $ IE_done (convert_vec vec) (Chunk $ Vec.empty)
| True = let (h, t) = Vec.splitAt (Vec.length vec - 1) vec
in
liftI $ IE_done (convert_vec h) (Chunk t)
step stream = liftI $ IE_done Nothing stream
step' i (Chunk vec)
| Vec.null vec = liftI $ IE_cont $ step' i
| Vec.length vec `rem` 2 == 1 = let vec' = Vec.append i vec
in
liftI $ IE_done (convert_vec vec') (Chunk $ Vec.empty)
| True = let (h, t) = Vec.splitAt (Vec.length vec - 1) vec
in
liftI $ IE_done (convert_vec $ Vec.append i h) (Chunk t)
step' _i stream = liftI $ IE_done Nothing stream
convert_vec vec = let (fp, off, len) = VB.toForeignPtr vec
f = FP.plusPtr (FFP.unsafeForeignPtrToPtr fp) off
fp' = (FFP.castForeignPtr $ unsafePerformIO $ FFP.newForeignPtr_ f) :: FFP.ForeignPtr Word16
in
Just $ VB.fromForeignPtr fp' (len `div` 2)

and that's it! The big hack is that this only works on little-endian platforms. To be correct, the bytes would need to be swapped on big-endian systems only. Implementing a suitable swap function is currently an excercise for the reader. Note that unsafeForeignPtrToPtr should be safe as there is no finalizer associated with the ForeignPtr, and in any case the memory pointed to won't be GC'd until after the Iteratee is complete.

At this point, the data should be handled entirely as a StorableVector, with no conversions to or from any intermediate lists at any point. Here are results from running the test:
$ time ./test_iter_sb
Just (AudioFormat {numberOfChannels = 2, sampleRate = 44100, bitDepth = 16})
Just 0.977568895535142

real 0m1.661s
user 0m1.335s
sys 0m0.140s
And the proof is in the pudding. This is the fastest Haskell implementation yet, even faster than hsndfile in a lazy stream.

A bit of work remains to be done. It should be cleaned up, with the hack fixed properly. Also I've currently only implemented unroll_16. For 24-bit audio, I expect the creation of a Word24 type and Storable instance will be necessary in order to employ the same technique. However, I think I've proven that a pure-Haskell, high performance audio I/O library is possible.

Monday, February 2, 2009

Build a better WAVE reader, pt 1

I have recently been working on creating an optimal audio I/O library in Haskell. By optimal, I mean the library should do all of the following:
  1. Have a clean, functional, usable interface. Operations should be composable, and the user should not have to deal with recursion, folds or the like. Procedural-style interfaces (anything exposing filehandle-type operations to the user) are out too.
  2. Be space efficient. Some Haskell implementations suffer from space leaks, or can have space leaks if the user isn't careful with data.
  3. Have performance comparable to hsndfile, a Haskell binding to libsndfile.
  4. Be implemented purely in Haskell. This makes it interesting.
The standard for audio I/O performance in Haskell has been hsndfile, as was discussed on haskell-cafe not long ago. Testing shows that using hsndfile to read data in a lazy stream fashion (similar to lazy ByteStrings) is about 10 times faster than the next-fastest implementation. Not good for Haskell's claim of having speed comparable to C! Still, I'm convinced it's possible to do better.

My test case consists of finding the peak normalized amplitude of a 16-bit stereo WAVE file. The audio duration is about 6 minutes, with a total file size of about 66 MB. This task involves reading and processing the entire file. The processing is quite minimal, so the speed of an implementation should depend primarily on the efficiency of reading and normalizing audio data.

Using hsndfile in a lazy stream does provide an interface which is familiar to many Haskell users, but I'm not particularly fond of it. If the user is not careful, the entire file can be retained in memory after processing. For a large audio file, this leads to an unacceptable situation. It should be impossible for data to be retained without explicit action from the user. Also, dependencies on foreign libraries can be difficult for some (i.e. Windows) users to resolve.

So having decided to create a native Haskell implementation, the first place to start is with the semantics and interface. I recently read Oleg K.'s presentations on Iteratee processing in Haskell, and I'm convinced this is the best model for functional I/O. It just feels right, what else can I say? Read the paper, see the code.

A preliminary version of Iteratee-based processing was benchmarked as "Enumerator" in the Haskell-art discussion. It was roughly comparable with other Haskell solutions, but has some problems. Notably it doesn't support seeking within the file, which would be nice. After I developed that, Oleg released a new version of Iteratee which does support seek operations via the RBIO monad. So I dropped Enumerator in favor of the new Iteratee + RBIO.

Oleg helpfully provides a TIFF reader with his Iteratee code. I based my wave reader on that, using the same technique of storing Enumerators for each wave sub-chunk in an IntMap. Since they aren't tagged, a list probably would have served just as well, but since most wave files only have 2 or 3 chunks anyway it doesn't seem like it matters much. Changes here aren't going to have a large performance impact.

The first version to use Iteratee + RBIO executes in about 45 seconds. Ouch.

RBIO uses IORef's to communicate seek requests and answers between Iteratees and Enumerators. Unfortunately IORefs are slow, so I'll start by getting rid of them. The data type
data IterateeG el m a = IE_done a (StreamG el)
| IE_cont (StreamG el -> IterateeGM el m a)

can be changed to
data IterateeG el m a = IE_done a (StreamG el)
| IE_cont (StreamG el -> IterateeGM el m a)
| IE_seek FileOffset (StreamG el -> IterateeGM el m a)

This also necessitates adding one more case to (>>=) on Iteratees:
iter_bind m f = m >>== docase
where
docase (IE_done a (Chunk [])) = f a
docase (IE_done a stream) = f a >>== (\r -> case r of
IE_done x _ -> liftI $ IE_done x stream
IE_cont k -> k stream
iter -> liftI iter)
docase (IE_cont k) = liftI $ IE_cont ((>>= f) . k)
docase (IE_seek off k) = liftI $ IE_seek off ((>>= f) . k)

After these changes, the provided Iteratees can be adapted to use the new structure relatively easily. The enumerator that reads from a file, enum_fd_random, must also be updated to understand seek requests. At this point we can remove RBIO entirely, and our execution time is about 33 sec. Adding a few carefully-chosen INLINE's and SPECIALIZATION's gets runtime down to 25 sec. Better, but we're very far from the goal of being comparable with C.

The next step is to change the Iteratee's internal representation of data.

Introduction

Hi, and welcome. I've decided to start this blog as a way to share some of my findings with various communities I'm involved with. I will try to post when I have something important/interesting to say, so I may not update all that frequently.

My main interests are music and programming. Specifically, modern art music, electroacoustic music, and programming in order to serve some musical purpose. For the moment, my language of choice is Haskell, because it's far more interesting than other languages I know. If Perl is for people who want to Get Things Done, then Haskell is for people who want to Do Things Right, Eventually, After You Figure Out Exactly What That Means. Which suits me just fine.

So again, welcome and thanks for taking a look.