Databending :: Experiment Two: Edit Audio As Image

Audio As Image File

If we can edit images as audio, then why not edit audio as an image?

There are a couple different ways we can open arbitrary raw data streams in an image editor. We explore two methods here:

Method 1: Manually Create File Header

We can open data as image files by creating a BMP file header with X*Y dimensions matching the data length. A 8bpp image (one byte per pixel) is easiest to calculate. An image of size (1 * (# of audio samples)) would be easy to use but no good to look at. A square image where X = Y = sqrt(# samples) is probably better; Adjust as needed.

Once you have a BMP header in which (X*Y) = (number of audio samples), insert this at the beginning of a new file. After the end of the header, add the Raw (header-less) Unsigned 8-bit output from Audacity. Name the file with a ".bmp" extension.

Method 2: Open Raw Image Data

This was submitted by a user.

You can open arbitrary (raw) data streams in GIMP. in the File > Open dialogue, select the File Type "Raw image data (.data)" and enable Show All Files. This will allow you to open any file and specify the width, height, colour depth and encoding, and other options manually. Then you can Export the data and GIMP will create a file header of the appropriate type for you.

Eliciting Patterns

By adjusting the pixel width (X) in the image file header, we can visualize different waveforms in the audio stream. The sample rate used in the input audio affects the data width of the sound waves, and the X-dimension we choose to insert into the image file header affects the image pixel flow from one line to the next.

Image files at 8bpp (256 colours) might be grayscale, or a custom colour map might be included for fun or art or other purposes.

Image Effects On Audio

Teddy KGB - Fo Mike

I asked my buddy Teddy KGB to whip up a simple music track for databending; He created Fo Mike.wav which has some simple synth and drums in a hiphop beat.

The first step is to convert the encoding to a simple 8-bit format. I used Audacity to mix down the original track from stereo to mono, and exported as RAW (header-less) Unsigned 8-bit encoding at 18000Hz. The 1-channel Unsigned 8-bit format ensures that each audio frame is one byte, and the entire data stream is linear (mono); Stereo format looks more like "LRLRLR..." but mono is just "BBBB..." (Bytes). The choice of 18000Hz for a sample rate gives us a total data size of 1536000 bytes, which divides nicely into many factors. I chose image dimensions of 1500x1024 (1500 * 1024 = 1536000).

I created a new 1500x1024 Grayscale image in GIMP (just a plain white rectangle) from which I extracted the BMP header of 1146 bytes and attached this to the beginning of the RAW data output by Audacity. The result is a BMP interpretation of the audio data, which looks like this:

NOTE: This Windows BMP file encodes the data bottom-to-top, which is counterintuitive. Using a Y-dimension of "-1024" encodes the BMP top-to-bottom, but this will result in even more confusion when we bring it back to Audacity or export as PNG, so just know that the start of the audio is at the bottom and the end of the audio is at the top. (WHY??? I don't know why BMP does this...)

In order to bring the data back into Audacity, we simply have to "Export As..." in GIMP and use the "Other format" of Windows BMP 8bpp, then open this in Audacity using the following settings:

Encoding: Unsigned 8-bit PCM
Byte order: No Endianness (probably doesn't matter since its only 8-bit encoding)
Channels: 1 (mono)
Start offset: 1146 bytes (This skips over the BMP header, straight to the data)
Amount: 100%
Sample rate: 18000Hz

With this process defined, here are some "image transforms" performed on this track in GIMP:

GIMP Operation	Looks like	Sounds like
Filters > Distorts > Ripple
Filters > Artistic > Glass Tile
Filters > Blur > Gaussian Blur
Filters > Generic > Erode
Filters > Edge-Detect > Neon Colors > Brightness-Contrast * (See NOTE)

NOTE: The Neon transformation caused heavy bending! To make the audio more tolerable, I also adjusted Brightness and Contrast in GIMP, as well as Normalized and Compressed the resulting audio in Audacity.

Eliciting Specific Frequencies

In the previous example I was able to elicit some patterns from the audio in a graphical interpretation of its data because I used a value for the X-dimension (width) of the image which matches values in the music (such as BPM). However my other attempts to elicit imagery from audio have been less interesting. One of the issues is that in a melody or song there are different frequencies (otherwise you would just hear a single tone), and these frequencies overlap at different times, which makes it difficult to choose an image width which will elicit patterns from many of the sounds in a piece of music.

And I haven't taken a course on Fourier Analysis (yet)...

It should be possible to elicit imagery from an audio sample if it were a static or near-static frequency, such as plucking a single string on a guitar.

In the following experiment I recorded a single strum of the low E string on a 6 string acoustic guitar which may or may not have been well tuned at the time. Once the audio was loaded in Audacity I measured the number of audio samples for each repetition of the waveform produced by the guitar. This value just so happened to be that nice nerdy number 640, and quite by fluke when I calculated the Y-resolution of the desired image to hold the amount of audio data (length of time) which I had cut from the recording I got the nice resolution: 640x480!

Here is the original audio recording (compressed as MP3):

And here are the image results in four different bit depths (compressed PNGs shown, click images for original BMPs). Note that due to Windows BMP orientation, the recording actually plays from bottom to top:

Unsigned 8-bit PCM as 8bpp raster image.
Signed 8-bit PCM as 8bpp raster image.
Signed 16-bit PCM as 16bpp raster image (5 Red, 6 Green, 5 Blue).
Signed 24-bit PCM as 24bpp raster image.

Here is the low definition Signed 8-bit audio stream as 24bpp True Colour image. Note that the pattern is visible thrice horizontally, as the vertical dimension was reduced by one third to accomodate the shorter data set:

And here is a little animated GIF I made of these results, just for fun!

Examining The Composition

In order to better understand the effects at play which create the beautiful images seen above, some experiments were carried out in decomposing the RGB image into its component parts. Using GIMP I selected Colors -> Components -> Decompose and explored the resulting images (Click the thumbnail to obtain the BMP):

Red	Green	Blue

So, what on Earth has happened here? Why is Red seen in stark, almost entirely On/Off pixels? Why is Green carrying the bulk of the interesting water wave patterns? And why is Blue just chaotic noise contrasted with the other two? On examination of these questions, the answers lie in the manner of 24-bit data storage and byte ordering.

Simply put, the byte ordering of the audio samples interpreted as RGB pixel data is such that as the audio amplitude values of the waveforms count up or down, Blue values cycle through their entire range 0-255 over and over again; Then Green and then Red in that order. This is akin to how the 1s, 10s, or 100s column changes frequently as you count up towards one million in decimal, but the column for 100,000s only changes a few times.

But then why does Red change so abruptly in the quiet part (the top) of the image? If the Reds are the extreme values, the quiet part of the Red image should just be black? This is because the audio format used (Signed 24-bit PCM) uses a common computer math trick to store signed (positive and negative) integer values in binary called Two's Compliment. This has some bizarre effects on our image interpretations of audio data:

As the audio waveform moves from the center line (0 amplitude) upwards slightly, the Red value of those audio samples (pixels) becomes 0 (binary 00000000). It is only once the audio amplitude becomes quite positive that Red pixels would have a value in the range 1 to 127.
As the audio waveform moves from the center line downwards slightly, the audio sample is now a negative number, and the Red value of those pixels is 255 (binary 11111111). It is only once the audio amplitude becomes quite negative that Red pixels would have a value in the range 255-128 (backwards because of Two's Compliment storage of the audio data which is signed).
As audio samples increase in the range 0 to +8388608 (that is from the center line to full positive amplitude), Blue values increase 0-255 very quickly and many times; Green less so, and Red only at the high edge from 0 to 127.
As audio samples decrease in the range 0 to -8388607 (that is from the center line to full negative amplitude), Blue values decrease 255-0 very quickly and many times; Green less so, and Red only at the low edge from 255 to 128.

It may look as though the Blue channel is complete chaos and noise, but there exists the same pattern there as in the Red and Green channels. Click the animated GIF below for a higher resolution: