Skip to content

Junk data

Jocelyn Beedie edited this page May 24, 2020 · 2 revisions

Junk data

If you see me refer to some data in a file as 'junk data', I literally mean that data is completely meaningless. A lot of files contain data that has no real 'function' in relation to the rest of the file. This makes the data difficult to understand and pick apart, so knowing what is and is not 'junk' is very beneficial.

Purposes of junk data

A lot of junk data tends to be used for padding. For example, a lot of the data in DGC archives are zeroes, so that the contents of the archive align to 0x800 byte regions. Since the Gamecube was disk based, this made disk reads faster. Unfortunately, not all junk data is zeroes.

Why junk data is sometimes not zeroed out

This usually arises when uninitialized memory is written to a file. Uninitialized memory is memory that has been allocated, but has not been written to. Consider the following code:

char data[20] = "Hello, world!";

The first 14 bytes of 'data' have been assigned to the string "Hello, world!" (null terminator included). The last 6 bytes of this data are uninitialized, and could be anything (but this doesn't matter in practice because those bytes are ignored).

Determining junk data

It can be difficult to discern junk data from real data. fortunately, there are many methods of determining what is junk and what isn't.

A lot of files tend to be copied between different Totem archives. For example, consider the file "DB:>PERSO>PATRICK>TEXTURES>PATRICK.TGA", which is copied between seven different archives. This is a BITMAP file, so it contains texture data. However, if you compare the MD5 sums of these files, you get the following:

e74bfa380e5b01feff04f18f62ab2da6  extracted/rotfd/DATA/JF/LVL_JFCL/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
a10b4032c762e19918c30ca298264a64  extracted/rotfd/DATA/GL/LVL_GLBE/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
a037999ce3a3ec135d26420801e17fcc  extracted/rotfd/DATA/BB/LVL_BBSH/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
459549f6f810bcabf07c6bc4b9cfb422  extracted/rotfd/DATA/BB/LVL_BBTP/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
eadaad30a042836ca0c124fc0a77ca87  extracted/rotfd/DATA/BB/LVL_BBEX/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
9dd38052e384393569c70130b2df4b95  extracted/rotfd/DATA/DG/LVL_DGBA/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
2114a4af911ab5f0ace8960b28d09eb2  extracted/rotfd/DATA/DG/LVL_DGDS/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA

None of these files match whatsoever! But, the actual image data they contain is exactly the same, since they have the same name. A neat little command line tool called "VBinDiff" can be used to determine generally where the junk data is. If we compare two files, we can see where the junk data lies:

VBinDiff output for patrick textures

A lot of the junk data matches between the two files (of course, there's only 256 values that a byte can be, and zeroes tend to be very common among junk data), so this method isn't perfect. But it can provide a good starting visualization for junk data.

Sometimes, however, there only exists one version of a file. If this is the case, data can sometimes be determined to be junk by looking for sections of that file appears to have uninitialized memory. The tells for junk data might be seeing data that you might see elsewhere, usually by seeing random file paths or other textual data (this data is usually cut off at the beginning and/or end too).

Clone this wiki locally