The following was done for the fun of solving a puzzle, or at least part of one. It has been done by inspection, I have no association with (nor love of) Microsoft.
I also find they tend to radically change the format of their files each time they come out with a new version. Therefore I think the concepts below will only be compatible with Version 5.0 of Outlook Express and if there isn't a new version out now, you can assume one will be soon!
Disclaimer:
There is no Warrantee, I'm not liable if this is wrong...
Use this information at your own risk. I haven't worked on
or really looked at this since June 2000 when I wrote the sample code.
Prior Work:
I went to WOTSIT when I became interested in
Outlook Express 5.0 files. In the Windows section for the DBX file
type I found dbx.zip. A great starting point. I might never
have gotten anywhere without the work of Simon Craythorn who
was later assisted by Jason Miller. I wrote Simon and he
sent me an updated file with Jason's fix(s).
Quoating part of dbx.txt which is part of the Wotsit distribution,
Simon describes his algorithm as follows:
"We do however know what a message header structure looks like.
We also know that this is always on a 4 byte boundary.
Looking at the file I can also see that a message ALWAYS starts
with an RFC822 Message Header.
So, by looking through the file, 4 bytes at a time we can locate
the start of messages. Chain them together and export the result.
It's slow, it's crude, but it works!
"
Simon's sample code, 0E5.cpp scans the file looking for the start
of a message header by looking for the key RFC822 strings. Then
he traverses the message which he determined has a linked list
structure. There is some code in the routine oe5__import_message()
which frankly I don't understand. Simon says it works, and since
I never actually got around to the final step of dumping messages
I guess I believe him! The code in question advances through
the file in steps of 65536 while the header.lSectionSize > 512.
I think this is not required if you get to the start of the
message via the table of contents as outlined below.
File Format Overview:
First let me tell you how to find the data. Go to the 'tools'
dropdown box and select 'options'. Under options select
'maintainance' and click on the 'store folder' box to find
the path to the actual data files.
Next note that I'm really only dealing with the *.dbx
files that contain messages. There are others in the outlook
directory with the *.dbx extension which don't follow this
format, ie folders.dbx and pop3uidl.dbx. What I describe here
is a subset of the entire format, and there are many areas
where I have NO idea as to what is going on.
I have some understanding of the following three sections:
File Header: Appears to extend to 0x2ad4, although what I
understand occurs in the first 0x100 bytes.
Descriptive Data: Contains various string data, see
DESC_HEAD. To date I've only seen these
allocated in 0xC000 byte block lengths.
Message Data: Contains the messages as a series of linked
lists. See DBX_HEAD, this is the structure
identified by Simon Craythorn (thanks).
All message data has flags == 0x200.
To date I've only seen these allocated in 0xF780
byte block lengths.
TOC Pointer list: Contains pointers to the table of contents
data, ie offsets to DESC_HEAD locations. I believe
this is some form of doubly linked list, but
don't have a big enough file to be sure.
See TOC_HEAD which is a super set of DBX_HEAD,
appears to have flags == 0x0. To date I've only seen
these allocated in 0x3e1c byte block lengths. Typically
this will leave space for 25 entires of length 0x27C
which is the apparent size of an individual TOC_HEAD
entry with its associated data.
The extremely useful tip Simon gives is that a valid header
only occurs on a modulo 4 byte boundry. The unsigned long
the begins such a header is equal to the offset in the file
of this header. Hence one searchs for headers by looking for
longs with file offsets equal to the value at that location.
Outlook appears to allocate fairly large blocks for each type
of data. A *.dbx file will always has a file header region.
As soon as there are one or more messages it will also have
the three other regions described above. I don't see a pattern
to the order things are allocated, sometimes the Message data
is the next available location, 0x2ad4, and sometimes it will be
the Descriptive data. However one can determine some of this
from key offsets in the File header region. When one of the
previous allocated blocks fills, an additional block is allocated
at the end of the file allowing the system to grow.
File Header:
I've identified the following offsets at the beginning of the
file that provide data about the rest of the file:
Offset Description
0x0 1st 16 bytes probably a file GUID flag, see shortcut.pdf
Same bytes for all message files, but not for all *.dbx.
0x24 points to current block of what I'm calling Descriptive Data,
TOC data and flag 0x48 above entries appended in this block
Typically 0xC000 bytes reserved for TOC data on first wack
0x28 is allocation length for each Descriptive Data block = 0xC000
0x30 points to TOC current header list block, header with flag == 0
may be 0 if no message, ie no TOC
Each flag = 0 block 0x27c bytes long, 1 block contains 25 of these
May be more data after this, see inbox.dbx, and offset 0x7c
0x34 is allocation length for each stored at offset = 0x3e1c bytes
0x3C points to active Message Data block header.
Typically DBX_HEAD.flags == 0x200
This is where one adds next message
Allocation seems to be 0xF780, TOC list records use a 3 byte
offset, ie don't see how to exceed offset of 0xFFFFff.
but this allows files up to 16.77 Megs, then what happens?
0x40 is allocation length for Message Data size = 0xF780
0xC4 active messages in file <= value at offset 0x5C
0x5C looks like total messages in file (including deleted)
0x7C points to what might be next available block?
typically near EOF, but zeros in this area in all files
that I've seen.
0x88 points to string describing flags in file
all *.dbx with messages seem to have one of these...
its a string per below in all cases where exists
ie if 0x88 != 0, there are messages in file?
defines some flags for the file.
Note that DBX_HEAD.flags value 0x48 is its length!
0xE4 points to master TOC pointer list block, header with flag == 0
may be 0 if no messages, ie no TOC
may be eqaul to value at offset 0x30 if only one such block
Message Data:
One should enter the first record in this linked list via
the data following a TOC_ENT in the Descriptive Data which is
in turn pointed to by the data following TOC_HEAD entries.
typedef struct _dbx_head {
DWORD lpos, // current position
flags, // identifies type of data
length, // of section
next; // link to next section
} DBX_HEAD;
This is a linked list structure. I've only seen the flag 0x200
in this block. A series of links are tied together through
the next offset. When next==0 you're at the end of the data.
Length is normally, but not always, equal to 0x200.
I think longer blocks may incorporate deleted messages, but
not sure. If you enter the linked list data throught a TOC
entry you always seem to get a series of nodes of length 0x200.
The last node in the list will have a next == 0x0 and a length
<= 0x200 (the last message record is rarely 0x200 bytes long.
However if one scans ahead to find the next message header,
you have to skip 0x200 bytes after the previous one to get to
the next so message headers seem to be located at 0x210 byte
intervals in the Message Data blocks.
TOC Pointer List:
I'm not sure about this one, its something like this, but
I don't have big enough files to be sure.
// TOC pointer header
typedef struct _toc_head{
DWORD lpos, // offset equal to position in file
flags, // always 0?
prev_ptr, // back pointer for doubley linked list
next, // next pointer
count, // see shift required!
unknown; // used as filler in dbx.c during read
} TOC_HEAD;
Its easy to spot cause the flags above = 0x0.
Per above, believed to be a doubley linked list.
File header offset 0x30 points to the TOC pointer currently
in use, ie the one with the most recent entries.
File header offset 0xE4 points to a master entry which may
be the same as the value at offset 0x30 (if there is only
one pointer block). If TOC_HEAD.prev_ptr !=0 there are
additional pointer blocks which preceed the master entries.
If I'm right about this it could be a true doubley linked list
if the current block was linked in, but in my examples it isn't.
You get to the current entry from file header offset 0x30. This
header has a forward pointer, next, to the master entry which is
also pointed to by file header offset 0xE4.
The master entry my have no back pointer, prev_ptr = 0. If its
not 0, then the entry pointed to will in turn point back via its
next. I assume but don't have a file that proves that if there
are 4 or more entries in the TOC pointer data (TOC_HEAD.flags == 0)
that they will be connected via the prev_ptr and next fields.
The current logic assumes one reads the TOC_HEAD structure.
This gives one the number of data entries that follow.
CAREFUL, to get the count left shift by 8 bits, the low byte
of this long has been zero in all the files I've seen. It may
really be a flag as in the data following DESC_HEAD below.
Each data entry has 3 longs. Of the three, only the
function of the first it known. Its the offset of a TOC_ENT
record. In the only example I have, the last data entry in the
master block pointed to by file header offset 0xE4 has non-zero
data entries at the 2nd and 3rd location which are respectively
the offset of the current TOC_HEAD block and the number of entries
it contains. Hard to say all this is text, see dbx.c.
Descriptive Data:
I know what two types of this data are which is enough to parse
a message file. All data in these blocks start with a DESC_HEAD.
Note rather than a flags value as in the DBX_HEAD, the second
long is the length. You get to an individual entry via the
offset in the TOC pointer list described above.
typedef struct _desc_head {
DWORD lpos,
length;
} DESC_HEAD;
The simplest entry is a string describing some flags for the file
(no idea what it means). Its pointed to by file header at offset 0x88.
Its a single NUL terminated string of length = DESC_HEAD.length-1.
The important entry is the table of contents, TOC, entry data.
This needs a lot more work, but I see enough to get one from
here to the Message data.
#define TOC_TYPE 0x1f // mask for TOC entries to get type #
#pragma pack(1) // an alternate view of longs in structure
typedef struct _toc_ent {
BYTE flag; // some sort of bitmap
WORD data; // if flag == 0x84 offset to message, often in 7th long
// its in 3 bytes, ie up to 16.77 mb file
BYTE extra; // high byte of data?
} TOC_ENT;
#pragma pack()
The data following the DESC_HEAD is variable. After examining a
number of records I found that the first byte in the longs following
the header is a flag which indicates the function of the remaining
bytes in the long. I don't know the meaning of the high order bits,
but masking this byte with 0x1F produces monotonically increasing
values in the range 0x0 to 0x1C. I only recognize two of them,
several always seem to be present {0x1,0x2,0x4-0x6,0xC-0xD,
0x10-0x14,0x1A-0x1C} and some I've never seen {0x3,0x9,0xA,0xB,
0xF,0x15,0x1c}.
flag & TOC_TYPE description
0x1C last in list of longs, the high order byte(s?)
contain the length of the following string
data, ie length to skip to next region.
0x04 the high order three bytes represent the file offset
of the first header, DBX_HEAD, for this message in
the Message Data block. Note I have some concern
about this, what happens if the file is longer than
the 16.77 megabytes offset one can store in 3 byte
binary location?
After finding a flag with type 0x1C the table of contents data
extracted from the message is stored in the following order:
8 bytes, a quad word, Win32 FILETIME = time message recieved
NUL terminated string, contains subject data
8 bytes, a quad word, Win32 FILETIME = time message last accessed?
NUL terminated string, contains subject data (again)
NUL terminated string, contains mail server name
NUL terminated string, contains "From:" name
NUL terminated string, contains "To:" name
Sample Code:
Listing of archive : wdbx.com
Name Original Packed Ratio Date Time Attr Type CRC
-------------- -------- -------- ------ -------- -------- ---- ----- ----
DBX-FMT.HTM 17825 7128 40.0% 05-12-16 09:47:24 a--w -lh5- 80B7
DBX.C 22946 7473 32.6% 00-06-22 15:18:32 a--w -lh1- A275
DBX.EXE 34934 16104 46.1% 00-06-22 15:00:06 a--w -lh1- 8BA7
-------------- -------- -------- ------ -------- --------
3 files 75705 30705 40.6% 05-12-16 09:55:26
(note dbx-fmt.htm size and crc are a little off, can't include here
and have it come out right!
)
Its a MSDOS mode 16 bit program.
It really just demostrates what I'm talking about and lets one
validate the format of a file. The command line arguments
are displayed as indicated below if it is executed with no arguments:
The terminating 0 byte is always there, but some of these strings
may be empty. The length associated with to flag type 0x1C is
the number of bytes to skip to get to the location immediately following
the NUL for the final "To:" string above. More binary data follows
whose purpose is unknown.
usage: dbx [-b] [-c#] [-d] [-f] [-m[#]] [-t]
optional args mutually exculsive, default dumps all headers
-b display known block allocations in file
-c# dump contents TOC list block at hex offset = # points to
-d just displays gross file composition
-f# scan for headers with a specific flag value, # in hex
-m scan for message blocks flag = 0x200 and count messages
-m# display message data from block starting at hex #
-t to test for TOC, lists all block entries
All program output goes to standard output, you must redirect it
to a file if you want to capture it. The default option displays
all headers, ie 4 byte long locations in file that are equal to
the file offset at that location. They are displayed as if they
were all DBX_HEAD, if you are in the Descriptive Data section
the flags is really the length, and the other two entries should be
ignored.
-f# is similar to the the default mode, but only searches for
the specified flag value. Nice to find where flags=0x0
are located.
-m is simialar to -f200, but has some extra logic to detect if
message blocks are extra long, or not contiguous. Finds
all DBX_HEAD.flags == 0x200.
-m# traces a single message starting at the offset given and
continuing via the DBX_HEAD.next pointer until its 0.
Just the DBX_HEAD data is displayed to trace a single message.
-b simplistically shows how blocks were allocated. It does not
check the file header for allocation sizes, but assumes the
fixed sizes I've seen in the past.
-d looks at and displays information from the fixed offsets in
the File header block.
-t test TOC pointer list. Assumes doubley linked list format
described above and displays offsets to all DESC_HEAD entries
found in the pointer lists.
-c# displays DESC_HEAD data for one message. It must be
given one of the offsets from the -t option above and
parses the descriptive strings from the appropriate location.
The file offset of the starting DBX_HEAD in the Message
Data block is also show.
Unknown:
A lot. The above is a step toward the entire format, but there
are big holes. Using the -m option one finds message chains
that jump around, sometimes going backwards in file, then
forward again. Pretty clearly the system can regain space lost
in deleted messages. There must be an available block list somewhere.
I suspect its near or in the TOC pointer list and the DBX_HEAD.flag ==
0x0 area.
I'm not sure I've got the organization of the TOC pointer list
right. I'd love to have someone with some larger files check this,
or send me a copy of their region which I call the TOC pointer
list where DBX_HEAD.flag == 0x0.