After a long absence from OS development I recently returned to it - and
it feels great to be doing this stuff again!!! The reason for the
absence (other than life!) was that I was developing a language to write
the OS in.
And, me just randomly poking in this group, not been around for a while.
Well, I now have a working compiler and a language which, while it is
currently primitive, is usable.
C is pretty hard to beat, IMO.
I have my own languages as well, but C is pretty hard to beat for
low-level tasks even when one has the ability to write their own
compilers (if anything, it gives more insight into why C is the way it is).
Well, and also why a few of the "less well received" C99 features, such
as VLAs are, in practice, "kinda awful" (one can implement support for
them, but making them "not suck" is harder than what might be otherwise
implied).
At this point it seems to me that there's an opportunity for a win-win.
If I use the language to work on the OS then that will let me make
progress on the OS while at the same time using the experience to
provide useful feedback on how the language should develop.
So that's what I plan to do, and the above is background to the query of
What formats of image file are best for the OS itself?
My compiler currently emits x86 32-bit code (and its output is readily
linkable with other code which can be written in 32-bit assembly) so
pmode is my target. I have enough 16-bit asm code to load the bytes of
an image and switch to pmode but the next problem is what format the
1. Flat binary
A 32-bit flat binary would be easy to invoke as I could just jump to its
first byte. It would not be relocatable but it looks as though I could
change my compiler so that as long as I avoid globals I can emit
position-independent code - which could be handy! But I am not sure how
to create a 32-bit flat binary. My copy of ld doesn't seem to support
such an output, though maybe there's a way to persuade it.
2. Elf or PE
Elf and PE have the opposite problem. Either of them should be easy to
2a. Extract the executable part (how?) for inclusion in the loadable image.
2b. Include the whole executable file, including the headers, and write
some asm code to parse the headers and jump to the executable part of it.
Or maybe there's another option. I've a feeling we've discussed this
before but at the moment I cannot think of what we concluded. Plus, I
need to work with what my compiler can produce (32-bit Nasm) which may
be a new constraint.
So, any thoughts on what format an OS image should have?
In one of my own projects, which has occupied much of the last several
years of my life, is a custom CPU ISA (mostly used of FPGA boards thus
far...).
For this, I mostly went with a tweaked PE/COFF variant:
* Omits the 'MZ' stub, as it is basically useless in this case.
** Typically the file starts at a 'PE\0\0' or 'PEL4' magic or similar.
** The MZ stub is disallowed entirely in the LZ compressed variants.
** The MZ stub may be present for uncompressed 'PE\0\0' files.
* Optional (per-image) LZ compression (typically LZ4 in this case).
** The LZ decoding is integrated with reading the image off the SDcard.
* Adds an RVA==Offset restriction.
* ...
The addition of an RVA==Offset restriction means that it is possible to
essentially just read (or unpack) the EXE into its target address and
then jump to its entry point. Though, my loader also zeroes the ".bss"
section and similar. Without the restriction, it would be necessary to
first read the binary to an intermediate location and then copy its
sections to their destination addresses.
Though, if using a generic linker which does not follow this rule, one
would need to first read into a buffer and then copy out the sections
(or do "seek and read" for each section if one has a "proper" filesystem
driver).
For programs within the OS, the same basic format was used, except that
the ABI splits the binary into two separate areas:
* One for '.text' and other read-only sections.
* An area for '.data' and '.bss' and similar.
The modifiable sections would then be addressed relative to a "Global
Register" (oddly enough, PE/COFF already had the fields for this; albeit
they were unused for x86/x64, mostly intended for MIPS and similar).
This allows multiple logical instances of the same program within a
single address space (without also needing multiple instances of the
".text" section). Implicitly, the ".data" section points to a table to
allow the main EXE (and any DLLs) to reload its own data sections
(typically needed for DLL exports and calls via function pointers, which
may not necessarily have the correct data section in the global register
on entry to the function).
Base relocations could be performed easily enough, but are N/A for
loading up an OS image. The image needs to have its base set to its
starting address.
* In my case, this is generally 01100000 (or 17MB)
* This is 1MB past the start of DRAM, 01000000 (16MB)
** The first 1MB of DRAM is generally reserved for stacks and similar.
Base relocations are typically applied (once) when loading up program
binaries and DLLs though. These fix up the binary both for the load
address, and also its index into the table used for reloading the global
pointer. The base reloc format is basically the same as in normal
PE/COFF, with a few minor tweaks and extensions.
Addresses are generally:
* 00000000..0000FFFF: ROM, Boot SRAM
* 00010000..000FFFFF: Special Hardware Pages (Fixed Contents).
** There is a page of 00 bytes, a page of NOPs, BREAK, RTS, ...
** These are partly intended for use by virtual memory and similar.
* 01000000..7FFFFFFF: DRAM Range (RAM may wrap within this space).
* 80000000..EFFFFFFF: Reserved for now
* F0000000..FFFFFFFF: MMIO (Low Range)
Ranges above the 4GB mark also exist (47:32):
* 0001..7FFF: Virtual Address Space
** Virtual memory generally goes in this range.
** Stuff below 4GB being physically mapped.
In the case of boot-loading, the PE/COFF image is treated as (more or
less) functionally equivalent to a flat binary, just with the entry
point pulled from the PE/COFF header.
For the optional LZ scheme:
* The image is compressed in terms of 1K blocks, starting at 1K.
* The first 1K remains uncompressed (PE headers go here).
* Decoding happens in terms of discrete 1K blocks.
** This avoids the need for using a large intermediate buffer.
In my case, I experimented with several LZ schemes:
* LZ4
** Base encoding is very similar to the standard LZ4.
*** However, absent any headers or file packaging (as in ".lz4" files).
** A distance of 0 is a special "escape case"; generally used for EOB.
*** This typically happens once, at the end of the PE image.
* LZ4LLB
** Tweaked version of LZ4 which adds match/literal length restrictions.
** Adds special case to deal with long runs of literal bytes (no match).
** Allows a slightly faster/simpler decoder, but worse compression.
* RP2 (a custom LZ scheme)
** Custom format, but remains byte-oriented (like LZ4)
** Not generally for binaries as it does worse than LZ4 in this case.
For the case of binaries (at least with my ISA, but also appears to hold
true for x86-64 and ARM as well), LZ4 came out ahead of the various
options I had tried.
My RP2 format tends to do better for general purpose data compression,
however its design assumes that match length and distance are positively
correlated (it uses a unary coded match-format selector, with several
combinations of length and distance bits as bit-packed values), eg:
* dddddddd-dlllrrr0 (l=3..10, d=0..511, r=0..7)
* dddddddd-dddddlll-lllrrr01 (l=4..67, d=0..8191, r=0..7)
* dddddddd-dddddddd-dlllllll-llrrr011 (l=4..515, d=0..131071, r=0..7)
* rrrr0111 (Raw Bytes, r=(r+1)*8, 8..128)
* rrr01111 (Long Match, *1)
* rr011111 (r=1..3 bytes, 0=EOB)
* rrrrrrrr-r0111111 (Long Raw, r=(r+1)*8, 8..4096)
So, typically, the match formats only support 0..7 literal bytes, with
longer runs of literal bytes using explicit encodings.
On my ISA, RP2 can also beat out LZ4 in terms of decode speed, however
on x86 and x86-64, LZ4 tends to be faster.
LZ4 uses a fixed-length (16-bit) distance over a 64K sliding window,
which it happens, better matches the patterns typically seen in program
binaries (where it seems there is no particular correlation between
match length and distance; with lots of short and medium matches spread
more-or-less at random within the sliding window).
Eg, quick and dirty (from memory):
* Tag Byte (rrrr-llll), Raw=0..14, Length=4..18
** Values of 15 in either field encoding a longer length.
*** 00..FE: 0..254, added to preceding length.
*** FF, add 255 to length, read another length byte.
With a match structured like (IIRC):
* Token [LitLen] LiteralBytes* Distance [MatchLen]
** Distance encoded as a 16-bit word (little endian).
Traditionally, in LZ4 the EOB case was encoded by running into the end
of the input buffer. Supporting Distance==0 as an escape case deals
better with input buffers without an explicit length.
In the binaries, the "hits end of buffer" scheme was used for encoding
within most of the 1K blocks, and the explicit EOB case was used for the
final block. Arguably, could have also used 512B blocks (a single
sector), but 1K worked.
The LZ4LLB case was further tweaked:
* Match and Literal length cases were limited to a single byte.
* A match with (Literal!=0, Dist=0, Length=4)
** Encodes an EOB (as in other variant)
* A match with (Literal!=0, Dist=0, Length=5)
** Encodes a run of literal bytes with no match copy operation.
Though, if one has a decoder which does both formats, there was little
advantage to LZ4LLB. Likewise, having a dedicated decoder gave (at best)
a fairly small speed increase.
The PE loaders in my case also make use of checksums, albeit using a
different checksum algorithm from that used in normal PE/COFF.
This was mostly because the original checksum algo was "very weak" and
could miss the sorts of damage which could be done by a misbehaving LZ
decoder.
More or less:
* Keep a running sum of the current values (32-bit DWORDs);
* Keep a running sum of the preceding sums;
* Twiddle and XOR them together at the end (32-bit result).
Though, done as 4 parallel sets of sums in my ISA, as this was faster in
my case (though does produce a different checksum from using a single
set of sums).
The addition of a second sum-of-sums significantly increases its error
detection ability with only a minor increase in computational cost.
The single set of sums option would likely be faster for x86 or x86-64
due to these ISAs having a smaller GPR space (running 4 sets of sums in
parallel would eat more registers than are available in the ISA).
So, a simplified "x86 friendly" variant would probably be, say:
uint32_t *cs, *cse;
uint32_t sum1, sum2;
cs=(uint32_t *)(buf);
cse=(uint32_t *)(buf+sz);
sum1=0; sum2=0;
while(cs<cse)
{
sum1+=*cs++;
sum2+=sum1;
}
return(sum1^sum2);
Though the "actual version" uses 64-bit sums (of 32-bit values) and then
adds the high half back to the low-half at the end, which does help with
its error-detection rate (but, 32-bit x86 doesn't really have the
register space for this, but this property could be emulated with, say
"LODSD; ADD ECX, EAX; ADC ECX, 0; ADD EDX, ECX; ADC EDX, 0;" or similar,
which works within the limits of 32-bit accumulators).
Though a major incentive for binary compression is that the SDcard is
fairly slow (particularly noticeable in simulation, IRL ~ 600 kB/s isn't
too bad; though FAT filesystem overheads also take their cut), and so
compressing the binary can be used to accelerate loading times (with
decompression speed being somewhat higher than the IO speed).
Partly though, this is due to the use of polling IO. In the current
design, the MMIO interface can move multiple bytes at a time. Originally
it was a single byte at a time, and the IO speed limit was somewhat
slower (~ 100-200 kB/S).
Saving a few (virtual) seconds in a Verilog simulation running ~ 1000x
slower than real-time, is quite noticeable.