Discussion:
Format for the OS image
(too old to reply)
James Harris
2022-01-05 16:39:07 UTC
Permalink
On 05/01/2022 15:40, James Harris wrote:

...
   What formats of image file are best for the OS itself?
...
Or maybe there's another option. I've a feeling we've discussed this
before but at the moment I cannot think of what we concluded. Plus, I
need to work with what my compiler can produce (32-bit Nasm) which may
be a new constraint.
I should say I remember someone (Alex?) since long ago espousing a
certain .exe format as being very easy to relocate but I am not sure
whether it was suitable for 32-bit code, nor which linker I would need
to use to produce it. My development environment is Ubutu Linux.

I guess that maybe .exe has been supplanted by the 32-bit PE format -
and PE may be a good way to go given that it's required by UEFI - but,
again, feedback would be welcome.
--
James Harris
wolfgang kern
2022-01-05 18:54:12 UTC
Permalink
Post by James Harris
    What formats of image file are best for the OS itself?
Or maybe there's another option. I've a feeling we've discussed this
before but at the moment I cannot think of what we concluded. Plus, I
need to work with what my compiler can produce (32-bit Nasm) which may
be a new constraint.
I should say I remember someone (Alex?) since long ago espousing a
certain .exe format as being very easy to relocate but I am not sure
whether it was suitable for 32-bit code, nor which linker I would need
to use to produce it. My development environment is Ubutu Linux.
I guess that maybe .exe has been supplanted by the 32-bit PE format -
and PE may be a good way to go given that it's required by UEFI - but,
again, feedback would be welcome.
I prefer a never fragmented (iow: always consecutive) Flat Binary image,
and as all M$-forms are prone to become distributed all over the disk
I'd avoid using any of their formats. Loonix may act quite similar...

Even we are now forced to use UEFI and the FAT32 boot code, I'd have all
my OS including loader stages as a block of consecutive sectors on disk.
So all I need to know is the LBA-Number of the start sector then.
This needs a certain (stupid easy) tool to "format" such an OS either
manually or full featured autonome.
__
wolfgang
James Harris
2022-01-08 12:30:51 UTC
Permalink
Post by wolfgang kern
    What formats of image file are best for the OS itself?
...
Post by wolfgang kern
I prefer a never fragmented (iow: always consecutive) Flat Binary image,
and as all M$-forms are prone to become distributed all over the disk
I'd avoid using any of their formats. Loonix may act quite similar...
Even we are now forced to use UEFI and the FAT32 boot code, I'd have all
my OS including loader stages as a block of consecutive sectors on disk.
So all I need to know is the LBA-Number of the start sector then.
This needs a certain (stupid easy) tool to "format" such an OS either
manually or full featured autonome.
IIRC you load your OS image to 0x7e00 although you might move it later.
Either way, does it always run from a specific location (in which case
you won't need any fixups)?

I cannot load the image to a fixed location as I want to be able to
adapt to different machine configs so it looks as though I have no
choice but to relocate it after reading the file into memory.
--
James Harris
wolfgang kern
2022-01-08 20:10:53 UTC
Permalink
Post by James Harris
Post by wolfgang kern
    What formats of image file are best for the OS itself?
...
Post by wolfgang kern
I prefer a never fragmented (iow: always consecutive) Flat Binary image,
and as all M$-forms are prone to become distributed all over the disk
I'd avoid using any of their formats. Loonix may act quite similar...
Even we are now forced to use UEFI and the FAT32 boot code, I'd have
all my OS including loader stages as a block of consecutive sectors on
disk.
So all I need to know is the LBA-Number of the start sector then.
This needs a certain (stupid easy) tool to "format" such an OS either
manually or full featured autonome.
IIRC you load your OS image to 0x7e00 although you might move it later.
Either way, does it always run from a specific location (in which case
you won't need any fixups)?
Yes I let the BIOS load to 7e00 beginning with the sector after the MBR,
so my whole OS-image is a single block starting with the MBR (==VBR).
And my code starts already in the middle of the MBR and jumps over the
MBR+VBR-data (196 byte).
Post by James Harris
I cannot load the image to a fixed location as I want to be able to
adapt to different machine configs so it looks as though I have no
choice but to relocate it after reading the file into memory.
Me too have the opportunity to boot several variants within partitions,
so the boot code "calculates" (adds a One) to the anyway known MBR-LBA
before asking the BIOS to load the rest.

But I move then the 32bit part always to the same address 1_0000_0000
because there isn't enough consecutive space below.

Earlier versions of my OS used self-relocation, and they were small
enough to fit into the first 640K. Not more since a while ...
__
wolfgang
James Harris
2022-01-08 22:44:07 UTC
Permalink
On 08/01/2022 20:10, wolfgang kern wrote:

...
Post by wolfgang kern
Earlier versions of my OS used self-relocation, and they were small
enough to fit into the first 640K. Not more since a while ...
What format did you use for relocatable files? AISI there are two choices:

1. Design your own format. Then relocation would be easy but you'd need
some way to convert to that form from build tools - compiler/assembler -
or to produce your own build tools.

2. Use a predefined format. Then relocation would require parsing the
structure but (bugs permitting!) it would be easy to produce.


FWIW, I am looking to use option 2 ATM but I have all kinds of
interesting ideas for matters related to option 1. :-)
--
James Harris
James Harris
2022-02-07 13:06:44 UTC
Permalink
Post by James Harris
...
Post by wolfgang kern
Earlier versions of my OS used self-relocation, and they were small
enough to fit into the first 640K. Not more since a while ...
1. Design your own format. Then relocation would be easy but you'd
need some way to convert to that form from build tools -
compiler/assembler - or to produce your own build tools.
2. Use a predefined format. Then relocation would require parsing the
structure but (bugs permitting!) it would be easy to produce.
FWIW, I am looking to use option 2 ATM but I have all kinds of
interesting ideas for matters related to option 1. :-)
My way of self relocation was just one single variable on the very start
of the image in RAM (I put all other rd/wr data anyway 512 bytes upfront
code, similar to a PSP in DOS), so all internal and also external access
just added this to reference anything within code and data by SIB modes.
I don't get that. If you had instructions like

mov ebx, [0x7c00] ;Get pointer from start of image

then you would need to know where the image started (0x7c00) ... and if
you knew where the image started you wouldn't need relocation.
more previous versions used a relative CALL-Null/POP sequence to figure
where it was loaded, but I soon found the better solution as said above.
I could see value in loading a pointer from a fixed location in memory,
e.g.

mov ebx, [0x1004] ;Get pointer to globals

but not from the start of a movable image.
--
James Harris
wolfgang kern
2022-02-08 13:49:56 UTC
Permalink
Post by James Harris
Post by James Harris
...
Post by wolfgang kern
Earlier versions of my OS used self-relocation, and they were small
enough to fit into the first 640K. Not more since a while ...
1. Design your own format. Then relocation would be easy but you'd
need some way to convert to that form from build tools -
compiler/assembler - or to produce your own build tools.
2. Use a predefined format. Then relocation would require parsing the
structure but (bugs permitting!) it would be easy to produce.
FWIW, I am looking to use option 2 ATM but I have all kinds of
interesting ideas for matters related to option 1. :-)
My way of self relocation was just one single variable on the very
start of the image in RAM (I put all other rd/wr data anyway 512 bytes
upfront code, similar to a PSP in DOS), so all internal and also
external access just added this to reference anything within code and
data by SIB modes.
I don't get that. If you had instructions like
  mov ebx, [0x7c00]  ;Get pointer from start of image
then you would need to know where the image started (0x7c00) ... and if
you knew where the image started you wouldn't need relocation.
more previous versions used a relative CALL-Null/POP sequence to
figure where it was loaded, but I soon found the better solution as
said above.
I could see value in loading a pointer from a fixed location in memory,
e.g.
  mov ebx, [0x1004]  ;Get pointer to globals
but not from the start of a movable image.
OK, we know that the (old yet) BIOS starts our code at 7C000.
So all we have to do is to make sure that all our references
either align to (flat) 0:7c00 or (seg:offset) 7c00:0000.
My 16 bit MBR starts with a hard coded far jump to 7c00:0040
to leave some room for variables and align to cache bound.

And all the stuff I load after the initial boot, self-relocating
is just done by remember the location where I put it and store
this (32 bit flat xxxx) value into the image.
e.g.: a first local data pointer init mov esi,xxxx
aka: BE xx xx xx xx
this is also stored into my MEM-LISTING to be used for external
and inter instance referencing.
works the same if code modules were moved, what I never did.

Previous versions used: call 0000 | pop esi
E8 00000000
5E must be aware of the five bytes off here.
__
wolfgang
James Harris
2022-02-10 13:17:16 UTC
Permalink
...
Post by wolfgang kern
Post by James Harris
I don't get that. If you had instructions like
   mov ebx, [0x7c00]  ;Get pointer from start of image
then you would need to know where the image started (0x7c00) ... and
if you knew where the image started you wouldn't need relocation.
more previous versions used a relative CALL-Null/POP sequence to
figure where it was loaded, but I soon found the better solution as
said above.
I could see value in loading a pointer from a fixed location in
memory, e.g.
   mov ebx, [0x1004]  ;Get pointer to globals
but not from the start of a movable image.
OK, we know that the (old yet) BIOS starts our code at 7C000.
I take it you mean address 7c00 (31k).
Post by wolfgang kern
So all we have to do is to make sure that all our references
either align to (flat) 0:7c00 or (seg:offset) 7c00:0000.
Yes (though seg 07c0).
Post by wolfgang kern
My 16 bit MBR starts with a hard coded far jump to 7c00:0040
to leave some room for variables and align to cache bound.
And all the stuff I load after the initial boot, self-relocating
is just done by remember the location where I put it and store
this (32 bit flat xxxx) value into the image.
e.g.: a first local data pointer init   mov esi,xxxx
aka: BE xx xx xx xx
this is also stored into my MEM-LISTING to be used for external
and inter instance referencing.
works the same if code modules were moved, what I never did.
Previous versions used: call 0000 | pop esi
E8 00000000
5E          must be aware of the five bytes off here.
That's basically what I ended up doing. In fact, my code includes
sequences such as

call _ip_get
add eax, label - $

where $ is the address of the add instruction, label is the intended
label and ip_get is

mov eax, [esp] ;Fetch return EIP
ret

I did it that way to match calls with returns so as not to upset the RSB
although I gather that E8 0000_0000 is recognised as an idiom in many
CPUs and doesn't cause any problems with branch prediction.

That approach wouldn't work where there are instances at different
addresses relative to EIP but it works fine for the single-instance case
of booting.
--
James Harris
Alexei A. Frounze
2022-01-07 02:44:04 UTC
Permalink
Post by James Harris
I should say I remember someone (Alex?) since long ago espousing a
certain .exe format as being very easy to relocate but I am not sure
whether it was suitable for 32-bit code,
The 32-bit PE image is easy to relocate (the 64-bit PE should be easy too),
there's only one x86-32 kind of relocation: IMAGE_REL_BASED_HIGHLOW
(=3) in the .reloc section. That is, if the base address differs from the
one in the image header, you add a constant to all locations enumerated in
the .reloc section.

I still haven't found the minimum requirements for simple relocatable ELF
images. If I got it right, Linux kernel modules are actually objects, not images.
Fun.
Post by James Harris
nor which linker I would need
to use to produce it. My development environment is Ubutu Linux.
I have my own linker to statically link ELF32 objects into all kinds of
executables (PE, ELF (non-relocatable), MZ EXE, a.out, flat).
See smlrl in Smaller C.

Alex
James Harris
2022-01-08 12:26:04 UTC
Permalink
Post by Alexei A. Frounze
Post by James Harris
I should say I remember someone (Alex?) since long ago espousing a
certain .exe format as being very easy to relocate but I am not sure
whether it was suitable for 32-bit code,
The 32-bit PE image is easy to relocate (the 64-bit PE should be easy too),
there's only one x86-32 kind of relocation: IMAGE_REL_BASED_HIGHLOW
(=3) in the .reloc section. That is, if the base address differs from the
one in the image header, you add a constant to all locations enumerated in
the .reloc section.
A relocatable PE may be easy to relocate ... but it doesn't seem so easy
to create. :-(

Those I make seem to come without any relocation entries.

Am I correct that data_directory[5] should contain the relocations? I
see that entry's address and length as zero - despite source which
AFAICS needs to be relocated such as

mov eax, label

where label is elsewhere in the code.

That's with a PE file created by

ld -m i386pe ifile... -o ofile

FWIW objdump -x shows most sections as empty:

The Data Directory
Entry 0 00000000 00000000 Export Directory [.edata (or where ever we
found it)]
Entry 1 00003000 00000014 Import Directory [parts of .idata]
Entry 2 00000000 00000000 Resource Directory [.rsrc]
Entry 3 00000000 00000000 Exception Directory [.pdata]
Entry 4 00000000 00000000 Security Directory
Entry 5 00000000 00000000 Base Relocation Directory [.reloc]
Entry 6 00000000 00000000 Debug Directory
Entry 7 00000000 00000000 Description Directory
Entry 8 00000000 00000000 Special Directory
Entry 9 00000000 00000000 Thread Storage Directory [.tls]
Entry a 00000000 00000000 Load Configuration Directory
Entry b 00000000 00000000 Bound Import Directory
Entry c 00000000 00000000 Import Address Table Directory
Entry d 00000000 00000000 Delay Import Directory
Entry e 00000000 00000000 CLR Runtime Header
Entry f 00000000 00000000 Reserved
Post by Alexei A. Frounze
I still haven't found the minimum requirements for simple relocatable ELF
images. If I got it right, Linux kernel modules are actually objects, not images.
Fun.
Despite spending time learning about PE I fear I may have to switch to
ELF if I cannot get PE relocation working.

The thing is, maybe I'm misunderstanding something. If the code contains
absolute references, as above, I cannot get how it's even sensible for
ld to create a PE which contains no relocations. Such an executable
could only ever be loaded to a certain location - which is not how I
understand PE is supposed to work.

What's more, even with the switch --dynamicbase which tells ld to allow
for ASLR the PE file still has an empty .reloc section.

Or maybe ld is doing the right thing as it's my expectation which is
wrong. Let me know if you can see what it is!

What do you see in the data directory for your PE files?
--
James Harris
James Harris
2022-01-08 22:29:45 UTC
Permalink
Post by James Harris
A relocatable PE may be easy to relocate ... but it doesn't seem so easy
to create. :-(
https://groups.google.com/g/alt.os.development/c/enQJoDLLm_c/m/f-Di_fSdAQAJ
Thanks, Paul. The linked article is very interesting. I see it says that
a bug in some versions of binutils can be got round by exporting a
function - something I don't do at the moment. It's getting late now but
defining an export is something I can take a look at tomorrow.
The second solution - fixing ld, is available in executable
form (32-bit) on the PDOS distribution (ldwin.exe) at http://pdos.org
There's no reason to abandon PE. I have it working on
PDOS/386.
Cool.
--
James Harris
Alexei A. Frounze
2022-01-10 05:42:55 UTC
Permalink
Post by James Harris
Post by Alexei A. Frounze
Post by James Harris
I should say I remember someone (Alex?) since long ago espousing a
certain .exe format as being very easy to relocate but I am not sure
whether it was suitable for 32-bit code,
The 32-bit PE image is easy to relocate (the 64-bit PE should be easy too),
there's only one x86-32 kind of relocation: IMAGE_REL_BASED_HIGHLOW
(=3) in the .reloc section. That is, if the base address differs from the
one in the image header, you add a constant to all locations enumerated in
the .reloc section.
A relocatable PE may be easy to relocate ... but it doesn't seem so easy
to create. :-(
Not easy because you don't have the right tools or because you haven't
yet worked out the formats on both sides (input obj and output exe)?

I simply wrote my own linker to be able to produce the different kinds
of executables I wanted. And it works great in the relatively simple
scenarios I have (statically linked executables for the most part;
Windows ones are an exception since the Windows API isn't defined
as some kind of int#/fxn# in DOS or Linux and must be imported
from system libraries).
Post by James Harris
Those I make seem to come without any relocation entries.
Might be that you aren't asking the linker to produce them for you.
That's what MinGW does by default. There are two tweaks here:
----8<----
/*
How to compile for Windows with MinGW:
gcc hw-mingw-reloc.c -o hw-mingw-reloc.exe -Wl,--dynamicbase
*/

#include <stdio.h>

int __declspec(dllexport) main(void)
{
printf("Hello, World! @ %p\n", (void*)&main);
return 0;
}
----8<----

1. dllexport on main forces creation of .reloc.
2. --dynamicbase sets a flag to enable ASLR for the executable.

I'm getting different addresses printed in different runs, as expected.

My smlrl produces .reloc by default and sets the ASLR flag too.
Post by James Harris
Am I correct that data_directory[5] should contain the relocations?
Yes. But it's optional.
Post by James Harris
I
see that entry's address and length as zero - despite source which
AFAICS needs to be relocated such as
mov eax, label
where label is elsewhere in the code.
That's with a PE file created by
ld -m i386pe ifile... -o ofile
The Data Directory
Entry 0 00000000 00000000 Export Directory [.edata (or where ever we
found it)]
Entry 1 00003000 00000014 Import Directory [parts of .idata]
Entry 2 00000000 00000000 Resource Directory [.rsrc]
Entry 3 00000000 00000000 Exception Directory [.pdata]
Entry 4 00000000 00000000 Security Directory
Entry 5 00000000 00000000 Base Relocation Directory [.reloc]
Entry 6 00000000 00000000 Debug Directory
Entry 7 00000000 00000000 Description Directory
Entry 8 00000000 00000000 Special Directory
Entry 9 00000000 00000000 Thread Storage Directory [.tls]
Entry a 00000000 00000000 Load Configuration Directory
Entry b 00000000 00000000 Bound Import Directory
Entry c 00000000 00000000 Import Address Table Directory
Entry d 00000000 00000000 Delay Import Directory
Entry e 00000000 00000000 CLR Runtime Header
Entry f 00000000 00000000 Reserved
Yep, it could be like that.
Post by James Harris
Post by Alexei A. Frounze
I still haven't found the minimum requirements for simple relocatable ELF
images. If I got it right, Linux kernel modules are actually objects, not images.
Fun.
Despite spending time learning about PE I fear I may have to switch to
ELF if I cannot get PE relocation working.
The thing is, maybe I'm misunderstanding something. If the code contains
absolute references, as above, I cannot get how it's even sensible for
ld to create a PE which contains no relocations. Such an executable
could only ever be loaded to a certain location - which is not how I
understand PE is supposed to work.
What's more, even with the switch --dynamicbase which tells ld to allow
for ASLR the PE file still has an empty .reloc section.
There's at least one way to force gcc to produce .reloc, one shown above.
--dynamicbase on its own is likely insufficient as it probably merely sets
the ASLR flag but it can't magically function without the relocation table.
Post by James Harris
Or maybe ld is doing the right thing as it's my expectation which is
wrong. Let me know if you can see what it is!
What do you see in the data directory for your PE files?
Like I said, my version of MinGW doesn't produce .reloc or ASLR flag
by default, but I can talk it into doing that.
OTOH, my compiler comes with its own linker that does this by default.

HTH,
Alex
James Harris
2022-02-07 13:44:37 UTC
Permalink
Post by Alexei A. Frounze
Post by James Harris
Post by Alexei A. Frounze
Post by James Harris
I should say I remember someone (Alex?) since long ago espousing a
certain .exe format as being very easy to relocate but I am not sure
whether it was suitable for 32-bit code,
The 32-bit PE image is easy to relocate (the 64-bit PE should be easy too),
there's only one x86-32 kind of relocation: IMAGE_REL_BASED_HIGHLOW
(=3) in the .reloc section. That is, if the base address differs from the
one in the image header, you add a constant to all locations enumerated in
the .reloc section.
A relocatable PE may be easy to relocate ... but it doesn't seem so easy
to create. :-(
Not easy because you don't have the right tools or because you haven't
yet worked out the formats on both sides (input obj and output exe)?
It was "not easy" probably because of incompatibility between how things
work and my expectations of how they 'should' work.

For example, I was surprised (and still am) that a load module would be
compiled to run in just one fixed location and never any other. I could
understand a preferred location but not a fixed one.

Is there a particular advantage in stripping off the relocations? The
change in file size will surely be tiny and I cannot see any security
advantages to removing the relocs. Just the opposite, in fact.

(It's no longer a problem, BTW. I've worked round it. My comment about
stripped relocs is purely philosophical.)
Post by Alexei A. Frounze
I simply wrote my own linker to be able to produce the different kinds
of executables I wanted. And it works great in the relatively simple
scenarios I have (statically linked executables for the most part;
Windows ones are an exception since the Windows API isn't defined
as some kind of int#/fxn# in DOS or Linux and must be imported
from system libraries).
OK, though what is the format of the object inputs to your linker?

ISTM that whenever we have to work with systems or formats someone else
has designed we end up having to deal with lots of complexity.
Post by Alexei A. Frounze
Post by James Harris
Those I make seem to come without any relocation entries.
Might be that you aren't asking the linker to produce them for you.
...

Thanks for the info (now snipped). You have worked out how to wrangle
the tools into submission - which is what I felt like I was trying to do.

I spent some time looking at PE and then at a.out for booting but I
ended up getting my compiler to emit PIC so I can boot via a flat
binary. I expect later to have to use PE - not least because of its
association with UEFI - but it will be easier for me to load PE using
HLL code rather than assembly (which is what I need at boot time).

As I looked in to it I was becoming increasingly uncertain that PE could
be made to do what I wanted. For example, for my environment the linker
expects sections to be loaded into memory at 4k alignment but they have
smaller alignment in the file. I needed the executable to run from
wherever the file happened to be loaded. It /may/ have been relocatable
as it stood but I couldn't be sure and thought maybe I was trying to get
the format to do something it was not designed for.
--
James Harris
Alexei A. Frounze
2022-02-08 02:10:40 UTC
Permalink
Post by James Harris
Post by Alexei A. Frounze
Post by James Harris
Post by Alexei A. Frounze
Post by James Harris
I should say I remember someone (Alex?) since long ago espousing a
certain .exe format as being very easy to relocate but I am not sure
whether it was suitable for 32-bit code,
The 32-bit PE image is easy to relocate (the 64-bit PE should be easy too),
there's only one x86-32 kind of relocation: IMAGE_REL_BASED_HIGHLOW
(=3) in the .reloc section. That is, if the base address differs from the
one in the image header, you add a constant to all locations enumerated in
the .reloc section.
A relocatable PE may be easy to relocate ... but it doesn't seem so easy
to create. :-(
Not easy because you don't have the right tools or because you haven't
yet worked out the formats on both sides (input obj and output exe)?
It was "not easy" probably because of incompatibility between how things
work and my expectations of how they 'should' work.
For example, I was surprised (and still am) that a load module would be
compiled to run in just one fixed location and never any other. I could
understand a preferred location but not a fixed one.
If you don't have relocations, there isn't any other option.
(Position-independent executable wasn't such an option.)
Post by James Harris
Is there a particular advantage in stripping off the relocations? The
change in file size will surely be tiny and I cannot see any security
advantages to removing the relocs. Just the opposite, in fact.
True. I don't know why they weren't/aren't generated out of the box.
Kinda stupid and careless.
Post by James Harris
(It's no longer a problem, BTW. I've worked round it. My comment about
stripped relocs is purely philosophical.)
Post by Alexei A. Frounze
I simply wrote my own linker to be able to produce the different kinds
of executables I wanted. And it works great in the relatively simple
scenarios I have (statically linked executables for the most part;
Windows ones are an exception since the Windows API isn't defined
as some kind of int#/fxn# in DOS or Linux and must be imported
from system libraries).
OK, though what is the format of the object inputs to your linker?
ELF. One object format that's good enough for all 32-bit x86 needs.
And for some 16-bit needs too (there are 16-bit relocation extensions
in ELF that NASM supports and my compiler/linker depends on).
Pretty neat.
Post by James Harris
ISTM that whenever we have to work with systems or formats someone else
has designed we end up having to deal with lots of complexity.
You're somehow overcomplicating the PE format beyond what it really is.
If you ignore all those special sections and options and what not
and just stick to a few code and data sections and maybe relocations,
it's gonna be very simple.
Post by James Harris
Post by Alexei A. Frounze
Post by James Harris
Those I make seem to come without any relocation entries.
Might be that you aren't asking the linker to produce them for you.
...
Thanks for the info (now snipped). You have worked out how to wrangle
the tools into submission - which is what I felt like I was trying to do.
I spent some time looking at PE and then at a.out for booting but I
ended up getting my compiler to emit PIC so I can boot via a flat
binary. I expect later to have to use PE - not least because of its
association with UEFI - but it will be easier for me to load PE using
HLL code rather than assembly (which is what I need at boot time).
As I looked in to it I was becoming increasingly uncertain that PE could
be made to do what I wanted. For example, for my environment the linker
expects sections to be loaded into memory at 4k alignment but they have
smaller alignment in the file.
There are two alignment fields in the PE header: section alignment
and file alignment. If you set both to 4KB (there should be an option
or a way to do it via the linker script), you'll probably just get
what you need.
Post by James Harris
I needed the executable to run from
wherever the file happened to be loaded. It /may/ have been relocatable
as it stood but I couldn't be sure and thought maybe I was trying to get
the format to do something it was not designed for.
My guess is you wanted it look like a flat binary, which is missing the point
of the format and the format isn't terribly complex for what you really
need at this point.

Get a Windows system (or Wine) and make a few PEs by hand
and see them work (or break them and see them not).
Here's one example (without relocations; possibly not 100% perfect,
but Windows doesn't complain), compilable by NASM,
no linker needed:
https://github.com/alexfru/SmallerC/blob/master/v0100/smlrcw.asm
Btw, it has the sections 4KB aligned within the file.

Alex
James Harris
2022-02-10 13:17:57 UTC
Permalink
...
Post by Alexei A. Frounze
Post by James Harris
As I looked in to it I was becoming increasingly uncertain that PE could
be made to do what I wanted. For example, for my environment the linker
expects sections to be loaded into memory at 4k alignment but they have
smaller alignment in the file.
There are two alignment fields in the PE header: section alignment
and file alignment. If you set both to 4KB (there should be an option
or a way to do it via the linker script), you'll probably just get
what you need.
Good point. It turns out that ld even includes switches for
--file-alignment and --section-alignment.
Post by Alexei A. Frounze
Post by James Harris
I needed the executable to run from
wherever the file happened to be loaded. It /may/ have been relocatable
as it stood but I couldn't be sure and thought maybe I was trying to get
the format to do something it was not designed for.
My guess is you wanted it look like a flat binary, which is missing the point
of the format and the format isn't terribly complex for what you really
need at this point.
It may have been possible to get PE to do what I wanted but there were
certainly incompatibilities between my mental model and what PE was
designed for.

By contrast there were some things about it that were better than what I
had in mind. For example, its dividing patch lists up into the pages for
which they list the fixups. (Though I am not sure what is supposed to
happen when the bytes to be fixed up cross a page boundary.)
--
James Harris
James Harris
2022-01-08 13:37:34 UTC
Permalink
On Wed, 5 Jan 2022 15:40:27 +0000
...
What formats of image file are best for the OS itself?
...
1. Flat binary
It turns out I cannot use a flat binary as my compiler does not (yet)
emit fully position-independent code so I've dropped flat binary as an
option.
2b. Include the whole executable file, including the headers, and
write some asm code to parse the headers and jump to the executable
part of it.
This seems to be the best option for now, i.e.:

1. Load a file in its entirety and jump to its first byte.

2. Have some code there which understands the format of the executable
which follows and applies fixups to it.

...
So, any thoughts on what format an OS image should have?
As long as your 16-bit OS loader code or GRUB etc can properly transfer
execution to your 32-bit OS image, does the format really matter? ...
The problem is that the main executable needs to have fixups applied
before it can be run. For instance, if it includes an instruction such as

mov ebx, gdt

then that becomes an absolute load:

b8 <location of label gdt>

where b8 is the opcode in hex.

Because of what is, admittedly, a policy decision, i.e. that I don't
want to require the code to be in a fixed location, I need to correct
the 'location of label gdt' field in the above instruction.

AFAICS that ought to be easy enough but ATM I cannot seem to get the
linker to tell me which locations need fixing up. If I don't know where
they are then I cannot fix em! :-(
--
James Harris
Rod Pemberton
2022-01-09 11:41:32 UTC
Permalink
On Sat, 8 Jan 2022 13:37:34 +0000
Post by James Harris
The problem is that the main executable needs to have fixups applied
before it can be run.
Why do you need fixups applied?

Is this due to the type of executable being used?

Is this due to you moving position-dependent code?

What is generating the instruction below?

(FYI, from one line of assembly, it's hard to figure out where or why
the problem you have exists, as it could be from multiple things.)
Post by James Harris
I need to correct the 'location of label gdt' field in the [below]
instruction.
What is generating that instruction? Assembly? C code? In both
cases, I'd think you should be able to convert "gdt" to an indirect
address or variable. Then, you can play around with what's stored in
the variable.
Post by James Harris
For instance, if it includes an instruction such as
mov ebx, gdt
b8 <location of label gdt>
where b8 is the opcode in hex.
By "absolute load", you mean you can't change the binary value of
"<local of label gdt>" stored within the binary object, yes?

(Sorry, I was confusing what you said, "absolute load", with "absolute
address" for quite a while ...)

If you're generating this instruction from assembly, you can change it,
so it generates an indirect address loading from memory.

If a compiler is generating this instruction, I'd wonder why it was
doing that, i.e., it should access a C variable stored in memory, i.e.,
an indirect access. E.g., I'd expect something more like "mov ebx,
[gdt_pointer]" where gdt_pointer is the memory address where
the address of gdt is stored. Of course, you can do that for assembly.

You might be able to fix up issues like a gdt pointing to an incorrect
location by adjusting either the 16-bit RM segment for CS/DS or the
32-bit/64-bit PM base address for CS/DS (stored in the descriptor for
the PM selector).

That solution will work well for position independent code.

For position dependent code, the code might attempt to access memory
regions outside the application space, e.g., memory-mapped or below 1MB
on x86. If the position dependent code is loaded to some address other
than what it was compiled for, accessing these "outside" memory regions
won't work. You'd need to code special functions to make these memory
regions available from your relocated/moved code. This is usually done
by creating additional CS/DS selectors for the low memory or memory
mapped device. Execution is transferred from one code segment to
another, the function does the work, then returns.
Post by James Harris
What do you do in the 32-bit code where you use absolute references
such as the
mov eax, gdt
that I mentioned in my prior reply?
The vast majority of the 32-bit code in my OS is compiled C. So, I'm
not generating any "mov eax, gdt" or similar instructions in inline
assembly in C.

My OS is entered directly from 32-bit PM via the bootloader (or from my
DOS startup TSR). The GDT for my OS is actually set up twice. My OS
inherits a minimally functional GDT from the bootloader (or from my DOS
startup TSR). I don't access the inherited GDT from my OS, as it's
"discarded" immediately. My OS immediately sets up it's own proper GDT
in C, using an array of descriptors (typedef'd structs) for the GDT,
and loads the gdt via inline assembly for lgdt instruction. If I need
to access the OS' gdt, it's accessed directly from C variables and
functions.

I do wrap a large variety of x86 system instructions such as lgdt, lidt,
... in inline assembly. These x86 instruction functions are passed the
C variables directly. The C compiler patches the C code and assembly
code together correctly.
Post by James Harris
Because of what is, admittedly, a policy decision, i.e. that I don't
want to require the code to be in a fixed location
From this, my guess is that it seems like your compiler is producing
position-dependent code, but you're wanting position-independent code?
... If so, that's a problem.

If you load a position-dependent PM executable or object compiled for
1MB to say 2MB, you can adjust the base address of the PM selectors for
CS/DS from 1MB to 2MB. However, that won't fix any accesses to
memory-mapped devices or stuff below 1MB. To do that, you have to set
up additional CS/DS selectors for those regions, and you'd have to code
and call functions that transfer to code in those regions, do the work,
then return to your image.

The DJGPP compiler is a prime example of this. It uses GCC C compiler
but a custom DOS library. To access things below 1MB, you call special
functions. The DJGPP images can be loaded anywhere above 1MB because
they're position independent. OpenWatcom on the other hand, is position
dependent. It's images are only intended to be loaded to the compiled
for address. If you move an OW image from it's compiled for address,
you can't access memory or devices outside the image space, without
coding your own functions to access these regions, like DJGPP ...
--
Once a President becomes a permanent failure, he then becomes a
fearmonger.
James Harris
2022-02-07 13:05:40 UTC
Permalink
Post by Rod Pemberton
On Sat, 8 Jan 2022 13:37:34 +0000
Post by James Harris
The problem is that the main executable needs to have fixups applied
before it can be run.
Why do you need fixups applied?
Is this due to the type of executable being used?
Is this due to you moving position-dependent code?
Yes, the code was position-dependent - which, it turns out, is often the
case with executables. Whereas locals may be addressed off the frame
pointer it's fairly normal for addresses of globals not to be relative
to any register but to be hardcoded. For example,

int e;
void f() { e = 0; }

When that's compiled the "e = 0" assignment may become

mov [address_of_e], dword 0

The value 'address_of_e' becomes literally an address; it is only
finalised at link/load time.

...
Post by Rod Pemberton
What is generating that instruction? Assembly? C code? In both
cases, I'd think you should be able to convert "gdt" to an indirect
address or variable. Then, you can play around with what's stored in
the variable.
It was generated by my compiler.
Post by Rod Pemberton
Post by James Harris
For instance, if it includes an instruction such as
mov ebx, gdt
By "absolute load" I mean where the address is essentially hardcoded in
the object code - except that the linker and/or loader can patch the
address to suit where the code is to be run from.

...
Post by Rod Pemberton
If you're generating this instruction from assembly, you can change it,
so it generates an indirect address loading from memory.
If a compiler is generating this instruction, I'd wonder why it was
doing that, i.e., it should access a C variable stored in memory, i.e.,
an indirect access. E.g., I'd expect something more like "mov ebx,
[gdt_pointer]" where gdt_pointer is the memory address where
the address of gdt is stored. Of course, you can do that for assembly.
But then you would need to know the address of gdt_pointer!
Post by Rod Pemberton
You might be able to fix up issues like a gdt pointing to an incorrect
location by adjusting either the 16-bit RM segment for CS/DS or the
32-bit/64-bit PM base address for CS/DS (stored in the descriptor for
the PM selector).
That solution will work well for position independent code.
Yes, that would work if using segmentation - although I am using a flat
memory image so it won't work in my case.

...
Post by Rod Pemberton
Post by James Harris
Because of what is, admittedly, a policy decision, i.e. that I don't
want to require the code to be in a fixed location
From this, my guess is that it seems like your compiler is producing
position-dependent code, but you're wanting position-independent code?
... If so, that's a problem.
Yes. As I saw it, my choices were:

1. Change my compiler to produce position-independent code.
2. Work out a way to relocate a PE file 'in place'.
3. Work out a way to do the same with an Elf file.

After spending quite a while looking at option 2 I became less and less
certain it would do what I wanted. For example, PE expects sections to
be loaded with different separation than they have in the file and I
needed a loaded image to run 'in place', i.e. without altering such
separations.

Elf may have had its own difficulties so I went for option 1 and decided
to change my compiler. It has been a fair bit of work but the compiler
is now emitting PIC and seems to be doing the job. I just got it working
yesterday.
--
James Harris
Alexei A. Frounze
2022-02-08 01:50:03 UTC
Permalink
...
Post by James Harris
Post by Rod Pemberton
From this, my guess is that it seems like your compiler is producing
position-dependent code, but you're wanting position-independent code?
... If so, that's a problem.
1. Change my compiler to produce position-independent code.
2. Work out a way to relocate a PE file 'in place'.
3. Work out a way to do the same with an Elf file.
After spending quite a while looking at option 2 I became less and less
certain it would do what I wanted. For example, PE expects sections to
be loaded with different separation than they have in the file and I
needed a loaded image to run 'in place', i.e. without altering such
separations.
Um, you essentially wanted a PE file to look like a flat one with just
PE headers prepended? It's kinda defeating the purpose of the format.
However, it's not impossible. It's just a matter of padding your sections
to whole 4KB pages. And for that you may need to adjust the linker
script.
But the format is simple enough to just properly support its basic
code and data sections, even relocations.

Alex
James Harris
2022-01-08 13:55:44 UTC
Permalink
On Wed, 5 Jan 2022 15:40:27 +0000
...
E.g., I'm using two DOS C compilers (DJGPP, OpenWatcom) which produce
executable images for 16-bit MS-DOS plus 32-bit DPMI I.e., 16-bit DOS
stub plus 32-bit compiled C code intended for a DPMI host. Obviously,
these executables were never designed, nor intended to be a suitable
format for an OS image.
Yes. Combining 16-bit and 32-bit code in one file has to be unusual but
AFAICS needed to get a pmode OS running.

What do you do in the 32-bit code where you use absolute references such
as the

mov eax, gdt

that I mentioned in my prior reply? (At a guess, perhaps you know at
compile/assemble time where the 16- and 32-bit pieces of code will be in
memory, in which case you could build them with the absolute references
already what they need to be.)
If I was using assembly or my own customizable tool chain, then I'd go
with a .COM or flat binary, but for 32-bit or 64-bit instead of 16-bit.
To transfer execution to your OS, you just need the appropriate
processor mode for the code (32-bit PM) to be set up and an entry point
to jump to. For GRUB, you can do that with a small header (Multiboot)
and a pinch of assembly which points to your OS' entry function.
Once your OS is in memory and running, it can re-configure, load or
re-load, or move around, whatever it wants, i.e., total control.
It might be easier, depending on what you're doing, to have a 16-bit
stub on the OS at the start. Then, your OS loader doesn't need to pack
in the code to setup 32-bit PM, as the OS can do the PM switch later on.
Yes, essentially I have 16-bit and 32-bit code in the same file. The
harddrive boot process is as follows. Criticism welcome.

One could say that the VBR is 'Stage1'. It finds the Stage2 bootloader
and loads the first sector thereof to 0x8000.

The first sector of Stage2 loads up to 31.5k more of itself (so that the
part loaded initially fits in the address range 0x8000..0xffff inclusive).

The first 32k of the Stage2 code then finds out how much memory there is
in the machine and responds accordingly. In the simple case, it will
load the rest of itself to addresses 64k and later but that's not the
only option. As an alternative, it could load the whole of itself to 1M
and above.

Either way, once the Stage2 code is fully loaded it does what it has to
do in real mode (BIOS calls etc), and switches to pmode.

At this point my HLL code starts - but I need to fix it up for wherever
it finds itself before it will run properly.
--
James Harris
James Harris
2022-01-08 14:42:21 UTC
Permalink
After a long absence from OS development I recently returned to it - and
it feels great to be doing this stuff again!!! The reason for the
absence (other than life!) was that I was developing a language to write
the OS in.
What's wrong with C?
That's a good question to ask in comp.lang.misc. There's a guy there who
has compiled a very good list. I agree with the points he had on his
list when I last looked at it though that was some time ago.

That said, for writing an operating system the flaws in C are nothing
major. I decided to produce my own language to make OS writing easier
but found out that I could do a lot of what I wanted the OS for by using
my own language.

That said, I would definitely not recommend anyone go down the course
that I did unless he is a genius! Producing a 'good' language has been
far more difficult than I expected.

...
In either case, I always load to a fixed address, so I
may as well produce a flat binary, so that is what I
am likely to do.
That makes sense. I chose to allow the code to run from different
locations so that option is not open to me.
Elf and PE have the opposite problem. Either of them should be easy to
2a. Extract the executable part (how?) for inclusion in the loadable image.
You will need the same logic as a loader. Public domain
https://sourceforge.net/p/pdos/gitcode/ci/master/tree/bios/exeload.c
Thanks for the link. I see in its exeloadLoadPE function the following.

if ((coff_hdr.Characteristics & IMAGE_FILE_RELOCS_STRIPPED) != 0)
{
printf("only relocatable executables supported\n");
return (2);
}

and I think that's what's happening here. The linker or something else
is stripping out the relocs. ATM I cannot imagine why it would do such a
thing.

Assuming I can fix the link process I can also see some useful
calculations in the page you mention such as

rel_block = exeStart + data_dir->VirtualAddress

such calcs may be simple but they confirm what needs to be done and are
easier to read than the specs!
2b. Include the whole executable file, including the headers, and write
some asm code to parse the headers and jump to the executable part of it.
Why can't you write 16-bit C code to do that instead
of assembler? That's what I do. If I had my time again
I would write it using the huge memory model of
Watcom C instead.
If you mean you'd become dependent on a particular compiler then I'd
suggest that that's a trap. I thought you wrote in C89. Your same source
should be compilable by ANY compatible compiler.

As for me, by the time I need to relocate the PE or Elf file I will be
in 32-bit mode so I can, thankfully, avoid different 16-bit memory models.
--
James Harris
BGB
2022-01-14 22:59:45 UTC
Permalink
Post by James Harris
After a long absence from OS development I recently returned to it - and
it feels great to be doing this stuff again!!! The reason for the
absence (other than life!) was that I was developing a language to write
the OS in.
What's wrong with C?
That's a good question to ask in comp.lang.misc. There's a guy there who
has compiled a very good list. I agree with the points he had on his
list when I last looked at it though that was some time ago.
That said, for writing an operating system the flaws in C are nothing
major. I decided to produce my own language to make OS writing easier
but found out that I could do a lot of what I wanted the OS for by using
my own language.
That said, I would definitely not recommend anyone go down the course
that I did unless he is a genius! Producing a 'good' language has been
far more difficult than I expected.
FWIW, my own languages had generally ended up mostly being:
BGBScript family, JS/AS style syntax, mostly dynamic types;
BGBScript2 family, slightly more Java like syntax, hybrid types.

At present, I have a compiler (for my own ISA) which does variants of
BS, BS2, and C.

Though, because these are static compiled in this case, and are on a
relatively resource constrained target, they lack support for "eval" or
similar at present.


In this compiler, some amount of the BGBScript and BGBScript2 features
are available in C via language extensions, and it is possible in this
case to fairly directly call from one language to another.

So, say (BS):
var obj={x:3, y:4};
var arr=[3, 4, 5];
var fcn=function(x,y) { return(x+y); };
...
Maps to (C extension):
__var obj=__var{x:3, y:4};
__var arr=__var[3, 4, 5];
__var fcn=__function(x,y) { return(x+y); };


Things like type-tagging are partially supported at the hardware level,
and the dynamic type-tagging system is also partially specified in the ABI.


Generally, for pointer type values:
* (63:60): 0000
* (59:48): Object Type-Tag
* (47: 0): Memory Address (48-bit address, up to 47 bit userland, *1)
**: *1: If virtual memory was usable...

Normal pointer access (in hardware) simply ignores bits (63:48), though
they may apply in certain contexts.

Fixint:
(63:62): 01
(61: 0): Value (62-bit, sign extended to 64 bits).

Flonum:
(63:62): 10
(61: 0): Value (Double, right-shifted by 2 bits)

Some other misc stuff goes into the remaining tag space.


In the past, did compare a prime sieve written in normal C (static
types), vs one using dynamic types, and saw a roughly 3x speed
difference. However, this was "pretty good" all things considered
(dynamically typed code consists primarily of runtime calls; array
accesses require a fair bit more heavy lifting, ... So seeing only a 3x
delta here was better than expected).

For code which primarily uses static types, the relative cost of
occasionally using dynamic types is much less.


It is also possible to retain portability (with normal C compilers) by
wrapping the dynamic types in a way that they can fall back to a plain C
implementation on different compilers (eg, via a bunch of preprocessor
macros).


However, for higher-level scripting tasks, a design more like BS or BS2
makes more sense.
Post by James Harris
...
In either case, I always load to a fixed address, so I
may as well produce a flat binary, so that is what I
am likely to do.
That makes sense. I chose to allow the code to run from different
locations so that option is not open to me.
On x86, if one needs a 16-bit section, then using a flat binary for this
part makes sense. Would probably also write it in ASM as well.


One loader option is, say, for the boot loader:
Read in second stage loader image;
Read in kernel image (using most of conventional memory as buffer space);
Second stage loads/unpacks kernel image to its intended load address.


Something like LZ4 still makes sense, since this allows fitting a
potentially larger kernel image into the memory one has available in
real mode. Also it is simple enough to not be unreasonably complicated
to write a decoder for it in ASM (unlike, say, Deflate or LZMA, which
are significantly more complicated).


In my custom ISA project though, the FAT driver, PE/PEL4 loader, ... are
all handled by the Boot ROM though. This benefits from having 32K
available to work with, although the FAT driver (FAT16/FAT32) and
"Sanity Check" code eat up a lot of this (the sanity check code
basically tests out various ISA instructions and features to make sure
things behave as expected).

Granted, the FAT driver is potentially a little overbuilt for what is
theoretically needed in this case. For example, if one mandates that the
boot image is contiguous (rather than walking the FAT chain), then a bit
of simplification would be possible.
Post by James Harris
Elf and PE have the opposite problem. Either of them should be easy to
2a. Extract the executable part (how?) for inclusion in the loadable image.
You will need the same logic as a loader. Public domain
https://sourceforge.net/p/pdos/gitcode/ci/master/tree/bios/exeload.c
Thanks for the link. I see in its exeloadLoadPE function the following.
    if ((coff_hdr.Characteristics & IMAGE_FILE_RELOCS_STRIPPED) != 0)
    {
        printf("only relocatable executables supported\n");
        return (2);
    }
and I think that's what's happening here. The linker or something else
is stripping out the relocs. ATM I cannot imagine why it would do such a
thing.
Assuming I can fix the link process I can also see some useful
calculations in the page you mention such as
  rel_block = exeStart + data_dir->VirtualAddress
such calcs may be simple but they confirm what needs to be done and are
easier to read than the specs!
The mainstream linkers generally tend to strip relocs by default for EXE
files.

One possible workaround would be to compile their binaries as DLLs.
Post by James Harris
2b. Include the whole executable file, including the headers, and write
some asm code to parse the headers and jump to the executable part of it.
Why can't you write 16-bit C code to do that instead
of assembler? That's what I do. If I had my time again
I would write it using the huge memory model of
Watcom C instead.
If you mean you'd become dependent on a particular compiler then I'd
suggest that that's a trap. I thought you wrote in C89. Your same source
should be compilable by ANY compatible compiler.
As for me, by the time I need to relocate the PE or Elf file I will be
in 32-bit mode so I can, thankfully, avoid different 16-bit memory models.
For any real-mode code, would probably assume sticking to ASM.
BGB
2022-01-14 19:26:02 UTC
Permalink
After a long absence from OS development I recently returned to it - and
it feels great to be doing this stuff again!!! The reason for the
absence (other than life!) was that I was developing a language to write
the OS in.
And, me just randomly poking in this group, not been around for a while.
Well, I now have a working compiler and a language which, while it is
currently primitive, is usable.
C is pretty hard to beat, IMO.


I have my own languages as well, but C is pretty hard to beat for
low-level tasks even when one has the ability to write their own
compilers (if anything, it gives more insight into why C is the way it is).

Well, and also why a few of the "less well received" C99 features, such
as VLAs are, in practice, "kinda awful" (one can implement support for
them, but making them "not suck" is harder than what might be otherwise
implied).
At this point it seems to me that there's an opportunity for a win-win.
If I use the language to work on the OS then that will let me make
progress on the OS while at the same time using the experience to
provide useful feedback on how the language should develop.
So that's what I plan to do, and the above is background to the query of
   What formats of image file are best for the OS itself?
My compiler currently emits x86 32-bit code (and its output is readily
linkable with other code which can be written in 32-bit assembly) so
pmode is my target. I have enough 16-bit asm code to load the bytes of
an image and switch to pmode but the next problem is what format the
1. Flat binary
A 32-bit flat binary would be easy to invoke as I could just jump to its
first byte. It would not be relocatable but it looks as though I could
change my compiler so that as long as I avoid globals I can emit
position-independent code - which could be handy! But I am not sure how
to create a 32-bit flat binary. My copy of ld doesn't seem to support
such an output, though maybe there's a way to persuade it.
2. Elf or PE
Elf and PE have the opposite problem. Either of them should be easy to
2a. Extract the executable part (how?) for inclusion in the loadable image.
2b. Include the whole executable file, including the headers, and write
some asm code to parse the headers and jump to the executable part of it.
Or maybe there's another option. I've a feeling we've discussed this
before but at the moment I cannot think of what we concluded. Plus, I
need to work with what my compiler can produce (32-bit Nasm) which may
be a new constraint.
So, any thoughts on what format an OS image should have?
In one of my own projects, which has occupied much of the last several
years of my life, is a custom CPU ISA (mostly used of FPGA boards thus
far...).


For this, I mostly went with a tweaked PE/COFF variant:
* Omits the 'MZ' stub, as it is basically useless in this case.
** Typically the file starts at a 'PE\0\0' or 'PEL4' magic or similar.
** The MZ stub is disallowed entirely in the LZ compressed variants.
** The MZ stub may be present for uncompressed 'PE\0\0' files.
* Optional (per-image) LZ compression (typically LZ4 in this case).
** The LZ decoding is integrated with reading the image off the SDcard.
* Adds an RVA==Offset restriction.
* ...

The addition of an RVA==Offset restriction means that it is possible to
essentially just read (or unpack) the EXE into its target address and
then jump to its entry point. Though, my loader also zeroes the ".bss"
section and similar. Without the restriction, it would be necessary to
first read the binary to an intermediate location and then copy its
sections to their destination addresses.

Though, if using a generic linker which does not follow this rule, one
would need to first read into a buffer and then copy out the sections
(or do "seek and read" for each section if one has a "proper" filesystem
driver).


For programs within the OS, the same basic format was used, except that
the ABI splits the binary into two separate areas:
* One for '.text' and other read-only sections.
* An area for '.data' and '.bss' and similar.

The modifiable sections would then be addressed relative to a "Global
Register" (oddly enough, PE/COFF already had the fields for this; albeit
they were unused for x86/x64, mostly intended for MIPS and similar).

This allows multiple logical instances of the same program within a
single address space (without also needing multiple instances of the
".text" section). Implicitly, the ".data" section points to a table to
allow the main EXE (and any DLLs) to reload its own data sections
(typically needed for DLL exports and calls via function pointers, which
may not necessarily have the correct data section in the global register
on entry to the function).


Base relocations could be performed easily enough, but are N/A for
loading up an OS image. The image needs to have its base set to its
starting address.
* In my case, this is generally 01100000 (or 17MB)
* This is 1MB past the start of DRAM, 01000000 (16MB)
** The first 1MB of DRAM is generally reserved for stacks and similar.

Base relocations are typically applied (once) when loading up program
binaries and DLLs though. These fix up the binary both for the load
address, and also its index into the table used for reloading the global
pointer. The base reloc format is basically the same as in normal
PE/COFF, with a few minor tweaks and extensions.


Addresses are generally:
* 00000000..0000FFFF: ROM, Boot SRAM
* 00010000..000FFFFF: Special Hardware Pages (Fixed Contents).
** There is a page of 00 bytes, a page of NOPs, BREAK, RTS, ...
** These are partly intended for use by virtual memory and similar.
* 01000000..7FFFFFFF: DRAM Range (RAM may wrap within this space).
* 80000000..EFFFFFFF: Reserved for now
* F0000000..FFFFFFFF: MMIO (Low Range)

Ranges above the 4GB mark also exist (47:32):
* 0001..7FFF: Virtual Address Space
** Virtual memory generally goes in this range.
** Stuff below 4GB being physically mapped.


In the case of boot-loading, the PE/COFF image is treated as (more or
less) functionally equivalent to a flat binary, just with the entry
point pulled from the PE/COFF header.

For the optional LZ scheme:
* The image is compressed in terms of 1K blocks, starting at 1K.
* The first 1K remains uncompressed (PE headers go here).
* Decoding happens in terms of discrete 1K blocks.
** This avoids the need for using a large intermediate buffer.


In my case, I experimented with several LZ schemes:
* LZ4
** Base encoding is very similar to the standard LZ4.
*** However, absent any headers or file packaging (as in ".lz4" files).
** A distance of 0 is a special "escape case"; generally used for EOB.
*** This typically happens once, at the end of the PE image.
* LZ4LLB
** Tweaked version of LZ4 which adds match/literal length restrictions.
** Adds special case to deal with long runs of literal bytes (no match).
** Allows a slightly faster/simpler decoder, but worse compression.
* RP2 (a custom LZ scheme)
** Custom format, but remains byte-oriented (like LZ4)
** Not generally for binaries as it does worse than LZ4 in this case.


For the case of binaries (at least with my ISA, but also appears to hold
true for x86-64 and ARM as well), LZ4 came out ahead of the various
options I had tried.

My RP2 format tends to do better for general purpose data compression,
however its design assumes that match length and distance are positively
correlated (it uses a unary coded match-format selector, with several
combinations of length and distance bits as bit-packed values), eg:
* dddddddd-dlllrrr0 (l=3..10, d=0..511, r=0..7)
* dddddddd-dddddlll-lllrrr01 (l=4..67, d=0..8191, r=0..7)
* dddddddd-dddddddd-dlllllll-llrrr011 (l=4..515, d=0..131071, r=0..7)
* rrrr0111 (Raw Bytes, r=(r+1)*8, 8..128)
* rrr01111 (Long Match, *1)
* rr011111 (r=1..3 bytes, 0=EOB)
* rrrrrrrr-r0111111 (Long Raw, r=(r+1)*8, 8..4096)

So, typically, the match formats only support 0..7 literal bytes, with
longer runs of literal bytes using explicit encodings.


On my ISA, RP2 can also beat out LZ4 in terms of decode speed, however
on x86 and x86-64, LZ4 tends to be faster.


LZ4 uses a fixed-length (16-bit) distance over a 64K sliding window,
which it happens, better matches the patterns typically seen in program
binaries (where it seems there is no particular correlation between
match length and distance; with lots of short and medium matches spread
more-or-less at random within the sliding window).

Eg, quick and dirty (from memory):
* Tag Byte (rrrr-llll), Raw=0..14, Length=4..18
** Values of 15 in either field encoding a longer length.
*** 00..FE: 0..254, added to preceding length.
*** FF, add 255 to length, read another length byte.

With a match structured like (IIRC):
* Token [LitLen] LiteralBytes* Distance [MatchLen]
** Distance encoded as a 16-bit word (little endian).

Traditionally, in LZ4 the EOB case was encoded by running into the end
of the input buffer. Supporting Distance==0 as an escape case deals
better with input buffers without an explicit length.

In the binaries, the "hits end of buffer" scheme was used for encoding
within most of the 1K blocks, and the explicit EOB case was used for the
final block. Arguably, could have also used 512B blocks (a single
sector), but 1K worked.



The LZ4LLB case was further tweaked:
* Match and Literal length cases were limited to a single byte.
* A match with (Literal!=0, Dist=0, Length=4)
** Encodes an EOB (as in other variant)
* A match with (Literal!=0, Dist=0, Length=5)
** Encodes a run of literal bytes with no match copy operation.

Though, if one has a decoder which does both formats, there was little
advantage to LZ4LLB. Likewise, having a dedicated decoder gave (at best)
a fairly small speed increase.


The PE loaders in my case also make use of checksums, albeit using a
different checksum algorithm from that used in normal PE/COFF.
This was mostly because the original checksum algo was "very weak" and
could miss the sorts of damage which could be done by a misbehaving LZ
decoder.

More or less:
* Keep a running sum of the current values (32-bit DWORDs);
* Keep a running sum of the preceding sums;
* Twiddle and XOR them together at the end (32-bit result).

Though, done as 4 parallel sets of sums in my ISA, as this was faster in
my case (though does produce a different checksum from using a single
set of sums).

The addition of a second sum-of-sums significantly increases its error
detection ability with only a minor increase in computational cost.

The single set of sums option would likely be faster for x86 or x86-64
due to these ISAs having a smaller GPR space (running 4 sets of sums in
parallel would eat more registers than are available in the ISA).

So, a simplified "x86 friendly" variant would probably be, say:
uint32_t *cs, *cse;
uint32_t sum1, sum2;
cs=(uint32_t *)(buf);
cse=(uint32_t *)(buf+sz);
sum1=0; sum2=0;
while(cs<cse)
{
sum1+=*cs++;
sum2+=sum1;
}
return(sum1^sum2);

Though the "actual version" uses 64-bit sums (of 32-bit values) and then
adds the high half back to the low-half at the end, which does help with
its error-detection rate (but, 32-bit x86 doesn't really have the
register space for this, but this property could be emulated with, say
"LODSD; ADD ECX, EAX; ADC ECX, 0; ADD EDX, ECX; ADC EDX, 0;" or similar,
which works within the limits of 32-bit accumulators).


Though a major incentive for binary compression is that the SDcard is
fairly slow (particularly noticeable in simulation, IRL ~ 600 kB/s isn't
too bad; though FAT filesystem overheads also take their cut), and so
compressing the binary can be used to accelerate loading times (with
decompression speed being somewhat higher than the IO speed).

Partly though, this is due to the use of polling IO. In the current
design, the MMIO interface can move multiple bytes at a time. Originally
it was a single byte at a time, and the IO speed limit was somewhat
slower (~ 100-200 kB/S).

Saving a few (virtual) seconds in a Verilog simulation running ~ 1000x
slower than real-time, is quite noticeable.
Loading...