Misc: experimental filesystem DFS2

Discussion:

(too old to reply)

BGB

2024-12-11 11:35:22 UTC

Granted, this group isn't very active, but it is one of the few groups
where this sort of thing is on-topic, so...

So, for my project I have made a new experimental filesystem I am
calling DFS2.

Why?...
The existing filesystems didn't really fit what I wanted.

Namely:
Can support Unix style features;
File ownership;
Symbolic links;
...
Not excessively complicated;
Needs to be relatively low overhead.
Neither RAM nor CPU time are abundant.
Target is 50MHz with RAM measured in MB.
I also wanted things like file compression.
But, more for IO optimization than saving space.

Currently I am using FAT32:
Does not natively support the desired features;
FAT chain walking and large clusters are not efficient to deal with;
No built-in compression.

A few existing options:
Minix:
Too limited, missing features.
EXT2:
Not 1:1 with wanted features;
Some aspects of the design seem annoying.
NTFS:
Significantly over complicated.
Most other Linux filesystems:
Tend to have a problem of most being over-complicated;
Most are trying to "aim higher" than EXT2.
Or, are specialized for specific cases, like "intitrd".

General design I went with:
Uses an inode table (kinda like EXT2)
Though, the table is itself represented with an inode (pros/cons);
Superblock gives the address of the inode table
with inode 0 as the inode-table inode.
Inode structure consists of multiple tagged members.
Sorta like the NTFS MFT entries.
Fixed size directory entries
They are 64 byte with a 48 byte name field;
Multiple entries are used for longer names (sorta like FAT).
Though, this should be infrequent,
as names longer than 48 bytes are rare.
Directory entries are organized into an AVL tree.

Tradeoff for AVL:
An AVL tree is more complicated than could have been hoped. But, it is
less complicated than a B-Tree would have been. While a case could have
been made for using linear search (like FAT), it seemed like AVL was
probably worth the complexity (does mean directory lookups are roughly
log2 N).

By my current estimates, likely the break-even point between an AVL tree
and a B-Tree (with k=16) would likely not be reach until around 1000
files, which is well over the size of most directories. If one could
argue that most directories were smaller than around 8-16 files, a case
could have been made for linear search; but AVL doesn't really add much
cost beyond some additional code complexity.

Where, each directory entry encodes:
A name field;
The inode number;
Left, Right, and Parent links;
The Z depth of this node;
Directory Entry Type.

Initially, there was no parent-link in the tree, but the lack of a
parent link added significant complexity to tasks like walking the
directory or rebalancing nodes (it was necessary to keep track of this
externally), so I ended up adding a link. Though, at present this
reduces the maximum theoretical directory size to 2M files (unlikely to
be an issue).

Note that the filesystem is case-sensitive and assumes UTF-8 for
filenames (with some wonk for names longer than 48 bytes).

Block management within inodes:
Uses a scheme similar to EXT2, namely (with 32-bit block numbers):
16 entries, direct block number (16KB with 1K blocks)
8 entries, indirect block (2MB )
4 entries, double indirect block (256MB)
2 entries, triple indirect block (16GB )
1 entries, quad indirect block (4TB )
1 entries, penta indirect block (1PB )
For larger volumes and compressed files, 64 bit numbers are used:
8 entries, direct block number (8K / 128K, with 1K or 16K)
4 entries, indirect block (512K / 8MB )
2 entries, double indirect block (32MB / 512MB)
1 entries, triple indirect block (2GB / 32GB )
1 entries, quad indirect block (256GB / 1TB )

Blocks and inodes are allocated via bitmaps, which are themselves
assigned inodes (and store their data the same as normal files).

Where the reducing the number of entries avoids making the inode bigger.
Currently, the design can use 256 or 512 byte inodes.
The former layout with 64b entries requires a 512B inode.

Had considered spans tables, but didn't seem worth it (though
conceptually simpler, spans are more complicated to deal with in terms
of code complexity).

For the 64-bit block numbers, with compressed files, the high-order bits
are used for some additional metadata (such as how this logical block of
the file is stored, and how many disk-blocks were used).

For file compression:
A larger logical block size is used in this case:
Say, 16K/32K/64K vs 1K/2K/4K.
Currently I am using 16K and 1K.
This larger block is compressed,
and stored as a variable number of disk blocks.
Blocks are decompressed when loaded into a block-cache;
And, re-compressed when evicted (and dirty).
Underlying blocks are dynamically reallocated as needed.
Currently, using a non-entropy-coded LZ77 variant.
Say, along vaguely similar lines to LZ4.

Currently, with all of this, code size is still moderately smaller than
my existing FAT32 driver... Granted, this isn't saying much.

Any thoughts?...

Andy Valencia

2024-12-22 15:48:13 UTC

Permalink

Post by BGB
Any thoughts?...

One un-sexy and yet critical aspect of filesystem design is all the
edge cases you can encounter during panics at each possible combination
of states. DEC famously engineered the sequence of I/O's in moving
filesystem state to disk so that they could make firm statements
about the worst case. Log-based filesystems get there with their
parallel data structure with its own commit semantics.

If you punt, then count on your fsck-ish program becoming quite
complex, with some bad cases where corrupt data has to be abandoned,
or batches of content without name end up in a lost+found directory.

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html
Fediverse: @***@goto.vsta.org

BGB

2024-12-23 10:45:35 UTC

Permalink

Post by Andy Valencia

Post by BGB
Any thoughts?...

BTW, here is a spec of sorts:
https://github.com/cr88192/bgbtech_btsr1arch/blob/master/docs/scratchpad/2024-03-24_TKDFS2.txt

Possible. I was not initially prioritizing reliability.

Rather: I wanted something that could have relatively modest
requirements in terms of memory (for targets with RAM still measured in
MB), and where the driver code was not overly complicated. Similarly, I
am dealing with fairly low CPU speeds, so want something that performs
well on a 50 MHz CPU.

Currently, the intended target media would be something like an SDcard
(which would be used as the primary mass storage device).

Generally, I am running the SDcards in 12 MHz SPI mode:
Theoretically, 25MHz SPI should be supported, but I haven't really
gotten reliable operation on actual hardware (similarly, 16MHz sorta
worked by had reliability issues in my tests).

Not gotten the native SDcard IO working, but it is likely then that it
would still be mostly bottlenecked by the MMIO interface.

The current MMIO interface can do 64-bit operations (swapping 8 bytes at
a time over SPI), as the original interface did 1 byte at a time and hit
a limit of around 600 kB/sec...

Theoretically, the MMIO interface can now do 4.8 MB/s, though for 12 MHz
SPI the limit is ~ 1.5 MB/s (but, averages closer to ~ 1.2 MB/s).

This is partly why I had prioritized file compression, as generally the
speed at which data can be decompressed (using LZ4 or my own RP2 scheme)
is faster than the baseline IO speed.

My FS driver ended up as around 3.5 kLOC, which is admittedly bigger
than I had hoped, but still smaller than my VFAT driver, which ended up
closer to around 5 kLOC.

Granted, FAT can be simplified if one drops LFN support and makes it
read only.

But, for the new filesystem, a read-only version is possible in around
1.0 kLOC. A fair chunk of the code ended up going into things needed for
a non-read-only filesystem.

At present, compression is being done with 16K blocks, with each block
being compressed individually. This is a little on the small side, as
the block size is smaller than the sliding window in both LZ4 (64K) and
RP2 (128K).

The use of 16K blocks for now was mostly so that I could use 256K for
the compressed block cache; and a larger compressed block size would
likely have needed a bigger cache (say, 512K to 2MB), which seemed a
little steep. Where, say, one wants a cache size of 16 or so logical blocks.

Note that the cached block count is a bit higher for uncompressed blocks
(currently around 128 ATM for 1K blocks), mostly due to their smaller
size. But, this block caching also applies to metadata.

Neither compression scheme uses entropy coding; as entropy coding would
come at a significant speed cost (and then result in decompression being
slower than the baseline IO speed).

RP2 is a custom scheme, which usually gives slightly better compression
than LZ4, at similar decode speeds on this target (though, on normal
desktop PC's, LZ4 is slightly faster; but both can pull off decode
speeds in the area of 2GB/sec on a 3.7GHz Ryzen).

Currently though, the compression code I am using for this currently
only uses RP2. For simplicity, it naively only uses a single-step
hash-table (no hash chain search).

Note that the RP2 and LZ4 code it separate and not counted in the
line-count for the filesystem driver.

Compression is not currently the default, and would only apply to normal
filesystem files (not to special files or metadata).

It is possible that a strategy could be that a directory is flagged to
enable compression, and this would implicitly apply to new files created
within the directory, and subdirectories created from this directory
(similar to the behavior in Windows).

Post by Andy Valencia
If you punt, then count on your fsck-ish program becoming quite
complex, with some bad cases where corrupt data has to be abandoned,
or batches of content without name end up in a lost+found directory.

Not gotten to this point yet.
No fsck yet.
Currently though, mkfs functionality is handled by the kernel driver.
Possibly fsck might also be handled by the kernel driver.

Current strategy is that generally all dirty blocks and inodes are
flushed whenever a file is closed or similar.

Say:
Write out all dirty compressed blocks;
May dirty inodes or the cached disk blocks.
Write out any dirty inodes:
May dirty cached blocks.
Write out any dirty disk blocks.

Operations like file creation or deletion may also trigger all the
caches to be flushed.

Note that in my case, currently each filesystem manages its own
buffering, where the underlying interface for reading/writing block
devices currently uses unbuffered IO.

It is possible that some other operations:
Inode creation;
Modification of the block or inode bitmaps;
...

Could trigger a priority flush (say, a flush is then initiated directly
following the IO operation in question).

Though, more pedantic could be to do so following any IO write
operation, but this would likely cause a notable performance overhead
(and, would be very bad if the program is writing data via "fputc" or
similar).

Currently, something like a journal seems like more complexity than it
is worth for this.

Post by Andy Valencia
Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html

Waldek Hebisch

2024-12-26 21:46:23 UTC

Permalink

Post by BGB
Granted, this group isn't very active, but it is one of the few groups
where this sort of thing is on-topic, so...
So, for my project I have made a new experimental filesystem I am
calling DFS2.
Why?...
The existing filesystems didn't really fit what I wanted.
Can support Unix style features;
File ownership;
Symbolic links;
...
Not excessively complicated;
Needs to be relatively low overhead.
Neither RAM nor CPU time are abundant.
Target is 50MHz with RAM measured in MB.
I also wanted things like file compression.
But, more for IO optimization than saving space.
Does not natively support the desired features;
FAT chain walking and large clusters are not efficient to deal with;
No built-in compression.
Too limited, missing features.
Not 1:1 with wanted features;
Some aspects of the design seem annoying.
Significantly over complicated.
Tend to have a problem of most being over-complicated;
Most are trying to "aim higher" than EXT2.
Or, are specialized for specific cases, like "intitrd".
Uses an inode table (kinda like EXT2)
Though, the table is itself represented with an inode (pros/cons);
Superblock gives the address of the inode table
with inode 0 as the inode-table inode.
Inode structure consists of multiple tagged members.
Sorta like the NTFS MFT entries.
Fixed size directory entries
They are 64 byte with a 48 byte name field;
Multiple entries are used for longer names (sorta like FAT).
Though, this should be infrequent,
as names longer than 48 bytes are rare.
Directory entries are organized into an AVL tree.
An AVL tree is more complicated than could have been hoped. But, it is
less complicated than a B-Tree would have been. While a case could have
been made for using linear search (like FAT), it seemed like AVL was
probably worth the complexity (does mean directory lookups are roughly
log2 N).
By my current estimates, likely the break-even point between an AVL tree
and a B-Tree (with k=16) would likely not be reach until around 1000
files, which is well over the size of most directories. If one could
argue that most directories were smaller than around 8-16 files, a case
could have been made for linear search; but AVL doesn't really add much
cost beyond some additional code complexity.
A name field;
The inode number;
Left, Right, and Parent links;
The Z depth of this node;
Directory Entry Type.
Initially, there was no parent-link in the tree, but the lack of a
parent link added significant complexity to tasks like walking the
directory or rebalancing nodes (it was necessary to keep track of this
externally), so I ended up adding a link. Though, at present this
reduces the maximum theoretical directory size to 2M files (unlikely to
be an issue).
Note that the filesystem is case-sensitive and assumes UTF-8 for
filenames (with some wonk for names longer than 48 bytes).
16 entries, direct block number (16KB with 1K blocks)
8 entries, indirect block (2MB )
4 entries, double indirect block (256MB)
2 entries, triple indirect block (16GB )
1 entries, quad indirect block (4TB )
1 entries, penta indirect block (1PB )
8 entries, direct block number (8K / 128K, with 1K or 16K)
4 entries, indirect block (512K / 8MB )
2 entries, double indirect block (32MB / 512MB)
1 entries, triple indirect block (2GB / 32GB )
1 entries, quad indirect block (256GB / 1TB )
Blocks and inodes are allocated via bitmaps, which are themselves
assigned inodes (and store their data the same as normal files).
Where the reducing the number of entries avoids making the inode bigger.
Currently, the design can use 256 or 512 byte inodes.
The former layout with 64b entries requires a 512B inode.
Had considered spans tables, but didn't seem worth it (though
conceptually simpler, spans are more complicated to deal with in terms
of code complexity).
For the 64-bit block numbers, with compressed files, the high-order bits
are used for some additional metadata (such as how this logical block of
the file is stored, and how many disk-blocks were used).
Say, 16K/32K/64K vs 1K/2K/4K.
Currently I am using 16K and 1K.
This larger block is compressed,
and stored as a variable number of disk blocks.
Blocks are decompressed when loaded into a block-cache;
And, re-compressed when evicted (and dirty).
Underlying blocks are dynamically reallocated as needed.
Currently, using a non-entropy-coded LZ77 variant.
Say, along vaguely similar lines to LZ4.
Currently, with all of this, code size is still moderately smaller than
my existing FAT32 driver... Granted, this isn't saying much.
Any thoughts?...

For SD cards it is better to put critical data (like bitmaps or
block lists) at start. IIUC SD card manufactires assume FAT-like
filesystem and make first part (say 4 MB) better. So speading out
metadata over the whole card means higher chance of loosing
critical data.

Inodes and bitmaps in separate area have advantage of easier
recovery. With your scheme, if you loose single reference to
some directory, then recovery will be somewhere between very
hard an impossible.

--
Waldek Hebisch

BGB-Alt

2024-12-26 22:37:47 UTC

Permalink

Post by Waldek Hebisch

OK.

Some of this could likely be tunable when creating the filesystem (in an
"mkfs" type tool or similar).

My code currently initially reserves 16K for the starting inode table,
and a few K for the initial bitmaps (though, bigger may be better).

But, up-front creation/reservation based on the final volume size
(rather than allocating them incrementally) is still very much possible.

Say, initial block assignment at present, IIRC:
0: Superblock
1: Root (first 16 entries)
2: Block Bitmap (first 16MB of volume)
3: Inode Bitmap (first 8k inodes)
4..19: Inode Table
20..23: Block Bitmap (~ 64MB).

But, this was mostly quick/dirty...

Likely, could reserve bitmap space for all blocks in the volume, and
enough starting inodes to cover the volume if one assumes a certain file
size (say, 1MB or 4MB or something); growing the inode table dynamically
if this limit is reached (say, if average file size is smaller than this
estimate).

Post by Waldek Hebisch
Inodes and bitmaps in separate area have advantage of easier
recovery. With your scheme, if you loose single reference to
some directory, then recovery will be somewhere between very
hard an impossible.

But, separate area also means one can't easily grow/shrink the
filesystem, as well as adding some complexity in other areas.

Seemed less problematic to essentially treat the whole volume as a
linear array of blocks, and have all metadata itself as inodes.

NTFS encodes filenames and similar into MFT entries and (IIRC) has
back-links to the parent directory. It is possible I could consider
adding something similar to the inodes.

Though, ATM, most of the space is already used up for a 256B inode.
Say, if I wanted to add a struct for a 48 byte name and 6 byte
parent-link (worst case, define a smaller IndexC4 structure, or bump the
inode size back up to 512 bytes).

Inode_DirMeta {
byte refname[48]; //filename, or first part of longname
u32 dirino_lo; //directory inode, low
u16 dirino_hi; //directory inode, high
byte tag1; //tag byte / reserved
byte tag2; //tag byte / reserved
};

Then, probably also maintain a backup inode table...

Could potentially then allow recovery of the full directory tree from
the inode table if one of the directories gets mangled.

The Running Man

2025-01-23 23:18:35 UTC

Permalink

Another useful feature to add could be a Reed-Solomon encoding
scheme for error-detection and recovery making it more resilient.