ZenFS: A New File System

Post by Mark
I'm announcing here the release of the specifications for a new file
system: ZenFS. It's a 64-bit journaled file system, with support
for attributes, filesets, indexing, compressed files, and both per-
file and full volume encryption. It's the first part in a series of
designs for a secure operating system I'm writing.
If anyone is interested and has time, I'd appreciate feedback. I'm
open to all suggestions, comments, and constructive criticism. The
design could us a little peer review. Also, if anyone is in need of
a file system for their own projects, please feel free to use it.
The design is unpatented, not copyrighted, and free for use.
The specification can be found in many formats at the top of the
http://mark.friedenbach.net/

Wow!! Fantastic!! And I very rarely use two exclamation marks!!

I have some comments, on an initial reading, which I've put below.
Remember that these comments follow a brief initial reading of the
spec. I'll probably have more to say later.

However, what I want to say up front is that I think the design is
just /great/, and the specification is very well written (clear,
unambiguous, well organised). So much so, that I think I would like
to adopt the format for AdaOS (Aquila), if you would agree to this.

AdaOS Aquila will be written in the Ada language, so the specification
would have to pin down the format of everything to the bit. (I think
it seems to do this, in fact.)

Would you consider adding the following features?

* Could files be permitted to be excluded from journalling at user
request (to trade integrity maintenance of the file for some increase
in performance)?

* I'd like to be able to have files that do not have file names (i.e.
their inodes would have 0 links).

* I'd like to permit the forward slash '/' character in file names
(they would be escaped, in software, to distinguish them from path
separators). CORBA (and the Macintosh?) seem to permit them.

* Where at the moment you have a 'flags' field, would you divide it
into two fields, 'mandatory_flags' and 'auxiliary_flags', the idea
being that software can ignore auxiliary flags (but not mandatory
ones)?

* In my own preliminary design, I have a 16-bit inode type field.
The basic types I have myself defined (so far) are:

1 -- inode metafile
2-255 -- reserved (for other metafiles)
256 -- unknown (plain binary)
257 -- Simple Text
258-511 -- reserved (for other simple types)
513 -- AdaOS Basic Directory
522 -- AdaOS NFCH
512,514-521,523-767 -- reserved (for AdaOS)
768-65535 -- reserved (for further types)

Code 256 is used for any file of a type (format) that is unknown or
does not correspond to any defined type code. 'AdaOS Basic Directory'
could surely be replaced by INODETYPE_DIRECTORY (with code 16?).

* For a really small file, the data could all stored in the inode (if
it will fit), in place of the 'first_blocks' field.

* Could 'data_size' be in terms of bits instead of bytes (since this
is what SQL-92 requires for a binary string)?

* I'd like to propose a small improvement to updating of the
superblock. You seem to suggest the superblock is only stored in one
place. I feel that's a bit dangerous. My idea is to permanently
reserve a certain number of blocks to store copies of the superblock,
spread evenly across the volume (maybe one every 256 MiB?). When
updating volume status, you merely update the nearest (to the current
allocation point) superblock copy. Upon mounting the volume, all the
superblock copies must be scanned to ascertain the most recently
updated one (but I think this can be assumed to take a reasonably
short amount of time). It's possibly not a bad idea to move the
armature across the disk once (when mounting the volume), since a
major mechanical fault may be detected in so doing (and it might even
prompt the drive to go into a recalibration cycle, and volume
mounting time seems an appropriate time to do this). To be honest, I
want to think about this some more :-)

* I'd like to support 'smart copying' of files. The idea is that each
run/extent descriptor also contains a reference counter (which need
only be small), so that a file can be copied merely by pointing its
inode at the same index structure, and incrementing the reference
count (of the top index). When a write occurs, the blocks containing
data and affected indices are split (duplicated), and the reference
counts of the indices pointing to them decremented. This supports
transaction management perfectly: any file updated within a
transaction is smart copied; to commit the transaction, the orignal is
deleted (so that overwritten data is lost, but new and unchanged data
is kept); to roll back the transaction, the new file is deleted (so
that the original data is restored). Combined with journalling, this
facility would provide the basis for bullet-proof ACID enforcement.

~~~

My initial comments are as follows.

* I doubt that it is necessary to 'sprinkle' magic numbers
interspersed among the data. If a disk sector gets written badly,
this fact will be detected by the controller. The only role a
magic number plays is to allow: regular software to double-check
that a structural element (block) really is what it thinks it should
be; data recovery software to reconstruct data (or the volume in
place), possibly using heuristics (guesswork). For both of these
purposes, all that is required is at most one magic number per
sector.

* You say:

A file is a named collection of bytes managed by the file system.
A file is a collection of bytes which may be ...

I think you should say:

A file is a named sequence of bytes managed by the file system.
A file is a sequence of bytes which may be ...

* You say:

A disk is split into one or more fixed size (but user specified)
partitions.

I think you should say:

A disk may be split into one or more fixed size (but user
specified) partitions.

since the IBM PC partitioning scheme is technically optional, and
it is not used on non-PC computers. Some (e.g. the Mac) use a
different partitioning scheme, but most use none.

* An 'index' (inode/file) might just as well be a proper database
table? Why not go the whole hog, and provide the complete nuts and
bolts of a database engine?

* Instead of using UTF-8 in various places, I suggest that: in
directories, Latin-1 only is used, because this means that computer
(operating) systems which cannot handle Unicode (e.g. an embedded
system) are not embarrassed (strings of %XYs are not helpful,
even to a someone who does not know the Roman letters); define
attribute types such as STRING_ISO_8859_1, STRING_UTF_8,
STRING_UTF_16 and so on.

* Does the '.parent' reverse mapping (inode to parent directory) mean
that hardlinks are not all equal, but rather that there is always
exactly one hardlink which is 'preferred' (or 'distinguished'), as
the file's 'real' (principal) name? I would actually prefer this to
be the case, since it means that disk usage information can be
presented to the user in terms of names (rather than inode numbers),
notwithstanding my suggestion for nameless files.

* I like the encryption idea. I particularly like the idea that the
file system software could: cache an encrypted file unencrypted in
memory; block such a file with 'big' blocks (say 64 KiB each);
remember an encryption state for each 'big' block in cache; encrypt
a (dirty) big block before writing it back to disk. Cool. This kind
of 'encryption in place' would be superior to anything offered by any
existing file system, to my knowledge. Is full-volume encryption
useful?

* I think the compression needs to be significantly rethought. I
suspect that we should not be concerned with compression at the file
system level, except maybe for something simple (e.g. RLE). It's just
too complex, in general, and better handled at higher levels. What
are your thoughts on this?

Some minor points on grammar and spelling:

* 'Elementary Principals' -> 'Elementary Principles' (I keep confusing
these two words myself)

* 'basic underlying principals of file' same

* 'ZenFS's on disk data' -> 'ZenFS's on-disk data'

* 'block size of 1K' -> 'block size of 1 KiB'

Would you consider using DocBook/XML to originate your documentation?

Finally, a comment on your copyright notice. I think it is fine; however,
I suspect that the sentence:

A verbatim copy may be reproduced or distributed for personal or
educational use in any medium physical or electronic without the
express permission of the author.

is legally meaningless, because these uses almost certainly fall within
the 'fair uses' explicitly excepted by copyright law. But IANAL.

More later.

--
Respect,
Nick Roberts

Mark

2004-08-07 05:35:34 UTC

Post by Nick Roberts
Wow!! Fantastic!! And I very rarely use two exclamation marks!!
<snip>
However, what I want to say up front is that I think the design is
just /great/, and the specification is very well written (clear,
unambiguous, well organised). So much so, that I think I would like
to adopt the format for AdaOS (Aquila), if you would agree to this.

Why, thank you! And thank you for your well-written reply, I've
addressed your points in series below.

And please, feel free to use ZenFS. It's out there for anyone who
want it, no strings attached.

Post by Nick Roberts
Would you consider adding the following features?
* Could files be permitted to be excluded from journalling at user
request (to trade integrity maintenance of the file for some increase
in performance)?

You can actually do this as it stands right now. The only things that
are assumed to be journaled are updates to directories, indices, or
any of the on-disk file system structures. Systems are allowed to
also log file data, but that's a matter of policy, not mechanism (the
journal format is general enough to allow for this).

PS: Whoops, I don't think the spec says what has to be journaled.
I'll fix that.

Post by Nick Roberts
* I'd like to be able to have files that do not have file names (i.e.
their inodes would have 0 links).

That's an interesting idea. It's certainly possible, but needs to be
worked out a little more I think. How would you access the file?
Where would the inode address be stored?

Post by Nick Roberts
* I'd like to permit the forward slash '/' character in file names
(they would be escaped, in software, to distinguish them from path
separators). CORBA (and the Macintosh?) seem to permit them.

Yeah, that decision was pretty arbitrary. I chose that to remain
compatible as a Linux file system (POSIX specifies that restriction),
but I personally really have no preference. I thought of it as a
lowest common denominator kind of thing, I didn't realize CORBA lacked
that restriction. I'll remove it.

Post by Nick Roberts
* Where at the moment you have a 'flags' field, would you divide it
into two fields, 'mandatory_flags' and 'auxiliary_flags', the idea
being that software can ignore auxiliary flags (but not mandatory
ones)?

Hrm, possibly. What kind of flags do you have in mind? I ask only
because TYPE_BOOL attributes might be the mechanism of choice for
storing auxiliary flags, if the flags relate to the format of the data
and not where it is on disk. Sharing an auxiliary_flags field could
lead to collisions, whereas with attributes you can avoid that with a
prefix for all your attribute names (e.g. ".AdaOS:FLAG_NAME")

Post by Nick Roberts
* In my own preliminary design, I have a 16-bit inode type field.
1 -- inode metafile
2-255 -- reserved (for other metafiles)
256 -- unknown (plain binary)
257 -- Simple Text
258-511 -- reserved (for other simple types)
513 -- AdaOS Basic Directory
522 -- AdaOS NFCH
512,514-521,523-767 -- reserved (for AdaOS)
768-65535 -- reserved (for further types)
Code 256 is used for any file of a type (format) that is unknown or
does not correspond to any defined type code. 'AdaOS Basic Directory'
could surely be replaced by INODETYPE_DIRECTORY (with code 16?).

And 256&257 would fall under INODETYPE_FILE; I'm not sure I follow on
what 1 and 522 are for. I take it the reserved values are for
defining new user-level types? Like 258=UTF-8 text file, 1024=PNG
image file, 1025=JPEG file, etc?

I think this falls again under "mechanism vs. policy." For my own OS,
I'm taking the BeOS route and having a TYPE_STRING attribute
".myOS:TYPE" which specifies the MIME type for the file. For your
nees you might want a TYPE_UINT16 attribute (".AdaOS:TYPE"?). For
interoperability reasons I think it would be best to limit
inode_type's values to those relating to the file system itself.

Post by Nick Roberts
* For a really small file, the data could all stored in the inode (if
it will fit), in place of the 'first_blocks' field.

It's in there. (page 26, do a search for the ".data" attribute)

Post by Nick Roberts
* Could 'data_size' be in terms of bits instead of bytes (since this
is what SQL-92 requires for a binary string)?

Do you mean the data_size in the inode, or the data_size in the
small_attr region? In either case it'd cut the maximum size one can
store down by a factor of 8... You could always shift three bits.

Post by Nick Roberts
* I'd like to propose a small improvement to updating of the
superblock.... <snip>

I thought about this too. As you demonstrated, there isn't really a
good solution. I'm not convinced yet there's much of a problem
either. The chances of the superblock wearing out is just so small on
a modern hard drive (compared to the life of the hard drive). Unless
the whole region gets scratched by the drive head or something, in
which case you're screwed anyway because you'd have lost the
neighboring boot block as well.. I'll think about it some more too,
but i'm pretty sure the complexity will outweigh any benefit.

Post by Nick Roberts
* I'd like to support 'smart copying' of files. The idea is that each
run/extent descriptor also contains a reference counter (which need
only be small), so that a file can be copied merely by pointing its
inode at the same index structure, and incrementing the reference
count (of the top index). When a write occurs, the blocks containing
data and affected indices are split (duplicated), and the reference
counts of the indices pointing to them decremented. This supports
transaction management perfectly: any file updated within a
transaction is smart copied; to commit the transaction, the orignal is
deleted (so that overwritten data is lost, but new and unchanged data
is kept); to roll back the transaction, the new file is deleted (so
that the original data is restored). Combined with journalling, this
facility would provide the basis for bullet-proof ACID enforcement.

Oooh, this is a unique idea. You don't need it for transaction
management though, that can all be all be done with user-specified
transactions in the existing journal. Where else might this be
useful? It is a really neat concept. It'd require extending the
file_run structure by another 8 bytes though (for alignment reasons)..

Post by Nick Roberts
~~~
My initial comments are as follows.
* I doubt that it is necessary to 'sprinkle' magic numbers
interspersed among the data. If a disk sector gets written badly,
this fact will be detected by the controller. The only role a
magic number plays is to allow: regular software to double-check
that a structural element (block) really is what it thinks it should
be; data recovery software to reconstruct data (or the volume in
place), possibly using heuristics (guesswork). For both of these
purposes, all that is required is at most one magic number per
sector.

Actually, they're all there for good reasons. The first one in the
superblock identifies the volume, the next two help detect buffer
overrun of the volume name, the rest are padding bytes for alignment
that I couldn't think of any other use for (same goes for the inode
structure). The bytes are there, might as well put them to good use
(it's unlikely, as you say, that the disk might garble one and not the
other, but buggy software might).

Post by Nick Roberts
* An 'index' (inode/file) might just as well be a proper database
table? Why not go the whole hog, and provide the complete nuts and
bolts of a database engine?

Well, there's two reasons for that:

1) mechanism vs. policy; most of the things you'd need to build a
general database system are there, but it's up to software to do that.
i didn't see any need to go any further in the spec.

2) i'm not a database expert. is there anything missing that should
be there?

Compact implmentations that do not require database features need only
support inode->inode (for ".parent" index) and string->inode (for
directories) indices.

Post by Nick Roberts
* Instead of using UTF-8 in various places, I suggest that: in
directories, Latin-1 only is used, because this means that computer
(operating) systems which cannot handle Unicode (e.g. an embedded
system) are not embarrassed (strings of %XYs are not helpful,
even to a someone who does not know the Roman letters); define
attribute types such as STRING_ISO_8859_1, STRING_UTF_8,
STRING_UTF_16 and so on.

Well, UTF-8 is quickly becoming de facto for new stuff (unless you're
in the NT world). Drivers for older systems that don't support it can
always convert on the fly. I'm a little hesitant to include something
like this since any format under the sun can be converted to UTF-8.
Adding extra types would complicate things for everyone, even those
who already run UTF-8..

Post by Nick Roberts
* Does the '.parent' reverse mapping (inode to parent directory) mean
that hardlinks are not all equal, but rather that there is always
exactly one hardlink which is 'preferred' (or 'distinguished'), as
the file's 'real' (principal) name? I would actually prefer this to
be the case, since it means that disk usage information can be
presented to the user in terms of names (rather than inode numbers),
notwithstanding my suggestion for nameless files.

Duplicate entries are allowed in the '.parent' index (or any attribute
index, for that matter). A hard linked file will have multiple
entries in the '.parent' index.

I thought about doing semi-hard links as well, with one entry being
the prefered, but it didn't really make sense except in the situation
you mentioned, and added an extra layer of indirection. You can
always just choose the first entry that shows up in '.parent'.

Post by Nick Roberts
* I like the encryption idea. I particularly like the idea that the
file system software could: cache an encrypted file unencrypted in
memory; block such a file with 'big' blocks (say 64 KiB each);
remember an encryption state for each 'big' block in cache; encrypt
a (dirty) big block before writing it back to disk. Cool. This kind
of 'encryption in place' would be superior to anything offered by any
existing file system, to my knowledge. Is full-volume encryption
useful?

Yes. At the very least, it's a catch-all for files that you forget to
encrypt, or applications that do something stupid like store
unencrypted info in a temp file (people seem to do stupid stuff like
this all the time).

It also prevents an attacker from learning anything from the hierarchy
structure and file names (a more serious concern). The file name,
creation time, size, and location in the hierarchy can give away a lot
of info, for example.

Not everyone will need it, however.

Post by Nick Roberts
* I think the compression needs to be significantly rethought. I
suspect that we should not be concerned with compression at the file
system level, except maybe for something simple (e.g. RLE). It's just
too complex, in general, and better handled at higher levels. What
are your thoughts on this?

Sparse files.

Scientific compute servers might have sparse files many terrabytes in
size (concievably the whole 2^64 byte range), but with perhaps only a
few gigabytes of real data. The data's usually highly compressible
too. As I note in the spec though, uniformally compressing such a
large file (as is the only real option if compression were at a higher
level) doesn't work as well as you might think; unless you come up
with new compression format designed for that, but i didn't want to
open that can of worms.

Besides, if you don't need it, and interoperability isn't an issue,
you don't need to implement it.

Post by Nick Roberts
Would you consider using DocBook/XML to originate your documentation?

Yes. I'm going to switch to LaTeX or DocBook soon. I've just never
use either, and wanted to get some ideas to paper quickly before
learning a new system.

Well, that's all I guess. Thanks for the comments, including the
grammar ones I skipped (i'll make the corrections). I hope to hear
more ideas if you have them.
-Mark Friedenbach

Nick Roberts

2004-08-08 11:05:24 UTC

Post by Mark
And please, feel free to use ZenFS. It's out there for anyone who
want it, no strings attached.

Excellent. It really solves a problem for me, as I have been
agonising over what format to use for Aquila for a long time now.

Post by Nick Roberts
* Could files be permitted to be excluded from journalling at user
request (to trade integrity maintenance of the file for some
increase in performance)?

You can actually do this as it stands right now. The only things
that are assumed to be journaled are updates to directories, indices,
or any of the on-disk file system structures. Systems are allowed to
also log file data, but that's a matter of policy, not mechanism (the
journal format is general enough to allow for this).

Perfect.

Post by Mark
PS: Whoops, I don't think the spec says what has to be journaled.
I'll fix that.

:-)

Post by Nick Roberts
* I'd like to be able to have files that do not have file names
(i.e. their inodes would have 0 links).

That's an interesting idea. It's certainly possible, but needs to
be worked out a little more I think. How would you access the
file?

The API would provide an alternative open(), which accepts an inode
number instead of a (path) name.

Post by Mark
Where would the inode address be stored?

There would also be an alternative creat(), which does not take a new
(path) name, but which returns the inode number of the created file.
The application program would be responsible for storing the inode
number (possibly in memory, for a temporary file, or in another file
or in a database for longer term storage).

This facility would be quite important to me, I think. AdaOS (which
will be based on CORBA, I think) is going to be an object-oriented OS.
Accessing objects (and perhaps database BLOBs) by inode number (rather
than any name) would be an important capability.

Yeah, that decision was pretty arbitrary. I chose that to remain
compatible as a Linux file system (POSIX specifies that
restriction), but I personally really have no preference. I thought
of it as a lowest common denominator kind of thing, I didn't realize
CORBA lacked that restriction. I'll remove it.

It's a dilemma, because we both know full well that if Joe User
creates a pile of files with '/'s in their names, and then, say, tries
to FTP them onto a Unix system, it's going to go nasty, and naturally
it's the OS or file system that will get the blame.

The problem is, it cuts both ways. If Jane User tries to FTP a pile
of files from her Macintosh (with '/'s in their names) to a ZenFS
system, and it goes belly-up, it's ZenFS that'll get the blame again.

We cannot win :-)

Hrm, possibly. What kind of flags do you have in mind? I ask only
because TYPE_BOOL attributes might be the mechanism of choice for
storing auxiliary flags, if the flags relate to the format of the
data and not where it is on disk. Sharing an auxiliary_flags field
could lead to collisions, whereas with attributes you can avoid that
with a prefix for all your attribute names (e.g. ".AdaOS:FLAG_NAME")

I think I explained this badly. I should probably have said something
like 'important flags' and 'unimportant' flags, instead of 'mandatory'
and 'auxiliary'.

The idea is simply that, by defining some of a set of future potential
flags as 'unimportant', new unimportant flags can be introduced without
requiring a version (or revision) change. A new flag can be deemed
unimportant if software can safely ignore it (innocent of its
existence).

To concoct an example, a new inode flag that means something like
'make access to this file espcially fast' (maybe for use by heavily
accessed files) could probably be considered unimportant. An inode flag
that means 'safety copy of another inode, do not use' would probably
have to be considered important, and its introduction would therefore
require a version change.

This idea is not important, but it might be nice.

Post by Nick Roberts
* In my own preliminary design, I have a 16-bit inode type field.
...

And 256&257 would fall under INODETYPE_FILE; I'm not sure I follow on
what 1 and 522 are for.

Type 1 denotes a 'metafile' containg the inodes (in my own preliminary
design). Unimportant.

Type 522 denotes files which start with the AdaOS 'Native File Common
Header'. There is a field in this header which specfies the (sub-)type
of the file. It is mainly used for AdaOS-specific things (such as
executable modules). Little relevance to ZenFS.

Post by Mark
I take it the reserved values are for defining new user-level types?
Like 258=UTF-8 text file, 1024=PNG image file, 1025=JPEG file, etc?

It was really only to define a few low-level file types, so that low-
level software (boot programs, basic devices drivers, etc.) had a way
to distinguish (or check) file types, without recourse to complicated
mechanisms. So probably PNG and JPEG files would not fall into this
category (they would be only be used by high-level application
software). However, some OS-specific files (such as AdaOS executable
modules) would fall into this category.

Post by Mark
For interoperability reasons I think it would be best to limit
inode_type's values to those relating to the file system itself.

I thought it might be handy to have a 16-bit type field, where the
upper byte is used to distinguish operating systems (say 0=ZenFS,
1=yourOS, 2=AdaOS, 3=Linux, 4=BSD, etc.), and the lower byte to
specify a basic file type within the OS. Sorry I didn't explain this
better in the first place!

Post by Mark
I think this falls again under "mechanism vs. policy." For my own
OS, I'm taking the BeOS route and having a TYPE_STRING attribute
".myOS:TYPE" which specifies the MIME type for the file. For your
needs you might want a TYPE_UINT16 attribute (".AdaOS:TYPE"?).

Yes, that's perfect. And so a PNG would be .AdaOS:TYPE=image/png and
a JPEG would be .AdaOS:TYPE=image/jpeg and so on.

However, I think there would be value in having a few conventional
types (defined in a separate specification). It might be pretty handy
to have a conventional file type attribute such as "MIME_TYPE" for
MIME types.

On the subject of attributes, the spiel for Reiser4 raises a couple
of questions.

Supposing we said that any directory can be opened as either a (raw)
datafile or as a directory. If it is opened as a file, we search for
a special entry (named ".DEFAULT" say), and open it. Suppose,
furthermore, that any entry in a directory can be marked as
'inherited': if a search of a subdirectory (down to any level) for
the same name fails, the inherited entry is returned instead. Might
this provide a sufficient basis for attribute storage? Might it also
be sufficient for the storage of streams? (But it might be dangerous
to spend too much time reading about Reiser4 :-)

Post by Nick Roberts
* For a really small file, the data could all stored in the inode
(if it will fit), in place of the 'first_blocks' field.

It's in there. (page 26, do a search for the ".data" attribute)

Aha! Cool.

Post by Nick Roberts
* Could 'data_size' be in terms of bits instead of bytes (since
this is what SQL-92 requires for a binary string)?

Do you mean the data_size in the inode, or the data_size in the
small_attr region? In either case it'd cut the maximum size one
can store down by a factor of 8... You could always shift three
bits.

Sorry, I meant the uint64 data_size in an inode. It would mean
maximum file size was 2.3x10^18 bytes. Personally, I think that's
okay.

If the uint16 data_size in the small_attr region were also in terms
of bits, it would reduce the maximum size to 8191+7/8 bytes. Since
these attributes are meant to be small, isn't that okay?

Actually, I might suggest the offset index need only contain uint16s
(rather than uint32s), and each could be in terms of n/8, where n is
the offset in bytes (from the beginning of the small_attr area).

Not sure how important any of this is.

Post by Nick Roberts
* I'd like to propose a small improvement to updating of the
superblock.... <snip>

I thought about this too. As you demonstrated, there isn't really
a good solution.

Indeed!

Post by Mark
I'm not convinced yet there's much of a problem either. The chances
of the superblock wearing out is just so small on a modern hard
drive (compared to the life of the hard drive).

I think it's extremely unlikely (but I'm not sure). On modern disks,
sectors can be relocated (at the factory); I believe that if a disk
has sector 0 bad, the disk is rejected (scrapped). However, a volume
on a partitioned disk can start anywhere. Modern disks do not use a
jelly substrate any more, and destructive head crashes are very rare.
I've heard of plans to reduce the inter-sector gap to nearly zero for
the next generation of disks, but I think there are ways to maintain
reliability (involving end-of-track redundancy data, I suspect).

Post by Mark
Unless the whole region gets scratched by the drive head or
something,

Probably just a track, or maybe a few adjacent tracks.

Post by Mark
in which case you're screwed anyway because you'd have lost the
neighboring boot block as well..

That's not a disaster at all. It is a standard technique (now) in
this situation to boot from another source (e.g. a floppy disk or
CD-ROM) and perform recovery or repair of the volume.

Post by Mark
I'll think about it some more too, but i'm pretty sure the
complexity will outweigh any benefit.

I don't feel what I suggested is all that complex. But it does need
thinking about. Let's avoid introducing any new major problems while
trying to solve a few relatively minor ones!

Post by Nick Roberts
* I'd like to support 'smart copying' of files. ...

Oooh, this is a unique idea.

It's definitely an idea that's been knocking around in computer science
for yonks, but its application to file systems /might/ be unique. I've
not personally come across any file system that does it.

Post by Mark
You don't need it for transaction management though, that can all be
all be done with user-specified transactions in the existing journal.

Doh! I'm an idiot! I'm not used to journalling technology yet, and I
tend to forget. Sorry.

In fact, on this subject, the astute may recall that I've written
recently (in alt.os.development) criticising journalling. But I'm
learning more about it, and I'm beginning to think that my criticisms
were hasty. I can see that a log could help speed up safe rewrites, by
allowing them to be reordered.

Post by Mark
Where else might this be useful? It is a really neat concept.

Hmm. To be honest, I'm not so sure now.

Post by Mark
It'd require extending the file_run structure by another 8 bytes
though (for alignment reasons)..

Yes, this is the problem.

Post by Nick Roberts
* I doubt that it is necessary to 'sprinkle' magic numbers
interspersed among the data. ...

Actually, they're all there for good reasons. The first one in
the superblock identifies the volume, the next two help detect
buffer overrun of the volume name,

Hehe. You're speaking to an Ada programmer (for whom buffer overrun
just doesn't happen). I forgot about that one.

Post by Mark
the rest are padding bytes for alignment that I couldn't think of
any other use for (same goes for the inode structure). The bytes
are there, might as well put them to good use

I disagree with this. The bytes are there now, but we may well wish
to use them (for other purposes) in future ZenFS versions.

Post by Mark
(it's unlikely, as you say, that the disk might garble one and not
the other, but buggy software might).

However, I agree with this. In the end, it's a minor issue. Keep
things as they are.

Post by Nick Roberts
* An 'index' (inode/file) might just as well be a proper database
table? Why not go the whole hog, and provide the complete nuts
and bolts of a database engine?

1) mechanism vs. policy; most of the things you'd need to build a
general database system are there, but it's up to software to do
that. i didn't see any need to go any further in the spec.
2) i'm not a database expert. is there anything missing that
should be there?

I'm not quite sure if can call myself a database expert. I am sort
of. Trouble is, you could study databases all your life and not
really be an expert. Hehe ;-)

I think we should provide everything that a SQL-92 CLI (ODBC)
database engine (device driver) would require. That gives us the
following checklist:

* types (char string, bit string, large & small integer, specific
decimal scale exact numeric, single & double float, various
datetimes and time intervals, possibly also BLOBs, references,
OIDs);

* organisational structures (catalogs, schemas);

* basic elements (tables, views, domains, constraints and
assertions, character sets & collations?, character set
translations?);

* permissions on these elements (insert, update, delete, select,
reference, usage) and authorisation (passwords?);

* transactions (roll back on recovery);

* stored procedures? (I don't think so);

* the Definition Schema (containing database metadata).

This checklist leaves out a lot of detail, but also it is only a
list of things that need to be covered by /some/ ZenFS mechanism
(a rose by any other name).

I think this stuff would be so useful, it would be worth
pursuing.

Post by Nick Roberts
* Instead of using UTF-8 in various places, I suggest that: in
directories, Latin-1 only is used, because this means that
computer (operating) systems which cannot handle Unicode are not
embarrassed; define attribute types such as STRING_ISO_8859_1,
STRING_UTF_8, STRING_UTF_16 and so on.

Well, UTF-8 is quickly becoming de facto for new stuff (unless
you're in the NT world).

That may be true, but I think it is for all the wrong reasons. The
principal (wrong) reason is that most of the standards and
specifications which specify UTF-8 are being devised by North
Americans (and to a lesser extent, Europeans), such as the IETF
and W3C, for example, who naturally (but foolishly) assume that an
encoding which is efficient only for the characters in the Latin-x
sets will be acceptable to everyone in the world. A Chinese,
Japanese, Korean, Taiwanese, or Thai filename encoded in UTF-8 will
be diabolically inefficient. In most countries in the world, UTF-8
is just irrelevant and silly. For once, Microsoft are right.

I am getting the impression that there is a gradual consensus
accumulating in the standards communities that it is better to
stick to a simple ASCII-based 8-bit character set (encoding) for
names that will be used in programs (file/path names, environment
variable names, configuration item names, database table and field
names, and so on). This way, programs can continue to work in
execution environments for which support of bigger character
repertoires would be inappropriate. I divine that this is the
current thrust of the OMG's thinking wrt CORBA now.

For more sophisticated environments, files and other objects can
be given alternative names or descriptions using Unicode (the UCS),
which can be used for human display purposes. The extensible
attribute mechanism of ZenFS seems to be ideal for this purpose.

Post by Mark
Drivers for older systems that don't support it can always convert
on the fly. I'm a little hesitant to include something like this
since any format under the sun can be converted to UTF-8. Adding
extra types would complicate things for everyone, even those who
already run UTF-8..

I don't want to sound like this is a hobby horse for me. I can see
that a UTF-8 encoding doesn't actually /prevent/ the use of Latin-1
characters only throughout the file system. But would it be
acceptable for some implementations of ZenFS to simply not support
non-Latin-1 characters?

For example, supposing I wanted to write a data recovery tool
suitable for use from a bootable floppy disk. It would have to be a
relatively small program to fit. It would be quite acceptable for
this program to use text mode (no GUI or mouse). But supposing this
program had an option to display directory listings (perhaps to
allow the user to do selective deletion). It would be hard to
support the display of non-Latin-1 characters (and a few Latin-1
characters too, in fact).

There are an awful lot of existing programs that cannot support
encodings other than a simple 8-bit ASCII-based one. These programs
would be at risk of failing if they tried to manipulate a file with
a (path) name that contained exotic characters. Can we expect
systems implementing ZenFS to not support legacy software?

I think this is an important issue, that needs careful
consideration.

Post by Nick Roberts
...

I thought about doing semi-hard links as well, with one entry being
the prefered, but it didn't really make sense except in the
situation you mentioned [disk usage reporting], and added an extra
layer of indirection. You can always just choose the first entry
that shows up in '.parent'.

There are alternative possible ways to solve the disk usage reporting
problem. Maybe this isn't an important issue at this level (file
system format specification).

Post by Nick Roberts
Is full-volume encryption useful?

But that could be handled in different ways. The user's execution
environment could impose a default behaviour upon all software of
encrypting a file (unless explicitly told not to).

Post by Mark
It also prevents an attacker from learning anything from the
hierarchy structure and file names (a more serious concern). The
file name, creation time, size, and location in the hierarchy can
give away a lot of info, for example.
Not everyone will need it, however.

I think just about all of that information could be protected by a
mechanism that permitted encryption of individual directories and
inodes.

I think there is a problem with encrypting an entire volume, if that
volume is to be used by more than one user. It denies users the option
of not encrypting a file, which might be an important efficiency issue
in some cases.

I think individual structures (directories, inodes, indices) should be
inidividually encryptable. In which case, I think it would probably not
be necessary to have an option to encrypt the entire volume.

Post by Nick Roberts
* I think the compression needs to be significantly rethought. ...
What are your thoughts on this?

Sparse files.
...

I'm pretty convinced that a system's API should expose methods to
allow application software to discover the location of holes in
sparse files, for various reasons. As I understand it, applications
running on scientific compute servers need to be able to see (and wait
on) the holes in order to use the sparse file as a communication (and
synchronisation) mechanism. That being the case, I still think that
compression should be left to higher levels of software; I feel sure
that a good algorithm for finding the redundancy in a particular data
set will require knowledge of the data set (which the file system does
not have). An increasing number of file formats these days have their
own specialist compression schemes designed-in (e.g. PNG, JPEG, MPEG).

Post by Nick Roberts
Would you consider using DocBook/XML to originate your documentation?

Yes. I'm going to switch to LaTeX or DocBook soon. I've just never
use either, and wanted to get some ideas to paper quickly before
learning a new system.

I use DocBook/XML myself. I would be happy to do most of the conversion
from your current format to DocBook for you, if you wish. (I'd then hand
it back to you for you to finish.)

My e-mail is real (***@acm.org) if you wish to e-mail me.

--
Best wishes,
Nick Roberts

Zhen Lin

2004-08-08 16:05:52 UTC

Nick Roberts wrote:
| I don't want to sound like this is a hobby horse for me. I can see
| that a UTF-8 encoding doesn't actually /prevent/ the use of Latin-1
| characters only throughout the file system. But would it be
| acceptable for some implementations of ZenFS to simply not support
| non-Latin-1 characters?

If you want to be minimalist, kill even Latin-1. Go down to ASCII. It
will be upwards compatible with virtually any encoding system. (For most
characters anyway. I know that Shift-JIS replaces \ with (Yen) and
another character)

UTF-8 does preempt Latin-1 anyway. A valid Latin-1 string is not always
a valid UTF-8 string; conversely, valid UTF-8 strings containing only
Latin-1 characters are also valid Latin-1 strings but with different
values. That is to say, what would be encoded as \xE9 in Latin-1 would
be encoded as \xC3 A9.

Nonetheless, your complaints are valid, regarding the inefficiency of
UTF-8 for non-ASCII characters. Allowing multiple encodings would
probably make it more efficient to store, however, it would complicate
code...

| There are an awful lot of existing programs that cannot support
| encodings other than a simple 8-bit ASCII-based one. These programs
| would be at risk of failing if they tried to manipulate a file with
| a (path) name that contained exotic characters. Can we expect
| systems implementing ZenFS to not support legacy software?

If it is 8-bit safe, it's UTF-8 safe. Assuming the user doesn't mutilate
half-a-dozen byte sequences in the process though, or create their own
invalid byte sequences...

UTF-8 does not generate any control characters anyway. (Assuming your
definition of control characters is \x00 - \x1F. Technically, \x80 -
\x9F in Latin-1 are control characters as well, and UTF-8 certainly
generates a few of those...)

P.S. With the advent of Mac OS X, / has become illegal in MacOS file
names. The reason why / is disallowed on *nix is because path parsing is
done on strings that have no escape mechanism for /. (Or for that
matter, \0.) (That I know of anyway.)

Nick Roberts

2004-08-09 11:34:45 UTC

Post by Zhen Lin
| I don't want to sound like this is a hobby horse for me. I can see
| that a UTF-8 encoding doesn't actually /prevent/ the use of Latin-1
| characters only throughout the file system. But would it be
| acceptable for some implementations of ZenFS to simply not support
| non-Latin-1 characters?
If you want to be minimalist, kill even Latin-1. Go down to ASCII. It
will be upwards compatible with virtually any encoding system. (For
most characters anyway. I know that Shift-JIS replaces \ with (Yen)
and another character)

It's not that I want to be minimalist, exactly! But I think it would be
helpful for files to have names that are easy for simple software to
deal with, in addition to other names. This is analagous to NTFS
supporting 8.3 names, to permit old DOS software to be able to access
files.

Post by Zhen Lin
UTF-8 does preempt Latin-1 anyway. A valid Latin-1 string is not
always a valid UTF-8 string; conversely, valid UTF-8 strings
containing only Latin-1 characters are also valid Latin-1 strings but
with different values. That is to say, what would be encoded as \xE9
in Latin-1 would be encoded as \xC3 A9.

So this would suggest that ASCII (ISO 646) would be a better choice
than Latin-1 (ISO 8859-1). I repeat that I would expect most files
to have at least one alternate name, in an alternate encoding, maybe
stored in a ZenFS extended attribute.

Post by Zhen Lin
Nonetheless, your complaints are valid, regarding the inefficiency
of UTF-8 for non-ASCII characters. Allowing multiple encodings would
probably make it more efficient to store, however, it would
complicate code...

I think the complication could be hidden from application programs,
generally, into libraries. Some application programs would access
the multiple names (for example, a directory display program) as a
matter of their primary (or extended) functionality.

Post by Zhen Lin
| There are an awful lot of existing programs that cannot support
| encodings other than a simple 8-bit ASCII-based one. These programs
| would be at risk of failing if they tried to manipulate a file with
| a (path) name that contained exotic characters. Can we expect
| systems implementing ZenFS to not support legacy software?
If it is 8-bit safe, it's UTF-8 safe. Assuming the user doesn't
mutilate half-a-dozen byte sequences in the process though, or create
their own invalid byte sequences...

Yes, you are right. I was thinking about display to the user (e.g. a
program that displayed "Do you wish to delete êÈÇÒóþññÑËî÷îÐÚæåÔÏØ?"
It might be better for it to display "Do you wish to delete WP01025?"
but I admit, I can see the flaw in my own argument. Hopefully the
user could easily obtain a dislay of the proper (extended) file name
for WP01025 in this situation (in a separate GUI window?), but that's
far from perfect.

Post by Zhen Lin
UTF-8 does not generate any control characters anyway. (Assuming your
definition of control characters is \x00 - \x1F. Technically, \x80 -
\x9F in Latin-1 are control characters as well, and UTF-8 certainly
generates a few of those...)

But at worst they'll generate a bit of an odd display. Not too bad.

Post by Zhen Lin
P.S. With the advent of Mac OS X, / has become illegal in MacOS file
names. The reason why / is disallowed on *nix is because path parsing
is done on strings that have no escape mechanism for /. (Or for that
matter, \0.) (That I know of anyway.)

Aha! Yes, of course. It would presumably have quite hard to support
'/'s under OS X. Maybe this argues for disallowing '/' after all.

--
Nick Roberts

Sylvain

2004-08-19 17:06:00 UTC

Hi,

Post by Nick Roberts
spread evenly across the volume (maybe one every 256 MiB?)

Not a good idea because if you have a RAID 0 array made of chunks which size is a multiple of the fixed value you choose (whatever it is), then _all_ superblocks would be written to the same disk. This is not performance-friendly ;-)

One must do all possible things to avoid such situation and try to evenly spread all control informations across _all_ physical volumes

Superblocks should be spread across the disk using a random value between each chunk. A linked list mecanism give you their exact position across the whole disk. Random can also be replaced with user input (because user knows the physical layout and can input a base value which would be consistent with it).

Post by Nick Roberts
* I'd like to support 'smart copying' of files.

Very good idea. But should this be done in the filesystem structures ? I don't think. Journaling the modifications and choosing to commit or rollback is another way to do it without adding complexity to the filesystem. You can do it in your filesystem driver without the need for this to be permanently stored on the disk.

Filesystem specs should be as light as possible to allow for future extensions :
1- to be possible without having to maintain compatibility for too many features !
2- to not break too much things !

Post by Nick Roberts
* Instead of using UTF-8 in various places

UTF is the future.

Post by Nick Roberts
* I like the encryption idea.
* ... I suspect that we should not be concerned with compression at the file
system level

Why encryption and not compression ?

Sylvain.

Nick Roberts

2004-08-19 23:02:26 UTC

Post by Nick Roberts
spread evenly across the volume (maybe one every 256 MiB?)

I think Mark and I have dropped the idea of multiple superblocks, in
fact. I think the necessary redundancy would be achieved by simply
having the information in the superblock repeated.

For example, if the block size is 4 KiB, 2 KiB are reserved as the
(maximum) size of the data stored in a superblock, and whenever the
superblock is updated, that 2 KiB of data is written out twice in
succession.

I imagine we can assume that it is unlikely, on a modern hard disk,
for one of those copies to be lost, so therefore it is acceptably
(extremely) unlikely for both to be lost. I imagine all the data
required will fit into 2 KiB. Of course, this is assuming a block
size of 4 KiB.

Post by Nick Roberts
* I'd like to support 'smart copying' of files.

Yes, that's quite right. I forgot!

Post by Sylvain
You can do it in your filesystem driver without the need for this to be
permanently stored on the disk.

The effect could be achieved at a higher level of software, so that
the file system software (or format) needn't be involved.

Post by Nick Roberts
* Instead of using UTF-8 in various places

UTF is the future.

Do you mean UTF-8 specifically?

Post by Nick Roberts
* I like the encryption idea.
* ... I suspect that we should not be concerned with compression
at the file system level

Why encryption and not compression?

Because, to be efficient and effective, encryption does not require
any knowledge of the (nature of) the data, whereas compression
generally does.

Many common and standard file formats these days include their own
compression schemes, for exactly this reason. The compression
methods they mandate are specially chosen to be suitable for the
kind of data stored in the file. If the file system software also
tries to compress data, I believe much of the time it will just
be making futile efforts to compress data that is already
compressed, and even when this is not the case, the compression it
performs may not be very efficient. I think compression is
therefore best left to higher levels of software.

However, encryption can be done on any data, and it is appropriate
for the file system to do it, because it can hold data unencrypted
in its cache (in memory) and re-encrypt any dirty data just before
flushing it back to disk. Higher level software cannot do this,
because it doesn't have access to the cache.

--
Nick Roberts

Sylvain

2004-08-20 01:21:49 UTC

Post by Sylvain
This
is not performance-friendly ;-)

on a modern hard disk,
for one of those copies to be lost

I was talking about performance only. Having all superblocks on the same disk will be a _major_ performance problem.

Post by Nick Roberts
Do you mean UTF-8 specifically?

I mean UTF-8 should never have been made. But here it is and it is space saving regarding UTF-16. I think UTF-16 would be a better choice in any case ...

Post by Nick Roberts
* I like the encryption idea.
* ... I suspect that we should not be concerned with compression
at the file system level

Why encryption and not compression?

Because, to be efficient

Efficient encryption ? Good modern encryption requires huge amounts of CPU. Some tests I saw using openssl required more than one hour to encrypt a 50 MB file using a Pentium 350 Mhz ... You really were talking about this kind of encryption ?

Post by Nick Roberts
encryption does not require
any knowledge of the (nature of) the data, whereas compression
generally does.

I don't agree. It generally doesn't. It can specifically help in some cases and it is required for lossy compression. In this case, compression should be smart : the filesystem driver should not store compressed data when the ratio is not high enough.

Also note that first stage of compression at filesystem's level is to not use fixed size blocks. A 1 byte file should use only 1 byte of storage (+ file information structures) on the disk, not a full fixed size block. This is the first and easiest level of compression (e.g. ReiserFS does this).

Another interest for compression done at filesystem's level in the improved performance that can arise from it in _some_ exact cases : where the system spends more time waiting for the disk than working to anything else _and_ you work on highly compressible data. Here, you can have better performances because read compressed volume + uncompressing (data are read more often than written) will probably be faster than reading the whole uncompressed volume from the disk. Text information (as in many databases) is a perfect example where it could be possible to achieve better disk performance by using compression.

Compression is a good idea in some cases. A very bad in some other. You could imagine you try to compress some data while the system is idle and check there is some interest (good enough comp. ratio) at keeping the compressed version of the data or not. This background activity should however be "manageable" by the system's admin.

Post by Nick Roberts
I believe much of the time it will just
be making futile efforts to compress data that is already
compressed

You mean all your data on your disk is currently compressed ? I operate several high-end Linux servers and if 1% of stored data is currently compressed, that'd already be very big ratio !

Sylvain.

Nick Roberts

2004-08-20 17:44:09 UTC

[re superblocks]

I was talking about performance only. Having all superblocks on the
same disk will be a _major_ performance problem.

Do you really mean superblocks? Are you thinking of other kinds of
administrative block?

Do you mean UTF-8 specifically?

I mean UTF-8 should never have been made. But here it is and it is
space saving regarding UTF-16. I think UTF-16 would be a better
choice in any case ...

But do you think files should have UTF-16 names /instead/ of Latin-1
names, or /in addition/ to them (as I was suggesting)?

Post by Sylvain
Why encryption and not compression?

Because, to be efficient

Efficient encryption ? Good modern encryption requires huge amounts
of CPU. Some tests I saw using openssl required more than one hour
to encrypt a 50 MB file using a Pentium 350 Mhz ... You really were
talking about this kind of encryption ?

I really am talking about this kind of encryption! Remember that
there is a difference between the amount of processor time required
by an operation, and the effficiency of the operation. It is
inefficient if it uses up a lot of processor time /unnecessarily/.

My point is that the amount of processor time used up by (just about)
any encryption algorithm -- however much it may be -- is independent
of the (nature of) the data being encrypted.

encryption does not require any knowledge of the (nature of) the
data, whereas compression generally does.

I don't agree. It generally doesn't.

I think you find that your opinion is in defiance of the consensus of
opinion in the world of information science. Admittedly, scientists
can be wrong.

Post by Sylvain
It can specifically help in some cases and it is required for lossy
compression.

I am certain that a useful lossy compression algorithm requires
knowledge of the nature and encoding of the data. Would, for example,
the JPEG algorithm be useful for anything other than JPEG encoded
pictures? I doubt it.

Post by Sylvain
In this case, compression should be smart : the filesystem driver
should not store compressed data when the ratio is not high enough.

Indeed, but is that a reason a why the file system driver should
perform the compression rather than a higher-level piece of software?

Post by Sylvain
Also note that first stage of compression at filesystem's level is
to not use fixed size blocks. A 1 byte file should use only 1 byte
of storage ...

Yes, and I am not suggesting that this form of compression is removed
from the ZenFS specification.

Post by Sylvain
Another interest for compression done at filesystem's level in the
where the system spends more time waiting for the disk than working
to anything else _and_ you work on highly compressible data. ...

Indeed, and I would say it will nearly always be quicker, on a typical
modern machine, to read a compressed file and decompress it than to
read it uncompressed.

But my argument is that the compression should be done at higher
levels of software, and should not be the concern of the file system
driver or other low-level file system software.

Post by Sylvain
Text information (as in many databases) is a perfect example where
it could be possible to achieve better disk performance by using
compression.

It is one example, and I thinks it supports my case. If the database
engine (or equivalent) were to carry out this compression itself, it
would be able to: compress only selected fields; use different
algorithms for different fields; perform centralised compression on
data stored in multiple files (e.g. multiple tables); perform
recompression at a time when it knows the data is less likely to be
accessed. If the file system driver were to perform the compression
it could do none of these things.

Post by Sylvain
Compression is a good idea in some cases. A very bad in some other.
You could imagine you try to compress some data while the system is
idle and check there is some interest (good enough comp. ratio) at
keeping the compressed version of the data or not. This background
activity should however be "manageable" by the system's admin.

Quite. I am not trying to argue against compression being done; I am
only arguing aginst it being done at such a low level (as the file
system driver).

I believe much of the time it will just be making futile efforts
to compress data that is already compressed

You mean all your data on your disk is currently compressed ?

As it happens, a great deal of the data on my machine's hard disk is
in one of the following formats: PNG; JPEG; MPEG; ZIP; an OpenOffice
format; CAB; compressed PDF. All of these formats are well-compressed.
I'm really not sure what percentage of the total data they represent,
but I wouldn't be surprised if it was more than 50%. My disk's
capacity is 38 GB, and the total reported usage is currently about 18
GB (which remains roughy stable). There is a lot of cruft.

If a file system driver tried to compress any of these files, it
would fail (the result would be no smaller), and so it would have
been wasting its time. I'm sure it would achieve about a 20% to 40%
reduction in the others; which would represent a saving of less than
3 GB. It's hard to say what the resulting speed increase would be
(I'm sure there would be /some/ speed increase), but I doubt it would
be of much significance from the point of view of saving disk space.

Post by Sylvain
I operate several high-end Linux servers and if 1% of stored data
is currently compressed, that'd already be very big ratio !

If you don't mind me asking, what kind of data is stored?

Frankly, I'm willing to be persuaded that some kind of file system
level compression would be worthwhile, but I think a good (and
detailed) argument would be required.

--
Nick Roberts

Sylvain

2004-08-20 22:34:14 UTC

Post by Nick Roberts
Do you really mean superblocks? Are you thinking of other kinds of
administrative block?

Any block in fact. Although superblocks are the best and easiest sample.

Post by Nick Roberts
But do you think files should have UTF-16 names /instead/ of Latin-1
names, or /in addition/ to them (as I was suggesting)?

Only support one charset. It'll save you many problems ;-) Also why Latin-1 ? Why not choosing a chinese charset ?

Post by Sylvain
Why encryption and not compression?

I think all you wrote just demonstrate that compression _must_ be included and encryption _must not_.

If you need to encrypt just a few files, then just do it at a higher software level (command line ssl !). If you need all data, then just do it at partition level.
On the other hand, including compression can offer services to users which would not be possible otherwise because xyz software can just work with uncompressed text files and the only way to achieve compression in this case will be at filesystem's level.

Post by Sylvain
I don't agree. It generally doesn't.

I think you find that your opinion is in defiance of the consensus of
opinion in the world of information science. Admittedly, scientists
can be wrong.

Let say it another way : Generalistic algorythms are good enough for a very high percentage of current data representation. Average cases are the first concern of a filesystem structure.

Post by Nick Roberts
Indeed, but is that a reason a why the file system driver should
perform the compression rather than a higher-level piece of software?

See before.

Post by Nick Roberts
But my argument is that the compression should be done at higher
levels of software, and should not be the concern of the file system
driver or other low-level file system software.

Post by Sylvain
Text information (as in many databases) is a perfect example where
it could be possible to achieve better disk performance by using
compression.

It's an efficiency matter. In the case you give, you will have to write and integrate compression algorythms in thousand places, perhaps hundred times more than this (in every software that accesses the filesystem in fact) ! If the capability is in the filesystem, one development. dot. Of course it won't achieve same ratios as specialized processes for some specific data, but as it will be able to try to compress anything transparently (if enabled at user level), it will certainly beat the previous ratios if you think about the total number of systems using it. And you can still add specific development into user space softwares to achieve better results.

Post by Nick Roberts
As it happens, a great deal of the data on my machine's hard disk is
in one of the following formats: PNG; JPEG; MPEG; ZIP; an OpenOffice
format; CAB; compressed PDF. All of these formats are well-compressed.
I'm really not sure what percentage of the total data they represent,
but I wouldn't be surprised if it was more than 50%. My disk's
capacity is 38 GB, and the total reported usage is currently about 18
GB (which remains roughy stable). There is a lot of cruft.

Mine is 60 GB and I think I could get 50% more with a good filesystem compression. Just a matter of kind of data you're working on ...

Post by Sylvain
I operate several high-end Linux servers and if 1% of stored data
is currently compressed, that'd already be very big ratio !

If you don't mind me asking, what kind of data is stored?

Text ! Source code (several languages), HTML, XML, logs, EDI data, ... All this is stored in files or database. I can compress a SQL dump of one server at more than 98% ratio. I just drop 40 MB on the cartridge when backups run ;-)

Post by Nick Roberts
Frankly, I'm willing to be persuaded that some kind of file system
level compression would be worthwhile, but I think a good (and
detailed) argument would be required.

Without the details : development costs to put it elsewhere ;-) Having the option to do it there does not cost more than checking one more bit when accessing a file. And you are not required to make use of it. Leave the option.

In fact, I ask myself the same question about encryption now I have read your speach about compression ... I'm less and less sure it has to be in some fs specs. The way Linux did it (cryptoloop) is probably the one I'd defend.

Sylvain.

Sylvain

2004-08-19 16:51:07 UTC

Hi,

Still not had time to read everything, but sounds interesting and well written.

Post by Mark
I'm open to all suggestions, comments, and constructive criticism.

One early question : why is it big-endian based ? You mean it can only be usefull on one kind of CPU ?
A one bit flag (set at format time) to inform if the structures are big-endian or little-endian coded would be interesting. The OS can then choose to only support one option or other (thus, refusing to mount unsupported xxx-endian system) or both by accepting the overhead and calculating required transforms.

Sylvain.

Nick Roberts

2004-08-19 22:38:14 UTC

Post by Sylvain
One early question : why is it big-endian based ? You mean it can
only be usefull on one kind of CPU ?

I'm that was not Mark's intention.

Post by Sylvain
A one bit flag (set at format time) to inform if the structures are
big-endian or little-endian coded would be interesting.

I think that would be a good idea.

--
Nick Roberts

amertime

2004-08-20 04:51:57 UTC

Post by Mark
I'm announcing here the release of the specifications for a new file
system: ZenFS. It's a 64-bit journaled file system, with support for
attributes, filesets, indexing, compressed files, and both per-file
and full volume encryption. It's the first part in a series of
designs for a secure operating system I'm writing.
If anyone is interested and has time, I'd appreciate feedback. I'm
open to all suggestions, comments, and constructive criticism. The
design could us a little peer review. Also, if anyone is in need of a
file system for their own projects, please feel free to use it. The
design is unpatented, not copyrighted, and free for use.
The specification can be found in many formats at the top of the
http://mark.friedenbach.net/
happy coding,
-Mark Friedenbach

i have quickly read your spec and i have some questions:

- where and how do you keep track of used inodes.
- zenfs seems to not handle ACL?

i will have some questions later, after a complete read.

Mark

2004-08-21 05:21:47 UTC

There's a new version (0.1pre3) up at mark.friedenbach.net. The only
major changes so far has been to the list of supported encryption and
hashing algirithms, lots of small editing changes, and movement from
Gobe to DocBook for formatting (the PDF and HTML formats are now much
easier to read).

I want to thank everyone for the feedback! It's good to see such a
lively discussion. Hopefully we can start to reach a consensus on the
issues discussed so far, so as to work changes into future drafts.
With that in mind, I'll address the questions asked to me, and throw
my 2 cents if I may into some of the discussions going on:

Sylvain: "why is it big-endian based ? You mean it can only
be usefull on one kind of CPU ?"

The big-endian choice is intended to increase interoperability. Byte
swapping is such a simple and quick operation that there's no reason a
file system driver shouldn't support it. If both big- and
little-endian formats were supported, however, then we're bound to end
up with naive implementations that only support one or the other,
which is something I want to avoid.

The choice of big-endian over little-endian was made because a lot of
the data structures are going to be reused verbatim in an archive
format for network transmissions. It made sense then to choose
big-endian since that's how the rest of the internet runs.

amertime: "where and how do you keep track of used inodes"

Inodes can be located anywhere on disk. To allocate a new inode, you
simply choose whatever unallocated block is convenient. Unlike other
file systems, there is no central pool of used/unused inodes.

amertime: "zenfs seems to not handle ACL?"

Mechanism vs. policy. The choice of protection mechanism (ACL vs.
UNIX groups vs. any other method) is up to the OS. Both ACL's and
UNIX protection can be implemented with per-file attributes.

--Multiple Superblocks--

my opinion: generally a bad idea. figuring out where to place them is
hard enough: on a single disk you want them far apart to protect
against localized damage, with RAID you might want them right next to
each other (so it'd be the same sector, but on separate disks).
having the location of the superblock depend on the underlying
hardware is not a pretty solution at all. and no matter how you slice
it, keeping every superblock consistent is troublesome and potentially
a performance problem.

for this to make it into the spec, all these problems would have to be
solved (or at least bounded in an acceptable way)

--The '/' In File Names--

The ".target" attribute in symbolic links needs a way of specifying
paths, and the '/' is as good a character as any. Unless there's a
Unicode character for this purpose. Anyone know of one? If not, we
could allocate one from the private use range, or restrict the '/' as
a path name separater, and allocate a private use code that gets
translated as the '/' by conforming implementations. is it worth the
effort? i'm not sure (having '/' in a file name would be troublesome
on other OS's)

--Encryption--

Full volume encryption is necessary for my application. I think we
can all agree though that if full volume encryption is a requirement
(which it is), it's best to do it at the file system level. It also
has the nice property that an encrypted volume is easily recognizable.
Implementations which choose not to support it can fail at mount time
and display to the user a reasonable error.

File encryption, not so much (especially if the encrypted file is
critical to the way the system works with the volume, like a recycle
bin folder or something). But the neat thing about file encryption is
that the syntax and code underlying it is almost identical to full
volume (really it's the reverse, but the point still holds), and has
been designed to be as non invasive as possible. A disk defragmentor
or repair utility doesn't have to be encryption aware--it can move
blocks around and examine file system properties without knowing keys
or other info (unless it's an encrypted volume of course)

In addition, while per-file encryption can be handled just fine by
higher levels in theory, real systems tend to do some pretty stupid
stuff (e.g. decrypting to a temp file, storing working copies and such
in plaintext form). Security is only as strong as the weakest link,
and moving file encryption into the system (while enforcing some API
restictions at the same time) helps to close seme of these holes.

PS: i'm considering removing encryption of directories, since that
only complicates matters and doesn't enhance security at all (since
all that info is in the .parent index anyway). thoughts?

--Compression--

This, notably, was *not* a design goal of ZenFS. When designing the
inode format, however, it became clear that compression could be added
with minimal changes to the spec.

The limited choice of compression algorithms was definately on
purpose. Gzip and bzip2 are general purpose compression algorithms
that work well on most redundant data. Gzip (or more accurately,
DEFLATE) is fast enough to achieve clear step up in performance when
reading compressed files on a slow hard drive. Desktop systems might
consider transparently compressing old files with gzip if a clear
improvement is to be gained. Bzip2 was optimized to compress text
(ASCII text to be fair, I don't know about general Unicode), and does
so quite well. Systems with a ton of text files (internet servers
come to mind) might consider bzip2'ing old non-volatile files to save
space.

Far more importantly, however, is the security gained using
compression files in conjunction with file encryption. Most binary
file types have a very regular structure, which attackers can take
advantage of when attempting to recover the key. Compressing a file
(even if no space is saved) eliminates this problem by
de-synchronizing the file. This is why PGP always gzips input, even
if the result is a larger message.

If file encryption were moved out of the system into a higher-level
interface, this requirement would go away of course. But I've already
presented above a (in my own view) good reason why encryption should
remain at the file system level.

--Multiple String Formats--

My opinion: adds too many complications. which string is the correct
one? what happens when the user uses han characters, and your system
uses Latin-1 as the back end? what happens if the strings don't
match? what happens if one of the other encodings exceeds the 255
byte limit? where are the extra strings stored? not in the directory
i hope, that would really complicate the indexing format. what
happens if one system doesn't understand one of the encodings (in
particular if the user wants to rename a file)?

The current system i think is best. Systems can store other names as
attributes (e.g. ".Win32:DOSNAME"), but the UTF-8 name specified in
the standard always takes precedence as the "official" name string
when confusion arises.

--UTF-8 vs. UTF-16/32--

UTF-8's variable length sequences aren't any more difficult to process
than UTF-16's surrogate pairs and endian-ness. Most new unicode
systems these days default to UTF-8 for their encoding (the internet
standards come to mind). And there's a lot to be said for binary
ASCII compatibility. The only real argument against UTF-8 is its 50%
expansion of east asian strings.

When designing ZenFS, I choose 32 characters as the minimum for a max
length for file names, with 48 chars in all languages being the goal.
English text gets 255 chars, most european somewhat less, and eastern
european and middle east about 127. East asian languages get about 85
characters or so. Worst case scenario (all supplementary plane
characters) would limit the file name to no more than 63 characters.
Still plenty according to the standards I chose, so I don't see length
expansion as a strong argument.

Well that's it for now.
Again, I appreciate your feedback!
Thank you,
-Mark Friedenbach
<***@yahoo.com>

Ben Gainey

2004-08-21 11:57:04 UTC

Post by Mark
--The '/' In File Names--
The ".target" attribute in symbolic links needs a way of specifying
paths, and the '/' is as good a character as any. Unless there's a
Unicode character for this purpose. Anyone know of one? If not, we
could allocate one from the private use range, or restrict the '/' as
a path name separater, and allocate a private use code that gets
translated as the '/' by conforming implementations. is it worth the
effort? i'm not sure (having '/' in a file name would be troublesome
on other OS's)

i don't know if this has been suggested before, but why not format the
.target attribute like so:

struct SYMLINK_TARGET
{
uint64_t Length; // the size of the
whole .target attribute
uint32_t NumberOfParts; // the number of segments in
the path...
struct
{
uint32_t Size; // the length of
the string (strings are counted rather than nul terminated)
char Name[]; // the name of the
path segment.
} FileNameParts[]; // an array containing
each path segment
};

You break down the file name into an array of path segments and store them
in the structure... that way you can disreguard the path separators allowing
each operating system to use whatever form of separator it chooses...

so for example:
"/home/ben/somefile" becomes

{
39
3 // number of parts == the number of segments in the
path
{
{4, "home"}
{3, "ben"}
{8, "somefile"}
}
}

this method may also make it easier for the filesystem to parse the symlink
given that it is essentialy already parsed for it.

the only problem i can see with this is that if say operating system "A"
uses "/" as a separator, and allows "\" in the file name, then operating
system "B" comes allong and uses "\" as the separator, you could have the
situation where

"A" encodes path "/home/ben/stupid\dangerous practices.doc" into a
symlink whereby "stupid\dangerous practices.doc" is a single unit.
"B" decodes symlink as "\home\ben\stupid\dangerous practices.doc" which
is seen as a different path.

does ZenFS therefore have a list of banned characters in names such as / \ >
< & * ? | " or does everything go?

Post by Mark
--Encryption--
Full volume encryption is necessary for my application. I think we
can all agree though that if full volume encryption is a requirement
(which it is), it's best to do it at the file system level. It also
has the nice property that an encrypted volume is easily recognizable.
Implementations which choose not to support it can fail at mount time
and display to the user a reasonable error.
File encryption, not so much (especially if the encrypted file is
critical to the way the system works with the volume, like a recycle
bin folder or something). But the neat thing about file encryption is
that the syntax and code underlying it is almost identical to full
volume (really it's the reverse, but the point still holds), and has
been designed to be as non invasive as possible. A disk defragmentor
or repair utility doesn't have to be encryption aware--it can move
blocks around and examine file system properties without knowing keys
or other info (unless it's an encrypted volume of course)
In addition, while per-file encryption can be handled just fine by
higher levels in theory, real systems tend to do some pretty stupid
stuff (e.g. decrypting to a temp file, storing working copies and such
in plaintext form). Security is only as strong as the weakest link,
and moving file encryption into the system (while enforcing some API
restictions at the same time) helps to close seme of these holes.
PS: i'm considering removing encryption of directories, since that
only complicates matters and doesn't enhance security at all (since
all that info is in the .parent index anyway). thoughts?
--Compression--
This, notably, was *not* a design goal of ZenFS. When designing the
inode format, however, it became clear that compression could be added
with minimal changes to the spec.
The limited choice of compression algorithms was definately on
purpose. Gzip and bzip2 are general purpose compression algorithms
that work well on most redundant data. Gzip (or more accurately,
DEFLATE) is fast enough to achieve clear step up in performance when
reading compressed files on a slow hard drive. Desktop systems might
consider transparently compressing old files with gzip if a clear
improvement is to be gained. Bzip2 was optimized to compress text
(ASCII text to be fair, I don't know about general Unicode), and does
so quite well. Systems with a ton of text files (internet servers
come to mind) might consider bzip2'ing old non-volatile files to save
space.
Far more importantly, however, is the security gained using
compression files in conjunction with file encryption. Most binary
file types have a very regular structure, which attackers can take
advantage of when attempting to recover the key. Compressing a file
(even if no space is saved) eliminates this problem by
de-synchronizing the file. This is why PGP always gzips input, even
if the result is a larger message.
If file encryption were moved out of the system into a higher-level
interface, this requirement would go away of course. But I've already
presented above a (in my own view) good reason why encryption should
remain at the file system level.
--Multiple String Formats--
My opinion: adds too many complications. which string is the correct
one? what happens when the user uses han characters, and your system
uses Latin-1 as the back end? what happens if the strings don't
match? what happens if one of the other encodings exceeds the 255
byte limit? where are the extra strings stored? not in the directory
i hope, that would really complicate the indexing format. what
happens if one system doesn't understand one of the encodings (in
particular if the user wants to rename a file)?
The current system i think is best. Systems can store other names as
attributes (e.g. ".Win32:DOSNAME"), but the UTF-8 name specified in
the standard always takes precedence as the "official" name string
when confusion arises.
--UTF-8 vs. UTF-16/32--
UTF-8's variable length sequences aren't any more difficult to process
than UTF-16's surrogate pairs and endian-ness. Most new unicode
systems these days default to UTF-8 for their encoding (the internet
standards come to mind). And there's a lot to be said for binary
ASCII compatibility. The only real argument against UTF-8 is its 50%
expansion of east asian strings.
When designing ZenFS, I choose 32 characters as the minimum for a max
length for file names, with 48 chars in all languages being the goal.
English text gets 255 chars, most european somewhat less, and eastern
european and middle east about 127. East asian languages get about 85
characters or so. Worst case scenario (all supplementary plane
characters) would limit the file name to no more than 63 characters.
Still plenty according to the standards I chose, so I don't see length
expansion as a strong argument.
Well that's it for now.
Again, I appreciate your feedback!
Thank you,
-Mark Friedenbach

Nick Roberts

2004-08-21 22:32:37 UTC

Post by Mark
There's a new version (0.1pre3) up at mark.friedenbach.net.

Cool. The presentation looks great. I've only had a brief look at it.

Post by Mark
Sylvain: "why is it big-endian based ? You mean it can only
be usefull on one kind of CPU ?"
The big-endian choice is intended to increase interoperability.
Byte swapping is such a simple and quick operation that there's no
reason a file system driver shouldn't support it. If both big- and
little-endian formats were supported, however, then we're bound to
end up with naive implementations that only support one or the
other, which is something I want to avoid.
The choice of big-endian over little-endian was made because a lot
of the data structures are going to be reused verbatim in an
archive format for network transmissions. It made sense then to
choose big-endian since that's how the rest of the internet runs.

The counter-argument is that it creates extra work for a little-
endian machine that might, in some cases, be significant. To write
back an index block could require many (a hundred?) integers to be
rewritten. A bit of a close call.

It has to be said, supporting both formats could seriously
complicate device driver code. I think I prefer it Mark's way.

Post by Mark
amertime: "zenfs seems to not handle ACL?"
Mechanism vs. policy. The choice of protection mechanism (ACL vs.
UNIX groups vs. any other method) is up to the OS. Both ACL's and
UNIX protection can be implemented with per-file attributes.

I think I reluctantly agree with this approach.

I would suggest that the inode header has a 128 bit opaque space
reserved in it, labelled 'security'. ZenFS would not define how this
space is used, but it would be quickly and easily accessible
(modified, in particular). Would an attribute do this job okay?

I'd also suggest that whatever security scheme is used by any system,
it is documented and published separately.

Post by Mark
--Multiple Superblocks--
my opinion: generally a bad idea.

Again, I think I agree with this, now.

I would suggest that the (one and only) superblock contains two
copies of the header information (one in the first half of the block,
the other in the second half).

I don't think there is a specific Unicode (or ISO) character for the
purpose, but it does seem to me that a control character could be
used. There's quite a choice. Perhaps HT (U+0009)?

Post by Mark
If not, we could allocate one from the private use range,

Or one of those ;-)

Post by Mark
or restrict the '/' as a path name separator, and allocate a private
use code that gets translated as the '/' by conforming
implementations.

That does sound like a feasible plan.

Post by Mark
is it worth the effort? i'm not sure (having '/' in a file name
would be troublesome on other OS's)

It wouldn't be much effort to implement, and I don't really see it
doing any harm.

I suppose operating systems that forbid '/' would have to translate
between this character and some other substitution character.

My personal feeling is that it is occasionally annoying (confusing,
even) not to be able to use '/' in a file name.

I certainly agree with this.

Post by Mark
It also has the nice property that an encrypted volume is easily
recognizable.

And I agree with this too.

Post by Mark
Implementations which choose not to support it can fail at mount
time and display to the user a reasonable error.

Right.

Post by Mark
File encryption, not so much (especially if the encrypted file is
critical to the way the system works with the volume, like a recycle
bin folder or something).

I think the volume header should contain a flag which indicates
whether the volume contains (or might contain) encrypted files. This
would make it easy for a driver which doesn't support file encryption
to fail (relatively) gracefully at mount time.

My main argument for supporting file encryption at the file system
(driver) level would be that encryption can be deferred until blocks
need to be actually written out to disk. This trick would avoid the
encryption of data that never gets written to disk, such as a
temporary or ephemeral file, which would represent a significant
saving of processor time, without any danger of data getting written
unencrypted.

Post by Mark
In addition, while per-file encryption can be handled just fine by
higher levels in theory, real systems tend to do some pretty stupid
stuff (e.g. decrypting to a temp file, storing working copies and
such in plaintext form).

Exactly.

Post by Mark
PS: i'm considering removing encryption of directories, since that
only complicates matters and doesn't enhance security at all
(since all that info is in the .parent index anyway). thoughts?

I agree with this in priciple, but I'm fairly sure it would cause
difficulties in practice.

For example, the '/home' directory are used on many systems to keep
the home directories of users. User ids are often based on user's
names (or other actually identifying information). It may well be
desired to keep the identity of users secure (from a potential disk
thief).

I don't think encryption of directories adds that much complication,
actually. Directory blocks will always be unencrypted in memory.

Post by Mark
--Compression--

Can't comment on this yet. Still reading the spec (sorry).

Post by Mark
--Multiple String Formats--
My opinion: adds too many complications.

I think it would actually reduce complications, in fact. I'll see if I
can answer the questions.

Post by Mark
which string is the correct one?

The one which is UTF-8 and limited to 255 octets would the 'canonical'
name I think, but I don't think it's really important. If it came to
it, the canonical names are the ones a data recovery program would
rely on (if there was a conflict).

Post by Mark
what happens when the user uses han characters, and your system
uses Latin-1 as the back end?

Well that was precisely my objection to using UTF-8 instead of Latin-1
for the 'canonical' names. I suppose the answer is that the back end
(do you mean low-level software?) would have to display non-latin
characters in a form such as "[U+nnnn]" or "%%nnnn".

Post by Mark
what happens if the strings don't match?

I don't think they need to match, normally, except in terms of human
intelligibility.

Post by Mark
what happens if one of the other encodings exceeds the 255 byte
limit?

Therefore I don't think this matters.

Post by Mark
where are the extra strings stored? not in the directory
i hope, that would really complicate the indexing format.

No, not in the (canonical) dictionary, agreed. In attributes, I think,
perhaps separately indexed.

Post by Mark
what happens if one system doesn't understand one of the encodings
(in particular if the user wants to rename a file)?

To my mind, the idea is that different sets of tools use the name
'dimension' that suits them.

For example, most Unix or GNU tools are difficult to use with
anything other than ASCII (apart from Emacs). They could use a
dimension that stuck to ASCII (or Latin-1?). On the other hand Emacs
could use a Unicode dimension, and so could Python. DOS programs
could use an 8.3 dimension (like NTFS).

Of course, if people weren't careful, they could get into a tangle.
What would be valuable would be some good tools for allowing users
to view all the name dimensions side-by-side and make changes.

Post by Mark
The current system i think is best. Systems can store other names
as attributes (e.g. ".Win32:DOSNAME"), but the UTF-8 name specified
in the standard always takes precedence as the "official" name
string when confusion arises.

That's really what I am suggesting.

I grudgingly accept the use of UTF-8 :-)

Post by Mark
--UTF-8 vs. UTF-16/32--

I now accept the use of UTF-8. I admit it does automatically offer
the choice of efficient ASCII or less efficient Unicode. I would
agree that storing file names efficiently on disk is unlikely to be
an important issue.

However, is the 255 length limit necessary? I admit, there could be
problems (of interchange) with a higher limit, but there will be
some problems with the 255 limit anyway, so maybe this isn't an
issue.

~~~

Just a few quick comments on the spec I have read so far. None
important, I think.

You say:

The file system must have efficient support for large sparse files.
Arbitrary holes of any size must be supported at any location in a
file (holes being regions of the file never written to which read
back as zeros). Applications which are sparse-file aware should be
able to deallocate existing blocks in a file and replace them with
holes, as need be.

It's the necessity for deallocation that I'm questioning.

From my understanding of how sparse file are meant to work, I don't
think it is necessary to be able to deallocate blocks. Of course, it
probably wouldn't hurt to provide the ability to deallocate blocks,
and I don't actually object to providing it. But I think it is
incorrect to think that supporting deallocation is necessary to
support sparse files.

In fact, I don't think it is strictly necessary for the file system
to support sparse files at all (since an intermediate software layer
can simulate them) but no doubt it is somewhat more efficient.

You say:

ZenFS is mostly unique in it's support for very large numbers of
independent filesets, as most file systems support only one "root
directory."

I'm not objecting to this capability, but may I ask please why you
consider it important to support very large numbers of independent
filesets? I'm intrigued.

Perhaps:

The length field stores one minus the length of the block run.

would be better phrased:

The length field stores the length of the block run minus one.

(You didn't really mean 1-n did you?)

That's all for now. I'm going to pore through the design to see if
I can find any howlers ;-)

--
Best wishes,
Nick Roberts

Mark

2004-08-22 04:31:35 UTC

Post by Nick Roberts
The counter-argument is that it creates extra work for a little-
endian machine that might, in some cases, be significant. To write
back an index block could require many (a hundred?) integers to be
rewritten. A bit of a close call.
It has to be said, supporting both formats could seriously
complicate device driver code.

But it needs only be done when system blocks are read or written,
which won't be that often in most cases. Of course such statements
are dangerous--If a server programmer made heavy use of indices in her
application, she might feel a little differently.

She might argue that most computers are little-endian, and so the
default format should be little-endian, and big-endian machines should
be the ones to byte swap, and complain loudly. I would disagree. The
majority of little-endian machines are x86 based, and intel has a
special instruction for swapping bytes. Big-endian machines, on the
other hand, are mostly RISC, and lack this potential optimization.
Switching to little endian would grant a small speed increase

In addition, the cost of byte swapping really is trivial, when
compared to disk access times, or especially the cost of encryption.
It may take a few cycles from other applications, but what
applications are both disk *and* CPU bound?

Post by Nick Roberts
I don't think there is a specific Unicode (or ISO) character for the
purpose, but it does seem to me that a control character could be
used. There's quite a choice. Perhaps HT (U+0009)?

Or U+001C, the file separation code. But I think Mr. Gainey's might
be the better solution. The reason I've been hung up on the .target
attribute was that I wanted a way for symbolic links to point across
volumes (e.g. a desktop shortcut to the floppy disk). A data
structure for symbolic links would provide enough leaway for this.

Post by Nick Roberts
I think the volume header should contain a flag which indicates
whether the volume contains (or might contain) encrypted files. This
would make it easy for a driver which doesn't support file encryption
to fail (relatively) gracefully at mount time.

That might be an unnecessary restriction, however. Encrypted user
files should not cause any unnecessary confusion for system software.
In fact, a system which doesn't support encryption should just treat
it like a normal file, that way user-level code can look for
cryptographic attributes and provide their own handling of the issue.

One of the beautiful aspects of encryption as it stands now is that it
is totally non-intrusive. The only issue I can see, like I said, is
if a necessary system file is encrypted. But why would a *system*
file be encrypted, if the system didn't support encryption? It could
be a security risk if the OS handles the situation poorly, but that's
about the only problem I can think of.

Post by Nick Roberts
For example, the '/home' directory are used on many systems to keep
the home directories of users. User ids are often based on user's
names (or other actually identifying information). It may well be
desired to keep the identity of users secure (from a potential disk
thief).
I don't think encryption of directories adds that much complication,
actually. Directory blocks will always be unencrypted in memory.

In which case the attacker'd just match entries in the .parent index
to entries in the .name index. File names (like all other system
attributes) are not designed to be secure. The solution here is to
rename the user's home directory to a random string, and then store
the (username,homedir) pair in an encrypted file somewhere (or full
volume encryption).

Encrypting directories makes it difficult to get around the file
system, but doesn't actually protect anything. A determined attacker
can still find what they need in the indices (which can't be encrypted
because implementations which do not support encryption still need
access to the indices). What it does do is make life difficult for
everyone else.

Post by Nick Roberts
However, is the 255 length limit necessary?

No. But is more than 255 bytes necessary? I decided on that limit
before the current index format, when longer strings would actually
have been an issue. But I honestly do think it's a fair limitation.
Most OS's limit name strings to 255 bytes already, so it'll make life
easier for them. Lengths can be represented with a single byte.

But really, when would you use more than 255 bytes? File names 32
bytes long are already too long. More than 80 bytes is absurd. The
longest files I've ever seen are in my music collection:

"Artist - Album - Disk # - Track ## - Title.ogg"

-and even the longest files here are <75 chars in length. Do you
really need more than 255 bytes?

Post by Nick Roberts
It's the necessity for deallocation that I'm questioning.
From my understanding of how sparse file are meant to work, I don't
think it is necessary to be able to deallocate blocks...

That's correct. It's an extra design requirement on top of sparse
file support. Not every system has to support block deallocation, but
I wanted to ensure that it at least *could* be supported. The
reference implementation will support it.

Post by Nick Roberts
I'm not objecting to this capability, but may I ask please why you
consider it important to support very large numbers of independent
filesets? I'm intrigued.

As you know, zenfs is the file system for the os I'm developing. In
this system the concept of a unified file hierarchy has been removed.
Instead each installed program, module, or user home directory is
given its own fileset. A public FTP server would have a file set for
each of the domains it manages, for example. As you can see, the
number of filesets is not easily bounded in practice.

-mark

Nick Roberts

2004-08-22 19:00:51 UTC

And also, having big-endianness required by the specification is
only relevant for interchange of a ZenFS volume between different
machines (of different endiannesses).

An implementation X on a little-endian machine could defy the
specification and use little-endian integers instead of big-endian
ones, and get away with it, providing the volume was never moved
to a machine whose ZenFS driver was not X.

I think it's a difficult question to resolve, actually.

Post by Mark
She might argue that most computers are little-endian, and so the
default format should be little-endian, and big-endian machines
should be the ones to byte swap, and complain loudly. I would
disagree. The majority of little-endian machines are x86 based,
and intel has a special instruction for swapping bytes.

I know that the PowerPC architecture permits processors to support
both endiannesses (there is a flag to select which). I don't know
how many models actually support this feature, but for those which
do, the overhead would be zero.

Post by Mark
Big-endian machines, on the other hand, are mostly RISC, and lack
this potential optimization.

I think this is true, so if an endianness has to be chosen, I agree
it should be big-endianness.

Post by Mark
In addition, the cost of byte swapping really is trivial, when
compared to disk access times, or especially the cost of
encryption. It may take a few cycles from other applications,
but what applications are both disk *and* CPU bound?

Can we assume that most ZenFS volumes will, in practice, be on
non-removable disks? An implementor might complain that the big-
endian requirement was pointless in such cases, however little the
overhead.

Maybe there should be a flag which permits the endianness to be
selected, but with the admonishment in the specification that big-
endianness /must/ be used on any volume on an interchangeable
medium (even if the interchange will only be occasional).

Post by Nick Roberts
I don't think there is a specific Unicode (or ISO) character for
the purpose, but it does seem to me that a control character
could be used. There's quite a choice. Perhaps HT (U+0009)?

Or U+001C, the file separation code. But I think Mr. Gainey's
might be the better solution. The reason I've been hung up on the
.target attribute was that I wanted a way for symbolic links to
point across volumes (e.g. a desktop shortcut to the floppy disk).
A data structure for symbolic links would provide enough leaway
for this.

I think that would be fine. For the record, Ben (Gainey) suggested
the following structure:

struct SYMLINK_TARGET
{
uint64_t Length;
uint32_t NumberOfParts;
struct
{
uint32_t Size;
char Name[];
} FileNameParts[];
};

Length is the size of the whole .target attribute. NumberOfParts is
the number of segments in the path. FileNameParts is an array
containing each path segment. For each element of this array, Size
is the length of the string (strings are counted rather than nul
terminated), and the Name array contains the name of the path
segment.

I question the need for Length to be a 64-bit integer, and for
NumberOfParts and Size to be 32 bits.

CORBA (rearing its ugly head again) specifies two parts to a path
component: 'name' and 'kind', which are both 8-bit character strings.
The 'kind' part is really the file 'type' or 'extension' of DOS (but
with no specific length or character set restrictions), or simply
'the part after the dot'. So "myfile.txt" has name "myfile" and kind
"txt".

So my suggestion would be:

typedef struct
{
uint16 length;
uint8 part_count;
struct
{
uint8 name_len;
uint8 kind_len;
char name[];
char kind[];
} parts[];
} CPathStructure;

typedef struct
{
uint16 magic;
CPathStructure path;
} CSymlinkContent;

I realise that this is not valid C, but it expresses the format. I've
thrown in a magic number, which is obviously there to confirm that
the file really contains a symlink. (Please change the naming style
as you see fit.)

I suppose the character encoding should also be UTF-8.

Post by Nick Roberts
I think the volume header should contain a flag which indicates
whether the volume contains (or might contain) encrypted files.
This would make it easy for a driver which doesn't support file
encryption to fail (relatively) gracefully at mount time.

That might be an unnecessary restriction, however. Encrypted user
files should not cause any unnecessary confusion for system
software.

True.

Post by Mark
In fact, a system which doesn't support encryption should just
treat it like a normal file, that way user-level code can look
for cryptographic attributes and provide their own handling of the
issue.

Or possibly the file system driver could just make those files
invisible.

Post by Mark
One of the beautiful aspects of encryption as it stands now is that
it is totally non-intrusive. The only issue I can see, like I said,
is if a necessary system file is encrypted. But why would a
*system* file be encrypted, if the system didn't support encryption?

When a ZenFS volume is transferred from one machine to another?

I suggest there should be a flag to indicate if any system files are
encrypted. If there are only user files encrypted, the file system
driver should proceed as you suggest.

Post by Nick Roberts
I don't think encryption of directories adds that much
complication, actually. Directory blocks will always be unencrypted
in memory.

In which case the attacker'd just match entries in the .parent index
to entries in the .name index.

Couldn't the .parent index also be encrypted? I see that this is
indiscriminate. It's not neat. But possible.

Post by Mark
File names (like all other system attributes) are not designed to be
secure.

I agree with this principle. Unfortunately, I'm pretty sure that many
exisiting files (as stored on existing file systems) and programs will
be found that do not conform to this principle. In practice, I think
it is necessary to support them (unfortunately).

Post by Mark
[re user names in the '/home' directory]
The solution here is to rename the user's home directory to a random
string, and then store the (username,homedir) pair in an encrypted
file somewhere (or full volume encryption).

Yes, but that approach will not be compatible with a lot of existing
software and organisational infrastructure.

I think the specification needs to support system files (including
directories) being encrypted. Actual implementations could always
make the policy decision of not doing so, for all the reasons you
state.

Post by Nick Roberts
However, is the 255 length limit necessary?

Okay, I agree with that.

Post by Nick Roberts
From my understanding of how sparse file are meant to work, I don't
think it is necessary to be able to deallocate blocks...

That's correct. It's an extra design requirement on top of sparse
file support. Not every system has to support block deallocation,
but I wanted to ensure that it at least *could* be supported. The
reference implementation will support it.

Okay, that's perfect.