Post by MarkAnd please, feel free to use ZenFS. It's out there for anyone who
want it, no strings attached.
Excellent. It really solves a problem for me, as I have been
agonising over what format to use for Aquila for a long time now.
Post by MarkPost by Nick Roberts* Could files be permitted to be excluded from journalling at user
request (to trade integrity maintenance of the file for some
increase in performance)?
You can actually do this as it stands right now. The only things
that are assumed to be journaled are updates to directories, indices,
or any of the on-disk file system structures. Systems are allowed to
also log file data, but that's a matter of policy, not mechanism (the
journal format is general enough to allow for this).
Perfect.
Post by MarkPS: Whoops, I don't think the spec says what has to be journaled.
I'll fix that.
:-)
Post by MarkPost by Nick Roberts* I'd like to be able to have files that do not have file names
(i.e. their inodes would have 0 links).
That's an interesting idea. It's certainly possible, but needs to
be worked out a little more I think. How would you access the
file?
The API would provide an alternative open(), which accepts an inode
number instead of a (path) name.
Post by MarkWhere would the inode address be stored?
There would also be an alternative creat(), which does not take a new
(path) name, but which returns the inode number of the created file.
The application program would be responsible for storing the inode
number (possibly in memory, for a temporary file, or in another file
or in a database for longer term storage).
This facility would be quite important to me, I think. AdaOS (which
will be based on CORBA, I think) is going to be an object-oriented OS.
Accessing objects (and perhaps database BLOBs) by inode number (rather
than any name) would be an important capability.
Post by MarkPost by Nick Roberts* I'd like to permit the forward slash '/' character in file names
(they would be escaped, in software, to distinguish them from path
separators). CORBA (and the Macintosh?) seem to permit them.
Yeah, that decision was pretty arbitrary. I chose that to remain
compatible as a Linux file system (POSIX specifies that
restriction), but I personally really have no preference. I thought
of it as a lowest common denominator kind of thing, I didn't realize
CORBA lacked that restriction. I'll remove it.
It's a dilemma, because we both know full well that if Joe User
creates a pile of files with '/'s in their names, and then, say, tries
to FTP them onto a Unix system, it's going to go nasty, and naturally
it's the OS or file system that will get the blame.
The problem is, it cuts both ways. If Jane User tries to FTP a pile
of files from her Macintosh (with '/'s in their names) to a ZenFS
system, and it goes belly-up, it's ZenFS that'll get the blame again.
We cannot win :-)
Post by MarkPost by Nick Roberts* Where at the moment you have a 'flags' field, would you divide it
into two fields, 'mandatory_flags' and 'auxiliary_flags', the idea
being that software can ignore auxiliary flags (but not mandatory
ones)?
Hrm, possibly. What kind of flags do you have in mind? I ask only
because TYPE_BOOL attributes might be the mechanism of choice for
storing auxiliary flags, if the flags relate to the format of the
data and not where it is on disk. Sharing an auxiliary_flags field
could lead to collisions, whereas with attributes you can avoid that
with a prefix for all your attribute names (e.g. ".AdaOS:FLAG_NAME")
I think I explained this badly. I should probably have said something
like 'important flags' and 'unimportant' flags, instead of 'mandatory'
and 'auxiliary'.
The idea is simply that, by defining some of a set of future potential
flags as 'unimportant', new unimportant flags can be introduced without
requiring a version (or revision) change. A new flag can be deemed
unimportant if software can safely ignore it (innocent of its
existence).
To concoct an example, a new inode flag that means something like
'make access to this file espcially fast' (maybe for use by heavily
accessed files) could probably be considered unimportant. An inode flag
that means 'safety copy of another inode, do not use' would probably
have to be considered important, and its introduction would therefore
require a version change.
This idea is not important, but it might be nice.
Post by MarkPost by Nick Roberts* In my own preliminary design, I have a 16-bit inode type field.
...
And 256&257 would fall under INODETYPE_FILE; I'm not sure I follow on
what 1 and 522 are for.
Type 1 denotes a 'metafile' containg the inodes (in my own preliminary
design). Unimportant.
Type 522 denotes files which start with the AdaOS 'Native File Common
Header'. There is a field in this header which specfies the (sub-)type
of the file. It is mainly used for AdaOS-specific things (such as
executable modules). Little relevance to ZenFS.
Post by MarkI take it the reserved values are for defining new user-level types?
Like 258=UTF-8 text file, 1024=PNG image file, 1025=JPEG file, etc?
It was really only to define a few low-level file types, so that low-
level software (boot programs, basic devices drivers, etc.) had a way
to distinguish (or check) file types, without recourse to complicated
mechanisms. So probably PNG and JPEG files would not fall into this
category (they would be only be used by high-level application
software). However, some OS-specific files (such as AdaOS executable
modules) would fall into this category.
Post by MarkFor interoperability reasons I think it would be best to limit
inode_type's values to those relating to the file system itself.
I thought it might be handy to have a 16-bit type field, where the
upper byte is used to distinguish operating systems (say 0=ZenFS,
1=yourOS, 2=AdaOS, 3=Linux, 4=BSD, etc.), and the lower byte to
specify a basic file type within the OS. Sorry I didn't explain this
better in the first place!
Post by MarkI think this falls again under "mechanism vs. policy." For my own
OS, I'm taking the BeOS route and having a TYPE_STRING attribute
".myOS:TYPE" which specifies the MIME type for the file. For your
needs you might want a TYPE_UINT16 attribute (".AdaOS:TYPE"?).
Yes, that's perfect. And so a PNG would be .AdaOS:TYPE=image/png and
a JPEG would be .AdaOS:TYPE=image/jpeg and so on.
However, I think there would be value in having a few conventional
types (defined in a separate specification). It might be pretty handy
to have a conventional file type attribute such as "MIME_TYPE" for
MIME types.
On the subject of attributes, the spiel for Reiser4 raises a couple
of questions.
Supposing we said that any directory can be opened as either a (raw)
datafile or as a directory. If it is opened as a file, we search for
a special entry (named ".DEFAULT" say), and open it. Suppose,
furthermore, that any entry in a directory can be marked as
'inherited': if a search of a subdirectory (down to any level) for
the same name fails, the inherited entry is returned instead. Might
this provide a sufficient basis for attribute storage? Might it also
be sufficient for the storage of streams? (But it might be dangerous
to spend too much time reading about Reiser4 :-)
Post by MarkPost by Nick Roberts* For a really small file, the data could all stored in the inode
(if it will fit), in place of the 'first_blocks' field.
It's in there. (page 26, do a search for the ".data" attribute)
Aha! Cool.
Post by MarkPost by Nick Roberts* Could 'data_size' be in terms of bits instead of bytes (since
this is what SQL-92 requires for a binary string)?
Do you mean the data_size in the inode, or the data_size in the
small_attr region? In either case it'd cut the maximum size one
can store down by a factor of 8... You could always shift three
bits.
Sorry, I meant the uint64 data_size in an inode. It would mean
maximum file size was 2.3x10^18 bytes. Personally, I think that's
okay.
If the uint16 data_size in the small_attr region were also in terms
of bits, it would reduce the maximum size to 8191+7/8 bytes. Since
these attributes are meant to be small, isn't that okay?
Actually, I might suggest the offset index need only contain uint16s
(rather than uint32s), and each could be in terms of n/8, where n is
the offset in bytes (from the beginning of the small_attr area).
Not sure how important any of this is.
Post by MarkPost by Nick Roberts* I'd like to propose a small improvement to updating of the
superblock.... <snip>
I thought about this too. As you demonstrated, there isn't really
a good solution.
Indeed!
Post by MarkI'm not convinced yet there's much of a problem either. The chances
of the superblock wearing out is just so small on a modern hard
drive (compared to the life of the hard drive).
I think it's extremely unlikely (but I'm not sure). On modern disks,
sectors can be relocated (at the factory); I believe that if a disk
has sector 0 bad, the disk is rejected (scrapped). However, a volume
on a partitioned disk can start anywhere. Modern disks do not use a
jelly substrate any more, and destructive head crashes are very rare.
I've heard of plans to reduce the inter-sector gap to nearly zero for
the next generation of disks, but I think there are ways to maintain
reliability (involving end-of-track redundancy data, I suspect).
Post by MarkUnless the whole region gets scratched by the drive head or
something,
Probably just a track, or maybe a few adjacent tracks.
Post by Markin which case you're screwed anyway because you'd have lost the
neighboring boot block as well..
That's not a disaster at all. It is a standard technique (now) in
this situation to boot from another source (e.g. a floppy disk or
CD-ROM) and perform recovery or repair of the volume.
Post by MarkI'll think about it some more too, but i'm pretty sure the
complexity will outweigh any benefit.
I don't feel what I suggested is all that complex. But it does need
thinking about. Let's avoid introducing any new major problems while
trying to solve a few relatively minor ones!
Post by MarkPost by Nick Roberts* I'd like to support 'smart copying' of files. ...
Oooh, this is a unique idea.
It's definitely an idea that's been knocking around in computer science
for yonks, but its application to file systems /might/ be unique. I've
not personally come across any file system that does it.
Post by MarkYou don't need it for transaction management though, that can all be
all be done with user-specified transactions in the existing journal.
Doh! I'm an idiot! I'm not used to journalling technology yet, and I
tend to forget. Sorry.
In fact, on this subject, the astute may recall that I've written
recently (in alt.os.development) criticising journalling. But I'm
learning more about it, and I'm beginning to think that my criticisms
were hasty. I can see that a log could help speed up safe rewrites, by
allowing them to be reordered.
Post by MarkWhere else might this be useful? It is a really neat concept.
Hmm. To be honest, I'm not so sure now.
Post by MarkIt'd require extending the file_run structure by another 8 bytes
though (for alignment reasons)..
Yes, this is the problem.
Post by MarkPost by Nick Roberts* I doubt that it is necessary to 'sprinkle' magic numbers
interspersed among the data. ...
Actually, they're all there for good reasons. The first one in
the superblock identifies the volume, the next two help detect
buffer overrun of the volume name,
Hehe. You're speaking to an Ada programmer (for whom buffer overrun
just doesn't happen). I forgot about that one.
Post by Markthe rest are padding bytes for alignment that I couldn't think of
any other use for (same goes for the inode structure). The bytes
are there, might as well put them to good use
I disagree with this. The bytes are there now, but we may well wish
to use them (for other purposes) in future ZenFS versions.
Post by Mark(it's unlikely, as you say, that the disk might garble one and not
the other, but buggy software might).
However, I agree with this. In the end, it's a minor issue. Keep
things as they are.
Post by MarkPost by Nick Roberts* An 'index' (inode/file) might just as well be a proper database
table? Why not go the whole hog, and provide the complete nuts
and bolts of a database engine?
1) mechanism vs. policy; most of the things you'd need to build a
general database system are there, but it's up to software to do
that. i didn't see any need to go any further in the spec.
2) i'm not a database expert. is there anything missing that
should be there?
I'm not quite sure if can call myself a database expert. I am sort
of. Trouble is, you could study databases all your life and not
really be an expert. Hehe ;-)
I think we should provide everything that a SQL-92 CLI (ODBC)
database engine (device driver) would require. That gives us the
following checklist:
* types (char string, bit string, large & small integer, specific
decimal scale exact numeric, single & double float, various
datetimes and time intervals, possibly also BLOBs, references,
OIDs);
* organisational structures (catalogs, schemas);
* basic elements (tables, views, domains, constraints and
assertions, character sets & collations?, character set
translations?);
* permissions on these elements (insert, update, delete, select,
reference, usage) and authorisation (passwords?);
* transactions (roll back on recovery);
* stored procedures? (I don't think so);
* the Definition Schema (containing database metadata).
This checklist leaves out a lot of detail, but also it is only a
list of things that need to be covered by /some/ ZenFS mechanism
(a rose by any other name).
I think this stuff would be so useful, it would be worth
pursuing.
Post by MarkPost by Nick Roberts* Instead of using UTF-8 in various places, I suggest that: in
directories, Latin-1 only is used, because this means that
computer (operating) systems which cannot handle Unicode are not
embarrassed; define attribute types such as STRING_ISO_8859_1,
STRING_UTF_8, STRING_UTF_16 and so on.
Well, UTF-8 is quickly becoming de facto for new stuff (unless
you're in the NT world).
That may be true, but I think it is for all the wrong reasons. The
principal (wrong) reason is that most of the standards and
specifications which specify UTF-8 are being devised by North
Americans (and to a lesser extent, Europeans), such as the IETF
and W3C, for example, who naturally (but foolishly) assume that an
encoding which is efficient only for the characters in the Latin-x
sets will be acceptable to everyone in the world. A Chinese,
Japanese, Korean, Taiwanese, or Thai filename encoded in UTF-8 will
be diabolically inefficient. In most countries in the world, UTF-8
is just irrelevant and silly. For once, Microsoft are right.
I am getting the impression that there is a gradual consensus
accumulating in the standards communities that it is better to
stick to a simple ASCII-based 8-bit character set (encoding) for
names that will be used in programs (file/path names, environment
variable names, configuration item names, database table and field
names, and so on). This way, programs can continue to work in
execution environments for which support of bigger character
repertoires would be inappropriate. I divine that this is the
current thrust of the OMG's thinking wrt CORBA now.
For more sophisticated environments, files and other objects can
be given alternative names or descriptions using Unicode (the UCS),
which can be used for human display purposes. The extensible
attribute mechanism of ZenFS seems to be ideal for this purpose.
Post by MarkDrivers for older systems that don't support it can always convert
on the fly. I'm a little hesitant to include something like this
since any format under the sun can be converted to UTF-8. Adding
extra types would complicate things for everyone, even those who
already run UTF-8..
I don't want to sound like this is a hobby horse for me. I can see
that a UTF-8 encoding doesn't actually /prevent/ the use of Latin-1
characters only throughout the file system. But would it be
acceptable for some implementations of ZenFS to simply not support
non-Latin-1 characters?
For example, supposing I wanted to write a data recovery tool
suitable for use from a bootable floppy disk. It would have to be a
relatively small program to fit. It would be quite acceptable for
this program to use text mode (no GUI or mouse). But supposing this
program had an option to display directory listings (perhaps to
allow the user to do selective deletion). It would be hard to
support the display of non-Latin-1 characters (and a few Latin-1
characters too, in fact).
There are an awful lot of existing programs that cannot support
encodings other than a simple 8-bit ASCII-based one. These programs
would be at risk of failing if they tried to manipulate a file with
a (path) name that contained exotic characters. Can we expect
systems implementing ZenFS to not support legacy software?
I think this is an important issue, that needs careful
consideration.
Post by MarkI thought about doing semi-hard links as well, with one entry being
the prefered, but it didn't really make sense except in the
situation you mentioned [disk usage reporting], and added an extra
layer of indirection. You can always just choose the first entry
that shows up in '.parent'.
There are alternative possible ways to solve the disk usage reporting
problem. Maybe this isn't an important issue at this level (file
system format specification).
Post by MarkPost by Nick RobertsIs full-volume encryption useful?
Yes. At the very least, it's a catch-all for files that you forget
to encrypt, or applications that do something stupid like store
unencrypted info in a temp file (people seem to do stupid stuff like
this all the time).
But that could be handled in different ways. The user's execution
environment could impose a default behaviour upon all software of
encrypting a file (unless explicitly told not to).
Post by MarkIt also prevents an attacker from learning anything from the
hierarchy structure and file names (a more serious concern). The
file name, creation time, size, and location in the hierarchy can
give away a lot of info, for example.
Not everyone will need it, however.
I think just about all of that information could be protected by a
mechanism that permitted encryption of individual directories and
inodes.
I think there is a problem with encrypting an entire volume, if that
volume is to be used by more than one user. It denies users the option
of not encrypting a file, which might be an important efficiency issue
in some cases.
I think individual structures (directories, inodes, indices) should be
inidividually encryptable. In which case, I think it would probably not
be necessary to have an option to encrypt the entire volume.
Post by MarkPost by Nick Roberts* I think the compression needs to be significantly rethought. ...
What are your thoughts on this?
Sparse files.
...
I'm pretty convinced that a system's API should expose methods to
allow application software to discover the location of holes in
sparse files, for various reasons. As I understand it, applications
running on scientific compute servers need to be able to see (and wait
on) the holes in order to use the sparse file as a communication (and
synchronisation) mechanism. That being the case, I still think that
compression should be left to higher levels of software; I feel sure
that a good algorithm for finding the redundancy in a particular data
set will require knowledge of the data set (which the file system does
not have). An increasing number of file formats these days have their
own specialist compression schemes designed-in (e.g. PNG, JPEG, MPEG).
Post by MarkPost by Nick RobertsWould you consider using DocBook/XML to originate your documentation?
Yes. I'm going to switch to LaTeX or DocBook soon. I've just never
use either, and wanted to get some ideas to paper quickly before
learning a new system.
I use DocBook/XML myself. I would be happy to do most of the conversion
from your current format to DocBook for you, if you wish. (I'd then hand
it back to you for you to finish.)
My e-mail is real (***@acm.org) if you wish to e-mail me.
--
Best wishes,
Nick Roberts