James Harris
2021-02-13 10:52:01 UTC
Well, here's a strange one. After I copied a file from one drive to
another I compared them and found them to be different! :-(
If your first thought is that I must have done something wrong you are
not alone. That's what I thought. But as far as I can tell it has turned
out to be a genuine error in either the hardware or in system software.
The reason for posting is partly that you might find this interesting
and partly because if it was a hardware error I wonder if there's
anything an operating system can do to detect or prevent such errors.
Here are the details.
---
The original file was very large, a 500GB disk image. The error happened
in a copy job I left to run overnight along the lines of (on Unix)
at 2am
cp -i /mnt/backups0/disk.img /mnt/backups1/work.img
Only one bit changed. A byte about 1.3% into the file was originally
0x70 but in the copy it became 0x71.
After the copy both diff and cmp reported a mismatch.
To see what the mismatch looked like in context I wrote a short C
program to extract a part of the file using fseek. (I found 'cut' to be
unsuitable as it read the file from the beginning until it got to the
requested offset; fseek was a better approach.)
I am confident that that was the only mismatch because after I used a
program called hexedit to correct the copy both diff and cmp found no
mismatch.
---
That's the situation. What do you think?
The only places I can think of where the bit could have been changed
from 0 to 1 are the following four.
* Between source drive and host system.
* In the host system's CPU or RAM. (It's plain RAM, not ECC.)
* Between host system and target drive.
* Within the host drive when it transferred its RAM to the disk surface.
Any others?
Given the above, if you are interested where do you think such an error
could have occurred and what could an OS do about it?
another I compared them and found them to be different! :-(
If your first thought is that I must have done something wrong you are
not alone. That's what I thought. But as far as I can tell it has turned
out to be a genuine error in either the hardware or in system software.
The reason for posting is partly that you might find this interesting
and partly because if it was a hardware error I wonder if there's
anything an operating system can do to detect or prevent such errors.
Here are the details.
---
The original file was very large, a 500GB disk image. The error happened
in a copy job I left to run overnight along the lines of (on Unix)
at 2am
cp -i /mnt/backups0/disk.img /mnt/backups1/work.img
Only one bit changed. A byte about 1.3% into the file was originally
0x70 but in the copy it became 0x71.
After the copy both diff and cmp reported a mismatch.
To see what the mismatch looked like in context I wrote a short C
program to extract a part of the file using fseek. (I found 'cut' to be
unsuitable as it read the file from the beginning until it got to the
requested offset; fseek was a better approach.)
I am confident that that was the only mismatch because after I used a
program called hexedit to correct the copy both diff and cmp found no
mismatch.
---
That's the situation. What do you think?
The only places I can think of where the bit could have been changed
from 0 to 1 are the following four.
* Between source drive and host system.
* In the host system's CPU or RAM. (It's plain RAM, not ECC.)
* Between host system and target drive.
* Within the host drive when it transferred its RAM to the disk surface.
Any others?
Given the above, if you are interested where do you think such an error
could have occurred and what could an OS do about it?
--
James Harris
James Harris