[clug] Anyone want to give a talk for this week's PSIG meeting?

Wed Mar 12 04:51:15 MDT 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/03/14 15:23, Bob Edwards wrote:
> On 11/03/14 20:09, Paul Wayper wrote:
> 
> I've got a topic of my own I'd like to ask people about, if that's OK...
[snip]
> 
> What I'd like to do at the PSIG meeting is talk about what kind of 
> operations could be supported by this library, how the underlying
> storage might work and what limitations I'm going to have to overcome.
> 
> Any interest in this?
> 
>> Hi Paul,
> 
>> I'm sort of interested. I am thinking, though, that inserting and 
>> deleting bytes from a file without reference to the structure of the 
>> data may have limited usefulness? If the file contains offsets etc. and
>> you just add or delete bytes from it then all those offsets will be
>> incorrect?
> 
>> Surely this is why we have "standards" like XML, JSON and databases for
>> structuring data?
> 
>> On the other hand, if you are not inserting or deleting bytes, just 
>> changing chunks "in place", then the existing seek() and write() API 
>> calls will do just fine?
> 
>> Or am I missing something?

No, not missing.  It's just that this is still a relatively new concept and
I'm trying to get an idea of how, or if, it would be used in practice.

With the operations I'm thinking of, it would be possible to edit a file in
place - each time the user changed a line the editor could instantly change
that many bytes on disk, adding or deleting as necessary, without having to
rewrite every subsequent line.  Likewise, something like sed could easily
pass through a file chopping and changing lines without ever having to write
a new file.

However, the library that I'm thinking of wouldn't provide that kind of
functionality - because it wouldn't actually operate on a raw file at all.
As presented on disk, the file the library would work with would be a block
of header information and then zero or more blocks of file metadata, file
data, and blank space waiting to be filled.  The library presents the
higher-level application with a view that is basically just a large memory
area, albeit one which can allow inserting or deleting arbitrary numbers of
bytes anywhere.  It's a bit like the view Perl gives you of an array - even
though the array may be stored underneath as a b-tree.

The use case that I'm thinking of for this library is supporting an
application that works with chunks of semi-structured data - say an audio
editor.  When people edit audio, they might record or import large
quantities of audio data, then remove the start, end and some intervening
bits ('um' removal, say).  Audacity, as an example, implements this by
having directories full of little files (say one file per second of audio or
part thereof), and an index which describes which audio files are in which
tracks and at what time offsets.  When you delete some audio in Audacity, it
restructures its index and, if necessary, writes new files to link into its
index.

Structured data standards like XML and JSON do allow "insertion",
"modification" and "deletion", but only (really) by reading the whole file
and marshalling it into structures in memory, restructuring it, and then
serialising it and writing it back out to disk.  This becomes more tricky
when you have files larger than memory - a consideration that is becoming
more relevant in this age of Raspberry Pis and embedded computers.
Databases do arrange for some data to be on disk, but there's a trade-off
between databases with committed storage that arrange their own blocks and
databases like SQLite that use an on-disk file that (generally) tries to be
reasonably small.

The idea came up as the result of a Linux Weekly News article about a
proposed 'file collapse' operation in the kernel:

https://lwn.net/Articles/589260/ for subscribers until next Friday, free
after then.

The operation proposed is like 'truncate' but instead of shortening the end
of a file it removes ('collapses') data inside it.  The data afterward all
'moves forward' in the file to begin at the start of the collapsed section.
 Would this support removing arbitrary quantities of data, or just
block-aligned chunks?  What happens on file systems such as tmpfs that
aren't extent based?  Could you also 'inflate' the file?  Would it be better
to have a generic 'move these blocks from here to here' operation as Andrew
Morton suggested?

Anyway, let's discuss this more at the meeting!

Have fun,

Paul
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlMgPCMACgkQu7W0U8VsXYInZwCgngFPNTUZvd+MePYFR9Nge+zX
Nf8AoMQOK4y3yMrTbHtGscaR+BWzOIGQ
=bHQX
-----END PGP SIGNATURE-----