[clug] Unique Id's and CD's

steve jenkin sjenkin at canb.auug.org.au
Fri May 8 02:27:41 GMT 2009


Andrew Janke wrote on 8/5/09 11:27 AM:

> Aye in my case I just wanted to make sure that I had (or hadn't) seen
> the data on the disc before. We get sent imaging data from all over
> for a project, in some cases, the techs are unsure if they have sent a
> particular dataset so they just send it again..  I also have added a
> bit of cataloguing into the CD extraction process so that I can
> reconcile what I have with what I am told I should have. This also
> means that the date of the CD creation and the files on it are
> somewhat irrelevant, yes the new burnt data might be different and
> exactly the same size but this is a problem that I can live with.  The
> pain of having duplicate data/UIDs is far more of a problem than this.
> 
> So the primary reason for the "unique" ID was to give each CD an ID
> and from there put it's contents into a database somewhere. So, as has
> been rightly pointed out, not unique (I should know better being a CS
> honours student.... :)
> 
> So my "suck-cd.sh" script has ended up as such. Run it on an old
> machine, feed it CD's whenever it get's hungry and sticks its CD tray
> out and begs for more.  dcmsort is a home-grown perl script to extract
> DICOM (medical imaging) data and in doing so sort out a bunch of QC
> issues with missing slices/data/acquisitions.. I am sure there are
> bugs in there but it works for now and I have a nice directory of 500
> or so txt files and growing. Only another 2500 or so CD's to go. ;)
> 
> Thanks again all for the pointers.
> 
> 
> --
> Andrew Janke
> (a.janke at gmail.com || http://a.janke.googlepages.com/)
> Canberra->Australia    +61 (402) 700 883


Andrew,

Something I'm unclear about, does each CD have just one image file or
many?? Only images, or other stuff too?

Something to test on your existing set of image files is the uniqueness
of MD5's in the first 128Kb, 512Kb, 1Mb, 4Mb, ...
And/or the last fraction of the file.

If they are JPEG's, doesn't ImageMagick dump all/some of the meta-data?
IIRC, there's an MD5 in the meta-data (optional or mandatory?).

If the headers/trailers/meta-data are unique, you'll have a quick &
scalable solution.

If your data providers decide to change media - like dual-layer DVD or
USB HDD's, they could be sending you many objects on a single media :-(

File based is 'good' in that case, and timestamps and filenames need to
be considered completely unreliable :-/

HTH
sj

--
Steve Jenkin, Info Tech, Systems and Design Specialist.
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA

sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin


More information about the linux mailing list