[clug] Unique Id's and CD's

Fri May 8 01:27:36 GMT 2009

2009/5/8 Alex Satrapa <alexsatrapa at mac.com>:
> You may even find that just using the table of contents for each/every
> session on the disk might be enough to ensure uniqueness for your problem
> space. I doubt that two disks having the exact same directory structure will
> end up being different. At which point you could just use a checksum on the
> output of "ls -lR" which is hardly going to be the entire data content of
> the disc, but will uniquely identify any file system.
>
> The MD5+SHA1 of "ls -lR" should work to uniquely identify a version of a CD
> that is regularly updated by a vendor, for example. They might ship out a
> disk labelled "WONDERFUL-UPDATES" which contains updates to their
> "Wonderful™" product. The file structure might look exactly the same each
> time: one "SETUP.EXE" file in the root of the file system. You'd expect that
> the datestamp and file size would change between updates, so this would be
> covered by the "ls -lR" (since it contains all the information).

Aye in my case I just wanted to make sure that I had (or hadn't) seen
the data on the disc before. We get sent imaging data from all over
for a project, in some cases, the techs are unsure if they have sent a
particular dataset so they just send it again..  I also have added a
bit of cataloguing into the CD extraction process so that I can
reconcile what I have with what I am told I should have. This also
means that the date of the CD creation and the files on it are
somewhat irrelevant, yes the new burnt data might be different and
exactly the same size but this is a problem that I can live with.  The
pain of having duplicate data/UIDs is far more of a problem than this.

So the primary reason for the "unique" ID was to give each CD an ID
and from there put it's contents into a database somewhere. So, as has
been rightly pointed out, not unique (I should know better being a CS
honours student.... :)

So my "suck-cd.sh" script has ended up as such. Run it on an old
machine, feed it CD's whenever it get's hungry and sticks its CD tray
out and begs for more.  dcmsort is a home-grown perl script to extract
DICOM (medical imaging) data and in doing so sort out a bunch of QC
issues with missing slices/data/acquisitions.. I am sure there are
bugs in there but it works for now and I have a nice directory of 500
or so txt files and growing. Only another 2500 or so CD's to go. ;)

Thanks again all for the pointers.

--
Andrew Janke
(a.janke at gmail.com || http://a.janke.googlepages.com/)
Canberra->Australia    +61 (402) 700 883

---

#! /bin/sh
#
# Andrew L Janke <a.janke at gmail.com>
#
#
# watch the CD drive and suck all the DICOM we can

home="/blah/blah/incoming"
dev="/dev/scd0"
mountpt="/media/cdrom0"

cddb="$home/cd-ids.txt"
cdcont="$home/cd-cont"

# set up tmpfile
tmpfile=$(tempfile --prefix=suck-cd --suffix .txt) || exit
#trap "rm -f -- '$tmpfile' && echo 'Cleaned up'" 0 1 2 3 13 15

mkdir -p $cdcont
touch $cddb

cdnum="undef"
while true
do
   date=$(date)
   echo "+++ Waiting... [$date] (last was: $cdnum)"
   sleep 5

   ready=`scsi_ready $dev | tail -1 | sed -e 's/\ //g'`
   echo "  got: $ready"
   echo

   # if we have a winner, hoover all the DICOM's we can find
   if [ $ready = "ready" ]
   then
      echo "--- We have a winner, going for it ---"
      mount $mountpt

      # first dump the CD contents
      du -ak $mountpt > $tmpfile

      # get the sha1sum
      sha1sum=$(cat $tmpfile | sort -k 2 | sha1sum | cut -f1 -d\ )
      #echo "Got sum: $sha1sum"

      # check if we have this one yet
      grep -q $sha1sum $cddb
      if [ $? = 0 ]
      then
         # get the ID
         cdnum=$(grep $sha1sum $cddb | cut -f1 -d\ )

         echo "+++ Found $sha1sum CD#: $cdnum +++"

      else
         # find the next id (numerically)
         last=$(sort -n -k 1 -t " "  cd-ids.txt | tail -1 | cut -f1 -d\ )
         cdnum=$(printf "%05d" $(expr $last + 1))

         echo "+++ Not found yet adding as $cdnum-$sha1sum +++"

         # add it to db and catalog
         echo "$cdnum $sha1sum" >> $cddb
         mv -i $tmpfile $cdcont/$cdnum-$sha1sum.txt

         # suck the data itself
         dcmsort --copy --by_id --outdir $home/db /cdrom/dicom

      fi

      # eject, we are done
      eject $dev
   fi
done