[clug] bash oddities... No NULLs in strings, builtin echo params & can't set variable to newline ('\n') without changing IFS.

steve jenkin sjenkin at canb.auug.org.au
Sun Dec 27 05:16:56 UTC 2015

In doing some simple testing using bash, I found some unexpected results.
I could’ve swapped to another scripting language, like PERL, but wanted to find a solution to handling binary data should I want to do this again.

The script set the variable by using the built-in ‘echo’ “-e” flag to interpret control-sequences.
	var=$(echo -en "\0$by1\0$by2”)
And use with:
	(echo -n “$var”;… 

The second echo emitted nothing sometimes as the shell variable was empty.

The solution is to abandon the shell-substitution and not try to save the intermediate result, (where by1 & by2 are octal digits) ie.
	(echo -en "\0$by1\0$by2”;...

I got empty variables for byte values. It took me a while to translate from octal & hex into the character values and see the problem:

When I realised ‘-e’, ‘-E’ and ‘-n’ were being passed as arguments & the command was parsed as ‘var=$(echo -en -e)', this wasn’t going to work.
Didn’t find a way to mark the end of the argument string.  ‘getopt()’ used to support “--“, but it didn’t work here.
Briefly tried the “printf” builtin, but could see how to output all octal values.

Bash seems to use null-terminated (\000) strings internally for variables.
Also, bash only seems to understand octal well, like K&R C.

A random blog points out that the built-in ‘read’ supports NULL as delimiter, meaning it will correctly parse the output of "find -print0” with "read -r -d $’\0’”.
I’ve not seen before that form for binary values, $’\0’.

There was a problem assigning a variable the octal version of ‘\n’. i.e. :
	var=$( echo -en “\012\012”)
I presume it was a problem with IFS causing shell to parse the command line and use the newline as a token separator.
Didn’t pursue this.

If you need to embed newlines in variables, here’s a good treatment.



Why the test?

MD5 is officially broken as a cryptographic hash - meaning it’s relatively inexpensive to create ‘hash collisions’ , the same hash value for files with different data (see md5sum & examples below).

But it’s still a very useful hash function for many applications.
If you’re relying on MD5 as a hash function, then your algorithm must now handle hash-collisions differently.
It cannot assume that ‘same MD5 is same data’ - which could be a radical change if you’ve data deduplication on a single MD5 hash.

Hashing algorithms have always needed a method for handling hash collisions.
When the ‘vendor’ claims “this hash will never collide”, then you can trust that or prepare for collisions.
The downsides of collisions, such as corrupting permanent archival data, may be higher than you care to accept.

The short & easy answer for simple data dedupe is now “use SHA1”, but that leaves you making the same assumption:
	there will _never_ be a collision in the lifetime of the application, for some value of ‘never’.

Creating a data deduplication system that has to resort to a full byte-by-byte comparison of potential duplicates would be woefully slow as mostly they have to reaching the end of the both files to declare a match.

Forcing brute-force checks on every hash match is computationally and IO expensive, exactly what we’re trying to avoid in a ‘fast’ system.

Minimal processing for a large data ingestion system must:
 - never backtrack. ie only stream datablocks over the IO path, limited buffering is OK.
 - duplicate hashes must now be treated as potential collisions, not strict evidence of equivalence.
 - can rely on “different hash, different data”
 - only use, in the worst case, a predictable amount of resource & run-time to resolve duplicate hashes: either a match or a collisions.

To continue to use complex cryptographic hash functions like MD5 & SHA1 for other uses, like Data DeDupe, collisions must now be actively checked. Work is being done on finding exploits for SHA1.

I wanted to test how easily the “insert selected 64-byte blocks at 64-byte boundaries” (Wang) method of breaking MD5 used in examples was disrupted.

First test, similar to ‘salt’ used in Unix passwd files:
  concatenate 2 bytes to start of the data-stream, then check the resultant hashes for collisions. Done correctly, no collisions were found.

This simple test, while suggesting a solution, doesn’t prove prepending just two bytes will disrupt all manufactured collisions:
	“absence of evidence is not evidence of absence"

MD5 collisions

Project HashClash

How I created two images with the same MD5 hash [10240b each]

Three way MD5 collision [280704b each]

Create your own MD5 collisions

Counter-cryptanalysis: a reference implementation of collision detection for MD5 and SHA-1. [Tool to detect manufactured collisions]

[WORKING] script


# create an array (A) of 2^8 entries. Each entry is almost the octal value of its index. Add string ‘\0’ and shell interprets value as Octal (byte).

for x in {0..255}; do printf -v A[$x] "%03o" $x; done

# For each of the test/hash-clash files, prepend two bytes & recompute MD5
# save in file ‘md5-clash.lst’. Tab separated format.

for f in ~steve/MD5-clash/*.jpg
  do for x in {0..255}
    do for y in {0..255}
        echo `(echo -en "\0${A[$x]}\0${A[$y]}";cat $f)|md5sum -` "$f ${A[$x]} ${A[$y]}"
  done |\
  sed -e 's, -  ,,' -e 's,/.*/, ,' -e 's/ /     /g' >md5-clash.lst

[NON WORKING script. nulls, newlines and arguments (-e, -E, -n) fail in bash]


for f in ~steve/MD5-clash/*.jpg
do for x in {0..3};do for y in {0..7};do for z in {0..7}
  do for a in {0..3};do for b in {0..7};do for c in {0..7}
    do salt=`echo -en "\0$x$y$z\0$a$b$c"`;
    echo `(echo -n "$salt";cat $f)|md5sum -` " $f 0$x$y$z 0$a$b$c"
  done; done; done;
 done; done; done;
done |\
 sed -e 's, -  ,,' -e 's,/.*/,        ,' -e 's/ /     /g' >md5-clash.lst

Steve Jenkin, IT Systems and Design 
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA

mailto:sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin

More information about the linux mailing list