UTF-8 Char Found In TRB Genesis

[Forward]

This blog post is a cut-down copy from the my original blog spot. First, I wanted to repost it as I make use of the tool contained in here called 'check_nonascii_bytes.pl' during my TRB Keccak Regrind. Second, I wanted to cut out the latter part of the original post; at the time, I had tak,en the extra steps to regrind vpatches that were impacted by the single change to the genesis.vpatch; however, since TRB was getting a full regrind with keccak hashes, it's no longer relevant to repost the old experimental vpatches.

[ Preface ]

It was brought to the attention of The Bitcoin Foundation by phf that a UTF-8 character was discovered in the original genesis.vpatch of The Bitcoin Reference Implementation, also known as The Real Bitcoin (TRB). UTF-8 characters are forbidden in The Reference Implementation source code, and will be removed. This post meant to present, in detail for Lords of the Most Serene Republic, all of the requisite work to remove the UTF-8 character, and regrind genesis.vpatch as well as descendant vpatches in v0.5.4 RELEASE.

[ The Offending Character ]

Upon inspection of we have found the original culprit of this offense. The character in question appears on line 23962 of genesis.vpatch. It is a substituted Unicode character for a hyphen.

[ First Things First: Seek & Destroy ]

After the UTF-8 character was found in genesis.vpatch I took it upon myself to search through the rest of the v0.5.4 Release vpatches to see if any others were lurking in the thousands of lines of source code.

There are two ways to go about the ``Seek & Destroy UTF-8'' mission: Automate the search, and do the check by hand. To cover the former, automation, I created two different programs, one in perl, and one in C to check each character and see if it fell within the 7-Bit ASCII range. It should be said that a check-by-hand is also in necessary, as even the most well intentioned automation can miss something that would be obvious to the human eye; However, to save us time right now, (there are well over a million bytes to examine contained in v0.5.4 Release) we'll just check with automation.

mod6@gentoo ~/check_nonascii_bytes $ ls patches/
asciilifeform-kills-integer-retardation.vpatch                         bitcoin-asciilifeform.2-https_snipsnip.vpatch
asciilifeform_add_verifyall_option.vpatch                              bitcoin-asciilifeform.3-turdmeister-alert-snip.vpatch
asciilifeform_and_now_we_have_block_dumper_corrected.vpatch            bitcoin-asciilifeform.4-goodbye-win32.vpatch
asciilifeform_and_now_we_have_eatblock.vpatch                          bitcoin-v0_5_3-db_config.6.vpatch
asciilifeform_dns_thermonyukyoolar_kleansing.vpatch                    bitcoin-v0_5_3_1-rev_bump.7.vpatch
asciilifeform_dnsseed_snipsnip.vpatch                                  bitcoin-v0_5_3_1-static_makefile_v002.8.vpatch
asciilifeform_lets_lose_testnet.vpatch                                 genesis.vpatch
asciilifeform_maxint_locks_corrected.vpatch                            makefiles.vpatch
asciilifeform_orphanage_thermonuke.vpatch                              malleus_mikehearnificarum.vpatch
asciilifeform_tx-orphanage_amputation.vpatch                           mod6_der_high_low_s.vpatch
asciilifeform_ver_now_5_4_and_irc_is_gone_and_now_must_give_ip.vpatch  mod6_fix_dumpblock_params.vpatch
asciilifeform_zap_hardcoded_seeds.vpatch                               programmable-versionstring.vpatch
asciilifeform_zap_showmyip_crud.vpatch                                 rm_rf_upnp.vpatch
bitcoin-asciilifeform.1.vpatch
mod6@gentoo ~/check_nonascii_bytes $ ls -l patches | awk '{print $5}' | perl -e '@a=; $total=0; foreach $n (@a) { chomp($n); $total+=$n; } print "Total Bytes: $total\n";'
Total Bytes: 1049853
 

WARNING: Neither of these programs are pretty, or meant to be used for anything else, but for our purposes here: They get the job done.

Let's start with the perl program that I wrote to search through all the v0.5.4 Release vpatches. First, I'll present the execution of this program, followed by posting the source code.

mod6@gentoo ~/check_nonascii_bytes $ ./v.pl i http://thebitcoin.foundation
Full vpatch sync complete to "/home/mod6/check_nonascii_bytes/patches"
Seal sync complete to "/home/mod6/check_nonascii_bytes/.seals"
mod6@gentoo ~/check_nonascii_bytes $ for i in `ls patches`; do echo "vpatch: $i" && cat patches/$i | ./check_nonascii_bytes.pl; done
vpatch: asciilifeform-kills-integer-retardation.vpatch
vpatch: asciilifeform_add_verifyall_option.vpatch
vpatch: asciilifeform_and_now_we_have_block_dumper_corrected.vpatch
vpatch: asciilifeform_and_now_we_have_eatblock.vpatch
vpatch: asciilifeform_dns_thermonyukyoolar_kleansing.vpatch
vpatch: asciilifeform_dnsseed_snipsnip.vpatch
vpatch: asciilifeform_lets_lose_testnet.vpatch
vpatch: asciilifeform_maxint_locks_corrected.vpatch
vpatch: asciilifeform_orphanage_thermonuke.vpatch
vpatch: asciilifeform_tx-orphanage_amputation.vpatch
vpatch: asciilifeform_ver_now_5_4_and_irc_is_gone_and_now_must_give_ip.vpatch
vpatch: asciilifeform_zap_hardcoded_seeds.vpatch
vpatch: asciilifeform_zap_showmyip_crud.vpatch
vpatch: bitcoin-asciilifeform.1.vpatch
vpatch: bitcoin-asciilifeform.2-https_snipsnip.vpatch
vpatch: bitcoin-asciilifeform.3-turdmeister-alert-snip.vpatch
vpatch: bitcoin-asciilifeform.4-goodbye-win32.vpatch
vpatch: bitcoin-v0_5_3-db_config.6.vpatch
vpatch: bitcoin-v0_5_3_1-rev_bump.7.vpatch
vpatch: bitcoin-v0_5_3_1-static_makefile_v002.8.vpatch
vpatch: genesis.vpatch
NON-7BIT-ASCII VALUE IN SOURCE FILE: a/bitcoin/src/util.h ; BYTE VALUE: e2
NON-7BIT-ASCII VALUE IN SOURCE FILE: a/bitcoin/src/util.h ; BYTE VALUE: 80
NON-7BIT-ASCII VALUE IN SOURCE FILE: a/bitcoin/src/util.h ; BYTE VALUE: 94
vpatch: makefiles.vpatch
vpatch: malleus_mikehearnificarum.vpatch
vpatch: mod6_der_high_low_s.vpatch
vpatch: mod6_fix_dumpblock_params.vpatch
vpatch: programmable-versionstring.vpatch
vpatch: rm_rf_upnp.vpatch
mod6@gentoo ~/check_nonascii_bytes $

As we can see from the above, we sync'd all of the vpatches and seals from The Bitcoin Foundation's mirror of v0.5.4 Release vpatches. After sync'ing, we iterate through each vpatch and cat, and feeding them into `check_nonascii_bytes.pl`. Three bytes were discovered in the whole set of v0.5.4 Release vpatches, specifically 'e2', '80', and '94' from genesis.vpatch, in the source file of 'util.h'; the very same hyphen as aforementioned in this post.

The following is source code for `check_nonascii_bytes.pl`:

#!/usr/bin/perl

use strict;

my @vpatch=;

# Keep track of the current source file we're looking through.
my $src_file = "";

# Iterate over the vpatch one line at a time.
foreach my $line (@vpatch) {  

  if($line =~ /^--- (.*) .*/) {
    $src_file = $1;
  }

  # unpack the line into hex i.e. ;
  # 64696666202d754e7220612f626974636f696e2f2e67697469676e6f726520622f626974636f696e2f2e67697469676e6f72650a
  my $hex_line = unpack "H*", $line;

  # add spaces between each two character hex byte, i.e. ;
  # 64 69 66 66 20 2d 75 4e 72 20 61 2f 62 69 74 63 6f 69 6e 2f 2e 67 69 74 69 67 6e 6f 72 65 20 62 2f 62 69 74 63 6f 69 6e 2f 2e 67 69 74 69 67 6e 6f 72 65 0a
  $hex_line =~ s/(..)(?!$)/$1 /g;

  # Split out each two character hex representation by spaces, into array.
  my @line_bytes = split / /, $hex_line;

  # Iterate through the array of two character representations of bytes, one at a time.
  foreach my $lb (@line_bytes) {

    # Check the current two character representations.
    # First character must be between 0 and 7
    # Second character can be 0-9, or a-f.

    # If the two character byte representation is *NOT* in this range,
    # we've found a non-seven-bit-ascii char.
    if($lb !~ /^[0-7][0-9a-f]$/i) {
      print "NON-7BIT-ASCII VALUE IN SOURCE FILE: $src_file ; BYTE VALUE: $lb\n";
    }
  }
}

Next, I wrote a C program that completes the same task, here's what it's execution looks like:

mod6@gentoo ~/check_nonascii_bytes $ for i in `ls patches`; do echo "vpatch: $i"; ./cb patches/$i; done
vpatch: asciilifeform-kills-integer-retardation.vpatch
vpatch: asciilifeform_add_verifyall_option.vpatch
vpatch: asciilifeform_and_now_we_have_block_dumper_corrected.vpatch
vpatch: asciilifeform_and_now_we_have_eatblock.vpatch
vpatch: asciilifeform_dns_thermonyukyoolar_kleansing.vpatch
vpatch: asciilifeform_dnsseed_snipsnip.vpatch
vpatch: asciilifeform_lets_lose_testnet.vpatch
vpatch: asciilifeform_maxint_locks_corrected.vpatch
vpatch: asciilifeform_orphanage_thermonuke.vpatch
vpatch: asciilifeform_tx-orphanage_amputation.vpatch
vpatch: asciilifeform_ver_now_5_4_and_irc_is_gone_and_now_must_give_ip.vpatch
vpatch: asciilifeform_zap_hardcoded_seeds.vpatch
vpatch: asciilifeform_zap_showmyip_crud.vpatch
vpatch: bitcoin-asciilifeform.1.vpatch
vpatch: bitcoin-asciilifeform.2-https_snipsnip.vpatch
vpatch: bitcoin-asciilifeform.3-turdmeister-alert-snip.vpatch
vpatch: bitcoin-asciilifeform.4-goodbye-win32.vpatch
vpatch: bitcoin-v0_5_3-db_config.6.vpatch
vpatch: bitcoin-v0_5_3_1-rev_bump.7.vpatch
vpatch: bitcoin-v0_5_3_1-static_makefile_v002.8.vpatch
vpatch: genesis.vpatch
OUT OF 7-BIT ASCII RANGE! : e2
OUT OF 7-BIT ASCII RANGE! : 80
OUT OF 7-BIT ASCII RANGE! : 94
vpatch: makefiles.vpatch
vpatch: malleus_mikehearnificarum.vpatch
vpatch: mod6_der_high_low_s.vpatch
vpatch: mod6_fix_dumpblock_params.vpatch
vpatch: programmable-versionstring.vpatch
vpatch: rm_rf_upnp.vpatch

Just as before, we iterate through each vpatch and feed it into our program, `cb` (check_bytes.c). Three bytes were discovered in the whole set of v0.5.4 Release vpatches, specifically 'e2', '80', and '94' from genesis.vpatch; the very same hyphen as aforementioned in this post.

The following is source code for `check_bytes.c`:

#include <stdio.h>
#include <stdlib.h>

void read_file(char *filename) {

  FILE      *file;
  char      *line  = NULL;
  size_t    len    = 0;
  ssize_t   read;

  int pos          = 0;  // Position in line
  short int d      = 0;  // holds the short decimal value of the char

  // attempt to open our file, read-only mode.
  file = fopen(filename, "r");

  // if our file stream is NULL, print error and exit.
  if (file == NULL) {
    fprintf(stderr, "FILE COULD NOT BE OPENED! EXITING.\n");
    exit(-1);
  }

  // Read a file line by line.
  while((read = getline(&line, &len, file)) != -1) {

    pos = 0;

    // Iterate through each character in the line, until we hit the NULL byte.
    while(line[pos] != '\0') {

      // set the current line position char in numeric form
      d = (short int) line[pos];

      // test numeric form of char if within 7-BIT ASCII range
      if(d < 0x0 || d > 0x7f) {
        fprintf(stdout, "OUT OF 7-BIT ASCII RANGE! : %02x\n", (unsigned char) line[pos]);
      }

      // move to the next position in the line
      pos++;
    }
  }

  // Clean up
  free(line);
  fclose(file);
}

int main(int argc, char **argv) {
  int i;

  // Iterate through all of the files given to us via command line
  for(i = 1; i < argc; i++) {
    read_file(argv[i]); // Call read_file for each given filename
  }

  return 0;
}

One Response to “UTF-8 Char Found In TRB Genesis”

  1. [...] mod6's Blog « UTF-8 Char Found In TRB Genesis [...]

Leave a Reply