Creating a Substitution Cipher with C#

Vulpes
13y
137.1k
0
4

Article

Introduction

I thought for a bit of fun I'd invent a new cipher and, depending on how 'strong' it turned out to be, perhaps use it in my projects to encrypt string literals, files etc.

Also, I wanted to base this on traditional 'substitution' principles, which are easily understand, rather than the complex mathematical algorithms which are used to manipulate bit patterns in modern cryptography.

So what is a substitution cipher?

In the simplest kind of substitution cipher, one simply substitutes one letter for another. Here's a basic program which does that using a key which consists of all 26 letters of the alphabet:

using System;

class SubstitutionCipher
{
   static void Main()
   {
      string key = "jfkgotmyvhspcandxlrwebquiz";
      string plainText = "the quick brown fox jumps over the lazy dog";
      string cipherText = Encrypt(plainText, key);
      string decryptedText = Decrypt(cipherText, key);

      Console.WriteLine("Plain     : {0}", plainText);
      Console.WriteLine("Encrypted : {0}", cipherText);
      Console.WriteLine("Decrypted : {0}", plainText);
      Console.ReadKey();
   }

   static string Encrypt(string plainText, string key)
   {
      char[] chars = new char[plainText.Length];

      for(int i = 0; i < plainText.Length; i++)
      {
         if (plainText[i] == ' ')
         {
            chars[i] = ' ';
         }
         else
         {
            int j = plainText[i] - 97;
            chars[i] = key[j];
         }
      }

return new string(chars);
}

   static string Decrypt(string cipherText, string key)
   {
      char[] chars = new char[cipherText.Length];

      for(int i = 0; i < cipherText.Length; i++)
      {
         if (cipherText[i] == ' ')
         {
            chars[i] = ' ';
         }
         else
         {
            int j = key.IndexOf(cipherText[i]) - 97;
            chars[i] = (char)j;
         }
      }

      return new string(chars);
   }
}

The output is:

Plain : the quick brown fox jumps over the lazy dog

Encrypted : wyo xevks flnqa tnu hecdr nbol wyo pjzi gnm

Decrypted : the quick brown fox jumps over the lazy dog

Notice that you don't necessarily need to substitute different letters in every case and here we leave the letter 'z' unchanged. Notice also that the 'plain text' is usually presented as a lower case string with all punctuation except spaces removed.

How easy is it to 'crack' a substitution cipher?

A substitution cipher is not very secure and can be attacked in the following main ways:

Various studies have shown that the letters of the alphabet occur in roughly the same frequencies in a piece of English text. For example, the commonest letters are: E, T, A and O and the least common are: X, J, Q and Z. By calculating the number of times each letter occurs in the cipher text and comparing it with the frequencies in a table (such as the one at http://en.wikipedia.org/wiki/Letter_frequency), you can usually guess what some of the letters represent. Of course, the longer the cipher text or the more extracts you have, the easier this is.
English has two one-letter words ('a' and 'I') and many common two and three letter words. A knowledge of these helps one to guess the meaning of other letters in the cipher.
Although 'E' is the commonest letter in English, it is far less common as the first letter of a word. Similarly, although 'O' and 'A' are also 'high frequency' letters, they don't often occur at the end of words.
Some two or three letter combinations of letters are quite common and others never occur at all. By analysing the frequency of digraphs and trigraphs (as they are technically called) in the cipher text, one may be able to guess what further letters in the cipher mean.
There are some letters which are frequently doubled (ss, ee, tt, ff etc) and others which are very seldom doubled in English (aa, uu, yy). A study of doubled letters in the cipher text may therefore yield some further clues to deciphering the cipher text.

Can anything be done to make a substitution cipher more secure?

If all letters had the same frequencies and there were no spaces between words, then it would be far harder to crack. This suggests that we should represent the most frequent letters by not just one but by several different letters or symbols and we should also represent spaces by a number of symbols as well so it's difficult to tell where a word begins or ends. For each substitution, one of the available symbols should be chosen at random. So, for example, in a given cipher sometimes 'd' might be represented by 'a' and sometimes by 'w'.

By rounding each letter's frequency to the nearer 2% and then allocating one symbol for each 2% in the Wikipedia table, I found that the number of symbols needed would be as follows:

Number of Symbols	Letters
7	Space
6	E
5	T
4	A, I, O
3	H, N, R, S
2	C, D, L, M, U
1	B, F, G, J, K, P,Q, V, W, X, Y, Z

It will be noticed that a space has been given 7 symbols compared to E's 6 symbols because it says in the Wikipedia article that the former is about 7% more frequent which would give it a percentage of 13.87%.

Also, after looking at some other studies, I decided to allocate 4 (rather than 3) symbols to 'I' and, for various other reasons, I decided to allocate 2 (rather than 1) symbols to 'U', 'C' and 'M'.

This means that we need a total of 64 symbols to represent letters and spaces.

Of course, the lowest frequency letters : Z, X, Q, J , K, V and (to some extent) B are going to be under-represented in the cipher text even though they have only been allocated one symbol. Moreover, the symbols for U, C and M are also going to be under-represented as I've decided to allocate them two symbols each when they don't strictly merit this treatment.

What about capitalization, punctuation and numerals?

In traditional substitution ciphers, capitalization and punctuation (except spaces) are usually ignored with the result that the text can be challenging to read even after you've deciphered it!

This is unsatisfactory and so I decided to allow for both in my cipher.

I found a study which indicated that, on average, every 100 letters would include 94 lower case and 6 upper case letters. To be consistent with the other assumptions, we therefore need three symbols (6%) to indicate that the following letter should be capitalized. Bearing in mind that capital letters often begin words or sentences, the relative frequency of capital letters in English is not the same as the relative frequency of lower case letters but, in the interests of simplicity, I decided to ignore this.

According to the Wikipedia article, the frequency of punctuation and numerals is equivalent to a letter between 'A' and 'T' in the table and so needs four symbols (say), split equally between the two, to indicate the nature of the following symbol. However, we can use this 'following symbol' to 'even up' the occurrence of the 10 letters which were identified as being under-represented in the previous section which is a useful bonus.

By far the most common punctuation symbols are the comma and period. These are followed by quotes, apostrophes and dashes. So, by allocating the least used letters to the commonest punctuation marks, we can address the unbalance to some extent.

Although one might expect that the numerals would occur with equal frequency, in fact 0 and 1 are the most frequent and, in general, the smaller digits are slightly more frequent than the larger ones. We can again use this fact to further address the imbalance in the less frequently used letters.

So, the total number of symbols we need has now increased to: 64 + 3 + 2 + 2 = 71.

OK, but what are we going to do about initial, final and double letters?

Even though it's now going to be difficult for someone to apply the frequency tables for initial and final letters to the cipher text, there's still the problem of the initial and final words of the text as a whole.

Fortunately, this is easy to deal with - all we need to do is to add a random number of 'noise'' symbols to the beginning and end of the cipher text, say between 1 and 5 at each end (the average length of a word is about 5 letters). These symbols will have the same frequency as they would otherwise have in the middle of words.

For double letters, I decided to create a special symbol which doubles the previous one so that adjacent symbols will always be different in the cipher text.

This means that we need a grand total of 72 symbols in our 'alphabet'. As well as the 62 alphanumeric symbols: 'A' to 'Z', 'a' to 'z' and '0' to '9', I added 10 other symbols which should be available on most keyboards, namely: !, #, $, %, &, *, +, =, ? and @.

All 72 of these symbols have an equal chance of appearing in the noise symbols at the beginning and end of the cipher text which makes them difficult to distinguish from 'normal' symbols.

The same key will, of course, be used for both encryption and decryption and will be 74 characters long to encompass the 72 symbol alphabet, plus the number of noise symbols (which can be represented as a single upper case letter as there are only 25 combinations) plus a further character which I'll discuss later. I've also allowed for a special 'no noise' key (see later).

What about control codes, foreign letters, mathematical symbols and the like?

The only control codes which actually affect the layout of text nowadays are: \r, \n, and \t which represent carriage return, new line and tab respectively. All of these occur quite frequently in a long piece of text and it's common for \r and \n to occur together.

I decided to represent \n by (capitals) space, \t by (numerals) space and \r by (numerals) E. As 'space' and 'E ' can be represented by 7 and 6 symbols and 'capitals' and 'numerals' by 3 and 2 symbols respectively, there are a large number of possible combinations available which will make these characters difficult to spot.

The other control codes can be dealt with using a (punctuation)(double) prefix.

The 32 unicode punctuation symbols with code points between 160 and 191 fit nicely with the symbols below 127 into the 'punctuation' category.

The 64 accented letters with code points between 192 and 255 can be dealt with using a (capitals)(double) prefix and any other unicode characters with code points above 255 can be represented by a (letter)(double) prefix followed by a 'base 64' three symbol code.

To give complete unicode coverage, the non-printing character DEL (127) is represented by (numerals) P and the Euro symbol (Unicode 8364) which appears on many European keyboards as (numerals) Y, rather than by the five symbol code that would otherwise apply.

Where's the source code so I can try this?

Although some of this sounds complicated, in fact it's quite easy to code and the download for this article contains the complete class, which I've called FASC ('Frequency Adjusted Substitution Cipher') all of whose methods are static.

If you want to use this class in your projects, then it can either be pasted in as it stands or it can be compiled to a class library first and then a reference to the .dll added to your project.

How should I use this class in practice?

You can if you wish generate a key and then use that key to encrypt/decrypt a string or a whole file. It's also possible to generate a special key which produces cipher text without any noise, if you want to minimize the length of the cipher text for some reason.

However, a better way of proceeding is to generate a key file which consists of 72 different keys, each of which contains a unique symbol (the 74th character mentioned earlier). Each time a string or file is encrypted a key is selected at random from this file and used to encrypt the plain text. The cipher text is then prefixed with the symbol used (a single character), in a slightly disguised form.

When you then decrypt the cipher text, the key symbol is read, the appropriate key is loaded from the file and the decipherment then takes place using this key.

The beauty of this system is that you never need to deal with the keys - it's all done for you in the background. Moreover, if you were encrypting a bunch of stuff in the same program, different keys would probably be used to encrypt each one making it much more difficult for someone to decipher it.

Of course, you still have to know which key file was used to encrypt a piece of text; there's no way to record that as the system is currently constituted. You might also want to encrypt the key file using, say, windows data security and bury it somewhere in the filing system where it cannot be easily identified.

By default, an unencrypted key file called default.fkf is generated and placed in the same directory as the assembly itself.

Even if you use the single key system, then the 74th character in the key is still used as a rough and ready way of telling whether the cipher text was produced with this key or not.

Can I have an example?

Here's an example which generates an individual key and then encrypts the famous 'pangram' (a phrase containing all 26 letters of the alphabet) mentioned earlier which has now been decorated with some upper case letters, numerals and punctuation marks.

The example also illustrates the use of a special key and a key file on some other pangrams, as well as the encryption and decryption of a text file; an entertaining piece I found in this blog (http://blogs.msdn.com/b/ericlippert/archive/2011/02/14/what-would-feynman-do.aspx).

class Test
{
    static void Main()
    {
        // using a single key

        string key = FASC.CreateKey();
        string plainText = "The quick Brown Fox, jumps over the 2nd lazy-dog!";
        string cipherText = FASC.Encrypt(plainText, key);
        string decryptedText = FASC.Decrypt(cipherText, key);

        Console.WriteLine("Key       : {0}", key);
        Console.WriteLine("Plain     : {0}", plainText);
        Console.WriteLine("Encrypted : {0}", cipherText);
        Console.WriteLine("Decrypted : {0}", plainText);
        Console.WriteLine();

// using a special single key (no noise)

        key = FASC.CreateSpecialKey();
        plainText = "Pack my red box (with five dozen quality jugs).";
        cipherText = FASC.Encrypt(plainText, key);
        decryptedText = FASC.Decrypt(cipherText, key);

// using a key file

        FASC.CreateKeyFile(false);
        plainText = "Woven silk pyjamas; exchanged for blue quartz :)";
        cipherText = FASC.Encrypt(plainText);
        decryptedText = FASC.Decrypt(cipherText);

        Console.WriteLine("Plain     : {0}", plainText);
        Console.WriteLine("Encrypted : {0}", cipherText);
        Console.WriteLine("Decrypted : {0}", plainText);

// encrypting a file

FASC.EncryptFile("feynman.txt", "feynman2.txt");
FASC.DecryptFile("feynman2.txt", "feynman3.txt");

Console.ReadKey();
}
}

The output is shown below though you will obviously get different values for the keys and cipher text if you run this program yourself:

Plain : The quick Brown Fox, jumps over the 2nd lazy-dog!
Encrypted : P8eXglf=Scpbh3xY!va#$t8TfdD4PZ+5m*JXTIB@y7%OV7KY62nElZsPYNDGP0gKOB@
Decrypted : The quick Brown Fox, jumps over the 2nd lazy-dog!

Plain : Pack my red box (with five dozen quality jugs).
Encrypted : x&UNRsro1%c5GrO!Er0V6zp8+TziMvb!a57%XANjQp1yJAP3hR0X
Decrypted : Pack my red box (with five dozen quality jugs).

Plain : Woven silk pyjamas; exchanged for blue quartz :)
Encrypted : ytC$JPHNuUlhT268YeWb4c&cG=BTxR31ckFlIgSo#7Bwtve0ty#rA75C=3+Nt
Decrypted : Woven silk pyjamas; exchanged for blue quartz :)

Conclusion

Clearly, a substitution cipher of this nature is not to going to be anywhere near as 'cryptographically strong' as modern algorithms such as Rijndael but it should certainly prevent casual inspection of confidential information and won't be too easy for professional hackers to crack even if they know the underlying basis. It also works quickly and on text of any length. You can even encode an empty string!

Although it's certainly possible to encrypt text written in other Western European languages which use the Latin alphabet (including French, German, Spanish, Portuguese and Italian) using this cipher, it will not be as secure because the frequency of letters in those languages will differ from English. Any native speaker of those languages would therefore be better developing his or her own version of the cipher using the same principles.

Although , for completeness, the FASC cipher can encode all Unicode characters, it is not suitable for encoding binary, as opposed to text, files.

Creating a Substitution Cipher with C#

Number of Symbols

Letters