.NET String Immutability and Related .NET Framework Bugs

This article which explains several aspects of the .NET String class internal structure. Examples and workarounds are given for three distinct manifestations of a reported bug related to the TextRenderer.MeasureText function used with PathEllipsis and ModifyString flags.

Introduction

The immutability of strings is one of the fundamental facts in .NET. Basically, this means that an internal buffer containing characters is, at least in theory, read-only. Any function that intends to change the string, actually allocates a new String object and returns it. The original object cannot be changed.

A String object contains a reference to a buffer containing Char objects. Every Char is a UTF16-encoded character. Usually, one letter in a string is represented by one Char. There are, however, exceptions: graphemes and some special characters are defined by UTF16 encoding standard to be represented by more than one Char object.

Once a Char buffer is established and a String object references it, no more changes should be allowed to the contents and size of that specific buffer. That is the main point of string immutability. One consequence of this idea is that the buffer can be shared among multiple String objects. This doesn't look to be explicitly stated in the MSDN but if you think it through, you'll find that there is no reason against this idea.

An Anecdote

I once heard a man, mathematician by profession, ask a very simple question: You have an array of integers, how will you find the smallest of them? His answer was: I'll sort the array in ascending order and pick the first value. This answer, although mathematically correct, is wrong on more than one account. Sorting, if comparison-based, requires at least O(N*logN) comparisons, while a simple minimum-finding algorithm requires only O(n) comparisons. Also, the original array is not guaranteed to be on such a medium that it can be sorted (for example a pre-fabricated DVD).

Now we may think of asking this mathematician a new question: Can I allocate a million times the same million-character long string and keep all those strings in memory all the time?

If your answer is no (despite the opposite hint in the introduction), then you're wrong. The answer is yes. Someone is probably asking how that is possible, because that would require two terabytes of memory? Well, those strings are the same, as stated in the question. Someone might complain that strings are immutable, and hence they must be copied every time. But again, the answer is no. Strings are the same. There is only one character string. All others refer to the same internal buffer.

Apart from the fact that the first mathematician's answer was wrong and the second correct, both given on the same grounds, these two answers have something in common. They are naive. They give out simple truths that programmers sometimes fail to see without additional thinking.

Consequences

The thesis of sharing the same buffer among multiple string instances is not something to be found in the MSDN. (At least I haven't found explanations; forgive me if I am wrong, the MSDN is such a large website these days.) However it is easy to test. Here is the code:

string s = new string('x', 1000000);
string[] strings = new string[10];
string[] modified = new string[10];

Console.WriteLine("Testing simple copy:");
for (int i = 0; i < strings.Length; i++)
{
    strings[i] = s;
    Console.WriteLine("{0,4} Allocated bytes: {1}", i + 1, GC.GetTotalMemory(false));
}
Console.WriteLine();

Console.WriteLine("Testing modified strings:");
for (int i = 0; i < modified.Length; i++)
{
    modified[i] = s + "x";
    Console.WriteLine("{0,4} Allocated bytes: {1}", i + 1, GC.GetTotalMemory(false));
}

This code produces the following output:

Testing simple copy:

1 Allocated bytes: 2268788
2 Allocated bytes: 2268788
3 Allocated bytes: 2268788
4 Allocated bytes: 2268788
5 Allocated bytes: 2268788
6 Allocated bytes: 2268788
7 Allocated bytes: 2276980
8 Allocated bytes: 2276980
9 Allocated bytes: 2276980
10 Allocated bytes: 2276980

Testing modified strings:

1 Allocated bytes: 4277012
2 Allocated bytes: 6116656
3 Allocated bytes: 8116688
4 Allocated bytes: 10115560
5 Allocated bytes: 12115592
6 Allocated bytes: 14115624
7 Allocated bytes: 16115656
8 Allocated bytes: 18115688
9 Allocated bytes: 20115720
10 Allocated bytes: 22115752

Now observe how the amount of allocated memory remains the same although a 2MB-sized string is supposedly copied ten times. But, in the second part of the test, one letter is appended to the original string in each copy, which produces a new string of 2MB plus two bytes for an additional Char in every iteration.

This proves that string buffers are shared internally.

I have heard people complaining about this specific example, saying that "it proves that strings aren't really immutable". But they have the wrong impression. Immutability does not mean that a string must be deep-copied every time it is copied. It means that its internal buffer is read-only. Well, if two strings point to the same internal buffer then it makes the buffer read-only none the less.

Bug #1

This bug occurs when the System.Windows.Forms.TextRenderer.MeasureText method is used with the flags PathEllipsis and ModifyString set. The PathEllipsis flag means that the text renderer should replace the middle part of the given string with an ellipsis, so that it can be rendered within a given size bounds. The MeasureText method does the job by invoking the PathCompactPathEx Windows API function, which alone is not a problem. The problem occurs with the ModifyString flag; it requests the MeasureText method to return the modified string back to the original string object. Which it does. By overwriting the internal buffer. Which is, you guess, shared.

The following piece of code depicts this most unfortunate situation:

string path = Environment.GetFolderPath(Environment.SpecialFolder.System);
string originalPath = path;

System.Drawing.Font font = new System.Drawing.Font(System.Drawing.FontFamily.GenericSansSerif, 8.25F);
System.Drawing.Size proposedSize = new System.Drawing.Size(100, 100);
System.Windows.Forms.TextFormatFlags flags = System.Windows.Forms.TextFormatFlags.ModifyString | System.Windows.Forms.TextFormatFlags.PathEllipsis;
System.Windows.Forms.TextRenderer.MeasureText(path, font, proposedSize, flags);

Console.WriteLine(path);
Console.WriteLine(originalPath);

The code produces output like this:

C:\Win...\system32
C:\Win...\system32

We have changed only the first string. But both string objects contain the modified value. The workaround for this problem is simple. Instead of the line:

string originalPath = path;

Write this:

string originalPath = string.Copy(path);

That will force a new buffer to be created, so the path and originalPath objects will not share the internal buffer any more. The result is as expected:

C:\Win...\system32
C:\Windows\system32

I hope that at least the String.Copy method documentation would mention buffer sharing, but it does not.

Bug #2

The latest output may be looking very good but it is not good. If we change the printout code then the difference will emerge:

Console.WriteLine("{0} chars - [{1}]", path.Length, path);
Console.WriteLine("{0} chars - [{1}]", originalPath.Length, originalPath);

This code prints different output:

19 chars - [C:\Win...\system32 ]
19 chars - [C:\Windows\system32]

Now observe the output; both strings declaratively have the same length!

The problem is that the MeasureText method has made another mistake, related to the previous one. It did not modify the string length. But that is not all; the PathCompactPathEx Windows API has placed a C-style string terminating character '\0' at the end of the compacted path. But this character is not considered a terminator in .NET. In most occasions it is treated as a regular character (it is ignored in culture-based string comparison, but treated as a regular character in normal string comparisons). The problem becomes obvious if we pick a longer path to compact (application data for instance). The output looks like this:

31 chars - [C:\User...\Roaming Data\Roaming]
31 chars - [C:\Users\zoranh\AppData\Roaming]

As you can see, the string was compacted, the middle part was replaced by an ellipsis, but after the end, there is a blank space ('\0' character) and then the remainder of the original string. All is the consequence of leaving the original string length unchanged instead of modifying it so that the string ends at the C string terminator character.

This bug has been reported to Microsoft, and it is currently in the status Won't fix, which means that Microsoft has higher priorities. The bug is reported at this address: http://connect.microsoft.com/VisualStudio/feedback/details/620330/textrenderer-measuretext-with-textformatflags-pathellipsis-textformatflags-modifystring-flags#details.

The bug is of no priority to Microsoft because there is a simple workaround. After adding an ellipsis to the string, we should manually search for the '\0' character and cut the string at that position:

int pos = path.IndexOf('\0');
if (pos >= 0)
    path = path.Substring(0, pos);

The output now looks like new:

18 chars - [C:\User...\Roaming]
31 chars - [C:\Users\zoranh\AppData\Roaming]

Bug #3

Suppose that we decide to use System.IO.DirectoryInfo to reference the application data directory; quite a common requirement. We can keep the string with an ellipsis for the purpose of printing out a directory path of limited length, and maintain the DirectoryInfo object pointing to the directory. The code looks like this:

string path = Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData);
System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(path);

System.Drawing.Font font = new System.Drawing.Font(System.Drawing.FontFamily.GenericSansSerif, 8.25F);
System.Drawing.Size proposedSize = new System.Drawing.Size(100, 100);
System.Windows.Forms.TextFormatFlags flags = System.Windows.Forms.TextFormatFlags.ModifyString | System.Windows.Forms.TextFormatFlags.PathEllipsis;
System.Windows.Forms.TextRenderer.MeasureText(path, font, proposedSize, flags);

int pos = path.IndexOf('\0');
if (pos >= 0)
    path = path.Substring(0, pos);

try
{

    Console.WriteLine("{0} chars - [{1}]", path.Length, path);
    Console.WriteLine("Directory path: {0}", dir.FullName);

}
catch (System.Exception ex)
{
    Console.WriteLine("ERROR: {0}", ex.Message);
}

The output looks like this:

18 chars - [C:\User...\Roaming]
Directory path: C:\Users\zoranh\AppData\Roaming

Notice the try-catch block around the printout statements. This is added for a purpose. Starting with .NET Framework 4.0, the DirectoryInfo's FullPath property shares the same buffer with the path string! So if we run the same code in .NET Framework 4.0, the output is transformed into this:

18 chars - [C:\User...\Roaming]
ERROR: Illegal characters in path.

The Exception is caused by the '\0' character that was inserted into the FullPath's buffer when the path was changed by the MeasureText method. This character is not legal when used in a file system path and hence the exception is thrown.

The problem occurs only when a DirectoryInfo object is initialized with an absolute path. If a relative path is used then DirectoryInfo cannot share the buffer with the initialization string, because it requires a full path to the directory. The reason is simple: a relative path becomes invalid if the current directory is changed.

Here is the code that uses a relative path to initialize the DirectoryInfo object. There is no exception even in the .NET Framework 4.0.

string path = Environment.GetFolderPath(Environment.SpecialFolder.ApplicationData);

System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(path);
string root = dir.Root.FullName;

System.IO.Directory.SetCurrentDirectory(root);
path = path.Substring(root.Length);
dir = new System.IO.DirectoryInfo(path);

Console.WriteLine("Current directory: {0}", System.IO.Directory.GetCurrentDirectory());
Console.WriteLine("Relative path: {0}", path);
Console.WriteLine("Absolute path: {0}", dir.FullName);

System.Drawing.Font font = new System.Drawing.Font(System.Drawing.FontFamily.GenericSansSerif, 8.25F);
System.Drawing.Size proposedSize = new System.Drawing.Size(100, 100);
System.Windows.Forms.TextFormatFlags flags = System.Windows.Forms.TextFormatFlags.ModifyString | System.Windows.Forms.TextFormatFlags.PathEllipsis;
System.Windows.Forms.TextRenderer.MeasureText(path, font, proposedSize, flags);

int pos = path.IndexOf('\0');
if (pos >= 0)
    path = path.Substring(0, pos);

try
{

    Console.WriteLine("{0} chars - [{1}]", path.Length, path);
    Console.WriteLine("Directory path: {0}", dir.FullName);

}
catch (System.Exception ex)
{
    Console.WriteLine("ERROR: {0}", ex.Message);
}

The output looks like this:

Current directory: C:\
Relative path: Users\zoranh\AppData\Roaming
Absolute path: C:\Users\zoranh\AppData\Roaming
18 chars - [Users\z...\Roaming]
Directory path: C:\Users\zoranh\AppData\Roaming

We can see that DirectoryInfo has converted the relative path to an absolute path, that is further not affected by the MeasureText method because it doesn't share the character buffer with the path string object.

The workaround when DirectoryInfo is initialized in the general case (that may include an absolute path) is again to force a copy of the character buffer as in the following:

System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(string.Copy(path));

This way the DirectoryInfo object maintains its own internal character buffer regardless of the path object.

Conclusion

This article has demonstrated how the .NET Framework optimizes memory use in strings where possible. Optimization is based on the idea that a dummy copy of strings doesn't change the contents and length of the internal character buffer, which means that a copied object can point to the same internal buffer without loss of information.

We have further shown that the MeasureText function breaks the rule that a character buffer in string objects is read-only and allows the Windows API to write to it. Not only does the contents of the buffer become invalid, but also the supporting members of the string object become incorrect, like the one specifying the length of the string. We have pointed to three real-life situations in which these problems are fully manifested when working with file system object paths. Adding an ellipsis to paths is a common operation, and here we have shown that it is not possible to rely just on the .NET Framework's help to do that, but apropriate workarounds must be employed to avoid the effects of these bugs.

Some people prefer writing their own functions to compact paths. I find that way wrong, because unless the path is printed using monospaced fonts, you cannot determine how many characters, and which ones, should be dropped to satisfy the space limitations without measuring parts of the string. The MeasureText method from the .NET Framework is the way to do it. Never forget that it is coded to operate with the underlying GDI and platform. Changes in external libraries may cause your custom code which measures text to fail in the future, while MeasureText would be expected to reflect the changes and continue working correctly. The Bug in string manipulation, especially correctable ones, is not sufficient motive to give up and try managing things in a custom manner.