General Formatter for .NET 2/4: Design


This is the second of four parts this article consists of:
  • Introduction - Part in which proposed goals and desired features of the solution are defined.
  • Design - Part in which solution design is outlined, explaining critical details that the solution will have to implement.
  • Implementation - Third part which explains actual implementation of the formatter classes.
  • Example - Final part which lists numerous examples of formatters use.
The last part of the article has a compressed file attached with it, containing complete and commented source code of the formatter classes, source code of the demonstration project and compiled library.

Solution Design

These examples are similar in a sense that they have the same general structure (i.e. specify type, then name and value) and then somehow list contents of the object by iterating through any contained fields and properties. Each contained member is then recursively printed based on the same rules applied to its containing instance. Differences in these representations are merely in substrings used to separate items, break and indent lines. Now we will try to list parameters, which will later become string properties in the solution, which define these substrings:
  • LinePrefix - String put at the beginning of every line, which is useful in multi-line representations. Typically used to indent all content by fixed number of blank characters or horizontal tab characters, e.g. setting this property to new string(" ", 15) would indent all content by 15 character positions.
  • FirstContainedValuePrefix - String prepended to first value which is a child of the currently processed object. For example, it could be an opened curly brace or opened bracket.
  • LastContainedValueSuffix - String appended after last value which is a child of the currently processed object.
  • IndentationString - String applied at every indentation level in multi-lined formats. For example, if set to single horizontal tab character, then line with indentation level 3 will be prepended by three consecutive tab characters. If this value is set to null or empty string then text will not be indented. This value is ignored in single-lined formats.
  • RightMostIndentationString - Similar to IndentationString, only applied to the last indentation level just before the indented text begins. If this value is set to null, then IndentationString value is used instead.
  • LastIndentationString - Similar to IndentationString, only applied to indent the last value in the list of values contained in the same object. In other words, this string is applied at the last line after which indentation level is going to be reduced. If this value is set to null, then IndentationString value is used instead.
  • LastRightMostIndentationString - Same as RightMostIndentationString only applied to the last item in the array of items with the same parent object. If this value is set to null then value of RightMostIndentationString is used instead.
  • FieldDelimiter - String used to delimit successive fields. In multi-line representations this string should contain new line characters, e.g. to EnvironmentSettings.NewLine. In single-line representations this property should be set to appropriate list delimiter, e.g. ", " or current culture's TextInfo.ListSeparator property value. This value is ignored if set to null or empty string.
It doesn't take much to see that by varying these values we can produce all different formats shown in examples above. Now we can complicate the matters even further if we try to present an array of values:

int[5] { 1, 2, 3, 4, 5 }

One step further will lead us to formatting of any object which implements IEnumerable interface, for example a list:

List<Int32> {
 |-- int Item[0]=56
 |-- int Item[1]=39
 |-- int Item[2]=95
 |-- int Item[3]=94
 |-- int Item[4]=35
 |-- int Item[5]=60
 |-- int Item[6]=64
 |-- int Item[7]=58
 |-- int Item[8]=77
 +-- int Item[9]=26 }

In all these cases we can notice that representation of a member variable is somewhat arbitrary. In that sense we write Point to represent System.Drawing.Point type, but at the same time we put int for System.Int32 type, rather than Int32. This is the result of an effort to make output more readable. On the other hand, that is an act of conscious stepping back from the proclaimed goal of being general, just in order to make the output look nicer. The key point in making such customizations is in finding proper measure between two extremes: one is to know nothing about particular object converted to string, which produces overweighed results as in the Rectangle example; opposite extreme is to gather all knowledge imaginable about the object at hand, as to convert it to string most beautiful of all, which in turn equals coding all possible custom formats. The proper solution lies between these extremes, and it is to recognize simplifications that affect output so frequently that gathering required knowledge eventually pays off. For example, we can specifically name primitive types int, long, etc. rather than to use their short CTS names Int32 and Int64. Such activity is limited only to a small set of types and, when implemented, helps output string be shorter and consequently easier to read.

Additional knowledge is applied when array is converted to string, in which case we know that only length and series of items should be given on the output. Quite similarly, objects implementing IEnumerable, e.g. List, Stack or ArrayList, can be converted to string in almost the same way as a plain array. Specific knowledge can also be applied to present key-value collections, e.g. Dictionary, Hashtable, SortedList, etc. The output in that case could be similar to the array case, only having key values shown in place of numerical indices. For example, it could look like this:

SortedDictionary<int, Point> = {
 |-- Item[0] = {
 |    |-- int Key=28
 |    +-- Point Value { bool IsEmpty=false, int X=31, int Y=68 } }
 |-- Item[1] = {
 |    |-- int Key=255
 |    +-- Point Value { bool IsEmpty=false, int X=43, int Y=4 } }
 |-- Item[2] = {
 |    |-- int Key=272
 |    +-- Point Value { bool IsEmpty=false, int X=95, int Y=67 } }
 |-- Item[3] = {
 |    |-- int Key=516
 |    +-- Point Value { bool IsEmpty=false, int X=35, int Y=54 } }
 |-- Item[4] = {
 |    |-- int Key=695
 |    +-- Point Value { bool IsEmpty=false, int X=71, int Y=34 } }
 |-- Item[5] = {
 |    |-- int Key=743
 |    +-- Point Value { bool IsEmpty=false, int X=84, int Y=5 } }
 |-- Item[6] = {
 |    |-- int Key=940
 |    +-- Point Value { bool IsEmpty=false, int X=38, int Y=14 } }
 |-- Item[7] = {
 |    |-- int Key=984
 |    +-- Point Value { bool IsEmpty=false, int X=96, int Y=23 } }
 +-- Item[8] = {
      |-- int Key=998
      +-- Point Value { bool IsEmpty=false, int X=56, int Y=62 } } }

With these examples we have shown that we can benefit if some level of knowledge about data types is available. One more, a little bit vague, type of knowledge has already been employed when multi-lined strings have been formatted. That was to occasionally make decision to inline specific object's representation rather than to spread it across several lines. Scalar types are inlined by default. But complex types consist of multiple values which are formatted based on properties like FieldDelimiter or IndentationString. The solution to the problem is to attempt to override these formatting strings and, if resulting string is acceptably short, to consider the work done. Otherwise, if one property is formatted in a long line, then give up trying and format it according to original formatting strings.

Inlined form is more readable, as can be observed from examples above. But if we try to inline the large structure, like Rectangle, that would produce quite unreadable output. Inlining rule can be given like this: Instance will be inlined in output if it produces short string. Inlining the instance means to convert it to compact single-line string regardless of actual values of formatting properties. This in turn means that any formatter will use single-line formatter to inline objects when possible, effectively overriding own settings made by the caller.

As we have progressed this far, we have prepared the ground to design the solution. Converting any object to string will be done as a recurrent, though not necessarily recursive operation. At every step one object is treated by printing its type, name (if available) and contents. Processing contents of the object sometimes causes other, contained objects to be pushed to the stack of yet unprocessed objects.

At this step we must stop for a moment to discuss one anomaly which may occur when this recursion unfolds. It may happen that object model contains loops, i.e. an object at hand contains a reference to an object which contains a reference and so on, until reference to the first object is reached. Attempt to print out such object would cause infinite loop if the situation is not recognized and loop broken. To resolve the issue, we might keep the stack containing all visited reference type instances in current chain of links, as to be able to stop once an instance is reached which is already contained in the stack.

In addition to this measure, we can limit the maximum depth of any instance sent to the output string. That would allow the caller to indirectly control total size of the resulting string which is an important goal. In many applications printing objects over, say, three references deep would produce more trouble to read than it would help user find some important information. Quite opposite, if objects are to be reported to log file then we are probably interested to see all its contents, however complex, because one who reads logs to find some anomaly in program execution is more interested to find particular information rather than to save time and consequently end up with fruitless analysis which failed because some object was only partially reported to log.

To this list of wishes we can also add an option to cancel further conversion to string if the resulting string's length becomes larger than specified. This exotic feature would then be used to implement inlining. Actually, at every step we could try to convert current object using single-lined formatter, but limiting its output to, say, 80 characters. If that formatting fails, i.e. breaches the given space limitation, then object will not be inlined. It will instead be converted to string using current formatter's settings. In addition to this logic, we can just add that inlining will not be attempted if current formatter is already single-lined. In that case string is formatted using current format and settings.

Control Flow

Now we will outline control flow of the formatter once an object is at hand and appropriate string should be formatted to represent it. There are obviously many ways to convert to string, each way possibly producing a different string as the result. Some of those results would be preferable over the others, as it often comes to be. So we need to decide the most appropriate conversion method to apply to each particular object we may encounter.

Consequently, we need to devise a set of conversion routines, each of them capable to decide whether it is applicable to given data type or not. If applicable, the routine would perform conversion and return status saying so. Otherwise, if routine is not applicable to a particular data type, it would not convert the object but only return the status indicating that operation was not done. A set of such routines would then be organized into a series so that most benign ones are invoked first, followed by more and more advanced ones. The last routine available is the one so general that it can convert any data type no matter how complex it is. Line of invocations terminates when first routine returns a success status, which ensures that object has been converted by exactly one routine that just happened to claim being capable of doing so.

Particular order in which converters are invoked is, as we said, from simple ones towards more complex ones. To be more precise, that would be from more specific to more general formatters. Hence, we will first deal with scalar types like int, bool, enumerations, DateTime and with string. After that step, key-value collections will be tried, like Dictionary or Hashtable. Matrices and jagged arrays of simple types will then be attempted, because they can be formatted in a familiar grid-like way. Further down the line arrays of simple types are processed, for similar reasons as with matrices. After that, general arrays (multidimensional) are tested, regardless of contained items types. If that attempt fails too, IEnumerable implementations will follow, which basically enumerates contents of the given object, ignoring other properties that it might expose. If that last attempt fails, leaving the object unprocessed at this stage, then most general method will be applied, which possesses no specific knowledge about data types. This last formatter is the one with heavy axe - it knows nothing particular about any data type other than the fact that data types expose properties and fields that can be read. So this general formatter iterates through the contents of the object and recursively applies the root formatter to each of them. This last measure guarantees that each contained value will again be formatted in the best possible way.

This article continues in General Formatter for .NET 3/4: Implementation.