Application Of Unicode In .NET Core

Afzaal Ahmad Zeeshan
8y
40.4k
0
4

Article

Introduction and Background

Previously, I wrote an article about .NET framework that covered 95 percent of Unicode programming in .NET framework (using C# language). Since my interest has changed to .NET Core, I wanted to cover .NET Core on Linux and test how it would work on the Linux environment. I must admit, it didn't let me down. I was intrigued by the fact that the same workflow was used on Linux environment as was used on Windows to build the applications that consume and process Unicode data.

I would, first of all, recommend that you go through my previous article that covers everything for you to become an expert in Unicode technology on .NET framework. Go to Reading and Writing Unicode Data in .NET, to read my article.

This image was taken previously for my last article about Unicode. I am going to re-apply the same concepts to this .NET Core Framework and see how it responds to my code. You can expect to learn the following concepts in this article.

.NET Core and Unicode support - all on Linux.
ASP.NET Core and Unicode support - using local deployment. Both work the same way.
See how Linux file editors work.

So, let us get started.

.NET Core and Unicode support

.NET Core is a cross-platform solution by Microsoft. It covers everything that .NET framework had up its sleeves. To understand whether .NET Core supports Unicode or not, you might want to move onwards to learn how .NET Core manages the character data. So basically, if you move onwards and read the documentation where they specify the "char" standard, you will find the statement, "Represents a character as a UTF-16 code unit". Using this, we can see that a character (the base of every String element in .NET Core framework) is a Unicode-based character with encoding set to UTF-16 (LE). You can see the results of character data and their integral and byte values, in the following table.

using System;
namespace ConsoleApplication
{
public class Program
{
public static void Main(string[] args)
{
char[] characters = new char[] { 'a', 'A', 'क', 'α', 'л' };
Console.WriteLine("Character\t|ASCII value\t|Unicode value");
foreach (var character in characters) {
PrintCharacterInfo(character);
}
}
public static void PrintCharacterInfo(char character) {
Console.WriteLine($"{character}\t\t|{((int)character)}\t\t|\\u{((int)character).ToString("X4")}");
}
}
}

The aforementioned program would print the characters, their ASCII codes, and their Unicode representations as numbers. The result of the above program was the following.

As seen, the program shows the correct format of characters, ASCII values, and the Unicode values for each of the characters. There is a trick to convert ASCII value to Unicode, that can be easily seen in the program above. The trick is that a Unicode representation is basically the hexadecimal value of the decimal value in ASCII. So, 0x61 in Unicode is the 97 value in ASCII, both representing the "a" character.

One thing that I find interesting, is that Linux terminals basically support character graphics, while the Windows Command Shell doesn't support graphics. The terminal that I am using is KDE-based terminal, "Konsole". I used the same in Ubuntu, and they provided me with the same results (gnome-terminal). This was an interesting fact, because using this if your web applications and native applications want to show the data in the terminal, then you don't need to program a graphical application to render the results. Which reminds me that I can also show the ASP.NET Core results natively, in the terminal. Linux environment is a primary terminal-oriented service and that is why they wanted to ensure that every terminal supported native Unicode characters. There was a great debate in Linux, for re-writing the entire kernel source code to provide support for Unicode which means, that thus .NET Core provides a Unicode support on the framework, the kernel itself supports Unicode data. So, you basically open a stream to a file (fopen, for instance!), the stream of characters provided is basically in the Unicode format instead of ASCII format.

Linux file handlers and text editor programs are also fine tuned to use various text encoding. There are numerous character encodings to select from. A few of these are shown below.

These options contain further options for the encoding schemes. For example, Unicode further contains UTF-8, UTF-16 etc. You would need to select the appropriate selection of encoding to be used for characters while storing and extracting the data from the disk.

ASP.NET Core and Unicode

In the ASP.NET Core section, I want to walk you through the terminal use and ASP.NET Web API's JSON data transfer. We have already seen that most of our Unicode data works natively, on the terminal without any additional graphics pack needed to be installed. So, I am going to write an API program on the server, that provides the language characters when needed (and as needed). The following is my ASP.NET Web API controller in ASP.NET Core framework,

using System.Collections.Generic;
using Microsoft.AspNetCore.Mvc;
namespace WebApplication.Controllers {
[RouteAttribute("/api/data")]
public class DataApi : Controller {
private Dictionary<string, string> source = new Dictionary<string, string>();
public DataApi() {
source.Add("arabic", "بِسْمِ اللهِ الرَّحْمٰنِ الرَّحِيْمِ");
source.Add("urdu", "یونیکوڈ ڈیٹا میں اردو");
source.Add("hindi", "यूनिकोड डेटा में हिंदी");
source.Add("russian", "рцы слово твердо");
source.Add("english", "Love for all, hatred for none!");
}
[HttpGetAttribute]
public Dictionary<string, string> GetData() {
return source;
}
[HttpGetAttribute]
[RouteAttribute("{language}")]
public string GetItem(string language) {
switch (language.ToLower())
{
case "arabic":
return source["arabic"];
case "urdu":
return source["urdu"];
case "hindi":
return source["hindi"];
case "russian":
return source["russian"];
case "english":
return source["english"];
default:
return "Language not found.";
}
}
}
}

A very simple, comprehensive, yet beautiful solution to the problem is shown here. You might be intrigued by the fact that this is the data that I used previously to support various languages and show how their glyphs work in WPF. However, this time, I am going to use .NET Core and the terminals of the Linux environment. What happens is that, the API will provide me with the language data when I call the API. Start the project using -

$ dotnet run

Finally, move ahead to a web browser, and finally send a request to, "http://localhost:5000/api/data/arabic". You will get the following,

This is the result in the browser. Now, consider trying the same in the terminal, and trying to save the data The following program does the thing, it renders the results on the terminal, and saves the data on the file too,

using System;
using System.IO;
using System.Net.Http;
namespace ConsoleApplication
{
public class Program
{
public static void Main(string[] args)
{
Worker();
}
public static void Worker() {
Console.WriteLine("Languages: [ 'Arabic', 'English', 'Urdu', 'Hindi', 'Russian' ]");
Console.WriteLine("Which language you want to get the data for? ");
var language = Console.ReadLine();
using (var client = new HttpClient()) {
client.BaseAddress = new Uri("http://localhost:5000");
var result = client.GetStringAsync($"/api/data/{language}").Result;
var message = $"Result for '{language}' is '{result}'.";
Console.WriteLine(message);
File.WriteAllText("/home/afzaal/Documents/DotnetCore/Unicode/downloaded.txt", message);
Console.WriteLine("Data written to the text file.");
}
}
}
}

Executing the above program would do the task that we are expecting it to do. Have a look at the results,

Why am I using "Urdu" only? The fact is simple. The rest of the fonts are all native fonts, whereas there is no such thing as Urdu font. This is basically a blend of Persian font and Arabic font. I am not sure why they are not bringing Urdu fonts to Unicode by now as there are a great number of users who use the Urdu font, especially in Microsoft Office suite. But this is an off-topic discussion. However, you can see the data is shown without any further graphical approach to the Unicode. Another interesting thing to see is,

This is also rendered in the text editors. Of course this isn't the interesting point. The interesting point is that the Unicode data was encoded using UTF-8 encoding in the HTTP context. No longer that ".NET Core uses UTF-16 LE encoding" stuff going on, here HTTP and the web controls everything and ASP.NET Core respects that protocol. If you try to read the data in a different format, then the data might be corrupted or may be lost. This text editor (kate) checked the data encoding as UTF-8 and used this, if there were errors in this format, it would have shown an error message respectively. So, this also has to work with the proper Unicode encoding and since web considers UTF-8 to be the best encoding possible, as it can be use a variable amount of bytes (from 1-4 bytes) to represent the data, it can use more code points to render the graphics on screen for the characters. Whereas in UTF-16 LE you get 2 surrogate pairs, these pairs are used to map to the characters that belong to higher code points and code pages, respectively. Whereas they all use all 16 bits. In UTF-8 that is not the case, and the data size is variable. In the Urdu case, it might reach up to the code point where the characters belong, and in the case of English it can remain in the same realm where ASCII belongs (1 byte).

That said, Unicode is confusing, and you should leave the handling of bytes to the framework itself. Never handle the bytes yourself, unless you know what you are doing.

Final words

In this post, you were given an overview of Unicode support on Linux in .NET Core framework. The framework uses the same concepts and technology as was used in .NET framework. However, the support for glyphs and character graphics has increased in Linux terminals, because the terminals are based to support Unicode characters in every locale. However, on Windows that is not the case and Windows terminals do not support character graphics outside a few code points (there is some support of non-English characters too). In these cases, graphical frameworks on Windows, such as Windows Forms or Windows Presentation Foundation, are used.

Since .NET Core is a cross-platform tool, we should consider looking into both of them. So, if you read this post, now you know that if your program works on Linux, it won't work the same on Windows (such as showing the Unicode data on terminals). In such cases, this post is going to help you understand how to plan the architecture of your application in such a way that allows you to show the data to your clients and customers on both the platforms. .NET Core might also get a graphical framework soon. When that happens, it will be a Christmas before Christmas.