Encodings

In this blog post, I will explain how to play around with different types of encodings using C#. I will mainly concentrate on ASCII, UTF-8 and UTF-16.
Character encoding systems consists of a code that pairs each character from a given set into something else, such as numbers. 
Different types of encodings
ASCII - works on general english characters. For example, in ASCII capital 'A' is represented by decimal 65 and small 'a' by 97.
UNICODE - works on a broader set of characters. Its a more consistent encoding to represent text expressed by most of the world's writing systems. Unicode is not an encoding. Its a standard. It can be implemented by different character encodings. Most famous being UTF-8 and UTF-16. UTF stands for Unicode Transformation Format.
Difference between UTF-8 and UTF-16 and UTF-32
  • UTF-8 uses 1 byte at a minimum while UTF-16 uses 2 hence UTF-8 files are generally smaller in size than UTF-16 files. 
  • UTF-8 is backward compatible with ASCII while UTF-16 is not. 
  • Both UTF-8 and UTF-16 are variable-length encodings. That is, its not guaranteed that UTF-8 will always use 1 byte and UTF-16 will always 2 bytes. UTF-16 can use 1 byte or 2 bytes.
  • UTF-32 is fixed-length encoding. All characters representations take up 4 bytes of space. The advantage of this is that byte arrays are easily indexable. That is, you know that the second character in the array will always start with the 4th byte that is barray[3].
  • Because every character takes 4 bytes, UTF-32 is space inefficient.
Base-64 Encoding
Base-64 is an encoding scheme to represent binary data in ASCII string format by translating it into radix-64 representation. Base-64 is different from ASCII and Unicode in the sense that Base-64 is not used for representation but more for transportation. To convert a string into base-64 encoded string, we do the following- 
  • Let's take a string "cat". 
  • Converting it into ASCII chars will give us: 99, 97, 116
  • Represent them in binary: 01100011 01100001 01110100
  • Combine them: 011000110110000101110100
  • Pack them in group of 6: 011000 110110 000101 110100
  • Convert it back to numbers and then to string: "Y2F0"
Code Examples
Let's take a look at some code examples. First up, this is the code-
When I run with the string as "cat". These are the results -

When I run with the string as "水". These are the results -
In this case, ASCII can not represent this character hence the results for ASCII, UTF-8 and Base-64 encoding are wrong since they are actually for string "?".

No comments: