Introduction to fonts and Unicode
19 October 2008
What is a font anyway?
Fonts, we spend all day looking at them. On paper, on the screen, on the walls and desks and posterboards, fonts, everywhere.
A font was originally a fount, but at some point we must have surrendered to the deranged Noah Webster. (Should we reverse this and call them founts?)
So lets break it down, we have a character (or grapheme if you want to be posh), such as the letter 'a' or or a semi-colon. This is represented graphically as a glyph. So two different founts will both have the character 'a' but the glyphs will look a little different in each fount.
So characters are represented as glyphs, and a set of glyphs is a fount. For example, my web browser on my computer by default shows DejaVu Sans. DejaVu Sans is a fount.
DejaVu Sans is a sans serif fount one of a family of founts. Other members include a serif version, a monospaced version and so on. The general family of founts with the same style (in this case DejaVu) is called a typeface.
Character encoding
_____ _____ _____ _____
/\ / ____| / ____||_ _||_ _|
/ \ | (___ | | | | | |
/ /\ \ \___ \ | | | | | |
/ ____ \ ____) || |____ _| |_ _| |_
/_/ \_\|_____/ \_____||_____||_____|
A character encoding maps a character to a value that the computer understands, this encoding is used to put the correct glyphs on the screen.
The character encoding ASCII, the American Standard Code for Information Interchange, was completed in 1963. ASCII maps each supported character to a number (which can be represented in decimal, hexadecimal, 7-bit binary and so on).
So in ASCII, decimal numbers 0-31 were defined as control characters, most of which fell out of use pretty quickly, number 10 is still used as the Unix new line character (n), number 13 is used as the Windows carriage return (r or ^M), 7 is the system bell/alert (a or ^G).
Decimal Number 32 (hex 20) is the space character, and then the printable characters start at 33 (hex 21) with the exclamation mark and various punctuation and other marks (@#$%&'()*+,-./), 48 to 57 (hex 30 to 39) are the numbers 0 to 9, then a few more punctuation marks (:;<=>?@) before capital A at decimal 65 (hex 41) through to capital Z at decimal 90 (hex 5A), a few more marks ([]^_`) and then lowercase 'a' at decimal 97 (hex 61) through to lowercase 'z' at decimal 122 (hex 7A). Lastly, a final set of marks ({|}) ending with tilda ~ at decimal 126 (hex 7E).
If you have ever listed files at the Unix, Linux or Mac OS X command line with the 'ls' command, you will have noticed that files beginning with capital letters are listed before those beginning with lowercase letters, now you know this is because the capital letters are encoded before the lowercase letters.
So ASCII could represent almost all words in American English, except a few loan words considered pretentious and superfluous by those in technical control. This meant that other countries had to tweak the character set to enable them to create documents in their language. From the British adding the pound sterling symbol £, through to more extreme revisions for languages that have accents and other non-English characters.
Eventually an extra bit was added (from 7 to 8 bits) enabling most Western European languages to be more or less represented in the decimal range 127-255, this was standardised as ISO/IEC 8859, but often still informally called ASCII. You can read the text of the standard (PDF) for all the gory details.
The world is a lot bigger however than Western Europe, and there are lots of ancient languages with different characters. These were represented by redefining the printable characters within the 255 8-bit character set, meaning that different founts would represent different characters with the same decimal (and hex) number.
This causes complete chaos when you use more than one language in the same file or web page, and it requires the user to obtain a new font/fount for each new language. Even worse, different founts created by different companies for the same language would map the characters to different encodings. So for example, one person creates documents in say the SymbolGreek fount and another person creates documents in the TechniaGreek fount; to combine their work, one of them would have to re-key their documents in the other fount, or they would have to get a programmer like us to come and write a script to translate the document from one fount to another.
By the 1990s, everyone with any sense had enough of this and decided to create one giant encoding to rule them all, this became 'Unicode'.
Unicode
In a future techno-utopia, every character in every writing system ever developed, should be assigned a code, and every Unicode fount should be able to represent all of them.
In today's reality, the Unicode Consortium gives each character a code, over one hundred thousand of them so far, however different founts have glyphs for a different subset of codes. So getting all the odd symbols you need is still sometimes a bit of a challenge. It is the Law of Leaky Abstractions.
Sadly, there is normally no inheritance or defaults. Back to Unicode utopia, in my opinion, when the Unicode Consortium assign a new code, they should add a default glyph to a default (public domain) character set. The third party Unicode founts could then inherit from that, over-ridding the founts they are interested in. If you use a code that your fount author has not provided a glyph for, then you get the default glyph, rather than a block, or worse, a white-space.
Of course you can swap out the Unicode founts very easily, that is part of the big idea, but if you have used a character without a glyph, then you have to do a lot of nasty regular expressions to hunt down and replace the blanks.


