Introduction to fonts and Unicode

19 October 2008

What is a font anyway?

Fonts, we spend all day looking at them. On paper, on the screen, on the walls and desks and posterboards, fonts, everywhere.

A font was originally a fount, but at some point we must have surrendered to the deranged Noah Webster. (Should we reverse this and call them founts?)

So lets break it down, we have a character (or grapheme if you want to be posh), such as the letter 'a' or or a semi-colon. This is represented graphically as a glyph. So two different founts will both have the character 'a' but the glyphs will look a little different in each fount.

So characters are represented as glyphs, and a set of glyphs is a fount. For example, my web browser on my computer by default shows DejaVu Sans. DejaVu Sans is a fount.

DejaVu Sans is a sans serif fount one of a family of founts. Other members include a serif version, a monospaced version and so on. The general family of founts with the same style (in this case DejaVu) is called a typeface.

Character encoding

            _____   _____  _____  _____
    /\     / ____| / ____||_   _||_   _|
   /  \   | (___  | |       | |    | |
  / /\ \   \___ \ | |       | |    | |
 / ____ \  ____) || |____  _| |_  _| |_
/_/    \_\|_____/  \_____||_____||_____|

A character encoding maps a character to a value that the computer understands, this encoding is used to put the correct glyphs on the screen.

The character encoding ASCII, the American Standard Code for Information Interchange, was completed in 1963. ASCII maps each supported character to a number (which can be represented in decimal, hexadecimal, 7-bit binary and so on).

So in ASCII, decimal numbers 0-31 were defined as control characters, most of which fell out of use pretty quickly, number 10 is still used as the Unix new line character (n), number 13 is used as the Windows carriage return (r or ^M), 7 is the system bell/alert (a or ^G).

Decimal Number 32 (hex 20) is the space character, and then the printable characters start at 33 (hex 21) with the exclamation mark and various punctuation and other marks (@#$%&'()*+,-./), 48 to 57 (hex 30 to 39) are the numbers 0 to 9, then a few more punctuation marks (:;<=>?@) before capital A at decimal 65 (hex 41) through to capital Z at decimal 90 (hex 5A), a few more marks ([]^_`) and then lowercase 'a' at decimal 97 (hex 61) through to lowercase 'z' at decimal 122 (hex 7A). Lastly, a final set of marks ({|}) ending with tilda ~ at decimal 126 (hex 7E).

If you have ever listed files at the Unix, Linux or Mac OS X command line with the 'ls' command, you will have noticed that files beginning with capital letters are listed before those beginning with lowercase letters, now you know this is because the capital letters are encoded before the lowercase letters.

So ASCII could represent almost all words in American English, except a few loan words considered pretentious and superfluous by those in technical control. This meant that other countries had to tweak the character set to enable them to create documents in their language. From the British adding the pound sterling symbol £, through to more extreme revisions for languages that have accents and other non-English characters.

Eventually an extra bit was added (from 7 to 8 bits) enabling most Western European languages to be more or less represented in the decimal range 127-255, this was standardised as ISO/IEC 8859, but often still informally called ASCII. You can read the text of the standard (PDF) for all the gory details.

The world is a lot bigger however than Western Europe, and there are lots of ancient languages with different characters. These were represented by redefining the printable characters within the 255 8-bit character set, meaning that different founts would represent different characters with the same decimal (and hex) number.

This causes complete chaos when you use more than one language in the same file or web page, and it requires the user to obtain a new font/fount for each new language. Even worse, different founts created by different companies for the same language would map the characters to different encodings. So for example, one person creates documents in say the SymbolGreek fount and another person creates documents in the TechniaGreek fount; to combine their work, one of them would have to re-key their documents in the other fount, or they would have to get a programmer like us to come and write a script to translate the document from one fount to another.

By the 1990s, everyone with any sense had enough of this and decided to create one giant encoding to rule them all, this became 'Unicode'.

Unicode

In a future techno-utopia, every character in every writing system ever developed, should be assigned a code, and every Unicode fount should be able to represent all of them.

In today's reality, the Unicode Consortium gives each character a code, over one hundred thousand of them so far, however different founts have glyphs for a different subset of codes. So getting all the odd symbols you need is still sometimes a bit of a challenge. It is the Law of Leaky Abstractions.

Sadly, there is normally no inheritance or defaults. Back to Unicode utopia, in my opinion, when the Unicode Consortium assign a new code, they should add a default glyph to a default (public domain) character set. The third party Unicode founts could then inherit from that, over-ridding the founts they are interested in. If you use a code that your fount author has not provided a glyph for, then you get the default glyph, rather than a block, or worse, a white-space.

Of course you can swap out the Unicode founts very easily, that is part of the big idea, but if you have used a character without a glyph, then you have to do a lot of nasty regular expressions to hunt down and replace the blanks.

What do you have to say?

Show Editing Help

About

Hello, my name is Zeth, I'll be your host here.

Command Line Warriors is about taking control of your own technology, it looks at our experiences of computing; especially using GNU/Linux, the Python programming language, the command-line and issues such as techno-ethics, best practices and whatever is cool now. If you take control of your technology then you are a Warrior too!

This site is your site too which means that you can contribute and get involved. You can leave comments using the facility provided. For me, the comments and discussions are by far the best part of the site. So please do have your say!

Latest Discussions

Cupcake

July 31, 2010
Good post! You helped me a lot with my school project! CountryField(blank = True) < (K)
Countries in Django

LeshaShampoo

July 30, 2010
it was very interesting to read commandline.org.uk I want to quote your post in my blog. It can? And you et an account on Twitter?
Email Syntax Check in Python

vemma2018

July 30, 2010
I find myself coming to your blog more and more often to the point where my visits are almost daily now!
On Comment Spam

layecenda

July 30, 2010
Hello. And Bye.test :) http://idfjhvihdfiphvlajbvhalibv.com
PuTTY Series: Adding PuTTY to your system path

scuba

July 30, 2010
I’ve been visiting your blog for a while now and I always find a gem in your new posts. Thanks for sharing.
On Comment Spam

Businesking

July 30, 2010
Great site and articles for hack for win, I said Amazing post
How not to program WSGI

Tehnoking

July 30, 2010
This is Great post to learn about the hack Thumbs-up for you :D
How not to program WSGI

Syabiltech

July 30, 2010
I think this articles for master...because very hard to learning, As blogger beginners like me.
How not to program WSGI

coffeeatea

July 30, 2010
Are you looking for coffee gifts? We can tell you more about the coffee gifts including coffee machines and coffee pods.
Introducing Soturi - yet another Django blog application

noni juice

July 30, 2010
I just sent this post to a bunch of my friends as I agree with most of what you’re saying here and the way you’ve presented it is awesome.
On Comment Spam

Dion Moult

July 29, 2010
What I do know is that ever since I tried out Opera and put their tab bar on the left as a column, I've loved that layout. Back on Firefox ...
We need a thoughout integration of the desktop and the web - not Tab Candy superfast jellyfish

ZonaEntertainment

July 29, 2010
Wow useful articles, I'm read to learn about this and now I bookmark this to my Facebook, thanks for share!
How not to program WSGI

Giacomo

July 29, 2010
Honestly, I think both Mozilla and you are wrong :) This sort of concept adds overhead. A user would have to manage all this crap, constantly dragging and dropping, creating ...
We need a thoughout integration of the desktop and the web - not Tab Candy superfast jellyfish

Matija "hook" Šuklje

July 29, 2010
As a minimalist, you'll probybly moan if I mention KDE, but I'll do so anyway ;) The future I want (and actually see slowly fold out before me) is to ...
We need a thoughout integration of the desktop and the web - not Tab Candy superfast jellyfish

tahitian noni

July 28, 2010
Thank You For This Blog, was added to my bookmarks.
On Comment Spam

Rick

July 28, 2010
I already have piles. It's called A New Window.
We need a thoughout integration of the desktop and the web - not Tab Candy superfast jellyfish

Tech News

July 25, 2010
Thanks for this short tutorial...was auto-FTPing my files from my appserver to webserver for my tech news website. Everything was OK until someone hacked it. Hosting provider is now recommending ...
SFTP in Python: Really Simple SSH

naypalm

July 24, 2010
During the past 3-4 years, I and many others have enjoyed unlimited 2G/3G internet. But ever since the massive cult-like following of i Phone users in the US, most cellular ...
Calling time on mobile internet nonsense?

Steve

July 15, 2010
Very occasionally, you will run into a Java program that uses a lot of memory just to hold all the classes used. It turns out that the JVM uses a ...
Three classic command line tips

no

July 14, 2010
1. number one 2. number two 4. number four 3. number three 6. number six # first # second ## second-ay ## second-bee ### second-bee-one ### second-bee-two
An Introduction to ReStructuredText