Character Storage

ASCII

Characters are stored as a standardized set of binary patterns based on ASCII (the American Standard Code for Information Interchange), pronounced "askee". The original ASCII set is a 7-bit binary code.

Although ASCII values are not numbers, we often use the decimal equivalent of their binary patterns to represent them for convenience rather than writing out the lengthy binary patterns.

Decimal codes 0-31 and 127 are the control characters. They were originally designated to control movement of the print head and scrolling on text-only printers and teletypes (basically typewriters that send keyboard input to a computer and print computer output on paper rather than a screen). ASCII terminals such as the vt100 and xterm replaced teletypes and use these codes to move the cursor and scroll the screen in the same way.

Below are some of the more commonly used control characters.

Binary      Dec Symbol  Keyboard    C       Meaning
00000000    0   NUL
00000100    4   EOT     (Ctrl+d)    \004    End of transmission
00000111    7   BEL     (Ctrl+g)    \007    Beep printer/terminal
00001000    8   BS      (Ctrl+h)    \b      Move head/cursor left
00001001    9   TAB     (Ctrl+i)    \t      Move head/cursor to next tab stop
00001010    10  LF      (Ctrl+j)    \n      Move cursor down, scroll paper up
00001100    12  FF      (Ctrl+l)    \f      Scroll to start of next page
00001101    13  CR      (Ctrl+m)    \r      Move head/cursor to left margin
01111111    127 DEL     -           \177    Delete char under cursor
            

CR+LF goes to the beginning of new line. Unix adds a CR to each LF by default since we usually don't want to go down a line without going to the beginning of the new line. Terminal-based programs need only send a newline, so a C program would write "Hello, world!\n" instead of "Hello, world!\n\r" or "Hello, world!\r\n".

Characters 32-126 are the printable characters, i.e. anything that has a font pattern (including a space character, which is just a printable character with all pixels off). Sending one of these to a terminal or teletype prints the character and moves the head/cursor to the right.

Binary      Decimal C
00100000    32      ' '
00100001    33      '!'
...
00110000    48      '0'
00110001    49      '1'
...
01000001    65      'A'
01000010    66      'B'
...
01100001    97      'a'     Note: Only 1 bit differs from 'A'
01100010    98      'b'
            

In C, we often use the int data type rather than char for scalar character variables to avoid promotions in algebraic expressions. C promotes char values to int in many situations, so storing as an int to begin with will make the program slightly faster.

We do not use int for character arrays (strings), since the wasted space would generally add up to more than the value of the speed gains.

int     ch = 'A';
char    greeting[] = "Hello, world!\n";
            

The content of ch in the program above will be 01000001, sign-extended to the size of an int, which is usually 32 bits on modern computers, so 00000000000000000000000001000001.

Note

The size of the C int type varies among different CPUs but always represents an integer that can be processed with a single instruction (i.e. is less than or equal to the CPU's word size). This allows the programmer to write portable code that runs at optimal speed on any CPU from a 16-bit microcontroller to a 64-bit server.
ISO

The International Standards Organization (ISO) extended the ASCII set to 8 bits to include non-English characters, and eventually non-Latin-based characters such as those found in many Asian languages. ISO-Latin1 is the most common character set used in the west.

In addition to European letters, the ISO character sets also add some graphic characters such as mathematical symbols, smooth border lines, etc.

Unicode

Unicode is a computing industry standard for representing the characters in most of the world's writing systems. It currently consists of over 100,000 characters from more than 90 writing scripts.

The Unicode Transformation Formats (UTF) provide ways to encode the Unicode character set using a stream of bytes. UTF-8 is the most commonly used format, and is backward-compatible with ASCII. UTF-8 encodes each of the Unicode characters using one to four bytes.

Practice

Note

Be sure to thoroughly review the instructions in Section 2, “Practice Problem Instructions” before doing the practice problems below.
  1. What is ASCII?

  2. What is ISO Latin1?

  3. What is unicode?