Characters are stored as a standardized set of binary patterns based on ASCII (the American Standard Code for Information Interchange), pronounced "askee". The original ASCII set is a 7-bit binary code.
Although ASCII values are not numbers, we often use the decimal equivalent of their binary patterns to represent them for convenience rather than writing out the lengthy binary patterns.
Decimal codes 0-31 and 127 are the control characters. They were originally designated to control movement of the print head and scrolling on text-only printers and teletypes (basically typewriters that send keyboard input to a computer and print computer output on paper rather than a screen). ASCII terminals such as the vt100 and xterm replaced teletypes and use these codes to move the cursor and scroll the screen in the same way.
Below are some of the more commonly used control characters.
Binary Dec Symbol Keyboard C Meaning 00000000 0 NUL 00000100 4 EOT (Ctrl+d) \004 End of transmission 00000111 7 BEL (Ctrl+g) \007 Beep printer/terminal 00001000 8 BS (Ctrl+h) \b Move head/cursor left 00001001 9 TAB (Ctrl+i) \t Move head/cursor to next tab stop 00001010 10 LF (Ctrl+j) \n Move cursor down, scroll paper up 00001100 12 FF (Ctrl+l) \f Scroll to start of next page 00001101 13 CR (Ctrl+m) \r Move head/cursor to left margin 01111111 127 DEL - \177 Delete char under cursor
CR+LF goes to the beginning of new line. Unix adds a CR to each LF by default since we usually don't want to go down a line without going to the beginning of the new line. Terminal-based programs need only send a newline, so a C program would write "Hello, world!\n" instead of "Hello, world!\n\r" or "Hello, world!\r\n".
Characters 32-126 are the printable characters, i.e. anything that has a font pattern (including a space character, which is just a printable character with all pixels off). Sending one of these to a terminal or teletype prints the character and moves the head/cursor to the right.
Binary Decimal C 00100000 32 ' ' 00100001 33 '!' ... 00110000 48 '0' 00110001 49 '1' ... 01000001 65 'A' 01000010 66 'B' ... 01100001 97 'a' Note: Only 1 bit differs from 'A' 01100010 98 'b'
In C, we often use the int data type rather than char for scalar character variables to avoid promotions in algebraic expressions. C promotes char values to int in many situations, so storing as an int to begin with will make the program slightly faster.
We do not use int for character arrays (strings), since the wasted space would generally add up to more than the value of the speed gains.
int ch = 'A'; char greeting[] = "Hello, world!\n";
The content of ch in the program above will be 01000001, sign-extended to the size of an int, which is usually 32 bits on modern computers, so 00000000000000000000000001000001.
The International Standards Organization (ISO) extended the ASCII set to 8 bits to include non-English characters, and eventually non-Latin-based characters such as those found in many Asian languages. ISO-Latin1 is the most common character set used in the west.
In addition to European letters, the ISO character sets also add some graphic characters such as mathematical symbols, smooth border lines, etc.
Unicode is a computing industry standard for representing the characters in most of the world's writing systems. It currently consists of over 100,000 characters from more than 90 writing scripts.
The Unicode Transformation Formats (UTF) provide ways to encode the Unicode character set using a stream of bytes. UTF-8 is the most commonly used format, and is backward-compatible with ASCII. UTF-8 encodes each of the Unicode characters using one to four bytes.
What is ASCII?
What is ISO Latin1?
What is unicode?