Understand Endianness

The problem of Endianness is essentially a question about how computers store large numbers.

I do not fear computers. I fear lack of them.
Isaac Asimov (American writer and professor of biochemistry, best known for his hard science fiction)

We know that one basic memory unit can hold one byte, and each memory unit has its address. For an integer larger than decimal 255 (0xff in hexadecimal), more than one memory unit is required. For example, 4660 is 0x1234 in hexadecimal and requires two bytes. Different computer systems use different methods to store these two bytes. In our common PC, the least-significant byte 0x34 is stored in the low address memory unit and the most-significant byte 0x12 is stored in the high address memory unit. While in Sun workstations, the opposite is true, with 0x34 in the high address memory unit and 0x12 in the low address memory unit. The former is called Little Endian and the latter is Big Endian.

How can I remember these two data storing modes? It is quite simple. First, remember that the addresses of the memory units we are talking about are always arranged from low to high. For a multi-byte number, if the first byte in the low address you see is the least-significant byte, the system is Little Endian, where Little matches low. On the contrary is Big Endian, where Big corresponds to "high".

Program Example

To deepen our understanding of Endianness, let's look at the following example of a C program:

1
2
3
4
char a = 1; 	 	 	 
char b = 2;
short c = 255; /* 0x00ff */
long d = 0x44332211;

On Intel 80x86 based systems, the memory content corresponding to variables a, b, c, and d are shown in the following table:

Address Offset Memory Content
0x0000 01 02 FF 00
0x0004 11 22 33 44

We can immediately tell that this system is Little Endian. For a 16-bit integer short c, we see the least-significant byte 0xff first, and the next one is 0x00. Similarly for a 32-bit integer long d, the least-significant byte 0x11 is stored at the lowest address 0x0004. If this is in a Big Endian computer, memory content would be 01 02 00 FF 44 33 22 11.

At the run time all computer processors must choose between these two Endians. The following is a shortlist of processor types with supported Endian modes:

  • Pure Big Endian: Sun SPARC, Motorola 68000, Java Virtual Machine
  • Bi-Endian running Big Endian mode: MIPS with IRIX, PA-RISC, most Power and PowerPC systems
  • Bi-Endian running Little Endian mode: ARM, MIPS with Ultrix, most DEC Alpha, IA-64 with Linux
  • Little Endian: Intel x86, AMD64, DEC VAX

How to detect the Endianess of local system in the program? The following function can be called for a quick check. If the return value is 1, it is Little Endian, else Big Endian

1
2
3
4
int test_endian() {
int x = 1;
return *((char *)&x);
}

Network Order

Endianness is also important for computer communications. Imagine that when a Little Endian system communicates with a Big Endian system, the receiver and sender will interpret the data completely differently if not handled properly. For example, for the variable d in the C program segment above, the Little Endian sender sends 11 22 33 44 four bytes, which the Big Endian receiver converts to the value 0x11223344. This is very different from the original value. To solve this problem, the TCP/IP protocol specifies a special "network byte order" (referred to as "network order"), which means that regardless of the Endian supported by the computer system, the most-significant byte is always sent first while transmitting data. From the definition, we can see that the network order corresponds to the Big Endian.

To avoid communication problems caused by Endianness and to facilitate software developers to write portable programs, some C preprocessing macros are defined for conversion between network bytes and local byte order. htons() and htonl() are used to convert local byte order to network byte order, the former works with 16-bit unsigned numbers and the latter for 32-bit unsigned numbers. ntohs() and ntohl() implement the conversion in the opposite direction. The prototype definitions of these four macros can be found as follows (available in the netinet/in.h file on Linux systems).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#if defined(BIG_ENDIAN) && !defined(LITTLE_ENDIAN)

#define htons(A) (A)
#define htonl(A) (A)
#define ntohs(A) (A)
#define ntohl(A) (A)

#elif defined(LITTLE_ENDIAN) && !defined(BIG_ENDIAN)

#define htons(A) ((((uint16)(A) & 0xff00) >> 8) | \
(((uint16)(A) & 0x00ff) << 8))
#define htonl(A) ((((uint32)(A) & 0xff000000) >> 24) | \
(((uint32)(A) & 0x00ff0000) >> 8) | \
(((uint32)(A) & 0x0000ff00) << 8) | \
(((uint32)(A) & 0x000000ff) << 24))
#define ntohs htons
#define ntohl htohl

#else

#error "Either BIG_ENDIAN or LITTLE_ENDIAN must be #defined, but not both."

#endif