Skip to content
Advertisement

How does C distinguish between a byte long character and a 2 byte long character?

I have this sample code:

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main(void){
    printf("%lin",sizeof(char));
    char mytext[20];
    read(1,mytext,3);
    printf("%s",mytext);
    return 0;
}

First run:

koray@koray-VirtualBox:~$ ./a.out 
1
pp
pp
koray@koray-VirtualBox:~$ 

Well I think this is all expected as p is 1 byte long character defined in ASCII and I am reading 3 bytes. (2 p’s and Line break) In the terminal, again I see 2 characters.

Now let’s try with a character that is 2 bytes long:

koray@koray-VirtualBox:~$ ./a.out 
1
ğ
ğ

What I do not understand is, when I send the character ‘ğ’ to the memory pointed by mytext variable, 16 bits are written to that area. As ‘ğ’ is 11000100:10011110 in utf-8, these bytes are written.

My question is, when printing back to the standard out, how does C (or should I say the kernel?) know that, it should read 2 bytes and interpret as 1 character instead of 2 1-byte characters?

Advertisement

Answer

C doesn’t interpret it. Your program reads 2 bytes and outputs same 2 bytes without caring about what characters (or anything else) they are.

Your terminal encodes your input and reinterprets your output back as the same two byte character.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement