How does C distinguish between a byte long character and a 2 byte long character?

I have this sample code:

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main(void){
    printf("%lin",sizeof(char));
    char mytext[20];
    read(1,mytext,3);
    printf("%s",mytext);
    return 0;
}

First run:

koray@koray-VirtualBox:~$ ./a.out 
1
pp
pp
koray@koray-VirtualBox:~$

Well I think this is all expected as p is 1 byte long character defined in ASCII and I am reading 3 bytes. (2 p’s and Line break) In the terminal, again I see 2 characters.

Now let’s try with a character that is 2 bytes long:

koray@koray-VirtualBox:~$ ./a.out 
1
ğ
ğ

What I do not understand is, when I send the character ‘ğ’ to the memory pointed by mytext variable, 16 bits are written to that area. As ‘ğ’ is 11000100:10011110 in utf-8, these bytes are written.

My question is, when printing back to the standard out, how does C (or should I say the kernel?) know that, it should read 2 bytes and interpret as 1 character instead of 2 1-byte characters?

Answer

C doesn’t interpret it. Your program reads 2 bytes and outputs same 2 bytes without caring about what characters (or anything else) they are.

Your terminal encodes your input and reinterprets your output back as the same two byte character.

Advertisement

Answer