fgets return less characters

Question

I'm writing an assembly program for practice. The assembly program uses the c library functions. I'm concern in particular with fgets() function. The fgets manual page states: I have declared a buffer of 1024 bytes and used it in the fgets funtion to read text from a file. But the program is returning 1019 characters. It always seem to return

Accepted Answer

Transferring comments into an answer.Run on Mac OS X, your code produces an output file of 1023 bytes according to ls -l. But my output file ends after &#8216;KDE Software&#8217; (with a trailing blank) as you find. How are you establishing the file size on output? How sure are you of your counting? Does the problem appear with shorter buffer sizes (say 32 bytes) — that is, is the output 5 bytes shorter than you thought it should be?And then rici correctly noted:  It is surely relevant that the sample text includes two instances of U+2014 EM DASH (—), whose UTF-8 encoding is e2 80 94.That is highly probable — to the point of being certain. It explains why vim seemed to be misplacing the cursor when I use 1024| — it is counting characters not bytes — which was confusing me. When I run: wc -m on the Mac, I get 1019 (multi-byte) characters, but still 1023 bytes.user1803784 observed:  I used atom.io text editor to get the count and the error start occurring at 256 bytes. I tried 128 bytes, 64 bytes, 32 bytes and the error does not occur it returns 127 bytes, 63 bytes, 31 bytes respectively (as the manual page stated &#8220;at most one less than size characters from stream&#8221;).Since the first &#8216;—&#8217; em-dash appears at offset 194, it appears that your problems are entirely related to &#8216;bytes versus characters&#8217; and the fact that you&#8217;re using UTF-8 encoded data. Treated as a pure stream of non-zero (NUL) bytes, you can read up to 1023 bytes into buff, and that&#8217;s what your code is doing. However, if you count characters rather than bytes, you have two 3-byte characters (the two em-dash characters), which means your character count is 4 less than your byte count. You have just learned that your editor counts characters; programs such as ls report bytes. The two numbers are, in general, different.We can also observe that the &#8216;characters&#8217; referred to by the quoted manual page are char-type characters, aka &#8216;bytes&#8217; (on most systems — there are machines where char are not 8-bit bytes).  The confusion stems in part from the C standard.ISO/IEC 9899:2011 §7.21.7.2  The fgets function says:  The fgets function reads at most one less than the number of characters specified by n  from the stream pointed to by stream into the array pointed to by s. No additional  characters are read after a new-line character (which is retained) or after end-of-file. A  null character is written immediately after the last character read into the array.Italic emphasis addedBy contrast, the POSIX specification of fgets() says that fgets() is specified in terms of bytes:  The fgets() function shall read bytes from stream into the array pointed to by s, until n-1 bytes are read, or a <newline> is read and transferred to s, or an end-of-file condition is encountered. The string is then terminated with a null byte.Italic emphasis addedThe page is annotated with:  The functionality described on this reference page is aligned with the ISO C standard. Any conflict between the requirements described here and the ISO C standard is unintentional. This volume of POSIX.1-2008 defers to the ISO C standard. That is referencing ISO/IEC 9899:1999 because POSIX.1-2008 was published before C11, but the wording in C99 §7.19.7.2 is the same as in C11.  Arguably, the POSIX wording is more easily understood exact or accurate than the C standard wording.  However, the definitions section of the standard says:  3.7  1 character  〈abstract〉 member of a set of elements used for the organization, control, or  representation of data    3.7.1  1 character   single-byte character  〈C〉 bit representation that fits in a byte    3.7.2  1 multibyte character  sequence of one or more bytes representing a member of the extended character set of  either the source or the execution environment  2 NOTE The extended character set is a superset of the basic character set.    3.7.3  1 wide character   value representable by an object of type wchar_t, capable of representing any character  in the current localeThus, in context, &#8216;character&#8217; means what most people think of as &#8216;byte&#8217; (with caveats — not all machines have CHAR_BIT == 8).

Advertisement

Answer