I’m writing an assembly program for practice. The assembly program uses the c library functions. I’m concern in particular with fgets() function. The fgets manual page states:
fgets() reads in at most one less than size characters from stream and stores them into the buffer pointed to by s. Reading stops after an EOF or a newline. If a newline is read, it is stored into the buffer. A terminating null byte ('') is stored after the last character in the buffer.
I have declared a buffer of 1024 bytes and used it in the fgets funtion to read text from a file. But the program is returning 1019 characters. It always seem to return 5 less characters so if I use a buffer of 1029 it will indeed return 1024 characters. I was wondering why the fgets function work in this manner or is it my code? My program is as follow:
#include <stdio.h> int main(){ FILE *fopen(), *fp, *fp2; char buff[1024]; fp = fopen("test.txt", "r"); fgets(buff, 1024, (FILE*)fp); fp2 = fopen("outputtest.txt", "w"); //fprintf(fp2, "This is testing for fprintf...n"); fputs(buff, fp2); fclose(fp); fclose(fp2); }
The input does not contain any null byte or new line character at 1020 position so at most 1023 should be return. the following is the input:
this a test file. The development of Linux is one of the most prominent examples of free and open-source software collaboration. The underlying source code may be used, modified and distributed—commercially or non-commercially—by anyone under the terms of its respective licenses, such as the GNU General Public License. Typically, Linux is packaged in a form known as a Linux distribution, for both desktop and server use. Some of the popular mainstream Linux distributions are Debian, Ubuntu, Linux Mint, Fedora, openSUSE, Arch Linux and Gentoo, together with commercial Red Hat Enterprise Linux and SUSE Linux Enterprise Server distributions. Linux distributions include the Linux kernel, supporting utilities and libraries, and usually a large amount of application software to fulfill the distribution’s intended use. Distributions oriented toward desktop use typically include X11, a Wayland implementation or Mir as the windowing system, and an accompanying desktop environment such as GNOME or the KDE Software Compilation; some distributions may also include a less resource-intensive desktop such as LXDE or Xfce. Distributions intended to run on servers may omit all graphical environments from the standard install, and instead include other software to set up and operate a solution stack such as LAMP. Because Linux is freely redistributable, anyone may create a distribution for any intended use.
The output is as follow:
this a test file. The development of Linux is one of the most prominent examples of free and open-source software collaboration. The underlying source code may be used, modified and distributed—commercially or non-commercially—by anyone under the terms of its respective licenses, such as the GNU General Public License. Typically, Linux is packaged in a form known as a Linux distribution, for both desktop and server use. Some of the popular mainstream Linux distributions are Debian, Ubuntu, Linux Mint, Fedora, openSUSE, Arch Linux and Gentoo, together with commercial Red Hat Enterprise Linux and SUSE Linux Enterprise Server distributions. Linux distributions include the Linux kernel, supporting utilities and libraries, and usually a large amount of application software to fulfill the distribution’s intended use. Distributions oriented toward desktop use typically include X11, a Wayland implementation or Mir as the windowing system, and an accompanying desktop environment such as GNOME or the KDE Software
The above ends with a space that makes up the complete 1019 characters return. I was wondering what is causing this. My assembly program works but of course the amount of characters read isn’t the correct amount. Can someone explain to me why this is occurring?
Thanks in advance.
Advertisement
Answer
Transferring comments into an answer.
Run on Mac OS X, your code produces an output file of 1023 bytes according to ls -l
. But my output file ends after ‘KDE Software’ (with a trailing blank) as you find. How are you establishing the file size on output? How sure are you of your counting? Does the problem appear with shorter buffer sizes (say 32 bytes) — that is, is the output 5 bytes shorter than you thought it should be?
And then rici correctly noted:
It is surely relevant that the sample text includes two instances of U+2014 EM DASH (—), whose UTF-8 encoding is e2 80 94.
That is highly probable — to the point of being certain. It explains why vim
seemed to be misplacing the cursor when I use 1024|
— it is counting characters not bytes — which was confusing me. When I run: wc -m
on the Mac, I get 1019 (multi-byte) characters, but still 1023 bytes.
I used atom.io text editor to get the count and the error start occurring at 256 bytes. I tried 128 bytes, 64 bytes, 32 bytes and the error does not occur it returns 127 bytes, 63 bytes, 31 bytes respectively (as the manual page stated “at most one less than size characters from stream”).
Since the first ‘—’ em-dash appears at offset 194, it appears that your problems are entirely related to ‘bytes versus characters’ and the fact that you’re using UTF-8 encoded data. Treated as a pure stream of non-zero (NUL) bytes, you can read up to 1023 bytes into buff, and that’s what your code is doing. However, if you count characters rather than bytes, you have two 3-byte characters (the two em-dash characters), which means your character count is 4 less than your byte count. You have just learned that your editor counts characters; programs such as ls
report bytes. The two numbers are, in general, different.
We can also observe that the ‘characters’ referred to by the quoted manual page are char
-type characters, aka ‘bytes’ (on most systems — there are machines where char
are not 8-bit bytes). The confusion stems in part from the C standard.
ISO/IEC 9899:2011 §7.21.7.2 The fgets
function says:
The
fgets
function reads at most one less than the number of characters specified byn
from the stream pointed to bystream
into the array pointed to bys
. No additional characters are read after a new-line character (which is retained) or after end-of-file. A null character is written immediately after the last character read into the array.
Italic emphasis added
By contrast, the POSIX specification of fgets()
says that fgets()
is specified in terms of bytes:
The
fgets()
function shall read bytes fromstream
into the array pointed to bys
, untiln-1
bytes are read, or a<newline>
is read and transferred tos
, or an end-of-file condition is encountered. The string is then terminated with a null byte.
Italic emphasis added
The page is annotated with:
The functionality described on this reference page is aligned with the ISO C standard. Any conflict between the requirements described here and the ISO C standard is unintentional. This volume of POSIX.1-2008 defers to the ISO C standard.
That is referencing ISO/IEC 9899:1999 because POSIX.1-2008 was published before C11, but the wording in C99 §7.19.7.2 is the same as in C11. Arguably, the POSIX wording is more easily understood exact or accurate than the C standard wording. However, the definitions section of the standard says:
3.7
1 character
〈abstract〉 member of a set of elements used for the organization, control, or representation of data3.7.1 1 character single-byte character
〈C〉 bit representation that fits in a byte3.7.2
1 multibyte character sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment
2 NOTE The extended character set is a superset of the basic character set.3.7.3
1 wide character value representable by an object of typewchar_t
, capable of representing any character in the current locale
Thus, in context, ‘character’ means what most people think of as ‘byte’ (with caveats — not all machines have CHAR_BIT == 8
).