Is char Nice ? : Code that doesn't increment with a problematic char type in C?

12 min

language: ja bn en es hi pt ru zh-cn zh-tw

Hello, I'm incompetent.

I was curious, so I tried it out.

https://blog.httrack.com/blog/2014/05/30/c-corner-cases-and-funny-things

First, I'll try compiling and running it with the code above.

#include <stdio.h>

/** Return the next character within a \0-terminated string, or EOF. **/
int my_read_char(const char *buffer, size_t *offs) {
  if (buffer[*offs] != '\0') {
    return buffer[*offs++];  /* here's the trap */
  } else {
    return EOF;
  }

int main() {
  const char *buffer = "Hello, World!";
  size_t offs = 0;

  printf("Reading characters from buffer:\n");
  while (1) {
    int ch = my_read_char(buffer, &offs);
    if (ch == EOF) {
      break;
    }
    printf("Character read: %c, offs: %zu\n", ch, offs);
  }

  return 0;
}

Now, I'll compile and run it with this.

gcc -o bugcode bugcode.c
./bugcode
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Charact^C
alleycat:[haturatu]:~/clang$ 

Aaaah!!!

So, it's dangerous, but I'll output the logs to a file named 'log' and check the initial execution.
Even so, despite quickly running ps -ef | grep bugcode to check the process ID and killing it...

alleycat:[haturatu]:~/clang$ ls -la
合計 1256824
drwxr-xr-x  2 haturatu haturatu       4096  6月 14 23:38 .
drwx------ 39 haturatu haturatu       4096  6月 14 23:38 ..
~
-rw-r--r-- 1 haturatu haturatu 1286922240  6月 14 23:39 log

It became an outrageous file size, about 1.2GB... This is already dangerous code that could crash the system... I couldn't view it with view, but I could open it with vim.

alleycat:[haturatu]:~/clang$ head log
Reading characters from buffer:
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0
Character read: H, offs: 0

It was not updated (incremented) from the beginning.

I want to understand what this means, with the help of ChatGPT.

Distinction between EOF and \0
EOF (End Of File) is a special value indicating the end of a file, usually -1. On the other hand, '\0' is the null character indicating the end of a C string, and its value is 0.

It seems like I understood something here, but my head is getting confused.
To calm down for a moment, I'll ask about the flow that occurs with this code at this point. It has changed to my_read_char_with_cast, but please don't mind that.

Flow Details
Initial State:

offs is 0.
The buffer is not null-terminated (no '\0').
Loop Start:

The my_read_char_with_cast function is called.
buffer[*offs] is 'H' (the first character), so it's not '\0'.
'H' is returned, and offs becomes 1.
Loop Continuation:

In the next call, buffer[1] ('e') is read and returned.
offs becomes 2.
Repetition:

This process continues.
Eventually, offs becomes 11, buffer[11] ('!') is read and returned.
offs becomes 12.
Infinite Loop:

The value of buffer[12] is undefined (referencing memory outside the buffer) and is not '\0'.
Since buffer[*offs] is never '\0', an infinite loop occurs without reaching EOF.

Somehow, I'm even more confused now. The actual test result caused an infinite loop from the beginning, so offs didn't even reach 12. When debugging,

Returning character: H
Character read: H, offs: 0
my_read_char called with offs: 0
Returning character: H
Character read: H, offs: 0
my_read_char called with offs: 0
Returning character: H
Character read: H, o^C
alleycat:[haturatu]:~/clang$ 

As expected, offs remains 0.

Now, let's go back to the HTTrack blog.

This function returns a negative value for ASCII > 127. Specifically, for ASCII 255 (0xFF), it returns the value -1. This is also the value of EOF.

I see? Then why does it return -1? I'll ask the AI.

Special characteristics of char type
Signed 8-bit integer: The char type in C is usually treated as a signed 8-bit integer. This can represent a range from -128 to 127.
Wrap-around: When a char type value reaches its maximum (127), the next increment operation wraps around to the minimum value (-128). This is due to two's complement representation, where incrementing 127 results in -128.

Consideration regarding ASCII 255 (0xFF)
Outside ASCII range: In the ASCII table, the value 255 (0xFF) is outside the range and is usually treated as a control character.
Relationship with EOF: In C, EOF is a special value indicating the end-of-file, usually treated as -1. EOF is used to indicate the end of a file, but here it is used as a return value for ASCII value 255 (0xFF).
Meaning of function behavior
The reason this function returns -1 for ASCII 255 (0xFF) is as follows:

Out-of-range value: ASCII 255 is outside the range of the char type and is usually treated as an invalid value.
Use of EOF: In this function, EOF (-1) is used to indicate the end of the buffer. The return value of -1 for ASCII 255 is a kind of implicit convention indicating EOF.
Such a design allows the function to detect the end of the buffer and return EOF, acting as a signal for the caller to stop reading data. However, in the context of this function, caution is required if specific handling for ASCII 255 is needed or if ASCII 255 needs to be handled correctly.

I see, the char type cannot handle it in the first place. It originates from the difference between ASCII 127 and ASCII 255, and because of that, EOF is returned as -1, leading to an infinite loop.

The corrected code from the HTTrack site is below. However, when I compiled the following with gcc, an infinite loop occurred, so this also seems to be incorrect.

/** Return the next character within a \0-terminated string, or EOF. **/
int my_read_char(const char *buffer, size_t *offs) {
  if (buffer[*offs] != '\0') {
    return (unsigned char) buffer[*offs++];
  } else {
    return EOF;
  }

The AI presented this.

int my_read_char(const char *buffer, size_t *offs) {
  if (buffer[*offs] != '\0') {
    return buffer[(*offs)++];
  } else {
    return EOF;
  }
}

In the end, even if casting, it seems necessary to explicitly increment it as a char type.

I asked again about the difference between these.

Details of the Difference
1. Function including cast to unsigned char
Reason for casting:
buffer is an array of char, and buffer[*offs] returns a char type value.
The char type is often a signed 8-bit integer, but this is implementation-dependent (it can be unsigned depending on the compiler or platform).
If a value in the string is negative, and not cast, the sign might be preserved when the negative value is extended to an int type.
For example, if the value of buffer[*offs] is -1, it returns -1 if not cast, but returns 255 (0xFF) if cast.
2. Function without cast
Handling of signed characters:
If buffer[*offs] can have a negative value, the negative value is returned as is if not cast.
This can sometimes lead to unintended behavior.

So, what's the conclusion?

Which one should be used?
Function including cast:

When treating character data as binary or when you want to treat it as an unsigned value, casting to unsigned char is common.
This is especially important when dealing with non-ASCII characters (e.g., UTF-8 encoded data).
Function without cast:

When dealing only with ASCII characters or when it's intended to be treated as signed, it might not be a problem even without casting.

This is quite a special case, isn't it?
In the latter case, where there is no cast, by explicitly stating that it should be treated as a char type with buffer[(*offs)++], it becomes unsigned by default for the char type, and thus increments correctly as a char type, is that right?

Differences in character encoding are quite difficult to visualize, and there are many traps around us, aren't there...? Well, that was interesting!

Related Posts