Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wchar_t / unicode support broken? - initscr breaking putwc #53

Open
0xE70000000 opened this issue Aug 28, 2022 · 19 comments
Open

wchar_t / unicode support broken? - initscr breaking putwc #53

0xE70000000 opened this issue Aug 28, 2022 · 19 comments

Comments

@0xE70000000
Copy link

0xE70000000 commented Aug 28, 2022

Hi,

i've spent quite a bit of time trying to debug this issue and at this point i feel confident to say it might be a bug. If putwc is called with a valid codepoint (i've used 0xF6 for testing which should render as "ö") before initscr everything works as expected but anywhere after initscr non 7bit codepoints render as unknown characters (inverse color question marks) and as putwc is used by all functions printing wchar_t strings (for example waddwstr) this basically breaks unicode support. Locale settings (setlocale) don't seem to affect this at all (en_US.UTF-8, C or an invalid setting all result in the same behavior).

Trying to track down the cause i've come to the conclusion that it seems to be related to tputs/ti_puts calls (adding macros to curses_private.h turning those into noops keeps putwc functional but obviously otherwise breaks rendering badly). Sadly i wasn't able to pin this to a certain call as it seems at least multiple (or even all of them?) result in putwc breaking. Any kind of hint or pointer on to how to get this working would be greatly appreciated.

Testing was done on Devuan Ascii (a linux distribution pretty much identical to Debian 9.0 sans systemd). Curses version used is 0.3.2 release. I've also tried a handful of different terminal definitions which made no difference regarding putwc behavior (terminal used for most testing is sakura which is libvte based identifying as xterm-256color. Xterm (the application, not just the definition) gave the same results though.

@0xE70000000
Copy link
Author

0xE70000000 commented Sep 5, 2022

Small update: In the meantime i have rebuilt my application against ncursesw which resolved the issue with no code changes whatsoever besides including the appropriate header files. This is somewhat unfortunate as i would rather use netbsd-curses but given that even after extensive debugging i have failed to track down the cause of putwc failing it seems that is the best i can do for now.

@rofl0r
Copy link
Member

rofl0r commented Sep 5, 2022

hmm, interesting, i didnt get a github notification when you opened this issue 7 days ago.
anyway, did you try the 0.3.1 release ? it's what i'm still using and i haven't come across any issues so far. if you can confirm that 0.3.1 works we can try to bisect the commit that broke it.

@0xE70000000
Copy link
Author

0xE70000000 commented Sep 5, 2022

Sadly building against 0.3.1 does not seem to make a difference. Non 7bit characters still show up as inverse question marks. What might be somewhat interesting though is that adding borders generally works. I guess the characters used to render the border aren't 7bit either but then rendering those is done using "alternative charset mode" (i am not sure how that works). I've tried patching one of the codepoints that seemed to be assigned to the be the "alternate" representation of a 7bit character but the change was ignored so being able to render those might not mean much as it seems the association is hardcoded one way or another.

Edit: I have just tried all versions down to 0.0.4 which while somewhat differing in rendering all exhibit the same behavior regarding non 7bit wchars.

@rofl0r
Copy link
Member

rofl0r commented Sep 5, 2022

thanks for testing. could you come up with a minimal example program exposing the behaviour ?

@0xE70000000
Copy link
Author

0xE70000000 commented Sep 7, 2022

I've tried but it proved difficult. My application does not do a whole lot before initializing curses but even after stripping basically everything it still seems to do "something" that results in a different scenario than just calling putwc, initscr and then putwc again. The most similar behavior i managed to trigger was put putwc failing before and after initscr. Trying to replicate the original scenario i noticed a lot of weird behavior from putwc for no obvious reason though, so i decided to dig into glibc to see what putwc was actually doing and that's when things started making a lot more sense:

glibc handles wide characters pretty much separate from normal output. Down to using different buffers based on which kind of character is being written. Intermixing both kind of writes seems like a pretty bad idea with this logic and from what i saw in my testing i am almost certain doing so causes errors in regards to what is actually written to the screen (missing output - likely due to reading from the wrong internal buffer - and similar weird behavior).

There is also another thing which is probably the main cause of what i was seeing earlier. Glibc keeps kind of a "wide character state" on the FILE* handle (mostly stdout in this case). It's not obvious (especially in regards to stdout/stderr) where this is actually decided (it's possible to force it on open - i highly doubt this is done for the default descriptors though) but it seems possible for glibc to set this the fly when handling functions that write to the FILE* handle. Now this state contains a reference to a conversion function that deals with turning the wchar_t data into something the terminal understands. This state is set ONCE and never touched again, so if there is any output before a usable LC_CTYPE (which is queried it seems) is setup or it changes later on the conversion function reference on the file handle goes out of sync...

Now my theory is that my original code somehow triggers glibc to set the wide character conversion function to something that goes out of sync during initscr. Either that or the tputs/ti_puts calls using non wide writes simply confuse glibc's internal state.

I am somewhat short on time right now but i'll do a bit more testing to see if a call to freopen (which should reset the wide character state) has any effect. It also seems that fprintf( stdout, "%lc", wchar ) has a slightly more intelligent logic when it comes to intermixing character types. So if the freopen approach fails i'll probaly try to replace putwc with fprintf which would admittedly be kinda ugly but so seems glibc's approach at handling wide character output.

@0xE70000000
Copy link
Author

0xE70000000 commented Sep 7, 2022

I've had a little more time than i initially thought so i investigated a bit. Here is some code demonstrating glibc's confusion when mixing putwc with putc calls:

#include <stdio.h>
#include <locale.h>
#include <unistd.h>
#include ".build/include/curses.h"

// broken putwc output
void test1( void ) {
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	putc( 'f', stdout );
	putc( '\n', stdout );
	putwc( 0xf6, stdout );
	putwc( '\n', stdout );
}

// putc output ignored
void test2( void ) {
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	putwc( 0xf6, stdout );
	putwc( '\n', stdout );
	putc( 'f', stdout );
	putc( '\n', stdout );
}

// broken putwc after initscr (likely due to calls to putc)
void test3( void ) {
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	initscr();
	putwc( 0xf6, stdout );
	putwc( '\n', stdout );
}

// surprisingly working putwc
// (maybe due to putwc before initscr setting stdout to wide orientation?)
void test4( void ) {
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	putwc( 0xf6, stdout );
	putwc( '\n', stdout );
	sleep( 5 );
	initscr();
	putwc( 0xf6, stdout );
	putwc( '\n', stdout );
}

// unsurprisingly broken putwc
void test5( void ) {
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	putc( 'f', stdout );
	putc( '\n', stdout );
	sleep( 5 );
	initscr();
	putwc( 0xf6, stdout );
	putwc( '\n', stdout );
}

// fputwc is not identical to putwc (no output)
void test6( void ) {
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	putc( 'f', stdout );
	putc( '\n', stdout );
	sleep( 5 );
	initscr();
	fputwc( 0xf6, stdout );
	fputwc( '\n', stdout );
}

// unsurprisingly broken putc
void test7( void ) {
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	putwc( 0xf6, stdout );
	putwc( '\n', stdout );
	sleep( 5 );
	initscr();
	putc( 'f', stdout );
	putc( '\n', stdout );
}

// missing linebreak in waddstr and waddwstr
void test8( void ) {
	char test1[] = { 'f', '\n', 0 };
	wchar_t test2[] = { 0xf6, '\n', 0x00 };
	WINDOW *win;
	
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	putwc( 0xf6, stdout );
	putwc( '\n', stdout );
	sleep( 5 );
	initscr();
	win = newwin( 3, 10, 0, 0 );
	waddstr( win, test1 );
	waddwstr( win, test2 );
	wrefresh( win );
	sleep( 5 );
}

// waddwstr output broken
void test9( void ) {
	char test1[] = { 'f', '\n', 0 };
	wchar_t test2[] = { 0xf6, '\n', 0x00 };
	WINDOW *win;
	
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	putc( 'f', stdout );
	putc( '\n', stdout );
	sleep( 5 );
	initscr();
	win = newwin( 3, 10, 0, 0 );
	waddstr( win, test1 );
	waddwstr( win, test2 );
	wrefresh( win );
	sleep( 5 );
}

// waddwstr output broken
void test10( void ) {
	char test1[] = { 'f', '\n', 0 };
	wchar_t test2[] = { 0xf6, '\n', 0x00 };
	WINDOW *win;
	
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	sleep( 5 );
	initscr();
	win = newwin( 3, 10, 0, 0 );
	waddstr( win, test1 );
	waddwstr( win, test2 );
	wrefresh( win );
	sleep( 5 );
}

// waddwstr output broken
void test11( void ) {
	char test1[] = { 'f', '\n', 0 };
	wchar_t test2[] = { 0xf6, '\n', 0x00 };
	WINDOW *win;
	
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	sleep( 5 );
	initscr();
	win = newwin( 3, 10, 0, 0 );
	waddwstr( win, test2 );
	waddstr( win, test1 );
	wrefresh( win );
	sleep( 5 );
}

// missing linebreak in waddstr and waddwstr
void test12( void ) {
	char test1[] = { 'f', '\n', 0 };
	wchar_t test2[] = { 0xf6, '\n', 0x00 };
	WINDOW *win;
	
	fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
	sleep( 5 );
	initscr();
	stdout = freopen( NULL, "w", stdout );
	win = newwin( 3, 10, 0, 0 );
	waddwstr( win, test2 );
	waddstr( win, test1 );
	wrefresh( win );
	sleep( 5 );
}

int main( int argc, char **argv ) {
	test12();
	return 0;
}

Interestingly from reading glibc source it seems neither putc nor putwc itself set the stream orientation (which is usually done using an implicit call to fwide) but obviously something down the line does. Calling freopen after initscr to reset the internal state actually did make a difference but output is still broken (ignoring line breaks). To be honest i am pretty baffled with glibc's behavior. I've tried patching putchar.c:92 to use status = fprintf(outfd, "%lc", wch); instead of status = putwc(wch, outfd); which does fix the weirdness but not really because it's more intelligent or even "doing the right thing". What fprintf does for wide characters is call wcrtomb to convert to multibyte no matter what which does work in my case because of of the UTF-8 locale but would likely fail for anything else.

@rofl0r
Copy link
Member

rofl0r commented Sep 7, 2022

ouch, sorry that i didn't realize it before, but you cannot safely mix stdio output with curses output, because both systems use internal buffering and bookkeeping of positions and such, and are unaware of each other. a curses application shouldnt output anything before calling initscr(), and after that only use curses routines for output.

@0xE70000000
Copy link
Author

0xE70000000 commented Sep 7, 2022

No need to be sorry. It's not really related to stdio anyways. You can take for example test10 and remove the fprintf to stderr (which was actually chosen to avoid messing with stdouts internal state) and the result is still the same:

// waddwstr output broken
void test10( void ) {
	char test1[] = { 'f', '\n', 0 };
	wchar_t test2[] = { 0xf6, '\n', 0x00 };
	WINDOW *win;
	
	setlocale( LC_ALL, "en_US.UTF-8" );
	initscr();
	win = newwin( 3, 10, 0, 0 );
	waddstr( win, test1 );
	waddwstr( win, test2 );
	wrefresh( win );
	sleep( 5 );
}

It's not even about mixing waddstr and waddwstr. The following version just calling waddwstr does the same:

// waddwstr output broken
void test10( void ) {
	wchar_t test2[] = { 0xf6, '\n', 0x00 };
	WINDOW *win;
	
	setlocale( LC_ALL, "en_US.UTF-8" );
	initscr();
	win = newwin( 3, 10, 0, 0 );
	waddwstr( win, test2 );
	wrefresh( win );
	sleep( 5 );
}

I am pretty sure it's related to glibc's internal stream state in regard to wide character handling associated with stdout being chosen by the first write occurring to it which is likely some putc call invoked by tputs/ti_puts during initscr (like i've said initially, hacking libcurses to turn tputs/ti_puts into noops keeps wide character output functional). At least i don't have a better explanation for what i am seeing. It's not like i've read every single line of libcurses / glibc and fully comprehend every little detail (especially in regards to libcurses - the code is pretty crazy at times!).

@rofl0r
Copy link
Member

rofl0r commented Sep 7, 2022

test10 prints f and ö separated by a newline on my musl-libc based system, and i suspect that's what you expect it to do. i can't currently test on glibc as i have to set up a test env first. what happens if you remove the setlocale call and export LC_ALL=C before running the program on glibc ?

@0xE70000000
Copy link
Author

0xE70000000 commented Sep 7, 2022

Yes, that would be exactly what i'd expect to show up on the screen. What i am getting is an f and an (inverse color question mark) though. Removing setlocale and exporting LC_ALL makes no difference. Same goes for calling setlocale( LC_ALL, "C" ) or setlocale( LC_ALL, "some-garbage" ).

@0xE70000000
Copy link
Author

0xE70000000 commented Sep 7, 2022

By the way there is no need to rush this issue. I've added a switch to my build scripts for now that selects between netbsd-curses and ncurses (if i end up with a bit of free time i might also take a look to see if i can figure out what ncurses does differently in regards to wide character handling) so i can easily switch back if a solution is found and it doesn't bother me much right now. I am generally quite grateful you are taking the time to look into this.

@richfelker
Copy link

richfelker commented Sep 7, 2022

It looks like netbsd curses is internally mixing byte and wide stdio calls, which produces undefined behavior. See at least:

status = putc(ch, outfd);
fflush(outfd);
return status;
}
#ifdef HAVE_WCHAR
int
__cputwchar(wchar_t wch)
{
#ifdef DEBUG
__CTRACE(__CTRACE_OUTPUT, "__cputwchar: 0x%x\n", wch);
#endif
return __cputwchar_args(wch, _cursesi_screen->outfd);
}
/*
* This is the same as __cputchar but the extra argument holds the file
* descriptor to write the output to. This function can only be used with
* the "new" libterm interface.
*/
int
__cputwchar_args(wchar_t wch, void *args)
{
FILE *outfd = (FILE *)args;
int status;
#ifdef DEBUG
__CTRACE(__CTRACE_OUTPUT, "__cputwchar_args: 0x%x on fd %d\n",
wch, outfd->_file);
#endif
status = putwc(wch, outfd);

I think you want to use narrow stdio functions entirely regardless of the type of character being printed. See if replacing the putwc call with fprintf(outfd, "%lc", wch) fixes it. There may be other places where a similar fix is needed. Of course printf is probably unwanted overhead here, so if it does work, you should probably switch to manually doing mbrtowc then fwrite or fputs.

@0xE70000000
Copy link
Author

0xE70000000 commented Sep 8, 2022

Well put. I pretty much agree with you. Regarding wcrtomb (i think that's what you actually meant?) i am little worried though if it's really sufficient when dealing with non UTF-8 capable terminals. Glibc is jumping through a lot of hoops to properly convert wide character output (see iofwide.c). According to the comments in that file the logic chooses appropriate conversion routines based on the value of LC_CTYPE. If replacing putwc is found to be the solution somewhat replicating that logic (which mostly seems to rely on __wcsmbs_clone_conv returning pointers to the needed functions) might be preferable to simply calling wcrtomb. I have to admit that i am not really sure if this is just kind of a low level way of using wcrtomb though.

@rofl0r
Copy link
Member

rofl0r commented Sep 8, 2022

@michaelforney has produced this patch: http://ix.io/49Ui - you might want to try it out.

@0xE70000000
Copy link
Author

Nice, works for me! I've build the test10 code against libcurses patched with your link and it printed f newline ö newline. If you feel confident with the approach this issue can be closed i guess. Thanks a lot!

@rofl0r
Copy link
Member

rofl0r commented Sep 8, 2022

thanks for confirmation. i'll close the issue once the patch is merged - which can take some time as michael wants to upstream it to netbsd proper so i have to backport the changes since last year from there.

@0xE70000000
Copy link
Author

Alright. Seems sensible.

@richfelker
Copy link

Regarding wcrtomb (i think that's what you actually meant?) i am little worried though if it's really sufficient when dealing with non UTF-8 capable terminals.

wcrtomb will convert to whatever the locale's multibyte encoding is, which is exactly what wide character stdio would do except that the choice of encoding is based on the current locale rather than the locale that was active at the time the FILE stream became wide-oriented.

@0xE70000000
Copy link
Author

I see. The proposed solution is probably fine then. Sorry, i didn't realize until now that you are the author of musl and likely know a thing or two about what standard functions do ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants