2018 HtDig Log

January 13

apt-get update/upgrade/dist-upgrade
My example was a minimum edit of ugrep.cpp saved as ~/git/icu/samples/ugrep/usr.cpp
The structures in use are:

const char *pattern = NULL;     // The regular expression
UErrorCode status = U_ZERO_ERROR;   // All ICU operations report success or failure
UParseError    parseErr;            // In the event of a syntax error in the regex pattern,
RegexPattern  *rePat = RegexPattern::compile((const UnicodeString)pattern, parseErr, status);
UnicodeString empty;
RegexMatcher *matcher = rePat->matcher(empty, status);
UnicodeString s(FALSE, ucharBuf+lineStart, lineEnd-lineStart);
matcher->reset(s);
if (matcher->find()) {
	UErrorCode st;
	UnicodeString r("http://$1.dyndns-pics.com/$2");
	cout << "Replacement: " << matcher->replaceAll(r, st) << endl;
    matchFound = TRUE;
    printMatch();
}
In htdig.conf, we have both the pattern and the replacement—we have to read them, in Configuration::AddParsed.
This is what I already analysed. The value contains both the pattern and the replacement.
Working on ParsedString::get...
Looking at icu/source/samples/citer/citer.cpp...
Not sure I didn't confuse incrementing the iterator and getting to the next word (but was there a next word? no...!?). Well... not quite sure: it is a recursive function...
I change the return value of the function: from pointer to reference. Hoping that the value may be initialized to an empty string, and that nothing breaks...
Done —ParsedString...
Next: Configuration... Changing the interface: from char* to UnicodeString&
Compiled.
I have still not solved the issue of setting up HtURLRewriter...
January 20 So... There is a singleton HtURLRewriter, and this one is constructed from config["url_rewrite_rules"], which contains strings with space separated pattern and replacement. A TAB separates these strings from the search_rewrite_rules: prefix. I moved the HtRegexReplaceList files into the away directory. In fact, it contained a list of pairs, which it used as from and to.
My strlist replacement for StringList is not adequate, as it uses char instead of UChar, although it might work as long as the urls do not contain multibyte chars.
But icu most probably offers better parsing tools
Extracted a test case in ~/git/tests/parse, with a local copy of strlist.
Fails abominably:

parse> make CXXFLAGS="-std=c++11 -g -O0"
g++  -std=c++11 -g -O0 -I/usr/local/include   -c -o parse.o parse.cc
g++  -std=c++11 -g -O0 -I/usr/local/include  parse.o strlist.o -L/usr/local/lib -licui18n -licuio -licuuc -licudata  -o parse
parse> ./parse 'il y de la joie Ваня'
original text: il y de la joie Ваня
l.join('+'): il y de la joie Ваня
First, the default constructor sets '\t' as separated, so the the string is not parsed/split.
Found BreakIterator, which begets:

parse> ./parse 'il y a de la joie Ваня'
original text: il y a de la joie Ваня
l.join('+'): il+ +y+ +a+ +de+ +la+ +joie+ +Ваня
So, now, getting rid of the whitespace—done, although not perfect: this will split (and record) also on punctuation, including tabs.

parse> ./parse ' il y a de la joie Ваня'
original text:  il y a de la joie Ваня
l.join('+'): il+y+a+de+la+joie+Ваня
parse> ./parse ' il y a de la joie, Ваня'
original text:  il y a de la joie, Ваня
l.join('+'): il+y+a+de+la+joie+,+Ваня
parse> ./parse ' il y a-aussi-de la joie	Ваня'
original text:  il y a-aussi-de la joie	Ваня
l.join('+'): il+y+a+-+aussi+-+de+la+joie+	+Ваня
Adapted for HtURLRewriter.

January 27

Still working in the icu branch.

htdig> find . -type f -name Stack\*
./htlib/Stack.cc
./htlib/Stack.h
htdig> grep -rl Stack --include=*.h .
./htdig/Server.h
./db/include/btree.h
./htsearch/parser.h
./htlib/Stack.h
In btree.h, the match is in a comment. The actual type is:

typedef struct __epg EPG;
struct __epg {
	PAGE	 *page;			/* The page. */
	db_indx_t indx;			/* The index on the page. */
	DB_LOCK	  lock;			/* The page's lock. */
};
PAGE is defined as a large struct in db_page.h
The hope for a start is that I do not need to modify the db...
This is version 2.6.4 (12/16/98) of Sleepycat Software's Berkeley DB product.
Before looking at replacing Stack, I might look at ResultList, as e.g. in parser.cc, one pushes such a list into (and pops from) the stack member.
ResultList specialized Dictionary which I removed, replacing it with an included map.
Fixed the issue with strlist of recording punction and tabs.

parse> ./parse ' il y a de la joie, Ваня'
original text:  il y a de la joie, Ваня
l.join('+'): il+y+a+de+la+joie+Ваня
parse> ./parse ' il y a de la joie	Ваня'
original text:  il y a de la joie	Ваня
l.join('+'): il+y+a+de+la+joie+Ваня
Added const iterators.
New test:

find> ./indexof mydefault:0 default:
text: mydefault:0, pattern: default:
position of pattern in text: 2
text after removing pattern: my0
find> ./indexof default:0 foo
text: default:0, pattern: foo
position of pattern in text: -1

March 17

Reindex, with new errors:

public_html> sudo /opt/www/htdig/bin/rundig
DB2 problem...: Unable to allocate 1852256170 bytes from mpool shared region: Cannot allocate memory
...
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted
Where was I? If under htsearch, I build, the first error is for Display.cc which includes QuotedStringList.h, now renamed to qstrings.h.
If I build htsearch.o, the fist error is for configFile, which still believes being a string,in stead of now a UnicodeString.
The notes tell to look into WordList, which is not opened yet.
OK: we are under parser, and result is a member there.
Building parser.o, the first error is for tokens, a vector<UnicodeString> member.
The point is that the first thing one does in lexan is to case the token into a WeightWord... So, that's what it should be: done.
But a problem is the way the tokens get iterated. There was a Start_Get to reset the list in fullexpr, following which current was set to Get_Next at the beginning of lexan, itself called from expr, etc.
But that's not how it works with vector... OK: converted, although with a doubt: e.g. now perform_push may return, if the end of tokens was reached. Hopefully I do not skip any token (I go to the next in lexan).
Committed all the changes to date.
Downloaded Berkeley DB: db-6.2.32.tar.gz, and extracted under git.

April 15

I noticed a few days back, that my last htdig db update had in fact aborted before indexing the last pages (no new hits for climat although expected).
I ran apt-get update/upgrade/dist-upgrade and retried: same result, finding in fact that there are not hits for distinction, which would have been the previous update yet.
Ran in verbose mode, found multiple errors (invalid links), fixed some, and ran again:

public_html> sudo /opt/www/htdig/bin/rundig -vvv 2>&1 | egrep -B5 ^DB2 
*href: http://berry314/bdb/docs/gsg/C/CoreCursorUsage.html (example_database_read)
resolving 'http://berry314/bdb/docs/gsg/C/CoreCursorUsage.html'
*href: http://berry314/bdb/docs/gsg/C/preface.html (Next)
resolving 'http://berry314/bdb/docs/gsg/C/preface.html'
* size = 17666
DB2 problem...: PANIC: Bad address
Note that this is not the exact same page as previously:

+href: http://berry314/bdb/docs/gsg/C/CoreEnvUsage.html (Managing Databases in Environments)
resolving 'http://berry314/bdb/docs/gsg/C/CoreEnvUsage.html'
DB2 problem...: PANIC: Bad address
It seems in fact to be later, but this log itself gets indexed, so that it may contribute to pushing the error (should be backwards?).
What I have practiced is e.g:

winfo> perl -0777 -pi -e 's%<!--.*?>%%gsm' $(find . -type f -name \*.html)
More...

public_html> sudo /opt/www/htdig/bin/rundig -vvv 2>&1 | egrep -B5 ^DB2
resolving 'http://berry314/bdb/docs/gsg/C/databaseLimits.html'

   pushing http://berry314/bdb/docs/gsg/C/databaseLimits.html
+href: http://berry314/bdb/docs/gsg/C/environments.html (Environments)
resolving 'http://berry314/bdb/docs/gsg/C/environments.html'
DB2 problem...: PANIC: Invalid argument
The problem may be that I added a large base with bdb, and overran a limit... I try to remove it for a try.

public_html> sudo ls -ld /var/www/html/bdb
lrwxrwxrwx 1 root root 18 Mar 18 14:22 /var/www/html/bdb -> /home/marc/git/bdb
public_html> sudo rm /var/www/html/bdb
OK... It looks like this is the explanation... I restore the link and exclude it:

public_html> sudo ln -s /home/marc/git/bdb /var/www/html/bdb
public_html> sudo perl -pi.bak -e 's%^(exclude_urls:.*)%$1 /bdb/%' /opt/www/htdig/conf/htdig.conf
public_html> diff /opt/www/htdig/conf/htdig.conf.bak /opt/www/htdig/conf/htdig.conf
52c52
< exclude_urls:		/cgi-bin/ .cgi
---
> exclude_urls:		/cgi-bin/ .cgi /bdb/
That's enough to get the climat indexed, but I still get a DB2 error.
I'll also exclude icu:

public_html> sudo perl -pi.bak -e 's%^(exclude_urls:.*)%$1 /icu/%' /opt/www/htdig/conf/htdig.conf
public_html> diff /opt/www/htdig/conf/htdig.conf.bak /opt/www/htdig/conf/htdig.conf
52c52
< exclude_urls:		/cgi-bin/ .cgi /bdb/
---
> exclude_urls:		/cgi-bin/ .cgi /bdb/ /icu/
Still a DB2 problem. Let's hope moving to bdb will solve it.

April 22

Progress logged in objects

April 28-29

Built and extended the udata example of icu.
Explored std::fstream, in an fstr test, trying to identify what to implement in terms of write.cpp, and unewdata.cpp, and as explicit instantiations of the std templates, for UChar or UnicodeString. Working as such with wchar_t, but not with char16_t (empty string—maybe just a locale issue?):

fstr> make
g++  -std=c++11  -I/usr/local/include   -c -o fstr.o fstr.cc
g++  -std=c++11  -I/usr/local/include  fstr.o -L/usr/local/lib -licui18n -licuio -licuuc -licudata  -o fstr
fstr> ./fstr 
fstr> cat test.txt
Il y a de la joie
fstr> nm -C ./fstr | grep wchar_t
         U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::close()@@GLIBCXX_3.4
         U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::basic_ofstream(char const*, std::_Ios_Openmode)@@GLIBCXX_3.4
         U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::~basic_ofstream()@@GLIBCXX_3.4
         U std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& std::operator<< <wchar_t, std::char_traits<wchar_t> >(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >&, wchar_t const*)@@GLIBCXX_3.4
OK... Maybe framework ready for the actual implementation (for now: dummy/empty bodies)...

fstr> make
g++  -std=c++11  -I/usr/local/include   -c -o fstr.o fstr.cc
g++  -std=c++11  -I/usr/local/include   -c -o uofstream.o uofstream.cc
g++  -std=c++11  -I/usr/local/include  fstr.o uofstream.o -L/usr/local/lib -licui18n -licuio -licuuc -licudata  -o fstr

April 30

Minor progress in integration, although:

May 19

Committed: builds, but doesn't work (with chat16_t)
I kept both code variants, writing to wchar_t stream (file: Ltest.txt), and to chat16_t stream (file: utest.txt).
So far only with plain ascii (only the second uses icu).
[Slightly edited the transcript: the magic sytring of the utest.txt file is not text]

fstr> make
g++  -std=c++11  -I/usr/local/include  -I/home/marc/git/icu/source/tools/toolutil  -c -o fstr.o fstr.cc
g++  -std=c++11  -I/usr/local/include  -I/home/marc/git/icu/source/tools/toolutil fstr.o uofstream.o -L/usr/local/lib -licui18n -licuio -licutu -licuuc -licudata  -o fstr
fstr> ./fstr 
fstr> cat Ltest.txt; echo; cat utest.txt; echo
Il y a de la joie

^@[...]MyDt[...]^@ Copyright (C) 2016 and later: Unicode, Inc. and others. License & terms of use: http://www.unicode.org/copyright.html 
fstr> nm -C ./fstr | grep wchar_t
         U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::close()@@GLIBCXX_3.4
         U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::basic_ofstream(char const*, std::_Ios_Openmode)@@GLIBCXX_3.4
         U std::basic_ofstream<wchar_t, std::char_traits<wchar_t> >::~basic_ofstream()@@GLIBCXX_3.4
         U std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& std::operator<< <wchar_t, std::char_traits<wchar_t> >(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >&, wchar_t const*)@@GLIBCXX_3.4
fstr> nm -C ./fstr | grep char16_t | sort -u | grep 'close()'
00012538 W std::basic_ofstream<char16_t, std::char_traits<char16_t> >::close()
00012dd8 W std::basic_filebuf<char16_t, std::char_traits<char16_t> >::close()::__close_sentry::__close_sentry(std::basic_filebuf<char16_t, std::char_traits<char16_t> >*)
00012e0c W std::basic_filebuf<char16_t, std::char_traits<char16_t> >::close()::__close_sentry::~__close_sentry()
00012ed0 W std::basic_filebuf<char16_t, std::char_traits<char16_t> >::close()
fstr> nm -C ./fstr | grep char16_t | sort -u | grep 'basic_ofstream(char const*'
00012168 W std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream(char const*, std::_Ios_Openmode)
fstr> nm -C ./fstr | grep char16_t | sort -u | grep 'basic_ofstream()'
00012328 W std::basic_ofstream<char16_t, std::char_traits<char16_t> >::~basic_ofstream()
000123c8 W virtual thunk to std::basic_ofstream<char16_t, std::char_traits<char16_t> >::~basic_ofstream()
0001240c W std::basic_ofstream<char16_t, std::char_traits<char16_t> >::~basic_ofstream()
0001243c W virtual thunk to std::basic_ofstream<char16_t, std::char_traits<char16_t> >::~basic_ofstream()
fstr> nm -C ./fstr | grep char16_t | sort -u | grep 'operator<<'
00017934 T std::basic_ostream<char16_t, std::char_traits<char16_t> >& std::operator<< <char16_t, std::char_traits<char16_t> >(std::basic_ostream<char16_t, std::char_traits<char16_t> >&, char16_t const*)
W in the nm output denotes a weak symbol.
Anyway, I have no achieved yet what I got in the udata sample:

udata> ./reader 
Read value 2000 from data file
Read string EXAMPLE from data file
Read ustring from data file: архипелаг
Read ustring from data file: Ça va... Il y a toujours à boire et à manger
Although... I have no reader... Only the writer... But well, there is nothing past the copyright in the test file.
Ah... I only wrote stubs so far!
And only part of the stubs (for the template specializations).
For instance, I did not specialize the basic_ostream base class, and it looks like operator<< is a member function of it...
Inserted now a specialization of it —so far no change from the default. But the point is indeed to add to it an operator<< taking UChar...
Or is it needed? UChar is one type with which to instantiate to template...
Only needed if it requires some change in the declaration, such as passing a buffer. Otherwise, all the specialization was useless.
Maybe the only thing I need to specialize is ostream::_M_insert
Or maybe __ostream_insert template function)
The problem is the basic_filebuf or the basic_streambuf (base class of the previous)
Taken away (into ostr.h) the specialization attempt of basic_ostream
Next: basic_streambuf... specialized in bstr.h, just to see it...
But UNewDataMemory is maybe more of basic_filebuf specialization? Here we go...
Maybe only some member functions might be enough and legal... First open (init comes from basic_ios).
In the template code, I have:

      _M_file.open(s, mode);
      if (this->is_open()) {
	_M_allocate_internal_buffer();
to replace with code from icu/source/tools/toolutil/unewdata.cpp.
_M_file is a basic_file<char>, defined in /usr/include/arm-linux-gnueabihf/c++/4.9/bits/basic_file.h
The file in UNewDataMemory is a FileStream defined in source/tools/toolutil/filestrm.h
It is opened in unewdata.cpp on line 102, i.e. in filestrm.cpp on line 36
So far: this is equivalent: no need to change anything.
Next (in my fstr.h):

	_M_allocate_internal_buffer();
At this point, something must have set the value of _M_buf_size.
The _M_buf is not the equivalent of the UNewDataMemory.
One needs to specialize the diverse operator<< at least in order to add the padding...
Added to ostr.h, but with no change (yet), e.g. from unewdata.cpp
Also, the two magic values, and the copyright header need to be inserted into the new file — maybe no need for a data structure to hold them...

May 26

Trying to understand the file header in utest.txt. Its size is 90H i.e. 9 lines of 16 bytes (144).
2 bytes of size: 9000, i.e. 144
2 bytes of magic: da27
20 bytes of dataInfo (from uofstream.h): 8 bytes (1400 0000 0000 0200), then 12 ('MyDt' 0100 0000 0100 0000)
119 bytes (20 43...6c 20): The Copyright string (from uvernum.h), including one space before and after
1 byte: 00
This is in fact written to the file by T_FileStream_write in filestrm.cpp, invoked 3 times, plus 1 for the 0 padding. from udata_create in unewdata.cpp: 4, pInfo->size, and commentLength
It uses plain fwrite.
Now: where should I write this header? In the open function of basic_filebuf when I'm creating the file?
Followed this plan: only specialized basic_filebuf<UChar, uctraits>::open
Same result as previously: header in the file, but no text insert.

May 27

open is a member function, not the contructor: the object is ready, one can use the operator instead of the implementation dependent details of _M_file. Except that open is a member of basic_filebuf<>, and operator<< of ostream<>...
What is probably missing is the conversion between the streams, as endl and the ascii text return an ostream<char>
No: endl is a template as well: it should return an ostream<UChar>
In the mypkg_example.dat file produced by the udata example, The header is similar as the one in utest.txt, and even EXAMPLE is in contiguous chars, but Il y a toujours... are in 16 bit UChars.
I need to specialize the instances of _M_insert for the different types: uint8_t, char, const char*, following the example of udata_write8, udata_write16, udate_writeString in unewdata.cpp, based on T_FileStream_write in filestrm.cpp.
I'm surprised by udata_writePadding padding with 0xaa, not with 0, but I cannot find it used—keep this in mind.
Also, the int16_t and UChar automatic specializations of _M_insert should be OK...?
Debugged. Stepping on line 17 in fstr.cc, I get into ostream_insert.h:88, in __ostream_insert, and there, __out.width() equals 0, and __n is 18 (the length of the string in chars), but even then, the stream is in badbit state, and nothing gets inserted.
Added a default specialization of __ostream_write, which never reaches line 85: SIGSEGV—not even always reaching line 84:

Breakpoint 1, std::__ostream_write<char16_t, std::char_traits<char16_t> > (
    out=..., s=0x180bc u"Il y a de la joie\n", n=18) at fstr.cc:84
84	    const streamsize put = out.rdbuf()->sputn(s, n);
(gdb) n
0x000132d4 in std::__ostream_insert<char16_t, std::char_traits<char16_t> > (
    __out=..., __s=0x180bc u"Il y a de la joie\n", __n=18)
    at /usr/include/c++/4.9/bits/ostream_insert.h:109
109		  __catch(...)
(gdb) bt
#0  0x000132d4 in std::__ostream_insert<char16_t, std::char_traits<char16_t> >
    (__out=..., __s=0x180bc u"Il y a de la joie\n", __n=18)
    at /usr/include/c++/4.9/bits/ostream_insert.h:109
#1  0x00012050 in std::operator<< <char16_t, std::char_traits<char16_t> > (
    __out=..., __s=0x180bc u"Il y a de la joie\n")
    at /usr/include/c++/4.9/ostream:518
#2  0x00011424 in main () at fstr.cc:17
Debugging, the basic_streambuf this pointer is 0 in streambuf:451, when invoked from fstr.cc:84.
However, the basic_streambuf base of basic_filebuf, itself base of basic_ofstream, was initialized in streambuf:466, invoked from fstream.tcc:85:

std::basic_streambuf<char16_t, std::char_traits<char16_t> >::basic_streambuf (
    this=0xbefff8c0) at /usr/include/c++/4.9/streambuf:466
466	      _M_buf_locale(locale()) 
(gdb) bt
#0  std::basic_streambuf<char16_t, std::char_traits<char16_t> >::basic_streambuf (this=0xbefff8c0) at /usr/include/c++/4.9/streambuf:466
#1  0x00012d18 in std::basic_filebuf<char16_t, std::char_traits<char16_t> >::basic_filebuf (this=0xbefff8c0) at /usr/include/c++/4.9/bits/fstream.tcc:85
#2  0x00011e00 in std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (this=0xbefff8bc, __s=0x180b0 "utest.txt", __mode=std::_S_out, 
    __in_chrg=<optimized out>, __vtt_parm=<optimized out>)
    at /usr/include/c++/4.9/fstream:645
#3  0x00011414 in main () at fstr.cc:16
(gdb) p this
$15 = (std::basic_streambuf<char16_t, std::char_traits<char16_t> > * const) 0xbefff8c0
There is a second basic_streambuf which comes as a member of the basic_ios base: _M_streambuf.
This is the one not initialized...
Progressed, although not absolutely clear why. Maybe just forced a better order of definition by introducing a specialization of rdbuf.
Now failing because of the _M_codecvt facet being found 0 in /usr/include/c++/4.9/bits/fstream.tcc:640

June 10

Still in tests/fstr
Looking for the missing facet.
Doesn't crash!? Ah, no. It shouldn't crash: just do nothing.

fstr> gdb fstr
(gdb) b 84
(gdb) r
(gdb) s
std::operator<< <char16_t, std::char_traits<char16_t> > (__out=..., 
    __s=0x18128 u"瑵獥\x2e74硴t") at /usr/include/c++/4.9/ostream:513
513	    operator<<(basic_ostream<_CharT, _Traits>& __out, const _CharT* __s)
(gdb) n
515	      if (!__s)
(gdb) p __s
$1 = 0x18134 u"Il y a de la joie\n"
(gdb) n
519				 static_cast<streamsize>(_Traits::length(__s)));
(gdb) s
std::char_traits<char16_t>::length (__s=0x18134 u"Il y a de la joie\n")
    at /usr/include/c++/4.9/bits/char_traits.h:421
421		size_t __i = 0;
(gdb) finish
Run till exit from #0  std::char_traits<char16_t>::length (
    __s=0x18134 u"Il y a de la joie\n")
    at /usr/include/c++/4.9/bits/char_traits.h:421
0x000129ec in std::operator<< <char16_t, std::char_traits<char16_t> > (
    __out=..., __s=0x18134 u"Il y a de la joie\n")
    at /usr/include/c++/4.9/ostream:519
519				 static_cast<streamsize>(_Traits::length(__s)));
Value returned is $2 = 18
(gdb) s
518		__ostream_insert(__out, __s,
(gdb) s
std::__ostream_insert<char16_t, std::char_traits<char16_t> > (__out=..., 
    __s=0x18134 u"Il y a de la joie\n", __n=18)
    at /usr/include/c++/4.9/bits/ostream_insert.h:82
82	      typename __ostream_type::sentry __cerb(__out);
(gdb) n
83	      if (__cerb)
(gdb) 
87		      const streamsize __w = __out.width();
(gdb) p __out
$3 = (std::basic_ostream<char16_t, std::char_traits<char16_t> > &) @0xbefff8bc: {<std::basic_ios<char16_t, std::char_traits<char16_t> >> = {<std::ios_base> = {<No data fields>}, _M_tie = 0x85e2f8, _M_fill = 7808 u'Ẁ', 
    _M_fill_init = 140, _M_streambuf = 0x0, _M_ctype = 0x31, 
    _M_num_put = 0x7273752f, _M_num_get = 0x636e692f}, 
  _vptr.basic_ostream = 0x182d4 <vtable for std::basic_ofstream<char16_t, std::char_traits<char16_t> >+12>}
(gdb) s
std::ios_base::width (this=0xbefff94c)
    at /usr/include/c++/4.9/bits/ios_base.h:645
645	    { return _M_width; }
(gdb) finish
Run till exit from #0  std::ios_base::width (this=0xbefff94c)
    at /usr/include/c++/4.9/bits/ios_base.h:645
0x00013dd4 in std::__ostream_insert<char16_t, std::char_traits<char16_t> > (
    __out=..., __s=0x18134 u"Il y a de la joie\n", __n=18)
    at /usr/include/c++/4.9/bits/ostream_insert.h:87
87		      const streamsize __w = __out.width();
Value returned is $4 = 0
(gdb) n
88		      if (__w > __n)
(gdb) n
101			__ostream_write(__out, __s, __n);
(gdb) s
std::__ostream_write<char16_t, std::char_traits<char16_t> > (out=..., 
    s=0x18134 u"Il y a de la joie\n", n=18) at fstr.cc:74
74	    const streamsize put = out.rdbuf()->sputn(s, n);
(gdb) s
std::basic_ios<char16_t, std::char_traits<char16_t> >::rdbuf (this=0xbefff94c)
    at /usr/include/c++/4.9/bits/basic_ios.h:316
316	      { return _M_streambuf; }
(gdb) s
std::basic_streambuf<char16_t, std::char_traits<char16_t> >::sputn (
    this=0xbefff8c0, __s=0x18134 u"Il y a de la joie\n", __n=18)
    at /usr/include/c++/4.9/streambuf:451
451	      { return this->xsputn(__s, __n); }
(gdb) s
std::basic_filebuf<char16_t, std::char_traits<char16_t> >::xsputn (
    this=0xbefff8c0, __s=0x18134 u"Il y a de la joie\n", __n=18)
    at /usr/include/c++/4.9/bits/fstream.tcc:640
640	      streamsize __ret = 0;
(gdb) n
644	      const bool __testout = (_M_mode & ios_base::out
(gdb) n
645				      || _M_mode & ios_base::app);
(gdb) n
646	      if (__check_facet(_M_codecvt).always_noconv()
(gdb) p __testout
$8 = true
(gdb) s
std::__check_facet<std::codecvt<char16_t, char, __mbstate_t> > (__f=0x0)
    at /usr/include/c++/4.9/bits/basic_ios.h:48
48	      if (!__f)
(gdb) p __f
$9 = (const std::codecvt<char16_t, char, __mbstate_t> *) 0x0
And here is our unset facet, which results in a bad_cast exception.
In /usr/include/c++/4.9/fstream:84, there is:

      typedef codecvt<char_type, char, __state_type>    __codecvt_type;

The codecvt template class is defined in: /usr/include/c++/4.9/bits/codecvt.h
There is also a specialization for char:

  template<>
    class codecvt<char, char, mbstate_t>
    : public __codecvt_abstract_base<char, char, mbstate_t>
...
and one for wchar_t, plus a few extern definitions in order to...

  // Inhibit implicit instantiations for required instantiations,
  // which are defined via explicit instantiations elsewhere.
namely in /usr/include/c++/4.9/ext/codecvt_specializations.h
Added a specialization (from the one for wchar_t for
codecvt<UChar, char, mbstate_t>
but... now, I need to specialize the member functions, and maybe the base class (hopefully not).
Inserted now a specialization for the first member: do_out from the partial specialization in codecvt_specializations.h...
Note: the partial specialization uses iconv, and obviously, that's what needs to be changed.
Asked for help/guidance in the ICU support mailing list.

June 16

The specializations for wchat_ uses the encoding_state class, defined in /usr/include/c++/4.9/ext/codecvt_specializations.h, and which uses iconv.
I'll define and use a new uencstate instead. Shortened the variable names, including those of protected members (stripped underscores). Stripped comments (see original).
Removing references to iconv (from /usr/include/iconv.h)?
Note: boost has boost_1_62_0/libs/locale/src/util/iconv.hpp
Maybe something here (Boost 1.67.0)
Although, that's only for regex?!
ICU has uconv as an iconv replacement.
That's for the standalone tool, but there is its source code.
The main header file in /usr/local/include/unicode/ucnv.h
I'll replace:

iconv_t iconv_open(const char *tocode, const char *fromcode);
with

UConverter* ucnv_open(const char *converterName, UErrorCode *err);
assuming tocode in the first is implicit in the second.
I commit my intermediate code before doing this, just in case.
I get rid of:

    typedef iconv_t descriptor_type;
But this descriptor_type will now become, depending on the context, either UConverter* or UErrorCode*...
I have only one desc left, since the other is implicit...
Maybe not so: iconv works both ways, whereas in ICU, the are two functions: ucnv_toUnicode and ucnv_fromUnicode...
do_out is from UChar to char.
It should use toUnicode.
I didn't know what to put for the flush argument. Used true to compile.

June 17

I specialized the codecvt template class and its do_out member. But this is derived from a member in __codecvt__abstract_base.
Given a compilation error, maybe I need to specialize this one?

fstr.cc:162:5: error: template-id ‘do_out<>’ for ‘std::codecvt_base::result std::codecvt<char16_t, char, uencodstate>::do_out(uencodstate&, const UChar*, const UChar*, const UChar*&, char*, char*, char*&) const’ does not match any template declaration
     codecvt<UChar, char, uencodstate>::
     ^
No: just syntax error: template<> not used for a member of a specialization.
Successful build. Checking where we are in terms of the error:

fstr> make CXXFLAGS="-std=c++11 -g -O0"
fstr> gdb fstr
(gdb) b 205
(gdb) r
Breakpoint 1, std::__ostream_write<char16_t, std::char_traits<char16_t> > (
    out=..., s=0x755ac u"Il y a de la joie\n", n=18) at fstr.cc:205
205	    const streamsize put = out.rdbuf()->sputn(s, n);
(gdb) s
(gdb) s
(gdb) s
(gdb) n
(gdb) n
(gdb) n
646	      if (__check_facet(_M_codecvt).always_noconv()
(gdb) s
std::__check_facet<std::codecvt<char16_t, char, __mbstate_t> > (__f=0x0)
    at /usr/include/c++/4.9/bits/basic_ios.h:48
48	      if (!__f)
(gdb) p __f
$1 = (const std::codecvt<char16_t, char, __mbstate_t> *) 0x0
char16_t, OK; why __mbstate_t and not uencodstate?
Besides: no progress at all!
Looks like there is a locale class, in /usr/include/c++/4.9/bits/locale_classes.h with template friends, and has_facet<UChar> returns false
Again, in /usr/include/c++/4.9/bits/locale_classes.tcc, there are extern declarations for explicit instantiations elsewhere of has_facet<collate<char> > and has_facet<collate<wchar_t> >.
Added collate<UChar> (probably only needing the member implementations). The comment actually said:
These virtual functions are hooks for developers to implement the behavior they require from the collate facet
I cannot see how I can work with a derived class.
To work with an explicit specialization, I need to insert between #include <bits/localefwd.h> and #include <bits/locale_classes.h> extern declarations (and more!?) such as the ones above for char and wchar_t. No, this doesn't work: explicit instantiation ... before definition of template.
Hopefully not needed.
Looks that not: provided stubs of the member functions, to be filled in with ICU code...
Builds.
has_facet still returns false in /usr/include/c++/4.9/bits/basic_ios.tcc:159 but __i was now 28 in /usr/include/c++/4.9/bits/locale_classes.tcc:106
Downloaded and extracted under ~/git boost 1.67.0; Committed

August 11

Long pause. Where was I?
under work says htsearch...
But wasn't it rather in some tests? fstr...
I had just installed the boost libraries, and this related to the task at hand...
Tried and failed to understand how to setup boost. Created a /usr/local/boost with write access for group girod, and did:

boost> ./bootstrap.sh --show-libraries --prefix=/usr/local/boost --with-icu
boost> ./b2 install

August 12

Setting the prefix did not apply as I expected.
Need to replay all the cp, and then to /usr/local (forget /usr/local/boost)
Also, the install command ended up in core dump!?
Looking at the adapations of iCU made for C++:

/scp:berry:/home/marc/git/boost/boost/:
  find . \( -type f -exec grep -q -e U_NAMESPACE_QUALIFIER \{\} \; \) -ls
regex/v4/u32regex_token_iterator.hpp
regex/v4/u32regex_iterator.hpp
regex/icu.hpp

/scp:berry:/usr/local/include/unicode/:
find . \( -type f -exec grep -q -e U_NAMESPACE_QUALIFIER \{\} \; \) -ls
394946    8 -rw-r--r--   1 root     staff        6597 Apr 16 01:23
Some interesting traits...

August 18

So... what I was interested in was facets, collate? Maybe boost/detail/utf8_codecvt_facet.hpp and .ipp, as well as boost/regex/icu.hpp, may be sources of inspiration...
Back to fstr. The debug is as previously, apart that the first breakpoint moved to line 239.
Tracing the construction of basic_ofstream<UChar> on line 248 in fstr.cc, I get in sequence to:

std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (
    this=0xbefff86c, __s=0x759bc "utest.txt", __mode=std::_S_out, 
    __in_chrg=<optimized out>, __vtt_parm=<optimized out>)
    at /usr/include/c++/4.9/fstream:645
645	      : __ostream_type(), _M_filebuf()
std::basic_ios<char16_t, std::char_traits<char16_t> >::basic_ios (
    this=0xbefff8fc) at /usr/include/c++/4.9/bits/basic_ios.h:456
456		_M_streambuf(0), _M_ctype(0), _M_num_put(0), _M_num_get(0)
std::basic_ostream<char16_t, std::char_traits<char16_t> >::basic_ostream (
    this=0xbefff86c, 
    __vtt_parm=0x75bc4 <VTT for std::basic_ofstream<char16_t, std::char_traits<char16_t> >+4>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/ostream:385
385	      { this->init(0); }
std::basic_ios<char16_t, std::char_traits<char16_t> >::init (this=0xbefff8fc, 
    __sb=0x0) at /usr/include/c++/4.9/bits/basic_ios.tcc:129
129	      ios_base::_M_init();
132	      _M_cache_locale(_M_ios_locale);
std::basic_ios<char16_t, std::char_traits<char16_t> >::_M_cache_locale (
    this=0xbefff8fc, __loc=...) at /usr/include/c++/4.9/bits/basic_ios.tcc:159
159	      if (__builtin_expect(has_facet<__ctype_type>(__loc), true))
std::has_facet<std::ctype<char16_t> > (__loc=...)
    at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106	      const size_t __i = _Facet::id._M_id();
107	      const locale::facet** __facets = __loc._M_impl->_M_facets;
(gdb) p __facets[__i]
$3 = (const std::locale::facet *) 0xb6f9d4d8 <vtable for std::__timepunct<wchar_t>+8>
(gdb) p __i
$4 = 28
So... the issue is that __facets[28] points to a wchar_t (automatic?) specialization of std::__timepunct, instead of to one for UChar aka char16_t?

fstr> grep -rl __timepunct /usr/include/c++/4.9
/usr/include/c++/4.9/bits/locale_facets_nonio.tcc
/usr/include/c++/4.9/bits/locale_facets_nonio.h
There are indeed extern declarations for explicit specializations of the templates for char and wchar_t, but the code itself is not there.
Thought of defining the id for char16_t as 39, compile and see.
But then found that boost has a collator template.

August 19

Tried to set the id, but this is not the way it goes.

fstr.cc:106:19: error: ‘id’ in ‘class std::collate<char16_t>’ does not name a type
   collate<UChar>::id = locale::id(39);
I can also cheat and define:

  template<> bool has_facet<collate<UChar>>(const locale& l) throw() {
    return true;
  }
...which compiles, but I don't think this is right. The template code should work fine. It should test the id which should have been installed into the proper list data member in locale::_Impl by _M_init_facet... I can try to force specialize this one, from the template code. Except that I fail:

fstr.cc:106:48: error: variable or field ‘_M_init_facet’ declared void
   template<> void locale::_Impl::_M_init_facet(collate<UChar>* f) {
Not sure this would be more right. In fact, I feel I'll have big problems with basic_string<UChar>, whereas what I really have is icu::UnicodeString, which has a different interface.
I rather implement my cheat, just to notice that... collate is not the right facet! What we are check now is ctype...
Of course, I can cheat this one as well, but I end up in a bad cast anyway, with __facets[28] still pointing to vtable for std::__timepunct<wchar_t>+8 which fails the dynamic cast to yet another facet...
One interest of this cheating is to check which factes are involved... Trying now the ones I find in locale_facets.h
Under the debugger, hitting ctype, and dying on bad cast.
Unfortunately, I cannot see from where this was thrown. And I cannot as easily cheat use_facet, because I'd need to return one, and the destructor is protected.
So, I'd need to find where _M_init_facet ought to be called, and why it is not.
Cloning now the new unicode-org git repo (v 62.1, release candidate)
I don't build it yet, waiting for the release. I guess I'll have to fork it?
I try to specialize ctaye from UChar16 from the wchar_t specialization in local_facets.h.
Now the linker complains about missing the code:

fstr> make CXXFLAGS="-std=c++11 -g -O0"
g++  -std=c++11 -g -O0 -I/usr/local/include   -c -o fstr.o fstr.cc
g++  -std=c++11 -g -O0 -I/usr/local/include  fstr.o -L/usr/local/lib -licui18n -licuio -licuuc -licudata  -o fstr
fstr.o: In function `bool std::has_facet<std::ctype<char16_t> >(std::locale const&)':
/usr/include/c++/4.9/bits/locale_classes.tcc:114: undefined reference to `std::ctype<char16_t>::id'
/usr/include/c++/4.9/bits/locale_classes.tcc:114: undefined reference to `typeinfo for std::ctype<char16_t>'
fstr.o: In function `std::ctype<char16_t> const& std::use_facet<std::ctype<char16_t> >(std::locale const&)':
/usr/include/c++/4.9/bits/locale_classes.tcc:143: undefined reference to `std::ctype<char16_t>::id'
/usr/include/c++/4.9/bits/locale_classes.tcc:143: undefined reference to `typeinfo for std::ctype<char16_t>'
collect2: error: ld returned 1 exit status
../parse/rules.mk:13: recipe for target 'fstr' failed
make: *** [fstr] Error 1
Though, I don't declare or invoke these functions explicitly.
Cloning gcc git repo now in ~/git/gnu/gcc...
Apart that it failed. Twice

gnu> git clone https://github.com/gcc-mirror/gcc.git
Cloning into 'gcc'...
remote: Counting objects: 2356004, done.        
remote: Compressing objects: 100% (51/51), done.        
remote: Total 2356004 (delta 12), reused 14 (delta 7), pack-reused 2355945        
Receiving objects: 100% (2356004/2356004), 2.57 GiB | 978.00 KiB/s, done.
Resolving deltas:  95% (1838195/1932842)
error: index-pack died of signal 11
fatal: index-pack failed
Trying now to fork, and clone the fork... This worked. Even if it won't worl for pull requests, as this is a mirror.

August 20

Tried to specialize ctype::_M_initialize_ctype from gcc/libstdc++-v3/config/locale/gnu/ctype_members.cc but failed to compile:

fstr> make CXXFLAGS="-std=c++11 -g -O0"
g++  -std=c++11 -g -O0 -I/usr/local/include   -c -o fstr.o fstr.cc
fstr.cc: In member function ‘void std::ctype<char16_t>::_M_initialize_ctype()’:
fstr.cc:140:51: error: ‘__uselocale’ was not declared in this scope
     __c_locale old = __uselocale(_M_c_locale_ctype);
                                                   ^

August 25

Found that mandb was running forever.
Tried a reboot, which failed, because of disk corruption.
This time, the boot string /boot/cmdlin.txt was OK.
ran:

# umount /dev/mmcblk0p6
# e2fsck -n /dev/mmcblk0p6
# e2fsck -p /dev/mmcblk0p6
# e2fsck /dev/mmcblk0p6
and accepted everything.
There was:
  1. one old issue with:
    code
          Multiply-claimed block(s) 5345739
    
          (There are two inodes containing multiply-claimed blocks.)
          file /home/marc/public_html/externsw/www/csn/dev-env/doc/share.html (inode #1327195, mod time Sat Jun  9 13:17:11 2018)
          /home/marc/public_html/externsw/www/csn/dev-env/doc/hlink.html (inode #1327229, mod time Sat Jun  9 15:51:02 2018)
        
    Cloned the block, and later restored the corrupted file from sartre.
  2. an old (Mar 21 2017) issue with Sergey's files
  3. a recent issue with the failure to clone the gcc git repo
The errors were:

Entry 'sort' in /home/marc/git/mgirod/gcc/libgo/go (294060) has deleted/unused inode 687723. Clear<y>? yes
...
Entry 'fortran' in /home/marc/git/mgirod/gcc/libgo/misc/go (572908) has deleted/unused inode 572933. Clear<y>? yes
...
Pass 3: Checking directory connectivity
Unconnected directory inode 573083 (...)
Connect to lost+found<y>? yes
...
   [note: these are entries 'cleared' earlier...]
Pass 4: Checking reference counts
Inode 294060 ref count is 45, should be 36. Fix<y>? yes
Inode 572908 ref count is 20, should be 13. Fix<y>? yes
   [note: these had deleted/unused entries, see above]
Unattached zero-length inode 573041, Clear<y>? yes
...
   [note: contiguous numbers up to 573082]
Inode 573083 ref count is 3, should be 2. Fix<y>? yes
   [note this was connected to lost+found]
...
Unattached inode 815073
Connect to lost+found<y>? yes
Inode 815073 ref count is 2, should be 1. Fix<y>? yes
...
Pass 5: Checking group summary information
Block bitmap differences: -1666987 -(2108371--2108379) -...
Fix<y>? yes
Free blocks count wrong for group #0 (21148, counted 21147)
Fix<y>? yes
...
Free blocks count wrong for group #69 (226, counted 326)
Fix<y>? yes
Directories count wrong for group #69 (317, counted 308)
Fix<y>? yes
...
Afterwards, cleaned up lost+found
Back to fstr...

  extern "C" __typeof(uselocale) __uselocale;
And now defining the specializations in order to link:

/home/marc/git/tests/fstr/fstr.cc:153: undefined reference to `std::ctype<char16_t>::_M_convert_to_wmask(unsigned short) const'
/usr/include/c++/4.9/bits/locale_classes.tcc:114: undefined reference to `std::ctype<char16_t>::id'
/usr/include/c++/4.9/bits/locale_classes.tcc:114: undefined reference to `typeinfo for std::ctype<char16_t>'
But there comes ICU... The logic is different... Anyway... not clear what the intention is. Let's debug when running.
I'm afraid ctype is far too simple for ICU.
May also try to debug printUnicodeString e.g. in ilc.
Puzzled by typeinfo. Cannot see where it comes from. No symbol with that name explicit in the source.

August 26

Noting the use of quotes in the error, typeinfo is not a symbol, it's a function of the compiler, probably prevented by the fact the class is not completely defined.
Defining the virtual destructor... This did it. Only now, the linker complains about the other virtual member functions.
Of course... at least some of these functions involve Unicode!
Obviously do_toupper and do_tolower, but also do_widen and do_narrow!
In fact, I leave for now _M_widen, which may use UConverter.
Same with _M_narrow, and wctob.
Compiled and linked... Doesn't crash, but still produces nothing. In fact, there is still no change whatsoever:

fstr> gdb fstr
(gdb) b 446
(gdb) r
446	  basic_ofstream<UChar> ufs("utest.txt", basic_ofstream<UChar>::out);
(gdb) s
...
std::has_facet<std::ctype<char16_t> > (__loc=...)
    at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106	      const size_t __i = _Facet::id._M_id();
(gdb) s
107	      const locale::facet** __facets = __loc._M_impl->_M_facets;
(gdb) p __i
$1 = 28
(gdb) s
110		      && dynamic_cast<const _Facet*>(__facets[__i]));
(gdb) p __facets[__i]
$2 = (const std::locale::facet *) 0xb6f9d4d8 <vtable for std::__timepunct<wchar_t>+8>
New run, stopping earlier:

446	  basic_ofstream<UChar> ufs("utest.txt", basic_ofstream<UChar>::out);
(gdb) s
std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (
    this=0xbefff86c, __s=0x8095c "utest.txt", __mode=std::_S_out, 
    __in_chrg=<optimized out>, __vtt_parm=<optimized out>)
    at /usr/include/c++/4.9/fstream:645
645	      : __ostream_type(), _M_filebuf()
(gdb) s
std::basic_ios<char16_t, std::char_traits<char16_t> >::basic_ios (
    this=0xbefff8fc) at /usr/include/c++/4.9/bits/basic_ios.h:456
456		_M_streambuf(0), _M_ctype(0), _M_num_put(0), _M_num_get(0)
(gdb) s
457	      { }
(gdb) s
std::basic_ostream<char16_t, std::char_traits<char16_t> >::basic_ostream (
    this=0xbefff86c, 
    __vtt_parm=0x80bc4 <VTT for std::basic_ofstream<char16_t, std::char_traits<char16_t> >+4>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/ostream:385
385	      { this->init(0); }
(gdb) s
std::basic_ios<char16_t, std::char_traits<char16_t> >::init (this=0xbefff8fc, 
    __sb=0x0) at /usr/include/c++/4.9/bits/basic_ios.tcc:129
129	      ios_base::_M_init();
(gdb) s
132	      _M_cache_locale(_M_ios_locale);
(gdb) s
std::basic_ios<char16_t, std::char_traits<char16_t> >::_M_cache_locale (
    this=0xbefff8fc, __loc=...) at /usr/include/c++/4.9/bits/basic_ios.tcc:159
159	      if (__builtin_expect(has_facet<__ctype_type>(__loc), true))
(gdb) bt
#0  std::basic_ios<char16_t, std::char_traits<char16_t> >::_M_cache_locale (
    this=0xbefff8fc, __loc=...) at /usr/include/c++/4.9/bits/basic_ios.tcc:159
#1  0x00015c50 in std::basic_ios<char16_t, std::char_traits<char16_t> >::init
    (this=0xbefff8fc, __sb=0x0) at /usr/include/c++/4.9/bits/basic_ios.tcc:132
#2  0x000158e8 in std::basic_ostream<char16_t, std::char_traits<char16_t> >::basic_ostream (this=0xbefff86c, 
    __vtt_parm=0x80bc4 <VTT for std::basic_ofstream<char16_t, std::char_traits<char16_t> >+4>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/ostream:385
#3  0x0001467c in std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (this=0xbefff86c, __s=0x8095c "utest.txt", __mode=std::_S_out, 
    __in_chrg=<optimized out>, __vtt_parm=<optimized out>)
    at /usr/include/c++/4.9/fstream:645
#4  0x00013434 in main () at fstr.cc:446
And has_facet<char_traits<char16_t>>(_M_ios_locale) returns false, and if we step in, we get our old:

(gdb) s
std::has_facet<std::ctype<char16_t> > (__loc=...)
    at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106	      const size_t __i = _Facet::id._M_id();
(gdb) bt
#0  std::has_facet<std::ctype<char16_t> > (__loc=...)
    at /usr/include/c++/4.9/bits/locale_classes.tcc:106
...
with __id 28 (i.e. wchar_t).

September 2


libstdc++-v3> egrep -rl 'has_facet<(std::)?ctype<wchar_t> ?>' .
./include/bits/locale_facets.tcc
Only extern declarations...

libstdc++-v3> egrep -rl 'has_facet<(std::)?ctype<wchar_t> ?>' ..
I believe that the problem is ctype<UChar>::_M_initialize_ctype
It should add a new entry (e.g. 29) into __facets
Breakpoint in the function: not caught...

fstr> nm -C fstr  | grep ctype | wc -l
35
fstr> nm -C fstr  | egrep 'ctype(_abstract_base)?<char16_t>' | wc -l
31
fstr> nm -C fstr  | grep ctype | egrep -v 'ctype(_abstract_base)?<char16_t>'nm -C fstr  | grep ctype | egrep -v 'ctype(_abstract_base)?<char16_t>'
000135c0 t _GLOBAL__sub_I__ZNSt5ctypeIDsE2idE
         U __wctype_l@@GLIBC_2.4
00080f60 V typeinfo for std::ctype_base
00080f50 V typeinfo name for std::ctype_base
fstr> nm -C /usr/lib/gcc/arm-linux-gnueabihf/4.9/libstdc++.a 2>/dev/null | grep 'ctype<wchar_t>' | wc -l
50
fstr> nm -C /usr/lib/gcc/arm-linux-gnueabihf/4.9/libstdc++.a 2>/dev/null | grep 'ctype<wchar_t>::_M'
         U std::ctype<wchar_t>::_M_initialize_ctype()
00000000 T std::ctype<wchar_t>::_M_convert_to_wmask(unsigned short) const
00000000 T std::ctype<wchar_t>::_M_initialize_ctype()
fstr> nm -C fstr  | grep 'ctype<char16_t>::_M'
000129e8 T std::ctype<char16_t>::_M_convert_to_wmask(unsigned short) const
00012c30 T std::ctype<char16_t>::_M_initialize_ctype()
Some hope: char16_t and wchar_t are distinct. My function, howver incomplete, is just not invoked yet.
Added the constructors, but they are not called:

fstr> nm -C fstr  | grep 'ctype<char16_t>::ctype'
00012ed8 T std::ctype<char16_t>::ctype(unsigned int)
00012f54 T std::ctype<char16_t>::ctype(__locale_struct*, unsigned int)
00012ed8 T std::ctype<char16_t>::ctype(unsigned int)
00012f54 T std::ctype<char16_t>::ctype(__locale_struct*, unsigned int)
fstr> nm -C /usr/lib/gcc/arm-linux-gnueabihf/4.9/libstdc++.a 2>/dev/null | grep 'ctype<wchar_t>::ctype'
00000000 T std::ctype<wchar_t>::ctype(unsigned int)
00000000 T std::ctype<wchar_t>::ctype(__locale_struct*, unsigned int)
00000000 T std::ctype<wchar_t>::ctype(unsigned int)
00000000 T std::ctype<wchar_t>::ctype(__locale_struct*, unsigned int)
         U std::ctype<wchar_t>::ctype(unsigned int)
         U std::ctype<wchar_t>::ctype(__locale_struct*, unsigned int)
In gcc/libstdc++-v3/src/c++98/localename.cc, there is a locale::_Impl::_Impl(...) which constructs all the facets for char and wchar_t. It does it with its private template member function _M_init_facet, defined inline in gcc/libstdc++-v3/include/bits/locale_classes.h:

_M_install_facet(&_Facet::id, __facet);
The use would be:

  _M_init_facet<UChar>(new ctype<UChar>());

September 9

Looking for classes with protected members, that one might extend in derived classes:

September 15

The class locale (in locale_classes.h), is commented as: an extensible container for user-defined localization.
Inside, there are 3 private _Impl pointer members: _M_impl (shared), and 2 static: _S_classic ("C" reference) and _S_global (Current).

libstdc++-v3> pwd
/home/marc/git/mgirod/gcc/libstdc++-v3
libstdc++-v3> find src -type f -name localename.cc
src/c++98/localename.cc
libstdc++-v3> cksum include/bits/locale_classes.h
1219623670 24897 include/bits/locale_classes.h
libstdc++-v3> cksum /usr/include/c++/4.9/bits/locale_classes.h
2000905944 22985 /usr/include/c++/4.9/bits/locale_classes.h
libstdc++-v3> diff include/bits/locale_classes.h /usr/include/c++/4.9/bits/locale_classes.h
3c3
< // Copyright (C) 1997-2018 Free Software Foundation, Inc.
---
> // Copyright (C) 1997-2014 Free Software Foundation, Inc.
...
Significant additions (~9%)
Trying an update/upgrade cycle...
This did not affect libstd++
Maybe Apache has an example...

int main () {
    std::locale loc;  // Default locale
    std::locale my_loc (loc, new ex_codecvt);
Found the source code, and attempted to build it in ~/git/tests/imbue
Not trivial: uses RogueWave...
Committed the files as such, and switched to a dev branch.
Cleaned up the RW macros, and built!
This shows ISO 8859-1 converted to US ASCII.
Trying to apply, by constructing a locale copy with the ctype<UChar> facet. Builds, but no effect.
More and more symbols using char16_t, mostly weak:

fstr> nm -C ./fstr | grep char16_t | wc -l
280
fstr> nm -C ./fstr | grep char16_t | perl -nle '$h{$1}++ if /^\w+ (\w) /;END{print"$_: $h{$_}" for sort keys %h}'
B: 1
R: 6
T: 54
V: 27
W: 166
r: 1
t: 18
u: 7

September 16

I confused traits (e.g. char_traits) and facets (e.g. ctype).
It looks like ICU doesn't build upon the C locale
Explored constructing the ctype<UChar> with a reference of 1:

  locale loc(defloc, new ctype<UChar>(1));
Which leads to:

std::locale::locale<std::ctype<char16_t> > (this=0xbefffa98, __other=..., 
    __f=0xc5008) at /usr/include/c++/4.9/bits/locale_classes.tcc:47
47	      _M_impl = new _Impl(*__other._M_impl, 1);
(gdb) bt
#0  std::locale::locale<std::ctype<char16_t> > (this=0xbefffa98, __other=..., 
    __f=0xc5008) at /usr/include/c++/4.9/bits/locale_classes.tcc:47
#1  0x0001387c in main () at fstr.cc:451
(gdb) p __other
$5 = (const std::locale &) @0xbefffa9c: {static none = 0, static ctype = 1, 
  static numeric = 2, static collate = 4, static time = 8, 
  static monetary = 16, static messages = 32, static all = 63, 
  _M_impl = 0xb6fa3d14, static _S_classic = <optimized out>, 
  static _S_global = <optimized out>, static _S_categories = <optimized out>, 
  static _S_once = <optimized out>}
(gdb) s
50		{ _M_impl->_M_install_facet(&_Facet::id, __f); }
(gdb) p _Facet::id
$6 = {_M_index = 0, static _S_refcount = <optimized out>}
(gdb) s
56	      delete [] _M_impl->_M_names[0];
(gdb) 
57	      _M_impl->_M_names[0] = 0;   // Unnamed.
(gdb) 
58	    }
again with no other visible effect (as the new _Impl gets deleted?).

October 27

Tried to refresh my fork of gcc, but failed to pull:

gcc> git remote -v
origin	https://github.com/mgirod/gcc.git (fetch)
origin	https://github.com/mgirod/gcc.git (push)
upstream	https://github.com/gcc-mirror/gcc.git (fetch)
upstream	https://github.com/gcc-mirror/gcc.git (push)
gcc> git pull upstream master
...
Updating 9dec9a1..698c03a
error: Your local changes to the following files would be overwritten by merge:
	libgo/go/runtime/mapspeed_test.go
...
	libgomp/testsuite/libgomp.oacc-c++/non-scalar-data.C
Please, commit your changes or stash them before you can merge.
Aborting
gcc> git commit -m 'changes between mirrors' -a
[master b82f073] changes between mirrors
 2616 files changed, 484545 deletions(-)
...
gcc> git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)
nothing to commit, working directory clean
gcc> git push
...
  C-c C-c
gcc> git status
On branch master
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)
nothing to commit, working directory clean
gcc> cd ..
mgirod> rm -rf gcc
mgirod> git clone [email protected]:mgirod/gcc.git
...
Receiving objects: 100% (2356011/2356011), 2.49 GiB | 1.17 MiB/s, done.
Connection to github.com closed by remote host.
...
Resolving deltas: 100% (1940070/1940070), done.
Checking connectivity... done.
Checking out files: 100% (81432/81432), done.
mgirod> cd gcc
gcc>  git remote add upstream [email protected]:gcc-mirror/gcc.git
gcc> git remote -v
origin	[email protected]:mgirod/gcc.git (fetch)
origin	[email protected]:mgirod/gcc.git (push)
upstream	[email protected]:gcc-mirror/gcc.git (fetch)
upstream	[email protected]:gcc-mirror/gcc.git (push)
gcc> git pull upstream master
...
Receiving objects: 100% (14662/14662), 29.93 MiB | 1.44 MiB/s, done.
Resolving deltas: 100% (11859/11859), completed with 3474 local objects.
From github.com:gcc-mirror/gcc
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> upstream/master
Updating 9dec9a1..698c03a
Checking out files: 100% (4453/4453), done.
Fast-forward
 ChangeLog                                           |    45 +-
...
 create mode 100644 libstdc++-v3/testsuite/ext/new_allocator/eq.cc
gcc> git branch
* master
gcc> git status
On branch master
Your branch is ahead of 'origin/master' by 1215 commits.
  (use "git push" to publish your local commits)

It took 2.65 seconds to enumerate untracked files. 'status -uno'
may speed it up, but you have to be careful not to forget to add
new files yourself (see 'git help status').
nothing to commit, working directory clean
gcc> git config --global push.default simple
gcc> git push

October 28

Trying again to debug... I don't touch the source for now, i.e. I leave on line 451 (1 will keep the ctype object):

  locale loc(defloc, new ctype<UChar>(1));
But during the basic_ofstream construction, one initializes a new ctype object, this time with 0 (line 456 in basic_ios.h—_M_ctype is a member object of type ctype<char16_>):

      basic_ios()
      : ios_base(), _M_tie(0), _M_fill(char_type()), _M_fill_init(false), 
	_M_streambuf(0), _M_ctype(0), _M_num_put(0), _M_num_get(0)
      { }
and the constructor is empty!?
My breakpoints get caught from the construction on line 451, but not from this on line 452:

fstr> gdb fstr
(gdb) b 290
(gdb) b 275
(gdb) b 452
(gdb) run
Breakpoint 1, std::ctype<char16_t>::ctype (this=0xc5008, refs=1) at fstr.cc:290
290	  _M_c_locale_ctype(_S_get_c_locale()), _M_narrow_ok(false) { _M_initialize_ctype(); }
(gdb) info stack
#0  std::ctype<char16_t>::ctype (this=0xc5008, refs=1) at fstr.cc:290
#1  0x00013864 in main () at fstr.cc:451
(gdb) c
Breakpoint 2, std::ctype<char16_t>::_M_initialize_ctype (this=0xc5008) at fstr.cc:275
275	    for (i = 0; i < 128; ++i) {
(gdb) c
Breakpoint 3, main () at fstr.cc:452
452	  basic_ofstream<UChar> ufs("utest.txt", basic_ofstream<UChar>::out);
(gdb) c
Continuing.
[Inferior 1 (process 2067) exited normally]
Tried to solve this with forward declarations... Didn't help.
Commented away the loc construction, in case it was this which prevented the second construction, but no.
I am building with:

fstr> make CXXFLAGS="-std=c++11 -g -O0"
And yet I get:

(gdb) s
std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (this=0xbefff84c, __s=0x80bfc "utest.txt", __mode=std::_S_out, 
    __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /usr/include/c++/4.9/fstream:645
645	      : __ostream_type(), _M_filebuf()
Trying:

fstr> make CXXFLAGS="-std=c++11 -ggdb -Og"
Now, no breakpoint is caught!
Removed -Og, and got back to the original situation.
More forward declarations may help, esp:

  template<typename C, typename T> class basic_ios;
  template<> class basic_ios<UChar, char_traits<UChar> >;
But then, one must provide the definition before including ostream.
So I did. Now caught in the linker, for undefined symbols for the non inlined members...

fstr.o: In function `std::basic_ios<char16_t, std::char_traits<char16_t> >::setstate(std::_Ios_Iostate)':
/home/marc/git/tests/fstr/fstr.cc:138: undefined reference to `std::basic_ios<char16_t, std::char_traits<char16_t> >::clear(std::_Ios_Iostate)'
fstr.o: In function `std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream(char const*, std::_Ios_Openmode)':
/usr/include/c++/4.9/fstream:647: undefined reference to `std::basic_ios<char16_t, std::char_traits<char16_t> >::init(std::basic_streambuf<char16_t, std::char_traits<char16_t> >*)'
fstr.o: In function `std::basic_ostream<char16_t, std::char_traits<char16_t> >::basic_ostream()':
/usr/include/c++/4.9/ostream:385: undefined reference to `std::basic_ios<char16_t, std::char_traits<char16_t> >::init(std::basic_streambuf<char16_t, std::char_traits<char16_t> >*)'
fstr.o: In function `std::basic_ofstream<char16_t, std::char_traits<char16_t> >::open(char const*, std::_Ios_Openmode)':
/usr/include/c++/4.9/fstream:724: undefined reference to `std::basic_ios<char16_t, std::char_traits<char16_t> >::clear(std::_Ios_Iostate)'
OK: I managed to provide inline implementations or these functions.
Now, maybe reimplementing them in terms of ICU?

(gdb) s
std::basic_ostream<char16_t, std::char_traits<char16_t> >::basic_ostream (this=0xbefff84c, 
    __vtt_parm=0x80e14 <VTT for std::basic_ofstream<char16_t, std::char_traits<char16_t> >+4>, __in_chrg=<optimized out>)
    at /usr/include/c++/4.9/ostream:385
385	      { this->init(0); }
...
std::has_facet<std::codecvt<char16_t, char, __mbstate_t> > (__loc=...) at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106	      const size_t __i = _Facet::id._M_id();
(gdb) n
107	      const locale::facet** __facets = __loc._M_impl->_M_facets;
(gdb) p __i
$1 = 31
(gdb) s
110		      && dynamic_cast<const _Facet*>(__facets[__i]));
(gdb) p __facets[__i]
$2 = (const std::locale::facet *) 0xb6e1c784 <_nl_C_locobj>
(gdb) s
114	    }
(gdb) 
std::basic_filebuf<char16_t, std::char_traits<char16_t> >::basic_filebuf (this=0xbefff850) at /usr/include/c++/4.9/bits/fstream.tcc:89
89	    }
(gdb) p _M_codecvt
$3 = (const std::basic_filebuf<char16_t, std::char_traits<char16_t> >::__codecvt_type *) 0x0
Still nothing added to the utest.txt file.
It looks like the point where it fails is:

std::basic_filebuf<char16_t, std::char_traits<char16_t> >::xsputn (this=0xbefff850, __s=0x80bcc u"Il y a de la joie\n", 
    __n=18) at /usr/include/c++/4.9/bits/fstream.tcc:640
640	      streamsize __ret = 0;
(gdb) n
644	      const bool __testout = (_M_mode & ios_base::out
(gdb) 
645				      || _M_mode & ios_base::app);
(gdb) 
646	      if (__check_facet(_M_codecvt).always_noconv()
(gdb) p __testout
$15 = true
(gdb) n
0x000168cc in std::__ostream_insert<char16_t, std::char_traits<char16_t> > (__out=..., 
    __s=0x80bcc u"Il y a de la joie\n", __n=18) at /usr/include/c++/4.9/bits/ostream_insert.h:109
109		  __catch(...)
(gdb) info stack
#0  0x000168cc in std::__ostream_insert<char16_t, std::char_traits<char16_t> > (__out=..., 
    __s=0x80bcc u"Il y a de la joie\n", __n=18) at /usr/include/c++/4.9/bits/ostream_insert.h:109
#1  0x000154e8 in std::operator<< <char16_t, std::char_traits<char16_t> > (__out=..., __s=0x80bcc u"Il y a de la joie\n")
    at /usr/include/c++/4.9/ostream:518
#2  0x000135cc in main () at fstr.cc:632
Back, deeper inside:

std::basic_filebuf<char16_t, std::char_traits<char16_t> >::xsputn (this=0xbefff850, __s=0x80bcc u"Il y a de la joie\n", 
    __n=18) at /usr/include/c++/4.9/bits/fstream.tcc:640
640	      streamsize __ret = 0;
(gdb) n
644	      const bool __testout = (_M_mode & ios_base::out
(gdb) 
645				      || _M_mode & ios_base::app);
(gdb) 
646	      if (__check_facet(_M_codecvt).always_noconv()
(gdb) s
std::__check_facet<std::codecvt<char16_t, char, __mbstate_t> > (__f=0x0) at /usr/include/c++/4.9/bits/basic_ios.h:48
48	      if (!__f)
(gdb) p __f
$2 = (const std::codecvt<char16_t, char, __mbstate_t> *) 0x0
(gdb) info stack
#0  std::__check_facet<std::codecvt<char16_t, char, __mbstate_t> > (__f=0x0) at /usr/include/c++/4.9/bits/basic_ios.h:48
#1  0x00019ba0 in std::basic_filebuf<char16_t, std::char_traits<char16_t> >::xsputn (this=0xbefff850, 
    __s=0x80bcc u"Il y a de la joie\n", __n=18) at /usr/include/c++/4.9/bits/fstream.tcc:646
#2  0x0001521c in std::basic_streambuf<char16_t, std::char_traits<char16_t> >::sputn (this=0xbefff850, 
    __s=0x80bcc u"Il y a de la joie\n", __n=18) at /usr/include/c++/4.9/streambuf:451
#3  0x00014404 in std::__ostream_write<char16_t, std::char_traits<char16_t> > (out=..., s=0x80bcc u"Il y a de la joie\n", 
    n=18) at fstr.cc:620
#4  0x0001687c in std::__ostream_insert<char16_t, std::char_traits<char16_t> > (__out=..., 
    __s=0x80bcc u"Il y a de la joie\n", __n=18) at /usr/include/c++/4.9/bits/ostream_insert.h:101
#5  0x000154e8 in std::operator<< <char16_t, std::char_traits<char16_t> > (__out=..., __s=0x80bcc u"Il y a de la joie\n")
    at /usr/include/c++/4.9/ostream:518
#6  0x000135cc in main () at fstr.cc:632
This used __check_facet instead of my inline chkfac, and it returned 0, even if of the expected type.
It is not that it returns 0, it has a 0 _M_codecvt:

646	      if (__check_facet(_M_codecvt).always_noconv()
(gdb) p _M_codecvt
$1 = (const std::basic_filebuf<char16_t, std::char_traits<char16_t> >::__codecvt_type *) 0x0
This is a member of basic_filebuf
It is initalized in the constructor (fstream.tcc:88):

	_M_codecvt = &use_facet<__codecvt_type>(this->_M_buf_locale);
Except that for this to happen, has_facet must have returned true, and...

(gdb) s
std::has_facet<std::codecvt<char16_t, char, __mbstate_t> > (__loc=...) at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106	      const size_t __i = _Facet::id._M_id();
(gdb) s
107	      const locale::facet** __facets = __loc._M_impl->_M_facets;
(gdb) s
110		      && dynamic_cast<const _Facet*>(__facets[__i]));
(gdb) s
114	    }
(gdb) s
std::basic_filebuf<char16_t, std::char_traits<char16_t> >::basic_filebuf (this=0xbefff850)
    at /usr/include/c++/4.9/bits/fstream.tcc:89
89	    }

October 29

Next: the generic behaviour is not OK, in fact, it is not inline, not virtual,

195		ios_base::_M_init();
codecvt
This example compiles with GCC 8.1 (C++2a)...
One considered deprecating codecvt...
Looking for ios_base::_M_init implementation:

libstdc++-v3> git checkout gcc-4_9_4-release
...
HEAD is now at d319148... Mark as release
Found the code I was looking for in ios_locale.cc:

  // Called only by basic_ios<>::init.
  void
  ios_base::_M_init() throw()
  {
    // NB: May be called more than once
    _M_precision = 6;
    _M_width = 0;
    _M_flags = skipws | dec;
    _M_ios_locale = locale();
  }
which means that it is not what I was interested in.
I leave the gcc fork in the 4.9 release state for now.
But now, I missed the point where _Facet::id._M_id() was created:

std::has_facet<std::codecvt<char16_t, char, __mbstate_t> > (__loc=...)
    at /usr/include/c++/4.9/bits/locale_classes.tcc:106
106	      const size_t __i = _Facet::id._M_id();
which must be in my own code, since it is 31, instead of previously 28.
_Facet here is codecvt<UChar, char, uencodstate>
It is assigned to in locale::_Impl::_M_install_facet but this one just installs what it gets as argument, and that's when constructing the locale object, in ~/git/mgirod/gcc/libstdc++-v3/src/c++98/codecvt.cc.
Added a definition for the id object.
_M_buf_locale is a member of basic_streambuf, which is a base of basic_filebuf.
Added the explicit specialization, if only to be able to put a breakpoint.
Decided to clone my own fork into a 4.9 tree, in order to free gcc for updates:

mgirod> git clone gcc gcc-4.9
mgirod> cd gcc
gcc> git checkout master

November 25


fstr> make CXXFLAGS="-std=c++11 -g -O0"
make: Nothing to be done for 'all'.
fstr> gdb fstr
(gdb) b 294
(gdb) run
Breakpoint 1, std::basic_streambuf<char16_t, std::char_traits<char16_t> >::basic_streambuf (this=0xbefff850) at fstr.cc:296
296	      _M_out_cur(0), _M_out_end(0), _M_buf_locale(locale()) { }
(gdb) info stack
#0  std::basic_streambuf<char16_t, std::char_traits<char16_t> >::basic_streambuf (this=0xbefff850) at fstr.cc:296
#1  0x0001643c in std::basic_filebuf<char16_t, std::char_traits<char16_t> >::basic_filebuf (this=0xbefff850) at /usr/include/c++/4.9/bits/fstream.tcc:85
#2  0x0001532c in std::basic_ofstream<char16_t, std::char_traits<char16_t> >::basic_ofstream (this=0xbefff84c, __s=0x80bd0 "utest.txt", __mode=std::_S_out, 
    __in_chrg=<optimized out>, __vtt_parm=<optimized out>)
    at /usr/include/c++/4.9/fstream:645
#3  0x00013644 in main () at fstr.cc:636
(gdb) s
std::basic_filebuf<char16_t, std::char_traits<char16_t> >::basic_filebuf (
    this=0xbefff850) at /usr/include/c++/4.9/bits/fstream.tcc:87
87	      if (has_facet<__codecvt_type>(this->_M_buf_locale))
There, __codecvt_type is codecvt<UChar, char, CTUC::state_type>
CTUC::state_type is std::mbstate_t by default as _Char_types if not overridden/specialized for UChar.
mbstate is defined in cwchar as int[6].

December 8

For a couple of weeks, got into trouble: could not reindex htdig, and failed to update berry.
I ended up upgrading to the next OS version: stretch although probably, a mere reboot would have solved the cause of the problems.
However, one consequence was that the htdig.conf file had get corrupted, and it is only today that I was at last able to login again over ssh, and regenerate a config file.

December 9

With stretch, the version of gcc in now 6.3.0
So, that's the version I checkout in my reference repo.
Git repositories, icu, bdb, objects, 2017, 2019, log
Marc Girod