Discussion:
[bug #58796] preconv: want option to write traditional [g|t]roff special characters where possible
(too old to reply)
G. Branden Robinson
2020-07-21 11:27:58 UTC
Permalink
URL:
<https://savannah.gnu.org/bugs/?58796>

Summary: preconv: want option to write traditional [g|t]roff
special characters where possible
Project: GNU troff
Submitted by: gbranden
Submitted on: Tue 21 Jul 2020 11:27:56 AM UTC
Category: Preprocessor preconv
Severity: 1 - Wish
Item Group: New feature
Status: Need Info
Privacy: Public
Assigned to: gbranden
Open/Closed: Open
Discussion Lock: Any
Planned Release: None

_______________________________________________________

Details:

preconv is a good thing but its conversion of absolutely everything that isn't
US-ASCII to inscrutable hexadecimal Unicode points means that it's not good
for "offline" source document conversion in a way that would be pleasant to
maintain.

I'd like:

1. A flag, possibly -s, to convert non-ASCII code points to special character
escapes, e.g., \['e], where such entities exist. The Unicode-to-glyph list in
src/libs/libgroff/uniglyph.cpp` might be helpful here.

2. A flag, possibly -t, to do the same except convert only those code points
documented in CSTR #54 to special character escapes, and use the old-fashioned
escape form, e.g., \('e.

In the slightly longer term I'd like a Single Source of Truth for the special
character escapes, one that can be macro-processed (not by groff!) into
`uniglyph.cpp` and `glyphuni.cpp` as well as groff_char(7) and grops(1) (which
is where I think the PostScript column of the tables in groff_char(7) today
should move to).

Thoughts?




_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58796>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
Dave
2020-07-25 20:02:19 UTC
Permalink
Follow-up Comment #1, bug #58796 (project groff):

Stepping back a bit: preconv is essentially a bit of a hack to address groff's
limitation of natively handling only Latin-1 input. As long-term strategies
go, addressing this limitation in core groff is a better fix than patching up
the interim tool.

To keep existing pipelines working, preconv would have to exist in some form
for a while, but if groff natively accepted UTF-8 input (bug #40720), preconv
could turn into a simple wrapper for iconv, instead of being essentially a
reimplementation of it but outputting groff-isms rather than standard
character sets. Fewer wheels reinvented.

And making groff speak UTF-8 shouldn't require wheel reinvention either. I'm
no C++ programmer, but surely the language has standard libraries to handle
UTF-8, that would just need to be plugged into the appropriate places in
groff's input handling. (I say "just" in complete ignorance of how big this
task actually is.)

Coders being in short supply here, it probably makes sense to devote this
limited resource to the best long-term solution.

However, I have to assume preconv exists at all because writing it from
scratch was once deemed substantially easier than updating groff's input
handling. So if retooling preconv remains the substantially easier task, I
have no quarrel with the proposals put forth here.

(Since groff does speak Latin-1, I'm actually not sure why preconv need emit
things like \['e] or \[u00E9] at all, rather than the more widely understood
Latin-1 character those things represent.)

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58796>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
Ingo Schwarze
2020-07-25 20:41:12 UTC
Permalink
Follow-up Comment #2, bug #58796 (project groff):

Hi Dave,
Post by Dave
a bit of a hack
Not so much, actually. Making good use of pipes is among the design
principles of the whole roff ecosystem, to harmonize with the overall UNIX
design philosophy that every tool should solve one task only, but solve it
well and in a way that facilitates combination with the other tools. In this
sense, groff is actually more UNIXy than mandoc, which does integrate
preconv.
Post by Dave
wrapper for iconv
I would hate it if groff would start requiring iconv. I consider it an
important asset that so far, it does not.
Post by Dave
the language has standard libraries to handle UTF-8
Yes, indeed the C language contains a vast array of C library functions to
deal with wide characters and with multibyte characters. But the design of
these C libary facilities is atrocious, and using something else which is
non-standard would even be worse. Either way, rewriting a program to natively
support wide characters is usually an extremely tedious, extremely intrusive,
very time-consuming and highly error-prone task. Even when done as designed,
it adds horrible complication to the code and makes the code much more
fragile. For samll programs, ways exist to cheat one's way around these
notorious downsides, see my presentation at EuroBSDCon in Beograd a few years
ago. But i doubt something like that could be pulled off for a program as
large as groff, at least not easily.
Post by Dave
not sure why preconv need emit things like \['e] or \[u00E9] at all
Because single-byte 8-bit locales have been obsolete for many years and some
operating systems don't even support them any longer. And even for people
using Linux: almost nobody uses LC_CTYPE=*.Latin-1 nowadays, which would imply
that you could no longer look at the preconv output with a pager. When you do
groff-specific encoding anyway, it's much better to encode all non-ASCII
characters and not force users to adopt an obsolete locale.


While in general, i hate adding options to programs, in particular when it can
be expected that they will be used rarely, i do see that an occasional need
for what Brandon asks for might arise. When picking new options, please don't
forget to look at https://mandoc.bsd.lv/man/man.options.1.html - the groff/man
option space is seriously crowded already, and having several programs in a
single package or in two very closely related packages that all use the same
option letter but each one for a different purpose isn't user-friendly at
all.

Either way, i would judge this task as somewhat low-priority because the
situation that you want to maintain the document source in US-ASCII (which
implies there are only occasional non-ASCII characters in it, otherwise you
would surely maintain the document source in UTF-8 in the first place) yet
that there is a sufficient number of stray wide characters inside that you
want to encode them automatically rather than just manually fixing them one by
one may occasionally occur, but not all that often, i think.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58796>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
Dave
2020-07-30 17:11:08 UTC
Permalink
Follow-up Comment #3, bug #58796 (project groff):

Thanks for the comments, Ingo. I understand and support the Unix philosophy,
but I disagree with some of your underlying assumptions.

If you developed a brand-new tool to do some text-processing task, something
designed to be used in pipelines with other tools, you could choose to specify
that:
a) the input character set of your tool be a Unicode encoding, or
b) the tool only take some subset of Unicode as input, and require another
tool to pipe in translations for the rest of Unicode, using a syntax invented
specifically for these tools and not standardized anywhere else.

If you chose (b) on the grounds "pipelines are more Unixy," this would not be
a popular choice. Requiring helper applications to understand modern
character sets is not inherently "the Unix way." It's a stopgap used for
historical applications whose cores do not (yet) speak Unicode.

Groff is a historical application. It will always support \['e] because it
must always be able to process historical documents that used such character
representations. But \['e] should in no way be considered the canonical way
to represent the Unicode character LATIN SMALL LETTER E WITH ACUTE. Unicode
gives us the canonical representation. \['e] and \[u00E9] are merely
additional, roff-specific ways to represent this character.

The "roff-specific" part is important: the entire Unix philosophy of pipelines
requires that all I/O be in as general a form as possible to be able to
interact with as wide a range of other programs as possible. groff and
preconv, by contrast, communicate in a secret code that no other tool uses.
That's not the Unix way; that's a band-aid to cover up something that Werner
identified as one of the four major areas of groff that needed to be updated
back in 2013. The need has not lessened in the intervening years.

That groff is a historical package does not absolve it from modern best
practices in software design. Looking to the long term, this is what we
should be striving for. preconv is a very useful bridge in the meantime; I
believe you that the task of converting historical C++ code to natively handle
UTF-8 input is big and messy.* Nonetheless it should be considered groff's
ultimate goal.

* I'm currently going through a similar process--on a much smaller
scale--with some Perl code. And Perl actually handles a lot of the logic
automatically that a C program would have to manually implement. I don't know
what C++'s facilities are like, but I do know that no matter how good the
language's design, you'll run into stupid problems
<http://www.perlmonks.org/?node_id=11119633> that will derail you for a few
hours.

[comment #2 comment #2:]
Post by Ingo Schwarze
I would hate it if groff would start requiring iconv.
It's far better to leverage existing code that does what you need than to
re-implement the same logic in your own code. The principle "solve one task
only, but solve it well" ought to free the groff package from implementing its
own conversions between character encodings and let it instead focus on its
primary task.

Anyway, if groff handled Unicode I/O natively (and thus also ASCII, a subset
thereof), I wouldn't expect iconv to become an installation requirement; it
would be a run-time requirement for those users who need to feed in documents
in other character encodings.
Post by Ingo Schwarze
it's much better to encode all non-ASCII characters and not force users to
adopt an obsolete locale.

Good points here; I agree. I fell into the trap of looking at the encoding
groff currently natively handles, and not at the big picture.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58796>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
Ingo Schwarze
2020-08-05 13:50:29 UTC
Permalink
Follow-up Comment #4, bug #58796 (project groff):

Regarding note #3 (it's all a bit tangential to this ticket):

I meant requiring iconv(3) at compile time, not iconv(1) at run-time. The
former would be horrible, the latter almost harmless.

I neither think that groff is really C++, it is more like C with some aspects
of classes, nor am i aware of features in C++ to handle Unicode. When you
handle Unicode in C++, you just use C library features. And Unicode handling
in Perl and C is totally different, so much so that i can hardly think of any
commonalities, so talking about Perl is really pointless here.

C features for handling Unicode are so bad that even for a new program, i
would seriously consider making it ASCII only even today rather than using
them. When Kristaps started mandoc ten years ago, he made exactly that
decision, and i'm very grateful to him for that, is was a very wise decision.
Even though internally, it uses practically no C-library Unicode-handling
features and no other Unicode-handling library, it has practically perfect and
in particular extremely robust and simple Unicode support. I very much doubt
that converting an existing program like groff would be a good idea, even if
you had time to waste for a purely make-work project, in particular since
groff already has a way for handling Unicode that works reasonably well and is
simpler and more robust than anything you could do with the native C library
features.

Of course, all this turns out to argue slightly in favour of Branden's idea,
but i still think this ticket isn't high priority.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58796>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
G. Branden Robinson
2020-08-06 14:35:20 UTC
Permalink
Follow-up Comment #5, bug #58796 (project groff):

Just a couple of miscellaneous data points here.

1. "I neither think that groff is really C++, it is more like C with some
aspects of classes"

My understanding from perusal of the sources over the past few years and a
hazy recollection of Stroustrup 2nd edition is that groff is written in a
limited subset of C++ as it existed in 1990 or so. Maybe a little past the
1st edition of the C++ book, but not a lot. Templates are not used anywhere
I've seen. None of the funny cast operators <static_cast>, <const_cast>,
<reinterpret_cast>, <dynamic_cast> appear to be present.

There are still comments in the source code that refer to workarounds for
CFRONT.

2. Something no one ever seems to mention when talking about supporting
Unicode natively is which Normalization Form(s) we should support.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58796>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
Werner LEMBERG
2020-08-06 15:48:31 UTC
Permalink
Post by G. Branden Robinson
2. Something no one ever seems to mention when talking about supporting
Unicode natively is which Normalization Form(s) we should support.
Pfft. If you search for 'normalization' in the groff info manual, you
will find

For simplicity, all Unicode characters that are composites must be
decomposed maximally (this is normalization form D in the Unicode
standard);

:-)


Werner
G. Branden Robinson
2020-08-06 15:54:27 UTC
Permalink
Post by Werner LEMBERG
Pfft. If you search for 'normalization' in the groff info manual, you
will find
For simplicity, all Unicode characters that are composites must be
decomposed maximally (this is normalization form D in the Unicode
standard);
Post by Werner LEMBERG
:-)
Thanks! I'm glad to be wrong, as NFD is my favorite anyway.

Please excuse my ignorance--I haven't gotten around to messing up that part of
our Texinfo manual yet. ;-)

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58796>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Loading...