[bug #58930] take baby steps toward Unicode

Discussion:

(too old to reply)

Dave

2020-08-10 14:56:08 UTC

URL:
<https://savannah.gnu.org/bugs/?58930>

Summary: take baby steps toward Unicode
Project: GNU troff
Submitted by: barx
Submitted on: Mon 10 Aug 2020 09:56:06 AM CDT
Category: Core
Severity: 3 - Normal
Item Group: New feature
Status: None
Privacy: Public
Assigned to: None
Open/Closed: Open
Discussion Lock: Any
Planned Release: None

_______________________________________________________

Details:

One small change that would improve groff's Unicode support would be to
recognize Unicode versions of things groff already knows how to do.

Four examples:

==== U+00A0 NO-BREAK SPACE ====

This character is in the Latin-1 character set, which groff recognizes, and
when groff's input is in Latin-1 encoding, it correctly handles this character
(though I'm not certain whether it interprets it as "\~" or "\ ").

But if the input is some other encoding, preconv converts the character into
the string "\[u00A0]", which groff does _not_ recognize. In macro space, a
simple

.char \[u00A0] \~

is enough to take care of this; presumably the equivalent mechanism to make
the code handle it internally is just as simple.

==== U+200B ZERO WIDTH SPACE ====

This is another character implemented in an existing groff escape (\:) but
unrecognized as "\[u200B]".

In this case, the simple, obvious, elegant solution that worked above:

.char \[u200B] \:

stupidly, irritatingly, and undocumentedly doesn't work. (.char being unable
to map something to an escape, or at least to this particular escape, is
another bug--either in the implementation, or the lack of documentation of the
restriction--for another day.)

==== U+202F NARROW NO-BREAK SPACE ====

Groff has two nonbreaking thin spaces, \| and \^. It is perhaps unclear which
of these groff should map "\[u202F]" to, but either one would be an
improvement over its current mapping to the warning "can't find special
character `u202F'".

==== U+2011 NON-BREAKING HYPHEN ====

I deem this change "extra credit" as it's the least likely to be easily
implementable, groff syntax having no direct correlate. Groff can only (via
\%) make an entire "word" (sequence of non-whitespace, including hyphens)
unbreakable, but has no easy way to support a mix of breaking and nonbreaking
hyphens in the same word, such as making the first hyphen of "jack-in-the-box"
nonbreaking but the other two breakable. (This can be done with a mix of \%
and \: escapes, as "\%jack-in-\:the-\:box" -- or even, taking advantage of the
bug/quirk Branden discovered
<http://lists.gnu.org/archive/html/groff/2020-07/msg00047.html>, as
"\%jack-in-\:the-box" -- but this is not obvious.) So it's possible, but
convoluted, to represent "jack\[u2011]in-the-box" in groff syntax; whether
this means it's equally convoluted in the underlying code, or whether the code
actually does have the concept of a nonbreaking hyphen but just doesn't expose
a direct representation of it to user space, I cannot guess.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

G. Branden Robinson

2020-08-14 10:00:02 UTC

Permalink

Update of bug #58930 (project groff):

Status: None => Need Info
Assigned to: None => gbranden

_______________________________________________________

Follow-up Comment #1:

It's a little demoralizing that even these baby steps seem fraught with
complication.

1. "U+00A0 NO-BREAK SPACE

This character is in the Latin-1 character set, which groff recognizes, and
when groff's input is in Latin-1 encoding, it correctly handles this character
(though I'm not certain whether it interprets it as "\~" or "\ ")."

None of the above, it seems:

$ cat EXPERIMENTS/spaces.groff
.pl 1v
.if '\ '\ ' \eSP = \eSP
.if '\ '\~' \eSP = \e\[ti]
.if '\ '\[u00A0]' \eSP = \e[u00A0]
.br
.if '\~'\ ' \e\[ti] = \eSP
.if '\~'\~' \e\[ti] = \e\[ti]
.if '\~'\[u00A0]' \e\[ti] = \e[u00A0]
.br
.if '\[u00A0]'\ ' \e[u00A0] = \eSP
.if '\[u00A0]'\~' \e[u00A0] = \e\[ti]
.if '\[u00A0]'\[u00A0]' \e[u00A0] = \e[u00A0]
$ ./build/test-groff -Tutf8
\SP = \SP
\~ = \~
\[u00A0] = \[u00A0]

None of these are equivalent to the others. :-/

2. The behavior of \: when used as the RHS of a .char request does indeed seem
a bit strange. It looks like the transform is just not happening:

.pl 1v
.char \[u200B] \:
.ds a \[u200B]
.length i \*a
\ni
8

.pl 1v
.ds a \[u200B]
.length i \*a
\ni
8

.pl 1v
.char a b
.ds a a
\*a
b

That unchanged length of 8, the exact character count of "\[u2000B]" is highly
suspicious to me.

3. Narrow no-break space. Have you named all of the non-breaking spaces in
Unicode in this ticket? I know there are bunch of others (hair space, thin
space, ideographic space, ...) but I don't know what their breaking semantics
are in Unicode.

4. A non-breaking hyphen would then be something that looks like \[hy] but
doesn't actually break? I don't know that this is actually the hardest of the
tasks on this list. You can just use the character as-is in input. groff
doesn't know it's a hyphen, and no hyphenation patterns include it, so it
never gets a break after it.

$ cat EXPERIMENTS/non-breaking-hyphen.groff
.pl 1v
.ds a a\[u2011]
.nr b 50 -1
.while \n+b \*a\c

troff: warning [p 1, 0.0i]: can't break line
a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑a‑

Let me know what you think of these findings.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Dave

2020-08-15 03:04:15 UTC

Permalink

Follow-up Comment #2, bug #58930 (project groff):

[comment #1 comment #1:]

Post by G. Branden Robinson
1. "U+00A0 NO-BREAK SPACE
None of these are equivalent to the others. :-/

"\~" and "\ " _shouldn't_ be equivalent; they're documented as behaving
differently.

The input string "\[u00A0]" being equivalent to neither of these is exactly
the problem this plank of this bug report is looking to solve.

It's only the character NO-BREAK SPACE in its Latin-1 form, which groff
accepts as direct input, that groff recognizes and interprets as a nonbreaking
space. groff_char(7) (which I only now thought to check) says it maps to \~.
But that appears to be less than 100% accurate:

$ LC_CTYPE=en_US.iso88591 printf ".if '\u00A0'\~' .tm equal\n" | groff
$

But the upshot is, however groff interprets a Latin-1 A0, it really ought to
interpret the form of that character emitted by preconv, \[u00A0],
identically.

Post by G. Branden Robinson
2. The behavior of \: when used as the RHS of a .char request
does indeed seem a bit strange.

Yeah, I really need to open a separate bug report for this, because it's
unrelated to everything else here.

Post by G. Branden Robinson
3. Narrow no-break space. Have you named all of the non-breaking
spaces in Unicode in this ticket?

No. I was intentionally trying to keep it simple and minimal. But it turns
out there are only three:

http://en.wikipedia.org/wiki/Whitespace_character#Unicode

So the only one I didn't cover was U+2007 FIGURE SPACE, which should map to
groff's (already nonbreaking) \0.

Post by G. Branden Robinson
there are bunch of others (hair space, thin space, ideographic space,
...) but I don't know what their breaking semantics are in Unicode.

Irrational, IMO. Unicode considers U+2009 THIN SPACE and
U+200A HAIR SPACE breakable, for no good reason that I can see. Groff (quite
sensibly, since the concept is sort of absurd) does not offer breaking
versions of these spaces, and the only reason to add them would be strict
compliance with a Unicode property that probably no one who uses those code
points actually wants: I can't think of a single real-world use case for a
breaking thin space (though perhaps this is merely a failure of my
imagination).

This is all another can of worms I intentionally didn't address in what I
intended to be a simple change.

Post by G. Branden Robinson
4. A non-breaking hyphen would then be something that looks
like \[hy] but doesn't actually break?

Yes.

Post by G. Branden Robinson
You can just use the character as-is in input.

Ah, I guess you used -Tutf8 output, where that does work. (Somehow your groff
command got stripped from your comment.) All other output formats (notably
-Tps and -Tpdf) produce "warning: can't find special character 'u2011'".

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Dave

2020-08-15 03:29:05 UTC

Permalink

Follow-up Comment #3, bug #58930 (project groff):

[comment #1 comment #1:]

Post by G. Branden Robinson
2. The behavior of \: when used as the RHS of a .char request
does indeed seem a bit strange.

Now its very own bug! Bug #58958.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

G. Branden Robinson

2020-08-15 04:05:39 UTC

Permalink

Follow-up Comment #4, bug #58930 (project groff):

[comment #2 comment #2:]

Post by Dave
"\~" and "\ " _shouldn't_ be equivalent; they're documented as behaving

differently.

No, not suggesting they should, just lamenting the total disjunctivity of the
set.

Post by Dave
The input string "\[u00A0]" being equivalent to neither of these is exactly

the problem this plank of this bug report is looking to solve.

Post by Dave
It's only the character NO-BREAK SPACE in its Latin-1 form, which groff

accepts as direct input, that groff recognizes and interprets as a nonbreaking
space. groff_char(7) (which I only now thought to check) says it maps to \~.

Post by Dave
$ LC_CTYPE=en_US.iso88591 printf ".if '\u00A0'\~' .tm equal\n" | groff
$
But the upshot is, however groff interprets a Latin-1 A0, it really ought to

interpret the form of that character emitted by preconv, \[u00A0],
identically.

Yes, I think I agree here. I can't think of a more appropriate mapping for
it.

Post by Dave
So the only one I didn't cover was U+2007 FIGURE SPACE, which should map to

groff's (already nonbreaking) \0.

Might as well sweep that one into this report, then. Once the "where" to fix
this has been determined, the incremental effort to handle that one will
probably be tiny.

Post by Dave

Post by G. Branden Robinson
there are bunch of others (hair space, thin space, ideographic space,
...) but I don't know what their breaking semantics are in Unicode.

Irrational, IMO. Unicode considers U+2009 THIN SPACE and
U+200A HAIR SPACE breakable, for no good reason that I can see. Groff

(quite sensibly, since the concept is sort of absurd) does not offer breaking
versions of these spaces, and the only reason to add them would be strict
compliance with a Unicode property that probably no one who uses those code
points actually wants: I can't think of a single real-world use case for a
breaking thin space (though perhaps this is merely a failure of my
imagination).

Well, I can't think of one either.

Post by Dave
This is all another can of worms I intentionally didn't address in what I

intended to be a simple change.

Hah. This is Sparta^Wgroff! Complexity rapidly ramifies.

Post by Dave

Post by G. Branden Robinson
4. A non-breaking hyphen would then be something that looks
like \[hy] but doesn't actually break?

Yes.

Post by G. Branden Robinson
You can just use the character as-is in input.

Ah, I guess you used -Tutf8 output, where that does work. (Somehow your

groff command got stripped from your comment.)

The "somehow" was me not thinking to include it.

Post by Dave
All other output formats (notably -Tps and -Tpdf) produce "warning: can't

find special character 'u2011'".

Okay, yes, that seems like another mapping issue.

And it appears to be a one-liner fix (morally).

tmac/pdf.tmac sources tmac/ps.tmac so the fix only has to be made in one
place.

I did have to goose the loop count in the test up to 100.

$ ./build/test-groff -Tps ./EXPERIMENTS/non-breaking-hyphen.groff

Post by Dave
|/tmp/2011.ps

troff: warning [p 1, 0.0i]: can't break line
$ ./build/test-groff -Tpdf ./EXPERIMENTS/non-breaking-hyphen.groff

Post by Dave
|/tmp/2011.pdf

troff: warning [p 1, 0.0i]: can't break line
$ cat ./EXPERIMENTS/non-breaking-hyphen.groff
.pl 1v
.ds a a\[u2011]
.nr b 100 -1
.while \n+b \*a\c
$ git di tmac/ps.tmac
diff --git a/tmac/ps.tmac b/tmac/ps.tmac
index 18928765..860919e1 100644
--- a/tmac/ps.tmac
+++ b/tmac/ps.tmac
@@ -28,6 +28,9 @@
.
.cflags 8 \[an]
.
+\# non-breaking hyphen
+.fchar \[u2011] -
+.
.char \[radicalex] \h'-\w'\[sr]'u'\[radicalex]\h'\w'\[sr]'u'
.fchar \[sqrtex] \[radicalex]
.char \[mo] \h'.08m'\[mo]\h'-.08m'

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Dave

2020-08-15 17:38:02 UTC

Permalink

Follow-up Comment #5, bug #58930 (project groff):

[comment #2 comment #2:]

Post by Dave
groff_char(7) (which I only now thought to check) says it

On further investigation, it appears in fact to be 0% accurate. See bug
#58962.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

G. Branden Robinson

2020-08-15 17:46:43 UTC

Permalink

Follow-up Comment #6, bug #58930 (project groff):

[comment #5 comment #5:]

Post by Dave
On further investigation, it appears in fact to be 0% accurate. See bug

#58962.

groff_char(7) is _full_ of problems with accuracy.

It's on my (s)hit list. I recently fixed up the introductory material but it
needs a lot more work.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Dave

2020-08-16 00:53:18 UTC

Permalink

Follow-up Comment #7, bug #58930 (project groff):

[comment #4 comment #4:]

just lamenting the total disjunctivity of the set.

That two of the three, intended to serve different purposes, are disjunct
seems more laudable than lamentable. But I'm not here to police your
feelings.

I can't think of a more appropriate mapping for it.

Well, if there were a more appropriate mapping for \[u00A0], that mapping
should also apply to the Latin-1 A0. They're the same character, just with
different input representations.

Speaking more generally, for a Latin-1 input file, "groff latin1.txt" and
"groff -Klatin1 latin1.txt" should produce identical output. Presently, for
this character they do not.

Might as well sweep that one into this report, then.

As long as it doesn't change the billing, I won't complain about you doing
more work than I asked for.

tmac/pdf.tmac sources tmac/ps.tmac so the fix only has to be made in one

place.

I should have said "notably but not limited to -Tps and -Tpdf." Fixing this
in the device-specific tmac file then requires duplicating that fix for at
least -Tascii, -Tlatin1, and the various -TX* devices, and I couldn't even
begin to guess about the more obscure legacy devices.

On the one hand, I get that \[u2011] is a character, and characters are mapped
to glyphs, and glyphs reside in fonts, and fonts are device-specific, so some
device-specific code seems a reasonable place to handle it.

But zooming out, the semantics of U+2011 NON-BREAKING HYPHEN are not
device-specific; as an output glyph, it is always identical (as you note) to
\[hy], or \[u2010]. What separates them is its behavior--and this should be
the same across all devices, suggesting it should be handled in a
device-independent section of the code.

I mean, I don't want to back-seat drive, and tell you your very simple
solution, which covers most output formats most people care about, isn't good
enough--except I guess I do, because that's kind of what I'm doing.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Dave

2020-08-20 05:23:18 UTC

Permalink

Follow-up Comment #8, bug #58930 (project groff):

[comment #2 comment #2:]

Unicode considers U+2009 THIN SPACE and U+200A HAIR SPACE breakable...
Groff... does not offer breaking versions of these spaces, and the only
reason to add them would be strict compliance with a Unicode property
that probably no one who uses those code points actually wants

I believe my reasoning here was inaccurate. Although Unicode _allows_
breaking at a thin space or hair space, it does not _require_ it,* so groff
declining to treat these as break points does not violate Unicode compliance
at all. Thus I now propose that U+2009 THIN SPACE be mapped to groff's
(nonbreaking) \|, and U+200A HAIR SPACE to groff's (nonbreaking) \^.

* The gory details: Unicode line breaking is covered in "Unicode Standard
Annex #14: Unicode Line Breaking Algorithm"
(http://www.unicode.org/reports/tr14/tr14-45.html), whose introductory section
makes its scope clear: "Given an input text, [this algorithm] produces a set
of positions called 'break opportunities' that are appropriate points to begin
a new line. The selection of actual line break positions from the set of break
opportunities is not covered by the Unicode Line Breaking Algorithm, but is in
the domain of higher level software." Groff declining to break at points that
Unicode specifies as "break opportunities" is perfectly in line with this.

_______________________________________________________

Reply to this item at:

<https://savannah.gnu.org/bugs/?58930>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/