fix: Fix hyphenation and rendering of decomposed characters (#1037)

## Summary

* This PR fixes decomposed diacritic handling end-to-end:
- Hyphenation: normalize common Latin base+combining sequences to
precomposed codepoints before Liang pattern matching, so decomposed
words hyphenate correctly
- Rendering: correct combining-mark placement logic so non-spacing marks
are attached to the preceding base glyph in normal and rotated text
rendering paths, with corresponding text-bounds consistency updates.
- Hyphenation around non breaking space variants have been fixed (and
extended)
- Hyphenation of terms that already included of hyphens were fixed to
include Liang pattern application (eg "US-Satellitensystem" was
*exclusively* broken at the existing hyphen)

## Additional Context

* Before
<img width="800" height="480" alt="2"
src="https://github.com/user-attachments/assets/b9c515c4-ab75-45cc-8b52-f4d86bce519d"
/>


* After
<img width="480" height="800" alt="fix1"
src="https://github.com/user-attachments/assets/4999f6a8-f51c-4c0a-b144-f153f77ddb57"
/>
<img width="800" height="480" alt="fix2"
src="https://github.com/user-attachments/assets/7355126b-80c7-441f-b390-4e0897ee3fb6"
/>

* Note 1: the hyphenation fix is not a 100% bullet proof implementation.
It adds composition of *common* base+combining sequences (e.g. O +
U+0308 -> Ö) during codepoint collection. A complete solution would
require implementing proper Unicode normalization (at least NFC,
possibly NFKC in specific cases) before hyphenation and rendering,
instead of hand-mapping a few combining marks. That was beyond the scope
of this fix.

* Note 2: the render fix should be universal and not limited to the
constraints outlined above: it properly x-centers the compund glyph over
the previous one, and it uses at least 1pt of visual distance in y.

Before:
<img width="478" height="167" alt="Image"
src="https://github.com/user-attachments/assets/f8db60d5-35b1-4477-96d0-5003b4e4a2a1"
/>

After: 
<img width="479" height="180" alt="Image"
src="https://github.com/user-attachments/assets/1b48ef97-3a77-475a-8522-23f4aca8e904"
/>

* This should resolve the issues described in #998 
---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**PARTIALLY**_
This commit is contained in:
jpirnay
2026-02-22 03:11:07 +01:00
committed by GitHub
parent 10a2678584
commit 5f5561b684
8 changed files with 459 additions and 29 deletions

View File

@@ -174,6 +174,213 @@ std::vector<CodepointInfo> collectCodepoints(const std::string& word) {
while (*ptr != 0) {
const unsigned char* current = ptr;
const uint32_t cp = utf8NextCodepoint(&ptr);
// If this is a combining diacritic (e.g., U+0301 = acute) and there's
// a previous base character that can be composed into a single
// precomposed Unicode scalar (Latin-1 / Latin-Extended), do that
// composition here. This provides lightweight NFC-like behavior for
// common Western European diacritics (acute, grave, circumflex, tilde,
// diaeresis, cedilla) without pulling in a full Unicode normalization
// library.
if (!cps.empty()) {
uint32_t prev = cps.back().value;
uint32_t composed = 0;
switch (cp) {
case 0x0300: // grave
switch (prev) {
case 0x0041:
composed = 0x00C0;
break; // A -> À
case 0x0061:
composed = 0x00E0;
break; // a -> à
case 0x0045:
composed = 0x00C8;
break; // E -> È
case 0x0065:
composed = 0x00E8;
break; // e -> è
case 0x0049:
composed = 0x00CC;
break; // I -> Ì
case 0x0069:
composed = 0x00EC;
break; // i -> ì
case 0x004F:
composed = 0x00D2;
break; // O -> Ò
case 0x006F:
composed = 0x00F2;
break; // o -> ò
case 0x0055:
composed = 0x00D9;
break; // U -> Ù
case 0x0075:
composed = 0x00F9;
break; // u -> ù
default:
break;
}
break;
case 0x0301: // acute
switch (prev) {
case 0x0041:
composed = 0x00C1;
break; // A -> Á
case 0x0061:
composed = 0x00E1;
break; // a -> á
case 0x0045:
composed = 0x00C9;
break; // E -> É
case 0x0065:
composed = 0x00E9;
break; // e -> é
case 0x0049:
composed = 0x00CD;
break; // I -> Í
case 0x0069:
composed = 0x00ED;
break; // i -> í
case 0x004F:
composed = 0x00D3;
break; // O -> Ó
case 0x006F:
composed = 0x00F3;
break; // o -> ó
case 0x0055:
composed = 0x00DA;
break; // U -> Ú
case 0x0075:
composed = 0x00FA;
break; // u -> ú
case 0x0059:
composed = 0x00DD;
break; // Y -> Ý
case 0x0079:
composed = 0x00FD;
break; // y -> ý
default:
break;
}
break;
case 0x0302: // circumflex
switch (prev) {
case 0x0041:
composed = 0x00C2;
break; // A -> Â
case 0x0061:
composed = 0x00E2;
break; // a -> â
case 0x0045:
composed = 0x00CA;
break; // E -> Ê
case 0x0065:
composed = 0x00EA;
break; // e -> ê
case 0x0049:
composed = 0x00CE;
break; // I -> Î
case 0x0069:
composed = 0x00EE;
break; // i -> î
case 0x004F:
composed = 0x00D4;
break; // O -> Ô
case 0x006F:
composed = 0x00F4;
break; // o -> ô
case 0x0055:
composed = 0x00DB;
break; // U -> Û
case 0x0075:
composed = 0x00FB;
break; // u -> û
default:
break;
}
break;
case 0x0303: // tilde
switch (prev) {
case 0x0041:
composed = 0x00C3;
break; // A -> Ã
case 0x0061:
composed = 0x00E3;
break; // a -> ã
case 0x004E:
composed = 0x00D1;
break; // N -> Ñ
case 0x006E:
composed = 0x00F1;
break; // n -> ñ
default:
break;
}
break;
case 0x0308: // diaeresis/umlaut
switch (prev) {
case 0x0041:
composed = 0x00C4;
break; // A -> Ä
case 0x0061:
composed = 0x00E4;
break; // a -> ä
case 0x0045:
composed = 0x00CB;
break; // E -> Ë
case 0x0065:
composed = 0x00EB;
break; // e -> ë
case 0x0049:
composed = 0x00CF;
break; // I -> Ï
case 0x0069:
composed = 0x00EF;
break; // i -> ï
case 0x004F:
composed = 0x00D6;
break; // O -> Ö
case 0x006F:
composed = 0x00F6;
break; // o -> ö
case 0x0055:
composed = 0x00DC;
break; // U -> Ü
case 0x0075:
composed = 0x00FC;
break; // u -> ü
case 0x0059:
composed = 0x0178;
break; // Y -> Ÿ
case 0x0079:
composed = 0x00FF;
break; // y -> ÿ
default:
break;
}
break;
case 0x0327: // cedilla
switch (prev) {
case 0x0043:
composed = 0x00C7;
break; // C -> Ç
case 0x0063:
composed = 0x00E7;
break; // c -> ç
default:
break;
}
break;
default:
break;
}
if (composed != 0) {
cps.back().value = composed;
continue; // skip pushing the combining mark itself
}
}
cps.push_back({cp, static_cast<size_t>(current - base)});
}