lib/Utf8/Utf8.h

#pragma once

#include <cstdint>
#include <string>
#define REPLACEMENT_GLYPH 0xFFFD

uint32_t utf8NextCodepoint(const unsigned char** string);
// Remove the last UTF-8 codepoint from a std::string and return the new size.
size_t utf8RemoveLastChar(std::string& str);
// Truncate string by removing N UTF-8 codepoints from the end.
void utf8TruncateChars(std::string& str, size_t numChars);

// Truncate a raw char buffer to the last complete UTF-8 codepoint boundary.
// Returns the new length (<= len). If the buffer ends mid-sequence, the
// incomplete trailing bytes are excluded.
int utf8SafeTruncateBuffer(const char* buf, int len);

// Returns true for Unicode combining diacritical marks that should not advance the cursor.
inline bool utf8IsCombiningMark(const uint32_t cp) {
  return (cp >= 0x0300 && cp <= 0x036F)      // Combining Diacritical Marks
         || (cp >= 0x1DC0 && cp <= 0x1DFF)   // Combining Diacritical Marks Supplement
         || (cp >= 0x20D0 && cp <= 0x20FF)   // Combining Diacritical Marks for Symbols
         || (cp >= 0xFE20 && cp <= 0xFE2F);  // Combining Half Marks
}
Public release 2025-12-03 22:00:29 +11:00			`#pragma once`

			`#include <cstdint>`
fix: truncating chapter titles using UTF-8 safe function (#599) ## Summary * Truncating chapter titles using utf8 safe functions (Cyrillic titles were split mid codepoint) * refactoring of lib/Utf8 --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _< PARTIALLY >_ 2026-02-01 16:23:48 +05:00			`#include <string>`
fix: render U+FFFD replacement character instead of ? (#366) The current behavior of rendering `?` for an unknown Unicode character can be hard to distinguish from a typo. Use the standard Unicode "replacement character" instead, that's what it's designed for: https://en.wikipedia.org/wiki/Specials_(Unicode_block) I'm making this PR as a draft because I'm not sure I did everything that was needed to change the character set covered by the fonts. Running that script is in its own commit. If this is proper, I'll rebase/squash into one commit and un-draft. Co-authored-by: Maeve Andrews <maeve@git.mail.maeveandrews.com> 2026-01-19 05:58:43 -06:00			`#define REPLACEMENT_GLYPH 0xFFFD`

Public release 2025-12-03 22:00:29 +11:00			`uint32_t utf8NextCodepoint(const unsigned char** string);`
fix: truncating chapter titles using UTF-8 safe function (#599) ## Summary * Truncating chapter titles using utf8 safe functions (Cyrillic titles were split mid codepoint) * refactoring of lib/Utf8 --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _< PARTIALLY >_ 2026-02-01 16:23:48 +05:00			`// Remove the last UTF-8 codepoint from a std::string and return the new size.`
			`size_t utf8RemoveLastChar(std::string& str);`
			`// Truncate string by removing N UTF-8 codepoints from the end.`
			`void utf8TruncateChars(std::string& str, size_t numChars);`
fix: Fix hyphenation and rendering of decomposed characters (#1037) ## Summary * This PR fixes decomposed diacritic handling end-to-end: - Hyphenation: normalize common Latin base+combining sequences to precomposed codepoints before Liang pattern matching, so decomposed words hyphenate correctly - Rendering: correct combining-mark placement logic so non-spacing marks are attached to the preceding base glyph in normal and rotated text rendering paths, with corresponding text-bounds consistency updates. - Hyphenation around non breaking space variants have been fixed (and extended) - Hyphenation of terms that already included of hyphens were fixed to include Liang pattern application (eg "US-Satellitensystem" was exclusively broken at the existing hyphen) ## Additional Context * Before <img width="800" height="480" alt="2" src="https://github.com/user-attachments/assets/b9c515c4-ab75-45cc-8b52-f4d86bce519d" /> * After <img width="480" height="800" alt="fix1" src="https://github.com/user-attachments/assets/4999f6a8-f51c-4c0a-b144-f153f77ddb57" /> <img width="800" height="480" alt="fix2" src="https://github.com/user-attachments/assets/7355126b-80c7-441f-b390-4e0897ee3fb6" /> * Note 1: the hyphenation fix is not a 100% bullet proof implementation. It adds composition of common base+combining sequences (e.g. O + U+0308 -> Ö) during codepoint collection. A complete solution would require implementing proper Unicode normalization (at least NFC, possibly NFKC in specific cases) before hyphenation and rendering, instead of hand-mapping a few combining marks. That was beyond the scope of this fix. * Note 2: the render fix should be universal and not limited to the constraints outlined above: it properly x-centers the compund glyph over the previous one, and it uses at least 1pt of visual distance in y. Before: <img width="478" height="167" alt="Image" src="https://github.com/user-attachments/assets/f8db60d5-35b1-4477-96d0-5003b4e4a2a1" /> After: <img width="479" height="180" alt="Image" src="https://github.com/user-attachments/assets/1b48ef97-3a77-475a-8522-23f4aca8e904" /> * This should resolve the issues described in #998 --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _PARTIALLY_ 2026-02-22 03:11:07 +01:00
perf: font-compression improvements (#1056) ## Purpose This PR includes some preparatory changes that are needed for an upcoming performant CJK font feature. The changes have no impact on render time and heap allocation for latin text. Despite this, I think these changes stand on their own as a better font compression/decompression implementation. ## Summary - Font decompressor rewrite: Replaced the 4-slot LRU group cache with a two-tier system — a page buffer (glyphs prewarmed before rendering begins) and a hot-group fallback (last decompressed group retained for non-prewarmed glyphs). - Byte-aligned compressed bitmap format: Glyph bitmaps within compressed groups are now stored row-padded rather than tightly packed before DEFLATE compression, improving compression ratios by making identical pixel rows produce identical byte patterns. Glyphs are compacted back to packed format on demand at render time. Reduces flash size by 155 KB. - Page prewarm system: Added `Page::collectText` and `Page::getDominantStyle` to extract per-style glyph requirements before rendering, and `GfxRenderer::prewarmFontCache` to pre-decompress only the groups needed for the dominant style — eliminating mid-render decompression for the common case. - UTF-8 robustness fixes: `utf8NextCodepoint` now validates continuation bytes and returns a replacement glyph on malformed input; `ChapterHtmlSlimParser` correctly preserves incomplete multi-byte sequences across word-buffer flush boundaries rather than splitting them. --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _YES_ Architecture and design was done by me, refined a bit by Claude. Code mostly by Claude, but not entirely. 2026-03-12 07:05:46 +11:00			`// Truncate a raw char buffer to the last complete UTF-8 codepoint boundary.`
			`// Returns the new length (<= len). If the buffer ends mid-sequence, the`
			`// incomplete trailing bytes are excluded.`
			`int utf8SafeTruncateBuffer(const char* buf, int len);`

fix: Fix hyphenation and rendering of decomposed characters (#1037) ## Summary * This PR fixes decomposed diacritic handling end-to-end: - Hyphenation: normalize common Latin base+combining sequences to precomposed codepoints before Liang pattern matching, so decomposed words hyphenate correctly - Rendering: correct combining-mark placement logic so non-spacing marks are attached to the preceding base glyph in normal and rotated text rendering paths, with corresponding text-bounds consistency updates. - Hyphenation around non breaking space variants have been fixed (and extended) - Hyphenation of terms that already included of hyphens were fixed to include Liang pattern application (eg "US-Satellitensystem" was exclusively broken at the existing hyphen) ## Additional Context * Before <img width="800" height="480" alt="2" src="https://github.com/user-attachments/assets/b9c515c4-ab75-45cc-8b52-f4d86bce519d" /> * After <img width="480" height="800" alt="fix1" src="https://github.com/user-attachments/assets/4999f6a8-f51c-4c0a-b144-f153f77ddb57" /> <img width="800" height="480" alt="fix2" src="https://github.com/user-attachments/assets/7355126b-80c7-441f-b390-4e0897ee3fb6" /> * Note 1: the hyphenation fix is not a 100% bullet proof implementation. It adds composition of common base+combining sequences (e.g. O + U+0308 -> Ö) during codepoint collection. A complete solution would require implementing proper Unicode normalization (at least NFC, possibly NFKC in specific cases) before hyphenation and rendering, instead of hand-mapping a few combining marks. That was beyond the scope of this fix. * Note 2: the render fix should be universal and not limited to the constraints outlined above: it properly x-centers the compund glyph over the previous one, and it uses at least 1pt of visual distance in y. Before: <img width="478" height="167" alt="Image" src="https://github.com/user-attachments/assets/f8db60d5-35b1-4477-96d0-5003b4e4a2a1" /> After: <img width="479" height="180" alt="Image" src="https://github.com/user-attachments/assets/1b48ef97-3a77-475a-8522-23f4aca8e904" /> * This should resolve the issues described in #998 --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _PARTIALLY_ 2026-02-22 03:11:07 +01:00			`// Returns true for Unicode combining diacritical marks that should not advance the cursor.`
			`inline bool utf8IsCombiningMark(const uint32_t cp) {`
			`return (cp >= 0x0300 && cp <= 0x036F) // Combining Diacritical Marks`
			`\|\| (cp >= 0x1DC0 && cp <= 0x1DFF) // Combining Diacritical Marks Supplement`
			`\|\| (cp >= 0x20D0 && cp <= 0x20FF) // Combining Diacritical Marks for Symbols`
			`\|\| (cp >= 0xFE20 && cp <= 0xFE2F); // Combining Half Marks`
			`}`