perf: font-compression improvements (#1056)

## Purpose

This PR includes some preparatory changes that are needed for an
upcoming performant CJK font feature. The changes have no impact on
render time and heap allocation for latin text. **Despite this, I think
these changes stand on their own as a better font
compression/decompression implementation.**

## Summary

- Font decompressor rewrite: Replaced the 4-slot LRU group cache with a
two-tier system — a page buffer (glyphs prewarmed before rendering
begins) and a hot-group fallback (last decompressed group retained for
non-prewarmed
  glyphs). 
- Byte-aligned compressed bitmap format: Glyph bitmaps within compressed
groups are now stored row-padded rather than tightly packed before
DEFLATE compression, improving compression ratios by making identical
pixel rows produce
identical byte patterns. Glyphs are compacted back to packed format on
demand at render time. Reduces flash size by 155 KB.
- Page prewarm system: Added `Page::collectText` and
`Page::getDominantStyle` to extract per-style glyph requirements before
rendering, and `GfxRenderer::prewarmFontCache` to pre-decompress only
the groups needed for the dominant style
   — eliminating mid-render decompression for the common case.
- UTF-8 robustness fixes: `utf8NextCodepoint` now validates continuation
bytes and returns a replacement glyph on malformed input;
`ChapterHtmlSlimParser` correctly preserves incomplete multi-byte
sequences across word-buffer flush
  boundaries rather than splitting them.

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_ Architecture and
design was done by me, refined a bit by Claude. Code mostly by Claude,
but not entirely.
This commit is contained in:
Adrian Wilkins-Caruana
2026-03-12 07:05:46 +11:00
committed by GitHub
parent b467ea7973
commit f1e9dc7f30
70 changed files with 104438 additions and 120059 deletions

View File

@@ -13,23 +13,69 @@ uint32_t utf8NextCodepoint(const unsigned char** string) {
return 0;
}
const int bytes = utf8CodepointLen(**string);
const unsigned char lead = **string;
const int bytes = utf8CodepointLen(lead);
const uint8_t* chr = *string;
*string += bytes;
// Invalid lead byte (stray continuation byte 0x80-0xBF, or 0xFE/0xFF)
if (bytes == 1 && lead >= 0x80) {
(*string)++;
return REPLACEMENT_GLYPH;
}
if (bytes == 1) {
(*string)++;
return chr[0];
}
// Validate continuation bytes before consuming them
for (int i = 1; i < bytes; i++) {
if ((chr[i] & 0xC0) != 0x80) {
// Missing or invalid continuation byte — skip all bytes consumed so far
*string += i;
return REPLACEMENT_GLYPH;
}
}
uint32_t cp = chr[0] & ((1 << (7 - bytes)) - 1); // mask header bits
for (int i = 1; i < bytes; i++) {
cp = (cp << 6) | (chr[i] & 0x3F);
}
// Reject overlong encodings, surrogates, and out-of-range values
const bool overlong = (bytes == 2 && cp < 0x80) || (bytes == 3 && cp < 0x800) || (bytes == 4 && cp < 0x10000);
const bool surrogate = (cp >= 0xD800 && cp <= 0xDFFF);
if (overlong || surrogate || cp > 0x10FFFF) {
(*string)++;
return REPLACEMENT_GLYPH;
}
*string += bytes;
return cp;
}
int utf8SafeTruncateBuffer(const char* buf, int len) {
if (len <= 0) return 0;
// Walk back past continuation bytes (10xxxxxx) to find the lead byte
int leadPos = len - 1;
while (leadPos > 0 && (static_cast<uint8_t>(buf[leadPos]) & 0xC0) == 0x80) {
leadPos--;
}
// Determine expected length of the sequence starting at leadPos
int expectedLen = utf8CodepointLen(static_cast<unsigned char>(buf[leadPos]));
int actualLen = len - leadPos;
if (actualLen < expectedLen && leadPos > 0) {
// Incomplete UTF-8 sequence at the end — exclude it
return leadPos;
}
return len;
}
size_t utf8RemoveLastChar(std::string& str) {
if (str.empty()) return 0;
size_t pos = str.size() - 1;

View File

@@ -10,6 +10,11 @@ size_t utf8RemoveLastChar(std::string& str);
// Truncate string by removing N UTF-8 codepoints from the end.
void utf8TruncateChars(std::string& str, size_t numChars);
// Truncate a raw char buffer to the last complete UTF-8 codepoint boundary.
// Returns the new length (<= len). If the buffer ends mid-sequence, the
// incomplete trailing bytes are excluded.
int utf8SafeTruncateBuffer(const char* buf, int len);
// Returns true for Unicode combining diacritical marks that should not advance the cursor.
inline bool utf8IsCombiningMark(const uint32_t cp) {
return (cp >= 0x0300 && cp <= 0x036F) // Combining Diacritical Marks