lib/Utf8/Utf8.cpp

#include "Utf8.h"

int utf8CodepointLen(const unsigned char c) {
  if (c < 0x80) return 1;          // 0xxxxxxx
  if ((c >> 5) == 0x6) return 2;   // 110xxxxx
  if ((c >> 4) == 0xE) return 3;   // 1110xxxx
  if ((c >> 3) == 0x1E) return 4;  // 11110xxx
  return 1;                        // fallback for invalid
}

uint32_t utf8NextCodepoint(const unsigned char** string) {
  if (**string == 0) {
    return 0;
  }

  const unsigned char lead = **string;
  const int bytes = utf8CodepointLen(lead);
  const uint8_t* chr = *string;

  // Invalid lead byte (stray continuation byte 0x80-0xBF, or 0xFE/0xFF)
  if (bytes == 1 && lead >= 0x80) {
    (*string)++;
    return REPLACEMENT_GLYPH;
  }

  if (bytes == 1) {
    (*string)++;
    return chr[0];
  }

  // Validate continuation bytes before consuming them
  for (int i = 1; i < bytes; i++) {
    if ((chr[i] & 0xC0) != 0x80) {
      // Missing or invalid continuation byte — skip all bytes consumed so far
      *string += i;
      return REPLACEMENT_GLYPH;
    }
  }

  uint32_t cp = chr[0] & ((1 << (7 - bytes)) - 1);  // mask header bits

  for (int i = 1; i < bytes; i++) {
    cp = (cp << 6) | (chr[i] & 0x3F);
  }

  // Reject overlong encodings, surrogates, and out-of-range values
  const bool overlong = (bytes == 2 && cp < 0x80) || (bytes == 3 && cp < 0x800) || (bytes == 4 && cp < 0x10000);
  const bool surrogate = (cp >= 0xD800 && cp <= 0xDFFF);
  if (overlong || surrogate || cp > 0x10FFFF) {
    (*string)++;
    return REPLACEMENT_GLYPH;
  }

  *string += bytes;

  return cp;
}

int utf8SafeTruncateBuffer(const char* buf, int len) {
  if (len <= 0) return 0;

  // Walk back past continuation bytes (10xxxxxx) to find the lead byte
  int leadPos = len - 1;
  while (leadPos > 0 && (static_cast<uint8_t>(buf[leadPos]) & 0xC0) == 0x80) {
    leadPos--;
  }

  // Determine expected length of the sequence starting at leadPos
  int expectedLen = utf8CodepointLen(static_cast<unsigned char>(buf[leadPos]));
  int actualLen = len - leadPos;

  if (actualLen < expectedLen && leadPos > 0) {
    // Incomplete UTF-8 sequence at the end — exclude it
    return leadPos;
  }
  return len;
}

size_t utf8RemoveLastChar(std::string& str) {
  if (str.empty()) return 0;
  size_t pos = str.size() - 1;
  while (pos > 0 && (static_cast<unsigned char>(str[pos]) & 0xC0) == 0x80) {
    --pos;
  }
  str.resize(pos);
  return pos;
}

// Truncate string by removing N UTF-8 characters from the end
void utf8TruncateChars(std::string& str, const size_t numChars) {
  for (size_t i = 0; i < numChars && !str.empty(); ++i) {
    utf8RemoveLastChar(str);
  }
}
Public release 2025-12-03 22:00:29 +11:00			`#include "Utf8.h"`

			`int utf8CodepointLen(const unsigned char c) {`
			`if (c < 0x80) return 1; // 0xxxxxxx`
			`if ((c >> 5) == 0x6) return 2; // 110xxxxx`
			`if ((c >> 4) == 0xE) return 3; // 1110xxxx`
			`if ((c >> 3) == 0x1E) return 4; // 11110xxx`
			`return 1; // fallback for invalid`
			`}`

			`uint32_t utf8NextCodepoint(const unsigned char** string) {`
			`if (**string == 0) {`
			`return 0;`
			`}`

perf: font-compression improvements (#1056) ## Purpose This PR includes some preparatory changes that are needed for an upcoming performant CJK font feature. The changes have no impact on render time and heap allocation for latin text. Despite this, I think these changes stand on their own as a better font compression/decompression implementation. ## Summary - Font decompressor rewrite: Replaced the 4-slot LRU group cache with a two-tier system — a page buffer (glyphs prewarmed before rendering begins) and a hot-group fallback (last decompressed group retained for non-prewarmed glyphs). - Byte-aligned compressed bitmap format: Glyph bitmaps within compressed groups are now stored row-padded rather than tightly packed before DEFLATE compression, improving compression ratios by making identical pixel rows produce identical byte patterns. Glyphs are compacted back to packed format on demand at render time. Reduces flash size by 155 KB. - Page prewarm system: Added `Page::collectText` and `Page::getDominantStyle` to extract per-style glyph requirements before rendering, and `GfxRenderer::prewarmFontCache` to pre-decompress only the groups needed for the dominant style — eliminating mid-render decompression for the common case. - UTF-8 robustness fixes: `utf8NextCodepoint` now validates continuation bytes and returns a replacement glyph on malformed input; `ChapterHtmlSlimParser` correctly preserves incomplete multi-byte sequences across word-buffer flush boundaries rather than splitting them. --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _YES_ Architecture and design was done by me, refined a bit by Claude. Code mostly by Claude, but not entirely. 2026-03-12 07:05:46 +11:00			`const unsigned char lead = **string;`
			`const int bytes = utf8CodepointLen(lead);`
Public release 2025-12-03 22:00:29 +11:00			`const uint8_t* chr = *string;`
perf: font-compression improvements (#1056) ## Purpose This PR includes some preparatory changes that are needed for an upcoming performant CJK font feature. The changes have no impact on render time and heap allocation for latin text. Despite this, I think these changes stand on their own as a better font compression/decompression implementation. ## Summary - Font decompressor rewrite: Replaced the 4-slot LRU group cache with a two-tier system — a page buffer (glyphs prewarmed before rendering begins) and a hot-group fallback (last decompressed group retained for non-prewarmed glyphs). - Byte-aligned compressed bitmap format: Glyph bitmaps within compressed groups are now stored row-padded rather than tightly packed before DEFLATE compression, improving compression ratios by making identical pixel rows produce identical byte patterns. Glyphs are compacted back to packed format on demand at render time. Reduces flash size by 155 KB. - Page prewarm system: Added `Page::collectText` and `Page::getDominantStyle` to extract per-style glyph requirements before rendering, and `GfxRenderer::prewarmFontCache` to pre-decompress only the groups needed for the dominant style — eliminating mid-render decompression for the common case. - UTF-8 robustness fixes: `utf8NextCodepoint` now validates continuation bytes and returns a replacement glyph on malformed input; `ChapterHtmlSlimParser` correctly preserves incomplete multi-byte sequences across word-buffer flush boundaries rather than splitting them. --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _YES_ Architecture and design was done by me, refined a bit by Claude. Code mostly by Claude, but not entirely. 2026-03-12 07:05:46 +11:00
			`// Invalid lead byte (stray continuation byte 0x80-0xBF, or 0xFE/0xFF)`
			`if (bytes == 1 && lead >= 0x80) {`
			`(*string)++;`
			`return REPLACEMENT_GLYPH;`
			`}`
Public release 2025-12-03 22:00:29 +11:00
			`if (bytes == 1) {`
perf: font-compression improvements (#1056) ## Purpose This PR includes some preparatory changes that are needed for an upcoming performant CJK font feature. The changes have no impact on render time and heap allocation for latin text. Despite this, I think these changes stand on their own as a better font compression/decompression implementation. ## Summary - Font decompressor rewrite: Replaced the 4-slot LRU group cache with a two-tier system — a page buffer (glyphs prewarmed before rendering begins) and a hot-group fallback (last decompressed group retained for non-prewarmed glyphs). - Byte-aligned compressed bitmap format: Glyph bitmaps within compressed groups are now stored row-padded rather than tightly packed before DEFLATE compression, improving compression ratios by making identical pixel rows produce identical byte patterns. Glyphs are compacted back to packed format on demand at render time. Reduces flash size by 155 KB. - Page prewarm system: Added `Page::collectText` and `Page::getDominantStyle` to extract per-style glyph requirements before rendering, and `GfxRenderer::prewarmFontCache` to pre-decompress only the groups needed for the dominant style — eliminating mid-render decompression for the common case. - UTF-8 robustness fixes: `utf8NextCodepoint` now validates continuation bytes and returns a replacement glyph on malformed input; `ChapterHtmlSlimParser` correctly preserves incomplete multi-byte sequences across word-buffer flush boundaries rather than splitting them. --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _YES_ Architecture and design was done by me, refined a bit by Claude. Code mostly by Claude, but not entirely. 2026-03-12 07:05:46 +11:00			`(*string)++;`
Public release 2025-12-03 22:00:29 +11:00			`return chr[0];`
			`}`

perf: font-compression improvements (#1056) ## Purpose This PR includes some preparatory changes that are needed for an upcoming performant CJK font feature. The changes have no impact on render time and heap allocation for latin text. Despite this, I think these changes stand on their own as a better font compression/decompression implementation. ## Summary - Font decompressor rewrite: Replaced the 4-slot LRU group cache with a two-tier system — a page buffer (glyphs prewarmed before rendering begins) and a hot-group fallback (last decompressed group retained for non-prewarmed glyphs). - Byte-aligned compressed bitmap format: Glyph bitmaps within compressed groups are now stored row-padded rather than tightly packed before DEFLATE compression, improving compression ratios by making identical pixel rows produce identical byte patterns. Glyphs are compacted back to packed format on demand at render time. Reduces flash size by 155 KB. - Page prewarm system: Added `Page::collectText` and `Page::getDominantStyle` to extract per-style glyph requirements before rendering, and `GfxRenderer::prewarmFontCache` to pre-decompress only the groups needed for the dominant style — eliminating mid-render decompression for the common case. - UTF-8 robustness fixes: `utf8NextCodepoint` now validates continuation bytes and returns a replacement glyph on malformed input; `ChapterHtmlSlimParser` correctly preserves incomplete multi-byte sequences across word-buffer flush boundaries rather than splitting them. --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _YES_ Architecture and design was done by me, refined a bit by Claude. Code mostly by Claude, but not entirely. 2026-03-12 07:05:46 +11:00			`// Validate continuation bytes before consuming them`
			`for (int i = 1; i < bytes; i++) {`
			`if ((chr[i] & 0xC0) != 0x80) {`
			`// Missing or invalid continuation byte — skip all bytes consumed so far`
			`*string += i;`
			`return REPLACEMENT_GLYPH;`
			`}`
			`}`

Public release 2025-12-03 22:00:29 +11:00			`uint32_t cp = chr[0] & ((1 << (7 - bytes)) - 1); // mask header bits`

			`for (int i = 1; i < bytes; i++) {`
			`cp = (cp << 6) \| (chr[i] & 0x3F);`
			`}`

perf: font-compression improvements (#1056) ## Purpose This PR includes some preparatory changes that are needed for an upcoming performant CJK font feature. The changes have no impact on render time and heap allocation for latin text. Despite this, I think these changes stand on their own as a better font compression/decompression implementation. ## Summary - Font decompressor rewrite: Replaced the 4-slot LRU group cache with a two-tier system — a page buffer (glyphs prewarmed before rendering begins) and a hot-group fallback (last decompressed group retained for non-prewarmed glyphs). - Byte-aligned compressed bitmap format: Glyph bitmaps within compressed groups are now stored row-padded rather than tightly packed before DEFLATE compression, improving compression ratios by making identical pixel rows produce identical byte patterns. Glyphs are compacted back to packed format on demand at render time. Reduces flash size by 155 KB. - Page prewarm system: Added `Page::collectText` and `Page::getDominantStyle` to extract per-style glyph requirements before rendering, and `GfxRenderer::prewarmFontCache` to pre-decompress only the groups needed for the dominant style — eliminating mid-render decompression for the common case. - UTF-8 robustness fixes: `utf8NextCodepoint` now validates continuation bytes and returns a replacement glyph on malformed input; `ChapterHtmlSlimParser` correctly preserves incomplete multi-byte sequences across word-buffer flush boundaries rather than splitting them. --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _YES_ Architecture and design was done by me, refined a bit by Claude. Code mostly by Claude, but not entirely. 2026-03-12 07:05:46 +11:00			`// Reject overlong encodings, surrogates, and out-of-range values`
			`const bool overlong = (bytes == 2 && cp < 0x80) \|\| (bytes == 3 && cp < 0x800) \|\| (bytes == 4 && cp < 0x10000);`
			`const bool surrogate = (cp >= 0xD800 && cp <= 0xDFFF);`
			`if (overlong \|\| surrogate \|\| cp > 0x10FFFF) {`
			`(*string)++;`
			`return REPLACEMENT_GLYPH;`
			`}`

			`*string += bytes;`

Public release 2025-12-03 22:00:29 +11:00			`return cp;`
			`}`
fix: truncating chapter titles using UTF-8 safe function (#599) ## Summary * Truncating chapter titles using utf8 safe functions (Cyrillic titles were split mid codepoint) * refactoring of lib/Utf8 --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _< PARTIALLY >_ 2026-02-01 16:23:48 +05:00
perf: font-compression improvements (#1056) ## Purpose This PR includes some preparatory changes that are needed for an upcoming performant CJK font feature. The changes have no impact on render time and heap allocation for latin text. Despite this, I think these changes stand on their own as a better font compression/decompression implementation. ## Summary - Font decompressor rewrite: Replaced the 4-slot LRU group cache with a two-tier system — a page buffer (glyphs prewarmed before rendering begins) and a hot-group fallback (last decompressed group retained for non-prewarmed glyphs). - Byte-aligned compressed bitmap format: Glyph bitmaps within compressed groups are now stored row-padded rather than tightly packed before DEFLATE compression, improving compression ratios by making identical pixel rows produce identical byte patterns. Glyphs are compacted back to packed format on demand at render time. Reduces flash size by 155 KB. - Page prewarm system: Added `Page::collectText` and `Page::getDominantStyle` to extract per-style glyph requirements before rendering, and `GfxRenderer::prewarmFontCache` to pre-decompress only the groups needed for the dominant style — eliminating mid-render decompression for the common case. - UTF-8 robustness fixes: `utf8NextCodepoint` now validates continuation bytes and returns a replacement glyph on malformed input; `ChapterHtmlSlimParser` correctly preserves incomplete multi-byte sequences across word-buffer flush boundaries rather than splitting them. --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _YES_ Architecture and design was done by me, refined a bit by Claude. Code mostly by Claude, but not entirely. 2026-03-12 07:05:46 +11:00			`int utf8SafeTruncateBuffer(const char* buf, int len) {`
			`if (len <= 0) return 0;`

			`// Walk back past continuation bytes (10xxxxxx) to find the lead byte`
			`int leadPos = len - 1;`
			`while (leadPos > 0 && (static_cast<uint8_t>(buf[leadPos]) & 0xC0) == 0x80) {`
			`leadPos--;`
			`}`

			`// Determine expected length of the sequence starting at leadPos`
			`int expectedLen = utf8CodepointLen(static_cast<unsigned char>(buf[leadPos]));`
			`int actualLen = len - leadPos;`

			`if (actualLen < expectedLen && leadPos > 0) {`
			`// Incomplete UTF-8 sequence at the end — exclude it`
			`return leadPos;`
			`}`
			`return len;`
			`}`

fix: truncating chapter titles using UTF-8 safe function (#599) ## Summary * Truncating chapter titles using utf8 safe functions (Cyrillic titles were split mid codepoint) * refactoring of lib/Utf8 --- ### AI Usage While CrossPoint doesn't have restrictions on AI tools in contributing, please be transparent about their usage as it helps set the right context for reviewers. Did you use AI tools to help write this code? _< PARTIALLY >_ 2026-02-01 16:23:48 +05:00			`size_t utf8RemoveLastChar(std::string& str) {`
			`if (str.empty()) return 0;`
			`size_t pos = str.size() - 1;`
			`while (pos > 0 && (static_cast<unsigned char>(str[pos]) & 0xC0) == 0x80) {`
			`--pos;`
			`}`
			`str.resize(pos);`
			`return pos;`
			`}`

			`// Truncate string by removing N UTF-8 characters from the end`
			`void utf8TruncateChars(std::string& str, const size_t numChars) {`
			`for (size_t i = 0; i < numChars && !str.empty(); ++i) {`
			`utf8RemoveLastChar(str);`
			`}`
			`}`