2025-12-13 19:36:01 +11:00
|
|
|
#pragma once
|
|
|
|
|
#include <Print.h>
|
|
|
|
|
|
perf: optimize large EPUB indexing from O(n^2) to O(n) (#458)
## Summary
Optimizes EPUB metadata indexing for large books (2000+ chapters) from
~30 minutes to ~50 seconds by replacing O(n²) algorithms with O(n log n)
hash-indexed lookups.
Fixes #134
## Problem
Three phases had O(n²) complexity due to nested loops:
| Phase | Operation | Before (2768 chapters) |
|-------|-----------|------------------------|
| OPF Pass | For each spine ref, scan all manifest items | ~25 min |
| TOC Pass | For each TOC entry, scan all spine items | ~5 min |
| buildBookBin | For each spine item, scan ZIP central directory | ~8.4
min |
Total: **~30+ minutes** for first-time indexing of large EPUBs.
## Solution
Replace linear scans with sorted hash indexes + binary search:
- **OPF Pass**: Build `{hash(id), len, offset}` index from manifest,
binary search for each spine ref
- **TOC Pass**: Build `{hash(href), len, spineIndex}` index from spine,
binary search for each TOC entry
- **buildBookBin**: New `ZipFile::fillUncompressedSizes()` API - single
ZIP central directory scan with batch hash matching
All indexes use FNV-1a hashing with length as secondary key to minimize
collisions. Indexes are freed immediately after each phase.
## Results
**Shadow Slave EPUB (2768 chapters):**
| Phase | Before | After | Speedup |
|-------|--------|-------|---------|
| OPF pass | ~25 min | 10.8 sec | ~140x |
| TOC pass | ~5 min | 4.7 sec | ~60x |
| buildBookBin | 506 sec | 34.6 sec | ~15x |
| **Total** | **~30+ min** | **~50 sec** | **~36x** |
**Normal EPUB (87 chapters):** 1.7 sec - no regression.
## Memory
Peak temporary memory during indexing:
- OPF index: ~33KB (2770 items × 12 bytes)
- TOC index: ~33KB (2768 items × 12 bytes)
- ZIP batch: ~44KB (targets + sizes arrays)
All indexes cleared immediately after each phase. No OOM risk on
ESP32-C3.
## Note on Threshold
All optimizations are gated by `LARGE_SPINE_THRESHOLD = 400` to preserve
existing behavior for small books. However, the algorithms work
correctly for any book size and are faster even for small books:
| Book Size | Old O(n²) | New O(n log n) | Improvement |
|-----------|-----------|----------------|-------------|
| 10 ch | 100 ops | 50 ops | 2x |
| 100 ch | 10K ops | 800 ops | 12x |
| 400 ch | 160K ops | 4K ops | 40x |
If preferred, the threshold could be removed to use the optimized path
universally.
## Testing
- [x] Shadow Slave (2768 chapters): 50s first-time indexing, loads and
navigates correctly
- [x] Normal book (87 chapters): 1.7s indexing, no regression
- [x] Build passes
- [x] clang-format passes
## Files Changed
- `lib/Epub/Epub/parsers/ContentOpfParser.h/.cpp` - OPF manifest index
- `lib/Epub/Epub/BookMetadataCache.h/.cpp` - TOC index + batch size
lookup
- `lib/ZipFile/ZipFile.h/.cpp` - New `fillUncompressedSizes()` API
- `lib/Epub/Epub.cpp` - Timing logs
<details>
<summary><b>Algorithm Details</b> (click to expand)</summary>
### Phase 1: OPF Pass - Manifest to Spine Lookup
**Problem**: Each `<itemref idref="ch001">` in spine must find matching
`<item id="ch001" href="...">` in manifest.
```
OLD: For each of 2768 spine refs, scan all 2770 manifest items
= 7.6M string comparisons
NEW: While parsing manifest, build index:
{ hash("ch001"), len=5, file_offset=120 }
Sort index, then binary search for each spine ref:
2768 × log₂(2770) ≈ 2768 × 11 = 30K comparisons
```
### Phase 2: TOC Pass - TOC Entry to Spine Index Lookup
**Problem**: Each TOC entry with `href="chapter0001.xhtml"` must find
its spine index.
```
OLD: For each of 2768 TOC entries, scan all 2768 spine entries
= 7.6M string comparisons
NEW: At beginTocPass(), read spine once and build index:
{ hash("OEBPS/chapter0001.xhtml"), len=25, spineIndex=0 }
Sort index, binary search for each TOC entry:
2768 × log₂(2768) ≈ 30K comparisons
Clear index at endTocPass() to free memory.
```
### Phase 3: buildBookBin - ZIP Size Lookup
**Problem**: Need uncompressed file size for each spine item (for
reading progress). Sizes are in ZIP central directory.
```
OLD: For each of 2768 spine items, scan ZIP central directory (2773 entries)
= 7.6M filename reads + string comparisons
Time: 506 seconds
NEW:
Step 1: Build targets from spine
{ hash("OEBPS/chapter0001.xhtml"), len=25, index=0 }
Sort by (hash, len)
Step 2: Single pass through ZIP central directory
For each entry:
- Compute hash ON THE FLY (no string allocation)
- Binary search targets
- If match: sizes[target.index] = uncompressedSize
Step 3: Use sizes array directly (O(1) per spine item)
Total: 2773 entries × log₂(2768) ≈ 33K comparisons
Time: 35 seconds
```
### Why Hash + Length?
Using 64-bit FNV-1a hash + string length as a composite key:
- Collision probability: ~1 in 2⁶⁴ × typical_path_lengths
- No string storage needed in index (just 12-16 bytes per entry)
- Integer comparisons are faster than string comparisons
- Verification on match handles the rare collision case
</details>
---
_AI-assisted development. All changes tested on hardware._
2026-01-27 06:29:15 -08:00
|
|
|
#include <algorithm>
|
|
|
|
|
#include <vector>
|
|
|
|
|
|
2025-12-13 19:36:01 +11:00
|
|
|
#include "Epub.h"
|
|
|
|
|
#include "expat.h"
|
|
|
|
|
|
2025-12-24 22:36:13 +11:00
|
|
|
class BookMetadataCache;
|
|
|
|
|
|
2025-12-13 19:36:01 +11:00
|
|
|
class ContentOpfParser final : public Print {
|
|
|
|
|
enum ParserState {
|
|
|
|
|
START,
|
|
|
|
|
IN_PACKAGE,
|
|
|
|
|
IN_METADATA,
|
|
|
|
|
IN_BOOK_TITLE,
|
2025-12-30 21:15:44 +10:00
|
|
|
IN_BOOK_AUTHOR,
|
2026-01-19 17:56:26 +05:00
|
|
|
IN_BOOK_LANGUAGE,
|
port: upstream PR #1342 - Book Info screen, richer metadata, safer controls
Ports upstream PR #1342 (feat: Add Book Info screen, richer metadata,
and safer file-browser controls) with mod-specific adaptations:
- Parse and cache series, seriesIndex, description from EPUB OPF
- Bump book.bin cache version to 6 for new metadata fields
- Add BookInfoActivity (new screen) accessible via Right button in FileBrowser
- Add ManageBook menu via Left button in FileBrowser (replaces upstream hidden delete)
- Guard all delete/archive actions with ConfirmationActivity (10 call sites)
- Add inputArmed gating to ConfirmationActivity to prevent accidental confirmation
- Safe deserialization: readString now returns bool with MAX_STRING_LENGTH guard
- Add series field to RecentBooksStore with JSON and binary serialization
- Add i18n keys: STR_BOOK_INFO, STR_AUTHOR, STR_SERIES, STR_FILE_SIZE, etc.
Made-with: Cursor
2026-03-09 00:39:32 -04:00
|
|
|
IN_BOOK_DESCRIPTION,
|
|
|
|
|
IN_BOOK_SERIES,
|
|
|
|
|
IN_BOOK_SERIES_INDEX,
|
2026-03-09 02:43:10 -04:00
|
|
|
IN_BOOK_PUBLISHER,
|
|
|
|
|
IN_BOOK_DATE,
|
|
|
|
|
IN_BOOK_SUBJECT,
|
|
|
|
|
IN_BOOK_RIGHTS,
|
|
|
|
|
IN_BOOK_CONTRIBUTOR,
|
|
|
|
|
IN_BOOK_IDENTIFIER,
|
2025-12-13 19:36:01 +11:00
|
|
|
IN_MANIFEST,
|
|
|
|
|
IN_SPINE,
|
2025-12-30 13:02:46 +01:00
|
|
|
IN_GUIDE,
|
2025-12-13 19:36:01 +11:00
|
|
|
};
|
|
|
|
|
|
2025-12-24 22:36:13 +11:00
|
|
|
const std::string& cachePath;
|
2025-12-13 19:36:01 +11:00
|
|
|
const std::string& baseContentPath;
|
|
|
|
|
size_t remainingSize;
|
|
|
|
|
XML_Parser parser = nullptr;
|
|
|
|
|
ParserState state = START;
|
2025-12-24 22:36:13 +11:00
|
|
|
BookMetadataCache* cache;
|
2025-12-30 15:09:30 +10:00
|
|
|
FsFile tempItemStore;
|
2025-12-24 22:36:13 +11:00
|
|
|
std::string coverItemId;
|
2026-03-09 02:43:10 -04:00
|
|
|
bool identifierIsIsbn = false;
|
2025-12-13 19:36:01 +11:00
|
|
|
|
perf: optimize large EPUB indexing from O(n^2) to O(n) (#458)
## Summary
Optimizes EPUB metadata indexing for large books (2000+ chapters) from
~30 minutes to ~50 seconds by replacing O(n²) algorithms with O(n log n)
hash-indexed lookups.
Fixes #134
## Problem
Three phases had O(n²) complexity due to nested loops:
| Phase | Operation | Before (2768 chapters) |
|-------|-----------|------------------------|
| OPF Pass | For each spine ref, scan all manifest items | ~25 min |
| TOC Pass | For each TOC entry, scan all spine items | ~5 min |
| buildBookBin | For each spine item, scan ZIP central directory | ~8.4
min |
Total: **~30+ minutes** for first-time indexing of large EPUBs.
## Solution
Replace linear scans with sorted hash indexes + binary search:
- **OPF Pass**: Build `{hash(id), len, offset}` index from manifest,
binary search for each spine ref
- **TOC Pass**: Build `{hash(href), len, spineIndex}` index from spine,
binary search for each TOC entry
- **buildBookBin**: New `ZipFile::fillUncompressedSizes()` API - single
ZIP central directory scan with batch hash matching
All indexes use FNV-1a hashing with length as secondary key to minimize
collisions. Indexes are freed immediately after each phase.
## Results
**Shadow Slave EPUB (2768 chapters):**
| Phase | Before | After | Speedup |
|-------|--------|-------|---------|
| OPF pass | ~25 min | 10.8 sec | ~140x |
| TOC pass | ~5 min | 4.7 sec | ~60x |
| buildBookBin | 506 sec | 34.6 sec | ~15x |
| **Total** | **~30+ min** | **~50 sec** | **~36x** |
**Normal EPUB (87 chapters):** 1.7 sec - no regression.
## Memory
Peak temporary memory during indexing:
- OPF index: ~33KB (2770 items × 12 bytes)
- TOC index: ~33KB (2768 items × 12 bytes)
- ZIP batch: ~44KB (targets + sizes arrays)
All indexes cleared immediately after each phase. No OOM risk on
ESP32-C3.
## Note on Threshold
All optimizations are gated by `LARGE_SPINE_THRESHOLD = 400` to preserve
existing behavior for small books. However, the algorithms work
correctly for any book size and are faster even for small books:
| Book Size | Old O(n²) | New O(n log n) | Improvement |
|-----------|-----------|----------------|-------------|
| 10 ch | 100 ops | 50 ops | 2x |
| 100 ch | 10K ops | 800 ops | 12x |
| 400 ch | 160K ops | 4K ops | 40x |
If preferred, the threshold could be removed to use the optimized path
universally.
## Testing
- [x] Shadow Slave (2768 chapters): 50s first-time indexing, loads and
navigates correctly
- [x] Normal book (87 chapters): 1.7s indexing, no regression
- [x] Build passes
- [x] clang-format passes
## Files Changed
- `lib/Epub/Epub/parsers/ContentOpfParser.h/.cpp` - OPF manifest index
- `lib/Epub/Epub/BookMetadataCache.h/.cpp` - TOC index + batch size
lookup
- `lib/ZipFile/ZipFile.h/.cpp` - New `fillUncompressedSizes()` API
- `lib/Epub/Epub.cpp` - Timing logs
<details>
<summary><b>Algorithm Details</b> (click to expand)</summary>
### Phase 1: OPF Pass - Manifest to Spine Lookup
**Problem**: Each `<itemref idref="ch001">` in spine must find matching
`<item id="ch001" href="...">` in manifest.
```
OLD: For each of 2768 spine refs, scan all 2770 manifest items
= 7.6M string comparisons
NEW: While parsing manifest, build index:
{ hash("ch001"), len=5, file_offset=120 }
Sort index, then binary search for each spine ref:
2768 × log₂(2770) ≈ 2768 × 11 = 30K comparisons
```
### Phase 2: TOC Pass - TOC Entry to Spine Index Lookup
**Problem**: Each TOC entry with `href="chapter0001.xhtml"` must find
its spine index.
```
OLD: For each of 2768 TOC entries, scan all 2768 spine entries
= 7.6M string comparisons
NEW: At beginTocPass(), read spine once and build index:
{ hash("OEBPS/chapter0001.xhtml"), len=25, spineIndex=0 }
Sort index, binary search for each TOC entry:
2768 × log₂(2768) ≈ 30K comparisons
Clear index at endTocPass() to free memory.
```
### Phase 3: buildBookBin - ZIP Size Lookup
**Problem**: Need uncompressed file size for each spine item (for
reading progress). Sizes are in ZIP central directory.
```
OLD: For each of 2768 spine items, scan ZIP central directory (2773 entries)
= 7.6M filename reads + string comparisons
Time: 506 seconds
NEW:
Step 1: Build targets from spine
{ hash("OEBPS/chapter0001.xhtml"), len=25, index=0 }
Sort by (hash, len)
Step 2: Single pass through ZIP central directory
For each entry:
- Compute hash ON THE FLY (no string allocation)
- Binary search targets
- If match: sizes[target.index] = uncompressedSize
Step 3: Use sizes array directly (O(1) per spine item)
Total: 2773 entries × log₂(2768) ≈ 33K comparisons
Time: 35 seconds
```
### Why Hash + Length?
Using 64-bit FNV-1a hash + string length as a composite key:
- Collision probability: ~1 in 2⁶⁴ × typical_path_lengths
- No string storage needed in index (just 12-16 bytes per entry)
- Integer comparisons are faster than string comparisons
- Verification on match handles the rare collision case
</details>
---
_AI-assisted development. All changes tested on hardware._
2026-01-27 06:29:15 -08:00
|
|
|
// Index for fast idref→href lookup (used only for large EPUBs)
|
|
|
|
|
struct ItemIndexEntry {
|
|
|
|
|
uint32_t idHash; // FNV-1a hash of itemId
|
|
|
|
|
uint16_t idLen; // length for collision reduction
|
|
|
|
|
uint32_t fileOffset; // offset in .items.bin
|
|
|
|
|
};
|
|
|
|
|
std::vector<ItemIndexEntry> itemIndex;
|
|
|
|
|
bool useItemIndex = false;
|
|
|
|
|
|
|
|
|
|
static constexpr uint16_t LARGE_SPINE_THRESHOLD = 400;
|
|
|
|
|
|
|
|
|
|
// FNV-1a hash function
|
|
|
|
|
static uint32_t fnvHash(const std::string& s) {
|
|
|
|
|
uint32_t hash = 2166136261u;
|
|
|
|
|
for (char c : s) {
|
|
|
|
|
hash ^= static_cast<uint8_t>(c);
|
|
|
|
|
hash *= 16777619u;
|
|
|
|
|
}
|
|
|
|
|
return hash;
|
|
|
|
|
}
|
|
|
|
|
|
2025-12-13 19:36:01 +11:00
|
|
|
static void startElement(void* userData, const XML_Char* name, const XML_Char** atts);
|
|
|
|
|
static void characterData(void* userData, const XML_Char* s, int len);
|
|
|
|
|
static void endElement(void* userData, const XML_Char* name);
|
|
|
|
|
|
|
|
|
|
public:
|
|
|
|
|
std::string title;
|
2025-12-30 21:15:44 +10:00
|
|
|
std::string author;
|
2026-01-19 17:56:26 +05:00
|
|
|
std::string language;
|
port: upstream PR #1342 - Book Info screen, richer metadata, safer controls
Ports upstream PR #1342 (feat: Add Book Info screen, richer metadata,
and safer file-browser controls) with mod-specific adaptations:
- Parse and cache series, seriesIndex, description from EPUB OPF
- Bump book.bin cache version to 6 for new metadata fields
- Add BookInfoActivity (new screen) accessible via Right button in FileBrowser
- Add ManageBook menu via Left button in FileBrowser (replaces upstream hidden delete)
- Guard all delete/archive actions with ConfirmationActivity (10 call sites)
- Add inputArmed gating to ConfirmationActivity to prevent accidental confirmation
- Safe deserialization: readString now returns bool with MAX_STRING_LENGTH guard
- Add series field to RecentBooksStore with JSON and binary serialization
- Add i18n keys: STR_BOOK_INFO, STR_AUTHOR, STR_SERIES, STR_FILE_SIZE, etc.
Made-with: Cursor
2026-03-09 00:39:32 -04:00
|
|
|
std::string series;
|
|
|
|
|
std::string seriesIndex;
|
|
|
|
|
std::string description;
|
2026-03-09 02:43:10 -04:00
|
|
|
std::string publisher;
|
|
|
|
|
std::string date;
|
|
|
|
|
std::string subjects;
|
|
|
|
|
std::string rights;
|
|
|
|
|
std::string contributor;
|
|
|
|
|
std::string identifier;
|
|
|
|
|
std::string rating;
|
2025-12-13 19:36:01 +11:00
|
|
|
std::string tocNcxPath;
|
2026-01-03 16:10:35 +08:00
|
|
|
std::string tocNavPath; // EPUB 3 nav document path
|
2025-12-24 22:36:13 +11:00
|
|
|
std::string coverItemHref;
|
2026-02-16 07:24:30 -05:00
|
|
|
std::string guideCoverPageHref; // Guide reference with type="cover" or "cover-page" (points to XHTML wrapper)
|
2025-12-30 13:02:46 +01:00
|
|
|
std::string textReferenceHref;
|
feat: Add CSS parsing and CSS support in EPUBs (#411)
## Summary
* **What is the goal of this PR?**
- Adds basic CSS parsing to EPUBs and determine the CSS rules when
rendering to the screen so that text is styled correctly. Currently
supports bold, underline, italics, margin, padding, and text alignment
## Additional Context
- My main reason for wanting this is that the book I'm currently
reading, Carl's Doomsday Scenario (2nd in the Dungeon Crawler Carl
series), relies _a lot_ on styled text for telling parts of the story.
When text is bolded, it's supposed to be a message that's rendered
"on-screen" in the story. When characters are "chatting" with each
other, the text is bolded and their names are underlined. Plus, normal
emphasis is provided with italicizing words here and there. So, this
greatly improves my experience reading this book on the Xteink, and I
figured it was useful enough for others too.
- For transparency: I'm a software engineer, but I'm mostly frontend and
TypeScript/JavaScript. It's been _years_ since I did any C/C++, so I
would not be surprised if I'm doing something dumb along the way in this
code. Please don't hesitate to ask for changes if something looks off. I
heavily relied on Claude Code for help, and I had a lot of inspiration
from how [microreader](https://github.com/CidVonHighwind/microreader)
achieves their CSS parsing and styling. I did give this as good of a
code review as I could and went through everything, and _it works on my
machine_ 😄
### Before


### After


---
### AI Usage
Did you use AI tools to help write this code? **YES**, Claude Code
2026-02-05 05:28:10 -05:00
|
|
|
std::vector<std::string> cssFiles; // CSS stylesheet paths
|
2025-12-13 19:36:01 +11:00
|
|
|
|
2025-12-24 22:36:13 +11:00
|
|
|
explicit ContentOpfParser(const std::string& cachePath, const std::string& baseContentPath, const size_t xmlSize,
|
|
|
|
|
BookMetadataCache* cache)
|
|
|
|
|
: cachePath(cachePath), baseContentPath(baseContentPath), remainingSize(xmlSize), cache(cache) {}
|
2025-12-21 15:43:53 +11:00
|
|
|
~ContentOpfParser() override;
|
2025-12-13 19:36:01 +11:00
|
|
|
|
|
|
|
|
bool setup();
|
|
|
|
|
|
|
|
|
|
size_t write(uint8_t) override;
|
|
|
|
|
size_t write(const uint8_t* buffer, size_t size) override;
|
|
|
|
|
};
|