Files
crosspoint-reader-mod/scripts/build_html.py
pablohc 7d56810ee6 feat: integrated epub optimizer (#1224)
## Problem

Many e-ink readers have limited image decoder support natively.
EPUBs with images in other formats than **baseline JPEG** frequently
cause:

- **Broken images**: pages render as blank, corrupted noise, or never
load
- **Slow rendering**: unoptimized images cause severe delays on e-ink
hardware, up to 7 seconds per page turn, with cover images taking up to
59 seconds to render
- **Broken covers**: the book thumbnail never generates

Fixing this today requires external tools before uploading.

---

## What this PR does

Adds an **optional, on-demand EPUB optimizer** to the file upload flow.
When enabled,
it converts all images to baseline JPEG directly in the browser — no
server, no internet,
no external tools needed.

**Conversion is opt-in. The standard upload flow is unchanged.**

---

## Real-world impact

The optimizer was applied in batch to **61 EPUBs**:
- 60 standard EPUBs: 198 MB → 55 MB (**−72.2%**, 143 MB saved)
- Text-dominant books: 8–46% smaller (covers and inline images
converted)
  - Image-heavy / illustrated books: 65–93% smaller
- 1 Large manga volume (594 MB): 594 MB → 72 MB (**−87.8%**, 522 MB
saved)
- EPUB structural integrity fully maintained — zero new validation
issues introduced across all 61 books

*Size and integrity analysis:
[epub-comparator](https://github.com/pablohc/epub-comparator)*

From that set, **17 books were selected** as a representative sample
covering different content
types: image-heavy novels, pure manga, light novels with broken images,
and text-dominant books.
Each was benchmarked on two devices running in parallel, one on `master`
and one
on `PR#1224` — measuring render time across ~30 pages per book on
average.

### Rendering bugs fixed

| Book | Problem (original) | After optimization |
|------|--------------------|--------------------|
| Fairy Tale — Stephen King | Cover took **59.7 s** to render | 2.1 s
(−96%) |
| Cycle of the Werewolf — Stephen King | Cover took **23.3 s** to render
| 1.7 s (−93%) |
| Tomie: Complete Deluxe Ed. — Junji Ito | Cover took **18.3 s** to
render | 2.0 s (−89%) |
| Joel Dicker — El tigre (Ed. Ilustrada) | Cover took **14.5 s** to
render | 1.4 s (−90%) |
| Jackson, Holly — Asesinato para principiantes | Cover failed
completely (blank) | 2.0 s ✓ |
| Sentenced to Be a Hero — Yen Press | Cover failed, **8 images failed
to load** | All fixed ✓ |
| Flynn, Gillian — Perdida | Cover failed completely (blank) | 1.6 s ✓ |
| Chandler, Raymond — Asesino en la lluvia | Cover failed completely
(blank) | 2.0 s ✓ |

### Page render times — image-heavy EPUBs (avg per page)

| Book | Pages | Avg original | Avg optimized | Improvement | File size
|
|------|-------|-------------|---------------|-------------|-----------|
| Fairy Tale — Stephen King | 30 | 3,028 ms | 1,066 ms | **−64.8%** |
32.4 MB → 9.1 MB (−72%) |
| Cycle of the Werewolf — Stephen King | 33 | 3,026 ms | 1,558 ms |
**−48.5%** | 35.1 MB → 2.9 MB (−92%) |
| Joel Dicker — El tigre (Ed. Ilustrada) | 16 | 1,846 ms | 1,051 ms |
**−43.1%** | 5.3 MB → 0.4 MB (−93%) |
| Tomie: Complete Deluxe Ed. — Junji Ito | 30 | 4,817 ms | 2,802 ms |
**−41.8%** | 593.8 MB → 72.2 MB (−87.8%) |
| Sentenced to Be a Hero — Yen Press | 30 | 1,719 ms | 1,388 ms |
**−19.2%** | 15.2 MB → 1.6 MB (−90%) |

### Text-heavy EPUBs — no regression

| Book | Pages | Avg original | Avg optimized | Delta |
|------|-------|-------------|---------------|-------|
| Christie — Asesinato en el Orient Express | 30 | 1,672 ms | 1,646 ms |
−1.6% |
| Flynn — Perdida | 30 | 1,327 ms | 1,291 ms | −2.7% |
| Dicker — La verdad sobre el caso Harry Quebert | 30 | 1,132 ms | 1,084
ms | −4.2% |
| Hammett — El halcón maltés | 30 | 1,009 ms | 966 ms | −4.3% |
| Chandler — Asesino en la lluvia | 30 | 989 ms | 1,007 ms | +1.8% |

*Differences within ±5% — consistent with device measurement noise.*

*Render time benchmark:
[epub-optimization-benchmark](https://github.com/pablohc/epub-optimization-benchmark)*

---
## How to use it

**Single file:**
1. Click **Upload** (top of the page) — a modal opens. Use **Choose
files** to select one EPUB from your device.
2. Check **Optimize**.
- *(Optional)* Expand **Advanced Mode** — adjust quality, rotation, or
overlap; set individual images to H-Split / V-Split / Rotate.
3. Click **Optimize & Upload**.

**Batch (2+ files):**
1. Click **Upload** (top of the page) — a modal opens. Use **Choose
files** to select multiple EPUBs from your device.
2. Check **Optimize**.
   - *(Optional)* Expand **Advanced Mode** — adjust quality.
3. Click **Upload** — all files are converted and uploaded sequentially.

Upload a batch of files, without optimization:
<img width="810" height="671" alt="image"
src="https://github.com/user-attachments/assets/d892ae13-0b87-4ea4-b6b8-340d56efc763"
/>

Batch file upload, with standard optimization:
<img width="809" height="707" alt="image"
src="https://github.com/user-attachments/assets/d32dbc88-1208-4555-bfcf-330ab91d2174"
/>

Optimization Phase (1/2):
<img width="807" height="1055" alt="image"
src="https://github.com/user-attachments/assets/fd4cd5f9-e56e-4ca1-9777-6926b9baf2bb"
/>

Upload Phase (2/2):
<img width="805" height="1065" alt="image"
src="https://github.com/user-attachments/assets/483294f0-02f0-4569-ae11-c10b3581d747"
/>

Batch upload successfully confirmed:
<img width="812" height="1043" alt="image"
src="https://github.com/user-attachments/assets/80c135bf-05c3-4c80-8755-2a04c68235bc"
/>

---

## Options

**Always active when the converter is enabled:**
- Converts PNG, WebP, BMP, GIF → baseline JPEG
- Smart downscaling to 480×800 px max (preserves aspect ratio)
- True grayscale for e-ink (BT.709 luminance, always on)
- SVG cover fix + OPF/NCX compliance repairs

**Advanced Mode (opt-in) — single file:**
- JPEG quality presets: 30% / 45% / 60% / 75% / **85%** (default) / 95%
- Rotation direction for split images: CW (default) / CCW
- Min overlap when splitting: 5% (default) / 10% / 15%
- Auto-download conversion log toggle (detailed stats per image)
- Per-image picker: set Normal / H-Split / V-Split / Rotate per image
individually,
  with "Apply to all" for bulk assignment

**Advanced Mode (opt-in) — batch (2+ files):**
- JPEG quality presets: 30% / 45% / 60% / 75% / **85%** (default) / 95%
- Auto-download conversion log toggle (aggregated stats for all files)

---

## ⚠️ Known limitations

**KoReader hash-based sync will break** for converted files. The file
content changes,
so the hash no longer matches the original. Filename-based sync is
unaffected.

If you rely on KoReader hash sync, use the Calibre plugin or the web
tool instead.

---
## Build size impact

| Metric | master (53beeee) | PR #1224 (a2ba5db) | Delta |

|---------------|------------------|--------------------|----------------|
| Flash used | 5,557 KB | 5,616 KB | +59 KB (+1.1%) |
| Flash free | 843 KB | 784 KB | −59 KB |
| Flash usage | 86.8% | 87.7% | +0.9 pp |
| RAM used | 95,156 B | 95,156 B | no change |

> Both builds compiled with `gh_release` environment in release mode
(ESP32-C3, 6,400 KB Flash).
> The +59 KB increase is entirely due to `jszip.min.js` embedded as a
> gzipped static asset served from Flash. RAM usage is identical,
> confirming no runtime overhead — the library runs in the browser,
> not on the ESP32. ~784 KB of Flash remain available.

---

## Alternatives considered

| Approach | Friction |
|----------|---------|
| **This PR** — integrated in upload flow | Zero: convert + upload in
one step, offline, any browser |
| Calibre plugin (in parallel development) | Requires a computer with
Calibre installed, same network |
| Web converters | Requires extra upload / download / transfer steps |

---

## Credits

Based on the converter algorithm developed by @zgredex.
Co-authored-by: @zgredex

---

### AI Usage

Did you use AI tools to help write this code? **PARTIALLY**

---------

Co-authored-by: zgredex <zgredex@users.noreply.github.com>
2026-03-22 19:53:15 +00:00

94 lines
3.7 KiB
Python

import os
import re
import gzip
SRC_DIR = "src"
def minify_html(html: str) -> str:
# Tags where whitespace should be preserved
preserve_tags = ['pre', 'code', 'textarea', 'script', 'style']
preserve_regex = '|'.join(preserve_tags)
# Protect preserve blocks with placeholders
preserve_blocks = []
def preserve(match):
preserve_blocks.append(match.group(0))
return f"__PRESERVE_BLOCK_{len(preserve_blocks)-1}__"
html = re.sub(rf'<({preserve_regex})[\s\S]*?</\1>', preserve, html, flags=re.IGNORECASE)
# Remove HTML comments
html = re.sub(r'<!--.*?-->', '', html, flags=re.DOTALL)
# Collapse all whitespace between tags
html = re.sub(r'>\s+<', '><', html)
# Collapse multiple spaces inside tags
html = re.sub(r'\s+', ' ', html)
# Restore preserved blocks
for i, block in enumerate(preserve_blocks):
html = html.replace(f"__PRESERVE_BLOCK_{i}__", block)
return html.strip()
def sanitize_identifier(name: str) -> str:
"""Sanitize a filename to create a valid C identifier.
C identifiers must:
- Start with a letter or underscore
- Contain only letters, digits, and underscores
"""
# Replace non-alphanumeric characters (including hyphens) with underscores
sanitized = re.sub(r'[^a-zA-Z0-9_]', '_', name)
# Prefix with underscore if starts with a digit
if sanitized and sanitized[0].isdigit():
sanitized = f"_{sanitized}"
return sanitized
for root, _, files in os.walk(SRC_DIR):
for file in files:
if file.endswith(".html") or file.endswith(".js"):
file_path = os.path.join(root, file)
with open(file_path, "r", encoding="utf-8") as f:
content = f.read()
# Only minify HTML files; JS files are typically pre-minified (e.g., jszip.min.js)
if file.endswith(".html"):
processed = minify_html(content)
else:
processed = content
# Compress with gzip (compresslevel 9 is maximum compression)
# IMPORTANT: we don't use brotli because Firefox doesn't support brotli with insecured context (only supported on HTTPS)
compressed = gzip.compress(processed.encode('utf-8'), compresslevel=9)
# Create valid C identifier from filename
# Use appropriate suffix based on file type
suffix = "Html" if file.endswith(".html") else "Js"
base_name = sanitize_identifier(f"{os.path.splitext(file)[0]}{suffix}")
header_path = os.path.join(root, f"{base_name}.generated.h")
with open(header_path, "w", encoding="utf-8") as h:
h.write(f"// THIS FILE IS AUTOGENERATED, DO NOT EDIT MANUALLY\n\n")
h.write(f"#pragma once\n")
h.write(f"#include <cstddef>\n\n")
# Write the compressed data as a byte array
h.write(f"constexpr char {base_name}[] PROGMEM = {{\n")
# Write bytes in rows of 16
for i in range(0, len(compressed), 16):
chunk = compressed[i:i+16]
hex_values = ', '.join(f'0x{b:02x}' for b in chunk)
h.write(f" {hex_values},\n")
h.write(f"}};\n\n")
h.write(f"constexpr size_t {base_name}CompressedSize = {len(compressed)};\n")
h.write(f"constexpr size_t {base_name}OriginalSize = {len(processed)};\n")
print(f"Generated: {header_path}")
print(f" Original: {len(content)} bytes")
print(f" Minified: {len(processed)} bytes ({100*len(processed)/len(content):.1f}%)")
print(f" Compressed: {len(compressed)} bytes ({100*len(compressed)/len(content):.1f}%)")