Commit Graph

4 Commits

Author SHA1 Message Date
Uri Tauber
1abe307f20 perf: Optimize HTML entities lookup to O(log(n)) (#1194)
## Summary

**What is the goal of this PR?** Replace the linear scan of
`lookupHtmlEntity` with a simple binary search to improve lookup
performance.

**What changes are included?**
`lib/Epub/Epub/Entities/htmlEntities.cpp`: 
 - Sorted the `ENTITY_LOOKUP` array.
 - Added a compile-time assertion to guarantee the array remains sorted.
 - Rewrote `lookupHtmlEntity` to use a binary search.

## Additional Context

Benchmarked on my x64 laptop (probably will be different on RISC-V)
```
=== Benchmark (53 entities x 10000 iterations) ===

Version           Total time   Avg per lookup
----------------------------------------------
linear          236.97 ms total     447.11 ns/lookup
binary search    22.09 ms total      41.68 ns/lookup

=== Summary ===

Binary search is 10.73x faster than linear scan.
```

This is a simplified alternative to #1180, focused on keeping the
implementation clean, and maintainable.

### AI Usage


Did you use AI tools to help write this code? _**< NO >**_

---------

Co-authored-by: Zach Nelson <zach@zdnelson.com>
2026-02-26 12:55:31 -06:00
Jake Kenneally
6e51afb977 fix: Account for nbsp; character as non-breaking space (#757)
## Summary

Closes #743.

**What is the goal of this PR?**

- Add back handling for HTML entities in expat. This was originally part
of the code that got removed
[here](https://github.com/crosspoint-reader/crosspoint-reader/pull/274)
- Handle `&nbsp;` characters to resolve issue #743 

**What changes are included?**

- Brought back HTML entity table from previous commit and refactored it
to use a static const char * table with linear lookup to reduce heap
allocations.
- Used `XML_SetDefaultHandlerExpand` in expat to parse out the entities
correctly, without needing them defined in DOCTYPE
- Added handling for `&nbsp;` so that the text stays together and
doesn't break onto a new line with text separated by an `&nbsp;`

## Additional Context

- This supersedes [this
PR](https://github.com/crosspoint-reader/crosspoint-reader/pull/751)
that simply handled `nbsp;` as whitespace. Instead, we want that
character to serve its true purpose and affect the line-breaking
algorithm.
- Updated my test EPUB [here](https://github.com/jdk2pq/css-test-epub)
with `&nbsp;` characters examples at the end of the book

---

### AI Usage

While CrossPoint doesn't have restrictions on AI tools in contributing,
please be transparent about their usage as it
helps set the right context for reviewers.

Did you use AI tools to help write this code? _**YES**_, Claude Code
2026-02-13 15:46:46 +01:00
Dave Allie
2b12a65011 Remove HTML entity parsing (#274)
## Summary

* Remove HTML entity parsing
  * This has been completely useless since the introduction of expat
* expat tries to parse all entities in the document, but only knows of
HTML ones
* Parsing will never end with HTML entities in the text, so the
additional step to parse them that we had went completely unused
* We should figure out the best way to parse that content in the future,
but for now remove that module as it generates a lot of heap allocations
with its map and strings
2026-01-07 23:08:43 +11:00
Dave Allie
2ccdbeecc8 Public release 2025-12-03 22:06:45 +11:00