Discussion: Understanding Late Interaction Models in Multimodal Retrieval

kash · May 9, 2025, 4:21pm

The ability to properly process and chunk financial statements is huge. This is something I’ve spent a ton of time working on with sonar.us – but there is a lot more work to do. It seems to me that OCR isn’t going to be the clear winner when it comes to chunking PDFs. Will be digging deeper into ColPaLI!

Has anyone come up with a good system for chunking large excel files? The problem there is that you can have a bunch of sheets within a workbook and within each workbook, you can have dozens of disconnected tables (some of these tables can be thousands of cells). Would love to hear how other folks are tackling this problem.

philip · May 12, 2025, 3:37pm

I haven’t been through the Excel‑to‑ColPaLI implementation fire myself yet, but here’s how I’d tackle it:

Snapshot each sheet / major table as an image, pull the sheet name + header row as text, and index both. Patch vectors preserve grid layout; text embeddings keep keywords (e.g., “profit margin”) searchable so MaxSim lands on the exact table.

If wiring a multi‑vector index in Qdrant/vector database feels heavyweight, Mixpeek already hosts that stack and can ingest and enable effective retrieval on these Excel‑derived chunks.

We’re prepping a full demo + deep‑dive write‑up on this workflow, would be great to swap notes or feature your sonar.us use‑case if you’re up for it.