Show HN: I used Claude Code to discover connections between 100 books

(trails.pieterma.es)

469 points by pmaze a day ago

I think LLMs are overused to summarise and underused to help us read deeper.

I built a system for Claude Code to browse 100 non-fiction books and find interesting connections between them.

I started out with a pipeline in stages, chaining together LLM calls to build up a context of the library. I was mainly getting back the insight that I was baking into the prompts, and the results weren't particularly surprising.

On a whim, I gave CC access to my debug CLI tools and found that it wiped the floor with that approach. It gave actually interesting results and required very little orchestration in comparison.

One of my favourite trail of excerpts goes from Jobs’ reality distortion field to Theranos’ fake demos, to Thiel on startup cults, to Hoffer on mass movement charlatans (https://trails.pieterma.es/trail/useful-lies/). A fun tendency is that Claude kept getting distracted by topics of secrecy, conspiracy, and hidden systems - as if the task itself summoned a Foucault’s Pendulum mindset.

Details:

* The books are picked from HN’s favourites (which I collected before: https://hnbooks.pieterma.es/).

* Chunks are indexed by topic using Gemini Flash Lite. The whole library cost about £10.

* Topics are organised into a tree structure using recursive Leiden partitioning and LLM labels. This gives a high-level sense of the themes.

* There are several ways to browse. The most useful are embedding similarity, topic tree siblings, and topics cooccurring within a chunk window.

* Everything is stored in SQLite and manipulated using a set of CLI tools.

I wrote more about the process here: https://pieterma.es/syntopic-reading-claude/

I’m curious if this way of reading resonates for anyone else - LLM-mediated or not.

johnwatson11218 a day ago

I did something similar whereby I used pdfplumber to extract text from my pdf book collection. I dumped it into postgresql, then chunked the text into 100 char chunks w/ a 10 char overlap. These chunks were directly embedded into a 384D space using python sentence_transformers. Then I simply averaged all chunks for a doc and wrote that single vector back to postgresql. Then I used UMAP + HDBScan to perform dimensionality reduction and clustering. I ended up with a 2D data set that I can plot with plotly to see my clusters. It is very cool to play with this. It takes hours to import 100 pdf files but I can take one folder that contains a mix of programming titles, self-help, math, science fiction etc. After the fully automated analysis you can clearly see the different topic clusters.

I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.

Reply View 4 replies

johnwatson11218 40 minutes ago

Thanks for the supportive comments. I'm definitely thinking I should release sooner rather than later. I have been using LLM for specific tasks and here is some sample stored procedure I had an LLM write for me.

-- -- Name: refresh_topic_tables(); Type: PROCEDURE; Schema: public; Owner: postgres --

CREATE PROCEDURE public.refresh_topic_tables() LANGUAGE plpgsql AS $$ BEGIN -- Drop tables in reverse dependency order DROP TABLE IF EXISTS topic_top_terms; DROP TABLE IF EXISTS topic_term_tfidf; DROP TABLE IF EXISTS term_df; DROP TABLE IF EXISTS term_tf; DROP TABLE IF EXISTS topic_terms;

    -- Recreate tables in correct dependency order
    CREATE TABLE topic_terms AS
    SELECT
        dt.term_id,
        dot.topic_id,
        COUNT(DISTINCT dt.document_id) as document_count,
        SUM(frequency) as total_frequency
    FROM document_terms dt
    JOIN document_topics dot ON dt.document_id = dot.document_id
    GROUP BY dt.term_id, dot.topic_id;

    CREATE TABLE term_tf AS
    SELECT
        topic_id,
        term_id,
        SUM(total_frequency) as term_frequency
    FROM topic_terms
    GROUP BY topic_id, term_id;

    CREATE TABLE term_df AS
    SELECT
        term_id,
        COUNT(DISTINCT topic_id) as document_frequency
    FROM topic_terms
    GROUP BY term_id;

    CREATE TABLE topic_term_tfidf AS
    SELECT
        tt.topic_id,
        tt.term_id,
        tt.term_frequency as tf,
        tdf.document_frequency as df,
        tt.term_frequency * LN( (SELECT COUNT(id) FROM topics) / GREATEST(tdf.document_frequency, 1)) as tf_idf
    FROM term_tf tt
    JOIN term_df tdf ON tt.term_id = tdf.term_id;

    CREATE TABLE topic_top_terms AS
    WITH ranked_terms AS (
        SELECT
            ttf.topic_id,
            t.term_text,
            ttf.tf_idf,
            ROW_NUMBER() OVER (PARTITION BY ttf.topic_id ORDER BY ttf.tf_idf DESC) as rank
        FROM topic_term_tfidf ttf
        JOIN terms t ON ttf.term_id = t.id
    )
    SELECT
        topic_id,
        term_text,
        tf_idf,
        rank
    FROM ranked_terms
    WHERE rank <= 5
    ORDER BY topic_id, rank;

    RAISE NOTICE 'All topic tables refreshed successfully';

EXCEPTION WHEN OTHERS THEN RAISE EXCEPTION 'Error refreshing topic tables: %', SQLERRM; END; $$;