Comment by johnwatson11218
Comment by johnwatson11218 a day ago
I did something similar whereby I used pdfplumber to extract text from my pdf book collection. I dumped it into postgresql, then chunked the text into 100 char chunks w/ a 10 char overlap. These chunks were directly embedded into a 384D space using python sentence_transformers. Then I simply averaged all chunks for a doc and wrote that single vector back to postgresql. Then I used UMAP + HDBScan to perform dimensionality reduction and clustering. I ended up with a 2D data set that I can plot with plotly to see my clusters. It is very cool to play with this. It takes hours to import 100 pdf files but I can take one folder that contains a mix of programming titles, self-help, math, science fiction etc. After the fully automated analysis you can clearly see the different topic clusters.
I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.
Thanks for the supportive comments. I'm definitely thinking I should release sooner rather than later. I have been using LLM for specific tasks and here is some sample stored procedure I had an LLM write for me.
-- -- Name: refresh_topic_tables(); Type: PROCEDURE; Schema: public; Owner: postgres --
CREATE PROCEDURE public.refresh_topic_tables() LANGUAGE plpgsql AS $$ BEGIN -- Drop tables in reverse dependency order DROP TABLE IF EXISTS topic_top_terms; DROP TABLE IF EXISTS topic_term_tfidf; DROP TABLE IF EXISTS term_df; DROP TABLE IF EXISTS term_tf; DROP TABLE IF EXISTS topic_terms;
EXCEPTION WHEN OTHERS THEN RAISE EXCEPTION 'Error refreshing topic tables: %', SQLERRM; END; $$;