Show HN: Use Claude Code to Query 600 GB Indexes over Hacker News, ArXiv, etc.

88 points by Xyra 4 hours ago

Paste in my prompt to Claude Code with an embedded API key for accessing my public readonly SQL+vector database, and you have a state-of-the-art research tool over Hacker News, arXiv, LessWrong, and dozens of other high-quality public commons sites. Claude whips up the monster SQL queries that safely run on my machine, to answer your most nuanced questions.

There's also an Alerts functionality, where you can just ask Claude to submit a SQL query as an alert, and you'll be emailed when the ultra nuanced criteria is met (and the output changes). Like I want to know when somebody posts about "estrogen" in a psychoactive context, or enough biology metaphors when talking about building infrastructure.

Currently have embedded: posts: 1.4M / 4.6M comments: 15.6M / 38M That's with Voyage-3.5-lite. And you can do amazing compositional vector search, like search @FTX_crisis - (@guilt_tone - @guilt_topic) to find writing that was about the FTX crisis and distinctly without guilty tones, but that can mention "guilt".

I can embed everything and all the other sources for cheap, I just literally don't have the money.

barishnamazov 2 hours ago

I like that this relies on generating SQL rather than just being a black-box chat bot. It feels like the right way to use LLMs for research: as a translator from natural language to a rigid query language, rather than as the database itself. Very cool project!

Hopefully your API doesn't get exploited and you are doing timeouts/sandboxing -- it'd be easy to do a massive join on this.

I also have a question mostly stemming from me being not knowledgeable in the area -- have you noticed any semantic bleeding when research is done between your datasets? e.g., "optimization" probably means different things under ArXiv, LessWrong, and HN. Wondering if vector searches account for this given a more specific question.

Reply View 2 replies

keeeba an hour ago

I don’t have the experiments to prove this, but from my experience it’s highly variable between embedding models.
Larger, more capable embedding models are better able to separate the different uses of a given word in the embedding space, smaller models are not.

Reply View | 1 reply
- A4ET8a8uTh0_v2 38 minutes ago
  
  I was thinking about it a fair bit lately. We have all sorts of benchmarks that describe a lot of factors in detail, but all those are very abstract and yet, those do not seem to map clearly to well observed behaviors. I think we need to think of a different way to list those.
  
  Reply View | 0 replies

[removed] 2 hours ago

[deleted]

Reply View 0 replies

nielsole 40 minutes ago

I think a prompt + an external dataset is a very simple distribution channel right now to explore anything quickly with low friction. The curl | bash of 2026

Reply View 0 replies

kburman an hour ago

> a state-of-the-art research tool over Hacker News, arXiv, LessWrong, and dozens

what makes this state of the art?

Reply View 3 replies

7moritz7 36 minutes ago

The scale. How many tools do you know that can query the content of all arxiv papers.

Reply View | 0 replies
ashirviskas an hour ago

First, so best in this?

Reply View | 0 replies
nandomrumber an hour ago

The tool is state of the art, the sources are historical.

Reply View | 0 replies

m11a 27 minutes ago

The quick setup is cool! I’ve not seen this onboarding flow for other tools, and I quite like its simplicity.

Reply View 0 replies

7777777phil 2 hours ago

Really useful currently working on a autonomous academic research system [1] and thinking about integrating this. Currently using custom prompt + Edison Scientific API. Any plans of making this open source?

[1] https://github.com/giatenica/gia-agentic-short

Reply View 0 replies

nineteen999 2 hours ago

That's just not a good use of my Claude plan. If you can make it so a self-hosted Lllama or Qwen 7B can query it, then that's something.

Reply View 1 reply

mcintyre1994 an hour ago

I think that’s just a matter of their capabilities, rather than anything specific to this?

Reply View | 0 replies

mentalgear 2 hours ago

Nice, but would you consider open-sourcing it? I (and I assume others) are not keen on sharing my API keys with a 3rd party.

Reply View 1 reply

nielsole an hour ago

I think you misunderstood. The API key is for their API, not Anthropic.
If you take a look at the prompt you'll find that they have a static API key that they have created for this demo ("exopriors_public_readonly_v1_2025")

Reply View | 0 replies

fragmede 27 minutes ago

> I can embed everything and all the other sources for cheap, I just literally don't have the money.

How much do you need for the various leaks, like the paradise papers, the panama papers, the offshore leajay, the Bahamas leaks, the fincen files, the Uber files, etc. and what's your Venmo?

Reply View 0 replies

gtsnexp 2 hours ago

Is the appeal of this tool its ability to identify semantic similarity?

Reply View 1 reply

A4ET8a8uTh0_v2 25 minutes ago

The use case could vary from person to person. When you think about it, hacker news has large enough data set ( and one that is widely accessible ) to allow all sorts of fun analyses. In a sense, the appeal is:
who knows what kind of fun patterns could emerge

Reply View | 0 replies

bugglebeetle 2 hours ago

Seems very cool, but IMO you’d be better off doing an open source version and then hosted SAAS.

Reply View 0 replies

octoberfranklin 2 hours ago

"Claude Code and Codex are essentially AGI at this point"

Okaaaaaaay....

Reply View 6 replies

Closi an hour ago

Just comes down to your own view of what AGI is, as it's not particularly well defined.
While a bit 'time-machiney' - I think if you took an LLM of today and showed it to someone 20 years ago, most people would probably say AGI has been achieved. If someone wrote a definition of AGI 20 years ago, we would probably have met that.
We have certainly blasted past some science-fiction examples of AI like Agnes from The Twilight Zone, which 20 years ago looked a bit silly, and now looks like a remarkable prediction of LLMs.
By todays definition of AGI we haven't met it yet, but eventually it comes down to 'I know it if I see it' - the problem with this definition is that it is polluted by what people have already seen.

Reply View | 2 replies
- bananaflag 37 minutes ago
  
  > If someone wrote a definition of AGI 20 years ago, we would probably have met that.
  No, as long as people can do work that a robot cannot do, we don't have AGI. That was always, if not the definition, at least implied by the definition.
  I don't know why the meme of AGI being not well defined has had such success over the past few years.
  
  Reply View | 1 reply
  
  Closi 29 minutes ago
  
  Completely disagree - Your definition (in my opinion) is more aligned to the concept of Artificial Super Intelligence.
  Surely the 'General Intelligence' definition has to be consistent between 'Artificial General Intelligence' and 'Human General Intelligence', and humans can be generally intelligent even if they can't solve calculus equations or protein folding problems. My definition of general intelligence is much lower than most - I think a dog is probably generally intelligent, although obviously in a different way (dogs are obviously better at learning how to run and catch a ball, and worse at programming python).
  
  Reply View | 0 replies
phatfish an hour ago

I want to know what the "intelligence explosion" is, sounds much cooler than AGI.

Reply View | 1 reply
- adammarples an hour ago
  
  When AI gets so good it can improve on itself
  
  Reply View | 0 replies
Hamuko 2 hours ago

I have noticed that Claude users seem to be about as intelligent as Claude itself, and wouldn't be able to surpass its output.

Reply View | 0 replies