Comment by KTibow

Thanks for bringing that up - it's exactly why I approached it this way from the start.

Technically you can use the original Codex CLI with a local LLM - if your inference provider implements the OpenAI Chat Completions API, with function calling, etc. included.

But based on what I had in mind - the idea that small models can be really useful if optimized for very specific use cases - I figured the current architecture of Codex CLI wasn't the best fit for that. So instead of forking it, I started from scratch.

Here's the rough thinking behind it:

   1. You still have to manually set up and run your own inference server (e.g., with ollama, lmstudio, vllm, etc.).
   2. You need to ensure that the model you choose works well with Codex's pre-defined prompt setup and configuration.
   3. Prompting patterns for small open-source models (like phi-4-mini) often need to be very different - they don't generalize as well.
   4. The function calling format (or structured output) might not even be supported by your local inference provider.

Codex CLI's implementation and prompts seem tailored for a specific class of hosted, large-scale models (e.g. GPT, Gemini, Grok). But if you want to get good results with small, local models, everything - prompting, reasoning chains, output structure - often needs to be different.

So I built this with a few assumptions in mind:

   - Write the tool specifically to run _locally_ out of the box, no inference API server required.
   - Use model directly (currently for phi-4-mini via llama-cpp-python).
   - Optimize the prompt and execution logic _per model_ to get the best performance.

Instead of forcing small models into a system meant for large, general-purpose APIs, I wanted to explore a local-first, model-specific alternative that's easy to install and extend — and free to run.

kingo55 8 months ago

Does it work for local though? It's my understanding this is still missing.

Reply View 2 replies

KTibow 8 months ago

If your favorite LLM inference program can run a Chat Completions API.

Reply View | 1 reply
- codingmoh 8 months ago
  
  Thanks for bringing that up - it's exactly why I approached it this way from the start.
  Technically you can use the original Codex CLI with a local LLM - if your inference provider implements the OpenAI Chat Completions API, with function calling, etc. included.
  But based on what I had in mind - the idea that small models can be really useful if optimized for very specific use cases - I figured the current architecture of Codex CLI wasn't the best fit for that. So instead of forking it, I started from scratch.
  Here's the rough thinking behind it:
  1. You still have to manually set up and run your own inference server (e.g., with ollama, lmstudio, vllm, etc.). 2. You need to ensure that the model you choose works well with Codex's pre-defined prompt setup and configuration. 3. Prompting patterns for small open-source models (like phi-4-mini) often need to be very different - they don't generalize as well. 4. The function calling format (or structured output) might not even be supported by your local inference provider.
  Codex CLI's implementation and prompts seem tailored for a specific class of hosted, large-scale models (e.g. GPT, Gemini, Grok). But if you want to get good results with small, local models, everything - prompting, reasoning chains, output structure - often needs to be different.
  So I built this with a few assumptions in mind:
  - Write the tool specifically to run _locally_ out of the box, no inference API server required. - Use model directly (currently for phi-4-mini via llama-cpp-python). - Optimize the prompt and execution logic _per model_ to get the best performance.
  Instead of forcing small models into a system meant for large, general-purpose APIs, I wanted to explore a local-first, model-specific alternative that's easy to install and extend — and free to run.
  
  Reply View | 0 replies

asadm 8 months ago

i think this was made before that PR was merged into codex.

KTibow 8 months ago

Good correction - while the SDK used has supported changing the API through environment variables for a long time, Codex only recently added Chat Completions support recently.

Reply View | 0 replies
xiphias2 8 months ago

Maybe it was part of the reason that they accepted the PR. The fork would happen anyways if they don't allow any LLM.
A bit like how Android came after iPhone with open source implementation.

Reply View | 0 replies