[GH-ISSUE #502] Cerebras inference support #82

New issue

Open

opened 2026-03-03 13:52:51 +03:00 by kerem · 1 comment

kerem commented

2026-03-03 13:52:51 +03:00

Owner

Originally created by @neoOpus on GitHub (Jun 29, 2025).
Original GitHub issue: https://github.com/jehna/humanify/issues/502

Hi,

I would like to know if someone worked on making humanify support Cerebras inference, as it is compatible with OpenAI and can be a better alternative in terms of speed and cost?

https://inference-docs.cerebras.ai/resources/openai

Originally created by @neoOpus on GitHub (Jun 29, 2025). Original GitHub issue: https://github.com/jehna/humanify/issues/502 Hi, I would like to know if someone worked on making humanify support Cerebras inference, as it is compatible with OpenAI and can be a better alternative in terms of speed and cost? https://inference-docs.cerebras.ai/resources/openai

kerem commented

2026-03-03 13:52:51 +03:00

Author

Owner

@0xdevalias commented on GitHub (Jun 30, 2025):

as it is compatible with OpenAI

@neoOpus Have you tried using the humanify openai --baseURL param in the way they suggest?

https://inference-docs.cerebras.ai/resources/openai#configuring-openai-to-use-cerebras-api
- Configuring OpenAI to Use Cerebras API

github.com/jehna/humanify@7beba2d324/src/commands/openai.ts (L20-L24)

I'd be interested to hear if you manage to get it to work, and also your feedback on the speed differences, how effective the different models are when used with humanify, etc.

It seems it's also usable via OpenRouter:

https://github.com/jehna/humanify/issues/416
- https://inference-docs.cerebras.ai/resources/openrouter-cerebras
- https://openrouter.ai/provider/cerebras

These seem to be the models currently available:

https://inference-docs.cerebras.ai/introduction

The Cerebras Inference API currently provides access to the following models:

Model Name Model ID Parameters Speed (tokens/s)

Llama 4 Scout llama-4-scout-17b-16e-instruct 109 billion ~2600 tokens/s

Llama 3.1 8B llama3.1-8b 8 billion ~2200 tokens/s

Llama 3.3 70B llama-3.3-70b 70 billion ~2100 tokens/s

Qwen 3 32B* qwen-3-32b 32 billion ~2100 tokens/s

DeepSeek R1 Distill Llama 70B* deepseek-r1-distill-llama-70b 70 billion ~1700 tokens/s

Model Name	Model ID	Parameters	Speed (tokens/s)
Llama 4 Scout	`llama-4-scout-17b-16e-instruct`	109 billion	~2600 tokens/s
Llama 3.1 8B	`llama3.1-8b`	8 billion	~2200 tokens/s
Llama 3.3 70B	`llama-3.3-70b`	70 billion	~2100 tokens/s
Qwen 3 32B*	`qwen-3-32b`	32 billion	~2100 tokens/s
DeepSeek R1 Distill Llama 70B*	`deepseek-r1-distill-llama-70b`	70 billion	~1700 tokens/s

The pricing:

https://inference-docs.cerebras.ai/support/pricing

Pricing
Our free tier supports a context length of 8,192 tokens. For all supported models, we also offer context lengths up to 128K upon request.

https://inference-docs.cerebras.ai/support/pricing#exploration-tier-pricing

Model Speed Input Output

Llama 4 Scout ~2600 tokens/s $0.65/M tokens $0.85/M tokens

Llama 3.1 8B ~2200 tokens/s $0.10/M tokens $0.10/M tokens

Llama 3.3 70B ~2100 tokens/s $0.85/M tokens $1.20/M tokens

Qwen 3 32B ~2100 tokens/s $0.40/M tokens $0.80/M tokens

Deepseek R1 Distill Llama 70B ~1700 tokens/s $2.20/M tokens $2.50/M tokens

Model	Speed	Input	Output
Llama 4 Scout	~2600 tokens/s	$0.65/M tokens	$0.85/M tokens
Llama 3.1 8B	~2200 tokens/s	$0.10/M tokens	$0.10/M tokens
Llama 3.3 70B	~2100 tokens/s	$0.85/M tokens	$1.20/M tokens
Qwen 3 32B	~2100 tokens/s	$0.40/M tokens	$0.80/M tokens
Deepseek R1 Distill Llama 70B	~1700 tokens/s	$2.20/M tokens	$2.50/M tokens

And the rate limits:

https://inference-docs.cerebras.ai/support/rate-limits
- Rate Limits

And further docs about tool use/function calling:

https://inference-docs.cerebras.ai/capabilities/tool-use
- Tool Use
https://inference-docs.cerebras.ai/agent-bootcamp/section-2
- Tool Use and Function Calling

See Also:

@0xdevalias commented on GitHub (Jun 30, 2025): > as it is compatible with OpenAI @neoOpus Have you tried using the `humanify openai --baseURL` param in the way they suggest? - https://inference-docs.cerebras.ai/resources/openai#configuring-openai-to-use-cerebras-api - > Configuring OpenAI to Use Cerebras API https://github.com/jehna/humanify/blob/7beba2d32433e58bb77d0e1b0eda01c470fec3e2/src/commands/openai.ts#L20-L24 I'd be interested to hear if you manage to get it to work, and also your feedback on the speed differences, how effective the different models are when used with `humanify`, etc. --- It seems it's also usable via OpenRouter: - https://github.com/jehna/humanify/issues/416 - https://inference-docs.cerebras.ai/resources/openrouter-cerebras - https://openrouter.ai/provider/cerebras These seem to be the models currently available: - https://inference-docs.cerebras.ai/introduction - > The Cerebras Inference API currently provides access to the following models: > > | Model Name | Model ID | Parameters | Speed (tokens/s) | > |:---|:---|:---|:---| > | Llama 4 Scout | `llama-4-scout-17b-16e-instruct` | 109 billion | ~2600 tokens/s | > | Llama 3.1 8B | `llama3.1-8b` | 8 billion | ~2200 tokens/s | > | Llama 3.3 70B | `llama-3.3-70b` | 70 billion | ~2100 tokens/s | > | Qwen 3 32B\* | `qwen-3-32b` | 32 billion | ~2100 tokens/s | > | DeepSeek R1 Distill Llama 70B\* | `deepseek-r1-distill-llama-70b` | 70 billion | ~1700 tokens/s | The pricing: - https://inference-docs.cerebras.ai/support/pricing - > Pricing - > Our free tier supports a context length of 8,192 tokens. For all supported models, we also offer context lengths up to 128K upon request. - https://inference-docs.cerebras.ai/support/pricing#exploration-tier-pricing - > | Model | Speed | Input | Output | > |:---|:---|:---|:---| > | Llama 4 Scout | ~2600 tokens/s | \$0.65/M tokens | \$0.85/M tokens | > | Llama 3.1 8B | ~2200 tokens/s | \$0.10/M tokens | \$0.10/M tokens | > | Llama 3.3 70B | ~2100 tokens/s | \$0.85/M tokens | \$1.20/M tokens | > | Qwen 3 32B | ~2100 tokens/s | \$0.40/M tokens | \$0.80/M tokens | > | Deepseek R1 Distill Llama 70B | ~1700 tokens/s | \$2.20/M tokens | \$2.50/M tokens | And the rate limits: - https://inference-docs.cerebras.ai/support/rate-limits - > Rate Limits And further docs about tool use/function calling: - https://inference-docs.cerebras.ai/capabilities/tool-use - > Tool Use - https://inference-docs.cerebras.ai/agent-bootcamp/section-2 - > Tool Use and Function Calling --- See Also: - https://github.com/jehna/humanify/issues/400 - https://github.com/jehna/humanify/issues/84