Test Token Throughput Speed
Measure and share model performance metrics
Overview
OllaMan includes a built-in performance measurement tool that shows how fast your models generate responses. Understanding token throughput helps you evaluate model efficiency and optimize your hardware setup.
Viewing Token Speed
Token speed is automatically measured during every chat interaction.
Start a Chat
Navigate to the Chat page and select a model to chat with.

Send a Message
Type and send any message to the model. The longer and more complex the response, the more accurate the speed measurement.
Wait for Response
As the model generates its response, you'll see the streaming text appear in real-time.
Check Speed in Status Bar
After the model completes its response, look at the top status bar in the Chat interface.
You'll see the token generation speed displayed as:
- Tokens per second (e.g., "45.2 tokens/s")
- Total tokens generated
- Response time
Understanding the Metrics
Tokens Per Second
What it means: How many tokens (words/word pieces) the model generates each second
Typical ranges:
- 5-15 tokens/s: Slow, may feel laggy (large model on CPU)
- 15-30 tokens/s: Moderate, acceptable for most uses
- 30-60 tokens/s: Fast, smooth experience
- 60+ tokens/s: Very fast, excellent performance
Factors Affecting Speed
Model Size
Larger models (70B) are slower than smaller ones (7B)
Quantization
Higher quantization (Q8_0) is slower than lower (Q4_0)
Hardware
GPU acceleration provides 10-100x faster speeds than CPU
Context Length
Longer conversations slow down response generation
System Load
Other applications competing for resources reduce speed
Prompt Complexity
Complex requests may process slower than simple ones
Sharing Performance Cards
OllaMan lets you create beautiful, shareable performance cards to showcase your model's speed.
Generate a Response
Complete a chat interaction to measure token speed (as described above).
Click on Speed Display
In the top status bar, click directly on the token speed number.

View Performance Card
A beautiful, designed performance card will appear showing:
- Model name and version
- Token generation speed
- Total tokens generated
- Response time
- System information (hardware)
- Visual speed rating
Capture Screenshot
Use your operating system's screenshot tool to capture the card:
- macOS:
⌘ + Shift + 4 - Windows:
Win + Shift + S - Linux:
PrtScor screenshot tool
Share on Social Media
Share your performance card on:
- Twitter/X
- Reddit (r/LocalLLaMA, r/ollama)
- Discord communities
- GitHub discussions
- Tech forums
Privacy Note
The performance card shows only:
- Model name and speed metrics
- No conversation content is included
- No personal information is shared
- You control what you screenshot and share
Benchmarking Your Setup
Use token speed testing to optimize your configuration:
Testing Different Models
Choose Test Prompt
Use a consistent prompt for fair comparison:
"Write a detailed explanation of how photosynthesis works in plants."Test Each Model
Send the same prompt to different models and record speeds.
Compare Results
Create a comparison table:
| Model | Size | Speed | Quality |
|---|---|---|---|
| llama3.1:8b | 4.7GB | 42 t/s | Good |
| mistral:7b | 4.1GB | 48 t/s | Good |
| llama3.1:70b | 40GB | 8 t/s | Excellent |
Testing Quantization Levels
Compare the same model with different quantizations:
- Install multiple versions (e.g.,
llama3:8b-q4_0,llama3:8b-q8_0) - Test with identical prompts
- Evaluate speed vs. quality trade-offs
- Choose the best balance for your needs
Testing Hardware Configurations
Track speeds before and after:
- GPU driver updates
- RAM upgrades
- Moving from CPU to GPU
- System optimizations
Interpreting Results
What's "Good" Speed?
Quality vs. Speed
Remember that speed isn't everything:
- Slower large models often produce better results
- Q4_0 is faster than Q8_0 but may have lower quality
- Consider your use case when choosing models
- Sometimes waiting an extra second is worth it for better output
Optimization Tips
Improve Your Token Speed
Hardware Upgrades:
- Enable GPU acceleration in Ollama
- Upgrade to a dedicated GPU (NVIDIA recommended)
- Add more RAM for larger context windows
- Use NVMe SSD for faster model loading
Software Optimization:
- Close unnecessary applications during inference
- Use smaller quantized models when quality allows
- Keep context windows reasonable (avoid very long chats)
- Update Ollama to the latest version
- Update GPU drivers regularly
Model Selection:
- Use smaller models for simple tasks (7B instead of 70B)
- Choose appropriate quantization (Q4_0 for speed, Q8_0 for quality)
- Test different model families (Mistral often faster than Llama)
OllaMan Docs