Test Token Throughput Speed

Overview

OllaMan includes a built-in performance measurement tool that shows how fast your models generate responses. Understanding token throughput helps you evaluate model efficiency and optimize your hardware setup.

Viewing Token Speed

Token speed is automatically measured during every chat interaction.

Start a Chat

Navigate to the Chat page and select a model to chat with.

Chat Page

Send a Message

Type and send any message to the model. The longer and more complex the response, the more accurate the speed measurement.

Wait for Response

As the model generates its response, you'll see the streaming text appear in real-time.

Check Speed in Status Bar

After the model completes its response, look at the top status bar in the Chat interface.

You'll see the token generation speed displayed as:

Tokens per second (e.g., "45.2 tokens/s")
Total tokens generated
Response time

Understanding the Metrics

Tokens Per Second

What it means: How many tokens (words/word pieces) the model generates each second

Typical ranges:

5-15 tokens/s: Slow, may feel laggy (large model on CPU)
15-30 tokens/s: Moderate, acceptable for most uses
30-60 tokens/s: Fast, smooth experience
60+ tokens/s: Very fast, excellent performance

Factors Affecting Speed

Model Size

Larger models (70B) are slower than smaller ones (7B)

Quantization

Higher quantization (Q8_0) is slower than lower (Q4_0)

Hardware

GPU acceleration provides 10-100x faster speeds than CPU

Context Length

Longer conversations slow down response generation

System Load

Other applications competing for resources reduce speed

Prompt Complexity

Complex requests may process slower than simple ones

OllaMan lets you create beautiful, shareable performance cards to showcase your model's speed.

Generate a Response

Complete a chat interaction to measure token speed (as described above).

Click on Speed Display

In the top status bar, click directly on the token speed number.

View Performance Card

A beautiful, designed performance card will appear showing:

Model name and version
Token generation speed
Total tokens generated
Response time
System information (hardware)
Visual speed rating

Capture Screenshot

Use your operating system's screenshot tool to capture the card:

macOS: ⌘ + Shift + 4
Windows: Win + Shift + S
Linux: PrtSc or screenshot tool

Share your performance card on:

Twitter/X
Reddit (r/LocalLLaMA, r/ollama)
Discord communities
GitHub discussions
Tech forums

Privacy Note

The performance card shows only:

Model name and speed metrics
No conversation content is included
No personal information is shared
You control what you screenshot and share

Benchmarking Your Setup

Use token speed testing to optimize your configuration:

Testing Different Models

Choose Test Prompt

Use a consistent prompt for fair comparison:

"Write a detailed explanation of how photosynthesis works in plants."

Test Each Model

Send the same prompt to different models and record speeds.

Compare Results

Create a comparison table:

Model	Size	Speed	Quality
llama3.1:8b	4.7GB	42 t/s	Good
mistral:7b	4.1GB	48 t/s	Good
llama3.1:70b	40GB	8 t/s	Excellent

Testing Quantization Levels

Compare the same model with different quantizations:

Install multiple versions (e.g., llama3:8b-q4_0, llama3:8b-q8_0)
Test with identical prompts
Evaluate speed vs. quality trade-offs
Choose the best balance for your needs

Testing Hardware Configurations

Track speeds before and after:

GPU driver updates
RAM upgrades
Moving from CPU to GPU
System optimizations

Interpreting Results

What's "Good" Speed?

Quality vs. Speed

Remember that speed isn't everything:

Slower large models often produce better results
Q4_0 is faster than Q8_0 but may have lower quality
Consider your use case when choosing models
Sometimes waiting an extra second is worth it for better output

Optimization Tips

Improve Your Token Speed

Hardware Upgrades:

Enable GPU acceleration in Ollama
Upgrade to a dedicated GPU (NVIDIA recommended)
Add more RAM for larger context windows
Use NVMe SSD for faster model loading

Software Optimization:

Close unnecessary applications during inference
Use smaller quantized models when quality allows
Keep context windows reasonable (avoid very long chats)
Update Ollama to the latest version
Update GPU drivers regularly

Model Selection:

Use smaller models for simple tasks (7B instead of 70B)
Choose appropriate quantization (Q4_0 for speed, Q8_0 for quality)
Test different model families (Mistral often faster than Llama)

Troubleshooting

Next Steps

Memory Management

Learn how to monitor and manage running models

Chat Features

Explore more chat interface capabilities

Performance Issues

Get help with slow models and optimization

Model Size

Quantization

Hardware

Context Length

System Load

Prompt Complexity

For reading/comprehension

For code generation

For creative writing

For production APIs

Speed not showing

Extremely slow speeds (<5 t/s)

Performance card not appearing

Inconsistent speeds

Memory Management

Chat Features

Performance Issues

Table of Contents