LogoOllaMan Docs

Test Token Throughput Speed

Measure and share model performance metrics

Overview

OllaMan includes a built-in performance measurement tool that shows how fast your models generate responses. Understanding token throughput helps you evaluate model efficiency and optimize your hardware setup.


Viewing Token Speed

Token speed is automatically measured during every chat interaction.

Start a Chat

Navigate to the Chat page and select a model to chat with.

Chat Page

Send a Message

Type and send any message to the model. The longer and more complex the response, the more accurate the speed measurement.

Wait for Response

As the model generates its response, you'll see the streaming text appear in real-time.

Check Speed in Status Bar

After the model completes its response, look at the top status bar in the Chat interface.

You'll see the token generation speed displayed as:

  • Tokens per second (e.g., "45.2 tokens/s")
  • Total tokens generated
  • Response time

Understanding the Metrics

Tokens Per Second

What it means: How many tokens (words/word pieces) the model generates each second

Typical ranges:

  • 5-15 tokens/s: Slow, may feel laggy (large model on CPU)
  • 15-30 tokens/s: Moderate, acceptable for most uses
  • 30-60 tokens/s: Fast, smooth experience
  • 60+ tokens/s: Very fast, excellent performance

Factors Affecting Speed

Model Size

Larger models (70B) are slower than smaller ones (7B)

Quantization

Higher quantization (Q8_0) is slower than lower (Q4_0)

Hardware

GPU acceleration provides 10-100x faster speeds than CPU

Context Length

Longer conversations slow down response generation

System Load

Other applications competing for resources reduce speed

Prompt Complexity

Complex requests may process slower than simple ones


Sharing Performance Cards

OllaMan lets you create beautiful, shareable performance cards to showcase your model's speed.

Generate a Response

Complete a chat interaction to measure token speed (as described above).

Click on Speed Display

In the top status bar, click directly on the token speed number.

Click Speed

View Performance Card

A beautiful, designed performance card will appear showing:

  • Model name and version
  • Token generation speed
  • Total tokens generated
  • Response time
  • System information (hardware)
  • Visual speed rating

Capture Screenshot

Use your operating system's screenshot tool to capture the card:

  • macOS: ⌘ + Shift + 4
  • Windows: Win + Shift + S
  • Linux: PrtSc or screenshot tool

Share on Social Media

Share your performance card on:

  • Twitter/X
  • Reddit (r/LocalLLaMA, r/ollama)
  • Discord communities
  • GitHub discussions
  • Tech forums

Privacy Note

The performance card shows only:

  • Model name and speed metrics
  • No conversation content is included
  • No personal information is shared
  • You control what you screenshot and share

Benchmarking Your Setup

Use token speed testing to optimize your configuration:

Testing Different Models

Choose Test Prompt

Use a consistent prompt for fair comparison:

"Write a detailed explanation of how photosynthesis works in plants."

Test Each Model

Send the same prompt to different models and record speeds.

Compare Results

Create a comparison table:

ModelSizeSpeedQuality
llama3.1:8b4.7GB42 t/sGood
mistral:7b4.1GB48 t/sGood
llama3.1:70b40GB8 t/sExcellent

Testing Quantization Levels

Compare the same model with different quantizations:

  1. Install multiple versions (e.g., llama3:8b-q4_0, llama3:8b-q8_0)
  2. Test with identical prompts
  3. Evaluate speed vs. quality trade-offs
  4. Choose the best balance for your needs

Testing Hardware Configurations

Track speeds before and after:

  • GPU driver updates
  • RAM upgrades
  • Moving from CPU to GPU
  • System optimizations

Interpreting Results

What's "Good" Speed?

Quality vs. Speed

Remember that speed isn't everything:

  • Slower large models often produce better results
  • Q4_0 is faster than Q8_0 but may have lower quality
  • Consider your use case when choosing models
  • Sometimes waiting an extra second is worth it for better output

Optimization Tips

Improve Your Token Speed

Hardware Upgrades:

  • Enable GPU acceleration in Ollama
  • Upgrade to a dedicated GPU (NVIDIA recommended)
  • Add more RAM for larger context windows
  • Use NVMe SSD for faster model loading

Software Optimization:

  • Close unnecessary applications during inference
  • Use smaller quantized models when quality allows
  • Keep context windows reasonable (avoid very long chats)
  • Update Ollama to the latest version
  • Update GPU drivers regularly

Model Selection:

  • Use smaller models for simple tasks (7B instead of 70B)
  • Choose appropriate quantization (Q4_0 for speed, Q8_0 for quality)
  • Test different model families (Mistral often faster than Llama)

Troubleshooting


Next Steps