Introduction: Why FastAPI Uvicorn Concurrency Tuning Matters
When I started deploying FastAPI services with Uvicorn, I quickly realized that default settings were fine for demos but terrible for real traffic. Concurrency and worker configuration decide whether your API feels snappy under load or crumbles the moment multiple clients hit it at once.
FastAPI Uvicorn concurrency tuning is all about how many requests your app can handle in parallel, how efficiently it uses CPU cores, and how predictably it behaves under spikes. Done well, you can cut tail latency, keep p95 and p99 response times low, and avoid over-provisioning expensive servers or containers.
In production microservices, I’ve seen small changes to workers, max connections, and event loop settings turn a flaky system into a stable one without touching a single line of business logic. The right tuning helps you:
- Increase throughput without blindly scaling out more instances.
- Reduce latency for both CPU-bound and IO-heavy endpoints.
- Control costs by squeezing more performance out of each node.
- Improve reliability during traffic spikes and background task bursts.
This article walks through practical strategies I use to configure FastAPI and Uvicorn so you can match concurrency to your workload instead of relying on guesswork or defaults.
1. Understand the FastAPI Uvicorn Concurrency Model
Before I start tuning FastAPI Uvicorn concurrency on any project, I make sure I’m crystal clear on what actually runs where: the async event loop, the thread pool, and the worker processes. Without that mental model, it’s easy to misconfigure workers and wonder why performance barely changes.
Under the hood, FastAPI uses Starlette as the ASGI framework and AnyIO as an abstraction over async backends like asyncio. Uvicorn is the ASGI server that:
- Accepts incoming HTTP connections.
- Dispatches each request into the async event loop.
- Hands the request to your FastAPI application for routing and response.
Inside a single Uvicorn process, there is typically:
- One event loop handling many concurrent requests with async I/O.
- A thread pool (often the default asyncio thread pool) for blocking work you wrap with
run_in_threadpoolor FastAPI dependencies usingsyncfunctions.
In my experience, the biggest confusion comes from mixing up concurrency and workers:
- Concurrency is how many requests can be in-flight on the event loop at once (e.g., many DB calls waiting, many responses being prepared).
- Workers are separate processes, each with its own Python interpreter, GIL, event loop, and thread pool.
When you start Uvicorn directly, you usually have one worker process. When you run via Gunicorn with the Uvicorn worker class, you might have multiple worker processes like:
gunicorn app.main:app \ -k uvicorn.workers.UvicornWorker \ --workers 4 \ --bind 0.0.0.0:8000
Here’s how I think about sizing:
- Use more workers to scale across CPU cores and bypass the GIL for CPU-heavy endpoints.
- Rely on the async event loop and good non-blocking I/O to get high concurrency for I/O-bound APIs (DB, HTTP calls, queues).
- Use the thread pool sparingly for legacy blocking code; too many blocking calls here can starve the loop.
To make this concrete, here’s a small FastAPI example mixing async I/O and a blocking operation offloaded to the thread pool:
from fastapi import FastAPI
from fastapi.concurrency import run_in_threadpool
import time
app = FastAPI()
# I/O-bound example (plays nicely with the event loop)
@app.get("/io-bound")
async def io_bound():
# Simulate async I/O, e.g., an HTTP call or DB query
await some_async_call()
return {"status": "ok"}
# CPU/blocking example offloaded to thread pool
@app.get("/blocking")
async def blocking():
def heavy_task():
time.sleep(2) # blocking operation
return "done"
result = await run_in_threadpool(heavy_task)
return {"result": result}
Once I separated in my head what the event loop, thread pool, and workers each do, it became much easier to reason about why adding more workers helped some endpoints a lot and barely affected others. The rest of the tuning process builds on this foundation: matching worker count and concurrency settings to your actual workload rather than copying someone else’s defaults. 2024 Comparing ASGI Servers: Uvicorn, Hypercorn, and Daphne
2. Choose the Right Number of Uvicorn Workers per CPU
When I’m tuning FastAPI Uvicorn concurrency, the first hard number I decide on is how many worker processes to run per CPU core. This choice controls how well you use the machine, how stable latencies are, and how predictable behavior is under load.
A Uvicorn worker is a full Python process with its own GIL, event loop, and memory footprint. More workers can help, but only to a point; beyond that you just add context-switching and memory pressure.
Rules of thumb by workload type
Here are the guidelines I actually use in deployments:
- CPU-bound APIs (heavy data processing, lots of JSON transformations): I usually start with 1 worker per physical core. On a 4-core VM, that means 4 workers.
- I/O-bound APIs (DB-heavy, HTTP calls to other services): I often go up to 2 workers per core, especially if each request spends most of its time waiting on network or disk.
- Mixed workloads (some CPU-heavy endpoints, some I/O heavy): I start around 1–1.5 workers per core and adjust after observing real metrics.
One thing I learned the hard way was that simply cranking workers to “as many as possible” often hurts performance: CPU gets saturated, context switching explodes, and p95 latency goes sideways even if raw throughput looks higher.
Adapting to containers and memory limits
In containers (Docker, Kubernetes), I also budget workers based on memory per process, not just cores. A FastAPI app with large models or heavy imports can eat a few hundred MB per worker, so 8 workers on a small pod might lead to OOM kills.
In my own setups, I typically:
- Fix CPU requests/limits for the pod (e.g., 2 cores).
- Estimate memory per worker from a small load test.
- Choose workers so that workers × memory-per-worker stays comfortably under the pod’s memory limit.
Practical example command
Here’s what a balanced configuration might look like for a 4-core, I/O-bound microservice using Gunicorn with Uvicorn workers:
# 4 cores, mostly I/O-bound, start with 6–8 workers gunicorn app.main:app \ -k uvicorn.workers.UvicornWorker \ --workers 6 \ --bind 0.0.0.0:8000
From there, I watch CPU, memory, and latency under a realistic load test. If CPU is low and latency is flat, I might try one or two more workers. If CPU is pegged and latencies spike, I back off. FastAPI Uvicorn concurrency is always a balance: enough workers to keep cores busy and hide I/O waits, but not so many that the OS spends more time juggling processes than serving requests.
3. Prefer Async Endpoints but Use Thread Pools Wisely
In my experience, the single biggest win for FastAPI Uvicorn concurrency is writing endpoints as async whenever the work is I/O-bound. That lets the event loop handle many in-flight requests while each one waits on the database, cache, or external HTTP calls.
When to use async def vs def
- Use
async deffor anything that can call async libraries (async DB drivers, HTTP clients, queues). This keeps the event loop free and scales concurrency efficiently. - Use
defonly when you must call blocking or legacy code that has no async equivalent (old SDKs, CPU-heavy libraries).
One mistake I made early on was mixing blocking operations directly into async endpoints. It “worked” in tests, but under load it quietly killed concurrency because each request blocked the event loop.
Offloading blocking work to the thread pool
FastAPI, via Starlette and AnyIO, provides helpers to push blocking work into a thread pool so the event loop stays responsive. Here’s a pattern I use for integrating legacy blocking code safely:
from fastapi import FastAPI
from fastapi.concurrency import run_in_threadpool
import time
app = FastAPI()
# Async, I/O-friendly endpoint
@app.get("/async-io")
async def async_io():
# imagine an async DB call here
await some_async_db_call()
return {"status": "ok"}
# Blocking code isolated in thread pool
@app.get("/blocking-safe")
async def blocking_safe():
def slow_blocking_task():
time.sleep(2) # blocking call
return "done"
result = await run_in_threadpool(slow_blocking_task)
return {"result": result}
Used sparingly, the thread pool is a great escape hatch when migrating from a synchronous stack. But if I see lots of thread-pool work in traces or metrics, that’s my cue to refactor toward true async libraries—otherwise the thread pool becomes the new bottleneck and undermines the concurrency benefits of FastAPI and Uvicorn. Concurrency and async / await – FastAPI
4. Tune Uvicorn Worker Class, Loop, and HTTP Settings
Once I’ve got workers-per-core roughly right, the next layer of FastAPI Uvicorn concurrency tuning is the worker class and low-level HTTP options. These don’t change your code, but they can noticeably shift throughput and tail latency under real traffic.
Pick the right worker class and event loop
In production behind Gunicorn, I almost always use the dedicated Uvicorn worker class:
gunicorn app.main:app \ -k uvicorn.workers.UvicornWorker \ --workers 4 \ --bind 0.0.0.0:8000
This ensures each Gunicorn worker is a full-featured Uvicorn server. On CPython, enabling uvloop often gives a small but free win for I/O-bound services:
uvicorn app.main:app \ --host 0.0.0.0 --port 8000 \ --loop uvloop --http h11
In my benchmarks, uvloop helps most when you have lots of concurrent connections and lightweight handlers; for heavy CPU work the gains are smaller but still worth trying.
HTTP keep-alive and connection limits
HTTP settings decide how connections are reused and how many concurrent clients a single worker can realistically support.
- Keep-alive timeout: Too low and clients reconnect constantly; too high and you waste sockets on idle clients. I usually start around 5–15 seconds depending on traffic patterns.
- Max keep-alive connections: Caps how many simultaneous persistent connections a worker will maintain. This guards against a small number of clients monopolizing resources.
Example with tuned HTTP options:
uvicorn app.main:app \ --host 0.0.0.0 --port 8000 \ --workers 4 \ --loop uvloop --http h11 \ --limit-concurrency 1000 \ --timeout-keep-alive 10
One thing I’ve learned is to pair these settings with real load tests that mirror production: watch how many open connections each pod holds, how p95 latency behaves as you approach --limit-concurrency, and whether keep-alive values match upstream expectations (API gateways, load balancers). Small tweaks here can smooth out spikes without touching your FastAPI code at all.
5. Load Test Your FastAPI Uvicorn Concurrency Settings
Every time I’ve skipped load testing my FastAPI Uvicorn concurrency setup, production traffic eventually exposed something embarrassing—usually ugly p95 and p99 latency. A quick, focused benchmark before rollout is the safest way to validate your worker counts and concurrency limits.
Designing a simple, realistic test
I like to start with one or two representative endpoints: one that’s I/O-heavy (database, external HTTP) and one that’s more CPU-intensive. Then I simulate realistic request patterns instead of just hammering a single path at maximum rate.
With tools like Locust or k6, you can ramp up users and see where the system bends. Here’s a minimal k6 script I’d use to probe concurrency behavior:
cat > loadtest.js << 'EOF'
import http from 'k6/http';
import { sleep } from 'k6';
export let options = {
stages: [
{ duration: '30s', target: 50 }, // ramp up
{ duration: '60s', target: 50 }, // steady
{ duration: '30s', target: 0 }, // ramp down
],
};
export default function () {
http.get('http://localhost:8000/health');
sleep(1);
}
EOF
k6 run loadtest.js
What to watch during the test
During and after the run, I focus on:
- p95/p99 latency: Do they spike sharply at a certain concurrency level?
- CPU and memory per pod/VM: Are workers saturating cores or hitting OOM limits?
- Connection count and errors: Any timeouts, 5xx responses, or connection resets?
One thing I’ve learned is to treat each test as a feedback loop: tweak workers, --limit-concurrency, or keep-alive settings, rerun the same script, and compare metrics. Within a couple of iterations, you usually land on a configuration that’s both fast and predictable for your real-world traffic. Locust – A modern load testing framework
6. Align FastAPI Uvicorn Concurrency with Container Resources
In containers, I’ve found that the best FastAPI Uvicorn concurrency settings are always tied to the actual CPU and memory limits of the pod or Docker container, not the underlying host. If I ignore those limits, I either waste capacity or get surprise throttling and OOM kills.
Respect CPU limits when choosing workers
In Kubernetes, I treat the CPU limit as my “effective machine size.” If a pod has cpu: 2, I size workers as if I’m on a 2-core VM:
- Mostly I/O-bound: start with 3–4 workers per pod.
- More CPU-heavy: start with 2 workers per pod.
I’ve run into issues when I set workers based on the node’s cores instead of the pod’s limit—Kubernetes just throttled the container and latencies spiked even though top on the node looked fine.
Budget per-worker memory usage
Each worker is a full Python process, and in my experience memory is often the true constraint. My process looks like this:
- Run a single worker in a staging environment under realistic load.
- Measure its steady-state RSS memory (e.g., via
kubectl top podor a profiler). - Multiply that by the number of workers I want, then add a safety margin (20–30%).
If that sum gets close to the pod’s memory limit, I reduce workers or split the service into more replicas. It’s better to run more pods with fewer workers each than one overloaded pod that keeps flirting with OOM.
Combine replicas and workers intentionally
For real deployments, I usually think in terms of total capacity:
- Replicas × workers per replica = total workers for the service.
- Scale horizontally (more pods) when a single pod is near its CPU or memory limit even after tuning workers.
- Adjust workers when a pod has headroom but can’t saturate its limit under load.
One thing I learned over several iterations is that the sweet spot is rarely “max out workers in one big pod.” Instead, aligning workers with container CPU/memory and then letting the orchestrator scale replicas gives you much smoother, more predictable performance as traffic grows.
7. Production Hardening: Timeouts, Backpressure, and Graceful Shutdown
After I’m happy with raw FastAPI Uvicorn concurrency numbers, I shift focus to protecting the service in production. Timeouts, backpressure, and clean shutdowns don’t raise your max RPS, but they stop bad traffic patterns or deployments from taking the whole system down.
Enforce timeouts end-to-end
In my experience, unbounded waits are the fastest way to turn a spike into an outage. I like to set timeouts at three layers:
- Uvicorn worker timeout (for stuck requests under Gunicorn):
gunicorn app.main:app \ -k uvicorn.workers.UvicornWorker \ --workers 4 \ --timeout 30
- Client-side / gateway timeout: your API gateway or ingress should time out before users give up.
- Downstream timeouts inside FastAPI (DB, HTTP calls) so a slow dependency doesn’t hog all your workers.
One rule I use: downstream timeouts are shorter than my external SLA, and gateway timeouts are just above my typical p99.
Apply backpressure with concurrency and queue limits
Backpressure is about saying “not now” instead of trying to serve more than the app can handle. Uvicorn gives you some basic knobs:
uvicorn app.main:app \ --workers 4 \ --limit-concurrency 800 \ --backlog 2048
- –limit-concurrency caps in-flight requests per worker process.
- –backlog limits how many pending connections sit in the socket queue.
When I first tuned these, I watched metrics carefully: I’d rather return a controlled 503 when saturated than let latency explode for everyone. This makes the service a better neighbor in a microservice environment.
Implement graceful shutdown hooks in FastAPI
Rolling deploys and autoscaling are painful without graceful shutdown. I’ve seen pods killed mid-request, leaving half-finished operations and confused clients. Now I always use FastAPI’s startup/shutdown events to manage resources cleanly:
from fastapi import FastAPI
app = FastAPI()
@app.on_event("startup")
async def on_startup():
# init DB pools, clients, etc.
...
@app.on_event("shutdown")
async def on_shutdown():
# close DB pools, flush queues, release resources
...
On the platform side, I make sure Kubernetes terminationGracePeriodSeconds is long enough for active requests to finish, and that the load balancer stops sending new traffic to a pod as soon as it gets a SIGTERM. Lifespan Events – FastAPI
Once these safeguards are in place, I’m far more comfortable pushing changes: even under high concurrency, the service fails fast, sheds excess load, and shuts down predictably instead of collapsing in unpredictable ways.
Conclusion: A Practical Checklist for FastAPI Uvicorn Concurrency
When I’m reviewing a FastAPI service before it goes live, I run through a simple concurrency checklist. It keeps me from over-optimizing one area while forgetting a basic setting somewhere else.
- Workers: Set workers ≈ CPU cores, then adjust for I/O vs CPU-heavy workloads.
- Async usage: Make endpoints
async defwhere possible; isolate blocking work withrun_in_threadpool. - Uvicorn/Gunicorn options: Use
uvicorn.workers.UvicornWorker, consider--loop uvloop, and tune keep-alive and--limit-concurrency. - Container alignment: Size workers to pod/container CPU and memory limits, not the host.
- Load tests: Benchmark with realistic traffic to validate worker counts and concurrency limits.
- Protection: Configure timeouts, backpressure (
--limit-concurrency, backlog), and graceful shutdown hooks.
If you walk through this list for each FastAPI microservice, you’ll end up with a setup that not only hits good throughput numbers, but also behaves predictably when traffic spikes or deployments go sideways.

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.





