It took a new developer three to five days to set up their machine before they could write a single line of code. Not to learn the project, just to make the machine usable. That was our reality, and it cost more than we realized.
This post walks through how we solved that problem at Critical Manufacturing, where we build a Manufacturing Execution System (MES) designed to be extended by developers across multiple organizations. Our solution — cloud-based development environments we call DevBoxes — reduced onboarding time to under half an hour, eliminated an entire category of authentication failures, and changed how we think about developer tooling altogether.
The problem space: multi-org, multi-version development
Our MES platform is extensible by design. The developers building on top of it aren't a single, co-located team. They're spread across Critical Manufacturing's internal teams, partner companies, and sometimes the customer's own development teams. Each of these organizations has its own infrastructure, security policies, network configurations, and hardware restrictions.
At the same time, our developers aren't working on a single version of the product. MES has multiple release lines, each with a different technology stack:
- Different .NET versions
- Different TypeScript and Angular versions
- Different database versions
A developer might be supporting a factory running MES 10.2 while actively developing a feature for a customer on MES 11.1. These aren't minor differences; the toolchains are incompatible. You can't just switch branches, you need an entirely different development environment.
So the challenge was never just "multi-org." It was multi-org and multi-version, which together created a combinatorial mess.
The real cost: it's about the 10th percentile
Before we fixed it, onboarding looked like this:
- Machine setup: 3–5 days. Installing required tooling, resolving version conflicts, battling admin permission restrictions. This had nothing to do with learning the project; it was just making the machine usable.
- Authentication setup: ~50% of new developers needed direct support. Multiple repositories, each with its own credential flow. Docker, npm, NuGet, each configured separately with its own failure modes.
- Time to first commit: roughly a week. By the time a developer had a working environment and understood the basics, a full week had elapsed.
The support burden was real and continuous, not just at onboarding. Questions like "why won't this compile on my machine?", "which .NET version do I need for this project?", and "my setup broke after switching projects" were common enough to be a running joke.
The underlying cause was heterogeneity. Partner A works on Windows with a corporate proxy. Partner B is on Linux behind a restrictive firewall. Customer C's network uses deep packet inspection. Developer D has no admin rights on their laptop. Each of these edge cases is individually manageable, but collectively they're a support nightmare.
Developer tooling isn't just about the happy path — it's about the 10th percentile worst-case scenario. That's where the support cost actually lives.
We weren't trying to optimize for the developer who had a clean Windows machine on the corporate network. We were designing for everyone else.
The solution: Cloud DevBoxes
We introduced DevBoxes: dedicated, headless cloud VMs that serve as each developer's development machine. The core components are intentionally few:
- VS Code Server — a browser-accessible IDE running on the VM
- Devcontainers — Docker-based environments that isolate tooling per project
- Unified OIDC authentication — single sign-on across all project infrastructure
- Automatic token rotation — credentials stay fresh without developer intervention
How a developer uses a DevBox:
Local VS Code
↓ (SSH tunnel)
DevBox (Remote VM)
↓
VS Code Server
↓
Devcontainer (Docker)
↓
Project Code + Tooling
The local machine runs VS Code and an SSH connection. That's it. No toolchain on the host, no version conflicts, no admin permission requirements. The development environment lives entirely on the DevBox, inside a devcontainer defined by the project.
We deliberately kept the stack minimal. SSH is remarkably resilient in restrictive network conditions; it works through proxies, survives deep packet inspection, and doesn't require special firewall rules. VS Code's remote development extension turns that SSH connection into a full IDE experience. The project can extend its environment through the devcontainer definition without touching the underlying VM.
We chose resilience over features. Every additional host-side requirement was another failure point for developers in restrictive environments. By pushing complexity into the devcontainer and keeping local requirements minimal, we gave developers in even the most constrained environments a reliable path to work.
Version isolation through devcontainers
The multi-version problem is solved cleanly by devcontainers. Each project defines its own container image, specifying exactly the tools it needs:
{
"name": "MES 10.2 Project",
"image": "mes-devcontainer:10.2.5",
"features": {
"dotnet": "6.0",
"node": "18"
}
}
A developer switching from an MES 10.2 project to an MES 11.1 project reopens their IDE into a different container. The toolchains are completely isolated — different .NET runtimes, different Node versions, different database clients. There's no version conflict because the environments don't share state.
This also means the entire team works with identical tooling. A new developer joining the project gets the exact same environment as a developer who's been on it for two years. "Works on my machine" stops being a problem.
Teams can further customize their environment via a Dockerfile or devcontainer features, adding project-specific extensions, custom tooling, or environment configuration, without those customizations affecting any other project or team.
Authentication: unified and automatic
Authentication was a significant pain point before. Developers had to manually configure credentials for each repository — Docker registries, npm feeds, NuGet sources — and about half of them needed direct support to get it right.
We unified all of it under OIDC. A developer logs into the Customer Portal once, and that login cascades automatically to everything else:
Developer logs in ONCE to Customer Portal
↓
Automatic authentication to:
- DevBox (SSH access)
- Git repositories
- Docker registries
- Kubernetes clusters
- Development environments
Token rotation happens automatically when the devcontainer is reopened. If a token is stale — say, a developer hasn't worked in a week — the tooling detects it and redirects to the Customer Portal for a fresh login. The developer never has to think about individual credentials.
Auth failures went from affecting roughly half of new developers to essentially zero.
The developer experience end-to-end
What does this look like from the developer's perspective?
First-time setup (~10 minutes, usually with IT):
- Install Node LTS and Visual Studio Code
- Run 4 terminal commands to install our CLI tools
Per-project setup (~10–20 minutes):
- Run a single command that handles: extension installation, SSH configuration, repository clone, and devcontainer build
Daily work:
- Run one command; it checks and refreshes auth if needed, then opens the IDE
- Code, commit, push; authentication works for all dependencies automatically
- Open a PR when ready
There's no overhead around environment management. The complexity is handled once, at setup, and then it stays out of the way.
Multi-tenancy through isolation
Isolation between organizations is handled at the VM level. Each developer gets their own DevBox. Access control is managed centrally in the Customer Portal. A developer from Partner A and a developer from Partner B can both have access to a shared project through the same infrastructure without any visibility into each other's environments.
The DevBox itself has no pre-installed tooling beyond the bare minimum. All project dependencies live inside the devcontainer. This means that even if two organizations share a DevBox host, the projects are fully isolated at the container level.
Real-world constraints we handle:
- ✓ Deep packet inspection proxies
- ✓ No admin access on the developer's laptop
- ✓ FDA-regulated compliance environments
- ✗ Air-gapped networks (genuinely not supported)
Treating developer tooling like a production system
The architectural decisions above were reasonably straightforward once we understood the problem. The harder shift was operational.
At some point during this work, we recognized that our developer tooling is production-critical. It has a direct and daily impact on hundreds of developers. When the DevBox is down, developers are blocked. When authentication fails, projects stall. When a build agent is unhealthy, CI pipelines stop.
When production incidents occur, the ability to reproduce and investigate them depends on having a working development environment. Slow or unreliable tooling slows down incident response. We weren't just maintaining a convenience tool; we were operating something teams depended on to do their jobs.
Once we accepted that framing, we had to operate accordingly.
Reproducible infrastructure
Manual setup doesn't belong in a production system, and it doesn't belong in developer tooling either. We provision everything through infrastructure as code, with versioned configuration. Every developer gets the same environment, with the same tooling, every time. Onboarding becomes fast and predictable because there's nothing to improvise.
Observability
If developers depend on this infrastructure daily, we need to know what's happening before they feel the impact. We monitor across the full stack:
DevBox:
- CPU and memory usage
- Session duration and SSH connectivity
Project infrastructure:
- Development environment health
- CI/CD build agent telemetry
- Database health
Proactive, not reactive
Visibility lets us act on signals instead of waiting for a support ticket:
- Disk usage above 80% → auto-cleanup and alert
- Build failure → notify team and capture logs
- Auth token approaching expiry → pre-rotate
Because provisioning is automated, we can patch, rebuild, and roll out changes without manual steps or scheduled maintenance windows.
You can't fix what you can't see.
The shift from reactive ("a developer has a problem, now we fix it") to proactive ("we see a degrading condition, we address it before it becomes a problem") reduced our support burden significantly, and improved the developer experience in ways that are harder to measure — the problems that never happen.
Results
- Onboarding time: 3–5 days → under 30 minutes (10 minutes for initial setup, 10–20 minutes per project for source clone and devcontainer build)
- Authentication failures: ~50% of developers needed support → ~0%
- Time to first commit: ~1 week → same day
- "Works on my machine" issues: eliminated — all machines are the same machine
The less obvious change: support conversations shifted in nature. Instead of "I can't get my environment working," the questions became "how do I extend the devcontainer to add X" or "how does this CI pipeline work." Developers stopped spending mental energy on environment management and started spending it on the actual work.
What we learned
Keep it simple
SSH plus VS Code Server plus Docker covers the core requirement. Every additional component we required from the developer's local machine was a potential failure mode in a restrictive network environment. Reliability beats features.
Authentication is worth the investment
Multi-platform systems with multiple login flows are a significant friction point — enough to block developers entirely. Unified auth with automatic token rotation removed an entire class of support request. If you're building developer infrastructure that spans multiple services, getting auth right from the start pays dividends continuously.
Treat internal tools like products
Developers are users. Their tooling is a product. Downtime and unreliability have real costs: blocked work, delayed projects, lost momentum. Once we accepted this, investing in observability, reproducible infrastructure, and proactive monitoring was an obvious call.
Start with observability
Not after something breaks — from the beginning. The ability to see what's happening in your infrastructure before it becomes a user-facing problem is the difference between proactive and reactive support. We should have done this earlier than we did.
Is this the right approach for you?
This isn't a universal solution. It made sense for us given specific constraints: multiple organizations with different hardware and security policies, multiple product versions with incompatible toolchains, developers working in environments we don't control.
You probably don't need this if:
- You have a single organization with homogeneous hardware
- Your team is small and co-located
- GitHub Codespaces or Gitpod covers your requirements
- You don't want to operate additional infrastructure
It's worth considering if:
- You're dealing with multi-organizational collaboration
- You need strict data isolation between teams
- You have complex authentication requirements across multiple services
- You're supporting multiple product versions simultaneously
- You have little or no control over developers' local machines
The principles — minimal host requirements, isolated environments, unified auth, production-grade operations — transfer broadly even if the specific implementation doesn't.
Solving developer onboarding isn't primarily a tooling problem. It's an operational one. The tools — devcontainers, VS Code Server, SSH, OIDC — are well-understood and widely available. What made the difference was treating the whole system with the same discipline we'd apply to production infrastructure. That's a choice, not a prerequisite.