It took a new developer three to five days to set up their machine before they could write a single line of code. Not to learn the project, just to make the machine usable. That was our reality, and it cost more than we realized.

Cloud Development Environments: Powering Multi-Org Collaboration at Scale

This post walks through how we solved that problem at Critical Manufacturing, where we build a Manufacturing Execution System (MES) designed to be extended by developers across multiple organizations. Our solution — cloud-based development environments we call DevBoxes — reduced onboarding time to under half an hour, eliminated an entire category of authentication failures, and changed how we think about developer tooling altogether.

The problem space: multi-org, multi-version development

Our MES platform is extensible by design. The developers building on top of it aren't a single, co-located team. They're spread across Critical Manufacturing's internal teams, partner companies, and sometimes the customer's own development teams. Each of these organizations has its own infrastructure, security policies, network configurations, and hardware restrictions.

At the same time, our developers aren't working on a single version of the product. MES has multiple release lines, each with a different technology stack:

A developer might be supporting a factory running MES 10.2 while actively developing a feature for a customer on MES 11.1. These aren't minor differences; the toolchains are incompatible. You can't just switch branches, you need an entirely different development environment.

So the challenge was never just "multi-org." It was multi-org and multi-version, which together created a combinatorial mess.

The real cost: it's about the 10th percentile

Before we fixed it, onboarding looked like this:

The support burden was real and continuous, not just at onboarding. Questions like "why won't this compile on my machine?", "which .NET version do I need for this project?", and "my setup broke after switching projects" were common enough to be a running joke.

The underlying cause was heterogeneity. Partner A works on Windows with a corporate proxy. Partner B is on Linux behind a restrictive firewall. Customer C's network uses deep packet inspection. Developer D has no admin rights on their laptop. Each of these edge cases is individually manageable, but collectively they're a support nightmare.

Developer tooling isn't just about the happy path — it's about the 10th percentile worst-case scenario. That's where the support cost actually lives.

We weren't trying to optimize for the developer who had a clean Windows machine on the corporate network. We were designing for everyone else.

The solution: Cloud DevBoxes

We introduced DevBoxes: dedicated, headless cloud VMs that serve as each developer's development machine. The core components are intentionally few:

How a developer uses a DevBox:

Local VS Code
  ↓ (SSH tunnel)
DevBox (Remote VM)
  ↓
VS Code Server
  ↓
Devcontainer (Docker)
  ↓
Project Code + Tooling

The local machine runs VS Code and an SSH connection. That's it. No toolchain on the host, no version conflicts, no admin permission requirements. The development environment lives entirely on the DevBox, inside a devcontainer defined by the project.

We deliberately kept the stack minimal. SSH is remarkably resilient in restrictive network conditions; it works through proxies, survives deep packet inspection, and doesn't require special firewall rules. VS Code's remote development extension turns that SSH connection into a full IDE experience. The project can extend its environment through the devcontainer definition without touching the underlying VM.

We chose resilience over features. Every additional host-side requirement was another failure point for developers in restrictive environments. By pushing complexity into the devcontainer and keeping local requirements minimal, we gave developers in even the most constrained environments a reliable path to work.

Version isolation through devcontainers

The multi-version problem is solved cleanly by devcontainers. Each project defines its own container image, specifying exactly the tools it needs:

{
  "name": "MES 10.2 Project",
  "image": "mes-devcontainer:10.2.5",
  "features": {
    "dotnet": "6.0",
    "node": "18"
  }
}

A developer switching from an MES 10.2 project to an MES 11.1 project reopens their IDE into a different container. The toolchains are completely isolated — different .NET runtimes, different Node versions, different database clients. There's no version conflict because the environments don't share state.

This also means the entire team works with identical tooling. A new developer joining the project gets the exact same environment as a developer who's been on it for two years. "Works on my machine" stops being a problem.

Teams can further customize their environment via a Dockerfile or devcontainer features, adding project-specific extensions, custom tooling, or environment configuration, without those customizations affecting any other project or team.

Authentication: unified and automatic

Authentication was a significant pain point before. Developers had to manually configure credentials for each repository — Docker registries, npm feeds, NuGet sources — and about half of them needed direct support to get it right.

We unified all of it under OIDC. A developer logs into the Customer Portal once, and that login cascades automatically to everything else:

Developer logs in ONCE to Customer Portal
     ↓
Automatic authentication to:
- DevBox (SSH access)
- Git repositories
- Docker registries
- Kubernetes clusters
- Development environments

Token rotation happens automatically when the devcontainer is reopened. If a token is stale — say, a developer hasn't worked in a week — the tooling detects it and redirects to the Customer Portal for a fresh login. The developer never has to think about individual credentials.

Auth failures went from affecting roughly half of new developers to essentially zero.

The developer experience end-to-end

What does this look like from the developer's perspective?

First-time setup (~10 minutes, usually with IT):

Per-project setup (~10–20 minutes):

Daily work:

There's no overhead around environment management. The complexity is handled once, at setup, and then it stays out of the way.

Multi-tenancy through isolation

Isolation between organizations is handled at the VM level. Each developer gets their own DevBox. Access control is managed centrally in the Customer Portal. A developer from Partner A and a developer from Partner B can both have access to a shared project through the same infrastructure without any visibility into each other's environments.

The DevBox itself has no pre-installed tooling beyond the bare minimum. All project dependencies live inside the devcontainer. This means that even if two organizations share a DevBox host, the projects are fully isolated at the container level.

Real-world constraints we handle:

Treating developer tooling like a production system

The architectural decisions above were reasonably straightforward once we understood the problem. The harder shift was operational.

At some point during this work, we recognized that our developer tooling is production-critical. It has a direct and daily impact on hundreds of developers. When the DevBox is down, developers are blocked. When authentication fails, projects stall. When a build agent is unhealthy, CI pipelines stop.

When production incidents occur, the ability to reproduce and investigate them depends on having a working development environment. Slow or unreliable tooling slows down incident response. We weren't just maintaining a convenience tool; we were operating something teams depended on to do their jobs.

Once we accepted that framing, we had to operate accordingly.

Reproducible infrastructure

Manual setup doesn't belong in a production system, and it doesn't belong in developer tooling either. We provision everything through infrastructure as code, with versioned configuration. Every developer gets the same environment, with the same tooling, every time. Onboarding becomes fast and predictable because there's nothing to improvise.

Observability

If developers depend on this infrastructure daily, we need to know what's happening before they feel the impact. We monitor across the full stack:

DevBox:

Project infrastructure:

Proactive, not reactive

Visibility lets us act on signals instead of waiting for a support ticket:

Because provisioning is automated, we can patch, rebuild, and roll out changes without manual steps or scheduled maintenance windows.

You can't fix what you can't see.

The shift from reactive ("a developer has a problem, now we fix it") to proactive ("we see a degrading condition, we address it before it becomes a problem") reduced our support burden significantly, and improved the developer experience in ways that are harder to measure — the problems that never happen.

Results

The less obvious change: support conversations shifted in nature. Instead of "I can't get my environment working," the questions became "how do I extend the devcontainer to add X" or "how does this CI pipeline work." Developers stopped spending mental energy on environment management and started spending it on the actual work.

What we learned

Keep it simple

SSH plus VS Code Server plus Docker covers the core requirement. Every additional component we required from the developer's local machine was a potential failure mode in a restrictive network environment. Reliability beats features.

Authentication is worth the investment

Multi-platform systems with multiple login flows are a significant friction point — enough to block developers entirely. Unified auth with automatic token rotation removed an entire class of support request. If you're building developer infrastructure that spans multiple services, getting auth right from the start pays dividends continuously.

Treat internal tools like products

Developers are users. Their tooling is a product. Downtime and unreliability have real costs: blocked work, delayed projects, lost momentum. Once we accepted this, investing in observability, reproducible infrastructure, and proactive monitoring was an obvious call.

Start with observability

Not after something breaks — from the beginning. The ability to see what's happening in your infrastructure before it becomes a user-facing problem is the difference between proactive and reactive support. We should have done this earlier than we did.

Is this the right approach for you?

This isn't a universal solution. It made sense for us given specific constraints: multiple organizations with different hardware and security policies, multiple product versions with incompatible toolchains, developers working in environments we don't control.

You probably don't need this if:

It's worth considering if:

The principles — minimal host requirements, isolated environments, unified auth, production-grade operations — transfer broadly even if the specific implementation doesn't.


Solving developer onboarding isn't primarily a tooling problem. It's an operational one. The tools — devcontainers, VS Code Server, SSH, OIDC — are well-understood and widely available. What made the difference was treating the whole system with the same discipline we'd apply to production infrastructure. That's a choice, not a prerequisite.