The Missing Layer in AI: Systems Engineering

Here’s a pattern I see constantly:

  1. Data scientist builds a model in a notebook. Accuracy looks great.
  2. Team demos it. Everyone claps.
  3. Six months later, it’s still not in production.

Why? Because nobody built the system around it.

The model was never the hard part. The hard part is everything else.

The Demo-to-Production Gap

A Jupyter notebook demo needs:

A production system needs:

- Data ingestion pipeline (where does new data come from?)
- Data validation (is the data even clean?)
- Feature store (how do you serve features at inference time?)
- Model registry (which model version is deployed?)
- Serving infrastructure (how do you handle 10K requests/sec?)
- Monitoring (how do you know when the model is wrong?)
- Rollback strategy (what happens when it breaks?)
- Logging (what happened at 3:47 AM last Tuesday?)

That second list? That’s systems engineering. And most AI teams don’t have anyone who knows how to build it.

Real Systems, Real Stack

Let me walk through what a production computer vision pipeline actually looks like. Not the textbook version — the one I’ve built and debugged at 2 AM.

Data Ingestion

Data doesn’t arrive in neat CSV files. It arrives as:

// Edge device sends frames via gRPC stream
func (s *Server) StreamFrames(stream pb.FrameService_StreamFramesServer) error {
    for {
        frame, err := stream.Recv()
        if err == io.EOF {
            return stream.SendAndClose(&pb.StreamResult{
                FramesProcessed: s.count,
            })
        }
        if err != nil {
            return err
        }
        // Push to Kafka for async processing
        // because blocking here means dropping frames
        s.producer.Send(frame.DeviceID, frame.Data)
    }
}

You can’t just pd.read_csv() this. You need message queues, serialization, backpressure handling, and retry logic.

Model Serving

“Just deploy the model” is a sentence that has never been true.

In production, a single image might go through:

  1. Preprocessing — resize, normalize, pad (NVIDIA DALI on GPU, not OpenCV on CPU)
  2. Detection model — ONNX runtime or TensorRT for speed
  3. Cropping — extract regions of interest
  4. Classification model — another model on the cropped regions
  5. Postprocessing — aggregate results, apply business rules
# Triton ensemble model config
# This is one "inference call" from the client's perspective
# It's actually 4 models and 2 preprocessing steps

ensemble_scheduling {
  step {
    model_name: "dali_preprocess"
    model_version: 1
  }
  step {
    model_name: "detection_trt"
    model_version: 3
    # TensorRT model: <10ms per image
    # The ONNX version was 45ms. Not acceptable.
  }
  step {
    model_name: "crop_and_resize"
    model_version: 1
  }
  step {
    model_name: "classifier_onnx"
    model_version: 2
  }
}

Each step has its own model version, its own resource requirements, and its own failure modes. Triton handles the orchestration. But someone has to design the orchestration.

The Worker Architecture

Not everything can be synchronous. For heavy tasks:

# FastAPI endpoint — receives request, publishes to queue
@app.post("/analyze")
async def analyze(image: UploadFile):
    task_id = str(uuid4())
    await kafka_producer.send("inference-tasks", {
        "task_id": task_id,
        "image_key": await store_image(image),
    })
    return {"task_id": task_id, "status": "processing"}

# Separate worker process — consumes from queue, runs inference
async def inference_worker():
    async for message in kafka_consumer:
        task = message.value
        result = await run_pipeline(task["image_key"])
        await store_result(task["task_id"], result)
        # If this crashes, Kafka retains the message
        # The task gets picked up by another worker
        # No data loss. No silent failures.

This is not over-engineering. This is survival engineering. When your inference server OOMs (and it will), you need the task to retry automatically. When traffic spikes 10x (and it will), you need horizontal scaling. When the GPU box catches fire (okay, hopefully not), you need the system to route around it.

The MLOps Pipeline

Training a model once is a science project. Retraining it continuously is engineering.

┌──────────┐    ┌──────────┐    ┌──────────┐
│  New Data │───▶│ Validate │───▶│  Train   │
└──────────┘    └──────────┘    └──────────┘


┌──────────┐    ┌──────────┐    ┌──────────┐
│  Deploy  │◀───│    QC    │◀───│ Register │
└──────────┘    └──────────┘    └──────────┘

Orchestrated by: Jenkins
Tracked by: ClearML
Stored in: MinIO + Qdrant
Served by: Triton

Every box in this diagram is a service that needs to be built, deployed, monitored, and maintained. The model training step? That’s maybe 20% of the total engineering effort.

Why This Matters

The AI industry has a talent shape problem:

If you’re an ML engineer who can’t explain what Kafka does, you’re going to have a hard time shipping anything. If you’re a systems engineer who doesn’t understand inference optimization, you’re going to build infrastructure that’s 10x more expensive than it needs to be.

The missing layer in AI is not a better model. It’s not more data. It’s not a fancier architecture.

It’s someone who knows how to build the damn system.


AI is becoming systems engineering disguised as machine learning. The sooner we acknowledge that, the sooner we start building things that actually work.