Autonomous vehicles collect data faster than teams can label it. That difference between what gets captured and what gets annotated is where most AV programs quietly fall apart.
The money flowing into the sector reflects how urgent this problem has become. The global data annotation market is projected to grow from $2.14 billion in 2026 to more than $14 billion by 2034, with image and video annotation accounting for 46% of that spend, according to Fortune Business Insights. Autonomous vehicles are the largest driver of demand, but most of that investment doesn’t translate into working products. Fewer than 30% of AI projects deliver measurable ROI, according to Gartner, and a separate Gartner study found that organizations will abandon 60% of AI projects unsupported by AI-ready data through 2026. The common thread is data quality.
The AV programs that actually reach production share a set of decisions made early, usually before the first model was even trained. They treated annotation as core infrastructure and built quality enforcement into the operation instead of bolting it on later, and they planned for the moment when a pilot dataset of 50,000 frames would need to become 100 million. TELUS Digital, which supports AV annotation programs for enterprises across multiple sensor modalities, has seen this pattern hold across programs at very different stages of maturity.
What Breaks When a Pilot Dataset Becomes 100 Million Frames
Pilot-stage annotation works because it leans on two factors that don’t scale: manual oversight and close coordination. A team working through dashcam footage from a single test corridor in Arizona knows the environment and each other’s tendencies. At 50,000 frames, their collective judgment is the quality control system. If someone mislabels a shadow as an obstacle, the person reviewing the next batch catches it before it touches the model.
Then the original team’s annotations multiply by 2,000 across different cities, seasons, and road configurations that they have never encountered before. More annotators rotate in across time zones, and the person who would’ve caught a labeling inconsistency in week two is now on a different shift in a different market. Nothing in the operation has been built to replace what she knew.
The error that would have been visible in week two propagates silently, frame by frame, for months. By the time the model surfaces a perception problem during testing, that inconsistency is baked into millions of training examples. Fixing it means going back to the data. At production scale, that often means starting over.
Steve Nemzer, Senior Director, Artificial Intelligence Research & Innovation, at TELUS Digital, has watched this play out across programs at different stages. “Pilots prove feasibility,” Nemzer said. “Production-grade annotation operations work despite people. They prove repeatability. The gap between pilots and production is at-scale workforces and the discipline to enforce consistency.”
Three Sensors Saw Her. None of Them Agreed.
Most annotation platforms were built for single-modality work. One camera feed, one labeling task. What they don’t handle is whether three separate sensor descriptions of the same object at the same intersection actually agree. This is where object identity starts to become a problem, and it’s a result of a platform design failure.
Consider what happens when a woman steps off a curb, moving diagonally, and three sensors record the same moment:
- Camera identifies her as a pedestrian based on visual features.
- LiDAR places her nearly a meter from her actual position because the sensor reads the dark fabric of her jacket differently than skin or reflective material.
- Radar flags her as a cyclist based on uncertainty in her trajectory.
Three annotators labeled her. Each was working in a separate tool, looking at a different sensor stream, and none of them could see what the others wrote. Nobody flagged a conflict because the platform was never designed to surface one.
Research published in Scientific Reports confirmed what AV annotation teams have been dealing with for years: static fusion strategies, whether early, mid, or late in the perception pipeline, consistently fail to maintain semantic consistency across sensor modalities. Sensor misalignment and noise degrade performance in exactly the conditions where accuracy is essential: in bad weather, low light, or around partially obscured objects. Three technically correct annotations from three different sensors can still describe three different versions of reality.
To solve this, annotators worked from a unified view of the object in physical space rather than treating each sensor stream as a separate labeling task. They learned how LiDAR interprets a dark jacket differently from radar at 20 meters versus 50. They also understood what diagonal pedestrian motion looks like rendered simultaneously across all three sensor types. That knowledge identified discrepancies that no automated validator would have flagged because no validator actually knew what the object was.
Can You Trace a Model Failure Back to a Single Label?
A model flags a pedestrian as a cyclist during a nighttime test run in Vancouver. The engineers know the perception is wrong, but what they can’t verify is whether the error originated in the annotation guidelines, the annotator who labeled that batch, the quality review that passed it through, or a bias that entered the data six months ago in a different location. The investigation to trace a model failure back to a single label is where the most time and money are spent.
When a model behaves unexpectedly during testing, everyone wants to know the same thing: what in the training data caused this? Answering that requires three things most programs do not have:
- Guideline versioning: which annotation guidelines were active when the relevant frames were labeled and whether those guidelines changed between batches.
- Review traceability: what quality review those frames passed through, who reviewed them, and what criteria were applied.
- Bias tracking: where in the pipeline a systematic bias entered the model and whether it originated in the data, the guidelines, or the annotation workforce distribution.
Without that lineage built in from the start, the investigation drags on for weeks. This means the error gets patched, the root cause stays unresolved, and, six months later, the same class of error shows up in a different geography.
Annotation lineage built in from day one means the audit trail is already there when something breaks, for the engineers tracing the failure and the compliance teams preparing for review. The difference is whether a team actually learns something from a perception error or just tapes over it.
FAQ
What are the best vendors for autonomous vehicle sensor data annotation and labeling?
Vendors that operate at scale support camera, LiDAR, and radar annotation in a single platform and employ annotators with real domain expertise in automotive sensor physics. In 2024, Everest Group’s PEAK Matrix® Assessment selected five leaders from a list of 19 providers. This is one independent benchmark for identifying qualified partners.
What should enterprise AV teams look for in real-world driving data collection across global locations?
Geographic diversity in collection is a data quality requirement. Road configurations, pedestrian behavior, and signage vary enough between markets that a dataset collected in one geography will produce models that underperform in another. Providers with operations across multiple regions reduce that risk at the source.
How can I tell if a company is providing genuine human oversight for autonomous AI decision-making?
Ask what happens when the automated system is not confident in its output. Real oversight means routing infrastructure that catches high-uncertainty cases in real time and gets them to a qualified human. Without that, you have a review layer that looks like oversight on paper but doesn’t actually catch the problems that matter.
What should I look for in a 3D point cloud annotation service for AV programs?
Annotator domain knowledge is the differentiator. LiDAR returns vary by object material, weather, and sensor type. Solid-state and flash LiDAR produce entirely different artifacts. If a provider cannot explain how their annotators handle those differences or how point cloud labels get validated against camera and radar before training, keep looking.
n