“After years of searching, there is still no cure for Digital Disposophobia”
What it takes to move a multi-petabyte archive from legacy tape to hybrid object storage—and why planning, hashing, and real-world limitations matter more than any cloud calculator.
Introduction — The Hidden Costs of Data Migration
When people hear you’re migrating 34 petabytes of data, they expect it’ll be expensive—but not that expensive. After all, storage is cheap. Cloud providers quote pennies per gigabyte per month, and object storage vendors often pitch compelling cost-per-terabyte pricing. Tape is still considered low-cost. Object systems are marketed as plug-and-play. And the migration itself? Supposedly, just a big copy job.
In reality? The true cost of a large-scale data migration isn’t in the storage—it’s in the movement.
If you’re managing long-term digital archives at scale, you already know: every file has history, metadata, and risk. Every storage platform has bottlenecks. Every bit has to be accounted for. And every misstep—be it silent corruption, metadata loss, or bad recall logic—can cost you time, money, and trust.
This article outlines the early stages of our ongoing migration of 34 petabytes of tape-based archival data to a new on-premises hybrid object storage system—and the operational, technical, and hidden costs we’re uncovering along the way.
The Current Day and the Life of the Preservation Environment
Before we examine the scale and complexity of our migration effort, it’s important to understand the operational heartbeat of the current digital preservation environment. This is not a cold archive sitting idle—this is a living, actively maintained preservation system adhering to a rigorous 3-2-1 policy: at least three copies, on two distinct media types, with one copy geographically off-site.
3-2-1 in Practice
Our preservation strategy is based on three concurrent and deliberately separated storage layers:
- Primary Copy (Tape-Based, On-Premises) Housed in our main data center, this is the primary deep archive. It includes Oracle SL8500 robotic libraries using T10000D media, a Quantum i6000 with LTO-9 cartridges, and is orchestrated entirely by Versity ScoutAM.
- Secondary Copy (Tape-Based, Alternate Facility) Located in a separate data center, this second copy is maintained on a distinct tape infrastructure. It acts as both a resiliency layer and a compliance requirement, ensuring survivability in case of a catastrophic site failure at the primary location.
- Tertiary Copy (Cloud-Based, AWS us-east-2) Every morning, newly ingested files written to the Versity ScoutAM system are reviewed and queued for replication to Amazon S3 buckets in the us-east-2 region. This process is automated and hash-validated, ensuring the offsite copy is both complete and independently recoverable.
Importantly, this cloud-based copy is contractual in nature—subject to renewal terms, vendor viability, and pricing structures. To uphold the 3-2-1 preservation standard long-term, we treat this copy as disposable yet essential: if and when the cloud contract expires, the full cloud copy is re-propagated to a new geographically distributed storage location—potentially another cloud region, vendor, or sovereign archive environment. This design ensures that dependency on any single cloud provider is temporary, not foundational.
Daily Lifecycle Operations
Despite the appearance of a “cold archive,” this system is active, transactional, and managed daily. Key operations include:
- New Ingests: Files continue to be written to ScoutFS via controlled data pipelines. These often come from internal digitization projects, external partners, or ongoing digital collections initiatives.
- Fixity Verification: For each new ingest, cryptographic checksums are embedded into the user hash space of ScoutFS to ensure future validation. These hashes are stored at time of write and used for all subsequent checks.
- Replication Pipeline (Cloud Offsite Copy): Once a file is written and verified locally, a daily script scans the Versity environment for the current scheduler and gathers entries from the archiver.log to identify directories that had archive jobs executed the previous day. These identified files are queued for replication to AWS S3 in the us-east-2 region. Files are transmitted in their original structure, and upon successful upload, the cloud-stored version is validated using the same hash metadata. Any mismatch is flagged for remediation.
A Moving Target
This is the reality we are migrating from—not a static legacy tape pool, but an active, resilient, and highly instrumented preservation environment.
The migration plan outlined in the next section doesn’t replace this environment overnight—it transitions just one of the three preservation copies to a new hybrid object storage model. The second tape copy remains fully operational, continuing to receive daily writes, while cloud replication continues for all eligible content. This overlapping strategy allows us to validate new infrastructure in production without putting preservation guarantees at risk.
Upcoming Migration — From Tape to Hybrid Object Archive
We’re in the early planning stages of a migration project to move 34PB of legacy cold storage to a new on-premises hybrid object archival storage system. “Hybrid” here refers to an architecture that blends both high-capacity disk and modern tape tiers, all behind an S3-compatible interface. This design gives us the best of both worlds: faster recall and metadata access when needed, with cost-effective, long-term retention via tape.
Legacy Environment:
- Oracle SL8500 robotic tape libraries containing the majority of our archive, based on T10000D cartridges
- Approximately 100 LTO-9 tapes also stored within the SL8500 system
- A Quantum i6000 tape library housing another ~500 LTO-9 cartridges
- Managed and orchestrated via Versity ScoutAM, which handles:
This mixed tape environment presents real-world operational challenges:
- Legacy T10000D drives are slower, with long mount and seek times
- LTO-9 drives are higher performing but operate in a separate mechanical and logical tier
- Drive sharing, recall contention, and concurrent read bandwidth must be carefully managed
To reduce risk and improve data fidelity, we’ve started integrating fixity hash values directly into the user hash space within the ScoutFS file system. This ensures each file can be validated during staging, catching any corruption, truncation, or misread before it’s written to the new system.
Our migration target includes not just the 34PB of existing tape-based data, but enough capacity to absorb an additional ~4PB of new ingest annually, for at least the first year. The total provisioned capacity in the new system is 40PB—designed to give us a buffer without overextending infrastructure.
The Real Costs in Migration
Migrations of this scale aren’t just about buying space—they’re about managing risk, trust, throughput, future-proofing, and time. It’s not enough to copy data from point A to point B. At any given moment, you’re balancing three active datasets:
- Current production (new data being ingested)
- Data in migration (from legacy tape to staging)
- Data in verification (testing the copied files post-ingest)
Most vendor proposals and cloud calculators overlook the operational cost of running all three states simultaneously. Here’s a breakdown of what truly drives cost and complexity in the real world:
System Cost
The new hybrid on-premises archive system is provisioned to support approximately 40PB, allowing us to:
- Absorb the full 34PB migration dataset
- Accommodate at least 1 year of new ingest, estimated at ~4PB annually
The migration from the legacy tape environment is orchestrated by Versity ScoutAM, which manages a multi-stage pipeline:
– Volume serial number (VSN)-driven recalls from both T10000D and LTO-9 cartridges
– Staging of data into disk-based scratch/cache pools
– Controlled archival into the new S3-compatible object storage system
– Additional cache storage was provisioned to:
- Support simultaneous ingest and migration staging
- Handle production workloads
- Allow for delayed verification of migrated files before release
Validation Overhead
To ensure bit-level data fidelity, we’ve begun populating user hash space fields in the ScoutFS file system with cryptographic fixity checksums prior to recall.
This approach enables:
– On-the-fly validation of files as they are staged from tape
– Comparison of staged file hashes with original stored hashes to immediately detect:
- File corruption
- Byte truncation
- Mismatches from degraded tape or faulty drives
This strategy significantly reduces:
– Redundant hashing workloads during object ingest
– Silent corruption risks introduced during mechanical tape reads
– Migration delays due to manual file triage or inconsistent validation logic
Hidden Taxes — Time, Energy, and Human Overhead
Some of the most significant costs in a multi-petabyte migration don’t show up on vendor quotes or capacity calculators—they’re buried in the human effort, infrastructure overlap, and round-the-clock support needed to make it all happen.
Here’s what that looks like in practice:
1. Dual-System Overhead
We expect to operate both the legacy and new archival systems in parallel for at least two full years. That means:
- Power, cooling, and maintenance costs for legacy robotics, tape drives, and storage controllers—even as data is actively migrating away
- Infrastructure costs for the new system (rack space, spinning disk, tape robotics, S3 interface endpoints) that must scale up before the old system scales down
- Ongoing monitoring and maintenance across both environments, which includes two independent telemetry stacks, alerting layers, and queue management processes
The dual-stack reality introduces complexity not just in capacity planning, but in operational overhead—particularly when issues affect both sides of the migration simultaneously.
2. Staffing Requirements
To meet our timeline and operational commitments, the migration team is scheduled for:
- 6-day-per-week operations, running 24 hours per day
- Shifts covering:
- Tape handling and media recalls
- Staging and ingest monitoring
- Fixity verification and issue resolution
- Log review, alerting, and dashboard tuning
- Daily oversight of both legacy and new systems
Staff must be able to respond to issues across multiple layers—tape robotics, disk cache performance, object storage health, and software automation pipelines.
3. ScoutAM Operational Load
While Versity ScoutAM serves as the backbone of the migration orchestration, it requires constant operational intervention in a complex legacy environment:
- Frequent manual remediation for ACSLS (Automated Cartridge System Library Software) issues, which affect tape visibility and mount accuracy
- Managing high stage queues, which can stall throughput if not carefully balanced across drives, media pools, and disk cache availability
- Regular validation and tuning of configuration to prevent deadlocks, retries, or starvation scenarios under load
This means that even with automation in place, the system must be actively managed and routinely adjusted to avoid migration stalls.
4. Migration Timeline Pressure
The goal: complete 34PB of migration in 18 to 23 months. That requires:
- Continuous tuning of recall-to-ingest pipelines
- Load balancing across tape drives, scratch pools, and object ingest nodes
- Real-time monitoring of errors, retries, and throughput drops
- Maintaining progress while still supporting current ingest and user requests
Every delay has downstream consequences:
- A failed or slow tape recall can back up staging
- A hash mismatch triggers manual triage
- A missed verification step risks corrupted long-term storage
These aren’t exceptions—they’re expected parts of the workflow. And they require human expertise, resilience, and continuous iteration to manage effectively.
The Vendor Blind Spot: Why Calculators Don’t Work
Storage vendors and cloud platforms love calculators. Plug in how many terabytes you have, pick a redundancy level, maybe add a retrieval rate, and out comes a tidy monthly cost or migration estimate. It all looks scientific—until you actually try to move 34 petabytes of long-term archive data.
The reality is: most calculators are built for static cost modeling, not for complex data movement and verification pipelines that span years, formats, and evolving systems.
Here’s where they fall short:
1. They Don’t Account for Legacy Media Complexity
Calculators assume all your data is neatly stored and instantly accessible. But we’re migrating from:
- T10000D cartridges with long mount and seek times
- LTO-9 cartridges in multiple libraries
- A blend of media types, drive generations, and recall strategies
Vendor models don’t include the cost of slow robotic mounts, incompatible drive pools, or long recall chains. And they certainly don’t account for manual intervention required to babysit legacy systems like ACSLS.
2. They Ignore Fixity Validation Workflows
Most calculators focus on bytes moved, not bytes verified. In our case:
- Every file must be validated against stored checksums in ScoutFS
- Hash mismatches trigger triage workflows
- Post-write verification in the object system must be staged, timed, and tracked
This adds both compute and storage demand to the migration, as data often exists in three states:
- Original tape format
- Staged file on disk
- Verified object in long-term archive
The calculators? They don’t factor in staging costs, hash workloads, or space for verification.
3. They Omit Human Labor
People run migrations—not spreadsheets.
Calculators ignore:
- 24/6 staffing models
- On-call support
- Tape librarians
- Log monitoring teams
- Software maintainers
We’re running two live environments for two years, with full coverage across:
- Legacy tape infrastructure
- Object archive ingest
- Monitoring and verification systems
The people-hours alone are non-trivial operational costs, yet they never appear on vendor estimates.
4. They Assume Ideal Conditions
Calculators assume perfect conditions:
- All tapes readable
- All files intact
- All drives healthy
- No queue contention
- No ingest bottlenecks
That’s not real life. In production:
- Drives fail
- Mounts timeout
- Fixity fails
- Scripts stall
- Resources saturate
And every hour lost to those failures is time you can’t get back—or model.
5. They Treat Migration as a Cost, Not a Capability
Most importantly, calculators treat migration as a one-time line item, not as a multi-phase operational capability that must be:
- Designed
- Tuned
- Scaled
- Monitored
- Documented
For us, migration is a platform feature—not a side task. It requires:
- Real-time logging
- Prometheus/Grafana-based alerting
- API-level orchestration
- Hash-aware data flow management
None of this is in the default TCO calculator.
Recommendations for Teams Planning Large Migrations
If you’re planning a multi-petabyte migration—especially from legacy tape to modern hybrid storage—understand that your success depends less on how much storage you buy and more on how well you architect your operational pipeline.
Here are our key takeaways for teams facing similar challenges:
1. Map Your Environment Thoroughly
- Inventory every media type, volume serial number, and drive model
- Understand robotic behaviors and drive sharing limitations
- Track mount latencies, not just theoretical throughput
2. Build for Simultaneous Ingest, Recall, and Verification
- Expect to run multiple systems in parallel for months to years
- Provision dedicated staging storage to buffer tape recalls and object ingest
- Treat hash verification as a core architectural feature—not a post-process
3. Treat Hashing as Core Metadata
- Use file system-level hash fields (like ScoutFS user hash space) early
- Don’t rehash if you can avoid it—store once, validate often
- Ensure every copy operation is backed by fixity-aware logic
4. Invest in Open Monitoring and Alerting
- Use tools like Prometheus, Grafana, and custom log collectors
- Instrument every part of the pipeline—from tape mount to hash verification
- Build dashboards and alert rules before your first PB moves
5. Automate What You Can, Document What You Can’t
- Script all recall, ingest, and validation tasks
- Maintain a living runbook for exceptions and intervention playbooks
- Expect edge cases. Document them when they happen.
6. Design for Graceful Failure and Retry
- Every file should have a known failure state and retry path
- Don’t let bad tapes, bad hashes, or stalled queues stop the pipeline
- Build small, testable units of work, not monolithic jobs
Conclusion: Migration Is Infrastructure, Not a One-Time Task
Moving 34PB of data isn’t a project—it’s the creation of an ongoing operational platform that defines how preservation happens, how access is retained, and how risk is managed.
For many institutions, the assumption has been that data needs to be migrated from tape every 7 to 10 years, driven by:
- Media obsolescence
- Hardware aging
- And shifting vendor support lifecycles
That rhythm alone is expensive—and it multiplies with every additional tape copy you maintain “just in case.”
But what if the storage platform itself was built for permanence?
What we’re working toward is not just a migration—but a transition to an archival system that inherently supports long-term durability:
- Built-in fault tolerance
- Geographic or media-tier redundancy
- Self-healing mechanisms like checksums and erasure coding
- Verification pipelines that ensure data integrity over decades
If these characteristics are fully realized, it opens the door to reducing the number of physical tape copies required to meet digital preservation standards. Instead of three physical copies to ensure survivability, you may achieve equivalent or better protection with:
- A primary object storage layer
- A cold, fault-tolerant tape tier
- And a hash-validated verification log or metadata registry
It doesn’t eliminate preservation requirements—it modernizes how we meet them.
True digital stewardship means designing systems that migrate themselves, that verify without intervention, and that allow future generations to access and trust the data without redoing all the work.
Preservation is no longer about saving the bits. It’s about building platforms that do it for us—consistently, verifiably, and automatically.
As we look beyond this migration cycle, a compelling evolution of the traditional 3-2-1 preservation strategy is the integration of ultra-resilient, long-lived media for one of the three preservation copies—specifically, Copy 2. By writing this second copy to a century-class storage medium such as DNA-based storage, fused silica glass (e.g., Project Silica), ceramic or film, we can significantly reduce the operational burden of decadal migrations. These emerging storage formats offer write-once, immutable characteristics with theoretical lifespans of 100 years or more, making them ideal candidates for infrequently accessed preservation tiers. If successfully adopted, this approach would allow institutions to focus active migration and infrastructure upgrades on only a single dynamic copy, while the long-lived copy serves as a stable anchor across technology generations. It’s not a replacement for redundancy—it’s an enhancement of durability and sustainability in preservation planning.