How do virtual machines store data?

A virtual machine (VM) is a software application that emulates a physical computer and its components. VMs utilize virtualization technology to abstract the hardware layer and allocate computing resources from a host machine to power one or more guest VMs. An essential aspect of VMs is how they handle data storage and access.

How do virtual machines store data?

When a VM is created, storage resources are allocated to it in the form of virtual hard disks. These virtual disks emulate the functionality of physical disks but are actually disk image files hosted on some physical storage media. There are a few common virtual disk formats used by major virtualization platforms:

Virtual Hard Disks

  • VHD: The Virtual Hard Disk format used predominantly by Microsoft and Citrix hypervisors. VHD files contain a custom disk image format that stores all disk structures like partitions and file systems.
  • VMDK: VMware’s Virtual Machine Disk format. Like VHD, VMDK files store virtual disk data including partition tables and file systems. VMDK has some additional features like sparse provisioning.
  • QED: QEMU’s native disk image format called the QEMU Enhanced Disk. It has some flexibility advantages over VHD/VMDK but is not as widely supported.
  • VDI: Oracle’s VirtualBox platform utilizes the Virtual Disk Image (VDI) format for its virtual disks. It has options for fixed or dynamic disk images.

So in summary, common virtual disk image formats like VHD, VMDK, QED, and VDI act as containers that store the guest OS data just like physical disks would. The virtualization platform handles mapping the virtual disk to the guest VM as an emulated storage drive.

Virtual Disk Location

The virtual disk image files containing the VMs data can be hosted in a couple of locations:

  • Local storage on the hypervisor server itself if storage is available on the host’s disks. This provides the fastest data access speeds for the VM but is limited by the host’s physical storage capacity.
  • Storage area network (SAN) can host disk images on shared storage arrays accessible over the network. This allows for flexibly allocating more storage to VMs independent of hypervisor server resources.
  • Network shared folders on storage servers like NAS can also contain virtual disk image data shared over standard network file sharing protocols. Performance may lag versus SAN or local storage depending on network capabilities.

So in on-premises deployments, a combination of local hypervisor storage combined with shared SAN or NAS arrays is commonly utilized to maximize VM storage allocation flexibility and performance.

For VMs running in the cloud, such as AWS, Azure, GCP, or other IaaS platforms, the VMs disks are stored as virtual disk images on the cloud provider’s backend storage infrastructure. This makes the location and actual storage hardware abstracted from the user but allows for flexible, software-defined storage allocation.

Some other storage considerations for VMs include:

  • Virtual disk file formats – Some storage efficiency features like thin/dynamic provisioning depend on which disk image types are used. This helps optimize utilization.
  • RAID configurations – For local and SAN/NAS storage, RAID levels affect performance and redundancy. Hypervisors may also provide software RAID for better management.
  • Caching settings – Memory caching for the hypervisor and storage subsystem improves VM disk performance substantially in most workloads by keeping frequently accessed data in memory.
  • IOPS quotas – Hypervisors allow setting input/output quotas for VMs to allocate bandwidth appropriately, especially when using shared storage repositories.

How VMs Access Storage

When a guest VM boots up and starts running its operating system and applications, read/write access calls are made to the virtual disks attached to the VM. The hypervisor handles routing these I/O requests from the guest VM to the appropriate location in the virtual disk image file.

Some typical examples of I/O flow include:

  • Boot files – Initial operating system boot files like Windows bootloader data or Linux initrd/initramfs files get loaded from specific areas of the virtual system disk image mapped by the hypervisor.
  • OS files – As the guest OS starts, further root filesystem data from partitions in the virtual disk serve operating system executables, libraries, config files and installers.
  • Application data – As applications execute in the guest VM, any files and databases they access for read/write end up as I/O requests to the appropriate locations in the guest VM’s virtual disk images.
  • New data – Any new files created by users and applications running inside the VM get written by the guest OS into free space allocated in the virtual disk image.

In this manner, the complete user and application data lifecycle gets persistently stored as digital data encapsulated within the virtual hard disk files containing the VM data. The hypervisor intercepts disk I/O coming from the guest VMs and routes it to the appropriate location in the disk image files supported by the underlying physical storage.

If multiple VMs access shared data, there are a few approaches that can be taken:

  • Share virtual disk files – A single virtual disk containing application data can be mounted to multiple VMs to provide shared app data.
  • Network shared folders – VMs can mount remote storage over the network using protocols like SMB for Windows or NFS for Linux to enable access to common files from multiple VMs.
  • Shared databases – For structured application data, having multiple VMs connect to a shared central database server like SQL Server provides unified persistent storage.

So in summary, VMs store all their digital data including OS, software and user files into virtual disk images that the hypervisor maps onto the underlying physical disks. Manipulation of this data from apps within the VM gets persisted into this emulated storage medium by routing I/O requests appropriately.

Key Takeaway

Virtual machines store their digital data like OS, software and user files by mapping read/write access from the guest VMs to virtual hard disk image containers. Hypervisors route the I/O requests coming from VMs to the appropriate location in these VHD/VMDK files or similar disk images hosted on the physical storage infrastructure. Optimizing these mappings along with the underlying storage setup allows efficiently meeting VM data needs.

Conclusion

In conclusion, virtual machines utilize virtualization technology to emulate physical computer hardware including storage disks. Virtual hard disk image formats like VHD and VMDK facilitate containing all the VM data from root files to application databases that need to persist.

The hypervisor manages routing all read/write access coming from the guest VMs to point to locations in these virtual disk files supported by the underlying physical disks on hosts, shared SAN or NAS storage. Tuning this mapping along with caching, I/O quotas and RAID configurations allows optimizing VM storage performance.

So while VMs emulate self-contained systems, the mechanism of persistent data storage relies on virtual disk images mapped by the hypervisor onto the physical storage media. Understanding this architecture allows efficiently designing and managing VM storage.

Frequently Asked Questions

Q: Where are VMs disk image files typically stored?
A: Locally on the hypervisor host server, shared storage area network (SAN), network shared folders on NAS filers or on the cloud provider’s infrastructure for cloud VMs.

Q: What is the difference between VHD and VMDK formats?
A: VHD is Microsoft’s Virtual Hard Disk format while VMDK is VMware’s. Both serve as containers to store guest VM disk data like partitions but VMDK has some additional features.

Q: Do VMs running on different hypervisors share physical storage?
A: No. Hypervisors handle the virtual disk mapping and I/O routing only for guest VMs running on top of that platform. Data does not directly cross over.

Q: How is a new file created in my VM physically stored?
A: Any new writes done in the VM get routed by the hypervisor and written into free allocated space in the virtual hard disk image backing that VM.

Q: Can VMs directly access SAN storage?
A: No. VM storage access is virtualized. Only the hypervisor running locally can directly interface with physical SAN volumes to host the VMs disk images.

Q: Where should VM disk image files be stored for best performance?
A: Local hypervisor server storage or shared SAN offer the fastest data access speeds for VMs vs NAS or Cloud object storage.

Q: Do all guest VMs need their own disk images?
A: Yes, the virtual hard disks containing the guest OS are tied to individual VMs even if running the same workload. It provides isolation.

Q: What causes data written in my VM to be lost or corrupted?
A: Issues like unexpected VM shutdown without flushing caches, underlying storage failures in physical disks or disk image file corruption lead to VM data loss.

Q: How can data be backed up from within VMs?
A: Users can run backup agents inside VMs to take image or file-based backups. Host-based backup solutions can directly backup disk images.

Q: What is a swap file and where is it physically stored?
A: The swap file expands virtual memory for higher utilization when RAM fills up. It is contained within allocated space on the VMs virtual system disk.

Q: Can VMs utilize cloud object storage like S3 as backend data store?
A: Object storage lacks native block-level access required for VM virtual disks. Special solutions that emulate such functionality would be required.

Q: How to optimize VM storage performance?
A: Use faster disk formats, leverage hypervisor caching, ensure adequate IOPS allocations, provision space from high performance SAN rather than NAS filers.

Q: What makes VMs running in the public cloud different in terms of storage?
A: The backend cloud infrastructure abstracts the actual data center hardware from end users. But flexible, software-defined storage can be allocated on-demand.

Q: Can physical host server disks store VM data from multiple hypervisors?
A: No. Shared disks would have to be assigned to a single hypervisor at a time for managing storage allocations and virtual disk mappings for its resident VMs.

Q: If a VM crashes, is its disk image still intact?
A: Yes, the virtual disk image file containing the VM data persists independently of the VM status. The VM can be rebooted to the last good state saved within the disk image.

Q: Can VMs use vendor specific file formats for better performance?
A: For portability, cross-platform formats like VHD/VMDK are preferred. But some hypervisors allow proprietary formats with incremental capabilities if workload isolation permits.

Q: Do all writes to VM disk images happen synchronously?
A: For crash consistency, confirmations are needed. But techniques like caching, asynchronous commits, snapshotting help boost performance and reduce backend latency impact.

Q: How to reduce the size of VM image files?
A: Sparse or thin provisioning only allocates storage on demand instead of pre-allocating full capacity upfront to avoid waste and sprawl. Can save substantial storage.

Q: What is the logical equivalent of VM disks in physical servers?
A: The system disk in a VM emulates the boot drives in a physical server while any attached data disks emulate additional storage drives mounted internally or externally.

Leave a Comment