Storage Concepts In The Data Center

Introduction 

 
We all understand the importance of data storage in our personal lives. Imagine having a phone that could only store five songs at a time. Every time you wanted to download a new song, you’d have to create space by deleting one of your five songs! How inconvenient would that be, and how much time would be wasted?
 
Now imagine the impact that inadequate storage would have on a data center. Bear in mind that 2.5 billion bytes of data are currently produced around the world every single day – the “structured data” of spreadsheets and databases, and the “unstructured data” that is our audio files, video files, images, messages, text files, and presentations. It’s been estimated that by 2025 there’ll be over 40 billion devices connected to the internet, generating nearly 80 zettabytes of data. A zettabyte is 1,000,000,000,000,000,000,000 bytes, or one trillion gigabytes, in case you were wondering! We’ll have to wait and see if that estimate comes to pass, but with the amount of data stored around the world doubling every two years, the need for storage that can handle this volume of data can certainly only grow.
Storage Concepts In The Data Center
 
Demand for storage already outstrips supply. When we consider the fact that data centers exist solely to store and process data, it’s not difficult to see why storage might be viewed as the most important “layer” of a data center’s infrastructure. Datacenter storage shortages mean application failures and service disruptions. (Most modern data centers from a high-level approach have 3 “layers”: a management layer, a virtual layer, and a physical layer. The physical layer is made up of 3 layers: a network layer, a compute layer (i.e. computing resources), and a storage layer.)
 
Many data centers contain devices built for storage only. This is because server memory – random access memory (RAM) and hard drives – can only accommodate a certain amount of data depending upon the model. RAM is a kind of “short-term” memory that’s used to hold data that needs to be accessed, read, and written speedily by the central processing unit (CPU). Hard drive capacity is still in the double-digit terabytes range – far lower than the three-figure terabytes range of some storage-only devices (which enters the petabyte range when the devices are clustered together). (One petabyte is 1,024 terabytes or 1,048,576 gigabytes in binary.)
 
These storage-only devices are typically connected to the server or to the network the servers are on. Apart from their larger capacity, another benefit of storage-only hardware is their ability to deliver data to users quickly and efficiently.
 
A data center’s storage must deliver high “availability” of storage. “Availability” refers to the expectation that a storage device is running rather than experiencing downtime. If you lost a flash drive containing all your favorite photos and videos, it would be incredibly inconvenient. Imagine, then, if a data center storage device went offline: the businesses and enterprises that relied on that storage would suffer – financially and reputationally. Datacenter storage, therefore, clearly needs to be far more robust and fail-safe than personal storage.
 
The method used to improve the availability of storage is called “redundancy”. Redundancy is the duplication of critical components of a system to provide a back-up solution.
 
It can be implemented in such a way that the system will create a copy of data, saved in another location, that can be accessed if the original location becomes corrupted or breaks down. As you can probably guess, redundant storage is critical for data centers.
 
Storage Concepts In The Data Center
 
You may not be familiar with some of the words that are used in the industry. Here’s an explanation of some of the key ones, with others explained throughout the course.
 
Abstraction (noun) – in a complex system or piece of software, abstraction is focusing on the most relevant details and hiding what can be ignored
 
Array (noun) – data storage made up of multiple storage devices and cache memory (that’s temporary memory for fast data access) block storage or block-level storage (noun) – data is saved in huge fixed-sized volumes called “blocks”; each block is treated as an individual storage device, has a unique identifier, and has its file system (file systems will be discussed in general in section 2.4)
 
Deploy (verb) – to install, test, and run hardware or software in a live environment
 
File storage or file-level storage (noun) – data is saved in files and folders in a hierarchical system of directories and sub-directories; to be accessed, the storage drives need to be configured with the Network File System (NFS – discussed in section 2.5) if it’s a Unix or Linux system, or Server Message Block (SMB) if it’s a Microsoft Windows system
 
Logical (adjective) – not physical
 
Mirror (verb) – to make an exact copy of data from one storage device drive to another storage device in real-time, to prevent data loss in the event of a disk failure; also known as “RAID 1” - RAID is defined below
 
Object (noun) – with vSAN (discussed in sections 2.7 and 6-6.5), an object is a virtual machine disk file (VMDK), a snapshot (a copy of a VMDK taken at a specific point in time), or the virtual machine home folder
 
Object storage or object-based storage (noun) – data is bundled together with its metadata (more information about the data, e.g., date created/modified, size, author) and a unique identifier to form an “object”
 
Policy (noun) – a set of rules about the storage requirements of virtual machines and the applications that they run
 
Provisioning (verb) – setting-up and making available IT resources, and managing them
 
RAID (noun) – a “redundant array of independent disks” is storage that is made up of multiple separate hard drives; the same data is stored across different disks using a variety of methods, known as “RAID levels”; mirroring (see above) and striping (see below) are two of these methods
 
Stripe (verb) – to divide a piece of data into equally-sized units which are then spread across multiple storage devices; no copies of the data are made; also known as “RAID 0”