As more and more businesses are impacted by data proliferation, data management has become a top priority.
Data proliferation refers to the amount of data, structured and unstructured, that business and government continue to generate at an alarming rate. While this has always included electronic business documents, such as reports and spreadsheets, there is a new wave of image, video, and audio data being generated at an unprecedented rate. Sorting and managing this amount of data has become impossible using traditional, manual methods, and simple elimination of data is dangerous in light of regulatory requirements and potential litigation. It is easier to add more storage as needed, which is expensive and only serves to postpone the inevitable while increasing the magnitude of the problem.
Ironically, it is often in the largest and most costly to maintain primary storage environments that data is the least useable. Generally speaking, as the size of a storage environment increases, usability and ROI decreases. In impacted environments, data is often lost, difficult to find, or difficult to restore.
Best practices in IT management dictate that primary storage is kept lean, a proposition that is all but impossible without comprehensive data management programs in place. Primary storage should be reserved for the creation of new data and the utilization of active data. Once data has reached a point in its lifecycle that it is accessed less frequently, it should be moved to a secondary tier of storage.
How old ways fail
Common stand-alone applications, such as word processing systems, financial solutions, and engineering systems require primary storage with little regard for the proliferation of data as it ages. Even secondary systems, such as document management solutions, email archiving systems, and backup solutions fail at a basic level to deal with the challenges and requirements of secondary storage. This is because they utilize only a very shallow pool of hard disk or tape storage that has little or no management capabilities. Most of these solutions fail to accommodate the movement of eligible data to secondary storage, let alone the ability to actually manage data over its lifecycle. Most of them fail to provide services such data as de-duplication, by which duplicate data is purged from storage. They also fail to capture meta-data that would transform data into information. They simply do not eliminate enough management overhead and operational complexity.
Data requirements over the long term
Digital information has a lifecycle, as do other corporate assets. The active portion of information’s lifecycle, when it is created and heavily utilized, is very short. In contrast, the length of time over which it must be preserved is often much longer. During this part of its lifecycle, when the information is being stored only for reference, maintained under regulatory requirements, or archived for posterity, it consumes a great deal more IT resources than at any other time. This is because after it leaves primary storage, the location at which data is stored and how it is managed is best determined by its value to the organization. This value, or classification, determines the data's service level requirements (SLR), its recovery time objectives (RTO), and its storage longevity.
Enter the information repository
An Information Repository is a tier of secondary storage for data that no longer needs to be housed in primary storage. It comprises resources that are less expensive to acquire and utilize and is more suited to data that is accessed less frequently. When deployed effectively, an information repository provides several advantages. It can keep primary storage clear of non-essential data. It turns 'data' into 'knowledge' and manages it according to SLRs and RTOs. It provides a searchable environment for data in long-term storage. In addition, it provides an alternative for shallow, stand-alone purposed application storage.
What do information repositories do?
Information repositories house and manage information in secondary storage. They also provide a means to manage secondary storage resources themselves. They mitigate problems associated with data proliferation and eliminate the need for separately deployed secondary storage solutions, which were required in the past because of the concurrent deployment of diverse storage technologies running diverse operating systems.
Information repositories are automated. They rely on policy driven ingest and data service software tools to respectively ingest data automatically from primary storage and manage it in secondary storage. During the ingest process, context information (meta-data) about the data is extracted and cataloged.
Within the information repository, content information is also cataloged. The context and content information is then used with service polices to manage data according to time, events, data age, content, and other parameters. It is also used to locate data quickly with search queries.
Deploying an information repository
A simple information repository can be composed of a single secondary storage resource. Complex information repositories can include several storage resources of diverse technologies.
Storage resources within the repository, or 'Vaults' as they are commonly referred to, usually consist of a storage resource and a host computer. The host provides database resources for meta-data catalogues, file lookup, and content information. It also makes the storage resource available on the network. Multiple vaults can be combined in a grid-like architecture, where each operates autonomously and can support others as part of a group. Vaults can be virtualized as a composite storage set. This diminishes the complexity of a configuration utilizing multiple different storage resources by encapsulating them within the confines of the information repository. Adding centralized management from a console that may be operated from anywhere enables the information repository to be managed as a federated environment.
The utilization of storage pools provides a great deal of flexibility to the information repository. Storage pools are a grouping of several units of media that can be identified and used for a single purpose. Media in a storage pool may be the same or of different types, and it should span multiple vaults. Using storage pools as defined destination criteria better ensures that a data manipulation (or service) policy will succeed in sending data. If one or more units of media are full, or even if a vault is off line, as long as there is a unit of media within a pool on a vault that is operating normally, the policy will succeed.
An information repository is easily deployable and easily scalable. Furthermore, it is self-contained and supports resource management to add, maintain, recycle, and terminate media as well as track off-line media.
Ingesting data into the information repository
Data can be ingested into the information repository via applications and policy driven utilities. Backup applications can serve a dual purpose of protecting data in primary storage and ingesting it into a repository for secondary storage. Data mover utilities and copy utilities may also be deployed to send data to the information repository. Other methods of getting eligible data into the information repository include file virtualization, email, and message archiving. Data producing applications, such as video surveillance, pre/post production solutions, engineering solutions, and others are also very good candidates for integration with an information repository.
Ingest policies should be able to understand and utilize information about the data's source, type, age, permission structure, and other parameters. This enables policies to be focused at certain eligible data types that can be identified and then processed into the information repository. Whenever meta-data can be extracted as the data is ingested, and catalogued in a database, data management policies and search queries can be far more efficient and comprehensive. This is because the process of searching and targeting eligible data is conducted against the meta-databases, not the actual stored data itself. Searching the meta-databases is magnitudes faster than trying to sift through actual storage space to locate data specific to a policy or query. This also allows data that resides off-line to be searched equally as fast as data that is on-line.
Managing data in the information repository When an information repository is configured using multiple types of storage technologies, it should be utilized as a 'tiered' storage environment. The layers within the environment should be structured from high tiers utilizing fast access technologies emphasizing high performance, to lower tiers utilizing slower access technology emphasizing economical, long-term storage.
Once information is in the information repository, automated data service policies should be designed and deployed to manage the stored information. The data service policies utilize data mover, replication, and purge utilities to manipulate the stored information. They should be designed to consider details such as the information's source, type, context, content, and ownership. This enables the policies to be focused to work on a narrow selection of files. Policies should be rounded out with destination criteria if they are designed to move or copy files.
Migration (move), replication (copy), and purge policies should be designed around an organization's data classification needs. When designed effectively, they should ensure that data is always housed on the media that best meets the information's SLRs and RTOs. This is done by determining what type (or tier) of storage will best accommodate the information's recovery performance needs and storage longevity requirements balanced against its value, classification, and utilization rate.
Purge policies should be equipped with an authorization trigger that may be manually initiated before they will actually delete data. This will mitigate inadvertent file deletion.
If storage space is at a premium, a de-duplication utility should be deployed to eliminate duplicate data. However, careful consideration should be paid to the design of this technology. Since an information repository contains data that is no longer eligible to reside in primary storage, it is advisable to store more than a single copy of a file within the repository as a fail-safe copy. Therefore, a de-duplication utility that is flexible enough to allow data duplication within specified storage tiers, pools, vaults, and media, while disallowing it in others, thereby being adaptive to a particular environment's needs, is more desirable.
Finding and recovering data
One of the biggest advantages of an information repository is that it refines mountains of unstructured information into manageable containers of organized information. When a search engine is attached to the repository, it provides the ability to locate an individual file and discover sets of data that meet a common set of parameters within a query.
Search queries can be highly refined. Efficient queries can specify a file based on any of its context or content information. They can also restrict the search to a particular storage pool, storage tier, media type, vault, computer of origin, unit of media, and even define time and date criteria.
Data security is preserved by search engines that utilize the information's permission structure. Users never even see files they do not have authorization to access.
When information is located, users should be able to recover individual files or multiple files to either their original network computer or any computer to which they have access.
Implementing management policies within the information repository
The following is an example of how policies can be designed and deployed to move data from primary storage into an information repository. In this scenario, a business with sites in Los Angeles and Chicago wants its weekly financial and customer data in its Los Angeles office protected in secondary storage utilizing a disk-based vault every night. They also want the files to be immediately replicated to a vault in their Chicago location for further protection. Furthermore, once the data has not been accessed for 90 days, they want the data to be migrated to different tiers within the information repository and removed from primary storage. Finally, 90 days later, when the data has been inactive for 6 months, they want to purge all stored copies of the files (except for those in long-term archive storage).
To accommodate this environment, an information repository will be equipped with a three-tier hierarchy that includes at least two hard-disk vaults in the two locations and at least one tape-based vault in the Los Angeles location. The following scenario is managed using several ingest and data service policies.
Step 1: Store files in-house to hard disk-based storage pools
Several computers in the accounting department in the Los Angeles location have ingest policies that backup important information every morning at 1:00 AM. The accounting data backed-up from these computers is sent to a disk-based storage pool named 'Accounting.' The customer data information is sent to a disk-based storage pool named 'Client Records.'
At this point, the information exists in its original primary storage and on two storage pools in the information repository.
Step 2: Replicate files offsite to hard disk-based storage pools
Data service policies are deployed that replicate information from the Accounting and Client Records storage pools to a storage pool in Chicago named 'Accounting Failover.' This step ensures that no information is lost in case there is a device failure, or unforeseen disastrous event, and takes place after all of the ingest jobs are finished.
At this point, the information exists in its original primary storage and on three storage pools in two separate locations, one being in Los Angeles; the other being in Chicago.
Step 3: Implementing a tape-based storage pool
As the data ages and is accessed less frequently, two policies will manage the next stage of its lifecycle. First, all of the information that has been in the Accounting and Client Records storage pools in Los Angeles that has been inactive for 90 days will be replicated to a tape-based storage pool named 'Accounting Archive' also at the Los Angeles location. Next, a purge policy will target the same data in primary storage and eliminate it.
At this point, the information is in the information repository in the highly accessible disk-based Accounting and Client Records storage pools and the tape-based Accounting Archive storage pool in Los Angeles. It is also on a hard disk storage pool in Chicago, but it is no longer taking up space in primary storage.
Step 4: Move files to a long-term archive
When the data has been inactive for 6 months and is eligible for long-term storage, a system administrator can simply remove the tapes from the Accounting Archives storage pool in Los Angeles as they fill up and store them at a safe, off-site location. When the data tapes are removed, the information repository should denote that they are offline and retain their meta-data and content databases. This enables the information repository to retain the information it needs to be able to conduct searches for data on the offline media, and provides a great deal more flexibility and speed for the overall storage solution.
At this point, the information is still readily available from disk-based storage in the information repository in both the Los Angeles and Chicago locations. It is also on Accounting Archive tapes removed from the Los Angeles office and stored in a secure off-site location.
Step 5: Purge unneeded files
Three sets of identical information are no longer required based on the information's declining value over a 6-month period. Therefore, when the original data has been inactive for six months, and after the information has been moved to tape in the Los Angeles site and the tapes have been stored securely off-site, the information will be purged from the Los Angeles and Chicago storage pools.
At this point, the information that we ingested in Step 1 exists only on tapes stored at a secure location off-site. However, because the data tapes' meta-data and content databases are still tracked in the information repository, the information on them can be easily located using the information repository's search and retrieve functionality.
The preceding scenario can easily be augmented to include any number of storage devices, vaults, units of media, and storage pools. Adding specialized storage pools for information that needs to be set aside for regulatory purposes is as easy as installing information repository software on a new device and creating polices to ingest the needed information from primary storage and mange it in secondary storage.
Storage technology and application refreshing
Another issue the information repository resolves is the problem created as old storage technologies are phased out. As new media technologies are implemented, simple migration policies can move data from media technologies being phased out onto the newer media.
It can also eliminate the problem of keeping archived data in-step with current application versions by putting the data through an application refresh. In this scenario, data is located and opened with a current version of the application software that created it. It is then saved to a network storage location that acts as an ingest point into the information repository. This ensures that data remains viable over long archive durations.
By implementing an information repository, businesses can solve the problem of information management and data proliferation in their current storage environments and invest in new storage hardware technologies only as the need arises to scale their information repository. Unlike traditional storage solutions that try to meet the challenges of data management and data proliferation by increasing the amount of available storage, information repositories have the potential to reduce the size of storage environments while simultaneously ensuring that all of the data being preserved is easily found, recovered, and is totally usable.
In short, the information repository is an easily deployable, easily scalable, policy driven, federated secondary tier of information storage with robust search and recover functionality. It can comprise multiple, already in place networked data storage technologies running on diverse operating systems. Because of this, it provides configuration flexibility, virtually limitless extensibility, redundancy, and reliable failover. Net benefits include reduced proliferation of lower priority data in primary and secondary storage, less primary and secondary storage needs, better efficiency, lower costs, and more cost effective management and utilization of corporate information assets.