Vault Storage
Vault storage is a tape-based storage product that is designed to accommodate long-term data storage and is best used for data that must be retained for reference but is unlikely to be accessed every day. Vault is also the best place to store redundant backups of data that is being stored locally. Vault storage can be presented as a desktop share/network drive(SMB), can be mounted on a server (NFS) or can be accessed via SFTP or rsync. Access to the shares is controlled by a user group for SMB shares (via the University's GroupAdmin access group control), or by network address for NFS shares
Vault storage is a combination of disk and tape capacity and has the advantage of being extremely efficient when storing large amounts of data. The disk capacity is used as a fast cache for quickly accessing data recently written to the Vault and the tape capacity is used to store older files that have not been recently accessed. By default, data that has not been accessed for seven days is migrated to tape but the default policy can be altered in special cases. Contact the RDSM team for more information. Files migrated to tape are still displayed in the relevant directory and the content is automatically recalled to disk from the relevant tape(s) when they are accessed. Recalling data from tape can take five or more minutes, depending upon the file size and number of files.
Due to the way in which tapes store data1, recalling tape content to disk can be a lengthy process if large numbers of files are each accessed individually. This is because for each file the tape robot needs to:
- Look up the file in the database,
- Locate the specific tape that contains the file (large files may span multiple tapes),
- Load the appropriate tape into the tape drive,
- Move to the position on the tape where the data is stored, and
- Copy the data to disk.
As a result, large numbers of small files can take a long time to be retrieved and this can be compounded if the files were not created at the same time because they could be located across multiple tapes. Due to this, it is strongly recommended that users store data in an orderly structure of folders that represent experiments, projects, instruments or people and also bundle large numbers of small files together into archives. This can be accomplished by using a file compression or archive tool like ZIP, TAR or SquashFS. In this way, a single archive containing a large number of files can be recalled far more quickly and efficiently than the same data stored as many individual files. If there is a need to access a large number of files within a short timeframe, a bulk recall" can be requested by contacting the RDSM team. Please note however that bulk recalls are uncommon and are assessed on a case-by-case basis.
Vault storage can be made accessible to the Internet (not recommended for critical or sensitive data) using Aspera, which is useful for sharing large amounts of data with external collaborators. There is a significant amount of Vault capacity available and allocations are assigned individual quotas. Vault storage can be requested via the Data Dashboard.
Resources
- Vault Storage User Guide
- Data Dashboard User Guide
- Data storage guidelines
- Guidelines for managing research data
- Electronic Information Security - Information Classification Procedure
- Slide Presentation on Better Use of Vault
Technical Information
:Protocols: SMB, NFS, SFTP, rsync. :Supported Operating Systems: Windows2, macOS, Linux. :Security: Secure access groups (SMB), specific machines (NFS) via hostname or IP, local accounts (SFTP, rsync). :Security Classification: Public, Restricted. :Backup Schedule: Daily, 30-day backup retention period.
Footnotes
-
Data on tape is stored in "chunks" and each chunk is written to a specific tape at a specific time. If large numbers of files are added to a collection over a long period of time, then it is very likely that the files will have been stored across multiple tapes. As each tape needs to be loaded into a drive and then move to the location of the chunk in order to retrieve each file, the fewer chunks there are to read and retrieve, the more efficiently the data can be recalled. For this reason, bundling up individual files into archives helps to speed up the retrieval process because there will be smaller numbers of contiguous chunks to locate and recall. ↩
-
Note that machines running Windows Vista Enterprise or later versions can be configured to mount NFS shares, but doing so is problematic. As a result, NFS is recommended only for macOS and Linux machines. SFTP is not natively supported on Windows, but third-party applications can be used. Tools like Cygwin are required in order to run rsync on Windows machines. ↩