AI computational network data unified storage and management platform

AI computing network needs to schedule a large number of tasks to meet the conditions of comput- ing resources of the AI computing center, different computing tasks rely on a variety of data sets, these data sets may be distributed in one or more AI com- puting center. While each AI computing center may use different storage systems, there is a large variability between storage architecture and exter- nal interface. Other problems like the complexity of use by users, the overhead costs of such as band- width, latency time, and copy storage space will be paid for multiple data set migrations between clouds, also exist. Therefore, the Wise Computing Network needs to build a unified storage and man- agement platform for inter-cloud data, to collabora- tively manage the heterogeneous storage media and data sets of the Wise Computing Center, to improve the efficiency of resource sharing among the partic- ipants of the Wise Computing Network ecosystem, to give the Wise Computing Network higher scala- bility, and to provide more convenience for more Wise Computing Centers and data to join the Wise Computing Network ecosystem in the future.

Functions

Fast access to heterogeneous storage systems The storage system of each AI computing center may be heterogeneous, using inconsistent data storage solutions. The application interface protocols may be S3, OBS, OSS, MINIO, FTP, and various custom storage system APIs. To achieve efficient man- agement of network data, it is necessary to adapt and unify the interfaces of the heterogeneous stor- age systems of each AI computing center to form a set of The internal unified storage management interface.

System high availability: the access to the wise computing center is attributed to different computing power service providers, and has geo- graphical isolation and management isolation. Ac- cess to the heterogeneous storage system can not be considered completely reliable and always online. AI computing center suddenly offline its storage system should not affect the normal other AI com- puting center to continue service.

Sense the dataset: To improve the efficiency of computing tasks, realize the scheduling strat- egy of calculating with the number. The intelligent computing network needs to generate a network- wide unique id for the dataset and sense the storage of the dataset in each intelligent computing center to form a global data storage view. Based on this global view, the AI task scheduling and manage- ment platform can query whether the target data exists in the target AI computing center. In the case that multiple computing centers meet the comput- ing power demand of a computing task at the same time, the user computing task can be dispatched to the computing center where the dataset already exists in order to improve the task activation effi- ciency.

Verifying dataset integrity: Datasets in the AI domain may contain a large number of fold- ers and multiple files, and AI training tasks are strongly dependent on the dataset, and inconsisten-

cies in the dataset will affect the effectiveness of AI model generation, so quickly verifying the dataset integrity will be a challenge for the platform.

Support for multiple versions of datasets: datasets in the field of artificial intelligence hold a variety of annotated data. As society develops and commercial companies invest in development, similar datasets tend to keep adding new content, in which case new versions of different datasets are needed to distinguish and archive the contents of datasets from different periods.

Inter-cloud migration of datasets: The com- puting service provider may deploy the wise com- puting center as a private cloud or may deploy the

wise computing center to a public cloud. When a computational task will be scheduled to the wise computing center N, the target dataset needs to be migrated between clouds when the target dataset specified by the task is not available in that cen- ter. The platform senses the existence of this target dataset in the wise computing center M and can migrate it to the storage system of the target wise computing center N from the wise computing cen- ter M as the data source.

Data set compression: Most of the data sets in the field of artificial intelligence are unstructured data, which are compressed and then stored in the AI computing center, which will enable the AI computing network to save a lot of transmission bandwidth, transmission time and storage space.

Data set caching: Migration of data sets be- tween clouds requires four steps: sensing the data source, downloading from the data source to an in- termediate proxy server, and uploading the data set to the target wise computing center by the proxy server. Each migration requires uploading and downloading through the public network, with huge bandwidth and time costs. Building a memory cache network for dataset in the AI Computing Net- work will greatly improve the efficiency of dataset transfer, greatly enhance the efficiency of AI task scheduling, and greatly improve the experience of using the entire AI computing network.

Support multi-tenancy: Wise Computing Net- work will serve multiple users and multiple organi- zations. The dataset has privacy attributes, requir- ing the platform to have a complete set of user role permission management system, supporting users to manage access rights to the dataset for private, limited domain public, and network-wide public.

External unified API: The AI data unified storage and management platform and the AI task uni- fied scheduling and management platform adopt the storage and calculation separation architecture, and the AI data unified storage and management platform supports multi-tenancy, which requires the development of a set of external unified API interfaces to provide multi-party calls for dataset awareness and migration.

Data structure

Before storing the dataset in this system, we need to compress and format the original data to gener- ate two parts: DataSet Metadata Part and Dataset Blobs Part. Among them, the metadata part is re- sponsible for storing the description information and folder hierarchy of the original dataset files, and the original data block part is a chunk of all the files in the dataset. Each data chunk in the original data block part generates a hash digest, and the digest algorithm is designed to be variable and con- figurable (optional Sha256, Blake3, etc.) to adapt to the increasingly powerful computational crack- ing attacks and higher performance hash algorithm innovations. Based on the folder hierarchy of the original dataset, a hash summary Merkle tree data structure is formed in the metadata section and a unique id of the dataset is generated. Due to the presence of the original data block section, some or all of the data blocks are selected randomly within a single dataset with the same flexibility and the dataset integrity is verified using the corresponding summary algorithm.

Figure 11: Structure of AI computing network data set

Multi-version dataset structure: The shared data block part then separates the original files of the dataset by block and stitches them together, and is responsible for storing the actual byte con- tents of the current version of the dataset. Such a separated data structure can improve system stor- age efficiency by storing high-frequency accessed metadata in memory and high-speed storage de- vices, and store shared data block files, which may occupy huge storage space, in less costly persistent storage.

Figure 12: Structure of multi-version data set of AI Comput- ing Network

Architecture

The platform is a unified data storage and manage- ment system that supports fast access to heteroge- neous wise computing centers, aiming to provide global unified storage management and efficient data scheduling solutions for wise computing net- works. The platform adopts micro-service architecture and consists of four parts: system management module, dataset management module, unified agent service module, and unified storage application in- terface module.

Last updated