One-stop platform for debugging and training deployment of AI computing centers
One-stop platform for debugging and training de- ployment of AI computing centers faces to single- center computing power users and undertakes com- puting power orders from user. It provides resource management and usage functions for data, algo- rithms, mirrors, models and computing power to facilitate the one-stop construction of AI task com- puting environment for the users of AI Computing Center. At the same time, it provides resource man- agement, computing task management and moni- toring in a single center to facilitate the operation and analysis of a single center by managers.
Characteristics
One-stop development: Provide users with one- stop functions of debugging, training and deploy- ment for AI computing scenarios, and open up the whole chain of AI computing through modules of data management, model development and model training. Its management is convenient and it pro- vides platform managers with a one-stop resource management platform, which greatly reduces the management costs of platform managers through visualization tools such as resource allocation, mon- itoring, and authority control.
Superior performance: Provide high perfor- mance distributed computing experience, ensure smooth operation of each environment through multi-faceted optimization, and further improve model training efficiency through resource schedul- ing optimization and distributed computing opti- mization.
Good compatibility: the platform supports het- erogeneous hardware and heterogeneous networks, such as GPU, NPU, FPGA, IB network, Ethernet, etc., to meet the needs of different hardware clus- ter deployment. It also support a variety of deep learning frameworks, such as TensorFlow, Pytorch, PaddlePaddle, etc., and can also support the addi- tion of new frameworks by way of custom images.
User functions
Image panel: Container images are used to pack- age applications and their dependent operating sys- tem environments, providing an environment for algorithms to run. Mirror management module includes user mirror management and platform ad- ministrator preset mirror management. The main functions of this module include: get mirror list, create/query/update/delete mirrors, get mirror list version list, create/query/update/delete mirror versions, share mirror versions to Wise Computing Network data mart, cancel sharing.
Algorithm panel: Users can upload their code files to the platform and then create an AI debug- ging environment to modify the code. After the algorithm code is successfully debugged and saved, the corresponding AI training environment can be launched with one click to output model results and training process logs. The main functions of this module include: get algorithm list, cre- ate/query/update/delete algorithm, get algorithm version list, create/query/update/delete/download algorithm version, share algorithm version to Wise Computing Network Data Bazaar, and cancel shar- ing.
Data Panel: The datasets in Single Wise Com- puting Center can be categorized into My Datasets, Public Datasets and Preset Datasets. My dataset is the user’s own dataset, public dataset is the dataset shared by the user, and pre-set dataset is the dataset uploaded by the administrator. All the datasets in the single center can be used as the data source of Wise Computing Network to access the unified storage and management platform of Wise Computing Network data. The main func- tions of this module include: get dataset list, cre- ate/query/update/delete/dataset, get dataset version list, create/query/update/delete/download dataset version, share dataset version to Wise Computing Network Data Bazaar, and delete sharing.
Debugging jobs: The platform provides an online programming environment for debugging, running and saving algorithms to support the sub- sequent creation of training jobs. The debugging module supports various online programming envi- ronments (e.g. JupterLab, VsCode) and provides management functions to create, open, start, save, stop and delete debugging jobs.
Training jobs: This module supports users to create AI distributed training jobs and non- distributed training jobs; and is compatible with multiple deep learning computing frameworks (such as TensorFlow, PaddlePaddle, MindSpore, PyTorch, etc.), select the corresponding algorithm and OS image to start AI training jobs; supports ”training-model” . One-stop development, automat- ically save the model after training, easy to deploy and call.
Model Panel: Models are the results gener- ated by algorithm training, and through manage- ment can help users archive training results and provide model data for subsequent model valida- tion, deployment and evaluation. The main func- tions of this module include: get model list, cre- ate/update/delete models, get model version list, create/query/update/delete/download model ver- sions, share models to Wise Computing Network data mart, and cancel sharing.
Inference Service: This module supports one- click deployment of models as online inference service, and can be configured to generate service external API documents, providing an inference service platform for users to carry out large model business. one can use the module to query the ser- vice name, model name, model version, service description, creation time, running time, and status. The inference service adopts elastic scaling tech- nology architecture, when no one visits at night, the number of copies of the inference service is re- duced to 0 at the minimum, which reduces resource occupation and saves the cost of users, making the platform inference computational card resources to be shared and utilized most efficiently.
Management functions
Multi-tenant management: it support multi-tenant resource isolation, to meet the different re- source needs of different teams, divide different workspaces and match the corresponding training resources, making it more convenient for Team Leader to manage training resources.
Platorm monitor: Objects monitored by the platform are wise computing center cluster moni- toring and training task monitoring.Prometheus ag- gregates monitoring data provided by each node’s NodeExporter into its time-series database. The Grafana web service regularly requests metrics from Prometheus, dynamically displaying graphs of metrics data for the cluster and for the tasks the user is running.
Resource management: Platform resource management objects can be divided into server node list, system resource objects, custom resource objects, resource package objects, and resource pool objects.
Machine time management: The platform uses machine time as the unit for billing. When users create an AI debugging environment or train- ing environment, the corresponding machine time will be deducted. The platform sets the unit price of n Utility coins/machine hour according to the price of the resource package. The billing rules for AI jobs are listed as follows:
1. Machine time = Sub-task 1 machine time + Sub- task 2 machine time + ... + sub-task n machine hours
2. Sub-task machine time = copy 1 machine time + copy 2 machine time + ... + copy n machine hours
3. Task copy machine time = resource package usage weight * (task copy run end time - task copy run start time)
Preset data management: Preset data set is a data set uploaded by the administrator of the wise computing center, generally a public data set, pro- viding the initial data source for users within a sin- gle wise computing center and the wise computing network.
Pre-set algorithm management: Pre-set algo- rithm is the algorithm code created and uploaded by the administrator of AI computing center, and users in AI Computing Center can query, use, copy, download, and migrate the pre-set algorithm.
Pre-set image management: Pre-set images are container images created by the administrator of the Smart Computing Center, and users within
the AI Computing Center can query, use, copy, download, and migrate pre-set images.
Debugging job management: The administra- tor of the AI Computing Center can use this module to operate and maintain the debugging jobs of users in the AI Computing Center and assist them in trou- bleshooting debugging job problems. It can query individual job details, view job monitoring infor- mation of each dimension, view job logs and force stop operations. Due to the platform storage and calculation separation architecture, the platform administrator cannot view the original input and output data such as user datasets, algorithms and models.
Training job management: The administrator of the AI Computing Center can use this module to operate and maintain the training jobs of users in the AI Computing Center and assist them in trou- bleshooting training job problems. It can query individual job details, view job monitoring infor- mation in each dimension, view job logs and force stop operations. Due to the platform storage and calculation separation architecture, the platform administrator cannot view the original input and output data such as user datasets, algorithms and models.
Inference service management: The admin- istrator of the AI computing center can operate
and maintain all the inference services in the AI Computing Center through this module, and the op- eration and maintenance functions include: query- ing the service list, viewing the service details and forcing to stop.
Last updated