AI computational network task unified scheduling and management platform
Artificial intelligence network task unified schedul- ing and management platform schedules and man- ages different types of online resources from differ- ent wise computing centers effectively to maximize the utilization and efficiency of the entire wise com- puting network.
Functions
Collecting resource information: computing re- sources include CPU, GPU, TPU, NPU, GPGPU, FPGA, ASIC, etc., network resources include su- percomputing intranet with dedicated Internet, pub- lic network, controlled and managed supercomput- ing intranet, controlled and managed AI computing intranet, etc., data resources include various public datasets of AI, private datasets of users, etc. Each resource on the network has its unique advantages and applicable Each resource on the network has its unique advantages and application scenarios.
Integration of resources: According to the minimum granularity of each heterogeneous re- source is not the same, to conveniently use AI, it often needs to combine a single computational re- source and with different kinds of heterogeneous resources to form a resource package, the resource package will be allocated to the most appropriate task requirements to optimize the task completion efficiency and resource utilization.
Managing jobs: Each AI job often contains multiple nested tasks, which can be called sub-task, and each sub-task can be scheduled to different wise computing centers in a distributed manner, and successful job scheduling will constitute the network of sub-task required for the job. In due course, it is necessary to monitor and manage the operation status and performance of the jobs and sub-task, to detect problems and deal with them in a timely manner.
Scheduling jobs: The platform schedules the jobs submitted by users, and schedules each sub- task included in the job to the appropriate AI com- puting center based on the load of each AI comput- ing center, data location, computing power price factor, communication efficiency, job resource de- mand and other factors (referred to as ”scheduling factors”, which will be outlined below) to run, to maximize the efficiency of task completion and resource utilization.
Scheduling strategies: A number of schedul- ing factors are used as inputs, and candidate wise computing centers that meet the job resource re- quirements are processed using certain processing logic to output a normalized score for each wise computing center. The output of each scheduling strategy will be used as the input of the schedul- ing evaluation model for the next step of compre- hensive decision making. The scheduling policies are determined according to the actual scenario re- quirements, and the optional scheduling policies are load minimum priority, resource idle priority, data affinity, computational power lowest price pri- ority, computational power highest performance priority, network performance priority, etc.
Scheduling factors: Scheduling factors are selected according to the actual scenario require- ments, and custom scheduling factors are also avail- able. Optional scheduling factors include: job re- quirements, resource packet demand specifications, data location, computational price, computational performance, wise computing center load, etc.
Scheduling evaluation model: The output of multiple scheduling policies will be used as the input of the evaluation model, and the optimal wise computing center will be calculated by the evaluation model to obtain the final scheduling re- sults. The following evaluation model is used: for a job waiting to be scheduled, scheduling policies S1 , S2 ,......, Sn are used, respectively, and assign weights W1 , W2 , ......, Wn to these scheduling poli- cies. At the kth wise computing center Ck, each scheduling policy outputs a score of Gk1 , Gk2 , ...... , Gkn, then the final score of the wise computing center Ck is Gk:

From all wise computing centers C1 , C2 ,...., Cn , the highest rated one is selected as the final schedul- ing result, and the task is dispatched to the target wise computing center.
Data scheduling factor: The platform can ob- tain the Id, name, version, data source distribution location, data cache distribution location, access rights and other related information of the target data set through the query interface of the AI data unified storage and management platform (see sec- tion 6.2 for details). It can also query the inter- face whether the wise computing center allows downloading, average downloading time, maximum downloading concurrency, maximum down- loading bits, recent downloading failure rate and other information through the artificial intelligence data unified storage and management platform. The platform uses the above target data set and the in- formation about the target AI computing center as one of the scheduling factors.
Data Scheduling Policy: The scheduling en- gine decides whether to migrate datasets across the AI computing center based on the output of the data scheduling policy. The platform has the following two data scheduling policies.
Calculate with the number of moving policy: the existence of the target data set of intelligent computing center can be exported as the target job can be scheduled to the candidate intelligent computing center. When the target wise comput- ing center is selected by the job, the optimal data source is selected based on the multidimensional data scheduling factor of the candidate data source, then the target data set is migrated to the target wise computing center and cached, and the target job is scheduled after the data migration task is completed.
Architecture
The core of the unified scheduling and management platform for artificial intelligence tasks is the inter- cloud management and scheduling system, which is responsible for assigning user tasks to different heterogeneous intelligent computing centers.
When designing the platform architecture, we adopted the design idea of storage and computa- tion separation, splitting the inter-cloud resource management into two major components: compu- tation resource management and data management, in order to achieve a high degree of scalability and flexibility.
In this architecture, the inter-cloud manage- ment and scheduling system is divided into four main modules: job management module, schedul- ing engine module, heterogeneous computing power resource management module, and data man- agement module. The main functions of each mod- ule are as follows:
Job management module: it is responsible for completing the functions of adding, deleting, changing and checking job objects, log manage- ment, running load monitoring, etc.
Scheduling engine module: it is based on the comprehensive scoring of internal scheduling pol- icy and scheduling evaluation program, the computing jobs are distributed to the target AI computing center to maximize the task completion efficiency and resource utilization.
Heterogeneous computing power resource management module: it is responsible for unified access and management of heterogeneous comput- ing, network and data resources in the AI comput- ing center, periodically collecting resource infor- mation, classifying, storing and managing them for use in subsequent job scheduling.
Data management module: it gives the Real- time sensing of the target data resources of the intelligent computing network, when the target in- telligent computing center lacks the target data set required for AI operations, the data management module can sense the data source with the best net- work location and provide the scheduling engine with a data migration strategy to realize the effect of scheduling with the number of calculations and calculations.
Last updated