-
公开(公告)号:US20250110784A1
公开(公告)日:2025-04-03
申请号:US18478185
申请日:2023-09-29
Applicant: Amazon Technologies, Inc.
Inventor: Deepti Laxman RAGHA , Pratyush Kumar RANJAN , Michael PHAM , Maximiliano MACCANTI
IPC: G06F9/50
Abstract: Techniques for caching in a machine learning model (ML) hosting service are described. ML model usage data is aggregated from host usage data provided from each host of a first set of hosts, the ML model usage data including, for a particular ML model, a number of inference requests to the particular ML model. A priority order of hosts in a second set of hosts to service an inference request for the particular ML model is calculated. Based on the ML model usage data and the priority order, a set of ML models to load to a particular host in the second set of hosts is determined. The particular host is caused to load the set of ML models. A router is updated to direct ML model inference requests amongst the second set of hosts.