-
公开(公告)号:US12014276B2
公开(公告)日:2024-06-18
申请号:US18219555
申请日:2023-07-07
Applicant: Google LLC
Inventor: Gaurav Mishra , Adam Joseph Roberts , Maarten Paul Bosma , Noam M. Shazeer, Jr.
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training a machine learning model using a deterministic data pipeline. One of the methods may include receiving a first request to generate a deterministic training dataset: transforming raw training examples obtained from the raw data source into pre-processed training examples; assigning a unique index to each pre-processed training example; and caching the pre-processed training examples into the cache directory specified in the received first request; receiving a second request to use the deterministic training dataset to train a machine learning model, the second request specifying a start index; and in response to receiving the second request: reading, from the cache directory, the pre-processed training examples that have indices beginning from the start index; and providing the read training examples in an order of the assigned indices for use in training the machine learning model.