KNOWLEDGE DISTILLATION METHOD FOR COMPRESSING TRANSFORMER NEURAL NETWORK AND APPARATUS THEREOF

    公开(公告)号:US20240330648A1

    公开(公告)日:2024-10-03

    申请号:US18596994

    申请日:2024-03-06

    CPC classification number: G06N3/042 G06N3/082 G06N3/096

    Abstract: A method for training a student network including at least one or more of a transformer neural network by using knowledge distillation in a teacher network including at least one or more of the transformer neural network is disclosed. The method includes: pre-training the teacher network using a training data and fine tuning the trained teacher network; copying a weight parameter of a bottom layer of the teacher network to the student network; and performing the knowledge distillation to the student network through the fine-tuned teacher network. The performing the knowledge distillation includes: extracting a feature structure from the result value of a layer of the fine-tuned teacher network; extracting a feature structure from the result value of a layer of the student network; and adjusting the feature structure of the extracted student network based on the feature structure of the extracted teacher network.

Patent Agency Ranking