摘要:
Methods, systems, and apparatuses, including computer programs encoded on computer-readable media, for tokenizing n-grams from a plurality of text units. A multi-dimensional array is created having a plurality of dimensions based upon the plurality of text units and the n-grams from the plurality of text units. The multi-dimensional array is normalized and the dimensionality of the multi-dimensional array is reduced. The reduced dimensionality multi-dimensional array is clustered to generate a plurality of clusters that each cluster includes one or more of the plurality of text units.