ProTrek embedding storage

The website stores all protein embeddings generated by ProTrek 650M version.

For extremely large databases, such as GOPC and OMG, we divided them into smaller parts and stored the protein embeddings respectively (labeled with numbers).

The stored embedding folders contain two files: an embeddings_xxx.npy file that stores the generated embeddings and an ids.tsv file that records the information of proteins (ID, sequence and sequence length). Each line of the two files corresponds to each other. For embeddings_xxx.npy, xxxis the number of proteins in this folder. If you want to read the file, you should use the np.memmap function. For example:

import numpy as np

npy_path = "embeddings_xxx.npy"
embeddings = np.memmap(npy_path, dtype=np.float32, mode="r", shape=(xxx, 1024)) # xxx is the number of lines
print(embeddings.shape)

Note: for PDB database, we didn’t provide embeddings_xxx.npy. Instead we uploaded a .index file, which you could load using the faiss lib. For example:

import faiss

index_path = "sequence.index"
index = faiss.read_index(index_path)
embeddings = index.reconstruct_n(0, index.ntotal)
print(embeddings.shape)

Available Download Files

Click on database names to expand/collapse file lists:

GOPC (492 files)
MGnify (96 files)
NCBI (142 files)
OMG (636 files)
OMG_prot50 (64 files)
PDB (4 files)
Uncharacterized (2 files)
UniRef50 (2 files)