The website stores all protein embeddings generated by ProTrek 650M version.
For extremely large databases, such as GOPC and OMG, we divided them into smaller parts and stored the protein embeddings respectively (labeled with numbers).
The stored embedding folders contain two files: an embeddings_xxx.npy
file that stores the generated embeddings and an ids.tsv
file that records the information of proteins (ID, sequence and sequence length). Each line of the two files corresponds to each other. For embeddings_xxx.npy
, xxx
is the number of proteins in this folder. If you want to read the file, you should use the np.memmap
function. For example:
import numpy as np
= "embeddings_xxx.npy"
npy_path = np.memmap(npy_path, dtype=np.float32, mode="r", shape=(xxx, 1024)) # xxx is the number of lines
embeddings print(embeddings.shape)
Note: for PDB
database, we didn’t provide embeddings_xxx.npy
. Instead we uploaded a .index
file, which you could load using the faiss
lib. For example:
import faiss
= "sequence.index"
index_path = faiss.read_index(index_path)
index = index.reconstruct_n(0, index.ntotal)
embeddings print(embeddings.shape)
Click on database names to expand/collapse file lists: