The website stores all protein embeddings generated by ProTrek 650M version.
For extremely large databases, such as GOPC and OMG, we divided them into smaller parts and stored the protein embeddings respectively (labeled with numbers).
The stored embedding folders contain two files: an embeddings_xxx.npy file that stores the generated embeddings and an ids.tsv file that records the information of proteins (ID, sequence and sequence length). Each line of the two files corresponds to each other. For embeddings_xxx.npy, xxxis the number of proteins in this folder. If you want to read the file, you should use the np.memmap function. For example:
import numpy as np
npy_path = "embeddings_xxx.npy"
embeddings = np.memmap(npy_path, dtype=np.float32, mode="r", shape=(xxx, 1024)) # xxx is the number of lines
print(embeddings.shape)Note: for PDB database, we didn’t provide embeddings_xxx.npy. Instead we uploaded a .index file, which you could load using the faiss lib. For example:
import faiss
index_path = "sequence.index"
index = faiss.read_index(index_path)
embeddings = index.reconstruct_n(0, index.ntotal)
print(embeddings.shape)Click on database names to expand/collapse file lists: