Dataset assetOpen Source CommunityMachine LearningComputational Biology

PepBDB-ML

This project aims to generate a rich dataset from the PepBDB database for machine‑learning and computational‑biology research. The dataset processes peptide‑protein interaction data, extracts sequences, and adds various biochemical features, creating a tabular dataset suitable for Random Forest, XGBoost, and other analyses. Each row is labeled as binding residue (1) or non‑binding residue (0).

Source

github

Created

Jun 26, 2024

Updated

Jun 28, 2024

Signals

169 views

Availability

Linked source ready

Overview

Dataset description and usage context

PepBDB-ML 数据集生成

概述

该项目旨在从 PepBDB 数据库生成一个用于机器学习和计算生物学研究的增强型数据集。

该脚本处理肽-蛋白质相互作用数据，提取序列，并使用各种生化特征对其进行丰富，创建适合使用随机森林、XGBoost 等进行进一步分析的表格数据集。每行标记为结合残基（1）或非结合残基（0）。

表格数据集 `peppi_data.csv`：

AA	Protein Hydrophobicity	Protein Steric Parameter	Protein Volume	Protein Polarizability	Protein Helix Probability	Protein Beta Probability	Protein Isoelectric Point	Protein HSE Up	Protein HSE Down	Protein Pseudo Angles	Protein ASA	Protein Phi	Protein Psi	Protein SS H	A	R	N	D	C	Q	E	G	H	I	L	K	M	F	P	S	T	W	Y	V	Binding Indices
L	0.6891891891891891	0.9607843137254901	0.8222778473091366	0.6241610738255033	0.6823529411764706	0.7473684210526315	0.40175219023779735	0.3333333333333333	0.42857142857142855	0.8699882132974325	0.4456686291000842	0.23066692000760028	0.0816007154035323	1.0	0.0	0.0	0.0	0.0	0.6666666666666666	0.0	0.0	0.0	0.16666666666666663	0.5555555555555556	1.0	0.0	0.5555555555555556	0.5	0.33333333333333326	0.0	0.0	0.25	0.30000000000000004	0.7142857142857142	0
K	0.518018018018018	0.6666666666666667	0.8610763454317898	0.7348993288590604	0.6941176470588235	0.25263157894736843	0.8723404255319149	0.0	0.5714285714285714	0.8747249797378288	0.7270984020185031	0.19703591107733237	0.09479096803040464	1.0	0.3333333333333333	0.4444444444444444	0.36363636363636365	0.30000000000000004	0.6666666666666666	0.5	0.3333333333333333	0.30000000000000004	0.5	0.2222222222222222	0.5	0.8571428571428571	0.4444444444444444	0.2	1.0	0.8	0.2857142857142857	0.125	0.2	0.42857142857142855	1
D	0.2072072072072072	0.7450980392156863	0.40175219023779735	0.3523489932885906	0.48235294117647065	0.18947368421052624	0.0	0.0	0.49999999999999994	0.6108288105124712	0.7711867992384177	0.1911457343720312	0.1553767046724793	1.0	0.0	0.2222222222222222	0.4545454545454546	1.0	0.0	0.375	0.6666666666666666	0.2	0.5	0.0	0.0	0.42857142857142855	0.0	0.0	0.6666666666666666	0.6000000000000001	0.14285714285714285	0.0	0.09999999999999998	0.14285714285714285	1

图像数据集 `peppi_data_imgs`：

bash peppi_data_imgs ├── binding │ ├── img1.jpg │ ├── img2.jpg │ ├── img3.jpg │ └── ... └── nonbinding ├── img4.jpg ├── img5.jpg ├── img6.jpg └── ...

数据准备过程

加载数据

脚本开始从 PepBDB 数据库加载 peptidelist.txt 文件。列名被重命名以提高可读性和便利性。

初始过滤

脚本过滤掉：

涉及核酸的条目。
分辨率高于 2.5 Å 的模型以保证质量。
短于 10 个氨基酸的肽。

序列提取

使用 BioPython 从 PDB 文件中提取序列。我们还将过滤掉包含非标准氨基酸的序列。

结合残基识别

使用 PRODIGY（默认参数）识别结合残基。

特征提取

使用 AAindex1 进行残基特定特征提取。

数据丰富

添加额外的生化特征，包括 HSE、ASA、DSSP 代码和 PSSM 配置文件。

运行脚本

要运行脚本，只需执行：

bash tar -xzf pepbdb-20200318.tgz python gendata.py

gendata.py 还可以生成类似于 Visual 数据集的图像。要启用此选项，请将 --images 标志设置为 true 并指定结合和非结合图像的完整路径：

bash python gendata.py --images True --binding_path path/to/binding --nonbinding_path path/to/nonbinding

重要提示：请记住使用特定于您系统的路径修改 paths.py。

确保您有必要的输入文件和目录，如脚本中所指定。

注意事项

图像目录 peppi_data_imgs.tgz 和表格数据集 peppi_data.csv.gz 不是 1-1 对应的，CSV 不是图像的标签文件。虽然它们基于相同的数据构建，但它们不包含相同数量的记录。
- peppi_data.csv 中有 811,830 条记录
  - 结合：110,268
  - 非结合：701,562
- peppi_data_imgs 中有 806,129 张图像
  - 结合：109,880
  - 非结合：696,249
这是因为 peppi_data.csv 中的某些行（残基）有 NaN 值。在导出 CSV 之前，这些行单独被删除。然而，相同的错误行/残基可以出现在多张图像中（因为每张图像代表七个连续残基）。为了保持可用性，包含该残基的所有图像都被删除。

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio