背景
当联合甲基化和转录组数据时,有不少文献直接用交集求出基因。但还有个更科学的方法——MethlMix包通过Beta混合模型(beta mixture model)识别低/高甲基化状态,并通过约对应基因的表达水平的相关性分析从而找到疾病中DNA甲基化驱动的基因。
DNA 甲基化是一种将甲基添加到 CpG 位点的机制。这些 CpG 位点的甲基化与基因沉默有关,是正常组织发育的重要机制,通常与癌症等疾病有关。最近已经产生了许多高通量数据,在全基因组基础上分析 CpG 位点甲基化。这为许多疾病创造了大量关于 DNA 甲基化的数据。需要对 DNA 甲基化数据进行计算分析,以识别与正常组织相比潜在的异常 DNA 甲基化。
MethylMix (Gevaert 2015; Gevaert, Tibshirani & Plevritis 2015) 的开发是为了使用计算方法解决这个问题。MethylMix 通过使用 Beta 混合模型来识别与正常组织相比具有不同 DNA 甲基化的样本亚群,从而识别差异和功能性 DNA 甲基化。功能性 DNA 甲基化是指基于匹配的基因表达数据的显着负相关。
软件发表和说明书
- 文章发表在Bioinformatics,Published: 14 April 2018,标题为:MethylMix 2.0: an R package for identifying DNA methylation genes
- 使用说明书
- bioconductor: MethylMix
- https://www.bioconductor.org/packages/release/bioc/vignettes/MethylMix/inst/doc/vignettes.html
- https://github.com/gevaertlab/MethylMix
示例流程代码复现
# 主题:MethylMix识别甲基化驱动基因------# Tools: MethylMix包# R包教程地址:https://www.bioconductor.org/packages/release/bioc/vignettes/MethylMix/inst/doc/vignettes.html# 运行前的一些参数设置--------------# rm(list = ls())options(stringsAsFactors = F)# 如果要想从TCGA下载目标癌症的数据,下载时间可能会非常久# 那么可能需要将反应时间延长# To retrieve an optiongetOption('timeout')# To set an optionoptions(timeout=10000)# 安装R包-------------# if (!requireNamespace("BiocManager", quietly=TRUE))# install.packages("BiocManager")# BiocManager::install("MethylMix")library("MethylMix")# 多核并行,不推荐单核运行,因为花的时间太久了library(doParallel)cl <- makeCluster(15)registerDoParallel(cl)# 获取TCGA的目标癌症数据的27K/450K甲基化和基因表达数据(GetData)---------# 方法一:GetData函数一次性获取和预处理不同的组学数据# 注意查看这个函数做的内容,此函数耗时非常非常非常久,嫌时间长可以直接使用测试数据help(GetData)cancerSite <- "LIHC"targetDirectory <- "/sdc1/home/ldl/lzw/project/software_train/DNAmeth_tools/MethylMix/"GetData(cancerSite, targetDirectory)stopCluster(cl)# 方法二:单独的函数获取不同的组学数据cancerSite <- "LIHC"home="/sdc1/home/ldl/lzw/project/software_train/DNAmeth_tools/MethylMix/21211231_test/"targetDirectory <- paste0(home, "/")# Downloading methylation dataMETdirectories <- Download_DNAmethylation(cancerSite, targetDirectory)# Processing methylation dataMETProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)# Saving methylation processed datasaveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))# Downloading gene expression dataGEdirectories <- Download_GeneExpression(cancerSite, targetDirectory)# Processing gene expression dataGEProcessedData <- Preprocess_GeneExpression(cancerSite, GEdirectories)# Saving gene expression processed datasaveRDS(GEProcessedData, file = paste0(targetDirectory, "GE_", cancerSite, "_Processed.rds"))# Clustering probes to genes methylation dataMETProcessedData <- readRDS(paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])# Putting everything together in one filetoSave <- list(METcancer = res[[1]], METnormal = res[[2]], GEcancer = GEProcessedData[[1]],GEnormal = GEProcessedData[[2]], ProbeMapping = res$ProbeMapping)saveRDS(toSave, file = paste0(targetDirectory, "data_", cancerSite, ".rds"))stopCluster(cl)# 4. Data input for MethylMix------library(MethylMix)data(METcancer)data(METnormal)data(GEcancer)dim(METcancer);head(METcancer[, 1:4]) #14 251dim(METnormal);head(METnormal) #14 4dim(GEcancer);head(GEcancer[, 1:4]) #14 251# 5. Running MethylMix---------MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)# MethylMixResults是最重要的计算结果,它存储以下6个对象,分别为:# 1. MethylationDrivers:得到driver gene(差异甲基化,转录调控)# 2. NrComponents: 每个driver gene的对应的甲基化状态;# 3. MixtureStates: 每个driver—gene的差异甲基化值(即肿瘤样本与正常样本之间的平均甲基化的差值);# 4. MethylationStates: 以样本为列,driver gene为行的差异甲基化值的矩阵;# 5. Classifications: Matrix with integers indicating to which mixture component each cancer sample was assigned to, for each gene# 6. Models:每个driver gene的beta mixture model参数;# 简单查看这些对象MethylMixResults$MethylationDriverslength(MethylMixResults$NrComponents)MethylMixResults$NrComponentsMethylMixResults$MixtureStates# 6. Graphical output---------------# Plot the most famous methylated gene for glioblastomaplots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer)plots$MixtureModelPlot# Plot MGMT also with its normal methylation variationplots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, METnormal = METnormal)plots$MixtureModelPlot# Plot a MethylMix model for another geneplots <- MethylMix_PlotModel("ZNF217", MethylMixResults, METcancer, METnormal = METnormal)plots$MixtureModelPlot# Also plot the inverse correlation with gene expression (creates two separate# plots)plots <- MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, GEcancer, METnormal)plots$MixtureModelPlotplots$CorrelationPlot# Plot all functional and differential genespdf("demo_MethylMix_MixtureModelPlot.pdf")for (gene in MethylMixResults$MethylationDrivers) {plots <- MethylMix_PlotModel(gene, MethylMixResults, METcancer, METnormal = METnormal)plots$MixtureModelPlotplots$CorrelationPlot}dev.off()
分步:1、获取TCGA上目标癌种的甲基化芯片和基因表达数据
注意:下载数据的环境配置
问题描述:R语言默认的timeout时间为60 seconds,这么短的时间如果要下载大文件,经常容易因为下载期间长期没反馈导致time out报错,如:Timeout of 60 seconds was reached
解决办法:设置timeout时间更久。
方法一:使用GetData函数一次性获取转录组和DNA甲基化芯片的数据
函数:GetData
注意查看这个函数做的内容(help(GetData)),如果下载的TCGA癌症类型为样本数目较多的癌症(癌症缩略词见附表),此函数耗时非常非常非常久,久到四五天也没有结果,放后台运行了至少十天左右,最终获得可以用于下游分析的文件!嫌时间长可以直接使用测试数据。
#!/path/to/Rscript# 主题:MethylMix识别甲基化驱动基因------# Tools: MethylMix包# R包教程地址:https://www.bioconductor.org/packages/release/bioc/vignettes/MethylMix/inst/doc/vignettes.html# 安装R包-------------# rm(list = ls())options(stringsAsFactors = F)# if (!requireNamespace("BiocManager", quietly=TRUE))# install.packages("BiocManager")# BiocManager::install(c("MethylMix", "doParallel"))library(MethylMix)# 获取TCGA的27K/450K甲基化和基因表达数据(GetData)---------# 单核运行(不推荐)# cancerSite <- "LIHC"# targetDirectory <- paste0(getwd(), "/")# GetData(cancerSite, targetDirectory)# 多核并行library(doParallel)cl <- makeCluster(5)registerDoParallel(cl)cancerSite <- "LIHC"targetDirectory <- paste0(getwd(), "/")GetData(cancerSite, targetDirectory)stopCluster(cl)
方法二:使用MethylMix包的其它函数单独下载不同的组学数据
代码如下:
cancerSite <- "OV"targetDirectory <- paste0(getwd(), "/")cl <- makeCluster(5)registerDoParallel(cl)# Downloading methylation dataMETdirectories <- Download_DNAmethylation(cancerSite, targetDirectory)# Processing methylation dataMETProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)# Saving methylation processed datasaveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))# Downloading gene expression dataGEdirectories <- Download_GeneExpression(cancerSite, targetDirectory)# Processing gene expression dataGEProcessedData <- Preprocess_GeneExpression(cancerSite, GEdirectories)# Saving gene expression processed datasaveRDS(GEProcessedData, file = paste0(targetDirectory, "GE_", cancerSite, "_Processed.rds"))# Clustering probes to genes methylation dataMETProcessedData <- readRDS(paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])# Putting everything together in one filetoSave <- list(METcancer = res[[1]], METnormal = res[[2]], GEcancer = GEProcessedData[[1]],GEnormal = GEProcessedData[[2]], ProbeMapping = res$ProbeMapping)saveRDS(toSave, file = paste0(targetDirectory, "data_", cancerSite, ".rds"))stopCluster(cl)
网上教程
- MethylMix识别甲基化驱动基因
- 太猛了!2021年这个套路发了134篇5+生信文章,还不快学!
- 使用MethylMix包识别甲基化驱动的癌症基因,下面这个数据挖掘路线图就是来自文章:Methylation-Driven Genes Identified as Novel Prognostic Indicators for Thyroid Carcinoma

附录
TCGA 癌症研究类型和Codes
TCGA / 癌症简称 / 缩写 / TCGA癌症中英文对照,链接:TCGA Study Abbreviations
| Study Abbreviation | Study Name | 中文全称 |
|---|---|---|
| ACC | Adrenocortical carcinoma | 肾上腺皮质癌 |
| BLCA | Bladder Urothelial Carcinoma | 膀胱尿路上皮癌 |
| BRCA | Breast invasive carcinoma | 乳腺浸润癌 |
| CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma | 宫颈鳞癌和腺癌 |
| CHOL | Cholangiocarcinoma | 胆管癌 |
| COAD | Colon adenocarcinoma | 结肠癌 |
| COADREAD | Colon adenocarcinoma/Rectum adenocarcinoma Esophageal carcinoma | 结直肠癌 |
| DLBC | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 弥漫性大B细胞淋巴瘤 |
| ESCA | Esophageal carcinoma | 食管癌 |
| FPPP | FFPE Pilot Phase II | FFPE试点二期 |
| GBM | Glioblastoma multiforme | 多形成性胶质细胞瘤 |
| GBMLGG | Glioma | 胶质瘤 |
| HNSC | Head and Neck squamous cell carcinoma | 头颈鳞状细胞癌 |
| KICH | Kidney Chromophobe | 肾嫌色细胞癌 |
| KIPAN | Pan-kidney cohort (KICH+KIRC+KIRP) | 混合肾癌 |
| KIRC | Kidney renal clear cell carcinoma | 肾透明细胞癌 |
| KIRP | Kidney renal papillary cell carcinoma | 肾乳头状细胞癌 |
| LAML | Acute Myeloid Leukemia | 急性髓细胞样白血病 |
| LGG | Brain Lower Grade Glioma | 脑低级别胶质瘤 |
| LIHC | Liver hepatocellular carcinoma | 肝细胞肝癌 |
| LUAD | Lung adenocarcinoma | 肺腺癌 |
| LUSC | Lung squamous cell carcinoma | 肺鳞癌 |
| MESO | Mesothelioma | 间皮瘤 |
| OV | Ovarian serous cystadenocarcinoma | 卵巢浆液性囊腺癌 |
| PAAD | Pancreatic adenocarcinoma | 胰腺癌 |
| PCPG | Pheochromocytoma and Paraganglioma | 嗜铬细胞瘤和副神经节瘤 |
| PRAD | Prostate adenocarcinoma | 前列腺癌 |
| READ | Rectum adenocarcinoma | 直肠腺癌 |
| SARC | Sarcoma | 肉瘤 |
| SKCM | Skin Cutaneous Melanoma | 皮肤黑色素瘤 |
| STAD | Stomach adenocarcinoma | 胃癌 |
| STES | Stomach and Esophageal carcinoma | 胃和食管癌 |
| TGCT | Testicular Germ Cell Tumors | 睾丸癌 |
| THCA | Thyroid carcinoma | 甲状腺癌 |
| THYM | Thymoma | 胸腺癌 |
| UCEC | Uterine Corpus Endometrial Carcinoma | 子宫内膜癌 |
| UCS | Uterine Carcinosarcoma | 子宫肉瘤 |
| UVM | Uveal Melanoma | 葡萄膜黑色素瘤 |
