Mouse Proteomics: An R Pipeline Guide

by RICHARD 38 views
Iklan Headers

Unveiling Mouse Proteomics: A Comprehensive R Pipeline

Hey there, data enthusiasts! Ready to dive into the exciting world of mouse proteomics? In this comprehensive guide, we'll walk through a complete R pipeline designed to analyze LFQ (Label-Free Quantification) proteomics data. Whether you're a seasoned bioinformatician or just starting, this pipeline provides a clear, step-by-step approach to data import, cleaning, analysis, and insightful visualization. We'll be using a variety of powerful R packages, from data wrangling with tidyverse to sophisticated statistical analysis with limma and functional enrichment with clusterProfiler. Get ready to unlock the secrets hidden within your proteomics data!

Setting Up Your Proteomics Analysis in R

First things first, let's get our environment ready. This section focuses on setting up the necessary libraries and packages for a successful proteomics analysis using the R programming language. This is the foundation upon which the entire analysis will be built, ensuring that all the required tools are in place. We'll handle package installations and library loading, which is the initial step in any R-based analysis. It's similar to gathering all the ingredients before you start cooking. This step is critical because it ensures that all the necessary tools, like those for data manipulation, statistical analysis, and visualization, are available for use.

Let's break down this crucial setup phase, which is essential for a streamlined and error-free analysis. We'll start with the installation of required packages. The code snippet begins by checking if the BiocManager package is installed; if not, it installs it, ensuring that we can access packages from Bioconductor, a repository for bioinformatics software. Following that, it defines two vectors: cran_pkgs and bioc_pkgs. The cran_pkgs vector lists packages available on CRAN (Comprehensive R Archive Network), and bioc_pkgs lists packages from Bioconductor. The script then uses for loops to iterate through these vectors and install the packages if they aren't already present on your system. Finally, it loads all installed packages using the library() function, making them available for use throughout the analysis. This systematic approach minimizes errors and ensures all dependencies are met. This is a cornerstone for reproducibility and efficiency.

##### 1. Setup #####
if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")
cran_pkgs <- c("readxl","tidyverse","openxlsx","pheatmap","ggrepel","ComplexHeatmap","circlize")
bioc_pkgs <- c("limma","clusterProfiler","org.Mm.eg.db","enrichplot","DOSE","pathview")
for(p in cran_pkgs) if(!require(p, character.only=TRUE)) install.packages(p)
for(p in bioc_pkgs) if(!require(p, character.only=TRUE)) BiocManager::install(p)
library(readxl); library(dplyr); library(tidyr); library(ggplot2)
library(limma); library(pheatmap); library(ggrepel)
library(clusterProfiler); library(org.Mm.eg.db); library(openxlsx)
library(ComplexHeatmap); library(circlize)
library(enrichplot); library(DOSE); library(pathview)

Importing and Cleaning Your Proteomics Data

Once the environment is set up, the next step involves importing and cleaning the raw proteomics data. This is a critical step in the proteomics data analysis pipeline. In this phase, the code reads the raw data, typically in an Excel file, and performs initial cleaning steps to prepare the data for further analysis. The goal here is to get the data into a manageable and analyzable format. This is where you transform the raw output from your mass spectrometry experiments into a tidy, usable dataset. Let's go through the details of this data wrangling process. You'll use functions from the readxl and tidyverse packages to load the data and perform some preliminary cleaning. This involves filtering out contaminants and any reverse database entries. The code renames the columns to remove prefixes, making them easier to work with. This section highlights the use of the read_excel function to read the Excel file and the filter and rename_with functions from dplyr to clean the dataset. This step is essential for ensuring the accuracy and reliability of downstream analyses.

First, the read_excel function from the readxl package is used to import the data from an Excel file named "RAW data.xlsx", specifically from the sheet labeled "Raw data". Next, the code filters out any rows containing "CON" (contaminants) or "REV" (reverse database hits) in the Protein IDs column. This is a crucial step to remove non-protein entries that can skew results. Following the filtering, the code renames the columns. It uses rename_with and sub functions to remove the prefix "LFQ intensity " from column names that start with "LFQ intensity". This makes the column names cleaner and easier to read and reference. The final cleaned dataset is stored in the raw variable.

##### 2. Import & Clean #####
raw <- read_excel("RAW data.xlsx", sheet="Raw data")
raw <- raw %>% filter(!grepl("CON|REV", `Protein IDs`))
raw <- raw %>% rename_with(~sub("^LFQ intensity ", "", .x), starts_with("LFQ intensity"))

Transforming and Filtering Proteomics Data

Following data import and cleaning, you will transform the data into a long format and apply filtering criteria to refine your dataset for analysis. These steps are crucial for preparing the data for statistical analysis and visualization. The long format is preferred for easier manipulation with tidyverse functions, while filtering removes low-quality or missing data. This is where your data gets molded into a shape that's ready for the more advanced analytical steps. We begin by converting the data into a long format using pivot_longer. This step transforms the data so that each row represents a single measurement for a protein in a specific sample. The subsequent filtering step identifies and removes proteins that are not consistently detected across samples. This is essential for improving the reliability of your results.

The code starts by converting the raw data into a long format using the pivot_longer function. It selects all columns except 'Protein IDs' and 'Gene names'. The names_to="Sample" and values_to="Intensity" arguments reshape the data, creating 'Sample' and 'Intensity' columns, respectively. The separate function then splits the 'Sample' column into 'Condition' and 'Replicate' based on the "-" separator. The remove = FALSE argument ensures the original column is retained. Finally, the code replaces any zero values in the 'Intensity' column with NA values, which is crucial for proper handling of missing data in subsequent analyses. After formatting the data, the code performs filtering to remove proteins with too many missing values. It groups the data by Protein IDs and calculates the number of valid intensity values for both control and AKI conditions. Proteins with fewer than two valid values in either condition are excluded. This filtering helps to ensure that the remaining data are of good quality and that the subsequent analyses are based on reliable information.

##### 3. Long Format #####
long <- raw %>%
  pivot_longer(-c(`Protein IDs`,`Gene names`), names_to="Sample", values_to="Intensity") %>%
  separate(Sample, into=c("Condition","Replicate"), sep="-", remove=FALSE) %>%
  mutate(Intensity=na_if(Intensity,0))

##### 4. Filter #####
counts <- long %>% group_by(`Protein IDs`) %>%
  summarize(nC=sum(Condition=="Control"&!is.na(Intensity)),
            nA=sum(Condition=="AKI"&!is.na(Intensity)))
keep <- counts %>% filter(nC>=2,nA>=2) %>% pull(`Protein IDs`)
long <- long %>% filter(`Protein IDs`%in%keep)

Data Transformation and Matrix Preparation

Now, it's time to transform the data by log2 transformation and prepare it for statistical analysis by creating a wide matrix. This crucial step involves transforming the data and converting it into a format suitable for further analysis, such as PCA and differential expression analysis. This is where we take the long-formatted data and prepare it for more advanced statistical evaluations. The log2 transformation helps normalize the data, which is often skewed, making it more suitable for statistical modeling. Then, the data is restructured into a wide format, creating a matrix that’s ready for the next steps.

The first part of this step applies a log2 transformation to the intensity values. This is a standard procedure in proteomics to normalize the data and reduce the impact of outliers. The mutate function is used to create a new column called Log2Int, where the log2 of the intensity values is calculated. Next, the code prepares the data for statistical analysis by creating a wide matrix. This is done using the pivot_wider function, which reshapes the data so that each protein is represented by a row, and each sample is represented by a column. The resulting mat object will be used for subsequent analysis such as normalization, PCA, and differential expression analysis.

##### 5. Log2 #####
long <- long %>% mutate(Log2Int=log2(Intensity))

##### 6. Wide Matrix #####
wide <- long %>% select(`Protein IDs`,`Gene names`,Sample,Log2Int) %>%
  pivot_wider(names_from=Sample,values_from=Log2Int)
mat <- as.matrix(wide %>% select(-c(`Protein IDs`,`Gene names`)))
rownames(mat) <- wide


	Mouse Proteomics: An R Pipeline Guide
    
    
    
    
	
	
	
	
	
	
	
    
    
    
    
    
    
    
    
    
    


    

Mouse Proteomics: An R Pipeline Guide

by RICHARD 38 views
Iklan Headers
Protein IDs`

Normalization, Design, and Statistical Analysis with limma

With the data transformed and prepared, the subsequent steps involve normalization, design, and the use of limma for differential expression analysis. These steps are critical for ensuring the reliability and accuracy of the results. Normalization adjusts for systematic differences in the data, the design step defines the experimental setup, and limma performs the statistical analysis to identify differentially expressed proteins. Let's go through the specifics of each of these sections. We normalize the data to account for systematic biases, set up the experimental design to compare different groups, and finally, perform differential expression analysis to identify proteins that change significantly between the conditions.

The code first normalizes the data using the median centering method. The median of each column (sample) is calculated, and then each value in the matrix is adjusted by subtracting the corresponding median. This approach ensures that the data are comparable across all samples, which is important for accurate downstream analysis. After normalization, the code sets up the experimental design. It defines the samples and groups, creating a factor variable group that categorizes each sample as either "Control" or "AKI". The model.matrix function is then used to create the design matrix, which will be used in the limma analysis. This matrix represents the experimental design and allows limma to compare the different groups effectively. After that, the code performs differential expression analysis using the limma package. The lmFit function fits a linear model to the normalized data, considering the design matrix. The eBayes function then performs empirical Bayes shrinkage of the standard errors to improve the stability of the results. Finally, the topTable function extracts the top differentially expressed proteins, based on adjusted p-values, and joins this information with the gene names for easier interpretation. This step identifies the proteins that are significantly up- or down-regulated between the control and AKI conditions.

##### 7. Normalize #####
med <- apply(mat,2,median,na.rm=TRUE)
mat_norm <- sweep(mat,2,med,FUN="-")

##### 8. Design #####
samples <- colnames(mat_norm)
group <- factor(ifelse(grepl("^Control",samples),"Control","AKI"),
                levels=c("Control","AKI"))
design <- model.matrix(~group)

##### 9. limma #####
fit <- lmFit(mat_norm,design); fit<-eBayes(fit)
res <- topTable(fit,coef="groupAKI",number=Inf,adjust.method="BH")
res <- res %>% rownames_to_column("Protein IDs") %>%
  left_join(wide %>% select(`Protein IDs`,`Gene names`), by="Protein IDs")

Imputation and Principal Component Analysis (PCA)

Next, we tackle missing data with imputation and visualize the data with PCA (Principal Component Analysis). This is an important step, as proteomics data often contains missing values, which can affect the accuracy of the analysis. Imputation addresses these missing values, while PCA helps visualize the overall data structure and identify potential batch effects or outliers. Here’s how we deal with those missing values and get a good overview of the dataset.

The code starts by imputing missing values in the normalized matrix. A simple imputation method is used where missing values are replaced with random values drawn from a normal distribution. The parameters of this distribution are estimated from the existing data. This is a common approach to fill in missing data in proteomics datasets. After imputation, the code performs a PCA on the transposed, imputed matrix. PCA reduces the dimensionality of the data while preserving the variance, allowing for visualization of the samples in a lower-dimensional space. The prcomp function calculates the PCA, and the variance explained by each principal component is calculated. A PDF plot is created, showing the samples in a 2D space defined by the first two principal components, colored by their group (Control or AKI). The plot also includes labels for each sample and explains the variance captured by each PC. This allows to quickly assess the overall structure of the data and identify any outliers or batch effects.

##### 10. Impute #####
set.seed(123); imp<-mat_norm
for(j in seq_len(ncol(imp))) {
  mu<-mean(imp[,j],na.rm=TRUE); sd<-sd(imp[,j],na.rm=TRUE)
  m<-mu-1.8*sd; s<-0.3*sd
  idx<-is.na(imp[,j]); imp[idx,j]<-rnorm(sum(idx),m,s)
}

##### 11. PCA #####
pca<-prcomp(t(imp),center=TRUE,scale.=TRUE)
var<-pca$sdev^2/sum(pca$sdev^2)
pdf("PCA_plot.pdf")
plot_df<-data.frame(PC1=pca$x[,1],PC2=pca$x[,2],Sample=samples,Group=group)
ggplot(plot_df,aes(PC1,PC2,color=Group,label=Sample))+  geom_point(size=3)+geom_text_repel()+labs(
    x=sprintf("PC1 (%.1f%%)",100*var[1]),
    y=sprintf("PC2 (%.1f%%)",100*var[2])
  )+theme_minimal()
dev.off()

Functional Enrichment Analysis: GO and KEGG

Let's perform GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment analysis to identify the biological pathways and processes associated with the differentially expressed proteins. These analyses provide insights into the functional significance of the changes observed. This is where we explore the functional roles of the identified proteins. The code uses clusterProfiler to perform GO enrichment and KEGG pathway analysis. It helps to find out the biological functions and pathways that are overrepresented in the dataset, providing context to the observed protein changes.

For GO enrichment, the enrichGO function is used to identify over-represented GO terms. The function takes a list of gene names as input, along with the organism database (org.Mm.eg.db) and the ontology to analyze (Biological Process, in this case). The results include enriched GO terms, their descriptions, gene counts, and adjusted p-values. Several visualizations are then created to summarize and present the GO enrichment results. The code generates a heatplot to visualize the GO terms and their corresponding fold changes, a barplot to visualize the top enriched GO terms, and creates Excel sheets containing lists of genes for the top GO terms. For KEGG enrichment, the enrichKEGG function is used, similarly to GO enrichment, but for KEGG pathways. The function takes a list of Entrez IDs as input, which are derived from the gene names using the bitr function from org.Mm.eg.db. The resulting KEGG pathways, along with their descriptions, gene counts, and adjusted p-values are then visualized using a barplot. Excel sheets are also created, which contain lists of genes for the top KEGG pathways. This comprehensive enrichment analysis provides valuable insights into the biological processes and pathways affected by the experimental conditions.

##### 12. GO Enrichment #####
ego<-enrichGO(gene=res %>% filter(adj.P.Val<0.05) %>% pull(`Gene names`),
              OrgDb=org.Mm.eg.db,keyType="SYMBOL",ont="BP",
              pAdjustMethod="BH",qvalueCutoff=0.05)
# Heatplot
fc_sym<-setNames(res$logFC,res


	Mouse Proteomics: An R Pipeline Guide
    
    
    
    
	
	
	
	
	
	
	
    
    
    
    
    
    
    
    
    
    


    

Mouse Proteomics: An R Pipeline Guide

by RICHARD 38 views
Iklan Headers
Gene names`) pdf("GO_heatplot.pdf",width=6,height=4) heatplot(ego,showCategory=5,foldChange=fc_sym)+ggtitle("GO Heatplot") dev.off() # Barplot library(forcats) go_df<-ego@result %>% arrange(p.adjust)%>%slice_head(n=5)% mutate(Term=fct_reorder(Description,Count),negLog=-log10(p.adjust)) pdf("GO_barplot.pdf",width=6,height=4) ggplot(go_df,aes(Count,Term,fill=negLog))+ geom_col(width=0.6)+scale_fill_gradient(low="lightsteelblue",high="steelblue4")+ labs(x="Gene Count",fill=expression(-log[10](adj.p)))+theme_minimal()+ theme(axis.text.y=element_text(face="italic")) dev.off() # Gene lists go_lists<-ego@result%>%arrange(p.adjust)%>%slice_head(n=5)% transmute(ID,Description,geneID)%>%mutate(List=strsplit(geneID,"/"))% unnest(List) wb_go<-createWorkbook() for(id in unique(go_lists$ID)){ df<-go_lists%>%filter(ID==id)%>%select(GeneSymbol=List) addWorksheet(wb_go,sheetName=gsub(":", "", id)) writeData(wb_go,sheet=gsub(":", "", id),df) } saveWorkbook(wb_go,"GO_top5_gene_lists.xlsx",overwrite=TRUE) ##### 13. KEGG ##### ek<-enrichKEGG(gene=bitr(res Mouse Proteomics: An R Pipeline Guide

Mouse Proteomics: An R Pipeline Guide

by RICHARD 38 views
Iklan Headers
Gene names`,fromType="SYMBOL", toType="ENTREZID",OrgDb=org.Mm.eg.db)$ENTREZID, organism="mmu",pAdjustMethod="BH",qvalueCutoff=0.05) pdf("KEGG_barplot.pdf",width=6,height=4) barplot(ek,showCategory=5)+ggtitle("KEGG Pathways") dev.off() kegg_lists<-ek@result%>%arrange(p.adjust)%>%slice_head(n=5)% transmute(ID,Pathway, geneID)%>%mutate(List=strsplit(geneID,"/"))% unnest(List) wb_kegg<-createWorkbook() for(id in unique(kegg_lists$ID)){ df<-kegg_lists%>%filter(ID==id)%>%select(EntrezID=List) addWorksheet(wb_kegg,sheetName=sub("mmu","",id)) writeData(wb_kegg,sheet=sub("mmu","",id),df) } saveWorkbook(wb_kegg,"KEGG_top5_gene_lists.xlsx",overwrite=TRUE) # End of script

And that's it, folks! You've successfully navigated a comprehensive proteomics analysis pipeline in R. This pipeline provides a solid foundation for analyzing your own proteomics data, from import to insightful visualizations and functional enrichment analysis. Now you're equipped to dive deep into your own data, uncover key insights, and make impactful discoveries in the world of proteomics! Remember, this is a starting point, and you can always customize and extend this pipeline to fit your specific research needs. Happy analyzing!