--- title: "Create Illumina Sample Sheet and metadata" author: "Ramon Gallego" date: "11/8/2019" output: html_document params: Assay: "Nextera XT" Index_Adapters: "Nextera XT Index Kit (96 Indexes 384 Samples)" Cycles_per_pairend: 301 Set: 1 input.metadata: value: ../data_sub/metadata.csv input: file date: !r Sys.Date() output_dir: ../data_sub/ --- ## How to use this To run this Rmarkdown, open it in Rstudio and from the knit drop-down menu, choose Knit with parameters ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library (tidyverse) library (here) library (kableExtra) ``` ## What does this do It creates the Illumina Sample sheet and the `metadata.csv` file that you need for running the pipeline. You need to double check the output - weird things happen all the time. ## Required input You need an input file with at least the following columns: - Sample_name: Avoid spaces - for compatibility with MiSeq's sample sheet, use hyphen "-" instead of underscores "_" or full stops "." - Well: in the 96-well plate used for the Nextera reaction. - PrimerF: using IUPAC wildcard character - PrimerR: ditto - Locus: Name of the locus, again without spaces. - i7_Index_Name: Adapter used to identify samples. Format is N7XX - i5_Index_Name: Adapter used to identify samples. Format is N5XX ## Load the Illumina adapters ```{r load nextera} Nextera_Adapters_i7 <- read_csv(here("data_sub","Nextera_adapters_i7.csv")) Nextera_Adapters_i5 <- read_csv(here("data_sub","Nextera_adapters_i5.csv")) ``` ## Loading the metadata you want to sequence ```{r load metadata} init.metadata <- read_csv(params$input.metadata) init.metadata ``` ### Check that the metadata has all needed fields ```{r Check metadata} init.metadata %>% rownames_to_column("Sample_number") %>% mutate_all(as.character) %>% pivot_longer(-Sample_number, names_to = "Variable", values_to = "Value") %>% summarise(Sample_name = case_when(sum(str_detect(Variable, "Sample")) > 0 ~ "Present", TRUE ~ "Absent"), Well = case_when(sum(str_detect(Variable, "Well")) > 0 ~ "Present", TRUE ~ "Absent"), PrimerF = case_when(sum(str_detect(Variable, "PrimerF")) > 0 ~ "Present", TRUE ~ "Absent"), PrimerR = case_when(sum(str_detect(Variable, "PrimerR")) > 0 ~ "Present", TRUE ~ "Absent"), i5_Adapter = case_when(sum(str_detect(Variable, "i5_Index_Name")) > 0 ~ "Present", TRUE ~ "Absent"), i7_Adapter = case_when(sum(str_detect(Variable, "i7_Index_Name")) > 0 ~ "Present", TRUE ~ "Absent")) -> Checks if(sum(str_detect(Checks,"Absent")) > 0){ knitr::knit_exit(append = "## ERROR: Initial metadata is missing some of the key column names: - Sample_name - Well - PrimerF - PrimerR - Plate - i5_Index_Name - i7_Index_Name ### Change the column names / add the infromation to your csv file")} ``` The i5 and i7 indices can start by either N, S, or H. But the sequence is the same. Let's make sure that they are compatible with the Nextera Indices file ```{r} init.metadata %>% mutate(i5_Index_Name = str_replace(i5_Index_Name, "^[EHNS]", "N"), i7_Index_Name = str_replace(i7_Index_Name, "^[HN]", "N")) -> init.metadata ``` Make the check that all the indices are present ```{r} case_when(nrow(anti_join(init.metadata, Nextera_Adapters_i5)) + nrow(anti_join(init.metadata, Nextera_Adapters_i7)) == 0 ~ "All indices present in the dataset", # == 0 ~ "Pass i7", TRUE ~ "Either some of the i5 or the i7 indices are not present in the Nextera Indices list - That will result in NAs in the final Sample Sheet") ``` ## Load the Illumina SampleSheet template ```{r Loading SampleSheet} template.sample.sheet <- read_lines(here("data_sub","SampleSheet.csv")) ``` Locate all the lines in which the new parameters are going to be written ```{r locate terms} date.row <- str_which(template.sample.sheet, "^\\Date") Assay.row <- str_which(template.sample.sheet, "^Assay") Index.row <- str_which(template.sample.sheet, "^Index Adapters") Reads.row <- str_which(template.sample.sheet, "^\\[Reads]") data.start <- str_which(template.sample.sheet, "^\\[Data]") data.header <- str_which(template.sample.sheet, "^Sample_ID") ``` ## Merge metadata with illumina adapters ```{r merging} init.metadata %>% left_join(Nextera_Adapters_i7) %>% left_join(Nextera_Adapters_i5) -> metadata ``` ## Fill Illumina SampleSheet In the case where there are many loci represented on each sample, usually you have them in the same well. If this is the case, your Illumina samplesheet will have fewer entries (rows) than your metadata sheet. In the Illumina sampleSheet, there are two parts: one that refers to the whole plate, and one that specificies the samples run. ### Parameters that refer to the whole sequencing run These include date of analysis, ```{r} template.sample.sheet[date.row] <- paste0("Date,",params$date) template.sample.sheet[Assay.row] <- paste0("Assay," ,params$Assay) template.sample.sheet[Index.row] <- paste0("Index Adapters,",params$Index_Adapters) template.sample.sheet[Reads.row +1 ] <- template.sample.sheet[Reads.row + 2] <- params$Cycles_per_pairend ``` Then with the rest of the dataset - collapsing first by primers ```{r collapse by primer} metadata %>% group_by( i5_Index_Name,i7_Index_Name ) %>% tally() -> occurrence.of.combos if(max(occurrence.of.combos$n > 1 )){ print("There are combinations of barcodes assigned to more than 1 sample - Please Check they have different PCR primers so you can separate them after") metadata %>% group_by(i5_Index_Name,i7_Index_Name ) %>% slice(1) -> metadata.for.sample.sheet print ("Keeping the first occurence of each combo for your samplesheet")}else{metadata.for.sample.sheet <- metadata} ``` Now write the SampleSheet to a file ```{r, echo = T} sample.data <- tibble (Sample_ID = metadata.for.sample.sheet$Sample_name, Sample_Plate = params$Set, Sample_Well = metadata.for.sample.sheet$Well, I7_Index_ID = metadata.for.sample.sheet$i7_Index_Name, index = metadata.for.sample.sheet$Bases_for_Sample_Sheet_i7, I5_Index_ID = metadata.for.sample.sheet$i5_Index_Name, index2 = metadata.for.sample.sheet$Bases_for_Sample_Sheet_i5, Sample_Project = "", Description = "") nsamples <- nrow(sample.data) template.sample.sheet <- template.sample.sheet[1:data.header] for (i in 1:nsamples){ template.sample.sheet[i+data.header] <- paste(sample.data[i,], collapse = ",") } write_lines(template.sample.sheet, file.path(params$output_dir,paste0( "SampleSheet_",Sys.Date(),".csv")), sep = "\r\n") ``` ## Fill in metadata We are going to assume that the Illumina always returns files with the format `Sample_name` `Sample_number` `_L001_R` [1-2] '_001.fastq`. So let's fill the metadata accordingly. ### Step1 Get the filenames ```{r} sample.data %>% mutate(Sample_name = str_replace_all(Sample_ID,pattern = "[\\.|_]",replacement = "-")) %>% rownames_to_column("Sample_number") %>% mutate(file1 = paste0(Sample_name, "_S", Sample_number, "_L001_R1_001.fastq"), file2 = paste0(Sample_name, "_S", Sample_number, "_L001_R2_001.fastq")) %>% select(i7_Index_Name = I7_Index_ID, i5_Index_Name = I5_Index_ID, file1, file2) -> filenames ``` ```{r, echo=F} metadata %>% left_join(filenames) %>% select(Sample_name, file1, file2, i7_Index_Name, i5_Index_Name, Well, PrimerF, PrimerR, Locus) -> metadata ``` ### Check that primer sequences do not include non-IUPAC characters ```{r NON-IUPAC characters} IUAPAC.char <- read_csv(here("data_sub","IUAPAC.csv")) IUPAC.char.1 <- paste(IUAPAC.char$Nucleotide.symbol, collapse = "") IUPAC.char.1 <- paste0("[^",IUPAC.char.1,"]") allprimers = paste0(metadata$PrimerR, metadata$PrimerF, collapse = "") case_when(str_count(allprimers, IUPAC.char.1) > 0 ~ print("There are non-IUAPAC characters in your primers, changing them to N")) metadata %>% mutate_at(.vars = vars(starts_with("Primer")), function(x)str_replace_all(x, IUPAC.char.1, "N")) -> metadata ``` ```{r} write_csv(metadata, file.path(params$output_dir,paste0( "metadata_",Sys.Date(),".csv"))) kable(metadata, align= "c", format = "html") %>% kable_styling(bootstrap_options= "striped", fixed_thead = T,full_width = T, position = "center") %>% column_spec(2, bold=T) %>% scroll_box(width = "1200px", height = "600px") ```