---
author: Sam White
toc-title: Contents
toc-depth: 5
toc-location: left
date: 2017-06-22 20:26:59+00:00
layout: post
slug: github-curation
title: GitHub Curation
categories:
  - 2017
  - Miscellaneous
tags:
  - GitHub
---

Updated a couple of GitHub Wikis:


* * *


Created a new repo in the [RobertsLab Organization GitHub account](https://github.com/RobertsLab) with a wiki to provide [an overview of how to use of Hyak (mox) computing node.](https://github.com/RobertsLab/hyak_mox/wiki) This was lightly modified from what [Sean already had in his personal repo.](https://github.com/seanb80/seanb80.github.io/wiki)


* * *


As a quick test, I updated all the md files in  the [sr320/LabDocs/code](https://github.com/sr320/LabDocs/tree/master/code) md files to format headers for GitHub's newest interpretation of headers. The headers (represented by a series of '#') require a space between them and the subsequent text.  I used the following command in bash:


    for i in *.md; do sed -i.bak 's/^#*/& /g' "$i"; done


The code works as follows:


  * Run for loop on all .md files in the directory


  * Use `sed` to edit the files in place: `-i.bak` (this command structure is needed for Mac OS X).


  * `'s/^#*/&amp; /g'`: Performs a substitution by identifying all lines beginning (^) with a pound symbol (#) and match zero or more occurrences of the pound symbol (*), then substituting the same pattern that was matched and adding a space at the end of the pattern (& ). Do this for all occurrences found within the document (g).


Since this worked, I'll probably run this through all of the md files in all of our various repos to quickly and easily fix header formatting issues.


* * *


Working on updating the [Genome-sequencing-December-2016-(UW-PacBio) wiki](https://github.com/RobertsLab/project-olympia.oyster-genomic/wiki/Genome-sequencing-December-2016-(UW-PacBio)), but need to work out the kinks on any easy, documentable way to rename and move some files around in order to make files/organization compliant with our [data management plan (DMP)(https://github.com/sr320/LabDocs/wiki/Data-Management#ngs-data-management-plan).


Current strategy:


  * Generate MD5 checksums for fastq files for each of the SMRT cell runs.


  * Copy file names from the the .xml file in the top level of each SMRT cell run folder to an array.


  * Use parameter substitution (in bash) to strip path and suffix from each index of the array (results likely stored in a secondary or tertiary array).


  * Use bash `find` command to copy the `filtered_subreads.fastq.gz` from each SMRT cell run folder, and append each of the corresponding stripped filenames in the final array to the beginning of the fastq file, to the `owl/nightingales/O_lurida` directory.


  * Generate new MD5 checksums on the copied files and compare to original MD5 checksums. This will confirm two things: 1 - The data did not get corrupted when copied. 2 - The new filenames correspond to the correct, original `filtered_subreads.fastq.gz` file (renaming a file doesn't alter the MD5 checksum).


  * Archive the original SMRT cell run folders (which contain a ton of metdata files)