---
title: "Learning awk for GNM569"
author: "Kathleen Durkin"
date: "2024-03-27"
categories: ["misc"]
format:
  html:
    toc: true
engine: knitr
---

This quarter I'm taking GENOME569, which covers developing bioinformatic workflows for high-throughput sequencing. Today we briefly covered some of the primaries of Unix, and I realized I've never learned awk or written awk commands fro scratch! These are some awk practice problems from the class to hopefully help me get up to speed.\

## Practice file:

grep_sed_example1.txt

```         
Experiment notes:
We are using the hg19 genome build
The cell line is A549
A549 cells have a KRAS G12C mutation
Another cell line is MCF7
```

## grep

### Find the lines that mention "cell":

```{bash, eval=FALSE}
$ grep "cell" grep_sed_example1.txt
```

```         
The cell line is A549
A549 cells have a KRAS G12C mutation
Another cell line is MCF7
```

"cell": This is the regular expression pattern to match. It simply looks for the string "cell" within each line of the file.

### Find the lines that talk about "A549":

```{bash, eval=FALSE}
$ grep "A549" grep_sed_example1.txt
```

```         
The cell line is A549
A549 cells have a KRAS G12C mutation
```

"A549": This pattern matches the string "A549" within each line of the file.

### Find the lines that talk about either A549 or MCF7:

```{bash, eval=FALSE}
$ grep -E "A549|MCF7" grep_sed_example1.txt
```

```         
The cell line is A549
A549 cells have a KRAS G12C mutation
Another cell line is MCF7
```

-E: Enables extended regular expressions, allowing the use of the \| (OR) operator.

"A549\|MCF7": This pattern matches lines containing either "A549" or "MCF7".

### Find the lines that end with a cell line id (i.e., A549 or MCF7):

```{bash, eval=FALSE}
$ grep -E "A549$|MCF7$" grep_sed_example1.txt
```

```         
Another cell line is MCF7
```

-E: Enables extended regular expressions, allowing the use of the \$ anchor to match the end of a line and the use of the \| (OR) operator

"A549\$\|MCF7\$": This pattern matches lines that end (\$) with either "A549" or "MCF7".

## sed

Note that sed will not make a permanent edit to the original file unless you specifically instruct it to do so using the -i option. Default behavior is to just output the modified text to standard output.

### Change instances of A549 to MCF7:

```{bash, eval=FALSE}
$ sed 's/A549/MCF7/g' grep_sed_example1.txt
```

```         
Experiment notes:
We are using the hg19 genome build
The cell line is MCF7
MCF7 cells have a KRAS G12C mutation
Another cell line is MCF7
```

s: This is the substitute command in sed, which is used to perform substitutions.

/A549/MCF7/: This is the substitution operation. It finds all occurrences of "A549" and replaces them with "MCF7".

g: This is the global flag, which tells sed to perform the substitution globally within each line, not just the first occurrence.

### Change only the second instance of A549 to A549_LUNG:

### Change instances of A549 to MCF7, but without stating "A549":

```{bash, eval=FALSE}
$ sed 's/[A-Z][0-9]\{3\}/MCF7/g' grep_sed_example1.txt
```

```         
Experiment notes:
We are using the hg19 genome build
The cell line is MCF7
MCF7 cells have a KRAS G12C mutation
Another cell line is MCF7
```

\[A-Z\]: Matches any uppercase letter from A to Z.

\[0-9\]: Matches any digit from 0 to 9.

\\{3\\}: Specifies that the previous pattern (digit) should occur exactly three times.

/MCF7/: Replaces the matched pattern with "MCF7".

g: This flag is used for global substitution, ensuring all occurrences within each line are replaced.

### How can I make multiple changes in a single line?

-e: This option allows you to specify multiple sed commands in one line. Each -e flag indicates the beginning of a new sed command.

Then you can just sequentially use the expressions you want to apply!