01.63-lncRNA-pipeline-getFastas
================
Kathleen Durkin
2026-06-10

- [0.1 GTF/BED formatting problem](#01-gtfbed-formatting-problem)
- [0.2 Manual GTF and BED correction](#02-manual-gtf-and-bed-correction)
  - [0.2.1 Apul](#021-apul)
  - [0.2.2 Peve](#022-peve)
  - [0.2.3 Ptuh](#023-ptuh)
- [0.3 Convert corrected BED to
  FASTA](#03-convert-corrected-bed-to-fasta)
  - [0.3.1 Apul](#031-apul)
  - [0.3.2 Peve](#032-peve)
  - [0.3.3 Ptuh](#033-ptuh)

When Zach re-did the lncRNA pipeline there doesn’t seem to have been
.fasta output files produced (just the expression matrices, .bed, and
.gtf files). I need lncRNA fastas for all three species for other
analyses, so let’s generate those real quick using `bedtools`.

## 0.1 GTF/BED formatting problem

``` bash
/home/shared/bedtools2/bin/bedtools getfasta -fi ../../D-Apul/data/Apulchra-genome.fa -bed ../output/01.6-lncRNA-pipeline/lncRNA.gtf -fo ../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.fasta -fullHeader
```

aaaannnnd it turns out there are formatting issues with the gtf and bed
files. The gtf has several instances of a “-1” starting position (in gtf
format, which is 1-start, a 1 starting position would represent the
first nucleotide of a strand – -1 is impossible).

``` bash
awk '$4 == "-1"' ../output/01.6-lncRNA-pipeline/lncRNA.gtf
```

    ## ptg000011l   .   lncRNA  -1  3144    .   +   .   gene_id "lncRNA_28934";
    ## ptg000050l   .   lncRNA  -1  248 .   +   .   gene_id "lncRNA_76192";
    ## ptg000055l   .   lncRNA  -1  1570    .   +   .   gene_id "lncRNA_76255";
    ## ptg000072c   .   lncRNA  -1  272 .   +   .   gene_id "lncRNA_76756";

Let’s check the other two species’ lncRNA gtf files:

Peve

``` bash
awk '$4 == "-1"' ../output/01.61-lncRNA-pipeline/lncRNA.gtf | head
```

    ## Porites_evermani_scaffold_107    .   lncRNA  -1  3351    .   +   .   gene_id "lncRNA_1387";
    ## Porites_evermani_scaffold_1102   .   lncRNA  -1  3128    .   +   .   gene_id "lncRNA_2068";
    ## Porites_evermani_scaffold_1108   .   lncRNA  -1  4071    .   +   .   gene_id "lncRNA_2159";
    ## Porites_evermani_scaffold_1115   .   lncRNA  -1  836 .   +   .   gene_id "lncRNA_2282";
    ## Porites_evermani_scaffold_1122   .   lncRNA  -1  10971   .   +   .   gene_id "lncRNA_2404";
    ## Porites_evermani_scaffold_1247   .   lncRNA  -1  3359    .   +   .   gene_id "lncRNA_4347";
    ## Porites_evermani_scaffold_1308   .   lncRNA  -1  395 .   +   .   gene_id "lncRNA_5314";
    ## Porites_evermani_scaffold_1335   .   lncRNA  -1  14276   .   +   .   gene_id "lncRNA_5733";
    ## Porites_evermani_scaffold_1335   .   lncRNA  -1  35893   .   +   .   gene_id "lncRNA_5734";
    ## Porites_evermani_scaffold_1335   .   lncRNA  -1  35893   .   +   .   gene_id "lncRNA_5735";

Ptuh

``` bash
awk '$4 == "-1"' ../output/01.62-lncRNA-pipeline/lncRNA.gtf
```

    ## Pocillopora_meandrina_HIv1___xfSc0001122 .   lncRNA  -1  270 .   +   .   gene_id "lncRNA_56352";

yeah, the problem is present in all three gtf files. This suggests
there’s a formatting issue in the lncRNA pipeline itself
(`./01.6-lncRNA-pipeline`). My best guess is that there’s an issue in
accounting for a BED-based 0-start position? There are apparently also
problems with the BED file formats, though, so the correction must have
been mistakenly applied before gtf conversion. This would mean that
either the entire “start” column OR both the “start” *and* “end” columns
of each file are shifted down by 2 (so, e.g., the “-1” should be “1”, a
“123” should actually be a “125”, etc.)

I’m going to have Claude Opus 4.8 look through Zach’s pipeline code to
see if it can identify the problem.

Claude response:

> Let `T` = the true 1-based start of a candidate transcript in
> `lncRNA_candidates.gtf` (this is correct, coming straight from
> gffcompare/StringTie). `bedtools getfasta -name` writes headers like
> transcript::chrom:start-end. In bedtools ≥2.27, the appended
> chrom:start-end is the 0-based interval, even when the input is a
> 1-based GTF (bedtools normalizes everything to 0-based BED
> internally). So the header start is `T − 1`. The `while` read loop
> (and the equivalent Python block) then does start=\$((start - 1)) with
> the comment “Convert the start position to 0-based by subtracting 1.”
> But the header value is already 0-based. So this subtracts a second
> time → BED start = `T − 2`. `renamegtf` chunk builds lncRNA.gtf with
> printf … %d … \$2, \$3 — it writes the BED start (\$2) straight into
> the GTF start column with no +1. Result: GTF start = `T − 2`.

I confirmed this by manually reading through the code (and confirmed
that our bedtools version, `/home/shared/bedtools2/bin/bedtools` is
\>=2.27). Yes, it seems like there is a double-correction in converting
a gtf (1-start) candidate file to a bed (0-start) final file, *and* that
bed final file was not appropriately converted into the final gtf file.
This means that both the final `lncRNA.bed` and `lncRNA.gtf` file have
inaccurate start positions – **the `lncRNA.bed` starts are off by 1
position, and the `lncRNA.gtf` starts are off by 2**. Importantly, since
.bed and .gtf end up having the same end positions, the end columns
didn’t undergo any conversions and should still be correct in all files.

Since the `./01.6-lncRNA-pipeline` code is Zach’s I want to leave it to
him to confirm the problem and fix, since he’s much more familiar with
the code structure. For now, I’ll manually correct the gtf files, under
the assumption they need to be start-shifted up by 2 positions, so that
I can generate fasta files

## 0.2 Manual GTF and BED correction

GTF: add 2 to all start positions (e.g., so that -1 becomes 1); leave
all end positions unchanged BED: add 1 to all start positions (e.g., so
that -1 becomes 0); leave all end positions unchanged

### 0.2.1 Apul

``` bash
awk 'BEGIN{FS=OFS="\t"} /^#/ {print; next} {$4=$4+2} 1' \
  "../output/01.6-lncRNA-pipeline/lncRNA.gtf" > "../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.start_corrected.gtf"
  
awk 'BEGIN{FS=OFS="\t"} /^#/ || /^track/ || /^browser/ {print; next} {$2=$2+1} 1' \
  "../output/01.6-lncRNA-pipeline/lncRNA.bed" > "../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.start_corrected.bed"

# Verify
# GTF: no illegal starts, start <= end everywhere
echo "verify GTF"
awk -F'\t' '!/^#/ && $4<1'  "../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.start_corrected.gtf" | wc -l   # expect 0
awk -F'\t' '!/^#/ && $4>$5' "../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.start_corrected.gtf"           # expect nothing

# BED: no negative starts, start < end everywhere (BED is half-open)
echo "verify BED"
awk -F'\t' '!/^#/ && $2<0'   "../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.start_corrected.bed" | wc -l  # expect 0
awk -F'\t' '!/^#/ && $2>=$3' "../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.start_corrected.bed"          # expect nothing

# Confirm the shift actually applied (compare a few rows side by side)
echo "confirm shift"
paste <(cut -f4 "../output/01.6-lncRNA-pipeline/lncRNA.gtf" | grep -v '^#') \
      <(cut -f4 "../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.start_corrected.gtf" | grep -v '^#') | head -3
```

    ## verify GTF
    ## 0
    ## verify BED
    ## 0
    ## confirm shift
    ## 684  686
    ## 32055    32057
    ## 25260    25262

while we’re here, let’s also make a bed file that contains the lncRNA
IDs – ours currently only contains genomic coordinates

``` bash
awk 'BEGIN{FS=OFS="\t"} !/^#/ {
    match($9, /gene_id "[^"]+"/)
    name = substr($9, RSTART+9, RLENGTH-10)   # strip leading: gene_id "  and trailing: "
    print $1, $4-1, $5, name, ".", $7          # $4-1 = 1-based -> 0-based start
}' "../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.start_corrected.gtf" \
> "../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.start_corrected.named.bed"
```

### 0.2.2 Peve

``` bash
awk 'BEGIN{FS=OFS="\t"} /^#/ {print; next} {$4=$4+2} 1' \
  "../output/01.61-lncRNA-pipeline/lncRNA.gtf" > "../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.start_corrected.gtf"
  
awk 'BEGIN{FS=OFS="\t"} /^#/ || /^track/ || /^browser/ {print; next} {$2=$2+1} 1' \
  "../output/01.61-lncRNA-pipeline/lncRNA.bed" > "../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.start_corrected.bed"

# Verify
# GTF: no illegal starts, start <= end everywhere
echo "verify GTF"
awk -F'\t' '!/^#/ && $4<1'  "../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.start_corrected.gtf" | wc -l   # expect 0
awk -F'\t' '!/^#/ && $4>$5' "../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.start_corrected.gtf"           # expect nothing

# BED: no negative starts, start < end everywhere (BED is half-open)
echo "verify BED"
awk -F'\t' '!/^#/ && $2<0'   "../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.start_corrected.bed" | wc -l  # expect 0
awk -F'\t' '!/^#/ && $2>=$3' "../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.start_corrected.bed"          # expect nothing

# Confirm the shift actually applied (compare a few rows side by side)
echo "confirm shift"
paste <(cut -f4 "../output/01.61-lncRNA-pipeline/lncRNA.gtf" | grep -v '^#') \
      <(cut -f4 "../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.start_corrected.gtf" | grep -v '^#') | head -3
```

    ## verify GTF
    ## 0
    ## verify BED
    ## 0
    ## confirm shift
    ## 7718 7720
    ## 8490 8492
    ## 26132    26134

while we’re here, let’s also make a bed file that contains the lncRNA
IDs – ours currently only contains genomic coordinates

``` bash
awk 'BEGIN{FS=OFS="\t"} !/^#/ {
    match($9, /gene_id "[^"]+"/)
    name = substr($9, RSTART+9, RLENGTH-10)   # strip leading: gene_id "  and trailing: "
    print $1, $4-1, $5, name, ".", $7          # $4-1 = 1-based -> 0-based start
}' "../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.start_corrected.gtf" \
> "../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.start_corrected.named.bed"
```

### 0.2.3 Ptuh

``` bash
awk 'BEGIN{FS=OFS="\t"} /^#/ {print; next} {$4=$4+2} 1' \
  "../output/01.62-lncRNA-pipeline/lncRNA.gtf" > "../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.start_corrected.gtf"
  
awk 'BEGIN{FS=OFS="\t"} /^#/ || /^track/ || /^browser/ {print; next} {$2=$2+1} 1' \
  "../output/01.62-lncRNA-pipeline/lncRNA.bed" > "../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.start_corrected.bed"

# Verify
# GTF: no illegal starts, start <= end everywhere
echo "verify GTF"
awk -F'\t' '!/^#/ && $4<1'  "../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.start_corrected.gtf" | wc -l   # expect 0
awk -F'\t' '!/^#/ && $4>$5' "../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.start_corrected.gtf"           # expect nothing

# BED: no negative starts, start < end everywhere (BED is half-open)
echo "verify BED"
awk -F'\t' '!/^#/ && $2<0'   "../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.start_corrected.bed" | wc -l  # expect 0
awk -F'\t' '!/^#/ && $2>=$3' "../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.start_corrected.bed"          # expect nothing

# Confirm the shift actually applied (compare a few rows side by side)
echo "confirm shift"
paste <(cut -f4 "../output/01.62-lncRNA-pipeline/lncRNA.gtf" | grep -v '^#') \
      <(cut -f4 "../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.start_corrected.gtf" | grep -v '^#') | head -3
```

    ## verify GTF
    ## 0
    ## verify BED
    ## 0
    ## confirm shift
    ## 10856    10858
    ## 74884    74886
    ## 74884    74886

while we’re here, let’s also make a bed file that contains the lncRNA
IDs – ours currently only contains genomic coordinates

``` bash
awk 'BEGIN{FS=OFS="\t"} !/^#/ {
    match($9, /gene_id "[^"]+"/)
    name = substr($9, RSTART+9, RLENGTH-10)   # strip leading: gene_id "  and trailing: "
    print $1, $4-1, $5, name, ".", $7          # $4-1 = 1-based -> 0-based start
}' "../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.start_corrected.gtf" \
> "../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.start_corrected.named.bed"
```

## 0.3 Convert corrected BED to FASTA

### 0.3.1 Apul

``` bash
/home/shared/bedtools2/bin/bedtools getfasta \
-fi ../../D-Apul/data/Apulchra-genome.fa \
-bed ../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.start_corrected.named.bed \
-s -nameOnly \
-fo ../output/01.63-lncRNA-pipeline-getFastas/Apul_lncRNA.fasta
```

### 0.3.2 Peve

``` bash
/home/shared/bedtools2/bin/bedtools getfasta \
-fi ../../E-Peve/data/Porites_evermanni_v1.fa \
-bed ../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.start_corrected.named.bed \
-s -nameOnly \
-fo ../output/01.63-lncRNA-pipeline-getFastas/Peve_lncRNA.fasta
```

### 0.3.3 Ptuh

``` bash
/home/shared/bedtools2/bin/bedtools getfasta \
-fi ../../F-Ptuh/data/Pocillopora_meandrina_HIv1.assembly.fa \
-bed ../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.start_corrected.named.bed \
-s -nameOnly \
-fo ../output/01.63-lncRNA-pipeline-getFastas/Ptuh_lncRNA.fasta
```