---
title: "Getting fa files"
output: html_document
date: "2023-05-06"
---

This code takes file in each genome dir and makes new file with genome prefix for example

```         
../data_raw/GCA_028115085.1/protein.faa
../data_raw/GCA_028115085.1/GCA_028115085.1_protein.faa
```

Specfically, ghis code snippet is a shell command that finds and copies files in a directory structure while renaming them. Let me break it down for you step by step:

1.  **`find ../data_raw/ -type f -name "*.faa" -print0`**: The **`find`** command searches for files in the **`../data_raw/`** directory and its subdirectories. It looks for regular files (**`-type f`**) with names ending in **`.faa`** (**`-name "*.faa"`**). The **`-print0`** flag prints the results separated by null characters, which is useful when filenames contain spaces or special characters.

2.  **`|`**: The pipe (**`|`**) symbol is used to pass the output of the **`find`** command to the next command, **`xargs`**.

3.  **`xargs -0 -I {}`**: The **`xargs`** command reads items from standard input, separated by null characters (because of the **`-0`** flag). For each input item, it runs the following command, replacing occurrences of **`{}`** with the input item. The **`-I {}`** flag specifies the placeholder for the input item in the command.

4.  **`bash -c '...'`**: This part runs a Bash command specified as a string. It defines several variables and runs a copy (**`cp`**) command.

5.  **`input_file="{}"`**: This sets the **`input_file`** variable to the current file path, which comes from the output of the **`find`** command.

6.  **`output_file="$(dirname "$input_file")/$(basename "$(dirname "$input_file")")_$(basename "$input_file")"`**: This line constructs the output file path by combining the following parts:

    -   **`$(dirname "$input_file")`**: This extracts the directory path of the input file.

    -   **`$(basename "$(dirname "$input_file")")`**: This extracts the name of the input file's parent directory.

    -   **`_`**: This is a separator to make the output file name more readable.

    -   **`$(basename "$input_file")`**: This extracts the name of the input file.

    The output file path has the format: **`<input_file_directory>/<parent_directory_name>_<input_file_name>`**.

7.  **`cp "$input_file" "$output_file"`**: This copies the input file to the output file, effectively renaming it according to the defined naming scheme.

In summary, this code snippet finds all **`.faa`** files in the **`../data_raw/`** directory and its subdirectories, and then copies them while renaming them to include their parent directory name in the file name.

```{bash}
find ../data_raw/ -type f -name "*.faa" -print0 | xargs -0 -I {} bash -c 'input_file="{}"; output_file="$(dirname "$input_file")/$(basename "$(dirname "$input_file")")_$(basename "$input_file")"; cp "$input_file" "$output_file"'

```


How many files?
```{r, engine='bash'}
find ../data_raw/ -type f -name "G*.faa" | wc -l
```


```{r, engine='bash'}
#given 1k+ output files creating subdirectory
mkdir ../output/bigblast-01

find ../data_raw/ -type f -name "G*.faa" \
| xargs basename -s _protein.faa | xargs -I{} \
/home/shared/ncbi-blast-2.11.0+/bin/blastp \
-query ../data_raw/{}/{}_protein.faa \
-db ../blastdb/95-protein \
-out ../output/bigblast-01/{}_95_blastp.tab \
-evalue 1E-20 \
-num_threads 40 \
-max_target_seqs 1 \
-outfmt 6
```