--- title: "Getting fa files" output: html_document date: "2023-05-06" --- This code takes file in each genome dir and makes new file with genome prefix for example ``` ../data_raw/GCA_028115085.1/protein.faa ../data_raw/GCA_028115085.1/GCA_028115085.1_protein.faa ``` Specfically, ghis code snippet is a shell command that finds and copies files in a directory structure while renaming them. Let me break it down for you step by step: 1. **`find ../data_raw/ -type f -name "*.faa" -print0`**: The **`find`** command searches for files in the **`../data_raw/`** directory and its subdirectories. It looks for regular files (**`-type f`**) with names ending in **`.faa`** (**`-name "*.faa"`**). The **`-print0`** flag prints the results separated by null characters, which is useful when filenames contain spaces or special characters. 2. **`|`**: The pipe (**`|`**) symbol is used to pass the output of the **`find`** command to the next command, **`xargs`**. 3. **`xargs -0 -I {}`**: The **`xargs`** command reads items from standard input, separated by null characters (because of the **`-0`** flag). For each input item, it runs the following command, replacing occurrences of **`{}`** with the input item. The **`-I {}`** flag specifies the placeholder for the input item in the command. 4. **`bash -c '...'`**: This part runs a Bash command specified as a string. It defines several variables and runs a copy (**`cp`**) command. 5. **`input_file="{}"`**: This sets the **`input_file`** variable to the current file path, which comes from the output of the **`find`** command. 6. **`output_file="$(dirname "$input_file")/$(basename "$(dirname "$input_file")")_$(basename "$input_file")"`**: This line constructs the output file path by combining the following parts: - **`$(dirname "$input_file")`**: This extracts the directory path of the input file. - **`$(basename "$(dirname "$input_file")")`**: This extracts the name of the input file's parent directory. - **`_`**: This is a separator to make the output file name more readable. - **`$(basename "$input_file")`**: This extracts the name of the input file. The output file path has the format: **`/_`**. 7. **`cp "$input_file" "$output_file"`**: This copies the input file to the output file, effectively renaming it according to the defined naming scheme. In summary, this code snippet finds all **`.faa`** files in the **`../data_raw/`** directory and its subdirectories, and then copies them while renaming them to include their parent directory name in the file name. ```{bash} find ../data_raw/ -type f -name "*.faa" -print0 | xargs -0 -I {} bash -c 'input_file="{}"; output_file="$(dirname "$input_file")/$(basename "$(dirname "$input_file")")_$(basename "$input_file")"; cp "$input_file" "$output_file"' ``` How many files? ```{r, engine='bash'} find ../data_raw/ -type f -name "G*.faa" | wc -l ``` ```{r, engine='bash'} #given 1k+ output files creating subdirectory mkdir ../output/bigblast-01 find ../data_raw/ -type f -name "G*.faa" \ | xargs basename -s _protein.faa | xargs -I{} \ /home/shared/ncbi-blast-2.11.0+/bin/blastp \ -query ../data_raw/{}/{}_protein.faa \ -db ../blastdb/95-protein \ -out ../output/bigblast-01/{}_95_blastp.tab \ -evalue 1E-20 \ -num_threads 40 \ -max_target_seqs 1 \ -outfmt 6 ```