02. Unix Commands for Data Mining¶
Practice¶
Author: Dr. Alejandra Rougon
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
🚴 Exercise 1¶
Now is time for you to practice what you have learned. Try to solve as many questions as possible. There are usually various ways to solve the problems.
Now let’s do some data mining in some tiny files. Once you have learn to do this, you will be able to work with large genomic data files.
On your home directory create a new directory called
Exercise1
.Create a new file inside the folder
Exercise1
withvim
calledToyPlant.fasta
you can copy/paste the following contents
>Plant_1 ACCACCGATACATGCGGTGCGTTGT >Plant_3 CCACTGTGTTCGAGTTGTGATACAG >Plant_3 CCACTGTGTTCGAGTTGTGATACAG >Plant_2 CCAGCATTTGTAGTCACAACGCCGC >Plant_4 TAGAGTTGTACACGCGTTTGTACGA >Plant_4 TAGAGTTGTACACGCGTTTGTACGA >Plant_1 ACCACCGATACATGCGGTGCGTTGT
See the file permissions
Give permissions of writing, reading and executing to everyone in ToyPlants.fasta
How many lines does the file have?
How many records does the file have?
How many unique records does the file have?
Calculate the total amount of bases [the genome size]
How many sequences contain the string
GATACA
[Specific sequence strings that may have a particular function or structure are called motives or domains.]Make a backup of that file in
Documents
.In the folder
Exercise1
create the following file called ToyPlant.geneschr1 height ht-1 100 1000 + (100-150,400-500,900-1000) chr1 height ht-2 100 1000 + (100-150,900-1000) chr1 resist res-1 1500 2000 + (1500-1750,1800-1850,1099-2000) chr1 resist res-2 1500 2000 + (1500-2000) chr2 color color-1 3400 4200 - (3400-3600,4000-4200) chr2 color color-2 3400 4200 - (3400-3550,3800-3900,4000-4200) chr2 color color-3 3400 4200 - (3400-3600,3800-3900,4100-4200) chr3 fruit fru-1 50 800 + (50-400,700-800) chr3 fruit fru-1 1100 1500 + (1100-1200,1450-1500) chr3 smell smell-1 2000 2600 - (2000-2300,2500-2600) chr3 smell smell-2 2000 2600 - (2000-2050,2200-2300,2500-2600) chr4 dev dev-1 3100 3700 - (3100-3500,3600-3700) chr4 dev dev-2 3100 3700 - (3100-3200,3400-3500,3600-3700) chr4 height2 ht2-1 4500 4800 + (4500-4800) chr5 shape shape9-1 200 1000 - (200-450,550-650,800-1000) chr5 shape shape10-1 110 1700 + (110-1400,1500-1700)
How many transcripts does the file show? (all lines)
How many different chromosomes does the file have? (column 1; the file separator is a space)
How many different genes does the genome have? (column 2)
🚴 Exercise 2¶
We are studying some proteins that are involved in pathogenicity called effectors found on the sequences of the phytopathogen Hyaloperonospora arabidopsidis. We want to know how many of those sequences are RxLR effectors (RxLR for containing Arginine, any amino acid, Leucine and Arginine). We also want to know which ones are cysteine-rich. And which of the RxLR effectors belong to a specific strain called Emoy2.
At the moment, we are only going to look for the strings ‘RxLR’ and ‘cysteine-rich’ within the decription line. However, you could look for specific domains within the sequences using other tools to eliminate the line breaks and to look for ambiguous bases or amino acids. To convert the fasta file into a tabular file with each fasta record in a single line, you can use this command awk -v RS='\n>' -v ORS='\n>' -v OFS='' -F'\n' '{$1=$1 "\t"}1' file.fasta
. Then you can select the second column and look for a string. You can use a dot .
to find ambiguos bases or amino acids. The .
is a regular expression that represents any character. So, instead of RxLR
, if you look into the actual sequence, you would have to use R.LR
. In order to search for regular expressions with grep
you have to use the option -E
. If you want to see the string colored in your search with grep
, use the option --color
.
Go to the folder
Exercise1
and create a new directory calledAnalysis
Upload the following file to your virtual terminal
Hp1.fastaHow many records does
Hp1.fasta
have?For the next three questions we will analyze the identifiers and not the sequences
How many of those records are RxLR proteins?
How many of those records are cysteine-rich proteins?
How many of the RxLR proteins belong to the strain Emoy2?
Thank you for completing this activity!