Find wheat potential homeologs

Tools

Find potential wheat homeologs (best hit with >90% identity and alignment >60% of the CDS length) and their functions based on Arabidopsis (At) and rice (Os) blast results (top 1 hit).

Please paste gene IDs (e.g. TraesCS5A02G391700) below. Each line is a gene.

To start a new job, click “Clear” button below, and resubmit (faster than refresh the page).

Database to search:
Output At/Os best hits only
Find wheat genes that match given At/Os genes (e.g. At/Os genes -> wheat genes)

Output below

or Export to CSV

WheatGeneID	Best Wheat matches	Wheat %identity	Best At matches	At %identity	At align length	At description	Best Os matches	Os %identity	At align length	Os description

Update

2024-09-18: modify the blastp method (-seg yes) to match Ensembl blast output (only affect some top hits of Arabidopsis).
2024-09-18: add some low confidence genes that are hits of high confidence genes. For example, the B homeolog of PLATZ-A1 (TraesCS6A02G156600) is a low confidence gene.
2024-11-01: add alignment length from BLAST for At and Os hits. Without the alignment length, we cannot tell which wheat gene is best At/Os homolog.

Methods

Here are the commands I used for preparing homeologs and the best hits in Arabidopsis and rice. Arabidopsis and rice seequnces were downloaded from Ensembl Plants. Kronos cDNAs were downloaded from Zenodo. CS IWGSC annotation v1.1 HC cDNAs were downloaded from Wheat URGI.

## homeolog search by self blast
### blast self
blastn -task blastn -db ../blastdb/Kronos.v1.0.all.cds.fa -query ../blastdb/Kronos.v1.0.all.cds.fa -outfmt "6 std qlen slen" -perc_identity 90 -word_size 20 -num_threads 40 -out out_Kronos_v1.0_cdna_self_wordsize20.txt &
blastn -task blastn -query /Users/galaxy/blastdb/IWGSC_v1.1_HC_20170706_cds.fasta -db /Users/galaxy/blastdb/IWGSC_v1.1_HC_20170706_cds.fasta -outfmt "6 std qlen slen" -perc_identity 90 -word_size 20 -num_threads 40 -out out_CS_v1.1_HC_self_wordsize20.txt &

### organize results: self3, use 0.6 length as cut point, due to splice variation
gawk '$4>$13*0.6 {split($1,aa,"."); split($2,bb,"."); qq=aa[1]; ss=bb[1]; if(!(qq"\t"ss in cc)) {cc[qq"\t"ss]++; printf("%s\t%s\t%.f\t%s\n",qq,ss,$3,$4)} }' out_CS_v1.1_HC_self_wordsize20.txt > filtered_CS_v1.1_HC_self3.txt
gawk '$4>$13*0.6 {split($1,aa,"."); split($2,bb,"."); qq=aa[1]; ss=bb[1]; if(!(qq"\t"ss in cc)) {cc[qq"\t"ss]++; printf("%s\t%s\t%.f\t%s\n",qq,ss,$3,$4)} }' out_Kronos_v1.0_cdna_self_wordsize20.txt > filtered_Kronos_self3.txt

## blast Os and At
# update 2024-09-18: add '-seg yes'
### Kronos
blastp -db ../blastdb/Arabidopsis_thaliana.TAIR10.pep.all.fa -query ../blastdb/Kronos.v1.0.all.pep.fa -outfmt "6 std qlen slen stitle" -max_target_seqs 6 -word_size 3 -num_threads 40 -out out_Kronos_v1.0_against_Arabidopsis_TAIR10_pep_wordsize3.txt -seg yes &
blastn -task blastn -db /Users/galaxy/blastdb/Oryza_sativa.IRGSP-1.0.cds.all.fa -query ../blastdb/Kronos.v1.0.all.cds.fa -outfmt "6 std qlen slen stitle" -max_target_seqs 6 -word_size 15 -num_threads 40 -out out_Kronos_v1.0_against_rice_IRGSP-1.0_cdna_wordsize15.txt &

gawk 'bb[$1]<1{bb[$1]=1; print}' out_Kronos_v1.0_against_Arabidopsis_TAIR10_pep_wordsize3.txt > top1hit_out_Kronos_v1.0_against_Arabidopsis_TAIR10_pep_wordsize3.txt
sed -i 's/ gene:/\t/g;s/ gene_symbol:/\t/g;s/ description:/\t/g;s/ \[Source/\t/g' top1hit_out_Kronos_v1.0_against_Arabidopsis_TAIR10_pep_wordsize3.txt

gawk 'bb[$1]<1{bb[$1]=1; print}' out_Kronos_v1.0_against_rice_IRGSP-1.0_cdna_wordsize15.txt > top1hit_out_Kronos_v1.0_against_rice_IRGSP-1.0_cdna_wordsize15.txt
sed -i 's/ gene:/\t/g;s/ gene_biotype:/\t/g; s/ gene_symbol:/\t/g;s/ description:/\t/g' top1hit_out_Kronos_v1.0_against_rice_IRGSP-1.0_cdna_wordsize15.txt

### CS
blastp -db ../blastdb/Arabidopsis_thaliana.TAIR10.pep.all.fa -query ../blastdb/Triticum_aestivum.IWGSC.pep.all.fa -outfmt "6 std qlen slen stitle" -max_target_seqs 6 -word_size 3 -num_threads 40 -out out_CS_v1.1_against_Arabidopsis_TAIR10_pep_wordsize3.txt -seg yes &
blastn -task blastn -db /Users/galaxy/blastdb/Oryza_sativa.IRGSP-1.0.cds.all.fa -query /Users/galaxy/blastdb/IWGSC_v1.1_HC_20170706_cds.fasta -outfmt "6 std qlen slen stitle" -max_target_seqs 6 -word_size 11 -num_threads 40 -out out_CS_v1.1_against_rice_IRGSP-1.0_cdna_wordsize11.txt &

gawk 'bb[$1]<1{bb[$1]=1; print}' out_CS_v1.1_against_Arabidopsis_TAIR10_pep_wordsize3.txt > top1hit_out_CS_v1.1_against_Arabidopsis_TAIR10_pep_wordsize3.txt
sed -i 's/ gene:/\t/g;s/ gene_symbol:/\t/g;s/ description:/\t/g;s/ \[Source/\t/g' top1hit_out_CS_v1.1_against_Arabidopsis_TAIR10_pep_wordsize3.txt

gawk 'bb[$1]<1{bb[$1]=1; print}' out_CS_v1.1_against_rice_IRGSP-1.0_cdna_wordsize11.txt > top1hit_CS_v1.1_against_rice_IRGSP-1.0_cdna_wordsize11.txt
sed -i 's/ gene:/\t/g; s/ gene_biotype:/\t/g; s/ gene_symbol:/\t/g; s/ description:/\t/g'  top1hit_CS_v1.1_against_rice_IRGSP-1.0_cdna_wordsize11.txt

## then I prepared a sqlite3 database for the webtool

Acknowledgment

IWGSC for CS Refseq v1 assembly and annotation
Krasileva lab for Kronos v1 assembly and annotation
Ensembl Plants for hosting many plant genomes
sqlite3 for database preparation
sql.js for using sqlite3 in the browser