HSDFinder

Run HSDFinder

Run

Text Input

Two spreadsheets in tab-separated values (tsv) format shall be prepared as input files.

The first spreadsheet is from a protein BLAST search of the genome genes against themselves (E-value cut-off 10-5, BLASTp output format 6).

See File Examples
File 1
File 2

The second spreadsheet is acquired from InterProScan which is an automatically software providing the protein signatures such as Pfam domain.

Or Upload

File1 (tsv.)
The BLAST results should be 12-column spreadsheets including the key information from query name to percentage identity etc.(see more at web FAQ)

File2 (tsv.)
The output file of InterProsScan is tab-separated values (tsv) format in default.

Amino acid pairwise identities:

Amino acid length variance (aa):

Protein function databases:

Visualization

To comparative analyze the HSDs across different species, we developed an online heat map plotting option to visualize the HSDs results in different KEGG pathway category.

Create Heatmap [Download Hands-on protocol to create heatmap.pdf]

HSD File
HSD_File_example.txt

Gene list with KO annotation
Genelist_KO_annotation_example.txt

Organism name

+ add species

See More Example Files Here

Figure Size:

row col

Once the input files have been submitted, the HSDs numbers for each species will be displayed in a heatmap under different KEGG function category. On the left side, the color bar indicates a broad category of HSDs who have pathway function matches, such as carbohydrate metabolism, energy metabolism, translation etc. The color for the matrix indicates the number of HSDs across species.

What's new

Aug. 5th, 2020: Updated to version 1.5.
The result of the predicted HSDs is displayed in a spreadsheet, which offers an alternative way to browse the result in graphical and tabular form. The software presented here is the primary selection of HSDs, the manually curation should be done to filter the partial and pseudogenes.

Aug. 1st, 2020: Updated to version 1.0.
The web server is able to analyze the unannotated genome sequences by integrating the results from InterProScan (e.g., Pfam) and KEGG.

What's HSDatabase

HSDs has served as a critical database in the eukaryotic genomics, and the database has facilitated the studies associated with the highly similar duplcaites, such as gene duplicates detections, gene duplicates prediction. The predictions of HSDs stored in HSDatabse have been supported by a series of experiments, including those published in New Phytologist, iScience etc.. The current web is available at HSDatabase.

Tutorial

Workflow of HSDFinder (8 steps) [Download Tutorial for HSDFinder.pdf]

HSDFinder Online server Tutorial: Identification of highly similar duplicates in eukaryotic genomes.

1. Upload a protein BLAST search result file of your genome in tab-separated values (tsv) format as the first input file (File 1) of HSDFinder.
2. Upload a InterProScan search result file of your genome in tab-separated values (tsv) format as the second input file (File 2) of HSDFinder.
3. Yielding the output of HSDFinder with three personalized options.
4. Visualizing the HSDFinder outputs via the Excel tools (optional).
5. Upload the results of HSDFinder from your respective genomes.
6. Upload a gene list with KO annotation from KEGG database.
7. The output files of the online Heatmap Visualization tool.
8. The heatmap of HSDs levels across species.

Webinars: Using Bioinformatics Tools to Predict, Collect and Visualize the Highly Similar Duplicates in Eukaryotic Genomes

HSDFinder - an integrated tool to predict highly similar duplicates (HSDs) in eukaryotic genomes. (http://hsdfinder.com)

Series 1: How to run HSDFinder?
Series 2: How to visualize HSDs in HSDFinder?
Series 3: How to run HSDFinder locally by downloading from GitHub?

HSDatabase - a database of highly similar duplicate genes in eukaryotic genomes. (http://hsdfinder.com/database/)

Series 4: How to use the HSDatabase - browse and search?
Series 5: How to use the HSDatabase - BLAST and KEGG?

How to cite

Xi Zhang, Yining Hu, Zhengyu Cheng, John M. Archibald (2023). HSDecipher: A pipeline for comparative genomic analysis of highly similar duplicate genes in eukaryotic genomes. Star Protocols. doi: https://doi.org/10.1016/j.xpro.2022.102014
Xi Zhang, Yining Hu, David Roy Smith (2022). An overview of online resources for intra-species detection of gene duplications. Frontiers in Genetics. doi: http://doi.org/10.3389/fgene.2022.1012788.
Xi Zhang, Yining Hu, David Roy Smith (2022). HSDatabase - a database of highly similar duplicate genes from plants, animals, and algae. Database. doi: http://doi.org/10.1093/database/baac086.
Xi Zhang, Yining Hu, David Roy Smith (2021). HSDFinder: a BLAST-based strategy to search for highly similar duplicated genes in eukaryotic genomes. Frontiers in Bioinformatics. doi: http://doi.org/10.3389/fbinf.2021.803176
Xi Zhang, Yining Hu, David Roy Smith. (2021). Protocol for HSDFinder: Identifying, annotating, categorizing, and visualizing duplicated genes in eukaryotic genomes. STAR Protocols. DOI:https://doi.org/10.1016/j.xpro.2021.100619
Xi Zhang, Marina Cvetkovska, Rachael Morgan-Kiss, Norman P. A. Hüner, David Roy Smith, (2021). Draft genome sequence of the Antarctic green alga Chlamydomonas sp. UWO241. iScience. https://doi.org/10.1016/j.isci.2021.102084

Where to download

The distribution version of HSDFinder is also available.
Current version: v1 (5 August 2020) [download].

Links to the InterProScan and KEGG
InterProscan: https://github.com/ebi-pf-team/interproscan
KEGG : https://www.kegg.jp/kegg/

Frequently Asked Questions (FAQs)

How to prepare the input files?

Before running HSDFinder, two tab-delimited text files need to be prepared as inputs. A protein BLAST search of the genes against themselves (Suggested parameters: E-value cut-off ≤10-5, BLASTP -outfmt 6) will yield the first input file. The BLAST result of the amino acid sequences shall be arranged in a 12-column tab-delimited text file, including the key information of the genes from the query name to percentage identity etc. (See more details in HSDFinder tutorial from GitHub). The second tab-delimited text file is acquired from the software InterProScan, which allow the genes to be scanned by different protein signature databases, such as Pfam domain. The output file of InterProsScan is tab-delimited text file in default.

How to run HSDFinder?

The two tab-delimited text files then can be uploaded to HSDFinder with some personalized options. The default setting of HSDFinder filters highly similar duplicates (HSDs) with near-identical protein lengths (within 10 amino acids of each other) and ≥ 90% pairwise amino acid identities. Choosing such a relative strict cut-off might rule out other genuine duplicates from the list. But from our past experience with green algae genomes, the thresholds of the metrics selected here can represent the majority of detected highly similar duplicates. Since the duplicates vary from different eukaryotic organisms, users always have the option to lower the thresholds to filter duplicates on their datasets (e.g., from 30% to 100% pairwise amino acid identity and from within 0-100 amino acid length variances), although lowering the threshold of the metrics might risk of increasing of false positives. The output file of HSDFinder will be arranged in an 8-column tab-delimited text file containing the information, such as HSD identifier, gene copy number, and Pfam domain.

How to visualize the HSDs across species?

For comparative analyses of the HSDs across different species, we developed an online heatmap plotting option to visualize the HSDs results in different KEGG pathway categories. To do so, the user will need to generate HSDs results following the previous steps for the species of interest. The default for plotting the heatmap is at least two species and at least two files are needed to plot the heatmap. Examples are given to guide the appropriate input files (See more details in the hands-on protocol on creating heatmap with example data). The first input file is the outputs of your interest species after running HSDFinder; the second file is retrieved from the KEGG database documenting the correlation of KEGG Orthology (KO) accession with each gene model identifier (The detailed steps are guided in HSDFinder tutorial from GitHub). Once the input files have been submitted for each species, the HSDs will be displayed in a heatmap (the color for the matrix reflects the number of HSDs across species) and a tab-delimited text file under different KEGG functional categories, such as carbohydrate metabolism, energy metabolism, and translation.

How to acquire the length of the gene models?

In some situations, if running errors occur with missing the gene length information. You can follow the sulution below. For the genome with amino acid sequences, simply copy and paste the code below to create length of amino acid, make sure the gene identifier is consistent with the ones used as input files.
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' '/.../.../protein.fa' |paste - - |sed 's/>//g'|awk -F'\t' '{print $1"\t"$1"\t"100"\t"$2}' >##.protein.length.aa This output file "##.protein.length.aa" can simply paste into the "##.BLAST.tabular" to run as the input file.