Exporting and searching ideal components
The ideal components constructed by RNA-clique can be interpreted as sets of
orthologous genes identified among all of the analyzed samples. RNA-clique uses
alignments among these identified orthologs as the basis for its distance
computation without specifically selecting and writing the ortholog sequences to
the disk, but since it is sometimes useful to be able to manually inspect or
search the detected orthologs, RNA-clique can optionally export and search
orthologous genes in ideal components via the export_and_search module. This
tutorial describes how to use export_and_search to export orthologous gene
sequences from ideal components and search those genes for sequences of
interest.
The first part of this tutorial assumes that the user has completed the end-to-end "From RNA-seq reads to a phylogenetic tree with RNA-clique" tutorial, and the second part assumes that the user has also completed the "Quickly computing subsets of existing analyses" tutorial. These earlier tutorials give us some data to export and search.
Background
The metadata for the six samples analyzed in the end-to-end tutorial is reproduced below.
| SRA Accession | Genotype | Endophyte |
|---|---|---|
| SRR2321388 | CTE46 | infected |
| SRR2321385 | CTE46 | minus |
| SRR8003761 | CTE27 | infected |
| SRR8003762 | CTE27 | minus |
| SRR7990321 | FATG4 | infected |
| SRR8003736 | NTE | infected |
Recall that some of the samples have an endophyte—the fungus Epichloë coenophiala—such samples are those labeled as "infected" in the endophyte column of the sample metadata. In this tutorial, I will also describe such samples as "E+" ("E plus") and will refer to samples without endophytes as "E−" ("E minus"). In this exercise, we will export the orthologs from the full six-sample analysis and search for sequences from the genome of Epichloë coenophiala within the ideal components. We will then repeat the process for the subset of four infected samples analyzed in the subsetting tutorial.
We know that the sequence reads from which we assembled the E+ samples also contain RNA sequences from the endophyte, and, in fact, some of these reads were assembled into transcripts by rnaSPAdes. Since the E− samples lack endophytes, there should be neither endophyte RNA-seq reads nor assembled transcripts for those samples.
Endophyte transcripts in the assembled transcriptomes could potentially introduce error into the distance calculations if alignments between endophyte transcripts were used in the distance calculation. For such transcripts to be used, their genes would need to appear in some ideal component in the gene matches graph. In the case of the full six-sample analysis, we expect that no such endophyte genes will be in ideal components because endophyte transcripts should not have identifiable orthologs in the E− transcriptomes. For the subset of four infected samples, however, the presence of endophyte genes in all samples means there is a chance that some of the genes will end up in ideal components (i.e., have orthologs in all samples). In this tutorial, we will check this intuition using RNA-clique's export and searching features.
Check your environment variables
Before we begin, check that your RNA_CLIQUE and TUTORIAL_DIR environment
variables are set and pointing to the RNA-clique root directory and the working
directory for the tutorial, respectively.
If either of these commands prints an empty line or prints a path to a directory that does not exist, you may need to reset the environment variables to the values described in the "Creating a directory for our work" section of the end-to-end tutorial.
Also, make sure that you are in your rna_clique_venv virtual environment. You
should see (rna_clique_venv) at the beginning of your prompt. If not, source
the activate script to activate the environment.
Download Epichloë genome assemblies
Note
If you downloaded this software from Zenodo, you already have the Epichloë
coenophiala genomes at test_data/ec_genomes. Instead of downloading them
again, you can skip this step and simply move them to the intended location.
We will download two genome assemblies of Epichloë coenophiala to use as query sequences for searching the exported ideal components in both the original and the infected subset RNA-clique analyses. First, create a directory for the genomes
Change to the new directory.
To download the genomes, run
Then, extract the genomes with tar.
Verify that the genomes have been extracted to $TUTORIAL_DIR/ec_genomes.
You should see three files: ec_genomes.tar.xz,
e19_scaffolds-JAFEMN000000000.fasta, and
e4305_Mas339_20200623_Ref_Scaffolds_CLS.fasta.
Export and search ideal components in the original analysis
To export and search the orthologous genes in ideal components at once, run the
export_and_search module. We'll use a low e-value threshold, 1e-99, to
ensure that we only get very close matches.
python -m rna_clique.export_and_search \
-C "$TUTORIAL_DIR"/rna_clique_out/config.yaml \
-Q "$TUTORIAL_DIR"/ec_genomes/*.fasta \
-X "$TUTORIAL_DIR"/full_ec_search_out \
-e 1e-99
The results should be written to
$TUTORIAL_DIR/full_ec_search_out/rna_clique_out. (If we had provided
export_and_search multiple configuration files, there would be multiple
directories under $TUTORIAL_DIR/full_ec_search_out/.)
Examining and interpreting results
Exported ideal components
The exported ideal components should be under the
$TUTORIAL_DIR/full_ec_search_out/rna_clique_out/export directory. If you list
the contents of the export directory, you should see many FASTA files
corresponding to the ideal components identified by RNA-clique. If you open one
of these files, you will see that the FASTA headers differ from those in the
original input transcriptome files. For example, a sequence might have the FASTA
header >-NODE_1_length_14835_cov_23.346544_g0_i0:SRR7990321. In this modified
header, the leading - denotes that the sequence is the reverse complement of
the original sequence in the input transcriptome. The sequence was reoriented to
ensure that all sequences in the same file are in the same orientation. The
trailing :SRR7990321 indicates that the original sequence comes from the
transcriptome of sample SRR7990321. Everything between the leading - and
final : is the original FASTA sequence header from the input transcriptome.
Other FASTA headers may lack the leading -. In that case, the sequence appears
exactly as it did in the input transcriptome; it has not be reoriented.
Under the export directory is also a file called all_ideal.fasta, which
contains all sequences from the exported ideal components. all_ideal.fasta is
akin to a concatenation of all of the ideal_component FASTA files, but
sequence headers in the all_ideal.fasta file are further modified to indicate
from which ideal components they come. The ideal component to which a sequence
belongs is indicated by a suffix appended to the sequence header. For example,
the header
>NODE_1_length_15383_cov_32.255511_g0_i0:SRR2321385:ideal_component_0 comes
from ideal_component_0.
Search results
The search results are
organized by query file, so we should have directories named
search_e19_scaffolds-JAFEMN000000000 and
search_e4305_Mas339_20200623_Ref_Scaffolds_CLS under
$TUTORIAL_DIR/full_ec_search_out/rna_clique_out. Since we did not perform an
"extended search" to find higher e-value matches in ideal components where there
were already low e-value matches, we should only see three files in each of
these two directories—queries.sam, stats, and subjects.fasta. (For
an explanation of the extended search and the files produced in such a search,
see the Command-line usage guide entry for
search_ideal_components.)
stats
The stats file provides statistics for the search results in JSON format. Each
stats file should have three JSON keys—hits, seqs, and
components. hits is the number of BLAST HSPs found in the search. seqs is
the number of distinct transcripts matched in the search, and components is
the number of ideal components in which at least one transcript matched.
For search_e19_scaffolds-JAFEMN000000000 directory, the stats file should
look something like this:
For the search_e4305_Mas339_20200623_Ref_Scaffolds_CLS directory, the stats
file should look something like this:
Surprisingly, in both cases, several components contained matching sequences, even though some of the input samples possessed no endophyte. It's possible, however, that the matching transcripts were all from the E+ samples, and that the matches are simply misassembled transcripts that happened to incorporate reads from the endophyte. Alternatively, it could be that the query genome assemblies inadvertently incorporated reads from the host plants. We will investigate these possibilities later in this tutorial.
subjects.fasta
The subjects.fasta file contains the sequences of all transcripts from the
all_ideal.fasta file that matched a query sequence in the BLAST search. The
number of sequences in this file should match the value associated with the
seqs key in the stats file.
Since the sequences in subjects.fasta are taken from all_ideal.fasta, each
sequence header indicates the sample and ideal component in which the transcript
is found.
queries.sam
queries.sam contains the BLAST alignments of the query sequences to the ideal
component transcript sequences (in subjects.fasta) in Sequence Alignment
Map (SAM) format. The number of
alignments in this file should match the value associated with the hits key in
the stats file.
Checking represented samples in matching components
As mentioned in the stats subsection, the endophyte sequences found
among ideal components could possibly be explained by transcriptome assembly
errors. Specifically, it is possible that RNA-seq reads from the endophyte were
erroneously incorporated into plant transcripts, creating spurious in silico
hybrid transcripts. If that were the case, then we would expect all hits to
occur in E+ samples where such endophyte reads would be present.
We can use the sequence headers in subjects.fasta to check whether matches all
come from E+ samples.
grep "$TUTORIAL_DIR"/full_ec_search_out/*/search_*/subjects.fasta \
-e '>' | grep '[^:]*:[^:]*$' -o | sort -u | sort -t: -k2
The command above gets the sample name and ideal component from every FASTA
header in subjects.fasta. sort is then used twice to get the unique
sample-ideal component pairs and then sort them by ideal component.
You should see something like this:
SRR7990321:ideal_component_1127
SRR8003761:ideal_component_1331
SRR8003761:ideal_component_1459
SRR2321388:ideal_component_2366
SRR2321388:ideal_component_3289
SRR8003761:ideal_component_3939
SRR2321388:ideal_component_4110
SRR8003761:ideal_component_511
SRR8003761:ideal_component_6114
SRR2321385:ideal_component_6249
SRR2321388:ideal_component_6249
SRR7990321:ideal_component_6249
SRR8003736:ideal_component_6249
SRR8003761:ideal_component_6249
SRR8003762:ideal_component_6249
SRR8003761:ideal_component_667
SRR2321385:ideal_component_6777
SRR2321388:ideal_component_6777
SRR7990321:ideal_component_6777
SRR8003736:ideal_component_6777
SRR8003761:ideal_component_6777
SRR8003762:ideal_component_6777
SRR7990321:ideal_component_7010
SRR8003761:ideal_component_7855
SRR2321388:ideal_component_793
SRR7990321:ideal_component_8416
Notice that transcripts from SRR2321385 in ideal components \(6249\) and \(6777\) matched the endophyte assemblies. We can conclude that the hits are not merely a result of assembly errors incorporating endophyte reads into plant sequences.
Visualizing alignments produced by export_and_search
Since export_and_search produces alignments in SAM format, we can use a
standard genome browser to view the alignments. In this example, I will show the
alignments using Integrative Genomics Viewer
(IGV).
Before visualizing the alignments, make sure that you have copies of the
subjects.fasta and queries.sam files on the computer that will be running
IGV, and make sure you can find these files on your system.
Downloading, installing, and launching IGV
IGV can be installed from https://igv.org/doc/desktop/#DownloadPage/ . Select the download most appropriate for your system. In most cases, you will want to install a version with Java included. (Otherwise, IGV will try to use the system Java installation, which may not be compatible.)
On Windows, run and complete the installer. You should be able to find IGV in your start menu.
On macOS, open the .zip file you downloaded and move the app within to your
Applications folder, then launch IGV from there.
On Linux, unzip the downloaded archive and run igv.sh.
Loading the subjects.fasta file
Before we can load the alignments, we need to load the reference sequences into IGV. IGV considers any reference sequences to be "genomes," but it will work with our transcripts as well.
In the menu bar, under Genomes, select "Load Genome from File...". A file
selection dialog will appear; select the subjects.fasta file from the
e19_scaffolds-JAFEMN000000000 search.
Loading the queries.sam file
Now that the references (subjects) have been loaded, we can open alignments to display as features along the subjects.
In the menu, under File, select "Load from File..." and choose the
queries.sam file from the file selection dialog that appears.
You will most likely be prompted about a missing index file for the
queries.sam alignments. Click "Go" to create the index file automatically.
Viewing the alignments
By default, IGV is set to try to view "All chromosomes," i.e., all subject sequences, simultaneously. In this mode, no alignments will appear, so we need to select a "chromosome" (subject sequence) from the drop down menu at the top of the window. Let's select the first transcript from ideal component 6249. You should see alignments like the ones below:
You can repeat these steps for the subjects.fasta and queries.sam from the
e4305_Mas339_20200623_Ref_Scaffolds_CLS search. In both cases, you should see
should see the alignments span most of the transcript, and if you look at the
other transcripts from the same ideal component, you should see the same thing
for the other alignments. This is true even for the E− samples, such as
SRR2321385, and the alignments clearly span much more than the length of a
single read. Thus, it is unlikely that misassembled transcripts explain the hits
matching the E−samples.
Searching the Nucleotide database for matched transcripts
One other possible explanation for the alignments we see is that the genome assemblies could inadvertently have included some sequences from the host plant. We can check if this is true by BLASTing the transcripts in the matched exported ideal components against the NCBI non-redundant Nucleotide database. From such a BLAST search, we could see in which kinds of organisms we get hits. We will also learn more about the identity of the matching sequences, which might help explain the matches to the E− samples.
First, let's concatenate the exported transcripts from the ideal components where all samples had matching transcripts. For this run, we want ideal components \(6249\) and \(6777\), but check the results you got at the end of the "Checking represented samples in matching component" section to make sure you are using the right ideal component IDs here.
cat "$TUTORIAL_DIR"/full_ec_search_out/*/export/ideal_component_6249.fasta \
"$TUTORIAL_DIR"/full_ec_search_out/*/export/ideal_component_6777.fasta \
> "$TUTORIAL_DIR"/matching_components.fasta
Now, we will search the created matching_components.fasta against the NCBI
Nucleotide database. (You may prefer to do this step in the web BLAST interface
to avoid rate limits.)
blastn -query "$TUTORIAL_DIR"/matching_components.fasta -evalue 1e-99 -remote \
-db nr -outfmt 6 -out "$TUTORIAL_DIR"/remote_results
If you view the subject sequence IDs in the resulting remote_results file and
search for them within the
Nucleotide database, you should find
that the sequences match various fungal sequences closely. Further investigation
of the sequences we found reveals that they code for highly conserved proteins,
and we are likely getting matches to both a plant and the fungal homolog of the
gene. This explains why we get matches for the endophyte sequence even in
E− samples.
Export and search ideal components in the subset analysis
Now that we've looked at the results for the full analysis, we will see how our
results differ for the subset analysis that included only E− samples. Run
export_and_search again but provide it with the configuration file for the
subset analysis and a different output directory.
python -m rna_clique.export_and_search
-C "$TUTORIAL_DIR"/infected_subset_out/config.yaml \
-Q "$TUTORIAL_DIR"/ec_genomes/*.fasta \
-X "$TUTORIAL_DIR"/subset_ec_search_out \
-e 1e-99
Examining results for the subset analysis
The output export and search files should be located at
$TUTORIAL_DIR/subset_ec_search_out/infected_subset_out. Check the stats
files under search_e19_scaffolds-JAFEMN000000000 and
search_e4305_Mas339_20200623_Ref_Scaffolds_CLS. You should see that the number
of components in which matching transcripts have been found has increased
dramatically compared to the results for the original set of samples, from
around 15 to over 500 for both genomes. Although the subset only had around
120% as many ideal components as the original set, the subset has at least 3200%
as many ideal components matching the endophyte genomes. This suggests that
RNA-clique filters out some of the endophyte transcripts when the full analysis
with E- samples is performed, but RNA-clique cannot filter such transcripts when
all the samples are E+.