RNA-clique Python API
This document describes some options for using RNA-clique in your own Python code. Most programs included in RNA-clique (described in the Command-line usage guide) have corresponding functions that can be called to perform the same operations of the program from Python code. Just as different programs offer finer-grained control of the RNA-clique analysis, the corresponding functions allow for customized analyses from within custom Python code.
Running a full RNA-clique analysis
The rna_clique function in the
rna_clique module
provides a way to perform a full RNA-clique analysis from Python code.
Unlike the rna-clique program, the rna_clique function must be provided
with explicit paths for the outputs. The parameters accepted by RNA-clique are
summarized in the table below.
| Formal parameter | Description | Default |
|---|---|---|
dirs |
Paths to the transcriptomes. | |
out_dir_1 |
Directory for the top \(n\) genes of each transcriptome. | |
out_dir_2 |
Directory for gene matches tables. | |
cache_dir |
Directory for the simple_blast BLAST database caches |
|
output_graph |
Output gene matches graph. | |
output_matrix |
Output distance matrix. | |
top_genes |
Number of top genes to select by \(k\)-mer coverage. | |
top_matches |
Count a match if a gene pair is among this many top pairs in both directions. | 1 |
id_parser |
Function for parsing transcript IDs. | default_parser |
evalue |
e-value cutoff to use for BLAST searches. | 1e-99 |
keep_all |
Whether to keep all gene pairs in the case of a tie by bitscore. | True |
store_dfs |
Whether to store the computed dataframes in memory. | False |
jobs |
Number of parallel jobs to use. | multiprocessing.cpu_count() - 1 |
from rna_clique.rna_clique import rna_clique
from pathlib import Path
out_dir = Path("rna_clique_out")
out_dir.mkdir(exist_ok=True)
# Get the SampleSimilarity object and a dict mapping paths to their sample
# names.
sim, path_to_sample = rna_clique(
[
Path("path/to/transcriptome1_dir"),
Path("path/to/transcriptome2_dir"),
Path("path/to/transcriptome3_dir"),
],
out_dir_1=out_dir / "od1",
out_dir_2=out_dir / "od2",
cache_dir=out_dir / "db_cache",
output_graph=output_dir / "graph.pkl",
output_matrix=output_dir / "matrix.h5",
top_genes=50000
)
print(sim.get_dissimilarity_df())
The rna_clique function returns two values. The first value is a
rna_clique.filtered_distance.SampleSimilarity object that can be used to get
the "dissimilarity" (distance) matrix as a Pandas
DataFrame
using the get_dissimilarity_df method as in the example above. The second
value is a dictionary mapping each sample's top genes path to its sample name.
SampleSimilarity objects can also provide other information about the
filtering process.
from multiset_key_dict import MultisetKeyDict
# Get number of samples.
number_of_samples: int = sim.sample_count
# Get a list of samples used in the analysis.
samples: list[Path] = sim.samples
# Get a DataFrame containing the genes found in ideal components.
ideal_component_genes: pd.DataFrame = sim.valid
# Get the gene matches tables.
tables: MultisetKeyDict = sim.comparison_dfs
# Get specific gene matches table.
transcriptome1_vs_2: pd.DataFrame = tables[
[
"path/to/trancsriptome1_dir",
"path/to/transcriptome2_dir"
]
]
# Get gene matches tables, filtered to include only genes in ideal components.
filtered_tables = MultisetKeyDict(sim.restricted_comparison_dfs())
# Get gene matches graph.
graph: nx.Graph = sim.graph
Run individual RNA-clique steps
It is possible to use the RNA-clique Python API to run individual steps of the RNA-clique method. As of version 0.2, RNA-clique supports running all steps through its Python API.
Select top genes by k-mer coverage
You can select top \(n\) genes by \(k\)-mer coverage and save the results for a
single transcriptome using the select_top_and_save function of the
rna_clique.select_top_genes_all module. select_top_genes_and_save takes at
least four parameters and returns the name of the output file and the inferred
sample name.
| Formal parameter | Description | Default |
|---|---|---|
out_dir |
Output directory in which to save top \(n\) genes. | |
transcripts |
Name (basename) of FASTA file containing transcripts. | |
x |
Directory containing the transcripts FASTA file. | |
top |
Number of top genes to select. | |
parse_transcript_id |
Function for parsing transcript IDs. | default_parser |
default_parser in the table above is rna_clique.transcripts.default_parser,
a function that uses the default regular expression designed for rnaSPAdes
assemblies, ^.*cov_([0-9]+(?:\.[0-9]+))_g([0-9]+)_i([0-9]+). For more
information about functions for parsing transcript IDs, see the "Working with
transcript IDs" section.
from rna_clique.select_top_genes_all import select_top_and_save
# Select top genes for "path/to/transcriptome1/transcripts.fasta" and save to
# "top_genes_dir"
top_genes_file, sample_name = select_top_genes_and_save(
"top_genes_dir",
"transcripts.fasta",
Path("path/to/transcriptome1"),
50000,
)
If you don't wish to save the selected top genes, you can also use the
TopGeneSelector class from the rna_clique.select_top_genes
module. TopGeneSelector can work on genes that are not read from a file; to
accomplish this, it requires as the first argument to its constructor a nullary
function that produces an iterator of Bio.SeqRecord objects from which to
select top genes. If you would rather simply pass TopGeneSelector a FASTA file
from which to select top genes, use the from_path classmethod instead of the
constructor. from_path requires at least two arguments.
| Formal parameter | Description | Default |
|---|---|---|
path |
Path to FASTA file containing transcripts. | |
top |
Number of top genes to select. | |
parse_transcript_id |
Function for parsing transcript IDs. | default_parser |
from rna_clique.select_top_genes import TopGeneSelector
selector = TopGeneSelector.from_path(
Path("path/to/transcriptome1/transcripts.fasta"),
50000
)
To get the top genes, use the get_top_gene_seqs method.
Alternatively, to just get the gene IDs of the top genes (which is faster), use
the get_top_genes method instead.
Getting gene matches tables
You can obtain gene matches tables for all pairs of some set of samples via the
find_all_pairs function from the rna_clique.find_all_pairs
module. find_all_pairs requires at least four arguments and returns an
iterator over the gene matches tables, an iterator over the paths to the gene
matches tables, and the total number of gene matches tables computed. The
arguments to find_all_pairs are summarized in the table below.
| Formal parameter | Description | Default value |
|---|---|---|
inputs |
Paths to top \(n\) genes for samples. | |
output_dir |
Directory in which to save gene matches tables. | |
cache_dir |
Directory for the simple_blast BLAST database caches |
|
path_to_sample |
Function mapping transcript FASTA file paths to sample names. | |
hf_args |
Arguments to pass to HomologFinder. |
[] |
jobs |
Number of parallel jobs to use. | multiprocessing.cpu_count() - 1 |
find_all_pairs uses the find_homologs_and_save function, which finds the
gene matches table for just a single pair of samples. find_all_pairs needs at
least three arguments and returns the gene matches table that was computed and
saved. The arguments for find_homologs_and_save are summarized below.
| Formal parameter | Description | Default value |
|---|---|---|
transcripts1 |
Path to top \(n\) genes of first sample. | |
transcripts2 |
Path to top \(n\) genes of second sample. | |
out_path |
File in which to save gene matches table. | |
hf_args |
Arguments to pass to HomologFinder. |
None |
hf_kwargs |
Keyword arguments to pass to HomologFilter. |
None |
In turn, find_homologs_and_save uses the HomologFinder class from the
rna_clique.find_homologs module. HomologFinder can find the gene matches
table for a pair of samples without saving it to a file. Many of
HomologFinder's constructor's arguments are relevant to the BLAST search used
for getting a gene matches table. The constructor arguments are described below.
| Formal parameter | Description | Default value |
|---|---|---|
parse_transcript_id |
Function for parsing transcript IDs. | |
top_n |
Number of top hits to select for each query gene (big \(N\)). | |
evalue |
e-value cutoff to use for BLAST search. | |
keep_all |
Keep all matches in the case of ties. | |
debug |
Enable debug behavior. | False |
If any additional keyword arguments are given to the HomologFinder
constructor, they are passed to the constructor for the simple_blast BLAST
search used.
Once a HomologFinder has been constructed, its get_match_table method can be
used to find the gene matches table for two sets of top \(n\) genes. The paths to
these top \(n\) genes are the parameters to the get_match_table method.
from rna_clique.transcripts import default_parser
# Get all gene matches tables with find_all_pairs.
from rna_clique.find_all_pairs import find_all_pairs
tables_iter, paths_iter, count = find_all_pairs(
[
Path("top/transcriptome1_top.fasta"),
Path("top/transcriptome2_top.fasta"),
Path("top/transcriptome2_top.fasta"),
],
Path("gene_matches_tables"),
Path("db_cache"),
# Could also just use rna_clique.path_to_sample.path_to_sample here.
path_to_sample={
Path("top/transcriptome1_top.fasta"): "transcriptome1",
Path("top/transcriptome2_top.fasta"): "transcriptome2",
Path("top/transcriptome2_top.fasta"): "transcriptome3",
}.__getitem__,
hf_args=[
default_parser,
1,
1e-99,
True
]
)
# Get one gene matches table with find_homologs_and_save.
from rna_clique.find_all_pairs import find_homologs_and_save
table = find_homologs_and_save(
Path("top/transcriptome1_top.fasta"),
Path("top/transcriptome2_top.fasta"),
Path("gene_matches_tables/transcriptome1--transcriptome2.h5"),
hf_args=[
default_parser,
1,
1e-99,
True
],
)
# Get one gene matches table with FindHomologs.
from rna_clique.find_homologs import HomologFinder
finder = HomologFinder(
default_parser,
1,
1e-99,
True
)
tble = finder.get_match_table(
Path("top/transcriptome1_top.fasta"),
Path("top/transcriptome2_top.fasta"),
)
Getting a gene matches graph
A gene matches graph can be constructed from gene matches tables using the
build_graph function in the rna_clique.build_graph module. build_graph
takes only one argument—an iterable of Pandas DataFrames representing the
gene matches tables.
For information on loading gene matches tables that have already been saved to disk, see the "Loading gene matches tables" section.
Getting ideal components
You can obtain the ideal components from a gene matches graph using the
get_ideal_components function in the rna_clique.filtered_distance module.
get_ideal_components requires two arguments and returns a list of NetworkX
Graph objects representing the ideal components. The arguments required by
get_ideal_components are summarized in the table below.
| Formal parameter | Description |
|---|---|
g |
The graph from which to get ideal components. |
samples |
The number of samples in the analysis. |
from rna_clique.filtered_distance import get_ideal_components
ideal_components: list[nx.Graph] = get_ideal_components(
graph,
number_of_samples
)
Filtering gene matches tables
Gene matches tables are most easily filtered to include only genes from ideal
components using the rna_clique.filtered_distance.SampleSimilarity class.
SampleSimilarity's constructor arguments are summarized below.
| Formal parameter | Description | |
|---|---|---|
graph |
Gene matches graph | |
comparison_dfs |
Iterable of gene matches tables. | |
sample_count |
Number of samples in the analysis. | None |
When sample_count is not provided, SampleSimilarity counts the number of
samples from the gene matches graph.
Once a SampleSimilarity object has been constructed, the filtered gene matches
tables can be retrieved via the restricted_comparison_dfs method, which yields
pairs. The first element in each pair is a frozenset containing the samples
that the table is for. The second element is the table itself. The iterator
returned by restricted_comparison_dfs can be passed to MultisetKeyDict's
constructor to get a mapping from pairs of samples to filtered gene matches
tables.
from rna_clique.filtered_distance import SampleSimilarity
from multiset_key_dict import MultisetKeyDict
sim = SampleSimilarity(graph, tables)
filtered_tables = MultisetKeyDict(sim.restricted_comparison_dfs())
# Get an individual table.
table = filtered_tables[
[
Path("od1/transcriptome1_top.fasta"),
Path("od1/transcriptome2_top.fasta")
]
]
Getting a genetic distance
If you have an rna_clique.filtered_distance.SampleSimilarity object, getting
the genetic distance is as simple as calling the get_dissimilarity_df()
method.
You can also obtain similarities from arbitrary (filtered or unfiltered) gene
matches tables using the similarities_from_dfs function from the
rna_clique.similarity_computer module. similarities_from_dfs accepts just
one argument—an iterable of pairs. The first element of each pair should
be an unordered pair of samples that the table is for. The second element of
each pair should be the table itself.
from rna_clique.similarity_computer import similarities_from_dfs
similarities = similarities_from_dfs(tables)
Each distance can then be computed as 1 - similarity.
from multiset_key_dict import MultisetKeyDict
distances = MultisetKeyDict((k, 1 - v) for (k, v) in similarities)
Working with transcript IDs
RNA-clique expects to be able to read various metadata about transcripts in transcriptomes from the transcripts' FASTA headers (also sometimes called "transcript IDs" in this guide). In order to retrieve such metadata from a transcript FASTA header, RNA-clique must be able to parse the transcript IDs.
RNA-clique was originally designed with rnaSPAdes assemblies in mind, and although the defaults in both the command line and Python API reflect this original design, RNA-clique is flexible enough to allow custom parsing for non-SPAdes assemblies. At the command-line, RNA-clique provides this flexibility by allowing the user to provide a custom regular expression for parsing transcript IDs. In its Python API, RNA-clique provides even more flexibility via functional programming.
A transcript ID parser is a unary function that maps a transcript's FASTA ID to
an rna_clique.transcripts.TranscriptID object. A TranscriptID object is a
kind of namedtuple with three fields, summarized in the table below.
| Position | Type | Name | Description |
|---|---|---|---|
| 0 | float |
coverage | \(k\)-mer coverage, expressed as a floating-point number |
| 1 | int |
gene | Gene ID, a non-negative integer |
| 2 | int |
isoform | Transcript isoform ID within the gene. |
Unlike an ordinary namedtuple, TranscriptID automatically casts the
parameters at its construction to the types shown in the table, ensuring that
the values of the fields always have the types given.
For example, a function that simply splits the FASTA ID by colons (:) and
takes the last three elements as the transcript ID is shown below.
from rna_clique.transcripts import TranscriptID
def custom_transcript_id_parser(x):
return TranscriptID(*x.split(":")[-3:])
You can also create a parser from a regular expression in the same fashion as
the command-line interface using the TranscriptID.parser_from_re classmethod.
For example,
import re
my_parser = TranscriptID.parser_from_re(re.compile(r"([^:]*):([^:]*):([^:]*)$"))
parsed = my_parser("12:13:14")
print(parsed) # TranscriptID(coverage=12.0, gene=13, isoform=14)
Loading gene matches tables
An individual gene matches table file can
be loaded using the read_table function from the
rna_clique.gene_matches_tables module. read_table requires at least one
argument, the path to the gene matches table to load, and it returns a Pandas
DataFrame representing the loaded gene matches table.
from rna_clique.gene_matches_tables import read_table
table = read_table(Path("od2/transcriptome1--transcriptome2.h5"))
To get a list of gene matches tables in a directory, use get_table_files from
the same module.
from rna_clique.gene_matches_tables import get_table_files
table_files = get_table_files(Path("od2"))
tables = [read_table(t) for t in table_files]
Exporting orthologs (ideal components)
You can export orthologs from Python code using the OrthologExporter class
found in the rna_clique.export_orthologs module. The OrthologExporter class
requires at least two arguments for construction. The parameters accepted by the
constructor are summarized in the table below.
| Formal parameter | Description | Default value |
|---|---|---|
sim |
SampleSimilarity for an analysis. |
|
parse_transcript_id |
Function for parsing transcript IDs. | |
non_contributing |
Whether to filter components where there are no differences in aligned regions. | True |
consistent_strands |
Whether to reorient transcripts to be consistent per ideal component. | True |
allow_inconsistent |
Whether to continue when basic transcript reorientation method fails. | True |
jobs |
Number of parallel jobs to use. | 1 |
debug |
Whether to enable debug behavior. | False |
After construction, calling an OrthologExporter object's by_component or
by_sample method will export the orthologs, organizing them into files by
component or by sample, respectively. The by_sample and by_component methods
also accept a few different options. These options are summarized in the table
below.
| Formal parameter | Description | Default value |
|---|---|---|
out_dir |
Output directory for exported orthologs. | |
rename |
Function to give a new name to a transcript from its sample ID, gene, and isoform ID. | None |
order |
Whether to put the new name before or after the original name. | after |
set_rlimit |
(by_component only) Try to Set rlimit for writing many files at once. |
True |
make_all |
Create combined file containing all exported orthologs. | True |
Importantly make_all must be set to True if you wish to later search
the ortholog sequences because
the search function requires a path to the combined file with all the
exported orthologs.
by_sample and by_component both return dictionaries mapping each group by
which the orthologs were organized to the path to the file in which the
orthologs in that group were saved. For by_sample, the dict maps each sample
name to the file in which orthologs from that sample are stored. For
by_component, the dict maps each ideal component ID (expressed as an
integer) to the file in which orthologs from that ideal component are stored.
from rna_clique.export_orthologs import OrthologExporter
from rna_clique.transcripts import default_parser
exporter = OrthologExporter(sim, default_parser)
component_to_path = exporter.by_component(Path("exported_orthologs"))
Searching ideal components (exported orthologs)
Exported orthologs can be searched using the search function from the
rna_clique.search_ideal_components module. search requires at least five
arguments but can accept many optional arguments. The possible parameters are
described in the table below.
| Formal parameter | Description | Default value |
|---|---|---|
sim |
SampleSimilarity for analysis in which search is performed. |
|
exported |
Path to combined FASTA file containing all exported orthologs. | |
db_cache_loc |
Directory for the simple_blast BLAST database caches. |
|
out_dir |
Output directory in which to store search results. | |
query |
Path to FASTA file containing query sequences. | |
parse_transcript_id |
Function for parsing transcript IDs. | default_parser |
path_to_sample |
Function retrieving sample names from top gene paths. | path_to_sample |
extended_evalue |
e-value cutoff for extended search. None disables extended search. |
None |
export_components |
Whether to export components in which matches are found. | True |
merge_sams |
Whether to merge extended searach results into one SAM file. | False |
strand_graph |
Orientation graph (strand graph) to use. Will be created if not provided. | False |
node_to_ccc |
Mapping from gene ID, isoform ID pairs to components in the meta-strand graph. Will be created if not provided. | None |
evalue |
e-value cutoff to use for initial searches. | None |
jobs |
Number of parallel jobs to use. | 1e-50 |
debug |
Enable debug behavior. | False |
If you have an OrthologExporter object, you can speed search up by providing
it with the OrthologExporter's strand_graph and
node_to_component_component attributes for its strand_graph and
node_to_ccc parameters, respectively,
search returns a rna_clique.search_ideal_components.SearchResult object,
which is a namedtuple containing attributes hits, seqs, and
components. hits is the number of BLAST HSPs to the ideal components. seqs
is the number of transcripts producing HSPs. components is the number of
components producing HSPs.
from rna_clique.search_ideal_components import search, SearchResult
result: SearchResult = search(
sim,
Path("exported_orthologs/all_ideal.fasta"),
Path("db_cache"),
Path("search_results),
Path("queries.fasta"),
)
# Print the number of total HSPs, transcripts and components producing HSPs.
print(*result)
# Get a faster result by providing strand_graph and node_to_cc from the
# OrthologExporter.
fast_result: SearchResult = search(
sim,
Path("exported_orthologs/all_ideal.fasta"),
Path("db_cache"),
Path("search_results),
Path("queries.fasta"),
strand_graph=exporter.strand_graph,
node_to_ccc=exporter.node_to_component_component,
)
Working with configuration objects
RNA-clique supports configuration via YAML configuration files as well as via a command-line interface. Additionally, some RNA-clique programs will automatically produce a configuration file for reproducing an analysis. Althoug using RNA-clique configuration files in not necessary to use the Python API, it can sometimes be convenient to be able to read or create configuration files from your own Python code.
To load an RNA-clique configuration file, use the
rna_clique.config.RNACliqueConfig.yaml_load classmethod. yaml_load takes
just one argument—the path to the configuration file—and loads and
returns the configuration at that path.
import config as config_module
config = config_module.RNACliqueConfig.yaml_load(Path("path/to/config.yaml"))
The attributes of configuration objects are documented in the Configuration
guide. To save an RNACliqueConfig object, use the yaml_save
method. yaml_save takes one argument, the output path.
Making subsets
To create a subset of an existing analysis from Python code, you can use the
rna_clique.make_subset.SubsetAnalysisCreator class. SubsetAnalysisCreator's
constructor requires three arguments, which are summarized in the table below.
| Formal parameter | Description |
|---|---|
matches |
Predicate indicating whether a path should be included in the subset. |
super_config |
RNACliqueConfig for the parent (original, superset) analysis. |
config |
Partial RNACliqueConfig for the subset analysis. |
Each of these parameters requires some further explanation.
First, the matches parameter is a predicate—i.e., a function that
returns a bool. It should take the Path to a top
genes file and return True if the Path should be
included in the subset, or False otherwise. The accepted matches parameter
is a function to provide maximum flexibility, but some other functions
(discussed below) are also provided to make creating such functions easier.
Second, the super_config parameter is the RNACliqueConfig for the original
analysis to be subsetted. The super_config is used to get the location of the
original gene matches tables and the paths to the top genes files.
Third, the config parameter is the RNACliqueConfig for the subset
analysis. Since the subset has not yet been created, config is not expected to
have all of its attributes assigned. The only required (non-None) attributes
are table_dir and graph—these paths are where the
SubsetAnalysisCreator will put the gene matches table links and new gene
matches graph for the subset analysis. SubsetAnalysisCreator does not modify
either RNACliqueConfig passed to it in the constructor but does create and
modify a copy of config.
After constructing a SubsetAnalysisCreator, calling the object's make method
creates the analysis using the parameters provided to the constructor. After
make has been called, the config attribute of the SubsetAnalysisCreator
instance will be a complete RNACliqueConfig for the subset analysis.
from rna_clique.make_subset import SubsetAnalysisCreator
from rna_clique.config import RNACliqueConfig
child_config = RNACliqueConfig()
child_config.tables_dir = Path("subset/tables_dir")
child_confid.graph = Path("subset/graph.pkl")
subsetter = SubsetAnalysisCreator(
# Exclude samples that start with the letter "B".
lambda x: not x.name.startswith("B"),
parent_config,
child_config
)
subsetter.make()
Since creating a child analysis RNACliqueConfig just to provide it to
SubsetAnalysisCreator is somewhat cumbersome, SubsetAnalysisCreator also
provides a classmethod, from_paths that allows one to simply provide the
subset analysis' tables_dir and graph directly to
SubsetAnalysisCreator. The parameters accepted by that classmethod are
described in the table below.
| Formal parameter | Description |
|---|---|
matches |
Predicate indicating whether a path should be included in the subset. |
super_config |
RNACliqueConfig for the parent (original, superset) analysis. |
tables_dir |
Directory in which to store child's links to parent's gene matches tables. |
graph |
Path at which to create gene matches graph for child analysis. |
from rna_clique.make_subset import SubsetAnalysisCreator
subsetter = SubsetAnalysisCreator.from_paths(
# Exclude samples that start with the letter "B".
lambda x: not x.name.startswith("B"),
parent_config,
Path("subset/tables_dir"),
Path("subset/graph.pkl"),
)
subsetter.make()
Visualization
Although visualization is not one of the primary features of RNA-clique, RNA-clique nevertheless provides some code designed for visualization. The API for those function is described in a separate document.