RNA-clique Python API

This document describes some options for using RNA-clique in your own Python code. Most programs included in RNA-clique (described in the Command-line usage guide) have corresponding functions that can be called to perform the same operations of the program from Python code. Just as different programs offer finer-grained control of the RNA-clique analysis, the corresponding functions allow for customized analyses from within custom Python code.

Running a full RNA-clique analysis

The rna_clique function in the rna_clique module provides a way to perform a full RNA-clique analysis from Python code.

Unlike the rna-clique program, the rna_clique function must be provided with explicit paths for the outputs. The parameters accepted by RNA-clique are summarized in the table below.

Formal parameter	Description	Default
`dirs`	Paths to the transcriptomes.
`out_dir_1`	Directory for the top \(n\) genes of each transcriptome.
`out_dir_2`	Directory for gene matches tables.
`cache_dir`	Directory for the `simple_blast` BLAST database caches
`output_graph`	Output gene matches graph.
`output_matrix`	Output distance matrix.
`top_genes`	Number of top genes to select by \(k\)-mer coverage.
`top_matches`	Count a match if a gene pair is among this many top pairs in both directions.	`1`
`id_parser`	Function for parsing transcript IDs.	`default_parser`
`evalue`	e-value cutoff to use for BLAST searches.	`1e-99`
`keep_all`	Whether to keep all gene pairs in the case of a tie by bitscore.	`True`
`store_dfs`	Whether to store the computed dataframes in memory.	`False`
`jobs`	Number of parallel jobs to use.	`multiprocessing.cpu_count() - 1`

from rna_clique.rna_clique import rna_clique
from pathlib import Path

out_dir = Path("rna_clique_out")
out_dir.mkdir(exist_ok=True)
# Get the SampleSimilarity object and a dict mapping paths to their sample
# names.
sim, path_to_sample = rna_clique(
    [
        Path("path/to/transcriptome1_dir"),
        Path("path/to/transcriptome2_dir"),
        Path("path/to/transcriptome3_dir"),
    ],
    out_dir_1=out_dir / "od1",
    out_dir_2=out_dir / "od2",
    cache_dir=out_dir / "db_cache",
    output_graph=output_dir / "graph.pkl",
    output_matrix=output_dir / "matrix.h5",
    top_genes=50000
)
print(sim.get_dissimilarity_df())

The rna_clique function returns two values. The first value is a rna_clique.filtered_distance.SampleSimilarity object that can be used to get the "dissimilarity" (distance) matrix as a Pandas DataFrame using the get_dissimilarity_df method as in the example above. The second value is a dictionary mapping each sample's top genes path to its sample name.

SampleSimilarity objects can also provide other information about the filtering process.

from multiset_key_dict import MultisetKeyDict

# Get number of samples.
number_of_samples: int = sim.sample_count

# Get a list of samples used in the analysis.
samples: list[Path] = sim.samples

# Get a DataFrame containing the genes found in ideal components.
ideal_component_genes: pd.DataFrame = sim.valid

# Get the gene matches tables.
tables: MultisetKeyDict = sim.comparison_dfs
# Get specific gene matches table.
transcriptome1_vs_2: pd.DataFrame = tables[
    [
        "path/to/trancsriptome1_dir",
        "path/to/transcriptome2_dir"
    ]
]

# Get gene matches tables, filtered to include only genes in ideal components.
filtered_tables = MultisetKeyDict(sim.restricted_comparison_dfs())

# Get gene matches graph.
graph: nx.Graph = sim.graph

Run individual RNA-clique steps

It is possible to use the RNA-clique Python API to run individual steps of the RNA-clique method. As of version 0.2, RNA-clique supports running all steps through its Python API.

Select top genes by k-mer coverage

You can select top \(n\) genes by \(k\)-mer coverage and save the results for a single transcriptome using the select_top_and_save function of the rna_clique.select_top_genes_all module. select_top_genes_and_save takes at least four parameters and returns the name of the output file and the inferred sample name.

Formal parameter	Description	Default
`out_dir`	Output directory in which to save top \(n\) genes.
`transcripts`	Name (basename) of FASTA file containing transcripts.
`x`	Directory containing the transcripts FASTA file.
`top`	Number of top genes to select.
`parse_transcript_id`	Function for parsing transcript IDs.	`default_parser`

default_parser in the table above is rna_clique.transcripts.default_parser, a function that uses the default regular expression designed for rnaSPAdes assemblies, ^.*cov_([0-9]+(?:\.[0-9]+))_g([0-9]+)_i([0-9]+). For more information about functions for parsing transcript IDs, see the "Working with transcript IDs" section.

from rna_clique.select_top_genes_all import select_top_and_save

# Select top genes for "path/to/transcriptome1/transcripts.fasta" and save to
# "top_genes_dir"
top_genes_file, sample_name = select_top_genes_and_save(
    "top_genes_dir",
    "transcripts.fasta",
    Path("path/to/transcriptome1"),
    50000,
)

If you don't wish to save the selected top genes, you can also use the TopGeneSelector class from the rna_clique.select_top_genes module. TopGeneSelector can work on genes that are not read from a file; to accomplish this, it requires as the first argument to its constructor a nullary function that produces an iterator of Bio.SeqRecord objects from which to select top genes. If you would rather simply pass TopGeneSelector a FASTA file from which to select top genes, use the from_path classmethod instead of the constructor. from_path requires at least two arguments.

Formal parameter	Description	Default
`path`	Path to FASTA file containing transcripts.
`top`	Number of top genes to select.
`parse_transcript_id`	Function for parsing transcript IDs.	`default_parser`

from rna_clique.select_top_genes import TopGeneSelector

selector = TopGeneSelector.from_path(
    Path("path/to/transcriptome1/transcripts.fasta"),
    50000
)

To get the top genes, use the get_top_gene_seqs method.

top_genes: list[Bio.SeqRecord] = list(selector.get_top_gene_seqs())

Alternatively, to just get the gene IDs of the top genes (which is faster), use the get_top_genes method instead.

top_gene_ids: list[int] = list(select.get_top_genes())

Getting gene matches tables

You can obtain gene matches tables for all pairs of some set of samples via the find_all_pairs function from the rna_clique.find_all_pairs module. find_all_pairs requires at least four arguments and returns an iterator over the gene matches tables, an iterator over the paths to the gene matches tables, and the total number of gene matches tables computed. The arguments to find_all_pairs are summarized in the table below.

Formal parameter	Description	Default value
`inputs`	Paths to top \(n\) genes for samples.
`output_dir`	Directory in which to save gene matches tables.
`cache_dir`	Directory for the `simple_blast` BLAST database caches
`path_to_sample`	Function mapping transcript FASTA file paths to sample names.
`hf_args`	Arguments to pass to `HomologFinder`.	`[]`
`jobs`	Number of parallel jobs to use.	`multiprocessing.cpu_count() - 1`

find_all_pairs uses the find_homologs_and_save function, which finds the gene matches table for just a single pair of samples. find_all_pairs needs at least three arguments and returns the gene matches table that was computed and saved. The arguments for find_homologs_and_save are summarized below.

Formal parameter	Description	Default value
`transcripts1`	Path to top \(n\) genes of first sample.
`transcripts2`	Path to top \(n\) genes of second sample.
`out_path`	File in which to save gene matches table.
`hf_args`	Arguments to pass to `HomologFinder`.	`None`
`hf_kwargs`	Keyword arguments to pass to `HomologFilter`.	`None`

In turn, find_homologs_and_save uses the HomologFinder class from the rna_clique.find_homologs module. HomologFinder can find the gene matches table for a pair of samples without saving it to a file. Many of HomologFinder's constructor's arguments are relevant to the BLAST search used for getting a gene matches table. The constructor arguments are described below.

Formal parameter	Description	Default value
`parse_transcript_id`	Function for parsing transcript IDs.
`top_n`	Number of top hits to select for each query gene (big \(N\)).
`evalue`	e-value cutoff to use for BLAST search.
`keep_all`	Keep all matches in the case of ties.
`debug`	Enable debug behavior.	`False`

If any additional keyword arguments are given to the HomologFinder constructor, they are passed to the constructor for the simple_blast BLAST search used.

Once a HomologFinder has been constructed, its get_match_table method can be used to find the gene matches table for two sets of top \(n\) genes. The paths to these top \(n\) genes are the parameters to the get_match_table method.

from rna_clique.transcripts import default_parser

# Get all gene matches tables with find_all_pairs.
from rna_clique.find_all_pairs import find_all_pairs

tables_iter, paths_iter, count = find_all_pairs(
    [
        Path("top/transcriptome1_top.fasta"),
        Path("top/transcriptome2_top.fasta"),
        Path("top/transcriptome2_top.fasta"),
    ],
    Path("gene_matches_tables"),
    Path("db_cache"),
    # Could also just use rna_clique.path_to_sample.path_to_sample here.
    path_to_sample={
        Path("top/transcriptome1_top.fasta"): "transcriptome1",
        Path("top/transcriptome2_top.fasta"): "transcriptome2",
        Path("top/transcriptome2_top.fasta"): "transcriptome3", 
    }.__getitem__,
    hf_args=[
        default_parser,
        1,
        1e-99,
        True
    ]
)

# Get one gene matches table with find_homologs_and_save.
from rna_clique.find_all_pairs import find_homologs_and_save

table = find_homologs_and_save(
    Path("top/transcriptome1_top.fasta"),
    Path("top/transcriptome2_top.fasta"),
    Path("gene_matches_tables/transcriptome1--transcriptome2.h5"),
    hf_args=[
        default_parser,
        1,
        1e-99,
        True
    ],
)

# Get one gene matches table with FindHomologs.
from rna_clique.find_homologs import HomologFinder

finder = HomologFinder(
    default_parser,
    1,
    1e-99,
    True
)
tble = finder.get_match_table(
    Path("top/transcriptome1_top.fasta"),
    Path("top/transcriptome2_top.fasta"),
)

Getting a gene matches graph

A gene matches graph can be constructed from gene matches tables using the build_graph function in the rna_clique.build_graph module. build_graph takes only one argument—an iterable of Pandas DataFrames representing the gene matches tables.

For information on loading gene matches tables that have already been saved to disk, see the "Loading gene matches tables" section.

from rna_clique.build_graph import build_graph

graph = build_graph(tables)

Getting ideal components

You can obtain the ideal components from a gene matches graph using the get_ideal_components function in the rna_clique.filtered_distance module.

get_ideal_components requires two arguments and returns a list of NetworkX Graph objects representing the ideal components. The arguments required by get_ideal_components are summarized in the table below.

Formal parameter	Description
`g`	The graph from which to get ideal components.
`samples`	The number of samples in the analysis.

from rna_clique.filtered_distance import get_ideal_components

ideal_components: list[nx.Graph] = get_ideal_components(
    graph,
    number_of_samples
)

Filtering gene matches tables

Gene matches tables are most easily filtered to include only genes from ideal components using the rna_clique.filtered_distance.SampleSimilarity class. SampleSimilarity's constructor arguments are summarized below.

Formal parameter	Description
`graph`	Gene matches graph
`comparison_dfs`	Iterable of gene matches tables.
`sample_count`	Number of samples in the analysis.	`None`

When sample_count is not provided, SampleSimilarity counts the number of samples from the gene matches graph.

Once a SampleSimilarity object has been constructed, the filtered gene matches tables can be retrieved via the restricted_comparison_dfs method, which yields pairs. The first element in each pair is a frozenset containing the samples that the table is for. The second element is the table itself. The iterator returned by restricted_comparison_dfs can be passed to MultisetKeyDict's constructor to get a mapping from pairs of samples to filtered gene matches tables.

from rna_clique.filtered_distance import SampleSimilarity
from multiset_key_dict import MultisetKeyDict

sim = SampleSimilarity(graph, tables)
filtered_tables = MultisetKeyDict(sim.restricted_comparison_dfs())

# Get an individual table.
table = filtered_tables[
    [
        Path("od1/transcriptome1_top.fasta"),
        Path("od1/transcriptome2_top.fasta")
    ]
]

Getting a genetic distance

If you have an rna_clique.filtered_distance.SampleSimilarity object, getting the genetic distance is as simple as calling the get_dissimilarity_df() method.

dist: pd.DataFrame = sim.get_dissimilarity_df()

You can also obtain similarities from arbitrary (filtered or unfiltered) gene matches tables using the similarities_from_dfs function from the rna_clique.similarity_computer module. similarities_from_dfs accepts just one argument—an iterable of pairs. The first element of each pair should be an unordered pair of samples that the table is for. The second element of each pair should be the table itself.

from rna_clique.similarity_computer import similarities_from_dfs

similarities = similarities_from_dfs(tables)

Each distance can then be computed as 1 - similarity.

from multiset_key_dict import MultisetKeyDict

distances = MultisetKeyDict((k, 1 - v) for (k, v) in similarities)

Working with transcript IDs

RNA-clique expects to be able to read various metadata about transcripts in transcriptomes from the transcripts' FASTA headers (also sometimes called "transcript IDs" in this guide). In order to retrieve such metadata from a transcript FASTA header, RNA-clique must be able to parse the transcript IDs.

RNA-clique was originally designed with rnaSPAdes assemblies in mind, and although the defaults in both the command line and Python API reflect this original design, RNA-clique is flexible enough to allow custom parsing for non-SPAdes assemblies. At the command-line, RNA-clique provides this flexibility by allowing the user to provide a custom regular expression for parsing transcript IDs. In its Python API, RNA-clique provides even more flexibility via functional programming.

A transcript ID parser is a unary function that maps a transcript's FASTA ID to an rna_clique.transcripts.TranscriptID object. A TranscriptID object is a kind of namedtuple with three fields, summarized in the table below.

Position	Type	Name	Description
0	`float`	coverage	\(k\)-mer coverage, expressed as a floating-point number
1	`int`	gene	Gene ID, a non-negative integer
2	`int`	isoform	Transcript isoform ID within the gene.

Unlike an ordinary namedtuple, TranscriptID automatically casts the parameters at its construction to the types shown in the table, ensuring that the values of the fields always have the types given.

For example, a function that simply splits the FASTA ID by colons (:) and takes the last three elements as the transcript ID is shown below.

from rna_clique.transcripts import TranscriptID

def custom_transcript_id_parser(x):
    return TranscriptID(*x.split(":")[-3:])

You can also create a parser from a regular expression in the same fashion as the command-line interface using the TranscriptID.parser_from_re classmethod. For example,

import re

my_parser = TranscriptID.parser_from_re(re.compile(r"([^:]*):([^:]*):([^:]*)$"))
parsed = my_parser("12:13:14")
print(parsed) # TranscriptID(coverage=12.0, gene=13, isoform=14)

Loading gene matches tables

An individual gene matches table file can be loaded using the read_table function from the rna_clique.gene_matches_tables module. read_table requires at least one argument, the path to the gene matches table to load, and it returns a Pandas DataFrame representing the loaded gene matches table.

from rna_clique.gene_matches_tables import read_table

table = read_table(Path("od2/transcriptome1--transcriptome2.h5"))

To get a list of gene matches tables in a directory, use get_table_files from the same module.

from rna_clique.gene_matches_tables import get_table_files

table_files = get_table_files(Path("od2"))
tables = [read_table(t) for t in table_files]

Exporting orthologs (ideal components)

You can export orthologs from Python code using the OrthologExporter class found in the rna_clique.export_orthologs module. The OrthologExporter class requires at least two arguments for construction. The parameters accepted by the constructor are summarized in the table below.

Formal parameter	Description	Default value
`sim`	`SampleSimilarity` for an analysis.
`parse_transcript_id`	Function for parsing transcript IDs.
`non_contributing`	Whether to filter components where there are no differences in aligned regions.	`True`
`consistent_strands`	Whether to reorient transcripts to be consistent per ideal component.	`True`
`allow_inconsistent`	Whether to continue when basic transcript reorientation method fails.	`True`
`jobs`	Number of parallel jobs to use.	`1`
`debug`	Whether to enable debug behavior.	`False`

After construction, calling an OrthologExporter object's by_component or by_sample method will export the orthologs, organizing them into files by component or by sample, respectively. The by_sample and by_component methods also accept a few different options. These options are summarized in the table below.

Formal parameter	Description	Default value
`out_dir`	Output directory for exported orthologs.
`rename`	Function to give a new name to a transcript from its sample ID, gene, and isoform ID.	`None`
`order`	Whether to put the new name before or after the original name.	`after`
`set_rlimit`	(`by_component` only) Try to Set `rlimit` for writing many files at once.	`True`
`make_all`	Create combined file containing all exported orthologs.	`True`

Importantly make_all must be set to True if you wish to later search the ortholog sequences because the search function requires a path to the combined file with all the exported orthologs.

by_sample and by_component both return dictionaries mapping each group by which the orthologs were organized to the path to the file in which the orthologs in that group were saved. For by_sample, the dict maps each sample name to the file in which orthologs from that sample are stored. For by_component, the dict maps each ideal component ID (expressed as an integer) to the file in which orthologs from that ideal component are stored.

from rna_clique.export_orthologs import OrthologExporter
from rna_clique.transcripts import default_parser

exporter = OrthologExporter(sim, default_parser)
component_to_path = exporter.by_component(Path("exported_orthologs"))

Searching ideal components (exported orthologs)

Exported orthologs can be searched using the search function from the rna_clique.search_ideal_components module. search requires at least five arguments but can accept many optional arguments. The possible parameters are described in the table below.

Formal parameter	Description	Default value
`sim`	`SampleSimilarity` for analysis in which search is performed.
`exported`	Path to combined FASTA file containing all exported orthologs.
`db_cache_loc`	Directory for the `simple_blast` BLAST database caches.
`out_dir`	Output directory in which to store search results.
`query`	Path to FASTA file containing query sequences.
`parse_transcript_id`	Function for parsing transcript IDs.	`default_parser`
`path_to_sample`	Function retrieving sample names from top gene paths.	`path_to_sample`
`extended_evalue`	e-value cutoff for extended search. `None` disables extended search.	`None`
`export_components`	Whether to export components in which matches are found.	`True`
`merge_sams`	Whether to merge extended searach results into one SAM file.	`False`
`strand_graph`	Orientation graph (strand graph) to use. Will be created if not provided.	`False`
`node_to_ccc`	Mapping from gene ID, isoform ID pairs to components in the meta-strand graph. Will be created if not provided.	`None`
`evalue`	e-value cutoff to use for initial searches.	`None`
`jobs`	Number of parallel jobs to use.	`1e-50`
`debug`	Enable debug behavior.	`False`

If you have an OrthologExporter object, you can speed search up by providing it with the OrthologExporter's strand_graph and node_to_component_component attributes for its strand_graph and node_to_ccc parameters, respectively,

search returns a rna_clique.search_ideal_components.SearchResult object, which is a namedtuple containing attributes hits, seqs, and components. hits is the number of BLAST HSPs to the ideal components. seqs is the number of transcripts producing HSPs. components is the number of components producing HSPs.

from rna_clique.search_ideal_components import search, SearchResult

result: SearchResult = search(
    sim,
    Path("exported_orthologs/all_ideal.fasta"),
    Path("db_cache"),
    Path("search_results),
    Path("queries.fasta"),
)
# Print the number of total HSPs, transcripts and components producing HSPs.
print(*result)

# Get a faster result by providing strand_graph and node_to_cc from the
# OrthologExporter.
fast_result: SearchResult = search(
    sim,
    Path("exported_orthologs/all_ideal.fasta"),
    Path("db_cache"),
    Path("search_results),
    Path("queries.fasta"),
    strand_graph=exporter.strand_graph,
    node_to_ccc=exporter.node_to_component_component,
)

Working with configuration objects

RNA-clique supports configuration via YAML configuration files as well as via a command-line interface. Additionally, some RNA-clique programs will automatically produce a configuration file for reproducing an analysis. Althoug using RNA-clique configuration files in not necessary to use the Python API, it can sometimes be convenient to be able to read or create configuration files from your own Python code.

To load an RNA-clique configuration file, use the rna_clique.config.RNACliqueConfig.yaml_load classmethod. yaml_load takes just one argument—the path to the configuration file—and loads and returns the configuration at that path.

import config as config_module

config = config_module.RNACliqueConfig.yaml_load(Path("path/to/config.yaml"))

The attributes of configuration objects are documented in the Configuration guide. To save an RNACliqueConfig object, use the yaml_save method. yaml_save takes one argument, the output path.

config.yaml_save(Path("path/to/config.yaml"))

Making subsets

To create a subset of an existing analysis from Python code, you can use the rna_clique.make_subset.SubsetAnalysisCreator class. SubsetAnalysisCreator's constructor requires three arguments, which are summarized in the table below.

Formal parameter	Description
`matches`	Predicate indicating whether a path should be included in the subset.
`super_config`	`RNACliqueConfig` for the parent (original, superset) analysis.
`config`	Partial `RNACliqueConfig` for the subset analysis.

Each of these parameters requires some further explanation.

First, the matches parameter is a predicate—i.e., a function that returns a bool. It should take the Path to a top genes file and return True if the Path should be included in the subset, or False otherwise. The accepted matches parameter is a function to provide maximum flexibility, but some other functions (discussed below) are also provided to make creating such functions easier.

Second, the super_config parameter is the RNACliqueConfig for the original analysis to be subsetted. The super_config is used to get the location of the original gene matches tables and the paths to the top genes files.

Third, the config parameter is the RNACliqueConfig for the subset analysis. Since the subset has not yet been created, config is not expected to have all of its attributes assigned. The only required (non-None) attributes are table_dir and graph—these paths are where the SubsetAnalysisCreator will put the gene matches table links and new gene matches graph for the subset analysis. SubsetAnalysisCreator does not modify either RNACliqueConfig passed to it in the constructor but does create and modify a copy of config.

After constructing a SubsetAnalysisCreator, calling the object's make method creates the analysis using the parameters provided to the constructor. After make has been called, the config attribute of the SubsetAnalysisCreator instance will be a complete RNACliqueConfig for the subset analysis.

from rna_clique.make_subset import SubsetAnalysisCreator
from rna_clique.config import RNACliqueConfig

child_config = RNACliqueConfig()
child_config.tables_dir = Path("subset/tables_dir")
child_confid.graph = Path("subset/graph.pkl")
subsetter = SubsetAnalysisCreator(
    # Exclude samples that start with the letter "B".
    lambda x: not x.name.startswith("B"),
    parent_config,
    child_config
)
subsetter.make()

Since creating a child analysis RNACliqueConfig just to provide it to SubsetAnalysisCreator is somewhat cumbersome, SubsetAnalysisCreator also provides a classmethod, from_paths that allows one to simply provide the subset analysis' tables_dir and graph directly to SubsetAnalysisCreator. The parameters accepted by that classmethod are described in the table below.

Formal parameter	Description
`matches`	Predicate indicating whether a path should be included in the subset.
`super_config`	`RNACliqueConfig` for the parent (original, superset) analysis.
`tables_dir`	Directory in which to store child's links to parent's gene matches tables.
`graph`	Path at which to create gene matches graph for child analysis.

from rna_clique.make_subset import SubsetAnalysisCreator

subsetter = SubsetAnalysisCreator.from_paths(
    # Exclude samples that start with the letter "B".
    lambda x: not x.name.startswith("B"),
    parent_config,
    Path("subset/tables_dir"),
    Path("subset/graph.pkl"),   
)
subsetter.make()

Visualization

Although visualization is not one of the primary features of RNA-clique, RNA-clique nevertheless provides some code designed for visualization. The API for those function is described in a separate document.